RE: Scala Vs Python

2016-08-31 Thread AssafMendelson
I believe this would greatly depend on your use case and your familiarity with 
the languages.

In general, scala would have a much better performance than python and not all 
interfaces are available in python.
That said, if you are planning to use dataframes without any UDF then the 
performance hit is practically nonexistent.
Even if you need UDF, it is possible to write those in scala and wrap them for 
python and still get away without the performance hit.
Python does not have interfaces for UDAFs.

I believe that if you have large structured data and do not generally need 
UDF/UDAF you can certainly work in python without losing too much.


From: ayan guha [mailto:guha.a...@gmail.com]
Sent: Thursday, September 01, 2016 5:03 AM
To: user
Subject: Scala Vs Python

Hi Users

Thought to ask (again and again) the question: While I am building any 
production application, should I use Scala or Python?

I have read many if not most articles but all seems pre-Spark 2. Anything 
changed with Spark 2? Either pro-scala way or pro-python way?

I am thinking performance, feature parity and future direction, not so much in 
terms of skillset or ease of use.

Or, if you think it is a moot point, please say so as well.

Any real life example, production experience, anecdotes, personal taste, 
profanity all are welcome :)

--
Best Regards,
Ayan Guha




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RE-Scala-Vs-Python-tp27637.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Window Functions with SQLContext

2016-08-31 Thread Saurabh Dubey
Hi Divya,

 

Then, how can https://issues.apache.org/jira/browse/SPARK-11001 be resolved.

 

Thanks,

Saurabh

 

From: Divya Gehlot [mailto:divya.htco...@gmail.com] 
Sent: 01 September 2016 11:33
To: saurabh3d
Cc: user @spark
Subject: Re: Window Functions with SQLContext

 

Hi Saurabh,

 

Even I am using Spark 1.6+ version ..and when I didnt create hiveContext it 
threw the same error .

So have to  create HiveContext to access windows function 

 

Thanks,

Divya 

 

On 1 September 2016 at 13:16, saurabh3d mailto:saurabh.s.du...@oracle.com"; \nsaurabh.s.du...@oracle.com> wrote:

Hi All,

As per  SPARK-11001    ,
Window functions should be supported by SQLContext. But when i try to run

SQLContext sqlContext = new SQLContext(jsc);
WindowSpec w = Window.partitionBy("assetId").orderBy("assetId");
DataFrame df_2 = df1.withColumn("row_number", row_number().over(w));
df_2.show(false);

it fails with:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Could not
resolve window function 'row_number'. Note that, using window functions
currently requires a HiveContext;

This code runs fine with HiveContext.
Any idea what’s going on?  Is this a known issue and is there a workaround
to make Window function work without HiveContext.

Thanks,
Saurabh




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Window-Functions-with-SQLContext-tp27636.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: HYPERLINK 
"mailto:user-unsubscr...@spark.apache.org"user-unsubscr...@spark.apache.org

 


Re: Window Functions with SQLContext

2016-08-31 Thread Divya Gehlot
Hi Saurabh,

Even I am using Spark 1.6+ version ..and when I didnt create hiveContext it
threw the same error .
So have to  create HiveContext to access windows function

Thanks,
Divya

On 1 September 2016 at 13:16, saurabh3d  wrote:

> Hi All,
>
> As per  SPARK-11001 
>  ,
> Window functions should be supported by SQLContext. But when i try to run
>
> SQLContext sqlContext = new SQLContext(jsc);
> WindowSpec w = Window.partitionBy("assetId").orderBy("assetId");
> DataFrame df_2 = df1.withColumn("row_number", row_number().over(w));
> df_2.show(false);
>
> it fails with:
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Could
> not
> resolve window function 'row_number'. Note that, using window functions
> currently requires a HiveContext;
>
> This code runs fine with HiveContext.
> Any idea what’s going on?  Is this a known issue and is there a workaround
> to make Window function work without HiveContext.
>
> Thanks,
> Saurabh
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Window-Functions-with-SQLContext-tp27636.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


RE: Window Functions with SQLContext

2016-08-31 Thread Saurabh Dubey
Hi Ayan,

 

Even with SQL query like:

 

SQLContext sqlContext = new SQLContext(jsc);

DataFrame df_2_sql = sqlContext.sql("select assetId, row_number() over ( 
partition by " +
    "assetId order by assetId) as " +
    "serial from df1");
df_2_sql.show(false);

 

It fails with:

 

Exception in thread "main" java.lang.RuntimeException: [1.35] failure: 
``union'' expected but `(' found

 

select assetId, row_number() over ( partition by assetId order by assetId) as 
serial from df1

  ^

    at scala.sys.package$.error(package.scala:27)

    at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)

    at 
org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)

    at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)

    at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)

 

Thanks

 

From: ayan guha [mailto:guha.a...@gmail.com] 
Sent: 01 September 2016 11:12
To: saurabh3d
Cc: user
Subject: Re: Window Functions with SQLContext

 

I think you can write the SQL query and run it usin sqlContext.

 

like 

 

select *,row_number() over(partitin by assetid order by assetid) rn from t

 

 

 

On Thu, Sep 1, 2016 at 3:16 PM, saurabh3d mailto:saurabh.s.du...@oracle.com"; \nsaurabh.s.du...@oracle.com> wrote:

Hi All,

As per  SPARK-11001    ,
Window functions should be supported by SQLContext. But when i try to run

SQLContext sqlContext = new SQLContext(jsc);
WindowSpec w = Window.partitionBy("assetId").orderBy("assetId");
DataFrame df_2 = df1.withColumn("row_number", row_number().over(w));
df_2.show(false);

it fails with:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Could not
resolve window function 'row_number'. Note that, using window functions
currently requires a HiveContext;

This code runs fine with HiveContext.
Any idea what’s going on?  Is this a known issue and is there a workaround
to make Window function work without HiveContext.

Thanks,
Saurabh




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Window-Functions-with-SQLContext-tp27636.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: HYPERLINK 
"mailto:user-unsubscr...@spark.apache.org"user-unsubscr...@spark.apache.org





 

-- 

Best Regards,
Ayan Guha


Re: Window Functions with SQLContext

2016-08-31 Thread ayan guha
I think you can write the SQL query and run it usin sqlContext.

like

select *,row_number() over(partitin by assetid order by assetid) rn from t



On Thu, Sep 1, 2016 at 3:16 PM, saurabh3d 
wrote:

> Hi All,
>
> As per  SPARK-11001 
>  ,
> Window functions should be supported by SQLContext. But when i try to run
>
> SQLContext sqlContext = new SQLContext(jsc);
> WindowSpec w = Window.partitionBy("assetId").orderBy("assetId");
> DataFrame df_2 = df1.withColumn("row_number", row_number().over(w));
> df_2.show(false);
>
> it fails with:
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Could
> not
> resolve window function 'row_number'. Note that, using window functions
> currently requires a HiveContext;
>
> This code runs fine with HiveContext.
> Any idea what’s going on?  Is this a known issue and is there a workaround
> to make Window function work without HiveContext.
>
> Thanks,
> Saurabh
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Window-Functions-with-SQLContext-tp27636.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Best Regards,
Ayan Guha


RE: Window Functions with SQLContext

2016-08-31 Thread Saurabh Dubey
Hi Adline,

rowNumber and row_number are same functions:
@scala.deprecated("Use row_number. This will be removed in Spark 2.0.")
def rowNumber() : org.apache.spark.sql.Column = { /* compiled code */ }
def row_number() : org.apache.spark.sql.Column = { /* compiled code */ }

but the issue here is about using SQLContext instead of HiveContext.

Thanks,
Saurabh

-Original Message-
From: Adline Dsilva [mailto:adline.dsi...@mimos.my] 
Sent: 01 September 2016 11:05
To: saurabh3d; user@spark.apache.org
Subject: RE: Window Functions with SQLContext

Hi,
  Use function rowNumber instead of row_number

df1.withColumn("row_number", rowNumber.over(w));

Regards,
Adline

From: saurabh3d [saurabh.s.du...@oracle.com]
Sent: 01 September 2016 13:16
To: user@spark.apache.org
Subject: Window Functions with SQLContext

Hi All,

As per  SPARK-11001    ,
Window functions should be supported by SQLContext. But when i try to run

SQLContext sqlContext = new SQLContext(jsc); WindowSpec w = 
Window.partitionBy("assetId").orderBy("assetId");
DataFrame df_2 = df1.withColumn("row_number", row_number().over(w)); 
df_2.show(false);

it fails with:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Could not 
resolve window function 'row_number'. Note that, using window functions 
currently requires a HiveContext;

This code runs fine with HiveContext.
Any idea what's going on?  Is this a known issue and is there a workaround to 
make Window function work without HiveContext.

Thanks,
Saurabh




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Window-Functions-with-SQLContext-tp27636.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



DISCLAIMER:


This e-mail (including any attachments) is for the addressee(s) only and may be 
confidential, especially as regards personal data. If you are not the intended 
recipient, please note that any dealing, review, distribution, printing, 
copying or use of this e-mail is strictly prohibited. If you have received this 
email in error, please notify the sender immediately and delete the original 
message (including any attachments).

MIMOS Berhad is a research and development institution under the purview of the 
Malaysian Ministry of Science, Technology and Innovation. Opinions, conclusions 
and other information in this e-mail that do not relate to the official 
business of MIMOS Berhad and/or its subsidiaries shall be understood as neither 
given nor endorsed by MIMOS Berhad and/or its subsidiaries and neither MIMOS 
Berhad nor its subsidiaries accepts responsibility for the same. All liability 
arising from or in connection with computer viruses and/or corrupted e-mails is 
excluded to the fullest extent permitted by law.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RE: Window Functions with SQLContext

2016-08-31 Thread Adline Dsilva
Hi,
  Use function rowNumber instead of row_number

df1.withColumn("row_number", rowNumber.over(w));

Regards,
Adline

From: saurabh3d [saurabh.s.du...@oracle.com]
Sent: 01 September 2016 13:16
To: user@spark.apache.org
Subject: Window Functions with SQLContext

Hi All,

As per  SPARK-11001    ,
Window functions should be supported by SQLContext. But when i try to run

SQLContext sqlContext = new SQLContext(jsc);
WindowSpec w = Window.partitionBy("assetId").orderBy("assetId");
DataFrame df_2 = df1.withColumn("row_number", row_number().over(w));
df_2.show(false);

it fails with:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Could not
resolve window function 'row_number'. Note that, using window functions
currently requires a HiveContext;

This code runs fine with HiveContext.
Any idea what’s going on?  Is this a known issue and is there a workaround
to make Window function work without HiveContext.

Thanks,
Saurabh




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Window-Functions-with-SQLContext-tp27636.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



DISCLAIMER:


This e-mail (including any attachments) is for the addressee(s) only and may be 
confidential, especially as regards personal data. If you are not the intended 
recipient, please note that any dealing, review, distribution, printing, 
copying or use of this e-mail is strictly prohibited. If you have received this 
email in error, please notify the sender immediately and delete the original 
message (including any attachments).

MIMOS Berhad is a research and development institution under the purview of the 
Malaysian Ministry of Science, Technology and Innovation. Opinions, conclusions 
and other information in this e-mail that do not relate to the official 
business of MIMOS Berhad and/or its subsidiaries shall be understood as neither 
given nor endorsed by MIMOS Berhad and/or its subsidiaries and neither MIMOS 
Berhad nor its subsidiaries accepts responsibility for the same. All liability 
arising from or in connection with computer viruses and/or corrupted e-mails is 
excluded to the fullest extent permitted by law.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Window Functions with SQLContext

2016-08-31 Thread saurabh3d
Hi All,

As per  SPARK-11001    ,
Window functions should be supported by SQLContext. But when i try to run

SQLContext sqlContext = new SQLContext(jsc);
WindowSpec w = Window.partitionBy("assetId").orderBy("assetId");
DataFrame df_2 = df1.withColumn("row_number", row_number().over(w));
df_2.show(false);

it fails with:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Could not
resolve window function 'row_number'. Note that, using window functions
currently requires a HiveContext;

This code runs fine with HiveContext.
Any idea what’s going on?  Is this a known issue and is there a workaround
to make Window function work without HiveContext.

Thanks,
Saurabh




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Window-Functions-with-SQLContext-tp27636.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RE: Scala Vs Python

2016-08-31 Thread Santoshakhilesh
Hi ,
I would prefer Scala if you are starting afresh , this is considering both ease 
of usage , features , performance and support.
You will find numerous examples & support with Scala which might not be true 
for any other language.
I had personally developed the first version of my App using Java 1.6 due to 
some unavoidable reasons , and my code is very verbose and ugly.
But now with Java 8 ‘s lambda support I think this is not a problem anymore. 
About Python since there is no compile time safety so If you plan to use Spark 
2.0 , Dataset API are not available.
Given a choice I would prefer to use Scala any day for very simple reason that 
I would get all the future features and optimizations out of box and I need to 
type less ☺.


Regards,
Santosh Akhilesh


From: ayan guha [mailto:guha.a...@gmail.com]
Sent: 01 September 2016 11:03
To: user
Subject: Scala Vs Python

Hi Users

Thought to ask (again and again) the question: While I am building any 
production application, should I use Scala or Python?

I have read many if not most articles but all seems pre-Spark 2. Anything 
changed with Spark 2? Either pro-scala way or pro-python way?

I am thinking performance, feature parity and future direction, not so much in 
terms of skillset or ease of use.

Or, if you think it is a moot point, please say so as well.

Any real life example, production experience, anecdotes, personal taste, 
profanity all are welcome :)

--
Best Regards,
Ayan Guha


KeyManager exception in Spark 1.6.2

2016-08-31 Thread Eric Ho
I was trying to enable SSL in Spark 1.6.2 and got this exception.
Not sure if I'm missing something or my keystore / truststore files got bad
although keytool showed that both files are fine...

=

*16/09/01 04:01:41 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable*

*Exception in thread "main" java.security.KeyManagementException: Default
SSLContext is initialized automatically*

*at
sun.security.ssl.SSLContextImpl$DefaultSSLContext.engineInit(SSLContextImpl.java:749)*

*at javax.net.ssl.SSLContext.init(SSLContext.java:282)*

*at
org.apache.spark.SecurityManager.(SecurityManager.scala:284)*

*at
org.apache.spark.deploy.master.Master$.startRpcEnvAndEndpoint(Master.scala:1121)*

*at org.apache.spark.deploy.master.Master$.main(Master.scala:1106)*

*at org.apache.spark.deploy.master.Master.main(Master.scala)*
=


-- 

-eric ho


Re: Spark build 1.6.2 error

2016-08-31 Thread Divya Gehlot
Which java version are you using ?

On 31 August 2016 at 04:30, Diwakar Dhanuskodi  wrote:

> Hi,
>
> While building Spark 1.6.2 , getting below error in spark-sql. Much
> appreciate for any help.
>
> ERROR] missing or invalid dependency detected while loading class file
> 'WebUI.class'.
> Could not access term eclipse in package org,
> because it (or its dependencies) are missing. Check your build definition
> for
> missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see
> the problematic classpath.)
> A full rebuild may help if 'WebUI.class' was compiled against an
> incompatible version of org.
> [ERROR] missing or invalid dependency detected while loading class file
> 'WebUI.class'.
> Could not access term jetty in value org.eclipse,
> because it (or its dependencies) are missing. Check your build definition
> for
> missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see
> the problematic classpath.)
> A full rebuild may help if 'WebUI.class' was compiled against an
> incompatible version of org.eclipse.
> [WARNING] 17 warnings found
> [ERROR] two errors found
> [INFO] 
> 
> [INFO] Reactor Summary:
> [INFO]
> [INFO] Spark Project Parent POM .. SUCCESS [4.399s]
> [INFO] Spark Project Test Tags ... SUCCESS [3.443s]
> [INFO] Spark Project Launcher  SUCCESS
> [10.131s]
> [INFO] Spark Project Networking .. SUCCESS
> [11.849s]
> [INFO] Spark Project Shuffle Streaming Service ... SUCCESS [6.641s]
> [INFO] Spark Project Unsafe .. SUCCESS
> [19.765s]
> [INFO] Spark Project Core  SUCCESS
> [4:16.511s]
> [INFO] Spark Project Bagel ... SUCCESS
> [13.401s]
> [INFO] Spark Project GraphX .. SUCCESS
> [1:08.824s]
> [INFO] Spark Project Streaming ... SUCCESS
> [2:18.844s]
> [INFO] Spark Project Catalyst  SUCCESS
> [2:43.695s]
> [INFO] Spark Project SQL . FAILURE
> [1:01.762s]
> [INFO] Spark Project ML Library .. SKIPPED
> [INFO] Spark Project Tools ... SKIPPED
> [INFO] Spark Project Hive  SKIPPED
> [INFO] Spark Project Docker Integration Tests  SKIPPED
> [INFO] Spark Project REPL  SKIPPED
> [INFO] Spark Project YARN Shuffle Service  SKIPPED
> [INFO] Spark Project YARN  SKIPPED
> [INFO] Spark Project Assembly  SKIPPED
> [INFO] Spark Project External Twitter  SKIPPED
> [INFO] Spark Project External Flume Sink . SKIPPED
> [INFO] Spark Project External Flume .. SKIPPED
> [INFO] Spark Project External Flume Assembly . SKIPPED
> [INFO] Spark Project External MQTT ... SKIPPED
> [INFO] Spark Project External MQTT Assembly .. SKIPPED
> [INFO] Spark Project External ZeroMQ . SKIPPED
> [INFO] Spark Project External Kafka .. SKIPPED
> [INFO] Spark Project Examples  SKIPPED
> [INFO] Spark Project External Kafka Assembly . SKIPPED
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time: 12:40.525s
> [INFO] Finished at: Wed Aug 31 01:56:50 IST 2016
> [INFO] Final Memory: 71M/830M
> [INFO] 
> 
> [ERROR] Failed to execute goal 
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile
> (scala-compile-first) on project spark-sql_2.11: Execution
> scala-compile-first of goal 
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile
> failed. CompileFailed -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the
> -e switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/
> PluginExecutionException
> [ERROR]
> [ERROR] After correcting the problems, you can resume the build with the
> command
> [ERROR]   mvn  -rf :spark-sql_2.11
>
>
>


[Error:]while read s3 buckets in Spark 1.6 in spark -submit

2016-08-31 Thread Divya Gehlot
Hi,
I am using Spark 1.6.1 in EMR machine
I am trying to read s3 buckets in my Spark job .
When I read it through Spark shell I am able to read it ,but when I try to
package the job and and run it as spark submit I am getting below error

16/08/31 07:36:38 INFO ApplicationMaster: Registered signal handlers for
[TERM, HUP, INT]

> 16/08/31 07:36:39 INFO ApplicationMaster: ApplicationAttemptId:
> appattempt_1468570153734_2851_01
> Exception in thread "main" java.util.ServiceConfigurationError:
> org.apache.hadoop.fs.FileSystem: Provider
> org.apache.hadoop.fs.s3a.S3AFileSystem could not be instantiated
> at java.util.ServiceLoader.fail(ServiceLoader.java:224)
> at java.util.ServiceLoader.access$100(ServiceLoader.java:181)
> at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:377)
> at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
> at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2673)
> at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2684)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2701)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2737)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2719)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:375)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:174)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:142)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:653)
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69)
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:68)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:651)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
> Caused by: java.lang.NoClassDefFoundError:
> com/amazonaws/services/s3/AmazonS3
> at java.lang.Class.getDeclaredConstructors0(Native Method)
> at java.lang.Class.privateGetDeclaredConstructors(Class.java:2595)
> at java.lang.Class.getConstructor0(Class.java:2895)
> at java.lang.Class.newInstance(Class.java:354)
> at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
> ... 19 more
> Caused by: java.lang.ClassNotFoundException:
> com.amazonaws.services.s3.AmazonS3
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> ... 24 more
> End of LogType:stderr



I have already included

 "com.amazonaws" % "aws-java-sdk-s3" % "1.11.15",

in my build.sbt


I tried the provinding the access key also in my job still the same error
persists.

when I googled it I if you have IAM role created there is no need to
provide access key .

Would really appreciate the help.


Thanks,

Divya


Re:Scala Vs Python

2016-08-31 Thread 时金魁
Must be Scala, or Java.

在 2016-09-01 10:02:54,"ayan guha"  写道:

Hi Users


Thought to ask (again and again) the question: While I am building any 
production application, should I use Scala or Python? 


I have read many if not most articles but all seems pre-Spark 2. Anything 
changed with Spark 2? Either pro-scala way or pro-python way? 


I am thinking performance, feature parity and future direction, not so much in 
terms of skillset or ease of use. 


Or, if you think it is a moot point, please say so as well. 


Any real life example, production experience, anecdotes, personal taste, 
profanity all are welcome :)



--

Best Regards,
Ayan Guha


Scala Vs Python

2016-08-31 Thread ayan guha
Hi Users

Thought to ask (again and again) the question: While I am building any
production application, should I use Scala or Python?

I have read many if not most articles but all seems pre-Spark 2. Anything
changed with Spark 2? Either pro-scala way or pro-python way?

I am thinking performance, feature parity and future direction, not so much
in terms of skillset or ease of use.

Or, if you think it is a moot point, please say so as well.

Any real life example, production experience, anecdotes, personal taste,
profanity all are welcome :)

-- 
Best Regards,
Ayan Guha


Re: AnalysisException exception while parsing XML

2016-08-31 Thread Peyman Mohajerian
here is an example:
df1 = df0.select(explode("manager.subordinates.subordinate_clerk
.duties).alias("duties-flat"),
col("duties-flat.duty.name"").alias("duty-name"))

this is in pyspark, i may have some part of this wrong, didn't test it, but
something similar.

On Wed, Aug 31, 2016 at 5:54 PM,  wrote:

> How do we explode nested arrays?
>
>
>
> Thanks,
> Sreekanth Jella
>
>
>
> *From: *Peyman Mohajerian 
> *Sent: *Wednesday, August 31, 2016 7:41 PM
> *To: *srikanth.je...@gmail.com
> *Cc: *user@spark.apache.org
> *Subject: *Re: AnalysisException exception while parsing XML
>
>
>
> Once you get to the 'Array' type, you got to use explode, you cannot to
> the same traversing.
>
>
>
> On Wed, Aug 31, 2016 at 2:19 PM,  wrote:
>
> Hello Experts,
>
>
>
> I am using Spark XML package to parse the XML. Below exception is being
> thrown when trying to *parse a tag which exist in arrays of array depth*.
> i.e. in this case subordinate_clerk. .duty.name
>
>
>
> With below sample XML, issue is reproducible:
>
>
>
> 
>
>   
>
>
>
> 1
>
> mgr1
>
> 2005-07-31
>
> 
>
>   
>
> 2
>
> clerk2
>
> 2005-07-31
>
>   
>
>   
>
> 3
>
> clerk3
>
> 2005-07-31
>
>   
>
> 
>
>
>
>   
>
>   
>
>
>
>11
>
>mgr11
>
> 
>
>   
>
> 12
>
> clerk12
>
> 
>
>  
>
>first duty
>
>  
>
>  
>
>second duty
>
>  
>
>
>
>   
>
> 
>
>
>
>   
>
> 
>
>
>
>
>
> scala> df.select( 
> "manager.subordinates.subordinate_clerk.duties.duty.name").show
>
>
>
> Exception is:
>
>  org.apache.spark.sql.AnalysisException: cannot resolve 
> 'manager.subordinates.subordinate_clerk.duties.duty[name]' due to data type 
> mismatch: argument 2 requires integral type, however, 'name' is of string 
> type.;
>
>at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>
>at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:65)
>
>at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>
>at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
>
>at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
>
>at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>
>at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:334)
>
>at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332)
>
>at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332)
>
>at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:281)
>
>at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>
>at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>
>at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>
>at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>
>at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>
>at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>
>at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>
>at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>
>at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>
>at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>
>at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>
>at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>
>at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:321)
>
>at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:332)
>
> ... more
>
>
>
>
>
>
>
>
>
> scala> df.printSchema
>
> root
>
>  |-- manager: struct (nullable = true)
>
>  ||-- dateOfJoin: string (nullable = true)
>
>  ||-- id: long (nullable = true)
>
>  ||-- name: string (nullable = true)
>
>  ||-- subordinates: struct (nullable = true)
>
>  |||-- subordinate_clerk: array (nullable = true)
>
>  ||||-- element: struct (containsNull = true)
>
>  |||||-- cid: long (nullable = true)
>
>  |||||-- cname: string (nullable = true)
>
>  |||||-- dateOfJoin: string (nullable = true)
>
>  |||||-- duties: struct (nullable = true)
>
>  ||||||-- duty: array (nullable = true)
>
>  |||

RE: AnalysisException exception while parsing XML

2016-08-31 Thread srikanth.jella
How do we explode nested arrays?

Thanks,
Sreekanth Jella

From: Peyman Mohajerian

Re: Spark 2.0 - Parquet data with fields containing periods "."

2016-08-31 Thread Don Drake
Yes, I just tested it against the nightly build from 8/31.  Looking at the
PR, I'm happy the test added verifies my issue.

Thanks.

-Don

On Wed, Aug 31, 2016 at 6:49 PM, Hyukjin Kwon  wrote:

> Hi Don, I guess this should be fixed from 2.0.1.
>
> Please refer this PR. https://github.com/apache/spark/pull/14339
>
> On 1 Sep 2016 2:48 a.m., "Don Drake"  wrote:
>
>> I am in the process of migrating a set of Spark 1.6.2 ETL jobs to Spark
>> 2.0 and have encountered some interesting issues.
>>
>> First, it seems the SQL parsing is different, and I had to rewrite some
>> SQL that was doing a mix of inner joins (using where syntax, not inner) and
>> outer joins to get the SQL to work.  It was complaining about columns not
>> existing.  I can't reproduce that one easily and can't share the SQL.  Just
>> curious if anyone else is seeing this?
>>
>> I do have a showstopper problem with Parquet dataset that have fields
>> containing a "." in the field name.  This data comes from an external
>> provider (CSV) and we just pass through the field names.  This has worked
>> flawlessly in Spark 1.5 and 1.6, but now spark can't seem to read these
>> parquet files.
>>
>> I've reproduced a trivial example below. Jira created:
>> https://issues.apache.org/jira/browse/SPARK-17341
>>
>>
>> Spark context available as 'sc' (master = local[*], app id =
>> local-1472664486578).
>> Spark session available as 'spark'.
>> Welcome to
>>     __
>>  / __/__  ___ _/ /__
>> _\ \/ _ \/ _ `/ __/  '_/
>>/___/ .__/\_,_/_/ /_/\_\   version 2.0.0
>>   /_/
>>
>> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java
>> 1.7.0_51)
>> Type in expressions to have them evaluated.
>> Type :help for more information.
>>
>> scala> val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i
>> * i)).toDF("value", "squared.value")
>> 16/08/31 12:28:44 WARN ObjectStore: Version information not found in
>> metastore. hive.metastore.schema.verification is not enabled so
>> recording the schema version 1.2.0
>> 16/08/31 12:28:44 WARN ObjectStore: Failed to get database default,
>> returning NoSuchObjectException
>> squaresDF: org.apache.spark.sql.DataFrame = [value: int, squared.value:
>> int]
>>
>> scala> squaresDF.take(2)
>> res0: Array[org.apache.spark.sql.Row] = Array([1,1], [2,4])
>>
>> scala> squaresDF.write.parquet("squares")
>> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>> SLF4J: Defaulting to no-operation (NOP) logger implementation
>> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for
>> further details.
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
>> Compression: SNAPPY
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
>> Compression: SNAPPY
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
>> Compression: SNAPPY
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
>> Compression: SNAPPY
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
>> Compression: SNAPPY
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
>> Compression: SNAPPY
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
>> Compression: SNAPPY
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
>> Compression: SNAPPY
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet block size to 134217728
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet block size to 134217728
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet page size to 1048576
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet block size to 134217728
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet block size to 134217728
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet block size to 134217728
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet page size to 1048576
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet page size to 1048576
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet block size to 134217728
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet block size to 134217728
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet block size to 134217728
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet page size to 1048576
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet page size to 1048576
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parq

Re: Spark 2.0 - Parquet data with fields containing periods "."

2016-08-31 Thread Hyukjin Kwon
Hi Don, I guess this should be fixed from 2.0.1.

Please refer this PR. https://github.com/apache/spark/pull/14339

On 1 Sep 2016 2:48 a.m., "Don Drake"  wrote:

> I am in the process of migrating a set of Spark 1.6.2 ETL jobs to Spark
> 2.0 and have encountered some interesting issues.
>
> First, it seems the SQL parsing is different, and I had to rewrite some
> SQL that was doing a mix of inner joins (using where syntax, not inner) and
> outer joins to get the SQL to work.  It was complaining about columns not
> existing.  I can't reproduce that one easily and can't share the SQL.  Just
> curious if anyone else is seeing this?
>
> I do have a showstopper problem with Parquet dataset that have fields
> containing a "." in the field name.  This data comes from an external
> provider (CSV) and we just pass through the field names.  This has worked
> flawlessly in Spark 1.5 and 1.6, but now spark can't seem to read these
> parquet files.
>
> I've reproduced a trivial example below. Jira created: https://issues.
> apache.org/jira/browse/SPARK-17341
>
>
> Spark context available as 'sc' (master = local[*], app id =
> local-1472664486578).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.0.0
>   /_/
>
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java
> 1.7.0_51)
> Type in expressions to have them evaluated.
> Type :help for more information.
>
> scala> val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i *
> i)).toDF("value", "squared.value")
> 16/08/31 12:28:44 WARN ObjectStore: Version information not found in
> metastore. hive.metastore.schema.verification is not enabled so recording
> the schema version 1.2.0
> 16/08/31 12:28:44 WARN ObjectStore: Failed to get database default,
> returning NoSuchObjectException
> squaresDF: org.apache.spark.sql.DataFrame = [value: int, squared.value:
> int]
>
> scala> squaresDF.take(2)
> res0: Array[org.apache.spark.sql.Row] = Array([1,1], [2,4])
>
> scala> squaresDF.write.parquet("squares")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
> details.
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Dictionary is on
> Aug 31, 2016 12:29:08 PM INFO: org.apac

Re: AnalysisException exception while parsing XML

2016-08-31 Thread Peyman Mohajerian
Once you get to the 'Array' type, you got to use explode, you cannot to the
same traversing.

On Wed, Aug 31, 2016 at 2:19 PM,  wrote:

> Hello Experts,
>
>
>
> I am using Spark XML package to parse the XML. Below exception is being
> thrown when trying to *parse a tag which exist in arrays of array depth*.
> i.e. in this case subordinate_clerk. .duty.name
>
>
>
> With below sample XML, issue is reproducible:
>
>
>
> 
>
>   
>
>
>
> 1
>
> mgr1
>
> 2005-07-31
>
> 
>
>   
>
> 2
>
> clerk2
>
> 2005-07-31
>
>   
>
>   
>
> 3
>
> clerk3
>
> 2005-07-31
>
>   
>
> 
>
>
>
>   
>
>   
>
>
>
>11
>
>mgr11
>
> 
>
>   
>
> 12
>
> clerk12
>
> 
>
>  
>
>first duty
>
>  
>
>  
>
>second duty
>
>  
>
>
>
>   
>
> 
>
>
>
>   
>
> 
>
>
>
>
>
> scala> df.select( 
> "manager.subordinates.subordinate_clerk.duties.duty.name").show
>
>
>
> Exception is:
>
>  org.apache.spark.sql.AnalysisException: cannot resolve 
> 'manager.subordinates.subordinate_clerk.duties.duty[name]' due to data type 
> mismatch: argument 2 requires integral type, however, 'name' is of string 
> type.;
>
>at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>
>at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:65)
>
>at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>
>at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
>
>at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
>
>at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>
>at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:334)
>
>at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332)
>
>at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332)
>
>at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:281)
>
>at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>
>at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>
>at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>
>at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>
>at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>
>at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>
>at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>
>at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>
>at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>
>at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>
>at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>
>at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>
>at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:321)
>
>at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:332)
>
> ... more
>
>
>
>
>
>
>
>
>
> scala> df.printSchema
>
> root
>
>  |-- manager: struct (nullable = true)
>
>  ||-- dateOfJoin: string (nullable = true)
>
>  ||-- id: long (nullable = true)
>
>  ||-- name: string (nullable = true)
>
>  ||-- subordinates: struct (nullable = true)
>
>  |||-- subordinate_clerk: array (nullable = true)
>
>  ||||-- element: struct (containsNull = true)
>
>  |||||-- cid: long (nullable = true)
>
>  |||||-- cname: string (nullable = true)
>
>  |||||-- dateOfJoin: string (nullable = true)
>
>  |||||-- duties: struct (nullable = true)
>
>  ||||||-- duty: array (nullable = true)
>
>  |||||||-- element: struct (containsNull = true)
>
>  ||||||||-- name: string (nullable = true)
>
>
>
>
>
>
>
> Versions info:
>
> Spark - 1.6.0
>
> Scala - 2.10.5
>
> Spark XML - com.databricks:spark-xml_2.10:0.3.3
>
>
>
> Please let me know if there is a solution or workaround for this?
>
>
>
> Thanks,
>
> Sreekanth
>
>
>


Re: Expected benefit of parquet filter pushdown?

2016-08-31 Thread Robert Kruszewski
Your statistics seem corrupted. The creator filed doesn’t match the version 
spec and as such parquet is not using it to filter. I would check whether you 
have references to PARQUET-251 or PARQUET-297 in your executor logs. This bug 
existed between parquet 1.5.0 and 1.8.0. Checkout 
https://issues.apache.org/jira/browse/PARQUET-251. Only master of spark has 
parquet >=1.8.0.

Also checkout VersionParser in parquet since your createdBy is invalid and even 
if you have fixed parquet it will be deemed corrupted.

-  Robert

On 8/31/16, 10:29 PM, "cde...@apple.com on behalf of Christon DeWan" 
 wrote:

I have a data set stored in parquet with several short key fields and one 
relatively large (several kb) blob field. The data set is sorted by key1, key2.

message spark_schema {
  optional binary key1 (UTF8);
  optional binary key2;
  optional binary blob;
}

One use case of this dataset is to fetch all the blobs for a given 
predicate of key1, key2. I would expect parquet predicate pushdown to help 
greatly by not reading blobs from rowgroups where the predicate on the keys 
matched zero records. That does not appear to be the case, however.

For a predicate that only returns 2 rows (out of 6 million), this query:

select sum(length(key2)) from t2 where key1 = 'rare value'

takes 5x longer and reads 50x more data (according to the web UI) than this 
query:

select sum(length(blob)) from t2 where key1 = 'rare value'

The parquet scan does appear to be getting the predicate (says explain(), 
see below), and those columns do even appear to be dictionary encoded (see 
further below).

So does filter pushdown not actually allow us to read less data or is there 
something wrong with my setup?

Thanks,
Xton



scala> spark.sql("select sum(length(blob)) from t2 where key1 = 'rare 
value'").explain()
== Physical Plan ==
*HashAggregate(keys=[], functions=[sum(cast(length(blob#48) as bigint))])
+- Exchange SinglePartition
   +- *HashAggregate(keys=[], functions=[partial_sum(cast(length(blob#48) 
as bigint))])
  +- *Project [blob#48]
 +- *Filter (isnotnull(key1#46) && (key1#46 = rare value))
+- *BatchedScan parquet [key1#46,blob#48] Format: 
ParquetFormat, InputPaths: hdfs://nameservice1/user/me/parquet_test/blob, 
PushedFilters: [IsNotNull(key1), EqualTo(key1,rare value)], ReadSchema: 
struct



$ parquet-tools meta example.snappy.parquet 
creator: parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d) 
extra:   org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"key1","type":"string","nullable":true,"metadata":{}},{"name":"key2","type":"binary","nullable":true,"metadata":{}},{"
 [more]...

file schema: spark_schema 


key1:OPTIONAL BINARY O:UTF8 R:0 D:1
key2:OPTIONAL BINARY R:0 D:1
blob:OPTIONAL BINARY R:0 D:1

row group 1: RC:3971 TS:320593029 


key1: BINARY SNAPPY DO:0 FPO:4 SZ:84/80/0.95 VC:3971 
ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
key2: BINARY SNAPPY DO:0 FPO:88 SZ:49582/53233/1.07 VC:3971 
ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
blob: BINARY SNAPPY DO:0 FPO:49670 SZ:134006918/320539716/2.39 
VC:3971 ENC:BIT_PACKED,RLE,PLAIN
...


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org




smime.p7s
Description: S/MIME cryptographic signature


Fwd: Pyspark Hbase Problem

2016-08-31 Thread md mehrab
I want to read and write data from hbase using pyspark. I am getting below
error plz help

My code

from pyspark import SparkContext, SQLContext
sc = SparkContext()
sqlContext = SQLContext(sc)


sparkconf = {
"hbase.zookeeper.quorum": "localhost",
"hbase.mapreduce.inputtable": "test"
}
keyConv = "org.apache.spark.examples.pythonconverters.
ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.
HBaseResultToStringConverter"

hbase_rdd = sc.newAPIHadoopRDD(
"org.apache.hadoop.hbase.mapreduce.TableInputFormat",
"org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"org.apache.hadoop.hbase.client.Result",
keyConverter=keyConv,
valueConverter=valueConv,
conf=sparkconf)


This raise error

Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.io.IOException: No table was provided.
at 
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:130)
at 
org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:121)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1284)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.take(RDD.scala:1279)
at 
org.apache.spark.api.python.SerDeUtil$.pairRDDToPython(SerDeUtil.scala:203)
at 
org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDD(PythonRDD.scala:582)
at 
org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)


-- 
Md Mehrab Alam

https://www.linkedin.com/in/iammehrabalam
https://github.com/iammehrabalam


Expected benefit of parquet filter pushdown?

2016-08-31 Thread Christon DeWan
I have a data set stored in parquet with several short key fields and one 
relatively large (several kb) blob field. The data set is sorted by key1, key2.

message spark_schema {
  optional binary key1 (UTF8);
  optional binary key2;
  optional binary blob;
}

One use case of this dataset is to fetch all the blobs for a given predicate of 
key1, key2. I would expect parquet predicate pushdown to help greatly by not 
reading blobs from rowgroups where the predicate on the keys matched zero 
records. That does not appear to be the case, however.

For a predicate that only returns 2 rows (out of 6 million), this query:

select sum(length(key2)) from t2 where key1 = 'rare value'

takes 5x longer and reads 50x more data (according to the web UI) than this 
query:

select sum(length(blob)) from t2 where key1 = 'rare value'

The parquet scan does appear to be getting the predicate (says explain(), see 
below), and those columns do even appear to be dictionary encoded (see further 
below).

So does filter pushdown not actually allow us to read less data or is there 
something wrong with my setup?

Thanks,
Xton



scala> spark.sql("select sum(length(blob)) from t2 where key1 = 'rare 
value'").explain()
== Physical Plan ==
*HashAggregate(keys=[], functions=[sum(cast(length(blob#48) as bigint))])
+- Exchange SinglePartition
   +- *HashAggregate(keys=[], functions=[partial_sum(cast(length(blob#48) as 
bigint))])
  +- *Project [blob#48]
 +- *Filter (isnotnull(key1#46) && (key1#46 = rare value))
+- *BatchedScan parquet [key1#46,blob#48] Format: ParquetFormat, 
InputPaths: hdfs://nameservice1/user/me/parquet_test/blob, PushedFilters: 
[IsNotNull(key1), EqualTo(key1,rare value)], ReadSchema: 
struct



$ parquet-tools meta example.snappy.parquet 
creator: parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d) 
extra:   org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"key1","type":"string","nullable":true,"metadata":{}},{"name":"key2","type":"binary","nullable":true,"metadata":{}},{"
 [more]...

file schema: spark_schema 

key1:OPTIONAL BINARY O:UTF8 R:0 D:1
key2:OPTIONAL BINARY R:0 D:1
blob:OPTIONAL BINARY R:0 D:1

row group 1: RC:3971 TS:320593029 

key1: BINARY SNAPPY DO:0 FPO:4 SZ:84/80/0.95 VC:3971 
ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
key2: BINARY SNAPPY DO:0 FPO:88 SZ:49582/53233/1.07 VC:3971 
ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
blob: BINARY SNAPPY DO:0 FPO:49670 SZ:134006918/320539716/2.39 VC:3971 
ENC:BIT_PACKED,RLE,PLAIN
...


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



AnalysisException exception while parsing XML

2016-08-31 Thread srikanth.jella
Hello Experts,

I am using Spark XML package to parse the XML. Below exception is being thrown 
when trying to parse a tag which exist in arrays of array depth. i.e. in this 
case subordinate_clerk. .duty.name

With below sample XML, issue is reproducible:


  
   
1
mgr1
2005-07-31

  
2
clerk2
2005-07-31
  
  
3
clerk3
2005-07-31
  

   
  
  
   
   11
   mgr11

  
12
clerk12

  
first duty
  
  
second duty
  

  

   
  
  


scala> df.select( 
"manager.subordinates.subordinate_clerk.duties.duty.name").show   

Exception is:
 org.apache.spark.sql.AnalysisException: cannot resolve 
'manager.subordinates.subordinate_clerk.duties.duty[name]' due to data type 
mismatch: argument 2 requires integral type, however, 'name' is of string type.;
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:65)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:334)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:281)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:321)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:332)
... more




scala> df.printSchema
root
 |-- manager: struct (nullable = true)
 ||-- dateOfJoin: string (nullable = true)
 ||-- id: long (nullable = true)
 ||-- name: string (nullable = true)
 ||-- subordinates: struct (nullable = true)
 |||-- subordinate_clerk: array (nullable = true)
 ||||-- element: struct (containsNull = true)
 |||||-- cid: long (nullable = true)
 |||||-- cname: string (nullable = true)
 |||||-- dateOfJoin: string (nullable = true)
 |||||-- duties: struct (nullable = true)
 ||||||-- duty: array (nullable = true)
 |||||||-- element: struct (containsNull = true)
 ||||||||-- name: string (nullable = true)



Versions info:
Spark - 1.6.0
Scala - 2.10.5
Spark XML - com.databricks:spark-xml_2.10:0.3.3

Please let me know if there is a solution or workaround for this?

Thanks,
Sreekanth



Spark jobs failing by looking for TachyonFS

2016-08-31 Thread Venkatesh Rudraraju
My spark job fails by trying to initialize Tachyon FS though not configured
to.
With *Standalone*-setup in my laptop the job *succeeds*. But with
*Medos+docker* setup it *fails*(once every few runs). Pls reply if anyone
has seen this or know why its looking for Tachyon at all. Below is the
exception it fails with:

spark version : 1.6.0

 Framework registered with 25a4e045-ffbd-479a-b24c-2d2cfcbf84c9-56355
2016/07/20 15:31:47 ERROR SparkContext: Error initializing SparkContext.
java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem:
Provider tachyon.hadoop.TFS could not be instantiated
  at java.util.ServiceLoader.fail(ServiceLoader.java:232)
  at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
  at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
  at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
  at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
  at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2563)
  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2574)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
  at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1650)
  at 
org.apache.spark.scheduler.EventLoggingListener.(EventLoggingListener.scala:66)
  at org.apache.spark.SparkContext.(SparkContext.scala:547)
  at org.apache.spark.SparkContext.(SparkContext.scala:123)
  at c
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:497)
  at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
  at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
  at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ExceptionInInitializerError
  at tachyon.hadoop.AbstractTFS.(AbstractTFS.java:72)
  at tachyon.hadoop.TFS.(TFS.java:28)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
  at java.lang.Class.newInstance(Class.java:442)
  at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
  ... 24 more
Caused by: java.util.ConcurrentModificationException
  at java.util.Hashtable$Enumerator.next(Hashtable.java:1367)
  at java.util.Hashtable.putAll(Hashtable.java:522)
  at tachyon.conf.TachyonConf.(TachyonConf.java:158)
  at tachyon.conf.TachyonConf.(TachyonConf.java:111)
  at tachyon.client.ClientContext.reset(ClientContext.java:57)
  at tachyon.client.ClientContext.(ClientContext.java:47)


Custom return code

2016-08-31 Thread Pierre Villard
Hi,

I am using Spark 1.5.2 and I am submitting a job (jar file) using
spark-submit command in a yarn cluster mode. I'd like the command to return
a custom return code.

In the run method, if I do:
sys.exit(myCode)
the command will always return 0.

The only way to have something not equal to 0 is to throw an exception and
this will return 1.

Is there a way to have a custom return code from the job application?

Thanks a lot!


Re: Model abstract class in spark ml

2016-08-31 Thread Mohit Jaggi
Thanks Cody. That was a good explanation!

Mohit Jaggi
Founder,
Data Orchard LLC
www.dataorchardllc.com




> On Aug 31, 2016, at 7:32 AM, Cody Koeninger  wrote:
> 
> http://blog.originate.com/blog/2014/02/27/types-inside-types-in-scala/
> 
> On Wed, Aug 31, 2016 at 2:19 AM, Sean Owen  wrote:
>> Weird, I recompiled Spark with a similar change to Model and it seemed
>> to work but maybe I missed a step in there.
>> 
>> On Wed, Aug 31, 2016 at 6:33 AM, Mohit Jaggi  wrote:
>>> I think I figured it out. There is indeed "something deeper in Scala” :-)
>>> 
>>> abstract class A {
>>>  def a: this.type
>>> }
>>> 
>>> class AA(i: Int) extends A {
>>>  def a = this
>>> }
>>> 
>>> the above works ok. But if you return anything other than “this”, you will
>>> get a compile error.
>>> 
>>> abstract class A {
>>>  def a: this.type
>>> }
>>> 
>>> class AA(i: Int) extends A {
>>>  def a = new AA(1)
>>> }
>>> 
>>> Error:(33, 11) type mismatch;
>>> found   : com.dataorchard.datagears.AA
>>> required: AA.this.type
>>>  def a = new AA(1)
>>>  ^
>>> 
>>> So you have to do:
>>> 
>>> abstract class A[T <: A[T]]  {
>>>  def a: T
>>> }
>>> 
>>> class AA(i: Int) extends A[AA] {
>>>  def a = new AA(1)
>>> }
>>> 
>>> 
>>> 
>>> Mohit Jaggi
>>> Founder,
>>> Data Orchard LLC
>>> www.dataorchardllc.com
>>> 
>>> 
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark 2.0 - Parquet data with fields containing periods "."

2016-08-31 Thread Don Drake
I am in the process of migrating a set of Spark 1.6.2 ETL jobs to Spark 2.0
and have encountered some interesting issues.

First, it seems the SQL parsing is different, and I had to rewrite some SQL
that was doing a mix of inner joins (using where syntax, not inner) and
outer joins to get the SQL to work.  It was complaining about columns not
existing.  I can't reproduce that one easily and can't share the SQL.  Just
curious if anyone else is seeing this?

I do have a showstopper problem with Parquet dataset that have fields
containing a "." in the field name.  This data comes from an external
provider (CSV) and we just pass through the field names.  This has worked
flawlessly in Spark 1.5 and 1.6, but now spark can't seem to read these
parquet files.

I've reproduced a trivial example below. Jira created:
https://issues.apache.org/jira/browse/SPARK-17341


Spark context available as 'sc' (master = local[*], app id =
local-1472664486578).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java
1.7.0_51)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i *
i)).toDF("value", "squared.value")
16/08/31 12:28:44 WARN ObjectStore: Version information not found in
metastore. hive.metastore.schema.verification is not enabled so recording
the schema version 1.2.0
16/08/31 12:28:44 WARN ObjectStore: Failed to get database default,
returning NoSuchObjectException
squaresDF: org.apache.spark.sql.DataFrame = [value: int, squared.value: int]

scala> squaresDF.take(2)
res0: Array[org.apache.spark.sql.Row] = Array([1,1], [2,4])

scala> squaresDF.write.parquet("squares")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
details.
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
Compression: SNAPPY
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
Compression: SNAPPY
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
Compression: SNAPPY
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
Compression: SNAPPY
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
Compression: SNAPPY
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
Compression: SNAPPY
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
Compression: SNAPPY
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
Compression: SNAPPY
Aug 31, 2016 12:29:08 PM INFO:
org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to
134217728
Aug 31, 2016 12:29:08 PM INFO:
org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to
134217728
Aug 31, 2016 12:29:08 PM INFO:
org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
Aug 31, 2016 12:29:08 PM INFO:
org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to
134217728
Aug 31, 2016 12:29:08 PM INFO:
org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to
134217728
Aug 31, 2016 12:29:08 PM INFO:
org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to
134217728
Aug 31, 2016 12:29:08 PM INFO:
org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
Aug 31, 2016 12:29:08 PM INFO:
org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
Aug 31, 2016 12:29:08 PM INFO:
org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to
134217728
Aug 31, 2016 12:29:08 PM INFO:
org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to
134217728
Aug 31, 2016 12:29:08 PM INFO:
org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to
134217728
Aug 31, 2016 12:29:08 PM INFO:
org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
Aug 31, 2016 12:29:08 PM INFO:
org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
Aug 31, 2016 12:29:08 PM INFO:
org.apache.parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size
to 1048576
Aug 31, 2016 12:29:08 PM INFO:
org.apache.parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size
to 1048576
Aug 31, 2016 12:29:08 PM INFO:
org.apache.parquet.hadoop.ParquetOutputFormat: Dictionary is on
Aug 31, 2016 12:29:08 PM INFO:
org.apache.parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size
to 1048576
Aug 31, 2016 12:29:08 PM INFO:
org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
Aug 31, 2016 12:29:08 PM INFO:
org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
Aug 31, 2016 12:29:08 PM INFO:
org.apache.parquet.hadoop.Parque

Re: Spark to Kafka communication encrypted ?

2016-08-31 Thread Luciano Resende
I believe the encryption section should be updated now that a bunch of
related jiras were fixed yerterday such as
https://issues.apache.org/jira/browse/SPARK-5682

On Wed, Aug 31, 2016 at 9:46 AM, Cody Koeninger  wrote:

> http://spark.apache.org/docs/latest/security.html
>
> On Wed, Aug 31, 2016 at 11:15 AM, Mihai Iacob  wrote:
>
>> Does Spark support encryption for inter node communication ?
>>
>>
>> Regards,
>>
>> *Mihai Iacob*
>> Next Generation Platform - Security
>> IBM Analytics
>>
>>
>> --
>> Phone: 1-905-413-5378
>> E-mail: mia...@ca.ibm.com
>> Chat:[image: Sametime:] mia...@ca.ibm.com [image: Skype:] mihai-iacob
>> Find me on: [image: LinkedIn: https://ca.linkedin.com/in/iacobmihai]
>>  [image: Twitter:
>> https://twitter.com/miacobv]   and within
>> IBM on: [image: IBM Connections:
>> https://w3-connections.ibm.com/profiles/html/myProfileView.do]
>>   8200
>> Warden Ave
>> Markham, ON L6G 1C7
>> Canada
>>
>>
>>
>> - Original message -
>> From: Cody Koeninger 
>> To: Eric Ho 
>> Cc: "user@spark.apache.org" 
>> Subject: Re: Spark to Kafka communication encrypted ?
>> Date: Wed, Aug 31, 2016 10:34 AM
>>
>> Encryption is only available in spark-streaming-kafka-0-10, not 0-8.
>> You enable it the same way you enable it for the Kafka project's new
>> consumer, by setting kafka configuration parameters appropriately.
>> http://kafka.apache.org/documentation.html#security_ssl
>>
>> On Wed, Aug 31, 2016 at 2:03 AM, Eric Ho  wrote:
>> > I can't find in Spark 1.6.2's docs in how to turn encryption on for
>> Spark to
>> > Kafka communication ...  I think that the Spark docs only tells you how
>> to
>> > turn on encryption for inter Spark node communications ..  Am I wrong ?
>> >
>> > Thanks.
>> >
>> > --
>> >
>> > -eric ho
>> >
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>>
>>
>>
>


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Spark to Kafka communication encrypted ?

2016-08-31 Thread Cody Koeninger
http://spark.apache.org/docs/latest/security.html

On Wed, Aug 31, 2016 at 11:15 AM, Mihai Iacob  wrote:

> Does Spark support encryption for inter node communication ?
>
>
> Regards,
>
> *Mihai Iacob*
> Next Generation Platform - Security
> IBM Analytics
>
>
> --
> Phone: 1-905-413-5378
> E-mail: mia...@ca.ibm.com
> Chat:[image: Sametime:] mia...@ca.ibm.com [image: Skype:] mihai-iacob
> Find me on: [image: LinkedIn: https://ca.linkedin.com/in/iacobmihai]
>  [image: Twitter:
> https://twitter.com/miacobv]   and within
> IBM on: [image: IBM Connections:
> https://w3-connections.ibm.com/profiles/html/myProfileView.do]
>   8200
> Warden Ave
> Markham, ON L6G 1C7
> Canada
>
>
>
> - Original message -
> From: Cody Koeninger 
> To: Eric Ho 
> Cc: "user@spark.apache.org" 
> Subject: Re: Spark to Kafka communication encrypted ?
> Date: Wed, Aug 31, 2016 10:34 AM
>
> Encryption is only available in spark-streaming-kafka-0-10, not 0-8.
> You enable it the same way you enable it for the Kafka project's new
> consumer, by setting kafka configuration parameters appropriately.
> http://kafka.apache.org/documentation.html#security_ssl
>
> On Wed, Aug 31, 2016 at 2:03 AM, Eric Ho  wrote:
> > I can't find in Spark 1.6.2's docs in how to turn encryption on for
> Spark to
> > Kafka communication ...  I think that the Spark docs only tells you how
> to
> > turn on encryption for inter Spark node communications ..  Am I wrong ?
> >
> > Thanks.
> >
> > --
> >
> > -eric ho
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>
>


Re: releasing memory without stopping the spark context ?

2016-08-31 Thread Mich Talebzadeh
Spark memory is the sum of execution memory and storage memory.

unpersist only removes the storage memory. Execution memory is there which
is what is Spark all about.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 31 August 2016 at 17:18, Rajani, Arpan  wrote:

> *Removing Data*
>
>
>
> Spark automatically monitors cache usage on each node and drops out old
> data partitions in a least-recently-used (LRU) fashion.
>
> If you would like to manually remove an RDD instead of waiting for it to
> fall out of the cache, use the *RDD.unpersist()* method. (Copied from
> Documentation
> 
> )
>
>
>
> *From:* Cesar [mailto:ces...@gmail.com]
> *Sent:* 31 August 2016 16:57
> *To:* user
> *Subject:* releasing memory without stopping the spark context ?
>
>
>
>
>
> Is there a way to release all persisted RDD's/DataFrame's in Spark without
> stopping the SparkContext ?
>
>
>
>
>
> Thanks a lot
>
> --
>
> Cesar Flores
>
>
> This e-mail and any attachments are confidential, intended only for the 
> addressee and may be privileged. If you have received this e-mail in error, 
> please notify the sender immediately and delete it. Any content that does not 
> relate to the business of Worldpay is personal to the sender and not 
> authorised or endorsed by Worldpay. Worldpay does not accept responsibility 
> for viruses or any loss or damage arising from transmission or access.
>
> Worldpay (UK) Limited (Company No: 07316500/ Financial Conduct Authority No: 
> 530923), Worldpay Limited (Company No:03424752 / Financial Conduct Authority 
> No: 504504), Worldpay AP Limited (Company No: 05593466 / Financial Conduct 
> Authority No: 502597). Registered Office: The Walbrook Building, 25 Walbrook, 
> London EC4N 8AF and authorised by the Financial Conduct Authority under the 
> Payment Service Regulations 2009 for the provision of payment services. 
> Worldpay (UK) Limited is authorised and regulated by the Financial Conduct 
> Authority for consumer credit activities. Worldpay B.V. (WPBV) has its 
> registered office in Amsterdam, the Netherlands (Handelsregister KvK no. 
> 60494344). WPBV holds a licence from and is included in the register kept by 
> De Nederlandsche Bank, which registration can be consulted through 
> www.dnb.nl. Worldpay, the logo and any associated brand names are trade marks 
> of the Worldpay group.
>
>
>
>


Re: Spark to Kafka communication encrypted ?

2016-08-31 Thread Mihai Iacob
Does Spark support encryption for inter node communication ?
 
Regards, 

Mihai IacobNext Generation Platform - Security
IBM Analytics 
Phone: 1-905-413-5378 E-mail:  mia...@ca.ibm.comChat: mia...@ca.ibm.com  mihai-iacob Find me on:    and within IBM on:  8200 Warden AveMarkham, ON L6G 1C7Canada
 
 
- Original message -From: Cody Koeninger To: Eric Ho Cc: "user@spark.apache.org" Subject: Re: Spark to Kafka communication encrypted ?Date: Wed, Aug 31, 2016 10:34 AM 
Encryption is only available in spark-streaming-kafka-0-10, not 0-8.You enable it the same way you enable it for the Kafka project's newconsumer, by setting kafka configuration parameters appropriately.http://kafka.apache.org/documentation.html#security_sslOn Wed, Aug 31, 2016 at 2:03 AM, Eric Ho  wrote:> I can't find in Spark 1.6.2's docs in how to turn encryption on for Spark to> Kafka communication ...  I think that the Spark docs only tells you how to> turn on encryption for inter Spark node communications ..  Am I wrong ?>> Thanks.>> -->> -eric ho>-To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RE: releasing memory without stopping the spark context ?

2016-08-31 Thread Rajani, Arpan
Removing Data

Spark automatically monitors cache usage on each node and drops out old data 
partitions in a least-recently-used (LRU) fashion.
If you would like to manually remove an RDD instead of waiting for it to fall 
out of the cache, use the RDD.unpersist() method. (Copied from 
Documentation
 )

From: Cesar [mailto:ces...@gmail.com]
Sent: 31 August 2016 16:57
To: user
Subject: releasing memory without stopping the spark context ?


Is there a way to release all persisted RDD's/DataFrame's in Spark without 
stopping the SparkContext ?


Thanks a lot
--
Cesar Flores
This e-mail and any attachments are confidential, intended only for the 
addressee and may be privileged. If you have received this e-mail in error, 
please notify the sender immediately and delete it. Any content that does not 
relate to the business of Worldpay is personal to the sender and not authorised 
or endorsed by Worldpay. Worldpay does not accept responsibility for viruses or 
any loss or damage arising from transmission or access.

Worldpay (UK) Limited (Company No: 07316500/ Financial Conduct Authority No: 
530923), Worldpay Limited (Company No:03424752 / Financial Conduct Authority 
No: 504504), Worldpay AP Limited (Company No: 05593466 / Financial Conduct 
Authority No: 502597). Registered Office: The Walbrook Building, 25 Walbrook, 
London EC4N 8AF and authorised by the Financial Conduct Authority under the 
Payment Service Regulations 2009 for the provision of payment services. 
Worldpay (UK) Limited is authorised and regulated by the Financial Conduct 
Authority for consumer credit activities. Worldpay B.V. (WPBV) has its 
registered office in Amsterdam, the Netherlands (Handelsregister KvK no. 
60494344). WPBV holds a licence from and is included in the register kept by De 
Nederlandsche Bank, which registration can be consulted through www.dnb.nl. 
Worldpay, the logo and any associated brand names are trade marks of the 
Worldpay group.


releasing memory without stopping the spark context ?

2016-08-31 Thread Cesar
Is there a way to release all persisted RDD's/DataFrame's in Spark without
stopping the SparkContext ?


Thanks a lot
-- 
Cesar Flores


Re: Does a driver jvm houses some rdd partitions?

2016-08-31 Thread Mich Talebzadeh
Hi,

Are you caching RDD into storage memory here?

Example

s.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)

Do you have a snapshot of your storage tab?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 31 August 2016 at 14:53, Jakub Dubovsky 
wrote:

> Hey all,
>
> I have a conceptual question which I have hard time finding answer for.
>
> Is the jvm where spark driver is running also used to run computations
> over rdd partitions and persist them? The answer is obvious for local mode
> (yes). But when it runs on yarn/mesos/standalone with many executors is the
> answer no?
>
> *My motivation is following*
> In "executors" tab of sparkUI in "storage memory" column for driver table
> line one can see "0.0 B / 14.2 GB" for example. This suggests that 14G of
> ram are not available to computations done in driver but are reserved for
> rdd caching.
>
> But I have plenty of memory on executors to cache rdd there. I would like
> to use driver memory for being able to collect medium sized data. Since I
> assume that collected data are stored out of memory reserved from cache
> this means that those 14G not available for saving collected data.
>
> It looks like spark2.0.0 is doing this cache vs non-cache memory
> management somehow automatically but I do not understand that yet
>
> Thanks for any insight on this
>
> Jakub D.
>


Re: Random Forest Classification

2016-08-31 Thread Bryan Cutler
I see.  You might try this, create a pipeline of just your feature
transformers, then call fit() on the complete dataset to get a model.
Finally make second pipeline and add this model and the decision tree as
stages.

On Aug 30, 2016 8:19 PM, "Bahubali Jain"  wrote:

> Hi Bryan,
> Thanks for the reply.
> I am indexing 5 columns ,then using these indexed columns to generate the
> "feature" column thru vector assembler.
> Which essentially means that I cannot use *fit()* directly on
> "completeDataset" dataframe since it will neither have the "feature" column
> and nor the 5 indexed columns.
> Of-course there is a dirty way of doing this, but I am wondering if there
> some optimized/intelligent approach for this.
>
> Thanks,
> Baahu
>
> On Wed, Aug 31, 2016 at 3:30 AM, Bryan Cutler  wrote:
>
>> You need to first fit just the VectorIndexer which returns the model,
>> then add the model to the pipeline where it will only transform.
>>
>> val featureVectorIndexer = new VectorIndexer()
>> .setInputCol("feature")
>> .setOutputCol("indexedfeature")
>> .setMaxCategories(180)
>> .fit(completeDataset)
>>
>> On Tue, Aug 30, 2016 at 9:57 AM, Bahubali Jain 
>> wrote:
>>
>>> Hi,
>>> I had run into similar exception " java.util.NoSuchElementException:
>>> key not found: " .
>>> After further investigation I realized it is happening due to
>>> vectorindexer being executed on training dataset and not on entire dataset.
>>>
>>> In the dataframe I have 5 categories , each of these have to go thru
>>> stringindexer and then these are put thru a vector indexer to generate
>>> feature vector.
>>> What is the right way to do this, so that vector indexer can be run on
>>> the entire data and not just on training data?
>>>
>>> Below is the current approach, as evident  VectorIndexer is being
>>> generated based on the training set.
>>>
>>> Please Note: fit() on Vectorindexer cannot be called on entireset
>>> dataframe since it doesn't have the required column(*feature *column is
>>> being generated dynamically in pipeline execution)
>>> How can the vectorindexer be *fit()* on the entireset?
>>>
>>>  val col1_indexer = new StringIndexer().setInputCol("c
>>> ol1").setOutputCol("indexed_col1")
>>> val col2_indexer = new StringIndexer().setInputCol("c
>>> ol2").setOutputCol("indexed_col2")
>>> val col3_indexer = new StringIndexer().setInputCol("c
>>> ol3").setOutputCol("indexed_col3")
>>> val col4_indexer = new StringIndexer().setInputCol("c
>>> ol4").setOutputCol("indexed_col4")
>>> val col5_indexer = new StringIndexer().setInputCol("c
>>> ol5").setOutputCol("indexed_col5")
>>>
>>> val featureArray =  Array("indexed_col1","indexed_
>>> col2","indexed_col3","indexed_col4","indexed_col5")
>>> val vectorAssembler = new VectorAssembler().setInputCols
>>> (featureArray).setOutputCol("*feature*")
>>> val featureVectorIndexer = new VectorIndexer()
>>> .setInputCol("feature")
>>> .setOutputCol("indexedfeature")
>>> .setMaxCategories(180)
>>>
>>> val decisionTree = new DecisionTreeClassifier().setMa
>>> xBins(300).setMaxDepth(1).setImpurity("entropy").setLabelCol
>>> ("indexed_user_action").setFeaturesCol("indexedfeature").
>>> setPredictionCol("prediction")
>>>
>>> val pipeline = new Pipeline().setStages(Array(col1_indexer,col2_indexer,
>>> col3_indexer,col4_indexer,col5_indexer,vectorAssembler,featureVecto
>>> rIndexer,decisionTree))
>>> val model = pipeline.*fit(trainingSet)*
>>> val output = model.transform(cvSet)
>>>
>>>
>>> Thanks,
>>> Baahu
>>>
>>> On Fri, Jul 8, 2016 at 11:24 PM, Bryan Cutler  wrote:
>>>
 Hi Rich,

 I looked at the notebook and it seems like you are fitting the
 StringIndexer and VectorIndexer to only the training data, and it should
 the the entire data set.  So if the training data does not include all of
 the labels and an unknown label appears in the test data during evaluation,
 then it will not know how to index it.  So your code should be like this,
 fit with 'digits' instead of 'training'

 val labelIndexer = new StringIndexer().setInputCol("l
 abel").setOutputCol("indexedLabel").fit(digits)
 // Automatically identify categorical features, and index them.
 // Set maxCategories so features with > 4 distinct values are treated
 as continuous.
 val featureIndexer = new VectorIndexer().setInputCol("f
 eatures").setOutputCol("indexedFeatures").setMaxCategories(4
 ).fit(digits)

 Hope that helps!

 On Fri, Jul 1, 2016 at 9:24 AM, Rich Tarro  wrote:

> Hi Bryan.
>
> Thanks for your continued help.
>
> Here is the code shown in a Jupyter notebook. I figured this was
> easier that cutting and pasting the code into an email. If you  would like
> me to send you the code in a different format let, me know. The necessary
> data is all downloaded within the notebook itself.
>
> https://console.ng.bluemix.net/data/notebooks/fe7e578a-401f-
> 4744-a318-b1b6bcf6f5f8/view

Iterative update for LocalLDAModel

2016-08-31 Thread jamborta
Hi,

I am trying to take the OnlineLDAOptimizer and apply it iteratively to new
data. My use case would be:

- Train the model using the DistributedLDAModel
- Convert to LocalLDAModel 
- Apply to new data as data comes in using the OnlineLDAOptimizer

I cannot see that this can be done without code change as the LDA class does
not have the option to initialise with an existing model. Am I missing
something?

Thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Iterative-update-for-LocalLDAModel-tp27632.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark to Kafka communication encrypted ?

2016-08-31 Thread Cody Koeninger
Encryption is only available in spark-streaming-kafka-0-10, not 0-8.
You enable it the same way you enable it for the Kafka project's new
consumer, by setting kafka configuration parameters appropriately.
http://kafka.apache.org/documentation.html#security_ssl

On Wed, Aug 31, 2016 at 2:03 AM, Eric Ho  wrote:
> I can't find in Spark 1.6.2's docs in how to turn encryption on for Spark to
> Kafka communication ...  I think that the Spark docs only tells you how to
> turn on encryption for inter Spark node communications ..  Am I wrong ?
>
> Thanks.
>
> --
>
> -eric ho
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Model abstract class in spark ml

2016-08-31 Thread Cody Koeninger
http://blog.originate.com/blog/2014/02/27/types-inside-types-in-scala/

On Wed, Aug 31, 2016 at 2:19 AM, Sean Owen  wrote:
> Weird, I recompiled Spark with a similar change to Model and it seemed
> to work but maybe I missed a step in there.
>
> On Wed, Aug 31, 2016 at 6:33 AM, Mohit Jaggi  wrote:
>> I think I figured it out. There is indeed "something deeper in Scala” :-)
>>
>> abstract class A {
>>   def a: this.type
>> }
>>
>> class AA(i: Int) extends A {
>>   def a = this
>> }
>>
>> the above works ok. But if you return anything other than “this”, you will
>> get a compile error.
>>
>> abstract class A {
>>   def a: this.type
>> }
>>
>> class AA(i: Int) extends A {
>>   def a = new AA(1)
>> }
>>
>> Error:(33, 11) type mismatch;
>>  found   : com.dataorchard.datagears.AA
>>  required: AA.this.type
>>   def a = new AA(1)
>>   ^
>>
>> So you have to do:
>>
>> abstract class A[T <: A[T]]  {
>>   def a: T
>> }
>>
>> class AA(i: Int) extends A[AA] {
>>   def a = new AA(1)
>> }
>>
>>
>>
>> Mohit Jaggi
>> Founder,
>> Data Orchard LLC
>> www.dataorchardllc.com
>>
>>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Does a driver jvm houses some rdd partitions?

2016-08-31 Thread Jakub Dubovsky
Hey all,

I have a conceptual question which I have hard time finding answer for.

Is the jvm where spark driver is running also used to run computations over
rdd partitions and persist them? The answer is obvious for local mode
(yes). But when it runs on yarn/mesos/standalone with many executors is the
answer no?

*My motivation is following*
In "executors" tab of sparkUI in "storage memory" column for driver table
line one can see "0.0 B / 14.2 GB" for example. This suggests that 14G of
ram are not available to computations done in driver but are reserved for
rdd caching.

But I have plenty of memory on executors to cache rdd there. I would like
to use driver memory for being able to collect medium sized data. Since I
assume that collected data are stored out of memory reserved from cache
this means that those 14G not available for saving collected data.

It looks like spark2.0.0 is doing this cache vs non-cache memory management
somehow automatically but I do not understand that yet

Thanks for any insight on this

Jakub D.


Re: Spark build 1.6.2 error

2016-08-31 Thread Adam Roberts
Looks familiar, got the zinc server running and using a shared dev box?

ps -ef | grep "com.typesafe zinc.Nailgun", look for the zinc server 
process, kill it and try again, Spark branch-1.6 builds great here from 
scratch, had plenty of problems thanks to running the zinc server here 
(started with build/mvn)




From:   Nachiketa 
To: Diwakar Dhanuskodi 
Cc: user 
Date:   31/08/2016 12:17
Subject:Re: Spark build 1.6.2 error



Hi Diwakar,

Could you please share the entire maven command that you are using to 
build ? And also the JDK version you are using ?

Also could you please confirm that you did execute the script for change 
scala version to 2.11 before starting the build ? Thanks.

Regards,
Nachiketa

On Wed, Aug 31, 2016 at 2:00 AM, Diwakar Dhanuskodi <
diwakar.dhanusk...@gmail.com> wrote:
Hi, 

While building Spark 1.6.2 , getting below error in spark-sql. Much 
appreciate for any help.

ERROR] missing or invalid dependency detected while loading class file 
'WebUI.class'.
Could not access term eclipse in package org,
because it (or its dependencies) are missing. Check your build definition 
for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see 
the problematic classpath.)
A full rebuild may help if 'WebUI.class' was compiled against an 
incompatible version of org.
[ERROR] missing or invalid dependency detected while loading class file 
'WebUI.class'.
Could not access term jetty in value org.eclipse,
because it (or its dependencies) are missing. Check your build definition 
for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see 
the problematic classpath.)
A full rebuild may help if 'WebUI.class' was compiled against an 
incompatible version of org.eclipse.
[WARNING] 17 warnings found
[ERROR] two errors found
[INFO] 

[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM .. SUCCESS 
[4.399s]
[INFO] Spark Project Test Tags ... SUCCESS 
[3.443s]
[INFO] Spark Project Launcher  SUCCESS 
[10.131s]
[INFO] Spark Project Networking .. SUCCESS 
[11.849s]
[INFO] Spark Project Shuffle Streaming Service ... SUCCESS 
[6.641s]
[INFO] Spark Project Unsafe .. SUCCESS 
[19.765s]
[INFO] Spark Project Core  SUCCESS 
[4:16.511s]
[INFO] Spark Project Bagel ... SUCCESS 
[13.401s]
[INFO] Spark Project GraphX .. SUCCESS 
[1:08.824s]
[INFO] Spark Project Streaming ... SUCCESS 
[2:18.844s]
[INFO] Spark Project Catalyst  SUCCESS 
[2:43.695s]
[INFO] Spark Project SQL . FAILURE 
[1:01.762s]
[INFO] Spark Project ML Library .. SKIPPED
[INFO] Spark Project Tools ... SKIPPED
[INFO] Spark Project Hive  SKIPPED
[INFO] Spark Project Docker Integration Tests  SKIPPED
[INFO] Spark Project REPL  SKIPPED
[INFO] Spark Project YARN Shuffle Service  SKIPPED
[INFO] Spark Project YARN  SKIPPED
[INFO] Spark Project Assembly  SKIPPED
[INFO] Spark Project External Twitter  SKIPPED
[INFO] Spark Project External Flume Sink . SKIPPED
[INFO] Spark Project External Flume .. SKIPPED
[INFO] Spark Project External Flume Assembly . SKIPPED
[INFO] Spark Project External MQTT ... SKIPPED
[INFO] Spark Project External MQTT Assembly .. SKIPPED
[INFO] Spark Project External ZeroMQ . SKIPPED
[INFO] Spark Project External Kafka .. SKIPPED
[INFO] Spark Project Examples  SKIPPED
[INFO] Spark Project External Kafka Assembly . SKIPPED
[INFO] 

[INFO] BUILD FAILURE
[INFO] 

[INFO] Total time: 12:40.525s
[INFO] Finished at: Wed Aug 31 01:56:50 IST 2016
[INFO] Final Memory: 71M/830M
[INFO] 

[ERROR] Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) 
on project spark-sql_2.11: Execution scala-compile-first of goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed 
-> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the 
-e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, 
please read the following articles:
[ERROR]

Grouping on bucketed and sorted columns

2016-08-31 Thread Fridtjof Sander

Hi Spark users,

I'm currently investigating spark's bucketing and partitioning 
capabilities and I have some questions:


Let /T/ be a table that is bucketed and sorted by /T.id/ and partitioned 
by /T.date/. Before persisting, /T/ has been repartitioned by /T.id/ to 
get only one file per bucket.

I want to group by /T.id/ over a subset of /T.date/'s values.

It seems to me that the best execution plan in this scenario would be 
the following:
- Schedule one stage (no exchange) with as many tasks as we have 
bucket-ids, so that there is a mapping from each task to a bucket-id
- Each tasks opens all bucket-files belonging to "it's" bucket-id 
simultaneously, which is one per affected partition /T.date/
- Since the data inside the buckets are sorted, we can perform the 
second phase of "two-phase-multiway-merge-sort" to get our groups, which 
can be "pipelined" into the next operator


From what I understand after scanning through the code, however, it 
appears to me that each bucket-file is read completely before the 
record-iterator is advanced to the next bucket file (see FileScanRDD , 
same applies to Hive). So a groupBy would require to sort the partitions 
of the resulting RDD before the groups can be emitted, which results in 
a blocking operation.


Could anyone confirm that I'm assessing the situation correctly here, or 
correct me if not?


Followup questions:

1. Is there a way to get the "sql" groups into the RDD API, like the RDD 
groupBy would return them? I fail to formulate a job like this, because 
a query with groupBy, that misses an aggregation function, is invalid.
2. I haven't simply testet this, because I fail to load a table with the 
specified properties like above:

After writing a table like this:

.write().partitionBy("date").bucketBy(4,"id").sortBy("id").format("json").saveAsTable("table");

I fail to read it again, with the partitioning and bucketing being 
recognized.
Is a functioning Hive-Metastore required for this to work, or is there a 
workaround?


I hope someone can spare the time to help me out here.

All the best,
Fridtjof




Re: Why does spark take so much time for simple task without calculation?

2016-08-31 Thread Bedrytski Aliaksandr
Hi xiefeng,

Spark Context initialization takes some time and the tool does not
really shine for small data computations:
http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html

But, when working with terabytes (petabytes) of data, those 35 seconds
of initialization don't really matter. 

Regards,

-- 
  Bedrytski Aliaksandr
  sp...@bedryt.ski

On Wed, Aug 31, 2016, at 11:45, xiefeng wrote:
> I install a spark standalone and run the spark cluster(one master and one
> worker) in a windows 2008 server with 16cores and 24GB memory.
> 
> I have done a simple test: Just create  a string RDD and simply return
> it. I
> use JMeter to test throughput but the highest is around 35/sec. I think
> spark is powerful at distribute calculation, but why the throughput is so
> limit in such simple test scenario only contains simple task dispatch and
> no
> calculation?
> 
> 1. In JMeter I test both 10 threads or 100 threads, there is little
> difference around 2-3/sec.
> 2. I test both cache/not cache the RDD, there is little difference. 
> 3. During the test, the cpu and memory are in low level.
> 
> Below is my test code:
> @RestController
> public class SimpleTest {   
>   @RequestMapping(value = "/SimpleTest", method = RequestMethod.GET)
>   @ResponseBody
>   public String testProcessTransaction() {
>   return SparkShardTest.simpleRDDTest();
>   }
> }
> 
> final static Map> simpleRDDs =
> initSimpleRDDs();
> public static Map> initSimpleRDDs()
>   {
>   Map> result = new 
> ConcurrentHashMap JavaRDD>();
>   JavaRDD rddData = JavaSC.parallelize(data);
>   rddData.cache().count();//this cache will improve 1-2/sec
>   result.put("MyRDD", rddData);
>   return result;
>   }
>   
>   public static String simpleRDDTest()
>   {   
>   JavaRDD rddData = simpleRDDs.get("MyRDD");
>   return rddData.first();
>   }
> 
> 
> 
> 
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-spark-take-so-much-time-for-simple-task-without-calculation-tp27628.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: SVD output within Spark

2016-08-31 Thread Yanbo Liang
The signs of the eigenvectors are essentially arbitrary, so both result of
Spark and Matlab are right.

Thanks

On Thu, Jul 21, 2016 at 3:50 PM, Martin Somers  wrote:

>
> just looking at a comparision between Matlab and Spark for svd with an
> input matrix N
>
>
> this is matlab code - yes very small matrix
>
> N =
>
> 2.5903   -0.04160.6023
>-0.12362.55960.7629
> 0.0148   -0.06930.2490
>
>
>
> U =
>
>-0.3706   -0.92840.0273
>-0.92870.37080.0014
>-0.0114   -0.0248   -0.9996
>
> 
> Spark code
>
> // Breeze to spark
> val N1D = N.reshape(1, 9).toArray
>
>
> // Note I had to transpose array to get correct values with incorrect signs
> val V2D = N1D.grouped(3).toArray.transpose
>
>
> // Then convert the array into a RDD
> // val NVecdis = Vectors.dense(N1D.map(x => x.toDouble))
> // val V2D = N1D.grouped(3).toArray
>
>
> val rowlocal = V2D.map{x => Vectors.dense(x)}
> val rows = sc.parallelize(rowlocal)
> val mat = new RowMatrix(rows)
> val mat = new RowMatrix(rows)
> val svd = mat.computeSVD(mat.numCols().toInt, computeU=true)
>
> 
>
> Spark Output - notice the change in sign on the 2nd and 3rd column
> -0.3158590633523746   0.9220516369164243   -0.22372713505049768
> -0.8822050381939436   -0.3721920780944116  -0.28842213436035985
> -0.34920956843045253  0.10627246051309004  0.9309988407367168
>
>
>
> And finally some julia code
> N  = [2.59031-0.0416335  0.602295;
> -0.1235842.559640.762906;
> 0.0148463  -0.0693119  0.249017]
>
> svd(N, thin=true)   --- same as matlab
> -0.315859  -0.922052   0.223727
> -0.882205   0.372192   0.288422
> -0.34921   -0.106272  -0.930999
>
> Most likely its an issue with my implementation rather than being a bug
> with svd within the spark environment
> My spark instance is running locally with a docker container
> Any suggestions
> tks
>
>


Problem with Graphx and number of partitions

2016-08-31 Thread alvarobrandon
Helo everyone:

I have a problem when setting the number of partitions inside Graphx with
the ConnectedComponents function. When I launch the application with the
default number of partition everything runs smoothly. However when I
increase the number of partitions to 150 for example ( it happens with
bigger values as well) it gets stuck in stage 5 in the last task.


 

with the following error


[Stage
5:=>  
(190 + 10) / 200]241.445: [GC [PSYoungGen: 118560K->800K(233472K)]
418401K->301406K(932864K), 0.0029430 secs] [Times: user=0.02 sys=0.00,
real=0.01 secs]
[Stage
5:=>(199
+ 1) / 200]16/08/31 11:09:23 ERROR spark.ContextCleaner: Error cleaning
broadcast 4
org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120
seconds. This timeout is controlled by spark.rpc.askTimeout
at
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185)
at scala.util.Try$.apply(Try.scala:161)
at scala.util.Failure.recover(Try.scala:185)
at
scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
at
scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at
org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
at
scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
at
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at scala.concurrent.Promise$class.complete(Promise.scala:55)
at
scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at
scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.processBatch$1(Future.scala:643)
at
scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply$mcV$sp(Future.scala:658)
at
scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
at
scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
at
scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at
scala.concurrent.Future$InternalCallbackExecutor$Batch.run(Future.scala:634)
at
scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
at
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:685)
at
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
at
scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
at
org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:241)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply
in 120 seconds
at
org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:242)
... 7 more

The way I set the number of partitions is when reading the graph through:

val graph = GraphLoader.edgeListFile(sc, input, true, minEdge,
StorageLevel.MEMORY_AND_DISK, StorageLevel.MEMORY_AND_DISK)
val res = graph.connectedComponents().vertices

The version of Spark I'm using is 1

Re: Spark build 1.6.2 error

2016-08-31 Thread Nachiketa
Hi Diwakar,

Could you please share the entire maven command that you are using to build
? And also the JDK version you are using ?

Also could you please confirm that you did execute the script for change
scala version to 2.11 before starting the build ? Thanks.

Regards,
Nachiketa

On Wed, Aug 31, 2016 at 2:00 AM, Diwakar Dhanuskodi <
diwakar.dhanusk...@gmail.com> wrote:

> Hi,
>
> While building Spark 1.6.2 , getting below error in spark-sql. Much
> appreciate for any help.
>
> ERROR] missing or invalid dependency detected while loading class file
> 'WebUI.class'.
> Could not access term eclipse in package org,
> because it (or its dependencies) are missing. Check your build definition
> for
> missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see
> the problematic classpath.)
> A full rebuild may help if 'WebUI.class' was compiled against an
> incompatible version of org.
> [ERROR] missing or invalid dependency detected while loading class file
> 'WebUI.class'.
> Could not access term jetty in value org.eclipse,
> because it (or its dependencies) are missing. Check your build definition
> for
> missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see
> the problematic classpath.)
> A full rebuild may help if 'WebUI.class' was compiled against an
> incompatible version of org.eclipse.
> [WARNING] 17 warnings found
> [ERROR] two errors found
> [INFO] 
> 
> [INFO] Reactor Summary:
> [INFO]
> [INFO] Spark Project Parent POM .. SUCCESS [4.399s]
> [INFO] Spark Project Test Tags ... SUCCESS [3.443s]
> [INFO] Spark Project Launcher  SUCCESS
> [10.131s]
> [INFO] Spark Project Networking .. SUCCESS
> [11.849s]
> [INFO] Spark Project Shuffle Streaming Service ... SUCCESS [6.641s]
> [INFO] Spark Project Unsafe .. SUCCESS
> [19.765s]
> [INFO] Spark Project Core  SUCCESS
> [4:16.511s]
> [INFO] Spark Project Bagel ... SUCCESS
> [13.401s]
> [INFO] Spark Project GraphX .. SUCCESS
> [1:08.824s]
> [INFO] Spark Project Streaming ... SUCCESS
> [2:18.844s]
> [INFO] Spark Project Catalyst  SUCCESS
> [2:43.695s]
> [INFO] Spark Project SQL . FAILURE
> [1:01.762s]
> [INFO] Spark Project ML Library .. SKIPPED
> [INFO] Spark Project Tools ... SKIPPED
> [INFO] Spark Project Hive  SKIPPED
> [INFO] Spark Project Docker Integration Tests  SKIPPED
> [INFO] Spark Project REPL  SKIPPED
> [INFO] Spark Project YARN Shuffle Service  SKIPPED
> [INFO] Spark Project YARN  SKIPPED
> [INFO] Spark Project Assembly  SKIPPED
> [INFO] Spark Project External Twitter  SKIPPED
> [INFO] Spark Project External Flume Sink . SKIPPED
> [INFO] Spark Project External Flume .. SKIPPED
> [INFO] Spark Project External Flume Assembly . SKIPPED
> [INFO] Spark Project External MQTT ... SKIPPED
> [INFO] Spark Project External MQTT Assembly .. SKIPPED
> [INFO] Spark Project External ZeroMQ . SKIPPED
> [INFO] Spark Project External Kafka .. SKIPPED
> [INFO] Spark Project Examples  SKIPPED
> [INFO] Spark Project External Kafka Assembly . SKIPPED
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time: 12:40.525s
> [INFO] Finished at: Wed Aug 31 01:56:50 IST 2016
> [INFO] Final Memory: 71M/830M
> [INFO] 
> 
> [ERROR] Failed to execute goal 
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile
> (scala-compile-first) on project spark-sql_2.11: Execution
> scala-compile-first of goal 
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile
> failed. CompileFailed -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the
> -e switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/
> PluginExecutionException
> [ERROR]
> [ERROR] After correcting the problems, you can resume the build with the
> command
> [ERROR]   mvn  -rf :spark-sql_2.11
>
>
>


-- 
Regards,
-- Nachiketa


Why does spark take so much time for simple task without calculation?

2016-08-31 Thread xiefeng
I install a spark standalone and run the spark cluster(one master and one
worker) in a windows 2008 server with 16cores and 24GB memory.

I have done a simple test: Just create  a string RDD and simply return it. I
use JMeter to test throughput but the highest is around 35/sec. I think
spark is powerful at distribute calculation, but why the throughput is so
limit in such simple test scenario only contains simple task dispatch and no
calculation?

1. In JMeter I test both 10 threads or 100 threads, there is little
difference around 2-3/sec.
2. I test both cache/not cache the RDD, there is little difference. 
3. During the test, the cpu and memory are in low level.

Below is my test code:
@RestController
public class SimpleTest {   
@RequestMapping(value = "/SimpleTest", method = RequestMethod.GET)
@ResponseBody
public String testProcessTransaction() {
return SparkShardTest.simpleRDDTest();
}
}

final static Map> simpleRDDs = initSimpleRDDs();
public static Map> initSimpleRDDs()
{
Map> result = new 
ConcurrentHashMap>();
JavaRDD rddData = JavaSC.parallelize(data);
rddData.cache().count();//this cache will improve 1-2/sec
result.put("MyRDD", rddData);
return result;
}

public static String simpleRDDTest()
{   
JavaRDD rddData = simpleRDDs.get("MyRDD");
return rddData.first();
}




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-spark-take-so-much-time-for-simple-task-without-calculation-tp27628.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Model abstract class in spark ml

2016-08-31 Thread Sean Owen
Weird, I recompiled Spark with a similar change to Model and it seemed
to work but maybe I missed a step in there.

On Wed, Aug 31, 2016 at 6:33 AM, Mohit Jaggi  wrote:
> I think I figured it out. There is indeed "something deeper in Scala” :-)
>
> abstract class A {
>   def a: this.type
> }
>
> class AA(i: Int) extends A {
>   def a = this
> }
>
> the above works ok. But if you return anything other than “this”, you will
> get a compile error.
>
> abstract class A {
>   def a: this.type
> }
>
> class AA(i: Int) extends A {
>   def a = new AA(1)
> }
>
> Error:(33, 11) type mismatch;
>  found   : com.dataorchard.datagears.AA
>  required: AA.this.type
>   def a = new AA(1)
>   ^
>
> So you have to do:
>
> abstract class A[T <: A[T]]  {
>   def a: T
> }
>
> class AA(i: Int) extends A[AA] {
>   def a = new AA(1)
> }
>
>
>
> Mohit Jaggi
> Founder,
> Data Orchard LLC
> www.dataorchardllc.com
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Slow activation using Spark Streaming's new receiver scheduling mechanism

2016-08-31 Thread Renxia Wang
I do also have this problem. The total time for launching receivers seems
related to the total number of executors. In my case, when I run 400
executors with 200 receivers, it takes about a minute for all receivers
become active, but with 800 executors, it takes 3 minutes to activate all
receivers.

I am running on YARN on EMR 4.7.2, with Spark 1.6.2.

2015-10-21 12:15 GMT-07:00 Budde, Adam :

> Hi all,
>
> My team uses Spark Streaming to implement the batch processing component
> of a lambda architecture with 5 min intervals. We process roughly 15 TB/day
> using three discrete Spark clusters and about 250 receivers per cluster.
> We've been having some issues migrating our platform from Spark 1.4.x to
> Spark 1.5.x.
>
> The first issue we've been having relates to receiver scheduling. Under
> Spark 1.4.x, each receiver becomes active almost immediately and the
> application quickly reaches its peak input throughput. Under the new
> receiver scheduling mechanism introduced in Spark 1.5.x (SPARK-8882
> ) we see that it takes
> quite a while for our receivers to become active. I haven't spent too much
> time gathering hard numbers on this, but my estimate would be that it takes
> over half an hour for half the receivers to become active and well over an
> hour for all of them to become active.
>
> I spent some time digging into the code for the *ReceiverTracker*,
> *ReceiverSchedulingPolicy*, and *ReceiverSupervisor* classes and
> recompiling Spark with some added debug logging. As far as I can tell, this
> is what is happening:
>
>- On program start, the ReceiverTracker RPC endpoint receives a
>*StartAllReceivers* message via its own *launchReceivers()* method
>(itself invoked by *start()*)
>- The handler for StartAllReceivers invokes
>*ReceiverSchedulingPolicy.scheduleReceivers()* to generate a desired
>receiver to executor mapping and calls
>*ReceiverTracker.startReceiver()* for each receiver
>- *startReceiver()* uses the SparkContext to submit a job that creates
>an instance of ReceiverSupervisorImpl to run the receiver on a random
>executor
>- While bootstrapping the receiver, the
>*ReceiverSupervisorImpl.onReceiverStart()* sends a* RegisterReceiver*
>message to the ReceiverTracker RPC endpoint
>- The handler for RegisterReceiver checks if the randomly-selected
>executor was the one the receiver was assigned to by
>ReceiverSchedulingPolicy.scheduleReceivers() and fails the job if it
>isn't
>- ReceiverTracker restarts the failed receiver job and this process
>continues until all receivers are assigned to their proper executor
>
> Assuming this order of operations is correct, I have the following
> questions:
>
>1. Is there any way to coerce SparkContext.submitJob() into scheduling
>a job on a specific executor? Put another way, is there a mechanism we can
>use to ensure that each receiver job is run on the executor it was assigned
>to on the first call to ReceiverSchedulingPolicy.scheduleReceivers()?
>2. If (1) is not possible, is there anything we can do to speed up the
>StartReceiver -> RegisterReceiver -> RestartReceiver loop? Right now, it
>seems to take about 30-40 sec between attempts to invoke RegisterReceiver
>on a given receiver.
>
> Thanks for the help!
>
> Adam
>


Spark to Kafka communication encrypted ?

2016-08-31 Thread Eric Ho
I can't find in Spark 1.6.2's docs in how to turn encryption on for Spark
to Kafka communication ...  I think that the Spark docs only tells you how
to turn on encryption for inter Spark node communications ..  Am I wrong ?

Thanks.

-- 

-eric ho