date:20150324

That's probably the problem; the intended path is on HDFS but the
configuration specifies a local path. See the exception message.

On Tue, Mar 24, 2015 at 1:08 PM, Akhil Das ak...@sigmoidanalytics.com wrote:
 Its in your local file system, not in hdfs.

 Thanks
 Best Regards

 On Tue, Mar 24, 2015 at 6:25 PM, Sachin Singh sachin.sha...@gmail.com
 wrote:

 hi,
 I can see required permission is granted for this directory as under,

  hadoop dfs -ls /user/spark
 DEPRECATED: Use of this script to execute hdfs command is deprecated.
 Instead use the hdfs command for it.

 Found 1 items
 drwxrwxrwt   - spark spark  0 2015-03-20 01:04
 /user/spark/applicationHistory

 regards
 Sachin

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: issue while creating spark context

thanks Sean,
please can you suggest in which file or configuration I need to modify
proper path, please elaborate which may help,

thanks,

Regards
Sachin


On Tue, Mar 24, 2015 at 7:15 PM, Sean Owen so...@cloudera.com wrote:

 That's probably the problem; the intended path is on HDFS but the
 configuration specifies a local path. See the exception message.

 On Tue, Mar 24, 2015 at 1:08 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:
  Its in your local file system, not in hdfs.
 
  Thanks
  Best Regards
 
  On Tue, Mar 24, 2015 at 6:25 PM, Sachin Singh sachin.sha...@gmail.com
  wrote:
 
  hi,
  I can see required permission is granted for this directory as under,
 
   hadoop dfs -ls /user/spark
  DEPRECATED: Use of this script to execute hdfs command is deprecated.
  Instead use the hdfs command for it.
 
  Found 1 items
  drwxrwxrwt   - spark spark  0 2015-03-20 01:04
  /user/spark/applicationHistory
 
  regards
  Sachin

Re: Spark streaming alerting

2015-03-24 Thread Helena Edelson

Streaming _from_ cassandra, CassandraInputDStream, is coming BTW 
https://issues.apache.org/jira/browse/SPARK-6283 
https://issues.apache.org/jira/browse/SPARK-6283
I am working on it now.

Helena
@helenaedelson

 On Mar 23, 2015, at 5:22 AM, Khanderao Kand Gmail khanderao.k...@gmail.com 
 wrote:
 
 Akhil 
 
 You are right in tour answer to what Mohit wrote. However what Mohit seems to 
 be alluring but did not write properly might be different.
 
 Mohit
 
 You are wrong in saying generally streaming works in HDFS and cassandra . 
 Streaming typically works with streaming or queing source like Kafka, 
 kinesis, Twitter, flume, zeroMQ, etc (but can also from HDFS and S3 ) However 
 , streaming context ( receiver wishing the streaming context ) gets 
 events/messages/records and forms a time window based batch (RDD)- 
 
 So there is a maximum gap of window time from alert message was available to 
 spark and when the processing happens. I think you meant about this. 
 
 As per spark programming model, RDD is the right way to deal with data.  If 
 you are fine with the minimum delay of say a sec (based on min time window 
 that dstreaming can support) then what Rohit gave is a right model. 
 
 Khanderao
 
 On Mar 22, 2015, at 11:39 PM, Akhil Das ak...@sigmoidanalytics.com 
 mailto:ak...@sigmoidanalytics.com wrote:
 
 What do you mean you can't send it directly from spark workers? Here's a 
 simple approach which you could do:
 
 val data = ssc.textFileStream(sigmoid/)
 val dist = data.filter(_.contains(ERROR)).foreachRDD(rdd = 
 alert(Errors : + rdd.count()))
 
 And the alert() function could be anything triggering an email or sending an 
 SMS alert.
 
 Thanks
 Best Regards
 
 On Sun, Mar 22, 2015 at 1:52 AM, Mohit Anchlia mohitanch...@gmail.com 
 mailto:mohitanch...@gmail.com wrote:
 Is there a module in spark streaming that lets you listen to the 
 alerts/conditions as they happen in the streaming module? Generally spark 
 streaming components will execute on large set of clusters like hdfs or 
 Cassandra, however when it comes to alerting you generally can't send it 
 directly from the spark workers, which means you need a way to listen to the 
 alerts.

Re: issue while creating spark context

Hi Akhil,
thanks for your quick reply,
I would like to request please elaborate i.e. what kind of permission
required ..

thanks in advance,

Regards
Sachin

On Tue, Mar 24, 2015 at 5:29 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Its an IOException, just make sure you are having the correct permission
 over */user/spark* directory.

 Thanks
 Best Regards

 On Tue, Mar 24, 2015 at 5:21 PM, sachin Singh sachin.sha...@gmail.com
 wrote:

 hi all,
 all of sudden I getting below error when I am submitting spark job using
 master as yarn its not able to create spark context,previously working
 fine,
 I am using CDH5.3.1 and creating javaHiveContext
 spark-submit --jars

 ./analiticlibs/mysql-connector-java-5.1.17.jar,./analiticlibs/log4j-1.2.17.jar
 --master yarn --class myproject.com.java.jobs.Aggregationtask
 sparkjob-1.0.jar

 error message-
 java.io.IOException: Error in creating log directory:
 file:/user/spark/applicationHistory/application_1427194309307_0005
 at
 org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:133)
 at org.apache.spark.util.FileLogger.start(FileLogger.scala:115)
 at

 org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74)
 at org.apache.spark.SparkContext.init(SparkContext.scala:353)
 at

 org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:61)
 at

 myproject.com.java.core.SparkAnaliticEngine.getJavaSparkContext(SparkAnaliticEngine.java:77)
 at

 myproject.com.java.core.SparkAnaliticTable.evmyprojectate(SparkAnaliticTable.java:108)
 at

 myproject.com.java.core.SparkAnaliticEngine.evmyprojectateAnaliticTable(SparkAnaliticEngine.java:55)
 at

 myproject.com.java.core.SparkAnaliticEngine.evmyprojectateAnaliticTable(SparkAnaliticEngine.java:65)
 at

 myproject.com.java.jobs.CustomAggregationJob.main(CustomAggregationJob.java:184)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/issue-while-creating-spark-context-tp22196.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Optimal solution for getting the header from CSV with Spark

2015-03-24 Thread Spico Florin

Hello!

I would like to know what is the optimal solution for getting the header
with from a CSV file with Spark? My aproach was:

def getHeader(data: RDD[String]): String = {
data.zipWithIndex().filter(_._2==0).map(x=x._1).take(1).mkString() }

Thanks.

EC2 Having script run at startup

2015-03-24 Thread Theodore Vasiloudis

Hello,

in the context of  SPARK-2394 Make it easier to read LZO-compressed files
from EC2 clusters https://issues.apache.org/jira/browse/SPARK-2394  , I
was wondering:

Is there an easy way to make a user-provided script run at every machine in
a cluster launched on EC2?

Regards,
Theodore



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/EC2-Having-script-run-at-startup-tp22197.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: issue while creating spark context

hi,
I can see required permission is granted for this directory as under,

 hadoop dfs -ls /user/spark
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Found 1 items
*drwxrwxrwt   - spark spark  0 2015-03-20 01:04
/user/spark/applicationHistory*

regards
Sachin


On Tue, Mar 24, 2015 at 6:13 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 write permission as its clearly saying:

 java.io.IOException:* Error in creating log directory:*
 file:*/user/spark/*applicationHistory/application_1427194309307_0005

 Thanks
 Best Regards

 On Tue, Mar 24, 2015 at 6:08 PM, Sachin Singh sachin.sha...@gmail.com
 wrote:

 Hi Akhil,
 thanks for your quick reply,
 I would like to request please elaborate i.e. what kind of permission
 required ..

 thanks in advance,

 Regards
 Sachin

 On Tue, Mar 24, 2015 at 5:29 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Its an IOException, just make sure you are having the correct permission
 over */user/spark* directory.

 Thanks
 Best Regards

 On Tue, Mar 24, 2015 at 5:21 PM, sachin Singh sachin.sha...@gmail.com
 wrote:

 hi all,
 all of sudden I getting below error when I am submitting spark job using
 master as yarn its not able to create spark context,previously working
 fine,
 I am using CDH5.3.1 and creating javaHiveContext
 spark-submit --jars

 ./analiticlibs/mysql-connector-java-5.1.17.jar,./analiticlibs/log4j-1.2.17.jar
 --master yarn --class myproject.com.java.jobs.Aggregationtask
 sparkjob-1.0.jar

 error message-
 java.io.IOException: Error in creating log directory:
 file:/user/spark/applicationHistory/application_1427194309307_0005
 at
 org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:133)
 at org.apache.spark.util.FileLogger.start(FileLogger.scala:115)
 at

 org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74)
 at org.apache.spark.SparkContext.init(SparkContext.scala:353)
 at

 org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:61)
 at

 myproject.com.java.core.SparkAnaliticEngine.getJavaSparkContext(SparkAnaliticEngine.java:77)
 at

 myproject.com.java.core.SparkAnaliticTable.evmyprojectate(SparkAnaliticTable.java:108)
 at

 myproject.com.java.core.SparkAnaliticEngine.evmyprojectateAnaliticTable(SparkAnaliticEngine.java:55)
 at

 myproject.com.java.core.SparkAnaliticEngine.evmyprojectateAnaliticTable(SparkAnaliticEngine.java:65)
 at

 myproject.com.java.jobs.CustomAggregationJob.main(CustomAggregationJob.java:184)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
 at
 org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/issue-while-creating-spark-context-tp22196.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: issue while creating spark context

2015-03-24 Thread Akhil Das

Its in your local file system, not in hdfs.

Thanks
Best Regards

On Tue, Mar 24, 2015 at 6:25 PM, Sachin Singh sachin.sha...@gmail.com
wrote:

 hi,
 I can see required permission is granted for this directory as under,

  hadoop dfs -ls /user/spark
 DEPRECATED: Use of this script to execute hdfs command is deprecated.
 Instead use the hdfs command for it.

 Found 1 items
 *drwxrwxrwt   - spark spark  0 2015-03-20 01:04
 /user/spark/applicationHistory*

 regards
 Sachin


 On Tue, Mar 24, 2015 at 6:13 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 write permission as its clearly saying:

 java.io.IOException:* Error in creating log directory:*
 file:*/user/spark/*applicationHistory/application_1427194309307_0005

 Thanks
 Best Regards

 On Tue, Mar 24, 2015 at 6:08 PM, Sachin Singh sachin.sha...@gmail.com
 wrote:

 Hi Akhil,
 thanks for your quick reply,
 I would like to request please elaborate i.e. what kind of permission
 required ..

 thanks in advance,

 Regards
 Sachin

 On Tue, Mar 24, 2015 at 5:29 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Its an IOException, just make sure you are having the correct
 permission over */user/spark* directory.

 Thanks
 Best Regards

 On Tue, Mar 24, 2015 at 5:21 PM, sachin Singh sachin.sha...@gmail.com
 wrote:

 hi all,
 all of sudden I getting below error when I am submitting spark job
 using
 master as yarn its not able to create spark context,previously working
 fine,
 I am using CDH5.3.1 and creating javaHiveContext
 spark-submit --jars

 ./analiticlibs/mysql-connector-java-5.1.17.jar,./analiticlibs/log4j-1.2.17.jar
 --master yarn --class myproject.com.java.jobs.Aggregationtask
 sparkjob-1.0.jar

 error message-
 java.io.IOException: Error in creating log directory:
 file:/user/spark/applicationHistory/application_1427194309307_0005
 at
 org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:133)
 at org.apache.spark.util.FileLogger.start(FileLogger.scala:115)
 at

 org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74)
 at org.apache.spark.SparkContext.init(SparkContext.scala:353)
 at

 org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:61)
 at

 myproject.com.java.core.SparkAnaliticEngine.getJavaSparkContext(SparkAnaliticEngine.java:77)
 at

 myproject.com.java.core.SparkAnaliticTable.evmyprojectate(SparkAnaliticTable.java:108)
 at

 myproject.com.java.core.SparkAnaliticEngine.evmyprojectateAnaliticTable(SparkAnaliticEngine.java:55)
 at

 myproject.com.java.core.SparkAnaliticEngine.evmyprojectateAnaliticTable(SparkAnaliticEngine.java:65)
 at

 myproject.com.java.jobs.CustomAggregationJob.main(CustomAggregationJob.java:184)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
 at
 org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/issue-while-creating-spark-context-tp22196.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: issue while creating spark context

2015-03-24 Thread Akhil Das

write permission as its clearly saying:

java.io.IOException:* Error in creating log directory:*
file:*/user/spark/*applicationHistory/application_1427194309307_0005

Thanks
Best Regards

On Tue, Mar 24, 2015 at 6:08 PM, Sachin Singh sachin.sha...@gmail.com
wrote:

 Hi Akhil,
 thanks for your quick reply,
 I would like to request please elaborate i.e. what kind of permission
 required ..

 thanks in advance,

 Regards
 Sachin

 On Tue, Mar 24, 2015 at 5:29 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Its an IOException, just make sure you are having the correct permission
 over */user/spark* directory.

 Thanks
 Best Regards

 On Tue, Mar 24, 2015 at 5:21 PM, sachin Singh sachin.sha...@gmail.com
 wrote:

 hi all,
 all of sudden I getting below error when I am submitting spark job using
 master as yarn its not able to create spark context,previously working
 fine,
 I am using CDH5.3.1 and creating javaHiveContext
 spark-submit --jars

 ./analiticlibs/mysql-connector-java-5.1.17.jar,./analiticlibs/log4j-1.2.17.jar
 --master yarn --class myproject.com.java.jobs.Aggregationtask
 sparkjob-1.0.jar

 error message-
 java.io.IOException: Error in creating log directory:
 file:/user/spark/applicationHistory/application_1427194309307_0005
 at
 org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:133)
 at org.apache.spark.util.FileLogger.start(FileLogger.scala:115)
 at

 org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74)
 at org.apache.spark.SparkContext.init(SparkContext.scala:353)
 at

 org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:61)
 at

 myproject.com.java.core.SparkAnaliticEngine.getJavaSparkContext(SparkAnaliticEngine.java:77)
 at

 myproject.com.java.core.SparkAnaliticTable.evmyprojectate(SparkAnaliticTable.java:108)
 at

 myproject.com.java.core.SparkAnaliticEngine.evmyprojectateAnaliticTable(SparkAnaliticEngine.java:55)
 at

 myproject.com.java.core.SparkAnaliticEngine.evmyprojectateAnaliticTable(SparkAnaliticEngine.java:65)
 at

 myproject.com.java.jobs.CustomAggregationJob.main(CustomAggregationJob.java:184)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
 at
 org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/issue-while-creating-spark-context-tp22196.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

issue while creating spark context

2015-03-24 Thread sachin Singh

hi all,
all of sudden I getting below error when I am submitting spark job using
master as yarn its not able to create spark context,previously working fine,
I am using CDH5.3.1 and creating javaHiveContext
spark-submit --jars
./analiticlibs/mysql-connector-java-5.1.17.jar,./analiticlibs/log4j-1.2.17.jar 
--master yarn --class myproject.com.java.jobs.Aggregationtask
sparkjob-1.0.jar

error message-
java.io.IOException: Error in creating log directory:
file:/user/spark/applicationHistory/application_1427194309307_0005
at
org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:133)
at org.apache.spark.util.FileLogger.start(FileLogger.scala:115)
at
org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74)
at org.apache.spark.SparkContext.init(SparkContext.scala:353)
at
org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:61)
at
myproject.com.java.core.SparkAnaliticEngine.getJavaSparkContext(SparkAnaliticEngine.java:77)
at
myproject.com.java.core.SparkAnaliticTable.evmyprojectate(SparkAnaliticTable.java:108)
at
myproject.com.java.core.SparkAnaliticEngine.evmyprojectateAnaliticTable(SparkAnaliticEngine.java:55)
at
myproject.com.java.core.SparkAnaliticEngine.evmyprojectateAnaliticTable(SparkAnaliticEngine.java:65)
at
myproject.com.java.jobs.CustomAggregationJob.main(CustomAggregationJob.java:184)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/issue-while-creating-spark-context-tp22196.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: diffrence in PCA of MLib vs H2o in R

Those implementations are computing an SVD of the input matrix
directly, and while you generally need the columns to have mean 0, you
can turn that off with the options you cite.

I don't think this is possible in the MLlib implementation, since it
is computing the principal components by computing eigenvectors of the
covariance matrix. The means inherently don't matter either way in
this computation.

On Tue, Mar 24, 2015 at 6:13 AM, roni roni.epi...@gmail.com wrote:
 I am trying to compute PCA  using  computePrincipalComponents.
 I  also computed PCA using h2o in R and R's prcomp. The answers I get from
 H2o and R's prComp (non h2o) is same when I set the options for H2o as
 standardized=FALSE and for r's prcomp as center = false.

 How do I make sure that the settings for MLib PCA is same as I am using for
 H2o or prcomp.

 Thanks
 Roni

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark as a service

2015-03-24 Thread Jeffrey Jedele

Hi Ashish,
this might be what you're looking for:

https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server

Regards,
Jeff

2015-03-24 11:28 GMT+01:00 Ashish Mukherjee ashish.mukher...@gmail.com:

 Hello,

 As of now, if I have to execute a Spark job, I need to create a jar and
 deploy it.  If I need to run a dynamically formed SQL from a Web
 application, is there any way of using SparkSQL in this manner? Perhaps,
 through a Web Service or something similar.

 Regards,
 Ashish

Re: Question about Data Sources API

2015-03-24 Thread Ashish Mukherjee

Hello Michael,

Thanks for your quick reply.

My question wrt Java/Scala was related to extending the classes to support
new custom data sources, so was wondering if those could be written in
Java, since our company is a Java shop.

The additional push downs I am looking for are aggregations with grouping
and sorting.

Essentially, I am trying to evaluate if this API can give me much of what
is possible with the Apache MetaModel project.

Regards,
Ashish

On Tue, Mar 24, 2015 at 1:57 PM, Michael Armbrust mich...@databricks.com
wrote:

 On Tue, Mar 24, 2015 at 12:57 AM, Ashish Mukherjee 
 ashish.mukher...@gmail.com wrote:

 1. Is the Data Source API stable as of Spark 1.3.0?


 It is marked DeveloperApi, but in general we do not plan to change even
 these APIs unless there is a very compelling reason to.


 2. The Data Source API seems to be available only in Scala. Is there any
 plan to make it available for Java too?


 We tried to make all the suggested interfaces (other than CatalystScan
 which exposes internals and is only for experimentation) usable from Java.
 Is there something in particular you are having trouble with?


 3.  Are only filters and projections pushed down to the data source and
 all the data pulled into Spark for other processing?


 For now, this is all that is provided by the public stable API.  We left a
 hook for more powerful push downs
 (sqlContext.experimental.extraStrategies), and would be interested in
 feedback on other operations we should push down as we expand the API.

Re: Measuer Bytes READ and Peak Memory Usage for Query

2015-03-24 Thread anamika gupta

Yeah thanks, I can now see the memory usage.

Please also verify if bytes read == Combined size of all RDDs ?

Actually, all my RDDs are completely cached in memory. So, Combined size of
my RDDs = Mem used (verified from WebUI)


On Fri, Mar 20, 2015 at 12:07 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 You could do a cache and see the memory usage under Storage tab in the
 driver UI (runs on port 4040)

 Thanks
 Best Regards

 On Fri, Mar 20, 2015 at 12:02 PM, anu anamika.guo...@gmail.com wrote:

 Hi All

 I would like to measure Bytes Read and Peak Memory Usage for a Spark SQL
 Query.

 Please clarify if Bytes Read = aggregate size of all RDDs ??
 All my RDDs are in memory and 0B spill to disk.

 And I am clueless how to measure Peak Memory Usage.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Measuer-Bytes-READ-and-Peak-Memory-Usage-for-Query-tp22159.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: issue while creating spark context

2015-03-24 Thread Akhil Das

Its an IOException, just make sure you are having the correct permission
over */user/spark* directory.

Thanks
Best Regards

On Tue, Mar 24, 2015 at 5:21 PM, sachin Singh sachin.sha...@gmail.com
wrote:

 hi all,
 all of sudden I getting below error when I am submitting spark job using
 master as yarn its not able to create spark context,previously working
 fine,
 I am using CDH5.3.1 and creating javaHiveContext
 spark-submit --jars

 ./analiticlibs/mysql-connector-java-5.1.17.jar,./analiticlibs/log4j-1.2.17.jar
 --master yarn --class myproject.com.java.jobs.Aggregationtask
 sparkjob-1.0.jar

 error message-
 java.io.IOException: Error in creating log directory:
 file:/user/spark/applicationHistory/application_1427194309307_0005
 at
 org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:133)
 at org.apache.spark.util.FileLogger.start(FileLogger.scala:115)
 at

 org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74)
 at org.apache.spark.SparkContext.init(SparkContext.scala:353)
 at

 org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:61)
 at

 myproject.com.java.core.SparkAnaliticEngine.getJavaSparkContext(SparkAnaliticEngine.java:77)
 at

 myproject.com.java.core.SparkAnaliticTable.evmyprojectate(SparkAnaliticTable.java:108)
 at

 myproject.com.java.core.SparkAnaliticEngine.evmyprojectateAnaliticTable(SparkAnaliticEngine.java:55)
 at

 myproject.com.java.core.SparkAnaliticEngine.evmyprojectateAnaliticTable(SparkAnaliticEngine.java:65)
 at

 myproject.com.java.jobs.CustomAggregationJob.main(CustomAggregationJob.java:184)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/issue-while-creating-spark-context-tp22196.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

How to deploy binary dependencies to workers?

2015-03-24 Thread Xi Shen

Hi,

I am doing ML using Spark mllib. However, I do not have full control to the
cluster. I am using Microsoft Azure HDInsight

I want to deploy the BLAS or whatever required dependencies to accelerate
the computation. But I don't know how to deploy those DLLs when I submit my
JAR to the cluster.

I know how to pack those DLLs into a jar. The real challenge is how to let
the system find them...


Thanks,
David

Re: Spark as a service

2015-03-24 Thread Jeffrey Jedele

I don't think there's are general approach to that - the usecases are just
to different. If you really need it, you probably will have to implement
yourself in the driver of your application.

PS: Make sure to use the reply to all button so that the mailing list is
included in your reply. Otherwise only I will get your mail.

Regards,
Jeff

2015-03-24 12:01 GMT+01:00 Ashish Mukherjee ashish.mukher...@gmail.com:

 Hi Jeffrey,

 Thanks. Yes, this resolves the SQL problem. My bad - I was looking for
 something which would work for Spark Streaming and other Spark jobs too,
 not just SQL.

 Regards,
 Ashish

 On Tue, Mar 24, 2015 at 4:07 PM, Jeffrey Jedele jeffrey.jed...@gmail.com
 wrote:

 Hi Ashish,
 this might be what you're looking for:


 https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server

 Regards,
 Jeff

 2015-03-24 11:28 GMT+01:00 Ashish Mukherjee ashish.mukher...@gmail.com:

 Hello,

 As of now, if I have to execute a Spark job, I need to create a jar and
 deploy it.  If I need to run a dynamically formed SQL from a Web
 application, is there any way of using SparkSQL in this manner? Perhaps,
 through a Web Service or something similar.

 Regards,
 Ashish

Re: Spark as a service

2015-03-24 Thread Todd Nist

Perhaps this project, https://github.com/calrissian/spark-jetty-server,
could help with your requirements.

On Tue, Mar 24, 2015 at 7:12 AM, Jeffrey Jedele jeffrey.jed...@gmail.com
wrote:

 I don't think there's are general approach to that - the usecases are just
 to different. If you really need it, you probably will have to implement
 yourself in the driver of your application.

 PS: Make sure to use the reply to all button so that the mailing list is
 included in your reply. Otherwise only I will get your mail.

 Regards,
 Jeff

 2015-03-24 12:01 GMT+01:00 Ashish Mukherjee ashish.mukher...@gmail.com:

 Hi Jeffrey,

 Thanks. Yes, this resolves the SQL problem. My bad - I was looking for
 something which would work for Spark Streaming and other Spark jobs too,
 not just SQL.

 Regards,
 Ashish

 On Tue, Mar 24, 2015 at 4:07 PM, Jeffrey Jedele jeffrey.jed...@gmail.com
  wrote:

 Hi Ashish,
 this might be what you're looking for:


 https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server

 Regards,
 Jeff

 2015-03-24 11:28 GMT+01:00 Ashish Mukherjee ashish.mukher...@gmail.com
 :

 Hello,

 As of now, if I have to execute a Spark job, I need to create a jar and
 deploy it.  If I need to run a dynamically formed SQL from a Web
 application, is there any way of using SparkSQL in this manner? Perhaps,
 through a Web Service or something similar.

 Regards,
 Ashish

1.3 Hadoop File System problem

2015-03-24 Thread Jim Carroll


I have code that works under 1.2.1 but when I upgraded to 1.3.0 it fails to
find the s3 hadoop file system.

I get the java.lang.IllegalArgumentException: Wrong FS: s3://path to my
file], expected: file:/// when I try to save a parquet file. This worked in
1.2.1.

Has anyone else seen this?

I'm running spark using local[8] so it's all internal. These are actually
unit tests in our app that are failing now.

Thanks.
Jim




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/1-3-Hadoop-File-System-problem-tp22207.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Invalid ContainerId ... Caused by: java.lang.NumberFormatException: For input string: e04

2015-03-24 Thread Manoj Samel

Thanks Marcelo - I was using the SBT built spark per earlier thread. I
switched now to the distro (with the conf changes for CDH path in front)
and guava issue is gone.

Thanks,

On Tue, Mar 24, 2015 at 1:50 PM, Marcelo Vanzin van...@cloudera.com wrote:

 Hi there,

 On Tue, Mar 24, 2015 at 1:40 PM, Manoj Samel manojsamelt...@gmail.com
 wrote:
  When I run any query, it gives java.lang.NoSuchMethodError:
 
 com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;

 Are you running a custom-compiled Spark by any chance? Specifically,
 one you built with sbt? That would hit this problem, because the path
 I suggested (/usr/lib/hadoop/client/*) contains an older guava
 library, which would override the one shipped with the sbt-built
 Spark.

 If you build Spark with maven, or use the pre-built Spark distro, or
 specifically filter out the guava jar from your classpath when setting
 up the Spark job, things should work.

 --
 Marcelo

 --

 ---
 You received this message because you are subscribed to the Google Groups
 CDH Users group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to cdh-user+unsubscr...@cloudera.org.
 For more options, visit https://groups.google.com/a/cloudera.org/d/optout.

Re: diffrence in PCA of MLib vs H2o in R

2015-03-24 Thread roni

Reza,
That SVD.v matches the H2o and R prComp (non-centered)
Thanks
-R

On Tue, Mar 24, 2015 at 11:38 AM, Sean Owen so...@cloudera.com wrote:

 (Oh sorry, I've only been thinking of TallSkinnySVD)

 On Tue, Mar 24, 2015 at 6:36 PM, Reza Zadeh r...@databricks.com wrote:
  If you want to do a nonstandard (or uncentered) PCA, you can call
  computeSVD on RowMatrix, and look at the resulting 'V' Matrix.
 
  That should match the output of the other two systems.
 
  Reza
 
  On Tue, Mar 24, 2015 at 3:53 AM, Sean Owen so...@cloudera.com wrote:
 
  Those implementations are computing an SVD of the input matrix
  directly, and while you generally need the columns to have mean 0, you
  can turn that off with the options you cite.
 
  I don't think this is possible in the MLlib implementation, since it
  is computing the principal components by computing eigenvectors of the
  covariance matrix. The means inherently don't matter either way in
  this computation.
 
  On Tue, Mar 24, 2015 at 6:13 AM, roni roni.epi...@gmail.com wrote:
   I am trying to compute PCA  using  computePrincipalComponents.
   I  also computed PCA using h2o in R and R's prcomp. The answers I get
   from
   H2o and R's prComp (non h2o) is same when I set the options for H2o as
   standardized=FALSE and for r's prcomp as center = false.
  
   How do I make sure that the settings for MLib PCA is same as I am
 using
   for
   H2o or prcomp.
  
   Thanks
   Roni
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org

Re: java.lang.OutOfMemoryError: unable to create new native thread

2015-03-24 Thread Matt Silvey

My memory is hazy on this but aren't there hidden limitations to
Linux-based threads?  I ran into some issues a couple of years ago where,
and here is the fuzzy part, the kernel wants to reserve virtual memory per
thread equal to the stack size.  When the total amount of reserved memory
(not necessarily resident memory) exceeds the memory of the system it
throws an OOM.  I'm looking for material to back this up.  Sorry for the
initial vague response.

Matthew

On Tue, Mar 24, 2015 at 12:53 PM, Thomas Gerber thomas.ger...@radius.com
wrote:

 Additional notes:
 I did not find anything wrong with the number of threads (ps -u USER -L |
 wc -l): around 780 on the master and 400 on executors. I am running on 100
 r3.2xlarge.

 On Tue, Mar 24, 2015 at 12:38 PM, Thomas Gerber thomas.ger...@radius.com
 wrote:

 Hello,

 I am seeing various crashes in spark on large jobs which all share a
 similar exception:

 java.lang.OutOfMemoryError: unable to create new native thread
 at java.lang.Thread.start0(Native Method)
 at java.lang.Thread.start(Thread.java:714)

 I increased nproc (i.e. ulimit -u) 10 fold, but it doesn't help.

 Does anyone know how to avoid those kinds of errors?

 Noteworthy: I added -XX:ThreadStackSize=10m on both driver and executor
 extra java options, which might have amplified the problem.

 Thanks for you help,
 Thomas

Re: SparkSQL UDTs with Ordering

2015-03-24 Thread Patrick Woody

Awesome. yep - I have seen the warnings on UDTs, happy to keep up with the
API changes :). Would this be a reasonable PR to toss up despite the API
unstableness or would you prefer it to wait?

Thanks
-Pat

On Tue, Mar 24, 2015 at 7:44 PM, Michael Armbrust mich...@databricks.com
wrote:

 I'll caution that the UDTs are not a stable public interface yet.  We'd
 like to do this someday, but currently this feature is mostly for MLlib as
 we have not finalized the API.

 Having an ordering could be useful, but I'll add that currently UDTs
 actually exist in serialized from so the ordering would have to be on the
 internal form, not the user visible form.

 On Tue, Mar 24, 2015 at 12:25 PM, Patrick Woody patrick.woo...@gmail.com
 wrote:

 Hey all,

 Currently looking into UDTs and I was wondering if it is reasonable to
 add the ability to define an Ordering (or if this is possible, then how)?
 Currently it will throw an error when non-Native types are used.

 Thanks!
 -Pat

Re: diffrence in PCA of MLib vs H2o in R

2015-03-24 Thread Reza Zadeh

Great!

On Tue, Mar 24, 2015 at 2:53 PM, roni roni.epi...@gmail.com wrote:

 Reza,
 That SVD.v matches the H2o and R prComp (non-centered)
 Thanks
 -R

 On Tue, Mar 24, 2015 at 11:38 AM, Sean Owen so...@cloudera.com wrote:

 (Oh sorry, I've only been thinking of TallSkinnySVD)

 On Tue, Mar 24, 2015 at 6:36 PM, Reza Zadeh r...@databricks.com wrote:
  If you want to do a nonstandard (or uncentered) PCA, you can call
  computeSVD on RowMatrix, and look at the resulting 'V' Matrix.
 
  That should match the output of the other two systems.
 
  Reza
 
  On Tue, Mar 24, 2015 at 3:53 AM, Sean Owen so...@cloudera.com wrote:
 
  Those implementations are computing an SVD of the input matrix
  directly, and while you generally need the columns to have mean 0, you
  can turn that off with the options you cite.
 
  I don't think this is possible in the MLlib implementation, since it
  is computing the principal components by computing eigenvectors of the
  covariance matrix. The means inherently don't matter either way in
  this computation.
 
  On Tue, Mar 24, 2015 at 6:13 AM, roni roni.epi...@gmail.com wrote:
   I am trying to compute PCA  using  computePrincipalComponents.
   I  also computed PCA using h2o in R and R's prcomp. The answers I get
   from
   H2o and R's prComp (non h2o) is same when I set the options for H2o
 as
   standardized=FALSE and for r's prcomp as center = false.
  
   How do I make sure that the settings for MLib PCA is same as I am
 using
   for
   H2o or prcomp.
  
   Thanks
   Roni
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-24 Thread David Holiday

hi all,

got a vagrant image with spark notebook, spark, accumulo, and hadoop all 
running. from notebook I can manually create a scanner and pull test data from 
a table I created using one of the accumulo examples:

val instanceNameS = accumulo
val zooServersS = localhost:2181
val instance: Instance = new ZooKeeperInstance(instanceNameS, zooServersS)
val connector: Connector = instance.getConnector( root, new 
PasswordToken(password))
val auths = new Authorizations(exampleVis)
val scanner = connector.createScanner(batchtest1, auths)

scanner.setRange(new Range(row_00, row_10))

for(entry: Entry[Key, Value] - scanner) {
  println(entry.getKey +  is  + entry.getValue)
}

will give the first ten rows of table data. when I try to create the RDD thusly:

val rdd2 =
  sparkContext.newAPIHadoopRDD (
new Configuration(),
classOf[org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat],
classOf[org.apache.accumulo.core.data.Key],
classOf[org.apache.accumulo.core.data.Value]
  )

I get an RDD returned to me that I can't do much with due to the following 
error:

java.io.IOException: Input info has not been set. at 
org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630)
 at 
org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:343)
 at 
org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:538)
 at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at 
scala.Option.getOrElse(Option.scala:120) at 
org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:1367) at 
org.apache.spark.rdd.RDD.count(RDD.scala:927)

which totally makes sense in light of the fact that I haven't specified any 
parameters as to which table to connect with, what the auths are, etc.

so my question is: what do I need to do from here to get those first ten rows 
of table data into my RDD?



DAVID HOLIDAY
Software Engineer
760 607 3300 | Office
312 758 8385 | Mobile
dav...@annaisystems.commailto:broo...@annaisystems.com


[cid:AE39C43E-3FF7-4C90-BCE4-9711C84C4CB8@cld.annailabs.com]
www.AnnaiSystems.comhttp://www.AnnaiSystems.com

On Mar 19, 2015, at 11:25 AM, David Holiday 
dav...@annaisystems.commailto:dav...@annaisystems.com wrote:

kk - I'll put something together and get back to you with more :-)

DAVID HOLIDAY
Software Engineer
760 607 3300 | Office
312 758 8385 | Mobile
dav...@annaisystems.commailto:broo...@annaisystems.com


GetFileAttachment.jpg
www.AnnaiSystems.comhttp://www.annaisystems.com/

On Mar 19, 2015, at 10:59 AM, Irfan Ahmad 
ir...@cloudphysics.commailto:ir...@cloudphysics.com wrote:

Once you setup spark-notebook, it'll handle the submits for interactive work. 
Non-interactive is not handled by it. For that spark-kernel could be used.

Give it a shot ... it only takes 5 minutes to get it running in local-mode.


Irfan Ahmad
CTO | Co-Founder | CloudPhysicshttp://www.cloudphysics.com/
Best of VMworld Finalist
Best Cloud Management Award
NetworkWorld 10 Startups to Watch
EMA Most Notable Vendor

On Thu, Mar 19, 2015 at 9:51 AM, David Holiday 
dav...@annaisystems.commailto:dav...@annaisystems.com wrote:
hi all - thx for the alacritous replies! so regarding how to get things from 
notebook to spark and back, am I correct that spark-submit is the way to go?

DAVID HOLIDAY
Software Engineer
760 607 3300tel:760%20607%203300 | Office
312 758 8385tel:312%20758%208385 | Mobile
dav...@annaisystems.commailto:broo...@annaisystems.com


GetFileAttachment.jpg
www.AnnaiSystems.comhttp://www.annaisystems.com/

On Mar 19, 2015, at 1:14 AM, Paolo Platter 
paolo.plat...@agilelab.itmailto:paolo.plat...@agilelab.it wrote:

Yes, I would suggest spark-notebook too.
It's very simple to setup and it's growing pretty fast.

Paolo

Inviata dal mio Windows Phone

Da: Irfan Ahmadmailto:ir...@cloudphysics.com
Inviato: ‎19/‎03/‎2015 04:05
A: davidhmailto:dav...@annaisystems.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Oggetto: Re: iPython Notebook + Spark + Accumulo -- best practice?

I forgot to mention that there is also Zeppelin and jove-notebook but I haven't 
got any experience with those yet.


Irfan Ahmad
CTO | Co-Founder | CloudPhysicshttp://www.cloudphysics.com/
Best of VMworld Finalist
Best Cloud Management Award
NetworkWorld 10 Startups to Watch
EMA Most Notable Vendor

On Wed, Mar 18, 2015 at 8:01 PM, Irfan Ahmad 
ir...@cloudphysics.commailto:ir...@cloudphysics.com wrote:
Hi David,

W00t indeed and great questions. On the notebook front, there are two options 
depending on what you are looking for. You can either go with iPython 3 with 
Spark-kernel as a backend or you can use spark-notebook. Both have

Re: Spark as a service

2015-03-24 Thread Irfan Ahmad

Also look at the spark-kernel and spark job server projects.

Irfan
On Mar 24, 2015 5:03 AM, Todd Nist tsind...@gmail.com wrote:

Perhaps this project, https://github.com/calrissian/spark-jetty-server,
could help with your requirements.

On Tue, Mar 24, 2015 at 7:12 AM, Jeffrey Jedele jeffrey.jed...@gmail.com
wrote:

I don't think there's are general approach to that - the usecases are
just to different. If you really need it, you probably will have to
implement yourself in the driver of your application.

PS: Make sure to use the reply to all button so that the mailing list is
included in your reply. Otherwise only I will get your mail.

Regards,
Jeff

2015-03-24 12:01 GMT+01:00 Ashish Mukherjee ashish.mukher...@gmail.com:

Hi Jeffrey,

Thanks. Yes, this resolves the SQL problem. My bad - I was looking for
something which would work for Spark Streaming and other Spark jobs too,
not just SQL.

Regards,
Ashish

On Tue, Mar 24, 2015 at 4:07 PM, Jeffrey Jedele
jeffrey.jed...@gmail.com wrote:

Hi Ashish,
this might be what you're looking for:

https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server

Regards,
Jeff

2015-03-24 11:28 GMT+01:00 Ashish Mukherjee ashish.mukher...@gmail.com
:

Hello,

As of now, if I have to execute a Spark job, I need to create a jar
and deploy it. If I need to run a dynamically formed SQL from a Web
application, is there any way of using SparkSQL in this manner? Perhaps,
through a Web Service or something similar.

Regards,
Ashish

Re: How to avoid being killed by YARN node manager ?

Hi Yuichiro,

The way to avoid this is to boost spark.yarn.executor.memoryOverhead until
the executors have enough off-heap memory to avoid going over their limits.

-Sandy

On Tue, Mar 24, 2015 at 11:49 AM, Yuichiro Sakamoto ks...@muc.biglobe.ne.jp
wrote:

Hello.

We use ALS(Collaborative filtering) of Spark MLlib on YARN.
Spark version is 1.2.0 included CDH 5.3.1.

1,000,000,000 records(5,000,000 users data and 5,000,000 items data) are
used for machine learning with ALS.
These large quantities of data increases virtual memory usage,
node manager of YARN kills Spark worker process.
Even though Spark run again after killing process, Spark worker process is
killed again.
As a result, the whole Spark processes are terminated.

# Spark worker process is killed, it seems that virtual memory usage
increased by
# 'Shuffle' or 'Disk writing' gets over the threshold of YARN.

To avoid such a case from occurring, we use the method that
'yarn.nodemanager.vmem-check-enabled' is false, then exit successfully.
But it does not seem to have an appropriate way.
If you know, please let me know about tuning method of Spark.

The conditions of machines and Spark settings are as follows.
1)six machines, physical memory is 32GB of each machine.
2)Spark settings
- spark.executor.memory=16g
- spark.closure.serializer=org.apache.spark.serializer.KryoSerializer
- spark.rdd.compress=true
- spark.shuffle.memoryFraction=0.4

Thanks,
Yuichiro Sakamoto

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-avoid-being-killed-by-YARN-node-manager-tp22199.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark disk-to-disk

2015-03-24 Thread Koert Kuipers

imran,
great, i will take a look at the pullreq. seems we are interested in
similar things


On Tue, Mar 24, 2015 at 11:00 AM, Imran Rashid iras...@cloudera.com wrote:

 I think writing to hdfs and reading it back again is totally reasonable.
 In fact, in my experience, writing to hdfs and reading back in actually
 gives you a good opportunity to handle some other issues as well:

 a) instead of just writing as an object file, I've found its helpful to
 write in a format that is a little more readable.  Json if efficiency
 doesn't matter :) or you could use something like avro, which at least has
 a good set of command line tools.

 b) when developing, I hate it when I introduce a bug in step 12 of a long
 pipeline, and need to re-run the whole thing.  If you save to disk, you can
 write a little application logic that realizes step 11 is already sitting
 on disk, and just restart from there.

 c) writing to disk is also a good opportunity to do a little crude
 auto-tuning of the number of partitions.  You can look at the size of
 each partition on hdfs, and then adjust the number of partitions.

 And I completely agree that losing the partitioning info is a major
 limitation -- I submitted a PR to help deal w/ it:

 https://github.com/apache/spark/pull/4449

 getting narrow dependencies w/ partitioners can lead to pretty big
 performance improvements, so I do think its important to make it easily
 accessible to the user.  Though now I'm thinking that maybe this api is a
 little clunky, and this should get rolled into the other changes you are
 proposing to hadoop RDD  friends -- but I'll go into more discussion on
 that thread.



 On Mon, Mar 23, 2015 at 12:55 PM, Koert Kuipers ko...@tresata.com wrote:

 there is a way to reinstate the partitioner, but that requires
 sc.objectFile to read exactly what i wrote, which means sc.objectFile
 should never split files on reading (a feature of hadoop file inputformat
 that gets in the way here).

 On Mon, Mar 23, 2015 at 1:39 PM, Koert Kuipers ko...@tresata.com wrote:

 i just realized the major limitation is that i lose partitioning info...

 On Mon, Mar 23, 2015 at 1:34 AM, Reynold Xin r...@databricks.com
 wrote:


 On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers ko...@tresata.com
 wrote:

 so finally i can resort to:
 rdd.saveAsObjectFile(...)
 sc.objectFile(...)
 but that seems like a rather broken abstraction.


 This seems like a fine solution to me.

Re: How to deploy binary dependencies to workers?

2015-03-24 Thread Dean Wampler

Both spark-submit and spark-shell have a --jars option for passing
additional jars to the cluster. They will be added to the appropriate
classpaths.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Tue, Mar 24, 2015 at 4:13 AM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 I am doing ML using Spark mllib. However, I do not have full control to
 the cluster. I am using Microsoft Azure HDInsight

 I want to deploy the BLAS or whatever required dependencies to accelerate
 the computation. But I don't know how to deploy those DLLs when I submit my
 JAR to the cluster.

 I know how to pack those DLLs into a jar. The real challenge is how to let
 the system find them...


 Thanks,
 David

How to avoid being killed by YARN node manager ?

2015-03-24 Thread Yuichiro Sakamoto

Hello.

We use ALS(Collaborative filtering) of Spark MLlib on YARN.
Spark version is 1.2.0 included CDH 5.3.1.

# Spark worker process is killed, it seems that virtual memory usage
increased by
# 'Shuffle' or 'Disk writing' gets over the threshold of YARN.

Thanks,
Yuichiro Sakamoto

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Why doesn't the --conf parameter work in yarn-cluster mode (but works in yarn-client and local)?

2015-03-24 Thread Emre Sevinc

Hello Sandy,

Thank you for your explanation. Then I would at least expect that to be
consistent across local, yarn-client, and yarn-cluster modes. (And not lead
to the case where it somehow works in two of them, and not for the third).

Kind regards,

Emre Sevinç
http://www.bigindustries.be/


On Tue, Mar 24, 2015 at 4:38 PM, Sandy Ryza sandy.r...@cloudera.com wrote:

 Ah, yes, I believe this is because only properties prefixed with spark
 get passed on.  The purpose of the --conf option is to allow passing
 Spark properties to the SparkConf, not to add general key-value pairs to
 the JVM system properties.

 -Sandy

 On Tue, Mar 24, 2015 at 4:25 AM, Emre Sevinc emre.sev...@gmail.com
 wrote:

 Hello Sandy,

 Your suggestion does not work when I try it locally:

 When I pass

   --conf key=someValue

 and then try to retrieve it like:

 SparkConf sparkConf = new SparkConf();
 logger.info(* * * key ~~~ {}, sparkConf.get(key));

 I get

   Exception in thread main java.util.NoSuchElementException: key

 And I think that's expected because the key is an arbitrary one, not
 necessarily a Spark configuration element. This is why I was passing it via
 --conf and retrieving System.getProperty(key) (which worked locally and
 in yarn-client mode but not in yarn-cluster mode). I'm surprised why I
 can't use it on the cluster while I can use it while local development and
 testing.

 Kind regards,

 Emre Sevinç
 http://www.bigindustries.be/



 On Mon, Mar 23, 2015 at 6:15 PM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 Hi Emre,

 The --conf property is meant to work with yarn-cluster mode.
 System.getProperty(key) isn't guaranteed, but new SparkConf().get(key)
 should.  Does it not?

 -Sandy

 On Mon, Mar 23, 2015 at 8:39 AM, Emre Sevinc emre.sev...@gmail.com
 wrote:

 Hello,

 According to Spark Documentation at
 https://spark.apache.org/docs/1.2.1/submitting-applications.html :

   --conf: Arbitrary Spark configuration property in key=value format.
 For values that contain spaces wrap “key=value” in quotes (as shown).

 And indeed, when I use that parameter, in my Spark program I can
 retrieve the value of the key by using:

 System.getProperty(key);

 This works when I test my program locally, and also in yarn-client
 mode, I can log the value of the key and see that it matches what I wrote
 in the command line, but it returns *null* when I submit the very same
 program in *yarn-cluster* mode.

 Why can't I retrieve the value of key given as --conf key=value when
 I submit my Spark application in *yarn-cluster* mode?

 Any ideas and/or workarounds?


 --
 Emre Sevinç
 http://www.bigindustries.be/





 --
 Emre Sevinc





-- 
Emre Sevinc

Re: spark disk-to-disk

2015-03-24 Thread Imran Rashid

I think writing to hdfs and reading it back again is totally reasonable.
In fact, in my experience, writing to hdfs and reading back in actually
gives you a good opportunity to handle some other issues as well:

a) instead of just writing as an object file, I've found its helpful to
write in a format that is a little more readable.  Json if efficiency
doesn't matter :) or you could use something like avro, which at least has
a good set of command line tools.

b) when developing, I hate it when I introduce a bug in step 12 of a long
pipeline, and need to re-run the whole thing.  If you save to disk, you can
write a little application logic that realizes step 11 is already sitting
on disk, and just restart from there.

c) writing to disk is also a good opportunity to do a little crude
auto-tuning of the number of partitions.  You can look at the size of
each partition on hdfs, and then adjust the number of partitions.

And I completely agree that losing the partitioning info is a major
limitation -- I submitted a PR to help deal w/ it:

https://github.com/apache/spark/pull/4449

getting narrow dependencies w/ partitioners can lead to pretty big
performance improvements, so I do think its important to make it easily
accessible to the user.  Though now I'm thinking that maybe this api is a
little clunky, and this should get rolled into the other changes you are
proposing to hadoop RDD  friends -- but I'll go into more discussion on
that thread.



On Mon, Mar 23, 2015 at 12:55 PM, Koert Kuipers ko...@tresata.com wrote:

 there is a way to reinstate the partitioner, but that requires
 sc.objectFile to read exactly what i wrote, which means sc.objectFile
 should never split files on reading (a feature of hadoop file inputformat
 that gets in the way here).

 On Mon, Mar 23, 2015 at 1:39 PM, Koert Kuipers ko...@tresata.com wrote:

 i just realized the major limitation is that i lose partitioning info...

 On Mon, Mar 23, 2015 at 1:34 AM, Reynold Xin r...@databricks.com wrote:


 On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers ko...@tresata.com
 wrote:

 so finally i can resort to:
 rdd.saveAsObjectFile(...)
 sc.objectFile(...)
 but that seems like a rather broken abstraction.


 This seems like a fine solution to me.

Re: Spark streaming alerting

2015-03-24 Thread Anwar Rizal

Helena,

The CassandraInputDStream sounds interesting. I dont find many things in
the jira though. Do you have more details on what it tries to achieve ?

Thanks,
Anwar.

On Tue, Mar 24, 2015 at 2:39 PM, Helena Edelson helena.edel...@datastax.com
 wrote:

 Streaming _from_ cassandra, CassandraInputDStream, is coming BTW
 https://issues.apache.org/jira/browse/SPARK-6283
 I am working on it now.

 Helena
 @helenaedelson

 On Mar 23, 2015, at 5:22 AM, Khanderao Kand Gmail 
 khanderao.k...@gmail.com wrote:

 Akhil

 You are right in tour answer to what Mohit wrote. However what Mohit seems
 to be alluring but did not write properly might be different.

 Mohit

 You are wrong in saying generally streaming works in HDFS and cassandra
 . Streaming typically works with streaming or queing source like Kafka,
 kinesis, Twitter, flume, zeroMQ, etc (but can also from HDFS and S3 )
 However , streaming context ( receiver wishing the streaming context )
 gets events/messages/records and forms a time window based batch (RDD)-

 So there is a maximum gap of window time from alert message was available
 to spark and when the processing happens. I think you meant about this.

 As per spark programming model, RDD is the right way to deal with data.
 If you are fine with the minimum delay of say a sec (based on min time
 window that dstreaming can support) then what Rohit gave is a right model.

 Khanderao

 On Mar 22, 2015, at 11:39 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 What do you mean you can't send it directly from spark workers? Here's a
 simple approach which you could do:

 val data = ssc.textFileStream(sigmoid/)
 val dist = data.filter(_.contains(ERROR)).foreachRDD(rdd =
 alert(Errors : + rdd.count()))

 And the alert() function could be anything triggering an email or sending
 an SMS alert.

 Thanks
 Best Regards

 On Sun, Mar 22, 2015 at 1:52 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

 Is there a module in spark streaming that lets you listen to
 the alerts/conditions as they happen in the streaming module? Generally
 spark streaming components will execute on large set of clusters like hdfs
 or Cassandra, however when it comes to alerting you generally can't send it
 directly from the spark workers, which means you need a way to listen to
 the alerts.

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-03-24 Thread Doug Balog

I found the problem.
In  mapped-site.xml, mapreduce.application.classpath has references to 
“${hdp.version}” which is not getting replaced
when launch_container.sh is created. The executor fails with a substitution 
error at line 27 in launch_container.sh because bash
can’t deal with “${hdp.version}.
I have hdp.version defined in my spark-defaults.conf via 
spark.{driver,yarn.am}.extraJavaOptions -Dhdp.version=2.2.0-2041,
so something is not doing the substitution.

To work around this problem, I replaced ${hdp.version}” with “current” in 
mapred-site.xml.
I found a similar bug, https://issues.apache.org/jira/browse/AMBARI-8028, and 
the fix was exactly what I did to work around it.
Not sure if this is an AMBARI bug (not doing variable substitution when writing 
mapred-site.xml) or YARN bug (its not doing the variable substitution when 
writing launch_container.sh) 

Anybody have an opinion ? 

Doug



 On Mar 19, 2015, at 5:51 PM, Doug Balog doug.sparku...@dugos.com wrote:
 
 I’m seeing the same problem.
 I’ve set logging to DEBUG, and I think some hints are in the “Yarn AM launch 
 context” that is printed out 
 before Yarn  runs java. 
 
 My next step is to talk to the admins and get them to set 
 yarn.nodemanager.delete.debug-delay-sec
 in the config, as recommended in 
 http://spark.apache.org/docs/latest/running-on-yarn.html
 Then I can see exactly whats in the directory.
 
 Doug
 
 ps Sorry for the dup message Bharath and Todd, used wrong email address.
 
 
 On Mar 19, 2015, at 1:19 AM, Bharath Ravi Kumar reachb...@gmail.com wrote:
 
 Thanks for clarifying Todd. This may then be an issue specific to the HDP 
 version we're using. Will continue to debug and post back if there's any 
 resolution.
 
 On Thu, Mar 19, 2015 at 3:40 AM, Todd Nist tsind...@gmail.com wrote:
 Yes I believe you are correct.  
 
 For the build you may need to specify the specific HDP version of hadoop to 
 use with the -Dhadoop.version=.  I went with the default 2.6.0, but 
 Horton may have a vendor specific version that needs to go here.  I know I 
 saw a similar post today where the solution was to use 
 -Dhadoop.version=2.5.0-cdh5.3.2 but that was for a cloudera installation.  I 
 am not sure what the HDP version would be to put here.
 
 -Todd
 
 On Wed, Mar 18, 2015 at 12:49 AM, Bharath Ravi Kumar reachb...@gmail.com 
 wrote:
 Hi Todd,
 
 Yes, those entries were present in the conf under the same SPARK_HOME that 
 was used to run spark-submit. On a related note, I'm assuming that the 
 additional spark yarn options (like spark.yarn.jar) need to be set in the 
 same properties file that is passed to spark-submit. That apart, I assume 
 that no other host on the cluster should require a deployment of the spark 
 distribution or any other config change to support a spark job.  Isn't that 
 correct?
 
 On Tue, Mar 17, 2015 at 6:19 PM, Todd Nist tsind...@gmail.com wrote:
 Hi Bharath,
 
 Do you have these entries in your $SPARK_HOME/conf/spark-defaults.conf file?
 
 spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0-2041
 spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0-2041
 
 
 
 
 On Tue, Mar 17, 2015 at 1:04 AM, Bharath Ravi Kumar reachb...@gmail.com 
 wrote:
 Still no luck running purpose-built 1.3 against HDP 2.2 after following all 
 the instructions. Anyone else faced this issue?
 
 On Mon, Mar 16, 2015 at 8:53 PM, Bharath Ravi Kumar reachb...@gmail.com 
 wrote:
 Hi Todd,
 
 Thanks for the help. I'll try again after building a distribution with the 
 1.3 sources. However, I wanted to confirm what I mentioned earlier:  is it 
 sufficient to copy the distribution only to the client host from where  
 spark-submit is invoked(with spark.yarn.jar set), or is there a need to 
 ensure that the entire distribution is deployed made available pre-deployed 
 on every host in the yarn cluster? I'd assume that the latter shouldn't be 
 necessary.
 
 On Mon, Mar 16, 2015 at 8:38 PM, Todd Nist tsind...@gmail.com wrote:
 Hi Bharath,
 
 I ran into the same issue a few days ago, here is a link to a post on 
 Horton's fourm.  http://hortonworks.com/community/forums/search/spark+1.2.1/
 Incase anyone else needs to perform this these are the steps I took to get 
 it to work with Spark 1.2.1 as well as Spark 1.3.0-RC3:
 
 1. Pull 1.2.1 Source
 2. Apply the following patches
 a. Address jackson version, https://github.com/apache/spark/pull/3938
 b. Address the propagation of the hdp.version set in the spark-default.conf, 
 https://github.com/apache/spark/pull/3409
 3. build with $SPARK_HOME./make-distribution.sh –name hadoop2.6 –tgz -Pyarn 
 -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests 
 package
 
 Then deploy the resulting artifact = spark-1.2.1-bin-hadoop2.6.tgz 
 following instructions in the HDP Spark preview 
 http://hortonworks.com/hadoop-tutorial/using-apache-spark-hdp/
 
 FWIW spark-1.3.0 appears to be working fine with HDP as well and steps 2a 
 and 2b are not required.
 
 HTH
 
 -Todd

Re: Why doesn't the --conf parameter work in yarn-cluster mode (but works in yarn-client and local)?

Ah, yes, I believe this is because only properties prefixed with spark
get passed on.  The purpose of the --conf option is to allow passing
Spark properties to the SparkConf, not to add general key-value pairs to
the JVM system properties.

-Sandy

On Tue, Mar 24, 2015 at 4:25 AM, Emre Sevinc emre.sev...@gmail.com wrote:

 Hello Sandy,

 Your suggestion does not work when I try it locally:

 When I pass

   --conf key=someValue

 and then try to retrieve it like:

 SparkConf sparkConf = new SparkConf();
 logger.info(* * * key ~~~ {}, sparkConf.get(key));

 I get

   Exception in thread main java.util.NoSuchElementException: key

 And I think that's expected because the key is an arbitrary one, not
 necessarily a Spark configuration element. This is why I was passing it via
 --conf and retrieving System.getProperty(key) (which worked locally and
 in yarn-client mode but not in yarn-cluster mode). I'm surprised why I
 can't use it on the cluster while I can use it while local development and
 testing.

 Kind regards,

 Emre Sevinç
 http://www.bigindustries.be/



 On Mon, Mar 23, 2015 at 6:15 PM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 Hi Emre,

 The --conf property is meant to work with yarn-cluster mode.
 System.getProperty(key) isn't guaranteed, but new SparkConf().get(key)
 should.  Does it not?

 -Sandy

 On Mon, Mar 23, 2015 at 8:39 AM, Emre Sevinc emre.sev...@gmail.com
 wrote:

 Hello,

 According to Spark Documentation at
 https://spark.apache.org/docs/1.2.1/submitting-applications.html :

   --conf: Arbitrary Spark configuration property in key=value format.
 For values that contain spaces wrap “key=value” in quotes (as shown).

 And indeed, when I use that parameter, in my Spark program I can
 retrieve the value of the key by using:

 System.getProperty(key);

 This works when I test my program locally, and also in yarn-client mode,
 I can log the value of the key and see that it matches what I wrote in the
 command line, but it returns *null* when I submit the very same program in
 *yarn-cluster* mode.

 Why can't I retrieve the value of key given as --conf key=value when I
 submit my Spark application in *yarn-cluster* mode?

 Any ideas and/or workarounds?


 --
 Emre Sevinç
 http://www.bigindustries.be/





 --
 Emre Sevinc

Re: Spark streaming alerting

2015-03-24 Thread Helena Edelson

I created a jira ticket for my work in both the spark and 
spark-cassandra-connector JIRAs, I don’t know why you can not see them.
Users can stream from any cassandra table, just as one can stream from a Kafka 
topic; same principle. 

Helena
@helenaedelson

 On Mar 24, 2015, at 11:29 AM, Anwar Rizal anriza...@gmail.com wrote:
 
 Helena,
 
 The CassandraInputDStream sounds interesting. I dont find many things in the 
 jira though. Do you have more details on what it tries to achieve ?
 
 Thanks,
 Anwar.
 
 On Tue, Mar 24, 2015 at 2:39 PM, Helena Edelson helena.edel...@datastax.com 
 mailto:helena.edel...@datastax.com wrote:
 Streaming _from_ cassandra, CassandraInputDStream, is coming BTW 
 https://issues.apache.org/jira/browse/SPARK-6283 
 https://issues.apache.org/jira/browse/SPARK-6283
 I am working on it now.
 
 Helena
 @helenaedelson
 
 On Mar 23, 2015, at 5:22 AM, Khanderao Kand Gmail khanderao.k...@gmail.com 
 mailto:khanderao.k...@gmail.com wrote:
 
 Akhil 
 
 You are right in tour answer to what Mohit wrote. However what Mohit seems 
 to be alluring but did not write properly might be different.
 
 Mohit
 
 You are wrong in saying generally streaming works in HDFS and cassandra . 
 Streaming typically works with streaming or queing source like Kafka, 
 kinesis, Twitter, flume, zeroMQ, etc (but can also from HDFS and S3 ) 
 However , streaming context ( receiver wishing the streaming context ) 
 gets events/messages/records and forms a time window based batch (RDD)- 
 
 So there is a maximum gap of window time from alert message was available to 
 spark and when the processing happens. I think you meant about this. 
 
 As per spark programming model, RDD is the right way to deal with data.  If 
 you are fine with the minimum delay of say a sec (based on min time window 
 that dstreaming can support) then what Rohit gave is a right model. 
 
 Khanderao
 
 On Mar 22, 2015, at 11:39 PM, Akhil Das ak...@sigmoidanalytics.com 
 mailto:ak...@sigmoidanalytics.com wrote:
 
 What do you mean you can't send it directly from spark workers? Here's a 
 simple approach which you could do:
 
 val data = ssc.textFileStream(sigmoid/)
 val dist = data.filter(_.contains(ERROR)).foreachRDD(rdd = 
 alert(Errors : + rdd.count()))
 
 And the alert() function could be anything triggering an email or sending 
 an SMS alert.
 
 Thanks
 Best Regards
 
 On Sun, Mar 22, 2015 at 1:52 AM, Mohit Anchlia mohitanch...@gmail.com 
 mailto:mohitanch...@gmail.com wrote:
 Is there a module in spark streaming that lets you listen to the 
 alerts/conditions as they happen in the streaming module? Generally spark 
 streaming components will execute on large set of clusters like hdfs or 
 Cassandra, however when it comes to alerting you generally can't send it 
 directly from the spark workers, which means you need a way to listen to 
 the alerts.

Does HiveContext connect to HiveServer2?

2015-03-24 Thread nitinkak001

I am wondering if HiveContext connects to HiveServer2 or does it work though
Hive CLI. The reason I am asking is because Cloudera has deprecated Hive
CLI. 

If the connection is through HiverServer2, is there a way to specify user
credentials?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-connect-to-HiveServer2-tp22200.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Invalid ContainerId ... Caused by: java.lang.NumberFormatException: For input string: e04

Steve, that's correct, but the problem only shows up when different
versions of the YARN jars are included on the classpath.

-Sandy

On Tue, Mar 24, 2015 at 6:29 AM, Steve Loughran ste...@hortonworks.com
wrote:


  On 24 Mar 2015, at 02:10, Marcelo Vanzin van...@cloudera.com wrote:
 
  This happens most probably because the Spark 1.3 you have downloaded
  is built against an older version of the Hadoop libraries than those
  used by CDH, and those libraries cannot parse the container IDs
  generated by CDH.


 This sounds suspiciously like the changes in YARN for HA (the epoch
 number) isn't being parsed by older versions of the YARN client libs. This
 is effectively a regression in the YARN code -its creating container IDs
 that can't be easily parsed by old apps. It may be possible to fix that
 spark-side by having its own parser for the YARN container/app environment
 variable

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

filter expression in API document for DataFrame

2015-03-24 Thread SK



The following statement appears in the Scala API example at
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame

people.filter(age  30).

I tried this example and it gave a compilation error. I think this needs to
be changed to people.filter(people(age)  30)

Also, it would be good to add some examples for the new equality operator
for columns (e.g. (people(age) === 30) ). The programming guide does not
have an example for this in the DataFrame Operations section and it was not
very obvious that we need to be using a different equality operator for
columns. 


thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/filter-expression-in-API-document-for-DataFrame-tp22213.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Graphx gets slower as the iteration number increases

2015-03-24 Thread orangepri...@foxmail.com

I'm working with graphx to calculate the pageranks of an extreme large social 
network with billion verteces.
As iteration number increases, the speed of each iteration becomes slower and 
unacceptable. Is there any reason of it?
How can I accelerate the ineration process?


orangepri...@foxmail.com

Re: 1.3 Hadoop File System problem

2015-03-24 Thread Patrick Wendell

Hey Jim,

Thanks for reporting this. Can you give a small end-to-end code
example that reproduces it? If so, we can definitely fix it.

- Patrick

On Tue, Mar 24, 2015 at 4:55 PM, Jim Carroll jimfcarr...@gmail.com wrote:

 I have code that works under 1.2.1 but when I upgraded to 1.3.0 it fails to
 find the s3 hadoop file system.

 I get the java.lang.IllegalArgumentException: Wrong FS: s3://path to my
 file], expected: file:/// when I try to save a parquet file. This worked in
 1.2.1.

 Has anyone else seen this?

 I'm running spark using local[8] so it's all internal. These are actually
 unit tests in our app that are failing now.

 Thanks.
 Jim




 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/1-3-Hadoop-File-System-problem-tp22207.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Graphx gets slower as the iteration number increases

2015-03-24 Thread Ankur Dave

This might be because partitions are getting dropped from memory and
needing to be recomputed. How much memory is in the cluster, and how large
are the partitions? This information should be in the Executors and Storage
pages in the web UI.

Ankur http://www.ankurdave.com/

On Tue, Mar 24, 2015 at 7:12 PM, orangepri...@foxmail.com 
orangepri...@foxmail.com wrote:

 I'm working with graphx to calculate the pageranks of an extreme large
 social network with billion verteces.
 As iteration number increases, the speed of each iteration becomes slower
 and unacceptable. Is there any reason of it?

Re: 1.3 Hadoop File System problem

You are probably hitting SPARK-6351
https://issues.apache.org/jira/browse/SPARK-6351, which will be fixed in
1.3.1 (hopefully cutting an RC this week).

On Tue, Mar 24, 2015 at 4:55 PM, Jim Carroll jimfcarr...@gmail.com wrote:


 I have code that works under 1.2.1 but when I upgraded to 1.3.0 it fails to
 find the s3 hadoop file system.

 I get the java.lang.IllegalArgumentException: Wrong FS: s3://path to my
 file], expected: file:/// when I try to save a parquet file. This worked
 in
 1.2.1.

 Has anyone else seen this?

 I'm running spark using local[8] so it's all internal. These are actually
 unit tests in our app that are failing now.

 Thanks.
 Jim




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/1-3-Hadoop-File-System-problem-tp22207.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

column expression in left outer join for DataFrame

2015-03-24 Thread SK

Hi,

I am trying to port some code that was working in Spark 1.2.0 on the latest
version, Spark 1.3.0. This code involves a left outer join between two
SchemaRDDs which I am now trying to change to a left outer join between 2
DataFrames. I followed the example  for left outer join of DataFrame at
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

Here's my code, where df1 and df2 are the 2 dataframes I am joining on the
country field:

 val join_df =  df1.join( df2,  df1.country == df2.country, left_outer)

But I got a compilation error that value  country is not a member of
sql.DataFrame

I  also tried the following:
 val join_df =  df1.join( df2, df1(country) == df2(country),
left_outer)

I got a compilation error that it is a Boolean whereas a Column is required. 

So what is the correct Column expression I need to provide for joining the 2
dataframes on a specific field ?

thanks








--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/column-expression-in-left-outer-join-for-DataFrame-tp22209.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: column expression in left outer join for DataFrame

You need to use `===`, so that you are constructing a column expression
instead of evaluating the standard scala equality method.  Calling methods
to access columns (i.e. df.county is only supported in python).

val join_df =  df1.join( df2, df1(country) === df2(country),
left_outer)

On Tue, Mar 24, 2015 at 5:50 PM, SK skrishna...@gmail.com wrote:

 Hi,

 I am trying to port some code that was working in Spark 1.2.0 on the latest
 version, Spark 1.3.0. This code involves a left outer join between two
 SchemaRDDs which I am now trying to change to a left outer join between 2
 DataFrames. I followed the example  for left outer join of DataFrame at

 https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

 Here's my code, where df1 and df2 are the 2 dataframes I am joining on the
 country field:

  val join_df =  df1.join( df2,  df1.country == df2.country, left_outer)

 But I got a compilation error that value  country is not a member of
 sql.DataFrame

 I  also tried the following:
  val join_df =  df1.join( df2, df1(country) == df2(country),
 left_outer)

 I got a compilation error that it is a Boolean whereas a Column is
 required.

 So what is the correct Column expression I need to provide for joining the
 2
 dataframes on a specific field ?

 thanks








 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/column-expression-in-left-outer-join-for-DataFrame-tp22209.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: java.lang.OutOfMemoryError: unable to create new native thread

2015-03-24 Thread Thomas Gerber

So,

1. I reduced my  -XX:ThreadStackSize to 5m (instead of 10m - default is
1m), which is still OK for my need.
2. I reduced the executor memory to 44GB for a 60GB machine (instead of
49GB).

This seems to have helped. Thanks to Matthew and Sean.

Thomas

On Tue, Mar 24, 2015 at 3:49 PM, Matt Silvey matt.sil...@videoamp.com
wrote:

 My memory is hazy on this but aren't there hidden limitations to
 Linux-based threads?  I ran into some issues a couple of years ago where,
 and here is the fuzzy part, the kernel wants to reserve virtual memory per
 thread equal to the stack size.  When the total amount of reserved memory
 (not necessarily resident memory) exceeds the memory of the system it
 throws an OOM.  I'm looking for material to back this up.  Sorry for the
 initial vague response.

 Matthew

 On Tue, Mar 24, 2015 at 12:53 PM, Thomas Gerber thomas.ger...@radius.com
 wrote:

 Additional notes:
 I did not find anything wrong with the number of threads (ps -u USER -L |
 wc -l): around 780 on the master and 400 on executors. I am running on 100
 r3.2xlarge.

 On Tue, Mar 24, 2015 at 12:38 PM, Thomas Gerber thomas.ger...@radius.com
  wrote:

 Hello,

 I am seeing various crashes in spark on large jobs which all share a
 similar exception:

 java.lang.OutOfMemoryError: unable to create new native thread
 at java.lang.Thread.start0(Native Method)
 at java.lang.Thread.start(Thread.java:714)

 I increased nproc (i.e. ulimit -u) 10 fold, but it doesn't help.

 Does anyone know how to avoid those kinds of errors?

 Noteworthy: I added -XX:ThreadStackSize=10m on both driver and executor
 extra java options, which might have amplified the problem.

 Thanks for you help,
 Thomas

Re: issue while creating spark context

thanks Sean and Akhil,
I changed the the permission of  */user/spark/applicationHistory, *now it
works,


On Tue, Mar 24, 2015 at 7:35 PM, Sachin Singh sachin.sha...@gmail.com
wrote:

 thanks Sean,
 please can you suggest in which file or configuration I need to modify
 proper path, please elaborate which may help,

 thanks,

 Regards
 Sachin


 On Tue, Mar 24, 2015 at 7:15 PM, Sean Owen so...@cloudera.com wrote:

 That's probably the problem; the intended path is on HDFS but the
 configuration specifies a local path. See the exception message.

 On Tue, Mar 24, 2015 at 1:08 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:
  Its in your local file system, not in hdfs.
 
  Thanks
  Best Regards
 
  On Tue, Mar 24, 2015 at 6:25 PM, Sachin Singh sachin.sha...@gmail.com
  wrote:
 
  hi,
  I can see required permission is granted for this directory as under,
 
   hadoop dfs -ls /user/spark
  DEPRECATED: Use of this script to execute hdfs command is deprecated.
  Instead use the hdfs command for it.
 
  Found 1 items
  drwxrwxrwt   - spark spark  0 2015-03-20 01:04
  /user/spark/applicationHistory
 
  regards
  Sachin

Re: Errors in SPARK

The error you're seeing typically means that you cannot connect to the Hive
metastore itself.  Some quick thoughts:
- If you were to run show tables (instead of the CREATE TABLE statement),
are you still getting the same error?

- To confirm, the Hive metastore (MySQL database) is up and running

- Did you download or build your version of Spark?




On Tue, Mar 24, 2015 at 10:48 PM sandeep vura sandeepv...@gmail.com wrote:

 Hi Denny,

 Still facing the same issue.Please find the following errors.

 *scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)*
 *sqlContext: org.apache.spark.sql.hive.HiveContext =
 org.apache.spark.sql.hive.HiveContext@4e4f880c*

 *scala sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value
 STRING))*
 *java.lang.RuntimeException: java.lang.RuntimeException: Unable to
 instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient*

 Cheers,
 Sandeep.v

 On Wed, Mar 25, 2015 at 11:10 AM, sandeep vura sandeepv...@gmail.com
 wrote:

 No I am just running ./spark-shell command in terminal I will try with
 above command

 On Wed, Mar 25, 2015 at 11:09 AM, Denny Lee denny.g@gmail.com
 wrote:

 Did you include the connection to a MySQL connector jar so that way
 spark-shell / hive can connect to the metastore?

 For example, when I run my spark-shell instance in standalone mode, I
 use:
 ./spark-shell --master spark://servername:7077 --driver-class-path /lib/
 mysql-connector-java-5.1.27.jar



 On Fri, Mar 13, 2015 at 8:31 AM sandeep vura sandeepv...@gmail.com
 wrote:

 Hi Sparkers,

 Can anyone please check the below error and give solution for this.I am
 using hive version 0.13 and spark 1.2.1 .

 Step 1 : I have installed hive 0.13 with local metastore (mySQL
 database)
 Step 2:  Hive is running without any errors and able to create tables
 and loading data in hive table
 Step 3: copied hive-site.xml in spark/conf directory
 Step 4: copied core-site.xml in spakr/conf directory
 Step 5: started spark shell

 Please check the below error for clarifications.

 scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
 sqlContext: org.apache.spark.sql.hive.HiveContext =
 org.apache.spark.sql.hive.Hi
  veContext@2821ec0c

 scala sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value
 STRING))
 java.lang.RuntimeException: java.lang.RuntimeException: Unable to
 instantiate or
g.apache.hadoop.hive.
 metastore.HiveMetaStoreClient
 at 
 org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.jav

a:346)
 at 
 org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.sc

ala:235)
 at 
 org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.sc

ala:231)
 at scala.Option.orElse(Option.scala:257)
 at 
 org.apache.spark.sql.hive.HiveContext.x$3$lzycompute(HiveContext.scal

a:231)
 at org.apache.spark.sql.hive.HiveContext.x$3(HiveContext.
 scala:229)
 at 
 org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext

.scala:229)
 at org.apache.spark.sql.hive.HiveContext.hiveconf(
 HiveContext.scala:229)
 at 
 org.apache.spark.sql.hive.HiveMetastoreCatalog.init(HiveMetastoreCa

talog.scala:55)

 Regards,
 Sandeep.v

Re: Errors in SPARK

2015-03-24 Thread sandeep vura

Hi Denny,

Still facing the same issue.Please find the following errors.

*scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)*
*sqlContext: org.apache.spark.sql.hive.HiveContext =
org.apache.spark.sql.hive.HiveContext@4e4f880c*

*scala sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value
STRING))*
*java.lang.RuntimeException: java.lang.RuntimeException: Unable to
instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient*

Cheers,
Sandeep.v

On Wed, Mar 25, 2015 at 11:10 AM, sandeep vura sandeepv...@gmail.com
wrote:

 No I am just running ./spark-shell command in terminal I will try with
 above command

 On Wed, Mar 25, 2015 at 11:09 AM, Denny Lee denny.g@gmail.com wrote:

 Did you include the connection to a MySQL connector jar so that way
 spark-shell / hive can connect to the metastore?

 For example, when I run my spark-shell instance in standalone mode, I use:
 ./spark-shell --master spark://servername:7077 --driver-class-path
 /lib/mysql-connector-java-5.1.27.jar



 On Fri, Mar 13, 2015 at 8:31 AM sandeep vura sandeepv...@gmail.com
 wrote:

 Hi Sparkers,

 Can anyone please check the below error and give solution for this.I am
 using hive version 0.13 and spark 1.2.1 .

 Step 1 : I have installed hive 0.13 with local metastore (mySQL database)
 Step 2:  Hive is running without any errors and able to create tables
 and loading data in hive table
 Step 3: copied hive-site.xml in spark/conf directory
 Step 4: copied core-site.xml in spakr/conf directory
 Step 5: started spark shell

 Please check the below error for clarifications.

 scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
 sqlContext: org.apache.spark.sql.hive.HiveContext =
 org.apache.spark.sql.hive.Hi
  veContext@2821ec0c

 scala sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value
 STRING))
 java.lang.RuntimeException: java.lang.RuntimeException: Unable to
 instantiate or

  g.apache.hadoop.hive.metastore.HiveMetaStoreClient
 at
 org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.jav

  a:346)
 at
 org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.sc

  ala:235)
 at
 org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.sc

  ala:231)
 at scala.Option.orElse(Option.scala:257)
 at
 org.apache.spark.sql.hive.HiveContext.x$3$lzycompute(HiveContext.scal

  a:231)
 at
 org.apache.spark.sql.hive.HiveContext.x$3(HiveContext.scala:229)
 at
 org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext

  .scala:229)
 at
 org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:229)
 at
 org.apache.spark.sql.hive.HiveMetastoreCatalog.init(HiveMetastoreCa

  talog.scala:55)

 Regards,
 Sandeep.v

Re: Errors in SPARK

2015-03-24 Thread sandeep vura

No I am just running ./spark-shell command in terminal I will try with
above command

On Wed, Mar 25, 2015 at 11:09 AM, Denny Lee denny.g@gmail.com wrote:

 Did you include the connection to a MySQL connector jar so that way
 spark-shell / hive can connect to the metastore?

 For example, when I run my spark-shell instance in standalone mode, I use:
 ./spark-shell --master spark://servername:7077 --driver-class-path
 /lib/mysql-connector-java-5.1.27.jar



 On Fri, Mar 13, 2015 at 8:31 AM sandeep vura sandeepv...@gmail.com
 wrote:

 Hi Sparkers,

 Can anyone please check the below error and give solution for this.I am
 using hive version 0.13 and spark 1.2.1 .

 Step 1 : I have installed hive 0.13 with local metastore (mySQL database)
 Step 2:  Hive is running without any errors and able to create tables and
 loading data in hive table
 Step 3: copied hive-site.xml in spark/conf directory
 Step 4: copied core-site.xml in spakr/conf directory
 Step 5: started spark shell

 Please check the below error for clarifications.

 scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
 sqlContext: org.apache.spark.sql.hive.HiveContext =
 org.apache.spark.sql.hive.Hi
  veContext@2821ec0c

 scala sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value
 STRING))
 java.lang.RuntimeException: java.lang.RuntimeException: Unable to
 instantiate or

  g.apache.hadoop.hive.metastore.HiveMetaStoreClient
 at
 org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.jav

  a:346)
 at
 org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.sc

  ala:235)
 at
 org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.sc

  ala:231)
 at scala.Option.orElse(Option.scala:257)
 at
 org.apache.spark.sql.hive.HiveContext.x$3$lzycompute(HiveContext.scal

  a:231)
 at
 org.apache.spark.sql.hive.HiveContext.x$3(HiveContext.scala:229)
 at
 org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext

  .scala:229)
 at
 org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:229)
 at
 org.apache.spark.sql.hive.HiveMetastoreCatalog.init(HiveMetastoreCa

  talog.scala:55)

 Regards,
 Sandeep.v

Re: Errors in SPARK

Did you include the connection to a MySQL connector jar so that way
spark-shell / hive can connect to the metastore?

For example, when I run my spark-shell instance in standalone mode, I use:
./spark-shell --master spark://servername:7077 --driver-class-path
/lib/mysql-connector-java-5.1.27.jar



On Fri, Mar 13, 2015 at 8:31 AM sandeep vura sandeepv...@gmail.com wrote:

 Hi Sparkers,

 Can anyone please check the below error and give solution for this.I am
 using hive version 0.13 and spark 1.2.1 .

 Step 1 : I have installed hive 0.13 with local metastore (mySQL database)
 Step 2:  Hive is running without any errors and able to create tables and
 loading data in hive table
 Step 3: copied hive-site.xml in spark/conf directory
 Step 4: copied core-site.xml in spakr/conf directory
 Step 5: started spark shell

 Please check the below error for clarifications.

 scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
 sqlContext: org.apache.spark.sql.hive.HiveContext =
 org.apache.spark.sql.hive.Hi
  veContext@2821ec0c

 scala sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value
 STRING))
 java.lang.RuntimeException: java.lang.RuntimeException: Unable to
 instantiate or

  g.apache.hadoop.hive.metastore.HiveMetaStoreClient
 at
 org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.jav

  a:346)
 at
 org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.sc

  ala:235)
 at
 org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.sc

  ala:231)
 at scala.Option.orElse(Option.scala:257)
 at
 org.apache.spark.sql.hive.HiveContext.x$3$lzycompute(HiveContext.scal

  a:231)
 at org.apache.spark.sql.hive.HiveContext.x$3(HiveContext.scala:229)
 at
 org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext

  .scala:229)
 at
 org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:229)
 at
 org.apache.spark.sql.hive.HiveMetastoreCatalog.init(HiveMetastoreCa

  talog.scala:55)

 Regards,
 Sandeep.v

Hadoop 2.5 not listed in Spark 1.4 build page

2015-03-24 Thread Manoj Samel

http://spark.apache.org/docs/latest/building-spark.html#packaging-without-hadoop-dependencies-for-yarn
does not list hadoop 2.5 in Hadoop version table table etc.

I assume it is still OK to compile with  -Pyarn -Phadoop-2.5 for use with
Hadoop 2.5 (cdh 5.3.2)

Thanks,

Re: Weird exception in Spark job

2015-03-24 Thread nitinkak001

Any Ideas on this?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Weird-exception-in-Spark-job-tp22195p22204.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Does HiveContext connect to HiveServer2?

spark-submit --files /path/to/hive-site.xml

On Tue, Mar 24, 2015 at 10:31 AM, Udit Mehta ume...@groupon.com wrote:
 Another question related to this, how can we propagate the hive-site.xml to
 all workers when running in the yarn cluster mode?

 On Tue, Mar 24, 2015 at 10:09 AM, Marcelo Vanzin van...@cloudera.com
 wrote:

 It does neither. If you provide a Hive configuration to Spark,
 HiveContext will connect to your metastore server, otherwise it will
 create its own metastore in the working directory (IIRC).

 On Tue, Mar 24, 2015 at 8:58 AM, nitinkak001 nitinkak...@gmail.com
 wrote:
  I am wondering if HiveContext connects to HiveServer2 or does it work
  though
  Hive CLI. The reason I am asking is because Cloudera has deprecated Hive
  CLI.
 
  If the connection is through HiverServer2, is there a way to specify
  user
  credentials?
 
 
 
  --
  View this message in context:
  http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-connect-to-HiveServer2-tp22200.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 



 --
 Marcelo

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Dataframe groupby custom functions (python)

2015-03-24 Thread jamborta

Hi all,

I have been trying out the new dataframe api in 1.3, which looks great by
the way.

I have found an example to define udfs and add them to select operations,
like this:

slen = F.udf(lambda s: len(s), IntegerType())
df.select(df.age, slen(df.name).alias('slen')).collect()

is it possible to to something similar with aggregates? Something like this:

gdf = df.groupBy(df.name)
gdf.agg(slen(df.age)).collect()

thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Dataframe-groupby-custom-functions-python-tp22205.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Dataframe groupby custom functions (python)

The only UDAFs that we support today are those defined using the Hive UDAF
API.  Otherwise you'll have to drop into Spark operations.  I'd suggest
opening a JIRA.

On Tue, Mar 24, 2015 at 10:49 AM, jamborta jambo...@gmail.com wrote:

 Hi all,

 I have been trying out the new dataframe api in 1.3, which looks great by
 the way.

 I have found an example to define udfs and add them to select operations,
 like this:

 slen = F.udf(lambda s: len(s), IntegerType())
 df.select(df.age, slen(df.name).alias('slen')).collect()

 is it possible to to something similar with aggregates? Something like
 this:

 gdf = df.groupBy(df.name)
 gdf.agg(slen(df.age)).collect()

 thanks,



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Dataframe-groupby-custom-functions-python-tp22205.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: diffrence in PCA of MLib vs H2o in R

2015-03-24 Thread Reza Zadeh

If you want to do a nonstandard (or uncentered) PCA, you can call
computeSVD on RowMatrix, and look at the resulting 'V' Matrix.

That should match the output of the other two systems.

Reza

On Tue, Mar 24, 2015 at 3:53 AM, Sean Owen so...@cloudera.com wrote:

 Those implementations are computing an SVD of the input matrix
 directly, and while you generally need the columns to have mean 0, you
 can turn that off with the options you cite.

 I don't think this is possible in the MLlib implementation, since it
 is computing the principal components by computing eigenvectors of the
 covariance matrix. The means inherently don't matter either way in
 this computation.

 On Tue, Mar 24, 2015 at 6:13 AM, roni roni.epi...@gmail.com wrote:
  I am trying to compute PCA  using  computePrincipalComponents.
  I  also computed PCA using h2o in R and R's prcomp. The answers I get
 from
  H2o and R's prComp (non h2o) is same when I set the options for H2o as
  standardized=FALSE and for r's prcomp as center = false.
 
  How do I make sure that the settings for MLib PCA is same as I am using
 for
  H2o or prcomp.
 
  Thanks
  Roni

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: diffrence in PCA of MLib vs H2o in R

(Oh sorry, I've only been thinking of TallSkinnySVD)

On Tue, Mar 24, 2015 at 6:36 PM, Reza Zadeh r...@databricks.com wrote:
 If you want to do a nonstandard (or uncentered) PCA, you can call
 computeSVD on RowMatrix, and look at the resulting 'V' Matrix.

 That should match the output of the other two systems.

 Reza

 On Tue, Mar 24, 2015 at 3:53 AM, Sean Owen so...@cloudera.com wrote:

 Those implementations are computing an SVD of the input matrix
 directly, and while you generally need the columns to have mean 0, you
 can turn that off with the options you cite.

 I don't think this is possible in the MLlib implementation, since it
 is computing the principal components by computing eigenvectors of the
 covariance matrix. The means inherently don't matter either way in
 this computation.

 On Tue, Mar 24, 2015 at 6:13 AM, roni roni.epi...@gmail.com wrote:
  I am trying to compute PCA  using  computePrincipalComponents.
  I  also computed PCA using h2o in R and R's prcomp. The answers I get
  from
  H2o and R's prComp (non h2o) is same when I set the options for H2o as
  standardized=FALSE and for r's prcomp as center = false.
 
  How do I make sure that the settings for MLib PCA is same as I am using
  for
  H2o or prcomp.
 
  Thanks
  Roni

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to deploy binary dependencies to workers?

2015-03-24 Thread DB Tsai

I would recommend to upload those jars to HDFS, and use add jars
option in spark-submit with URI from HDFS instead of URI from local
filesystem. Thus, it can avoid the problem of fetching jars from
driver which can be a bottleneck.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com


On Tue, Mar 24, 2015 at 4:13 AM, Xi Shen davidshe...@gmail.com wrote:
 Hi,

 I am doing ML using Spark mllib. However, I do not have full control to the
 cluster. I am using Microsoft Azure HDInsight

 I want to deploy the BLAS or whatever required dependencies to accelerate
 the computation. But I don't know how to deploy those DLLs when I submit my
 JAR to the cluster.

 I know how to pack those DLLs into a jar. The real challenge is how to let
 the system find them...


 Thanks,
 David


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Optimal solution for getting the header from CSV with Spark

I think this works in practice, but I don't know that the first block
of the file is guaranteed to be in the first partition? certainly
later down the pipeline that won't be true but presumably this is
happening right after reading the file.

I've always just written some filter that would only match the header,
which assumes that this is possible to distinguish, but usually is.

On Tue, Mar 24, 2015 at 2:41 PM, Dean Wampler deanwamp...@gmail.com wrote:
 Instead of data.zipWithIndex().filter(_._2==0), which will cause Spark to
 read the whole file, use data.take(1), which is simpler.

 From the Rdd.take documentation, it works by first scanning one partition,
 and using the results from that partition to estimate the number of
 additional partitions needed to satisfy the limit. In this case, it will
 trivially stop at the first.


 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition (O'Reilly)
 Typesafe
 @deanwampler
 http://polyglotprogramming.com

 On Tue, Mar 24, 2015 at 7:12 AM, Spico Florin spicoflo...@gmail.com wrote:

 Hello!

 I would like to know what is the optimal solution for getting the header
 with from a CSV file with Spark? My aproach was:

 def getHeader(data: RDD[String]): String = {
 data.zipWithIndex().filter(_._2==0).map(x=x._1).take(1).mkString() }

 Thanks.



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Does HiveContext connect to HiveServer2?

It does neither. If you provide a Hive configuration to Spark,
HiveContext will connect to your metastore server, otherwise it will
create its own metastore in the working directory (IIRC).

On Tue, Mar 24, 2015 at 8:58 AM, nitinkak001 nitinkak...@gmail.com wrote:
 I am wondering if HiveContext connects to HiveServer2 or does it work though
 Hive CLI. The reason I am asking is because Cloudera has deprecated Hive
 CLI.

 If the connection is through HiverServer2, is there a way to specify user
 credentials?



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-connect-to-HiveServer2-tp22200.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Is yarn-standalone mode deprecated?

I checked and apparently it hasn't be released yet.  it will be available
in the upcoming CDH 5.4 release.

-Sandy

On Mon, Mar 23, 2015 at 1:32 PM, Nitin kak nitinkak...@gmail.com wrote:

 I know there was an effort for this, do you know which version of Cloudera
 distribution we could find that?

 On Mon, Mar 23, 2015 at 1:13 PM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 The former is deprecated.  However, the latter is functionally equivalent
 to it.  Both launch an app in what is now called yarn-cluster mode.

 Oozie now also has a native Spark action, though I'm not familiar on the
 specifics.

 -Sandy

 On Mon, Mar 23, 2015 at 1:01 PM, Nitin kak nitinkak...@gmail.com wrote:

 To be more clear, I am talking about

 SPARK_JAR=SPARK_ASSEMBLY_JAR_FILE ./bin/spark-class 
 org.apache.spark.deploy.yarn.Client \
   --jar YOUR_APP_JAR_FILE \
   --class APP_MAIN_CLASS \
   --args APP_MAIN_ARGUMENTS \
   --num-workers NUMBER_OF_WORKER_MACHINES \
   --master-class ApplicationMaster_CLASS
   --master-memory MEMORY_FOR_MASTER \
   --worker-memory MEMORY_PER_WORKER \
   --worker-cores CORES_PER_WORKER \
   --name application_name \
   --queue queue_name \
   --addJars any_local_files_used_in_SparkContext.addJar \
   --files files_for_distributed_cache \
   --archives archives_for_distributed_cache

 which I thought was the yarn-standalone mode

 vs

 spark-submit

 ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
 --master yarn-cluster \
 --num-executors 3 \
 --driver-memory 4g \
 --executor-memory 2g \
 --executor-cores 1 \
 --queue thequeue \
 lib/spark-examples*.jar


 I didnt see example of ./bin/spark-class in 1.2.0 documentation, so am
 wondering if that is deprecated.





 On Mon, Mar 23, 2015 at 12:11 PM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 The mode is not deprecated, but the name yarn-standalone is now
 deprecated.  It's now referred to as yarn-cluster.

 -Sandy

 On Mon, Mar 23, 2015 at 11:49 AM, nitinkak001 nitinkak...@gmail.com
 wrote:

 Is yarn-standalone mode deprecated in Spark now. The reason I am
 asking is
 because while I can find it in 0.9.0
 documentation(https://spark.apache.org/docs/0.9.0/running-on-yarn.html).
 I
 am not able to find it in 1.2.0.

 I am using this mode to run the Spark jobs from Oozie as a java action.
 Removing this mode will prevent me from doing that. Are there any
 other ways
 of running a Spark job from Oozie other than Shell action?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Is-yarn-standalone-mode-deprecated-tp22188.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

akka.version error

2015-03-24 Thread Mohit Anchlia

I am facing the same issue as listed here:

http://apache-spark-user-list.1001560.n3.nabble.com/Packaging-a-spark-job-using-maven-td5615.html

Solution mentioned is here:

https://gist.github.com/prb/d776a47bd164f704eecb

However, I think I don't understand few things:

1) Why are jars being split into worker and driver?
2) Does it mean I now need to create 2 jars?
3) I am assuming I still need both jars in the path when I run this job?

I am simply trying to execute a basic word count example.

What his the ideal method to interact with Spark Cluster from a Cloud App?

2015-03-24 Thread Noorul Islam K M


Hi all,

We have a cloud application, to which we are adding a reporting service.
For this we have narrowed down to use Cassandra + Spark for data store
and processing respectively.

Since cloud application is separate from Cassandra + Spark deployment,
what is ideal method to interact with Spark Master from the application?

We have been evaluating spark-job-server [1], which is an RESTful layer
on top of Spark.

Are there any other such tools? Or are there any other better approach
which can be explored?

We are evaluating following requirements against spark-job-server,

   1. Provide a platform for applications to submit jobs
   2. Provide RESTful APIs using which applications will interact with the 
   server
  - Upload jar for running jobs
  - Submit job
  - Get job list
  - Get job status
  - Get job result
   3. Provide support for kill/restart job
  - Kill job
  - Restart job
   4. Support job priority
   5. Queue up job submissions if resources not available
   6. Troubleshoot job execution
  - Failure – job logs
  - Measure performance
   7. Manage cluster deployment
  - Bootstrap, scale up/down (add, remove, replace nodes)
   8. Monitor cluster deployment
  - Health report: Report metrics – CPU, Memory, - of jobs, spark 
  processes
  - Alert DevOps about threshold limit of these metrics
  - Alert DevOps about job failures
  - Self healing?
   9. Security
  - AAA job submissions
   10. High availability/Redundancy
  - This is for the spark-jobserver component itself

Any help is appreciated!

Thanks and Regards
Noorul

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Optimal solution for getting the header from CSV with Spark

2015-03-24 Thread Dean Wampler

Good point. There's no guarantee that you'll get the actual first
partition. One reason why I wouldn't allow a CSV header line in a real data
file, if I could avoid it.

Back to Spark, a safer approach is RDD.foreachPartition, which takes a
function expecting an iterator. You'll only need to grab the first element
(being careful that the partition isn't empty!) and then determine which of
those first lines has the header info.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Tue, Mar 24, 2015 at 11:12 AM, Sean Owen so...@cloudera.com wrote:

 I think this works in practice, but I don't know that the first block
 of the file is guaranteed to be in the first partition? certainly
 later down the pipeline that won't be true but presumably this is
 happening right after reading the file.

 I've always just written some filter that would only match the header,
 which assumes that this is possible to distinguish, but usually is.

 On Tue, Mar 24, 2015 at 2:41 PM, Dean Wampler deanwamp...@gmail.com
 wrote:
  Instead of data.zipWithIndex().filter(_._2==0), which will cause Spark to
  read the whole file, use data.take(1), which is simpler.
 
  From the Rdd.take documentation, it works by first scanning one
 partition,
  and using the results from that partition to estimate the number of
  additional partitions needed to satisfy the limit. In this case, it will
  trivially stop at the first.
 
 
  Dean Wampler, Ph.D.
  Author: Programming Scala, 2nd Edition (O'Reilly)
  Typesafe
  @deanwampler
  http://polyglotprogramming.com
 
  On Tue, Mar 24, 2015 at 7:12 AM, Spico Florin spicoflo...@gmail.com
 wrote:
 
  Hello!
 
  I would like to know what is the optimal solution for getting the header
  with from a CSV file with Spark? My aproach was:
 
  def getHeader(data: RDD[String]): String = {
  data.zipWithIndex().filter(_._2==0).map(x=x._1).take(1).mkString() }
 
  Thanks.

Re: Is it possible to use json4s 3.2.11 with Spark 1.3.0?

From the exception it seems like your app is also repackaging Scala
classes somehow. Can you double check that and remove the Scala
classes from your app if they're there?

On Mon, Mar 23, 2015 at 10:07 PM, Alexey Zinoviev
alexey.zinov...@gmail.com wrote:
 Thanks Marcelo, this options solved the problem (I'm using 1.3.0), but it
 works only if I remove extends Logging from the object, with extends
 Logging it return:

 Exception in thread main java.lang.LinkageError: loader constraint
 violation in interface itable initialization: when resolving method
 App1$.logInfo(Lscala/Function0;Ljava/lang/Throwable;)V the class loader
 (instance of org/apache/spark/util/ChildFirstURLClassLoader) of the current
 class, App1$, and the class loader (instance of
 sun/misc/Launcher$AppClassLoader) for interface org/apache/spark/Logging
 have different Class objects for the type scala/Function0 used in the
 signature
 at App1.main(App1.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
 at
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
 at
 org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

 Do you have any idea what's wrong with Logging?

 PS: I'm running it with spark-1.3.0/bin/spark-submit --class App1 --conf
 spark.driver.userClassPathFirst=true --conf
 spark.executor.userClassPathFirst=true
 $HOME/projects/sparkapp/target/scala-2.10/sparkapp-assembly-1.0.jar

 Thanks,
 Alexey


 On Tue, Mar 24, 2015 at 5:03 AM, Marcelo Vanzin van...@cloudera.com wrote:

 You could build a far jar for your application containing both your
 code and the json4s library, and then run Spark with these two
 options:

   spark.driver.userClassPathFirst=true
   spark.executor.userClassPathFirst=true

 Both only work in 1.3. (1.2 has spark.files.userClassPathFirst, but
 that only works for executors.)


 On Mon, Mar 23, 2015 at 2:12 PM, Alexey Zinoviev
 alexey.zinov...@gmail.com wrote:
  Spark has a dependency on json4s 3.2.10, but this version has several
  bugs
  and I need to use 3.2.11. I added json4s-native 3.2.11 dependency to
  build.sbt and everything compiled fine. But when I spark-submit my JAR
  it
  provides me with 3.2.10.
 
 
  build.sbt
 
  import sbt.Keys._
 
  name := sparkapp
 
  version := 1.0
 
  scalaVersion := 2.10.4
 
  libraryDependencies += org.apache.spark %% spark-core  % 1.3.0 %
  provided
 
  libraryDependencies += org.json4s %% json4s-native % 3.2.11`
 
 
  plugins.sbt
 
  logLevel := Level.Warn
 
  resolvers += Resolver.url(artifactory,
 
  url(http://scalasbt.artifactoryonline.com/scalasbt/sbt-plugin-releases;))(Resolver.ivyStylePatterns)
 
  addSbtPlugin(com.eed3si9n % sbt-assembly % 0.13.0)
 
 
  App1.scala
 
  import org.apache.spark.SparkConf
  import org.apache.spark.rdd.RDD
  import org.apache.spark.{Logging, SparkConf, SparkContext}
  import org.apache.spark.SparkContext._
 
  object App1 extends Logging {
def main(args: Array[String]) = {
  val conf = new SparkConf().setAppName(App1)
  val sc = new SparkContext(conf)
  println(sjson4s version: ${org.json4s.BuildInfo.version.toString})
}
  }
 
 
 
  sbt 0.13.7, sbt-assembly 0.13.0, Scala 2.10.4
 
  Is it possible to force 3.2.11 version usage?
 
  Thanks,
  Alexey



 --
 Marcelo





-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

CombineByKey - Please explain its working

2015-03-24 Thread ashish.usoni

I am reading about combinebyKey and going through below example from one of
the blog post but i cant understand how it works step by step , Can some one
please explain 


Case  class  Fruit ( kind :  String ,  weight :  Int )  { 
def  makeJuice : Juice  =  Juice ( weight  *  100 ) 
} 
Case  class  Juice ( volumn :  Int )  { 
def  add ( J :  Juice ) : Juice  =  Juice ( volumn  +  J . volumn ) 
} 
Val  apple1  =  Fruit ( Apple ,  5 ) 
Val  Apple2  =  Fruit ( Apple ,  8 ) 
Val  orange1  =  Fruit ( orange ,  10 )

Val  Fruit  =  sc . Parallelize ( List (( Apple ,  apple1 )  ,  ( orange
,  orange1 )  ,  ( Apple ,  Apple2 )))  
*Val  Juice  =  Fruit . combineByKey ( 
f  =  f . makeJuice , 
( J : Juice , f )  =  J . add ( f . makeJuice ), 
( J1 : Juice , J2 : Juice )  =  J1 . add ( J2 )  
)*



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/CombineByKey-Please-explain-its-working-tp22203.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Question about Data Sources API


 My question wrt Java/Scala was related to extending the classes to support
 new custom data sources, so was wondering if those could be written in
 Java, since our company is a Java shop.


Yes, you should be able to extend the required interfaces using Java.

The additional push downs I am looking for are aggregations with grouping
 and sorting.
 Essentially, I am trying to evaluate if this API can give me much of what
 is possible with the Apache MetaModel project.


We don't currently push those down today as our initial focus is on getting
data into Spark so that you can join with other sources and then do such
processing.  Its possible we will extend the pushdown API though in the
future.

Spark GraphX In Action on documentation page?

2015-03-24 Thread Michael Malak

Can my new book, Spark GraphX In Action, which is currently in MEAP 
http://manning.com/malak/, be added to 
https://spark.apache.org/documentation.html and, if appropriate, to 
https://spark.apache.org/graphx/ ?

Michael Malak

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark-thriftserver Issue

2015-03-24 Thread Anubhav Agarwal

Zhan specifying port fixed the port issue.

Is it possible to specify the log directory while starting the spark
thriftserver?
Still getting this error even through the folder exists and everyone has
permission to use that directory.
drwxr-xr-x  2 root root  4096 Mar 24 19:04 spark-events


Exception in thread main java.lang.IllegalArgumentException: Log
directory /tmp/spark-events does not exist.
at
org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:99)
at org.apache.spark.SparkContext.init(SparkContext.scala:399)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:49)
at
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:58)
at
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



On Mon, Mar 23, 2015 at 6:51 PM, Zhan Zhang zzh...@hortonworks.com wrote:

  Probably the port is already used by others, e.g., hive. You can change
 the port similar to below

   ./sbin/start-thriftserver.sh --master yarn --executor-memory 512m 
 --hiveconf hive.server2.thrift.port=10001


  Thanks.

  Zhan Zhang

   On Mar 23, 2015, at 12:01 PM, Neil Dev neilk...@gmail.com wrote:

 Hi,

 I am having issue starting spark-thriftserver. I'm running spark 1.3.with
 Hadoop 2.4.0. I would like to be able to change its port too so, I can hive
 hive-thriftserver as well as spark-thriftserver running at the same time.

 Starting sparkthrift server:-
 sudo ./start-thriftserver.sh --master spark://ip-172-31-10-124:7077
 --executor-memory 2G

 Error:-
 I created the folder manually but still getting the following error
 Exception in thread main java.lang.IllegalArgumentException: Log
 directory /tmp/spark-events does not exist.


 I am getting the following error
 15/03/23 15:07:02 ERROR thrift.ThriftCLIService: Error:
 org.apache.thrift.transport.TTransportException: Could not create
 ServerSocket on address0.0.0.0/0.0.0.0:1.
at
 org.apache.thrift.transport.TServerSocket.init(TServerSocket.java:93)
at
 org.apache.thrift.transport.TServerSocket.init(TServerSocket.java:79)
at

 org.apache.hive.service.auth.HiveAuthFactory.getServerSocket(HiveAuthFactory.java:236)
at

 org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:69)
at java.lang.Thread.run(Thread.java:745)

 Thanks
 Neil

FAILED SelectChannelConnector@0.0.0.0:4040 java.net.BindException: Address already in use

2015-03-24 Thread , Roy

I get following message for each time I run spark job


   1. 15/03/24 15:35:56 WARN AbstractLifeCycle: FAILED
   SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address
   already in use


full trace is here

http://pastebin.com/xSvRN01f

how do I fix this ?

I am on CDH 5.3.1

thanks
roy

Re: SparkSQL UDTs with Ordering

I'll caution that the UDTs are not a stable public interface yet.  We'd
like to do this someday, but currently this feature is mostly for MLlib as
we have not finalized the API.

Having an ordering could be useful, but I'll add that currently UDTs
actually exist in serialized from so the ordering would have to be on the
internal form, not the user visible form.

On Tue, Mar 24, 2015 at 12:25 PM, Patrick Woody patrick.woo...@gmail.com
wrote:

 Hey all,

 Currently looking into UDTs and I was wondering if it is reasonable to add
 the ability to define an Ordering (or if this is possible, then how)?
 Currently it will throw an error when non-Native types are used.

 Thanks!
 -Pat

Spark Application Hung

2015-03-24 Thread Ashish Rawat

Hi,

We are observing a hung spark application when one of the yarn datanode 
(running multiple spark executors) go down.

Setup details:

  *   Spark: 1.2.1
  *   Hadoop: 2.4.0
  *   Spark Application Mode: yarn-client
  *   2 datanodes (DN1, DN2)
  *   6 spark executors (initially 3 executors on both DN1 and DN2, after 
rebooting DN2, changes to 4 executors on DN1 and 2 executors on DN2)

Scenario:

When one of the datanodes (DN2) is brought down, the application gets hung, 
with spark driver continuously showing the following warning:

15/03/24 12:39:26 WARN TaskSetManager: Lost task 5.0 in stage 232.0 (TID 37941, 
DN1): FetchFailed(null, shuffleId=155, mapId=-1, reduceId=5, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 155
at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:384)
at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:381)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:380)
at 
org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:176)
at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
at 
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)


When DN2 is brought down, one executor gets launched on DN1. When DN2 is 
brought back up after 15mins, 2 executors get launched on it.
All the executors (including the ones which got launched after DN2 comes back), 
keep showing the following errors:

15/03/24 12:43:30 INFO spark.MapOutputTrackerWorker: Don't have map outputs for 
shuffle 155, fetching them
15/03/24 12:43:30 INFO spark.MapOutputTrackerWorker: Don't have map outputs for 
shuffle 155, fetching them
15/03/24 12:43:30 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker 
actor = Actor[akka.tcp://sparkDriver@NN1:44353/user/MapOutputTracker#-957394722]
15/03/24 12:43:30 INFO spark.MapOutputTrackerWorker: Got the output locations
15/03/24 12:43:30 ERROR spark.MapOutputTracker: Missing an output location for 
shuffle 155
15/03/24 12:43:30 ERROR spark.MapOutputTracker: Missing an output location for 
shuffle 155
15/03/24 12:43:30 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
44623
15/03/24 12:43:30 INFO executor.Executor: Running task 5.0 in stage 232.960 
(TID 44623)
15/03/24 12:43:30 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
44629

Re: hadoop input/output format advanced control

2015-03-24 Thread Nick Pentreath

You can indeed override the Hadoop configuration at a per-RDD level -
though it is a little more verbose, as in the below example, and you need
to effectively make a copy of the hadoop Configuration:

val thisRDDConf = new Configuration(sc.hadoopConfiguration)
thisRDDConf.set(mapred.min.split.size, 5)
val rdd = sc.newAPIHadoopFile(path,
  classOf[SequenceFileInputFormat[IntWritable, Text]],
  classOf[IntWritable],
  classOf[Text],
  thisRDDConf
)
println(rdd.partitions.size)

val rdd2 = sc.newAPIHadoopFile(path,
  classOf[SequenceFileInputFormat[IntWritable, Text]],
  classOf[IntWritable],
  classOf[Text]
)
println(rdd2.partitions.size)


For example, if I run the above on the following directory (some files I
have lying around):

-rw-r--r--  1 Nick  staff 0B Jul 11  2014 _SUCCESS
-rw-r--r--  1 Nick  staff   291M Sep 16  2014 part-0
-rw-r--r--  1 Nick  staff   227M Sep 16  2014 part-1
-rw-r--r--  1 Nick  staff   370M Sep 16  2014 part-2
-rw-r--r--  1 Nick  staff   244M Sep 16  2014 part-3
-rw-r--r--  1 Nick  staff   240M Sep 16  2014 part-4

I get output:

15/03/24 20:43:12 INFO FileInputFormat: Total input paths to process : 5
*5*

... and then for the second RDD:

15/03/24 20:43:12 INFO SparkContext: Created broadcast 1 from
newAPIHadoopFile at TestHash.scala:41
*45*

As expected.

Though a more succinct way of passing in those conf options would be nice -
but this should get you what you need.



On Mon, Mar 23, 2015 at 10:36 PM, Koert Kuipers ko...@tresata.com wrote:

 currently its pretty hard to control the Hadoop Input/Output formats used
 in Spark. The conventions seems to be to add extra parameters to all
 methods and then somewhere deep inside the code (for example in
 PairRDDFunctions.saveAsHadoopFile) all these parameters get translated into
 settings on the Hadoop Configuration object.

 for example for compression i see codec: Option[Class[_ :
 CompressionCodec]] = None added to a bunch of methods.

 how scalable is this solution really?

 for example i need to read from a hadoop dataset and i dont want the input
 (part) files to get split up. the way to do this is to set
 mapred.min.split.size. now i dont want to set this at the level of the
 SparkContext (which can be done), since i dont want it to apply to input
 formats in general. i want it to apply to just this one specific input
 dataset i need to read. which leaves me with no options currently. i could
 go add yet another input parameter to all the methods
 (SparkContext.textFile, SparkContext.hadoopFile, SparkContext.objectFile,
 etc.). but that seems ineffective.

 why can we not expose a Map[String, String] or some other generic way to
 manipulate settings for hadoop input/output formats? it would require
 adding one more parameter to all methods to deal with hadoop input/output
 formats, but after that its done. one parameter to rule them all

 then i could do:
 val x = sc.textFile(/some/path, formatSettings =
 Map(mapred.min.split.size - 12345))

 or
 rdd.saveAsTextFile(/some/path, formatSettings =
 Map(mapred.output.compress - true, mapred.output.compression.codec -
 somecodec))

SparkSQL UDTs with Ordering

2015-03-24 Thread Patrick Woody

Hey all,

Currently looking into UDTs and I was wondering if it is reasonable to add
the ability to define an Ordering (or if this is possible, then how)?
Currently it will throw an error when non-Native types are used.

Thanks!
-Pat

java.lang.OutOfMemoryError: unable to create new native thread

2015-03-24 Thread Thomas Gerber

Hello,

I am seeing various crashes in spark on large jobs which all share a
similar exception:

java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)

I increased nproc (i.e. ulimit -u) 10 fold, but it doesn't help.

Does anyone know how to avoid those kinds of errors?

Noteworthy: I added -XX:ThreadStackSize=10m on both driver and executor
extra java options, which might have amplified the problem.

Thanks for you help,
Thomas

Re: Spark-thriftserver Issue

2015-03-24 Thread Zhan Zhang

You can try to set it in spark-env.sh.

# - SPARK_LOG_DIR   Where log files are stored.  (Default: 
${SPARK_HOME}/logs)
# - SPARK_PID_DIR   Where the pid file is stored. (Default: /tmp)

Thanks.

Zhan Zhang

On Mar 24, 2015, at 12:10 PM, Anubhav Agarwal 
anubha...@gmail.commailto:anubha...@gmail.com wrote:

Zhan specifying port fixed the port issue.

Is it possible to specify the log directory while starting the spark 
thriftserver?
Still getting this error even through the folder exists and everyone has 
permission to use that directory.
drwxr-xr-x  2 root root  4096 Mar 24 19:04 spark-events


Exception in thread main java.lang.IllegalArgumentException: Log directory 
/tmp/spark-events does not exist.
at 
org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:99)
at org.apache.spark.SparkContext.init(SparkContext.scala:399)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:49)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:58)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



On Mon, Mar 23, 2015 at 6:51 PM, Zhan Zhang 
zzh...@hortonworks.commailto:zzh...@hortonworks.com wrote:
Probably the port is already used by others, e.g., hive. You can change the 
port similar to below


 ./sbin/start-thriftserver.sh --master yarn --executor-memory 512m --hiveconf 
hive.server2.thrift.port=10001

Thanks.

Zhan Zhang

On Mar 23, 2015, at 12:01 PM, Neil Dev 
neilk...@gmail.commailto:neilk...@gmail.com wrote:

Hi,

I am having issue starting spark-thriftserver. I'm running spark 1.3.with
Hadoop 2.4.0. I would like to be able to change its port too so, I can hive
hive-thriftserver as well as spark-thriftserver running at the same time.

Starting sparkthrift server:-
sudo ./start-thriftserver.sh --master spark://ip-172-31-10-124:7077
--executor-memory 2G

Error:-
I created the folder manually but still getting the following error
Exception in thread main java.lang.IllegalArgumentException: Log
directory /tmp/spark-events does not exist.


I am getting the following error
15/03/23 15:07:02 ERROR thrift.ThriftCLIService: Error:
org.apache.thrift.transport.TTransportException: Could not create
ServerSocket on address0.0.0.0/0.0.0.0:1http://0.0.0.0:1/.
   at
org.apache.thrift.transport.TServerSocket.init(TServerSocket.java:93)
   at
org.apache.thrift.transport.TServerSocket.init(TServerSocket.java:79)
   at
org.apache.hive.service.auth.HiveAuthFactory.getServerSocket(HiveAuthFactory.java:236)
   at
org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:69)
   at java.lang.Thread.run(Thread.java:745)

Thanks
Neil

diffrence in PCA of MLib vs H2o in R

2015-03-24 Thread roni

I am trying to compute PCA  using  computePrincipalComponents.
I  also computed PCA using h2o in R and R's prcomp. The answers I get from
H2o and R's prComp (non h2o) is same when I set the options for H2o as
standardized=FALSE and for r's prcomp as center = false.

How do I make sure that the settings for MLib PCA is same as I am using for
H2o or prcomp.

Thanks
Roni

Re: Hive context datanucleus error

2015-03-24 Thread Udit Mehta

has this issue been fixed in spark 1.2:
https://issues.apache.org/jira/browse/SPARK-2624

On Mon, Mar 23, 2015 at 9:19 PM, Udit Mehta ume...@groupon.com wrote:

 I am trying to run a simple query to view tables in my hive metastore
 using hive context.
 I am getting this error:
 spark Persistence process has been specified to use a *ClassLoader
 Resolve* of name datanucleus yet this has not been found by the
 DataNucleus plugin mechanism. Please check your CLASSPATH and plugin
 specification.
 https://www.google.com/search?espv=2biw=1440bih=802q=spark+Persistence+process+has+been+specified+to+use+a+ClassLoader+Resolve+of+name+%22datanucleus%22+yet+this+has+not+been+found+by+the+DataNucleus+plugin+mechanism.+Please+check+your+CLASSPATH+and+plugin+specification.spell=1sa=Xei=seQQVbPnCoyZNo6lgIgBved=0CBoQBSgA

 I am able to access the metastore using the spark-sql.
 Can someone point out what the issue could be?

 thanks

Re: Standalone Scheduler VS YARN Performance

By any chance does this thread address look similar:
http://apache-spark-developers-list.1001551.n3.nabble.com/Lost-executor-on-YARN-ALS-iterations-td7916.html
?



On Tue, Mar 24, 2015 at 5:23 AM Harut Martirosyan 
harut.martiros...@gmail.com wrote:

 What is performance overhead caused by YARN, or what configurations are
 being changed when the app is ran through YARN?

 The following example:

 sqlContext.sql(SELECT dayStamp(date),
 count(distinct deviceId) AS c
 FROM full
 GROUP BY dayStamp(date)
 ORDER BY c
 DESC LIMIT 10)
 .collect()

 runs on shell when we use standalone scheduler:
 ./spark-shell --master sparkmaster:7077 --executor-memory 20g
 --executor-cores 10  --driver-memory 10g --num-executors 8

 and fails due to losing an executor, when we run it through YARN.
 ./spark-shell --master yarn-client --executor-memory 20g --executor-cores
 10  --driver-memory 10g --num-executors 8

 There are no evident logs, just messages that executors are being lost,
 and connection refused errors, (apparently due to executor failures)
 The cluster is the same, 8 nodes, 64Gb RAM each.
 Format is parquet.

 --
 RGRDZ Harut

Re: Optimal solution for getting the header from CSV with Spark

2015-03-24 Thread Dean Wampler

Instead of data.zipWithIndex().filter(_._2==0), which will cause Spark to
read the whole file, use data.take(1), which is simpler.

From the Rdd.take documentation, it works by first scanning one partition,
and using the results from that partition to estimate the number of
additional partitions needed to satisfy the limit. In this case, it will
trivially stop at the first.


Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Tue, Mar 24, 2015 at 7:12 AM, Spico Florin spicoflo...@gmail.com wrote:

 Hello!

 I would like to know what is the optimal solution for getting the header
 with from a CSV file with Spark? My aproach was:

 def getHeader(data: RDD[String]): String = {
 data.zipWithIndex().filter(_._2==0).map(x=x._1).take(1).mkString() }

 Thanks.

Re: Invalid ContainerId ... Caused by: java.lang.NumberFormatException: For input string: e04

2015-03-24 Thread Manoj Samel

Thanks All - perhaps I misread the earlier posts as dependencies with
Hadoop version, but the key is also the CDH 5.3.2 (not just Hadoop 2.5 v/s
2.4) etc.

After adding the classPath as Marcelo/Harsh suggested (loading CDH libs
front), I am able to get spark-shell started without invalid container etc
so that issue is solved.

When I run any query, it gives java.lang.NoSuchMethodError:
com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;

This seems to be Guava lib version issue that has been known ... I will
look into it.

Thanks again !

On Tue, Mar 24, 2015 at 12:50 PM, Harsh J ha...@cloudera.com wrote:

 My comment's still the same: Runtime-link-via-classpath Spark to use CDH
 5.3.2 libraries, just like your cluster does, not Apache Hadoop 2.5.0
 (which CDH is merely based on, but carries several backports on top that
 aren't in Apache Hadoop 2.5.0, one of which addresses this parsing trouble).

 You do not require to recompile Spark, just alter its hadoop libraries in
 its classpath to be that of CDH server version (overwrite from parcels,
 etc.).

 On Wed, Mar 25, 2015 at 1:06 AM, Manoj Samel manojsamelt...@gmail.com
 wrote:

 I recompiled Spark 1.3 with Hadoop 2.5; it still gives same stack trace.

 A quick browse into  stacktrace with Hadoop 2.5.0
 org.apache.hadoop.yarn.util.ConverterUtils ...

 1. toContainerId gets parameter containerId which I assume is container_
 *e*06_1427223073530_0001_01_01
 2. It splits it using public static final Splitter _SPLITTER =
 Splitter.on('_').trimResults();
 3. Line 172 checks container prefix with CONTAINER_PREFIX which is valid
 (container)
 4. It calls toApplicationAttemptId
 5. toApplicationAttemptId tries Long.parseLong(it.next()) on e06 and
 dies

 Seems like it is not expecting a non-numeric character. Is this a Yarn
 issue ?

 Thanks,

 On Tue, Mar 24, 2015 at 8:25 AM, Manoj Samel manoj.sa...@gmail.com
 wrote:

 I'll compile Spark with Hadoop libraries and try again ...

 Thanks,

 Manoj

 On Mar 23, 2015, at 10:34 PM, Harsh J ha...@cloudera.com wrote:

 This may happen if you are using different versions of CDH5 jars between
 Spark and the cluster. Can you ensure your Spark's Hadoop CDH jars match
 the cluster version exactly, since you seem to be using a custom version of
 Spark (out of CDH) here?

 On Tue, Mar 24, 2015 at 7:32 AM, Manoj Samel manojsamelt...@gmail.com
 wrote:

 x-post to CDH list for any insight ...

 Thanks,

 -- Forwarded message --
 From: Manoj Samel manojsamelt...@gmail.com
 Date: Mon, Mar 23, 2015 at 6:32 PM
 Subject: Invalid ContainerId ... Caused by:
 java.lang.NumberFormatException: For input string: e04
 To: user@spark.apache.org user@spark.apache.org


 Spark 1.3, CDH 5.3.2, Kerberos

 Setup works fine with base configuration, spark-shell can be used in
 yarn client mode etc.

 When work recovery feature is enabled via
 http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/admin_ha_yarn_work_preserving_recovery.html,
 the spark-shell fails with following log

 15/03/24 01:20:16 ERROR yarn.ApplicationMaster: Uncaught exception:
 java.lang.IllegalArgumentException: Invalid ContainerId:
 container_e04_1427159778706_0002_01_01
 at
 org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:182)
 at
 org.apache.spark.deploy.yarn.YarnRMClient.getAttemptId(YarnRMClient.scala:93)
 at
 org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:83)
 at
 org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:576)
 at
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
 at
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
 at
 org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
 at
 org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:574)
 at
 org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:597)
 at
 org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
 Caused by: java.lang.NumberFormatException: For input string: e04
 at
 java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
 at java.lang.Long.parseLong(Long.java:589)
 at java.lang.Long.parseLong(Long.java:631)
 at
 org.apache.hadoop.yarn.util.ConverterUtils.toApplicationAttemptId(ConverterUtils.java:137)
 at
 org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:177)
 ... 12 more
 15/03/24 01:20:16 INFO yarn.ApplicationMaster: Final app status:
 FAILED, exitCode: 10, (reason:

Re: Invalid ContainerId ... Caused by: java.lang.NumberFormatException: For input string: e04

Hi there,

On Tue, Mar 24, 2015 at 1:40 PM, Manoj Samel manojsamelt...@gmail.com wrote:
 When I run any query, it gives java.lang.NoSuchMethodError:
 com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;

Are you running a custom-compiled Spark by any chance? Specifically,
one you built with sbt? That would hit this problem, because the path
I suggested (/usr/lib/hadoop/client/*) contains an older guava
library, which would override the one shipped with the sbt-built
Spark.

If you build Spark with maven, or use the pre-built Spark distro, or
specifically filter out the guava jar from your classpath when setting
up the Spark job, things should work.

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

updateStateByKey - Seq[V] order

2015-03-24 Thread Adrian Mocanu

Hi
Does updateStateByKey pass elements to updateFunc (in Seq[V]) in order in which 
they appear in the RDD?
My guess is no which means updateFunc needs to be commutative. Am I correct?
I've asked this question before but there were no takers.

Here's the scala docs for updateStateByKey

  /**
   * Return a new state DStream where the state for each key is updated by 
applying
   * the given function on the previous state of the key and the new values of 
each key.
   * Hash partitioning is used to generate the RDDs with Spark's default number 
of partitions.
   * @param updateFunc State update function. If `this` function returns None, 
then
   *   corresponding state key-value pair will be eliminated.
   * @tparam S State type
   */
  def updateStateByKey[S: ClassTag](
  updateFunc: (Seq[V], Option[S]) = Option[S]
): DStream[(K, S)] = {
updateStateByKey(updateFunc, defaultPartitioner())
  }

Re: java.lang.OutOfMemoryError: unable to create new native thread

I doubt you're hitting the limit of threads you can spawn, but as you
say, running out of memory that the JVM process is allowed to allocate
since your threads are grabbing stacks 10x bigger than usual. The
thread stacks are 4GB by themselves.

I suppose you can't not up the stack size so much?

If so then I think you need to make more, smaller executors instead?

On Tue, Mar 24, 2015 at 7:38 PM, Thomas Gerber thomas.ger...@radius.com wrote:
 Hello,

 I am seeing various crashes in spark on large jobs which all share a similar
 exception:

 java.lang.OutOfMemoryError: unable to create new native thread
 at java.lang.Thread.start0(Native Method)
 at java.lang.Thread.start(Thread.java:714)

 I increased nproc (i.e. ulimit -u) 10 fold, but it doesn't help.

 Does anyone know how to avoid those kinds of errors?

 Noteworthy: I added -XX:ThreadStackSize=10m on both driver and executor
 extra java options, which might have amplified the problem.

 Thanks for you help,
 Thomas

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how?

2015-03-24 Thread Jon Chase

Shahab -

This should do the trick until Hao's changes are out:


sqlContext.sql(create temporary function foobar as 'com.myco.FoobarUDAF');

sqlContext.sql(select foobar(some_column) from some_table);


This works without requiring to 'deploy' a JAR with the UDAF in it - just
make sure the UDAF is in your project's classpath.




On Tue, Mar 10, 2015 at 8:21 PM, Cheng, Hao hao.ch...@intel.com wrote:

  Oh, sorry, my bad, currently Spark SQL doesn't provide the user
 interface for UDAF, but it can work seamlessly with Hive UDAF (via
 HiveContext).



 I am also working on the UDAF interface refactoring, after that we can
 provide the custom interface for extension.



 https://github.com/apache/spark/pull/3247





 *From:* shahab [mailto:shahab.mok...@gmail.com]
 *Sent:* Wednesday, March 11, 2015 1:44 AM
 *To:* Cheng, Hao
 *Cc:* user@spark.apache.org
 *Subject:* Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how?



 Thanks Hao,

 But my question concerns UDAF (user defined aggregation function ) not
 UDTF( user defined type function ).

 I appreciate if you could point me to some starting point on UDAF
 development in Spark.



 Thanks

 Shahab

 On Tuesday, March 10, 2015, Cheng, Hao hao.ch...@intel.com wrote:

  Currently, Spark SQL doesn't provide interface for developing the custom
 UDTF, but it can work seamless with Hive UDTF.



 I am working on the UDTF refactoring for Spark SQL, hopefully will provide
 an Hive independent UDTF soon after that.



 *From:* shahab [mailto:shahab.mok...@gmail.com]
 *Sent:* Tuesday, March 10, 2015 5:44 PM
 *To:* user@spark.apache.org
 *Subject:* Registering custom UDAFs with HiveConetxt in SparkSQL, how?



 Hi,



 I need o develop couple of UDAFs and use them in the SparkSQL. While UDFs
 can be registered as a function in HiveContext, I could not find any
 documentation of how UDAFs can be registered in the HiveContext?? so far
 what I have found is to make a JAR file, out of developed UDAF class, and
 then deploy the JAR file to SparkSQL .



 But is there any way to avoid deploying the jar file and register it
 programmatically?





 best,

 /Shahab

Re: Question about Data Sources API

On Tue, Mar 24, 2015 at 12:57 AM, Ashish Mukherjee 
ashish.mukher...@gmail.com wrote:

 1. Is the Data Source API stable as of Spark 1.3.0?


It is marked DeveloperApi, but in general we do not plan to change even
these APIs unless there is a very compelling reason to.


 2. The Data Source API seems to be available only in Scala. Is there any
 plan to make it available for Java too?


We tried to make all the suggested interfaces (other than CatalystScan
which exposes internals and is only for experimentation) usable from Java.
Is there something in particular you are having trouble with?


 3.  Are only filters and projections pushed down to the data source and
 all the data pulled into Spark for other processing?


For now, this is all that is provided by the public stable API.  We left a
hook for more powerful push downs
(sqlContext.experimental.extraStrategies), and would be interested in
feedback on other operations we should push down as we expand the API.

Re: java.lang.OutOfMemoryError: unable to create new native thread

2015-03-24 Thread Thomas Gerber

Additional notes:
I did not find anything wrong with the number of threads (ps -u USER -L |
wc -l): around 780 on the master and 400 on executors. I am running on 100
r3.2xlarge.

On Tue, Mar 24, 2015 at 12:38 PM, Thomas Gerber thomas.ger...@radius.com
wrote:

 Hello,

 I am seeing various crashes in spark on large jobs which all share a
 similar exception:

 java.lang.OutOfMemoryError: unable to create new native thread
 at java.lang.Thread.start0(Native Method)
 at java.lang.Thread.start(Thread.java:714)

 I increased nproc (i.e. ulimit -u) 10 fold, but it doesn't help.

 Does anyone know how to avoid those kinds of errors?

 Noteworthy: I added -XX:ThreadStackSize=10m on both driver and executor
 extra java options, which might have amplified the problem.

 Thanks for you help,
 Thomas

Re: Hadoop 2.5 not listed in Spark 1.4 build page

Hadoop 2.5 would be referenced as via -Dhadoop-2.5 using the profile
-Phadoop-2.4

Please note earlier in the link the section:

# Apache Hadoop 2.4.X or 2.5.X
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package

Versions of Hadoop after 2.5.X may or may not work with the
-Phadoop-2.4 profile (they were
released after this version of Spark).


HTH!

On Tue, Mar 24, 2015 at 10:28 AM Manoj Samel manojsamelt...@gmail.com
wrote:


 http://spark.apache.org/docs/latest/building-spark.html#packaging-without-hadoop-dependencies-for-yarn
 does not list hadoop 2.5 in Hadoop version table table etc.

 I assume it is still OK to compile with  -Pyarn -Phadoop-2.5 for use with
 Hadoop 2.5 (cdh 5.3.2)

 Thanks,

Re: Hadoop 2.5 not listed in Spark 1.4 build page

The right invocation is still a bit different:

... -Phadoop-2.4 -Dhadoop.version=2.5.0

hadoop-2.4 == Hadoop 2.4+

On Tue, Mar 24, 2015 at 5:44 PM, Denny Lee denny.g@gmail.com wrote:
 Hadoop 2.5 would be referenced as via -Dhadoop-2.5 using the profile
 -Phadoop-2.4

 Please note earlier in the link the section:

 # Apache Hadoop 2.4.X or 2.5.X
 mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package

 Versions of Hadoop after 2.5.X may or may not work with the -Phadoop-2.4
 profile (they were
 released after this version of Spark).


 HTH!

 On Tue, Mar 24, 2015 at 10:28 AM Manoj Samel manojsamelt...@gmail.com
 wrote:


 http://spark.apache.org/docs/latest/building-spark.html#packaging-without-hadoop-dependencies-for-yarn
 does not list hadoop 2.5 in Hadoop version table table etc.

 I assume it is still OK to compile with  -Pyarn -Phadoop-2.5 for use with
 Hadoop 2.5 (cdh 5.3.2)

 Thanks,

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Does HiveContext connect to HiveServer2?

2015-03-24 Thread Udit Mehta

Another question related to this, how can we propagate the hive-site.xml to
all workers when running in the yarn cluster mode?

On Tue, Mar 24, 2015 at 10:09 AM, Marcelo Vanzin van...@cloudera.com
wrote:

 It does neither. If you provide a Hive configuration to Spark,
 HiveContext will connect to your metastore server, otherwise it will
 create its own metastore in the working directory (IIRC).

 On Tue, Mar 24, 2015 at 8:58 AM, nitinkak001 nitinkak...@gmail.com
 wrote:
  I am wondering if HiveContext connects to HiveServer2 or does it work
 though
  Hive CLI. The reason I am asking is because Cloudera has deprecated Hive
  CLI.
 
  If the connection is through HiverServer2, is there a way to specify user
  credentials?
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-connect-to-HiveServer2-tp22200.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 



 --
 Marcelo

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

spark worker on mesos slave | possible networking config issue

2015-03-24 Thread Anirudha Jadhav

is there some setting i am missing:
this is my spark-env.sh

export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so
export SPARK_EXECUTOR_URI=http://100.125.5.93/sparkx.tgz
export SPARK_LOCAL_IP=127.0.0.1



here is what i see on the slave node.

less
20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/stderr


WARNING: Logging before InitGoogleLogging() is written to STDERR
I0324 02:30:29.389225 27755 fetcher.cpp:76] Fetching URI '
http://100.125.5.93/sparkx.tgz'
I0324 02:30:29.389361 27755 fetcher.cpp:126] Downloading '
http://100.125.5.93/sparkx.tgz' to
'/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/sparkx.tgz'
I0324 02:30:35.353446 27755 fetcher.cpp:64] Extracted resource
'/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/sparkx.tgz'
into
'/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56'
Spark assembly has been built with Hive, including Datanucleus jars on
classpath
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
15/03/24 02:30:37 INFO MesosExecutorBackend: Registered signal handlers for
[TERM, HUP, INT]
I0324 02:30:37.071077 27863 exec.cpp:132] Version: 0.21.1
I0324 02:30:37.080971 27885 exec.cpp:206] Executor registered on slave
20150226-160708-78932-5050-8971-S0
15/03/24 02:30:37 INFO MesosExecutorBackend: Registered with Mesos as
executor ID 20150226-160708-78932-5050-8971-S0 with 1 cpus
15/03/24 02:30:37 INFO SecurityManager: Changing view acls to: ubuntu
15/03/24 02:30:37 INFO SecurityManager: Changing modify acls to: ubuntu
15/03/24 02:30:37 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(ubuntu); users
with modify permissions: Set(ubuntu)
15/03/24 02:30:37 INFO Slf4jLogger: Slf4jLogger started
15/03/24 02:30:37 INFO Remoting: Starting remoting
15/03/24 02:30:38 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkExecutor@mesos-si2:50542]
15/03/24 02:30:38 INFO Utils: Successfully started service 'sparkExecutor'
on port 50542.
15/03/24 02:30:38 INFO AkkaUtils: Connecting to MapOutputTracker:
akka.tcp://sparkDriver@localhost:51849/user/MapOutputTracker
15/03/24 02:30:38 WARN Remoting: Tried to associate with unreachable remote
address [akka.tcp://sparkDriver@localhost:51849]. Address is now gated for
5000 ms, all messages to this address will be delivered to dead letters.
Reason: Connection refused: localhost/127.0.0.1:51849
akka.actor.ActorNotFound: Actor not found for:
ActorSelection[Anchor(akka.tcp://sparkDriver@localhost:51849/),
Path(/user/MapOutputTracker)]
at
akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
at
akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
at
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
at
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at
scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at
akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
at
akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
at
akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
at
akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
at
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:267)
at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:508)
at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:541)
at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:531)
at
akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:87)

Spark SQL: Day of month from Timestamp

2015-03-24 Thread Harut Martirosyan

Hi guys.

Basically, we had to define a UDF that does that, is there a built in
function that we can use for it?

-- 
RGRDZ Harut

Re: Spark SQL: Day of month from Timestamp

2015-03-24 Thread Arush Kharbanda

Hi

You can use functions like year(date),month(date)

Thanks
Arush

On Tue, Mar 24, 2015 at 12:46 PM, Harut Martirosyan 
harut.martiros...@gmail.com wrote:

 Hi guys.

 Basically, we had to define a UDF that does that, is there a built in
 function that we can use for it?

 --
 RGRDZ Harut




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Question about Data Sources API

2015-03-24 Thread Ashish Mukherjee

Hello,

I have some questions related to the Data Sources API -

1. Is the Data Source API stable as of Spark 1.3.0?

2. The Data Source API seems to be available only in Scala. Is there any
plan to make it available for Java too?

3.  Are only filters and projections pushed down to the data source and all
the data pulled into Spark for other processing?

Regards,
Ashish

Re: FAILED SelectChannelConnector@0.0.0.0:4040 java.net.BindException: Address already in use