Re: Invalid ContainerId ... Caused by: java.lang.NumberFormatException: For input string: e04
On 24 Mar 2015, at 02:10, Marcelo Vanzin van...@cloudera.com wrote: This happens most probably because the Spark 1.3 you have downloaded is built against an older version of the Hadoop libraries than those used by CDH, and those libraries cannot parse the container IDs generated by CDH. This sounds suspiciously like the changes in YARN for HA (the epoch number) isn't being parsed by older versions of the YARN client libs. This is effectively a regression in the YARN code -its creating container IDs that can't be easily parsed by old apps. It may be possible to fix that spark-side by having its own parser for the YARN container/app environment variable - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark as a service
Hello, As of now, if I have to execute a Spark job, I need to create a jar and deploy it. If I need to run a dynamically formed SQL from a Web application, is there any way of using SparkSQL in this manner? Perhaps, through a Web Service or something similar. Regards, Ashish
Standalone Scheduler VS YARN Performance
What is performance overhead caused by YARN, or what configurations are being changed when the app is ran through YARN? The following example: sqlContext.sql(SELECT dayStamp(date), count(distinct deviceId) AS c FROM full GROUP BY dayStamp(date) ORDER BY c DESC LIMIT 10) .collect() runs on shell when we use standalone scheduler: ./spark-shell --master sparkmaster:7077 --executor-memory 20g --executor-cores 10 --driver-memory 10g --num-executors 8 and fails due to losing an executor, when we run it through YARN. ./spark-shell --master yarn-client --executor-memory 20g --executor-cores 10 --driver-memory 10g --num-executors 8 There are no evident logs, just messages that executors are being lost, and connection refused errors, (apparently due to executor failures) The cluster is the same, 8 nodes, 64Gb RAM each. Format is parquet. -- RGRDZ Harut
Re: issue while creating spark context
That's probably the problem; the intended path is on HDFS but the configuration specifies a local path. See the exception message. On Tue, Mar 24, 2015 at 1:08 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Its in your local file system, not in hdfs. Thanks Best Regards On Tue, Mar 24, 2015 at 6:25 PM, Sachin Singh sachin.sha...@gmail.com wrote: hi, I can see required permission is granted for this directory as under, hadoop dfs -ls /user/spark DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it. Found 1 items drwxrwxrwt - spark spark 0 2015-03-20 01:04 /user/spark/applicationHistory regards Sachin - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: issue while creating spark context
thanks Sean, please can you suggest in which file or configuration I need to modify proper path, please elaborate which may help, thanks, Regards Sachin On Tue, Mar 24, 2015 at 7:15 PM, Sean Owen so...@cloudera.com wrote: That's probably the problem; the intended path is on HDFS but the configuration specifies a local path. See the exception message. On Tue, Mar 24, 2015 at 1:08 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Its in your local file system, not in hdfs. Thanks Best Regards On Tue, Mar 24, 2015 at 6:25 PM, Sachin Singh sachin.sha...@gmail.com wrote: hi, I can see required permission is granted for this directory as under, hadoop dfs -ls /user/spark DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it. Found 1 items drwxrwxrwt - spark spark 0 2015-03-20 01:04 /user/spark/applicationHistory regards Sachin
Re: Spark streaming alerting
Streaming _from_ cassandra, CassandraInputDStream, is coming BTW https://issues.apache.org/jira/browse/SPARK-6283 https://issues.apache.org/jira/browse/SPARK-6283 I am working on it now. Helena @helenaedelson On Mar 23, 2015, at 5:22 AM, Khanderao Kand Gmail khanderao.k...@gmail.com wrote: Akhil You are right in tour answer to what Mohit wrote. However what Mohit seems to be alluring but did not write properly might be different. Mohit You are wrong in saying generally streaming works in HDFS and cassandra . Streaming typically works with streaming or queing source like Kafka, kinesis, Twitter, flume, zeroMQ, etc (but can also from HDFS and S3 ) However , streaming context ( receiver wishing the streaming context ) gets events/messages/records and forms a time window based batch (RDD)- So there is a maximum gap of window time from alert message was available to spark and when the processing happens. I think you meant about this. As per spark programming model, RDD is the right way to deal with data. If you are fine with the minimum delay of say a sec (based on min time window that dstreaming can support) then what Rohit gave is a right model. Khanderao On Mar 22, 2015, at 11:39 PM, Akhil Das ak...@sigmoidanalytics.com mailto:ak...@sigmoidanalytics.com wrote: What do you mean you can't send it directly from spark workers? Here's a simple approach which you could do: val data = ssc.textFileStream(sigmoid/) val dist = data.filter(_.contains(ERROR)).foreachRDD(rdd = alert(Errors : + rdd.count())) And the alert() function could be anything triggering an email or sending an SMS alert. Thanks Best Regards On Sun, Mar 22, 2015 at 1:52 AM, Mohit Anchlia mohitanch...@gmail.com mailto:mohitanch...@gmail.com wrote: Is there a module in spark streaming that lets you listen to the alerts/conditions as they happen in the streaming module? Generally spark streaming components will execute on large set of clusters like hdfs or Cassandra, however when it comes to alerting you generally can't send it directly from the spark workers, which means you need a way to listen to the alerts.
Re: issue while creating spark context
Hi Akhil, thanks for your quick reply, I would like to request please elaborate i.e. what kind of permission required .. thanks in advance, Regards Sachin On Tue, Mar 24, 2015 at 5:29 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Its an IOException, just make sure you are having the correct permission over */user/spark* directory. Thanks Best Regards On Tue, Mar 24, 2015 at 5:21 PM, sachin Singh sachin.sha...@gmail.com wrote: hi all, all of sudden I getting below error when I am submitting spark job using master as yarn its not able to create spark context,previously working fine, I am using CDH5.3.1 and creating javaHiveContext spark-submit --jars ./analiticlibs/mysql-connector-java-5.1.17.jar,./analiticlibs/log4j-1.2.17.jar --master yarn --class myproject.com.java.jobs.Aggregationtask sparkjob-1.0.jar error message- java.io.IOException: Error in creating log directory: file:/user/spark/applicationHistory/application_1427194309307_0005 at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:133) at org.apache.spark.util.FileLogger.start(FileLogger.scala:115) at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74) at org.apache.spark.SparkContext.init(SparkContext.scala:353) at org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:61) at myproject.com.java.core.SparkAnaliticEngine.getJavaSparkContext(SparkAnaliticEngine.java:77) at myproject.com.java.core.SparkAnaliticTable.evmyprojectate(SparkAnaliticTable.java:108) at myproject.com.java.core.SparkAnaliticEngine.evmyprojectateAnaliticTable(SparkAnaliticEngine.java:55) at myproject.com.java.core.SparkAnaliticEngine.evmyprojectateAnaliticTable(SparkAnaliticEngine.java:65) at myproject.com.java.jobs.CustomAggregationJob.main(CustomAggregationJob.java:184) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/issue-while-creating-spark-context-tp22196.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Optimal solution for getting the header from CSV with Spark
Hello! I would like to know what is the optimal solution for getting the header with from a CSV file with Spark? My aproach was: def getHeader(data: RDD[String]): String = { data.zipWithIndex().filter(_._2==0).map(x=x._1).take(1).mkString() } Thanks.
EC2 Having script run at startup
Hello, in the context of SPARK-2394 Make it easier to read LZO-compressed files from EC2 clusters https://issues.apache.org/jira/browse/SPARK-2394 , I was wondering: Is there an easy way to make a user-provided script run at every machine in a cluster launched on EC2? Regards, Theodore -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/EC2-Having-script-run-at-startup-tp22197.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: issue while creating spark context
hi, I can see required permission is granted for this directory as under, hadoop dfs -ls /user/spark DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it. Found 1 items *drwxrwxrwt - spark spark 0 2015-03-20 01:04 /user/spark/applicationHistory* regards Sachin On Tue, Mar 24, 2015 at 6:13 PM, Akhil Das ak...@sigmoidanalytics.com wrote: write permission as its clearly saying: java.io.IOException:* Error in creating log directory:* file:*/user/spark/*applicationHistory/application_1427194309307_0005 Thanks Best Regards On Tue, Mar 24, 2015 at 6:08 PM, Sachin Singh sachin.sha...@gmail.com wrote: Hi Akhil, thanks for your quick reply, I would like to request please elaborate i.e. what kind of permission required .. thanks in advance, Regards Sachin On Tue, Mar 24, 2015 at 5:29 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Its an IOException, just make sure you are having the correct permission over */user/spark* directory. Thanks Best Regards On Tue, Mar 24, 2015 at 5:21 PM, sachin Singh sachin.sha...@gmail.com wrote: hi all, all of sudden I getting below error when I am submitting spark job using master as yarn its not able to create spark context,previously working fine, I am using CDH5.3.1 and creating javaHiveContext spark-submit --jars ./analiticlibs/mysql-connector-java-5.1.17.jar,./analiticlibs/log4j-1.2.17.jar --master yarn --class myproject.com.java.jobs.Aggregationtask sparkjob-1.0.jar error message- java.io.IOException: Error in creating log directory: file:/user/spark/applicationHistory/application_1427194309307_0005 at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:133) at org.apache.spark.util.FileLogger.start(FileLogger.scala:115) at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74) at org.apache.spark.SparkContext.init(SparkContext.scala:353) at org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:61) at myproject.com.java.core.SparkAnaliticEngine.getJavaSparkContext(SparkAnaliticEngine.java:77) at myproject.com.java.core.SparkAnaliticTable.evmyprojectate(SparkAnaliticTable.java:108) at myproject.com.java.core.SparkAnaliticEngine.evmyprojectateAnaliticTable(SparkAnaliticEngine.java:55) at myproject.com.java.core.SparkAnaliticEngine.evmyprojectateAnaliticTable(SparkAnaliticEngine.java:65) at myproject.com.java.jobs.CustomAggregationJob.main(CustomAggregationJob.java:184) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/issue-while-creating-spark-context-tp22196.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: issue while creating spark context
Its in your local file system, not in hdfs. Thanks Best Regards On Tue, Mar 24, 2015 at 6:25 PM, Sachin Singh sachin.sha...@gmail.com wrote: hi, I can see required permission is granted for this directory as under, hadoop dfs -ls /user/spark DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it. Found 1 items *drwxrwxrwt - spark spark 0 2015-03-20 01:04 /user/spark/applicationHistory* regards Sachin On Tue, Mar 24, 2015 at 6:13 PM, Akhil Das ak...@sigmoidanalytics.com wrote: write permission as its clearly saying: java.io.IOException:* Error in creating log directory:* file:*/user/spark/*applicationHistory/application_1427194309307_0005 Thanks Best Regards On Tue, Mar 24, 2015 at 6:08 PM, Sachin Singh sachin.sha...@gmail.com wrote: Hi Akhil, thanks for your quick reply, I would like to request please elaborate i.e. what kind of permission required .. thanks in advance, Regards Sachin On Tue, Mar 24, 2015 at 5:29 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Its an IOException, just make sure you are having the correct permission over */user/spark* directory. Thanks Best Regards On Tue, Mar 24, 2015 at 5:21 PM, sachin Singh sachin.sha...@gmail.com wrote: hi all, all of sudden I getting below error when I am submitting spark job using master as yarn its not able to create spark context,previously working fine, I am using CDH5.3.1 and creating javaHiveContext spark-submit --jars ./analiticlibs/mysql-connector-java-5.1.17.jar,./analiticlibs/log4j-1.2.17.jar --master yarn --class myproject.com.java.jobs.Aggregationtask sparkjob-1.0.jar error message- java.io.IOException: Error in creating log directory: file:/user/spark/applicationHistory/application_1427194309307_0005 at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:133) at org.apache.spark.util.FileLogger.start(FileLogger.scala:115) at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74) at org.apache.spark.SparkContext.init(SparkContext.scala:353) at org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:61) at myproject.com.java.core.SparkAnaliticEngine.getJavaSparkContext(SparkAnaliticEngine.java:77) at myproject.com.java.core.SparkAnaliticTable.evmyprojectate(SparkAnaliticTable.java:108) at myproject.com.java.core.SparkAnaliticEngine.evmyprojectateAnaliticTable(SparkAnaliticEngine.java:55) at myproject.com.java.core.SparkAnaliticEngine.evmyprojectateAnaliticTable(SparkAnaliticEngine.java:65) at myproject.com.java.jobs.CustomAggregationJob.main(CustomAggregationJob.java:184) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/issue-while-creating-spark-context-tp22196.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: issue while creating spark context
write permission as its clearly saying: java.io.IOException:* Error in creating log directory:* file:*/user/spark/*applicationHistory/application_1427194309307_0005 Thanks Best Regards On Tue, Mar 24, 2015 at 6:08 PM, Sachin Singh sachin.sha...@gmail.com wrote: Hi Akhil, thanks for your quick reply, I would like to request please elaborate i.e. what kind of permission required .. thanks in advance, Regards Sachin On Tue, Mar 24, 2015 at 5:29 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Its an IOException, just make sure you are having the correct permission over */user/spark* directory. Thanks Best Regards On Tue, Mar 24, 2015 at 5:21 PM, sachin Singh sachin.sha...@gmail.com wrote: hi all, all of sudden I getting below error when I am submitting spark job using master as yarn its not able to create spark context,previously working fine, I am using CDH5.3.1 and creating javaHiveContext spark-submit --jars ./analiticlibs/mysql-connector-java-5.1.17.jar,./analiticlibs/log4j-1.2.17.jar --master yarn --class myproject.com.java.jobs.Aggregationtask sparkjob-1.0.jar error message- java.io.IOException: Error in creating log directory: file:/user/spark/applicationHistory/application_1427194309307_0005 at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:133) at org.apache.spark.util.FileLogger.start(FileLogger.scala:115) at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74) at org.apache.spark.SparkContext.init(SparkContext.scala:353) at org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:61) at myproject.com.java.core.SparkAnaliticEngine.getJavaSparkContext(SparkAnaliticEngine.java:77) at myproject.com.java.core.SparkAnaliticTable.evmyprojectate(SparkAnaliticTable.java:108) at myproject.com.java.core.SparkAnaliticEngine.evmyprojectateAnaliticTable(SparkAnaliticEngine.java:55) at myproject.com.java.core.SparkAnaliticEngine.evmyprojectateAnaliticTable(SparkAnaliticEngine.java:65) at myproject.com.java.jobs.CustomAggregationJob.main(CustomAggregationJob.java:184) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/issue-while-creating-spark-context-tp22196.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
issue while creating spark context
hi all, all of sudden I getting below error when I am submitting spark job using master as yarn its not able to create spark context,previously working fine, I am using CDH5.3.1 and creating javaHiveContext spark-submit --jars ./analiticlibs/mysql-connector-java-5.1.17.jar,./analiticlibs/log4j-1.2.17.jar --master yarn --class myproject.com.java.jobs.Aggregationtask sparkjob-1.0.jar error message- java.io.IOException: Error in creating log directory: file:/user/spark/applicationHistory/application_1427194309307_0005 at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:133) at org.apache.spark.util.FileLogger.start(FileLogger.scala:115) at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74) at org.apache.spark.SparkContext.init(SparkContext.scala:353) at org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:61) at myproject.com.java.core.SparkAnaliticEngine.getJavaSparkContext(SparkAnaliticEngine.java:77) at myproject.com.java.core.SparkAnaliticTable.evmyprojectate(SparkAnaliticTable.java:108) at myproject.com.java.core.SparkAnaliticEngine.evmyprojectateAnaliticTable(SparkAnaliticEngine.java:55) at myproject.com.java.core.SparkAnaliticEngine.evmyprojectateAnaliticTable(SparkAnaliticEngine.java:65) at myproject.com.java.jobs.CustomAggregationJob.main(CustomAggregationJob.java:184) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/issue-while-creating-spark-context-tp22196.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: diffrence in PCA of MLib vs H2o in R
Those implementations are computing an SVD of the input matrix directly, and while you generally need the columns to have mean 0, you can turn that off with the options you cite. I don't think this is possible in the MLlib implementation, since it is computing the principal components by computing eigenvectors of the covariance matrix. The means inherently don't matter either way in this computation. On Tue, Mar 24, 2015 at 6:13 AM, roni roni.epi...@gmail.com wrote: I am trying to compute PCA using computePrincipalComponents. I also computed PCA using h2o in R and R's prcomp. The answers I get from H2o and R's prComp (non h2o) is same when I set the options for H2o as standardized=FALSE and for r's prcomp as center = false. How do I make sure that the settings for MLib PCA is same as I am using for H2o or prcomp. Thanks Roni - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark as a service
Hi Ashish, this might be what you're looking for: https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server Regards, Jeff 2015-03-24 11:28 GMT+01:00 Ashish Mukherjee ashish.mukher...@gmail.com: Hello, As of now, if I have to execute a Spark job, I need to create a jar and deploy it. If I need to run a dynamically formed SQL from a Web application, is there any way of using SparkSQL in this manner? Perhaps, through a Web Service or something similar. Regards, Ashish
Re: Question about Data Sources API
Hello Michael, Thanks for your quick reply. My question wrt Java/Scala was related to extending the classes to support new custom data sources, so was wondering if those could be written in Java, since our company is a Java shop. The additional push downs I am looking for are aggregations with grouping and sorting. Essentially, I am trying to evaluate if this API can give me much of what is possible with the Apache MetaModel project. Regards, Ashish On Tue, Mar 24, 2015 at 1:57 PM, Michael Armbrust mich...@databricks.com wrote: On Tue, Mar 24, 2015 at 12:57 AM, Ashish Mukherjee ashish.mukher...@gmail.com wrote: 1. Is the Data Source API stable as of Spark 1.3.0? It is marked DeveloperApi, but in general we do not plan to change even these APIs unless there is a very compelling reason to. 2. The Data Source API seems to be available only in Scala. Is there any plan to make it available for Java too? We tried to make all the suggested interfaces (other than CatalystScan which exposes internals and is only for experimentation) usable from Java. Is there something in particular you are having trouble with? 3. Are only filters and projections pushed down to the data source and all the data pulled into Spark for other processing? For now, this is all that is provided by the public stable API. We left a hook for more powerful push downs (sqlContext.experimental.extraStrategies), and would be interested in feedback on other operations we should push down as we expand the API.
Re: Measuer Bytes READ and Peak Memory Usage for Query
Yeah thanks, I can now see the memory usage. Please also verify if bytes read == Combined size of all RDDs ? Actually, all my RDDs are completely cached in memory. So, Combined size of my RDDs = Mem used (verified from WebUI) On Fri, Mar 20, 2015 at 12:07 PM, Akhil Das ak...@sigmoidanalytics.com wrote: You could do a cache and see the memory usage under Storage tab in the driver UI (runs on port 4040) Thanks Best Regards On Fri, Mar 20, 2015 at 12:02 PM, anu anamika.guo...@gmail.com wrote: Hi All I would like to measure Bytes Read and Peak Memory Usage for a Spark SQL Query. Please clarify if Bytes Read = aggregate size of all RDDs ?? All my RDDs are in memory and 0B spill to disk. And I am clueless how to measure Peak Memory Usage. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Measuer-Bytes-READ-and-Peak-Memory-Usage-for-Query-tp22159.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: issue while creating spark context
Its an IOException, just make sure you are having the correct permission over */user/spark* directory. Thanks Best Regards On Tue, Mar 24, 2015 at 5:21 PM, sachin Singh sachin.sha...@gmail.com wrote: hi all, all of sudden I getting below error when I am submitting spark job using master as yarn its not able to create spark context,previously working fine, I am using CDH5.3.1 and creating javaHiveContext spark-submit --jars ./analiticlibs/mysql-connector-java-5.1.17.jar,./analiticlibs/log4j-1.2.17.jar --master yarn --class myproject.com.java.jobs.Aggregationtask sparkjob-1.0.jar error message- java.io.IOException: Error in creating log directory: file:/user/spark/applicationHistory/application_1427194309307_0005 at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:133) at org.apache.spark.util.FileLogger.start(FileLogger.scala:115) at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74) at org.apache.spark.SparkContext.init(SparkContext.scala:353) at org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:61) at myproject.com.java.core.SparkAnaliticEngine.getJavaSparkContext(SparkAnaliticEngine.java:77) at myproject.com.java.core.SparkAnaliticTable.evmyprojectate(SparkAnaliticTable.java:108) at myproject.com.java.core.SparkAnaliticEngine.evmyprojectateAnaliticTable(SparkAnaliticEngine.java:55) at myproject.com.java.core.SparkAnaliticEngine.evmyprojectateAnaliticTable(SparkAnaliticEngine.java:65) at myproject.com.java.jobs.CustomAggregationJob.main(CustomAggregationJob.java:184) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/issue-while-creating-spark-context-tp22196.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How to deploy binary dependencies to workers?
Hi, I am doing ML using Spark mllib. However, I do not have full control to the cluster. I am using Microsoft Azure HDInsight I want to deploy the BLAS or whatever required dependencies to accelerate the computation. But I don't know how to deploy those DLLs when I submit my JAR to the cluster. I know how to pack those DLLs into a jar. The real challenge is how to let the system find them... Thanks, David
Re: Spark as a service
I don't think there's are general approach to that - the usecases are just to different. If you really need it, you probably will have to implement yourself in the driver of your application. PS: Make sure to use the reply to all button so that the mailing list is included in your reply. Otherwise only I will get your mail. Regards, Jeff 2015-03-24 12:01 GMT+01:00 Ashish Mukherjee ashish.mukher...@gmail.com: Hi Jeffrey, Thanks. Yes, this resolves the SQL problem. My bad - I was looking for something which would work for Spark Streaming and other Spark jobs too, not just SQL. Regards, Ashish On Tue, Mar 24, 2015 at 4:07 PM, Jeffrey Jedele jeffrey.jed...@gmail.com wrote: Hi Ashish, this might be what you're looking for: https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server Regards, Jeff 2015-03-24 11:28 GMT+01:00 Ashish Mukherjee ashish.mukher...@gmail.com: Hello, As of now, if I have to execute a Spark job, I need to create a jar and deploy it. If I need to run a dynamically formed SQL from a Web application, is there any way of using SparkSQL in this manner? Perhaps, through a Web Service or something similar. Regards, Ashish
Re: Spark as a service
Perhaps this project, https://github.com/calrissian/spark-jetty-server, could help with your requirements. On Tue, Mar 24, 2015 at 7:12 AM, Jeffrey Jedele jeffrey.jed...@gmail.com wrote: I don't think there's are general approach to that - the usecases are just to different. If you really need it, you probably will have to implement yourself in the driver of your application. PS: Make sure to use the reply to all button so that the mailing list is included in your reply. Otherwise only I will get your mail. Regards, Jeff 2015-03-24 12:01 GMT+01:00 Ashish Mukherjee ashish.mukher...@gmail.com: Hi Jeffrey, Thanks. Yes, this resolves the SQL problem. My bad - I was looking for something which would work for Spark Streaming and other Spark jobs too, not just SQL. Regards, Ashish On Tue, Mar 24, 2015 at 4:07 PM, Jeffrey Jedele jeffrey.jed...@gmail.com wrote: Hi Ashish, this might be what you're looking for: https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server Regards, Jeff 2015-03-24 11:28 GMT+01:00 Ashish Mukherjee ashish.mukher...@gmail.com : Hello, As of now, if I have to execute a Spark job, I need to create a jar and deploy it. If I need to run a dynamically formed SQL from a Web application, is there any way of using SparkSQL in this manner? Perhaps, through a Web Service or something similar. Regards, Ashish
1.3 Hadoop File System problem
I have code that works under 1.2.1 but when I upgraded to 1.3.0 it fails to find the s3 hadoop file system. I get the java.lang.IllegalArgumentException: Wrong FS: s3://path to my file], expected: file:/// when I try to save a parquet file. This worked in 1.2.1. Has anyone else seen this? I'm running spark using local[8] so it's all internal. These are actually unit tests in our app that are failing now. Thanks. Jim -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/1-3-Hadoop-File-System-problem-tp22207.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Invalid ContainerId ... Caused by: java.lang.NumberFormatException: For input string: e04
Thanks Marcelo - I was using the SBT built spark per earlier thread. I switched now to the distro (with the conf changes for CDH path in front) and guava issue is gone. Thanks, On Tue, Mar 24, 2015 at 1:50 PM, Marcelo Vanzin van...@cloudera.com wrote: Hi there, On Tue, Mar 24, 2015 at 1:40 PM, Manoj Samel manojsamelt...@gmail.com wrote: When I run any query, it gives java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode; Are you running a custom-compiled Spark by any chance? Specifically, one you built with sbt? That would hit this problem, because the path I suggested (/usr/lib/hadoop/client/*) contains an older guava library, which would override the one shipped with the sbt-built Spark. If you build Spark with maven, or use the pre-built Spark distro, or specifically filter out the guava jar from your classpath when setting up the Spark job, things should work. -- Marcelo -- --- You received this message because you are subscribed to the Google Groups CDH Users group. To unsubscribe from this group and stop receiving emails from it, send an email to cdh-user+unsubscr...@cloudera.org. For more options, visit https://groups.google.com/a/cloudera.org/d/optout.
Re: diffrence in PCA of MLib vs H2o in R
Reza, That SVD.v matches the H2o and R prComp (non-centered) Thanks -R On Tue, Mar 24, 2015 at 11:38 AM, Sean Owen so...@cloudera.com wrote: (Oh sorry, I've only been thinking of TallSkinnySVD) On Tue, Mar 24, 2015 at 6:36 PM, Reza Zadeh r...@databricks.com wrote: If you want to do a nonstandard (or uncentered) PCA, you can call computeSVD on RowMatrix, and look at the resulting 'V' Matrix. That should match the output of the other two systems. Reza On Tue, Mar 24, 2015 at 3:53 AM, Sean Owen so...@cloudera.com wrote: Those implementations are computing an SVD of the input matrix directly, and while you generally need the columns to have mean 0, you can turn that off with the options you cite. I don't think this is possible in the MLlib implementation, since it is computing the principal components by computing eigenvectors of the covariance matrix. The means inherently don't matter either way in this computation. On Tue, Mar 24, 2015 at 6:13 AM, roni roni.epi...@gmail.com wrote: I am trying to compute PCA using computePrincipalComponents. I also computed PCA using h2o in R and R's prcomp. The answers I get from H2o and R's prComp (non h2o) is same when I set the options for H2o as standardized=FALSE and for r's prcomp as center = false. How do I make sure that the settings for MLib PCA is same as I am using for H2o or prcomp. Thanks Roni - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: java.lang.OutOfMemoryError: unable to create new native thread
My memory is hazy on this but aren't there hidden limitations to Linux-based threads? I ran into some issues a couple of years ago where, and here is the fuzzy part, the kernel wants to reserve virtual memory per thread equal to the stack size. When the total amount of reserved memory (not necessarily resident memory) exceeds the memory of the system it throws an OOM. I'm looking for material to back this up. Sorry for the initial vague response. Matthew On Tue, Mar 24, 2015 at 12:53 PM, Thomas Gerber thomas.ger...@radius.com wrote: Additional notes: I did not find anything wrong with the number of threads (ps -u USER -L | wc -l): around 780 on the master and 400 on executors. I am running on 100 r3.2xlarge. On Tue, Mar 24, 2015 at 12:38 PM, Thomas Gerber thomas.ger...@radius.com wrote: Hello, I am seeing various crashes in spark on large jobs which all share a similar exception: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) I increased nproc (i.e. ulimit -u) 10 fold, but it doesn't help. Does anyone know how to avoid those kinds of errors? Noteworthy: I added -XX:ThreadStackSize=10m on both driver and executor extra java options, which might have amplified the problem. Thanks for you help, Thomas
Re: SparkSQL UDTs with Ordering
Awesome. yep - I have seen the warnings on UDTs, happy to keep up with the API changes :). Would this be a reasonable PR to toss up despite the API unstableness or would you prefer it to wait? Thanks -Pat On Tue, Mar 24, 2015 at 7:44 PM, Michael Armbrust mich...@databricks.com wrote: I'll caution that the UDTs are not a stable public interface yet. We'd like to do this someday, but currently this feature is mostly for MLlib as we have not finalized the API. Having an ordering could be useful, but I'll add that currently UDTs actually exist in serialized from so the ordering would have to be on the internal form, not the user visible form. On Tue, Mar 24, 2015 at 12:25 PM, Patrick Woody patrick.woo...@gmail.com wrote: Hey all, Currently looking into UDTs and I was wondering if it is reasonable to add the ability to define an Ordering (or if this is possible, then how)? Currently it will throw an error when non-Native types are used. Thanks! -Pat
Re: diffrence in PCA of MLib vs H2o in R
Great! On Tue, Mar 24, 2015 at 2:53 PM, roni roni.epi...@gmail.com wrote: Reza, That SVD.v matches the H2o and R prComp (non-centered) Thanks -R On Tue, Mar 24, 2015 at 11:38 AM, Sean Owen so...@cloudera.com wrote: (Oh sorry, I've only been thinking of TallSkinnySVD) On Tue, Mar 24, 2015 at 6:36 PM, Reza Zadeh r...@databricks.com wrote: If you want to do a nonstandard (or uncentered) PCA, you can call computeSVD on RowMatrix, and look at the resulting 'V' Matrix. That should match the output of the other two systems. Reza On Tue, Mar 24, 2015 at 3:53 AM, Sean Owen so...@cloudera.com wrote: Those implementations are computing an SVD of the input matrix directly, and while you generally need the columns to have mean 0, you can turn that off with the options you cite. I don't think this is possible in the MLlib implementation, since it is computing the principal components by computing eigenvectors of the covariance matrix. The means inherently don't matter either way in this computation. On Tue, Mar 24, 2015 at 6:13 AM, roni roni.epi...@gmail.com wrote: I am trying to compute PCA using computePrincipalComponents. I also computed PCA using h2o in R and R's prcomp. The answers I get from H2o and R's prComp (non h2o) is same when I set the options for H2o as standardized=FALSE and for r's prcomp as center = false. How do I make sure that the settings for MLib PCA is same as I am using for H2o or prcomp. Thanks Roni - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: iPython Notebook + Spark + Accumulo -- best practice?
hi all, got a vagrant image with spark notebook, spark, accumulo, and hadoop all running. from notebook I can manually create a scanner and pull test data from a table I created using one of the accumulo examples: val instanceNameS = accumulo val zooServersS = localhost:2181 val instance: Instance = new ZooKeeperInstance(instanceNameS, zooServersS) val connector: Connector = instance.getConnector( root, new PasswordToken(password)) val auths = new Authorizations(exampleVis) val scanner = connector.createScanner(batchtest1, auths) scanner.setRange(new Range(row_00, row_10)) for(entry: Entry[Key, Value] - scanner) { println(entry.getKey + is + entry.getValue) } will give the first ten rows of table data. when I try to create the RDD thusly: val rdd2 = sparkContext.newAPIHadoopRDD ( new Configuration(), classOf[org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat], classOf[org.apache.accumulo.core.data.Key], classOf[org.apache.accumulo.core.data.Value] ) I get an RDD returned to me that I can't do much with due to the following error: java.io.IOException: Input info has not been set. at org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630) at org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:343) at org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:538) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1367) at org.apache.spark.rdd.RDD.count(RDD.scala:927) which totally makes sense in light of the fact that I haven't specified any parameters as to which table to connect with, what the auths are, etc. so my question is: what do I need to do from here to get those first ten rows of table data into my RDD? DAVID HOLIDAY Software Engineer 760 607 3300 | Office 312 758 8385 | Mobile dav...@annaisystems.commailto:broo...@annaisystems.com [cid:AE39C43E-3FF7-4C90-BCE4-9711C84C4CB8@cld.annailabs.com] www.AnnaiSystems.comhttp://www.AnnaiSystems.com On Mar 19, 2015, at 11:25 AM, David Holiday dav...@annaisystems.commailto:dav...@annaisystems.com wrote: kk - I'll put something together and get back to you with more :-) DAVID HOLIDAY Software Engineer 760 607 3300 | Office 312 758 8385 | Mobile dav...@annaisystems.commailto:broo...@annaisystems.com GetFileAttachment.jpg www.AnnaiSystems.comhttp://www.annaisystems.com/ On Mar 19, 2015, at 10:59 AM, Irfan Ahmad ir...@cloudphysics.commailto:ir...@cloudphysics.com wrote: Once you setup spark-notebook, it'll handle the submits for interactive work. Non-interactive is not handled by it. For that spark-kernel could be used. Give it a shot ... it only takes 5 minutes to get it running in local-mode. Irfan Ahmad CTO | Co-Founder | CloudPhysicshttp://www.cloudphysics.com/ Best of VMworld Finalist Best Cloud Management Award NetworkWorld 10 Startups to Watch EMA Most Notable Vendor On Thu, Mar 19, 2015 at 9:51 AM, David Holiday dav...@annaisystems.commailto:dav...@annaisystems.com wrote: hi all - thx for the alacritous replies! so regarding how to get things from notebook to spark and back, am I correct that spark-submit is the way to go? DAVID HOLIDAY Software Engineer 760 607 3300tel:760%20607%203300 | Office 312 758 8385tel:312%20758%208385 | Mobile dav...@annaisystems.commailto:broo...@annaisystems.com GetFileAttachment.jpg www.AnnaiSystems.comhttp://www.annaisystems.com/ On Mar 19, 2015, at 1:14 AM, Paolo Platter paolo.plat...@agilelab.itmailto:paolo.plat...@agilelab.it wrote: Yes, I would suggest spark-notebook too. It's very simple to setup and it's growing pretty fast. Paolo Inviata dal mio Windows Phone Da: Irfan Ahmadmailto:ir...@cloudphysics.com Inviato: 19/03/2015 04:05 A: davidhmailto:dav...@annaisystems.com Cc: user@spark.apache.orgmailto:user@spark.apache.org Oggetto: Re: iPython Notebook + Spark + Accumulo -- best practice? I forgot to mention that there is also Zeppelin and jove-notebook but I haven't got any experience with those yet. Irfan Ahmad CTO | Co-Founder | CloudPhysicshttp://www.cloudphysics.com/ Best of VMworld Finalist Best Cloud Management Award NetworkWorld 10 Startups to Watch EMA Most Notable Vendor On Wed, Mar 18, 2015 at 8:01 PM, Irfan Ahmad ir...@cloudphysics.commailto:ir...@cloudphysics.com wrote: Hi David, W00t indeed and great questions. On the notebook front, there are two options depending on what you are looking for. You can either go with iPython 3 with Spark-kernel as a backend or you can use spark-notebook. Both have
Re: Spark as a service
Also look at the spark-kernel and spark job server projects. Irfan On Mar 24, 2015 5:03 AM, Todd Nist tsind...@gmail.com wrote: Perhaps this project, https://github.com/calrissian/spark-jetty-server, could help with your requirements. On Tue, Mar 24, 2015 at 7:12 AM, Jeffrey Jedele jeffrey.jed...@gmail.com wrote: I don't think there's are general approach to that - the usecases are just to different. If you really need it, you probably will have to implement yourself in the driver of your application. PS: Make sure to use the reply to all button so that the mailing list is included in your reply. Otherwise only I will get your mail. Regards, Jeff 2015-03-24 12:01 GMT+01:00 Ashish Mukherjee ashish.mukher...@gmail.com: Hi Jeffrey, Thanks. Yes, this resolves the SQL problem. My bad - I was looking for something which would work for Spark Streaming and other Spark jobs too, not just SQL. Regards, Ashish On Tue, Mar 24, 2015 at 4:07 PM, Jeffrey Jedele jeffrey.jed...@gmail.com wrote: Hi Ashish, this might be what you're looking for: https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server Regards, Jeff 2015-03-24 11:28 GMT+01:00 Ashish Mukherjee ashish.mukher...@gmail.com : Hello, As of now, if I have to execute a Spark job, I need to create a jar and deploy it. If I need to run a dynamically formed SQL from a Web application, is there any way of using SparkSQL in this manner? Perhaps, through a Web Service or something similar. Regards, Ashish
Re: How to avoid being killed by YARN node manager ?
Hi Yuichiro, The way to avoid this is to boost spark.yarn.executor.memoryOverhead until the executors have enough off-heap memory to avoid going over their limits. -Sandy On Tue, Mar 24, 2015 at 11:49 AM, Yuichiro Sakamoto ks...@muc.biglobe.ne.jp wrote: Hello. We use ALS(Collaborative filtering) of Spark MLlib on YARN. Spark version is 1.2.0 included CDH 5.3.1. 1,000,000,000 records(5,000,000 users data and 5,000,000 items data) are used for machine learning with ALS. These large quantities of data increases virtual memory usage, node manager of YARN kills Spark worker process. Even though Spark run again after killing process, Spark worker process is killed again. As a result, the whole Spark processes are terminated. # Spark worker process is killed, it seems that virtual memory usage increased by # 'Shuffle' or 'Disk writing' gets over the threshold of YARN. To avoid such a case from occurring, we use the method that 'yarn.nodemanager.vmem-check-enabled' is false, then exit successfully. But it does not seem to have an appropriate way. If you know, please let me know about tuning method of Spark. The conditions of machines and Spark settings are as follows. 1)six machines, physical memory is 32GB of each machine. 2)Spark settings - spark.executor.memory=16g - spark.closure.serializer=org.apache.spark.serializer.KryoSerializer - spark.rdd.compress=true - spark.shuffle.memoryFraction=0.4 Thanks, Yuichiro Sakamoto -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-avoid-being-killed-by-YARN-node-manager-tp22199.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark disk-to-disk
imran, great, i will take a look at the pullreq. seems we are interested in similar things On Tue, Mar 24, 2015 at 11:00 AM, Imran Rashid iras...@cloudera.com wrote: I think writing to hdfs and reading it back again is totally reasonable. In fact, in my experience, writing to hdfs and reading back in actually gives you a good opportunity to handle some other issues as well: a) instead of just writing as an object file, I've found its helpful to write in a format that is a little more readable. Json if efficiency doesn't matter :) or you could use something like avro, which at least has a good set of command line tools. b) when developing, I hate it when I introduce a bug in step 12 of a long pipeline, and need to re-run the whole thing. If you save to disk, you can write a little application logic that realizes step 11 is already sitting on disk, and just restart from there. c) writing to disk is also a good opportunity to do a little crude auto-tuning of the number of partitions. You can look at the size of each partition on hdfs, and then adjust the number of partitions. And I completely agree that losing the partitioning info is a major limitation -- I submitted a PR to help deal w/ it: https://github.com/apache/spark/pull/4449 getting narrow dependencies w/ partitioners can lead to pretty big performance improvements, so I do think its important to make it easily accessible to the user. Though now I'm thinking that maybe this api is a little clunky, and this should get rolled into the other changes you are proposing to hadoop RDD friends -- but I'll go into more discussion on that thread. On Mon, Mar 23, 2015 at 12:55 PM, Koert Kuipers ko...@tresata.com wrote: there is a way to reinstate the partitioner, but that requires sc.objectFile to read exactly what i wrote, which means sc.objectFile should never split files on reading (a feature of hadoop file inputformat that gets in the way here). On Mon, Mar 23, 2015 at 1:39 PM, Koert Kuipers ko...@tresata.com wrote: i just realized the major limitation is that i lose partitioning info... On Mon, Mar 23, 2015 at 1:34 AM, Reynold Xin r...@databricks.com wrote: On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers ko...@tresata.com wrote: so finally i can resort to: rdd.saveAsObjectFile(...) sc.objectFile(...) but that seems like a rather broken abstraction. This seems like a fine solution to me.
Re: How to deploy binary dependencies to workers?
Both spark-submit and spark-shell have a --jars option for passing additional jars to the cluster. They will be added to the appropriate classpaths. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Tue, Mar 24, 2015 at 4:13 AM, Xi Shen davidshe...@gmail.com wrote: Hi, I am doing ML using Spark mllib. However, I do not have full control to the cluster. I am using Microsoft Azure HDInsight I want to deploy the BLAS or whatever required dependencies to accelerate the computation. But I don't know how to deploy those DLLs when I submit my JAR to the cluster. I know how to pack those DLLs into a jar. The real challenge is how to let the system find them... Thanks, David
How to avoid being killed by YARN node manager ?
Hello. We use ALS(Collaborative filtering) of Spark MLlib on YARN. Spark version is 1.2.0 included CDH 5.3.1. 1,000,000,000 records(5,000,000 users data and 5,000,000 items data) are used for machine learning with ALS. These large quantities of data increases virtual memory usage, node manager of YARN kills Spark worker process. Even though Spark run again after killing process, Spark worker process is killed again. As a result, the whole Spark processes are terminated. # Spark worker process is killed, it seems that virtual memory usage increased by # 'Shuffle' or 'Disk writing' gets over the threshold of YARN. To avoid such a case from occurring, we use the method that 'yarn.nodemanager.vmem-check-enabled' is false, then exit successfully. But it does not seem to have an appropriate way. If you know, please let me know about tuning method of Spark. The conditions of machines and Spark settings are as follows. 1)six machines, physical memory is 32GB of each machine. 2)Spark settings - spark.executor.memory=16g - spark.closure.serializer=org.apache.spark.serializer.KryoSerializer - spark.rdd.compress=true - spark.shuffle.memoryFraction=0.4 Thanks, Yuichiro Sakamoto -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-avoid-being-killed-by-YARN-node-manager-tp22199.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Why doesn't the --conf parameter work in yarn-cluster mode (but works in yarn-client and local)?
Hello Sandy, Thank you for your explanation. Then I would at least expect that to be consistent across local, yarn-client, and yarn-cluster modes. (And not lead to the case where it somehow works in two of them, and not for the third). Kind regards, Emre Sevinç http://www.bigindustries.be/ On Tue, Mar 24, 2015 at 4:38 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Ah, yes, I believe this is because only properties prefixed with spark get passed on. The purpose of the --conf option is to allow passing Spark properties to the SparkConf, not to add general key-value pairs to the JVM system properties. -Sandy On Tue, Mar 24, 2015 at 4:25 AM, Emre Sevinc emre.sev...@gmail.com wrote: Hello Sandy, Your suggestion does not work when I try it locally: When I pass --conf key=someValue and then try to retrieve it like: SparkConf sparkConf = new SparkConf(); logger.info(* * * key ~~~ {}, sparkConf.get(key)); I get Exception in thread main java.util.NoSuchElementException: key And I think that's expected because the key is an arbitrary one, not necessarily a Spark configuration element. This is why I was passing it via --conf and retrieving System.getProperty(key) (which worked locally and in yarn-client mode but not in yarn-cluster mode). I'm surprised why I can't use it on the cluster while I can use it while local development and testing. Kind regards, Emre Sevinç http://www.bigindustries.be/ On Mon, Mar 23, 2015 at 6:15 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Emre, The --conf property is meant to work with yarn-cluster mode. System.getProperty(key) isn't guaranteed, but new SparkConf().get(key) should. Does it not? -Sandy On Mon, Mar 23, 2015 at 8:39 AM, Emre Sevinc emre.sev...@gmail.com wrote: Hello, According to Spark Documentation at https://spark.apache.org/docs/1.2.1/submitting-applications.html : --conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown). And indeed, when I use that parameter, in my Spark program I can retrieve the value of the key by using: System.getProperty(key); This works when I test my program locally, and also in yarn-client mode, I can log the value of the key and see that it matches what I wrote in the command line, but it returns *null* when I submit the very same program in *yarn-cluster* mode. Why can't I retrieve the value of key given as --conf key=value when I submit my Spark application in *yarn-cluster* mode? Any ideas and/or workarounds? -- Emre Sevinç http://www.bigindustries.be/ -- Emre Sevinc -- Emre Sevinc
Re: spark disk-to-disk
I think writing to hdfs and reading it back again is totally reasonable. In fact, in my experience, writing to hdfs and reading back in actually gives you a good opportunity to handle some other issues as well: a) instead of just writing as an object file, I've found its helpful to write in a format that is a little more readable. Json if efficiency doesn't matter :) or you could use something like avro, which at least has a good set of command line tools. b) when developing, I hate it when I introduce a bug in step 12 of a long pipeline, and need to re-run the whole thing. If you save to disk, you can write a little application logic that realizes step 11 is already sitting on disk, and just restart from there. c) writing to disk is also a good opportunity to do a little crude auto-tuning of the number of partitions. You can look at the size of each partition on hdfs, and then adjust the number of partitions. And I completely agree that losing the partitioning info is a major limitation -- I submitted a PR to help deal w/ it: https://github.com/apache/spark/pull/4449 getting narrow dependencies w/ partitioners can lead to pretty big performance improvements, so I do think its important to make it easily accessible to the user. Though now I'm thinking that maybe this api is a little clunky, and this should get rolled into the other changes you are proposing to hadoop RDD friends -- but I'll go into more discussion on that thread. On Mon, Mar 23, 2015 at 12:55 PM, Koert Kuipers ko...@tresata.com wrote: there is a way to reinstate the partitioner, but that requires sc.objectFile to read exactly what i wrote, which means sc.objectFile should never split files on reading (a feature of hadoop file inputformat that gets in the way here). On Mon, Mar 23, 2015 at 1:39 PM, Koert Kuipers ko...@tresata.com wrote: i just realized the major limitation is that i lose partitioning info... On Mon, Mar 23, 2015 at 1:34 AM, Reynold Xin r...@databricks.com wrote: On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers ko...@tresata.com wrote: so finally i can resort to: rdd.saveAsObjectFile(...) sc.objectFile(...) but that seems like a rather broken abstraction. This seems like a fine solution to me.
Re: Spark streaming alerting
Helena, The CassandraInputDStream sounds interesting. I dont find many things in the jira though. Do you have more details on what it tries to achieve ? Thanks, Anwar. On Tue, Mar 24, 2015 at 2:39 PM, Helena Edelson helena.edel...@datastax.com wrote: Streaming _from_ cassandra, CassandraInputDStream, is coming BTW https://issues.apache.org/jira/browse/SPARK-6283 I am working on it now. Helena @helenaedelson On Mar 23, 2015, at 5:22 AM, Khanderao Kand Gmail khanderao.k...@gmail.com wrote: Akhil You are right in tour answer to what Mohit wrote. However what Mohit seems to be alluring but did not write properly might be different. Mohit You are wrong in saying generally streaming works in HDFS and cassandra . Streaming typically works with streaming or queing source like Kafka, kinesis, Twitter, flume, zeroMQ, etc (but can also from HDFS and S3 ) However , streaming context ( receiver wishing the streaming context ) gets events/messages/records and forms a time window based batch (RDD)- So there is a maximum gap of window time from alert message was available to spark and when the processing happens. I think you meant about this. As per spark programming model, RDD is the right way to deal with data. If you are fine with the minimum delay of say a sec (based on min time window that dstreaming can support) then what Rohit gave is a right model. Khanderao On Mar 22, 2015, at 11:39 PM, Akhil Das ak...@sigmoidanalytics.com wrote: What do you mean you can't send it directly from spark workers? Here's a simple approach which you could do: val data = ssc.textFileStream(sigmoid/) val dist = data.filter(_.contains(ERROR)).foreachRDD(rdd = alert(Errors : + rdd.count())) And the alert() function could be anything triggering an email or sending an SMS alert. Thanks Best Regards On Sun, Mar 22, 2015 at 1:52 AM, Mohit Anchlia mohitanch...@gmail.com wrote: Is there a module in spark streaming that lets you listen to the alerts/conditions as they happen in the streaming module? Generally spark streaming components will execute on large set of clusters like hdfs or Cassandra, however when it comes to alerting you generally can't send it directly from the spark workers, which means you need a way to listen to the alerts.
Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class
I found the problem. In mapped-site.xml, mapreduce.application.classpath has references to “${hdp.version}” which is not getting replaced when launch_container.sh is created. The executor fails with a substitution error at line 27 in launch_container.sh because bash can’t deal with “${hdp.version}. I have hdp.version defined in my spark-defaults.conf via spark.{driver,yarn.am}.extraJavaOptions -Dhdp.version=2.2.0-2041, so something is not doing the substitution. To work around this problem, I replaced ${hdp.version}” with “current” in mapred-site.xml. I found a similar bug, https://issues.apache.org/jira/browse/AMBARI-8028, and the fix was exactly what I did to work around it. Not sure if this is an AMBARI bug (not doing variable substitution when writing mapred-site.xml) or YARN bug (its not doing the variable substitution when writing launch_container.sh) Anybody have an opinion ? Doug On Mar 19, 2015, at 5:51 PM, Doug Balog doug.sparku...@dugos.com wrote: I’m seeing the same problem. I’ve set logging to DEBUG, and I think some hints are in the “Yarn AM launch context” that is printed out before Yarn runs java. My next step is to talk to the admins and get them to set yarn.nodemanager.delete.debug-delay-sec in the config, as recommended in http://spark.apache.org/docs/latest/running-on-yarn.html Then I can see exactly whats in the directory. Doug ps Sorry for the dup message Bharath and Todd, used wrong email address. On Mar 19, 2015, at 1:19 AM, Bharath Ravi Kumar reachb...@gmail.com wrote: Thanks for clarifying Todd. This may then be an issue specific to the HDP version we're using. Will continue to debug and post back if there's any resolution. On Thu, Mar 19, 2015 at 3:40 AM, Todd Nist tsind...@gmail.com wrote: Yes I believe you are correct. For the build you may need to specify the specific HDP version of hadoop to use with the -Dhadoop.version=. I went with the default 2.6.0, but Horton may have a vendor specific version that needs to go here. I know I saw a similar post today where the solution was to use -Dhadoop.version=2.5.0-cdh5.3.2 but that was for a cloudera installation. I am not sure what the HDP version would be to put here. -Todd On Wed, Mar 18, 2015 at 12:49 AM, Bharath Ravi Kumar reachb...@gmail.com wrote: Hi Todd, Yes, those entries were present in the conf under the same SPARK_HOME that was used to run spark-submit. On a related note, I'm assuming that the additional spark yarn options (like spark.yarn.jar) need to be set in the same properties file that is passed to spark-submit. That apart, I assume that no other host on the cluster should require a deployment of the spark distribution or any other config change to support a spark job. Isn't that correct? On Tue, Mar 17, 2015 at 6:19 PM, Todd Nist tsind...@gmail.com wrote: Hi Bharath, Do you have these entries in your $SPARK_HOME/conf/spark-defaults.conf file? spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0-2041 spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0-2041 On Tue, Mar 17, 2015 at 1:04 AM, Bharath Ravi Kumar reachb...@gmail.com wrote: Still no luck running purpose-built 1.3 against HDP 2.2 after following all the instructions. Anyone else faced this issue? On Mon, Mar 16, 2015 at 8:53 PM, Bharath Ravi Kumar reachb...@gmail.com wrote: Hi Todd, Thanks for the help. I'll try again after building a distribution with the 1.3 sources. However, I wanted to confirm what I mentioned earlier: is it sufficient to copy the distribution only to the client host from where spark-submit is invoked(with spark.yarn.jar set), or is there a need to ensure that the entire distribution is deployed made available pre-deployed on every host in the yarn cluster? I'd assume that the latter shouldn't be necessary. On Mon, Mar 16, 2015 at 8:38 PM, Todd Nist tsind...@gmail.com wrote: Hi Bharath, I ran into the same issue a few days ago, here is a link to a post on Horton's fourm. http://hortonworks.com/community/forums/search/spark+1.2.1/ Incase anyone else needs to perform this these are the steps I took to get it to work with Spark 1.2.1 as well as Spark 1.3.0-RC3: 1. Pull 1.2.1 Source 2. Apply the following patches a. Address jackson version, https://github.com/apache/spark/pull/3938 b. Address the propagation of the hdp.version set in the spark-default.conf, https://github.com/apache/spark/pull/3409 3. build with $SPARK_HOME./make-distribution.sh –name hadoop2.6 –tgz -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests package Then deploy the resulting artifact = spark-1.2.1-bin-hadoop2.6.tgz following instructions in the HDP Spark preview http://hortonworks.com/hadoop-tutorial/using-apache-spark-hdp/ FWIW spark-1.3.0 appears to be working fine with HDP as well and steps 2a and 2b are not required. HTH -Todd
Re: Why doesn't the --conf parameter work in yarn-cluster mode (but works in yarn-client and local)?
Ah, yes, I believe this is because only properties prefixed with spark get passed on. The purpose of the --conf option is to allow passing Spark properties to the SparkConf, not to add general key-value pairs to the JVM system properties. -Sandy On Tue, Mar 24, 2015 at 4:25 AM, Emre Sevinc emre.sev...@gmail.com wrote: Hello Sandy, Your suggestion does not work when I try it locally: When I pass --conf key=someValue and then try to retrieve it like: SparkConf sparkConf = new SparkConf(); logger.info(* * * key ~~~ {}, sparkConf.get(key)); I get Exception in thread main java.util.NoSuchElementException: key And I think that's expected because the key is an arbitrary one, not necessarily a Spark configuration element. This is why I was passing it via --conf and retrieving System.getProperty(key) (which worked locally and in yarn-client mode but not in yarn-cluster mode). I'm surprised why I can't use it on the cluster while I can use it while local development and testing. Kind regards, Emre Sevinç http://www.bigindustries.be/ On Mon, Mar 23, 2015 at 6:15 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Emre, The --conf property is meant to work with yarn-cluster mode. System.getProperty(key) isn't guaranteed, but new SparkConf().get(key) should. Does it not? -Sandy On Mon, Mar 23, 2015 at 8:39 AM, Emre Sevinc emre.sev...@gmail.com wrote: Hello, According to Spark Documentation at https://spark.apache.org/docs/1.2.1/submitting-applications.html : --conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown). And indeed, when I use that parameter, in my Spark program I can retrieve the value of the key by using: System.getProperty(key); This works when I test my program locally, and also in yarn-client mode, I can log the value of the key and see that it matches what I wrote in the command line, but it returns *null* when I submit the very same program in *yarn-cluster* mode. Why can't I retrieve the value of key given as --conf key=value when I submit my Spark application in *yarn-cluster* mode? Any ideas and/or workarounds? -- Emre Sevinç http://www.bigindustries.be/ -- Emre Sevinc
Re: Spark streaming alerting
I created a jira ticket for my work in both the spark and spark-cassandra-connector JIRAs, I don’t know why you can not see them. Users can stream from any cassandra table, just as one can stream from a Kafka topic; same principle. Helena @helenaedelson On Mar 24, 2015, at 11:29 AM, Anwar Rizal anriza...@gmail.com wrote: Helena, The CassandraInputDStream sounds interesting. I dont find many things in the jira though. Do you have more details on what it tries to achieve ? Thanks, Anwar. On Tue, Mar 24, 2015 at 2:39 PM, Helena Edelson helena.edel...@datastax.com mailto:helena.edel...@datastax.com wrote: Streaming _from_ cassandra, CassandraInputDStream, is coming BTW https://issues.apache.org/jira/browse/SPARK-6283 https://issues.apache.org/jira/browse/SPARK-6283 I am working on it now. Helena @helenaedelson On Mar 23, 2015, at 5:22 AM, Khanderao Kand Gmail khanderao.k...@gmail.com mailto:khanderao.k...@gmail.com wrote: Akhil You are right in tour answer to what Mohit wrote. However what Mohit seems to be alluring but did not write properly might be different. Mohit You are wrong in saying generally streaming works in HDFS and cassandra . Streaming typically works with streaming or queing source like Kafka, kinesis, Twitter, flume, zeroMQ, etc (but can also from HDFS and S3 ) However , streaming context ( receiver wishing the streaming context ) gets events/messages/records and forms a time window based batch (RDD)- So there is a maximum gap of window time from alert message was available to spark and when the processing happens. I think you meant about this. As per spark programming model, RDD is the right way to deal with data. If you are fine with the minimum delay of say a sec (based on min time window that dstreaming can support) then what Rohit gave is a right model. Khanderao On Mar 22, 2015, at 11:39 PM, Akhil Das ak...@sigmoidanalytics.com mailto:ak...@sigmoidanalytics.com wrote: What do you mean you can't send it directly from spark workers? Here's a simple approach which you could do: val data = ssc.textFileStream(sigmoid/) val dist = data.filter(_.contains(ERROR)).foreachRDD(rdd = alert(Errors : + rdd.count())) And the alert() function could be anything triggering an email or sending an SMS alert. Thanks Best Regards On Sun, Mar 22, 2015 at 1:52 AM, Mohit Anchlia mohitanch...@gmail.com mailto:mohitanch...@gmail.com wrote: Is there a module in spark streaming that lets you listen to the alerts/conditions as they happen in the streaming module? Generally spark streaming components will execute on large set of clusters like hdfs or Cassandra, however when it comes to alerting you generally can't send it directly from the spark workers, which means you need a way to listen to the alerts.
Does HiveContext connect to HiveServer2?
I am wondering if HiveContext connects to HiveServer2 or does it work though Hive CLI. The reason I am asking is because Cloudera has deprecated Hive CLI. If the connection is through HiverServer2, is there a way to specify user credentials? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-connect-to-HiveServer2-tp22200.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Invalid ContainerId ... Caused by: java.lang.NumberFormatException: For input string: e04
Steve, that's correct, but the problem only shows up when different versions of the YARN jars are included on the classpath. -Sandy On Tue, Mar 24, 2015 at 6:29 AM, Steve Loughran ste...@hortonworks.com wrote: On 24 Mar 2015, at 02:10, Marcelo Vanzin van...@cloudera.com wrote: This happens most probably because the Spark 1.3 you have downloaded is built against an older version of the Hadoop libraries than those used by CDH, and those libraries cannot parse the container IDs generated by CDH. This sounds suspiciously like the changes in YARN for HA (the epoch number) isn't being parsed by older versions of the YARN client libs. This is effectively a regression in the YARN code -its creating container IDs that can't be easily parsed by old apps. It may be possible to fix that spark-side by having its own parser for the YARN container/app environment variable - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
filter expression in API document for DataFrame
The following statement appears in the Scala API example at https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame people.filter(age 30). I tried this example and it gave a compilation error. I think this needs to be changed to people.filter(people(age) 30) Also, it would be good to add some examples for the new equality operator for columns (e.g. (people(age) === 30) ). The programming guide does not have an example for this in the DataFrame Operations section and it was not very obvious that we need to be using a different equality operator for columns. thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/filter-expression-in-API-document-for-DataFrame-tp22213.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Graphx gets slower as the iteration number increases
I'm working with graphx to calculate the pageranks of an extreme large social network with billion verteces. As iteration number increases, the speed of each iteration becomes slower and unacceptable. Is there any reason of it? How can I accelerate the ineration process? orangepri...@foxmail.com
Re: 1.3 Hadoop File System problem
Hey Jim, Thanks for reporting this. Can you give a small end-to-end code example that reproduces it? If so, we can definitely fix it. - Patrick On Tue, Mar 24, 2015 at 4:55 PM, Jim Carroll jimfcarr...@gmail.com wrote: I have code that works under 1.2.1 but when I upgraded to 1.3.0 it fails to find the s3 hadoop file system. I get the java.lang.IllegalArgumentException: Wrong FS: s3://path to my file], expected: file:/// when I try to save a parquet file. This worked in 1.2.1. Has anyone else seen this? I'm running spark using local[8] so it's all internal. These are actually unit tests in our app that are failing now. Thanks. Jim -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/1-3-Hadoop-File-System-problem-tp22207.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Graphx gets slower as the iteration number increases
This might be because partitions are getting dropped from memory and needing to be recomputed. How much memory is in the cluster, and how large are the partitions? This information should be in the Executors and Storage pages in the web UI. Ankur http://www.ankurdave.com/ On Tue, Mar 24, 2015 at 7:12 PM, orangepri...@foxmail.com orangepri...@foxmail.com wrote: I'm working with graphx to calculate the pageranks of an extreme large social network with billion verteces. As iteration number increases, the speed of each iteration becomes slower and unacceptable. Is there any reason of it?
Re: 1.3 Hadoop File System problem
You are probably hitting SPARK-6351 https://issues.apache.org/jira/browse/SPARK-6351, which will be fixed in 1.3.1 (hopefully cutting an RC this week). On Tue, Mar 24, 2015 at 4:55 PM, Jim Carroll jimfcarr...@gmail.com wrote: I have code that works under 1.2.1 but when I upgraded to 1.3.0 it fails to find the s3 hadoop file system. I get the java.lang.IllegalArgumentException: Wrong FS: s3://path to my file], expected: file:/// when I try to save a parquet file. This worked in 1.2.1. Has anyone else seen this? I'm running spark using local[8] so it's all internal. These are actually unit tests in our app that are failing now. Thanks. Jim -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/1-3-Hadoop-File-System-problem-tp22207.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
column expression in left outer join for DataFrame
Hi, I am trying to port some code that was working in Spark 1.2.0 on the latest version, Spark 1.3.0. This code involves a left outer join between two SchemaRDDs which I am now trying to change to a left outer join between 2 DataFrames. I followed the example for left outer join of DataFrame at https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html Here's my code, where df1 and df2 are the 2 dataframes I am joining on the country field: val join_df = df1.join( df2, df1.country == df2.country, left_outer) But I got a compilation error that value country is not a member of sql.DataFrame I also tried the following: val join_df = df1.join( df2, df1(country) == df2(country), left_outer) I got a compilation error that it is a Boolean whereas a Column is required. So what is the correct Column expression I need to provide for joining the 2 dataframes on a specific field ? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/column-expression-in-left-outer-join-for-DataFrame-tp22209.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: column expression in left outer join for DataFrame
You need to use `===`, so that you are constructing a column expression instead of evaluating the standard scala equality method. Calling methods to access columns (i.e. df.county is only supported in python). val join_df = df1.join( df2, df1(country) === df2(country), left_outer) On Tue, Mar 24, 2015 at 5:50 PM, SK skrishna...@gmail.com wrote: Hi, I am trying to port some code that was working in Spark 1.2.0 on the latest version, Spark 1.3.0. This code involves a left outer join between two SchemaRDDs which I am now trying to change to a left outer join between 2 DataFrames. I followed the example for left outer join of DataFrame at https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html Here's my code, where df1 and df2 are the 2 dataframes I am joining on the country field: val join_df = df1.join( df2, df1.country == df2.country, left_outer) But I got a compilation error that value country is not a member of sql.DataFrame I also tried the following: val join_df = df1.join( df2, df1(country) == df2(country), left_outer) I got a compilation error that it is a Boolean whereas a Column is required. So what is the correct Column expression I need to provide for joining the 2 dataframes on a specific field ? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/column-expression-in-left-outer-join-for-DataFrame-tp22209.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: java.lang.OutOfMemoryError: unable to create new native thread
So, 1. I reduced my -XX:ThreadStackSize to 5m (instead of 10m - default is 1m), which is still OK for my need. 2. I reduced the executor memory to 44GB for a 60GB machine (instead of 49GB). This seems to have helped. Thanks to Matthew and Sean. Thomas On Tue, Mar 24, 2015 at 3:49 PM, Matt Silvey matt.sil...@videoamp.com wrote: My memory is hazy on this but aren't there hidden limitations to Linux-based threads? I ran into some issues a couple of years ago where, and here is the fuzzy part, the kernel wants to reserve virtual memory per thread equal to the stack size. When the total amount of reserved memory (not necessarily resident memory) exceeds the memory of the system it throws an OOM. I'm looking for material to back this up. Sorry for the initial vague response. Matthew On Tue, Mar 24, 2015 at 12:53 PM, Thomas Gerber thomas.ger...@radius.com wrote: Additional notes: I did not find anything wrong with the number of threads (ps -u USER -L | wc -l): around 780 on the master and 400 on executors. I am running on 100 r3.2xlarge. On Tue, Mar 24, 2015 at 12:38 PM, Thomas Gerber thomas.ger...@radius.com wrote: Hello, I am seeing various crashes in spark on large jobs which all share a similar exception: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) I increased nproc (i.e. ulimit -u) 10 fold, but it doesn't help. Does anyone know how to avoid those kinds of errors? Noteworthy: I added -XX:ThreadStackSize=10m on both driver and executor extra java options, which might have amplified the problem. Thanks for you help, Thomas
Re: issue while creating spark context
thanks Sean and Akhil, I changed the the permission of */user/spark/applicationHistory, *now it works, On Tue, Mar 24, 2015 at 7:35 PM, Sachin Singh sachin.sha...@gmail.com wrote: thanks Sean, please can you suggest in which file or configuration I need to modify proper path, please elaborate which may help, thanks, Regards Sachin On Tue, Mar 24, 2015 at 7:15 PM, Sean Owen so...@cloudera.com wrote: That's probably the problem; the intended path is on HDFS but the configuration specifies a local path. See the exception message. On Tue, Mar 24, 2015 at 1:08 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Its in your local file system, not in hdfs. Thanks Best Regards On Tue, Mar 24, 2015 at 6:25 PM, Sachin Singh sachin.sha...@gmail.com wrote: hi, I can see required permission is granted for this directory as under, hadoop dfs -ls /user/spark DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it. Found 1 items drwxrwxrwt - spark spark 0 2015-03-20 01:04 /user/spark/applicationHistory regards Sachin
Re: Errors in SPARK
The error you're seeing typically means that you cannot connect to the Hive metastore itself. Some quick thoughts: - If you were to run show tables (instead of the CREATE TABLE statement), are you still getting the same error? - To confirm, the Hive metastore (MySQL database) is up and running - Did you download or build your version of Spark? On Tue, Mar 24, 2015 at 10:48 PM sandeep vura sandeepv...@gmail.com wrote: Hi Denny, Still facing the same issue.Please find the following errors. *scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)* *sqlContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@4e4f880c* *scala sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING))* *java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient* Cheers, Sandeep.v On Wed, Mar 25, 2015 at 11:10 AM, sandeep vura sandeepv...@gmail.com wrote: No I am just running ./spark-shell command in terminal I will try with above command On Wed, Mar 25, 2015 at 11:09 AM, Denny Lee denny.g@gmail.com wrote: Did you include the connection to a MySQL connector jar so that way spark-shell / hive can connect to the metastore? For example, when I run my spark-shell instance in standalone mode, I use: ./spark-shell --master spark://servername:7077 --driver-class-path /lib/ mysql-connector-java-5.1.27.jar On Fri, Mar 13, 2015 at 8:31 AM sandeep vura sandeepv...@gmail.com wrote: Hi Sparkers, Can anyone please check the below error and give solution for this.I am using hive version 0.13 and spark 1.2.1 . Step 1 : I have installed hive 0.13 with local metastore (mySQL database) Step 2: Hive is running without any errors and able to create tables and loading data in hive table Step 3: copied hive-site.xml in spark/conf directory Step 4: copied core-site.xml in spakr/conf directory Step 5: started spark shell Please check the below error for clarifications. scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.Hi veContext@2821ec0c scala sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING)) java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate or g.apache.hadoop.hive. metastore.HiveMetaStoreClient at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.jav a:346) at org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.sc ala:235) at org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.sc ala:231) at scala.Option.orElse(Option.scala:257) at org.apache.spark.sql.hive.HiveContext.x$3$lzycompute(HiveContext.scal a:231) at org.apache.spark.sql.hive.HiveContext.x$3(HiveContext. scala:229) at org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext .scala:229) at org.apache.spark.sql.hive.HiveContext.hiveconf( HiveContext.scala:229) at org.apache.spark.sql.hive.HiveMetastoreCatalog.init(HiveMetastoreCa talog.scala:55) Regards, Sandeep.v
Re: Errors in SPARK
Hi Denny, Still facing the same issue.Please find the following errors. *scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)* *sqlContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@4e4f880c* *scala sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING))* *java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient* Cheers, Sandeep.v On Wed, Mar 25, 2015 at 11:10 AM, sandeep vura sandeepv...@gmail.com wrote: No I am just running ./spark-shell command in terminal I will try with above command On Wed, Mar 25, 2015 at 11:09 AM, Denny Lee denny.g@gmail.com wrote: Did you include the connection to a MySQL connector jar so that way spark-shell / hive can connect to the metastore? For example, when I run my spark-shell instance in standalone mode, I use: ./spark-shell --master spark://servername:7077 --driver-class-path /lib/mysql-connector-java-5.1.27.jar On Fri, Mar 13, 2015 at 8:31 AM sandeep vura sandeepv...@gmail.com wrote: Hi Sparkers, Can anyone please check the below error and give solution for this.I am using hive version 0.13 and spark 1.2.1 . Step 1 : I have installed hive 0.13 with local metastore (mySQL database) Step 2: Hive is running without any errors and able to create tables and loading data in hive table Step 3: copied hive-site.xml in spark/conf directory Step 4: copied core-site.xml in spakr/conf directory Step 5: started spark shell Please check the below error for clarifications. scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.Hi veContext@2821ec0c scala sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING)) java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate or g.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.jav a:346) at org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.sc ala:235) at org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.sc ala:231) at scala.Option.orElse(Option.scala:257) at org.apache.spark.sql.hive.HiveContext.x$3$lzycompute(HiveContext.scal a:231) at org.apache.spark.sql.hive.HiveContext.x$3(HiveContext.scala:229) at org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext .scala:229) at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:229) at org.apache.spark.sql.hive.HiveMetastoreCatalog.init(HiveMetastoreCa talog.scala:55) Regards, Sandeep.v
Re: Errors in SPARK
No I am just running ./spark-shell command in terminal I will try with above command On Wed, Mar 25, 2015 at 11:09 AM, Denny Lee denny.g@gmail.com wrote: Did you include the connection to a MySQL connector jar so that way spark-shell / hive can connect to the metastore? For example, when I run my spark-shell instance in standalone mode, I use: ./spark-shell --master spark://servername:7077 --driver-class-path /lib/mysql-connector-java-5.1.27.jar On Fri, Mar 13, 2015 at 8:31 AM sandeep vura sandeepv...@gmail.com wrote: Hi Sparkers, Can anyone please check the below error and give solution for this.I am using hive version 0.13 and spark 1.2.1 . Step 1 : I have installed hive 0.13 with local metastore (mySQL database) Step 2: Hive is running without any errors and able to create tables and loading data in hive table Step 3: copied hive-site.xml in spark/conf directory Step 4: copied core-site.xml in spakr/conf directory Step 5: started spark shell Please check the below error for clarifications. scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.Hi veContext@2821ec0c scala sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING)) java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate or g.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.jav a:346) at org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.sc ala:235) at org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.sc ala:231) at scala.Option.orElse(Option.scala:257) at org.apache.spark.sql.hive.HiveContext.x$3$lzycompute(HiveContext.scal a:231) at org.apache.spark.sql.hive.HiveContext.x$3(HiveContext.scala:229) at org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext .scala:229) at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:229) at org.apache.spark.sql.hive.HiveMetastoreCatalog.init(HiveMetastoreCa talog.scala:55) Regards, Sandeep.v
Re: Errors in SPARK
Did you include the connection to a MySQL connector jar so that way spark-shell / hive can connect to the metastore? For example, when I run my spark-shell instance in standalone mode, I use: ./spark-shell --master spark://servername:7077 --driver-class-path /lib/mysql-connector-java-5.1.27.jar On Fri, Mar 13, 2015 at 8:31 AM sandeep vura sandeepv...@gmail.com wrote: Hi Sparkers, Can anyone please check the below error and give solution for this.I am using hive version 0.13 and spark 1.2.1 . Step 1 : I have installed hive 0.13 with local metastore (mySQL database) Step 2: Hive is running without any errors and able to create tables and loading data in hive table Step 3: copied hive-site.xml in spark/conf directory Step 4: copied core-site.xml in spakr/conf directory Step 5: started spark shell Please check the below error for clarifications. scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.Hi veContext@2821ec0c scala sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING)) java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate or g.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.jav a:346) at org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.sc ala:235) at org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.sc ala:231) at scala.Option.orElse(Option.scala:257) at org.apache.spark.sql.hive.HiveContext.x$3$lzycompute(HiveContext.scal a:231) at org.apache.spark.sql.hive.HiveContext.x$3(HiveContext.scala:229) at org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext .scala:229) at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:229) at org.apache.spark.sql.hive.HiveMetastoreCatalog.init(HiveMetastoreCa talog.scala:55) Regards, Sandeep.v
Hadoop 2.5 not listed in Spark 1.4 build page
http://spark.apache.org/docs/latest/building-spark.html#packaging-without-hadoop-dependencies-for-yarn does not list hadoop 2.5 in Hadoop version table table etc. I assume it is still OK to compile with -Pyarn -Phadoop-2.5 for use with Hadoop 2.5 (cdh 5.3.2) Thanks,
Re: Weird exception in Spark job
Any Ideas on this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Weird-exception-in-Spark-job-tp22195p22204.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Does HiveContext connect to HiveServer2?
spark-submit --files /path/to/hive-site.xml On Tue, Mar 24, 2015 at 10:31 AM, Udit Mehta ume...@groupon.com wrote: Another question related to this, how can we propagate the hive-site.xml to all workers when running in the yarn cluster mode? On Tue, Mar 24, 2015 at 10:09 AM, Marcelo Vanzin van...@cloudera.com wrote: It does neither. If you provide a Hive configuration to Spark, HiveContext will connect to your metastore server, otherwise it will create its own metastore in the working directory (IIRC). On Tue, Mar 24, 2015 at 8:58 AM, nitinkak001 nitinkak...@gmail.com wrote: I am wondering if HiveContext connects to HiveServer2 or does it work though Hive CLI. The reason I am asking is because Cloudera has deprecated Hive CLI. If the connection is through HiverServer2, is there a way to specify user credentials? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-connect-to-HiveServer2-tp22200.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Dataframe groupby custom functions (python)
Hi all, I have been trying out the new dataframe api in 1.3, which looks great by the way. I have found an example to define udfs and add them to select operations, like this: slen = F.udf(lambda s: len(s), IntegerType()) df.select(df.age, slen(df.name).alias('slen')).collect() is it possible to to something similar with aggregates? Something like this: gdf = df.groupBy(df.name) gdf.agg(slen(df.age)).collect() thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Dataframe-groupby-custom-functions-python-tp22205.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Dataframe groupby custom functions (python)
The only UDAFs that we support today are those defined using the Hive UDAF API. Otherwise you'll have to drop into Spark operations. I'd suggest opening a JIRA. On Tue, Mar 24, 2015 at 10:49 AM, jamborta jambo...@gmail.com wrote: Hi all, I have been trying out the new dataframe api in 1.3, which looks great by the way. I have found an example to define udfs and add them to select operations, like this: slen = F.udf(lambda s: len(s), IntegerType()) df.select(df.age, slen(df.name).alias('slen')).collect() is it possible to to something similar with aggregates? Something like this: gdf = df.groupBy(df.name) gdf.agg(slen(df.age)).collect() thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Dataframe-groupby-custom-functions-python-tp22205.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: diffrence in PCA of MLib vs H2o in R
If you want to do a nonstandard (or uncentered) PCA, you can call computeSVD on RowMatrix, and look at the resulting 'V' Matrix. That should match the output of the other two systems. Reza On Tue, Mar 24, 2015 at 3:53 AM, Sean Owen so...@cloudera.com wrote: Those implementations are computing an SVD of the input matrix directly, and while you generally need the columns to have mean 0, you can turn that off with the options you cite. I don't think this is possible in the MLlib implementation, since it is computing the principal components by computing eigenvectors of the covariance matrix. The means inherently don't matter either way in this computation. On Tue, Mar 24, 2015 at 6:13 AM, roni roni.epi...@gmail.com wrote: I am trying to compute PCA using computePrincipalComponents. I also computed PCA using h2o in R and R's prcomp. The answers I get from H2o and R's prComp (non h2o) is same when I set the options for H2o as standardized=FALSE and for r's prcomp as center = false. How do I make sure that the settings for MLib PCA is same as I am using for H2o or prcomp. Thanks Roni - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: diffrence in PCA of MLib vs H2o in R
(Oh sorry, I've only been thinking of TallSkinnySVD) On Tue, Mar 24, 2015 at 6:36 PM, Reza Zadeh r...@databricks.com wrote: If you want to do a nonstandard (or uncentered) PCA, you can call computeSVD on RowMatrix, and look at the resulting 'V' Matrix. That should match the output of the other two systems. Reza On Tue, Mar 24, 2015 at 3:53 AM, Sean Owen so...@cloudera.com wrote: Those implementations are computing an SVD of the input matrix directly, and while you generally need the columns to have mean 0, you can turn that off with the options you cite. I don't think this is possible in the MLlib implementation, since it is computing the principal components by computing eigenvectors of the covariance matrix. The means inherently don't matter either way in this computation. On Tue, Mar 24, 2015 at 6:13 AM, roni roni.epi...@gmail.com wrote: I am trying to compute PCA using computePrincipalComponents. I also computed PCA using h2o in R and R's prcomp. The answers I get from H2o and R's prComp (non h2o) is same when I set the options for H2o as standardized=FALSE and for r's prcomp as center = false. How do I make sure that the settings for MLib PCA is same as I am using for H2o or prcomp. Thanks Roni - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to deploy binary dependencies to workers?
I would recommend to upload those jars to HDFS, and use add jars option in spark-submit with URI from HDFS instead of URI from local filesystem. Thus, it can avoid the problem of fetching jars from driver which can be a bottleneck. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Tue, Mar 24, 2015 at 4:13 AM, Xi Shen davidshe...@gmail.com wrote: Hi, I am doing ML using Spark mllib. However, I do not have full control to the cluster. I am using Microsoft Azure HDInsight I want to deploy the BLAS or whatever required dependencies to accelerate the computation. But I don't know how to deploy those DLLs when I submit my JAR to the cluster. I know how to pack those DLLs into a jar. The real challenge is how to let the system find them... Thanks, David - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Optimal solution for getting the header from CSV with Spark
I think this works in practice, but I don't know that the first block of the file is guaranteed to be in the first partition? certainly later down the pipeline that won't be true but presumably this is happening right after reading the file. I've always just written some filter that would only match the header, which assumes that this is possible to distinguish, but usually is. On Tue, Mar 24, 2015 at 2:41 PM, Dean Wampler deanwamp...@gmail.com wrote: Instead of data.zipWithIndex().filter(_._2==0), which will cause Spark to read the whole file, use data.take(1), which is simpler. From the Rdd.take documentation, it works by first scanning one partition, and using the results from that partition to estimate the number of additional partitions needed to satisfy the limit. In this case, it will trivially stop at the first. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe @deanwampler http://polyglotprogramming.com On Tue, Mar 24, 2015 at 7:12 AM, Spico Florin spicoflo...@gmail.com wrote: Hello! I would like to know what is the optimal solution for getting the header with from a CSV file with Spark? My aproach was: def getHeader(data: RDD[String]): String = { data.zipWithIndex().filter(_._2==0).map(x=x._1).take(1).mkString() } Thanks. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Does HiveContext connect to HiveServer2?
It does neither. If you provide a Hive configuration to Spark, HiveContext will connect to your metastore server, otherwise it will create its own metastore in the working directory (IIRC). On Tue, Mar 24, 2015 at 8:58 AM, nitinkak001 nitinkak...@gmail.com wrote: I am wondering if HiveContext connects to HiveServer2 or does it work though Hive CLI. The reason I am asking is because Cloudera has deprecated Hive CLI. If the connection is through HiverServer2, is there a way to specify user credentials? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-connect-to-HiveServer2-tp22200.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Is yarn-standalone mode deprecated?
I checked and apparently it hasn't be released yet. it will be available in the upcoming CDH 5.4 release. -Sandy On Mon, Mar 23, 2015 at 1:32 PM, Nitin kak nitinkak...@gmail.com wrote: I know there was an effort for this, do you know which version of Cloudera distribution we could find that? On Mon, Mar 23, 2015 at 1:13 PM, Sandy Ryza sandy.r...@cloudera.com wrote: The former is deprecated. However, the latter is functionally equivalent to it. Both launch an app in what is now called yarn-cluster mode. Oozie now also has a native Spark action, though I'm not familiar on the specifics. -Sandy On Mon, Mar 23, 2015 at 1:01 PM, Nitin kak nitinkak...@gmail.com wrote: To be more clear, I am talking about SPARK_JAR=SPARK_ASSEMBLY_JAR_FILE ./bin/spark-class org.apache.spark.deploy.yarn.Client \ --jar YOUR_APP_JAR_FILE \ --class APP_MAIN_CLASS \ --args APP_MAIN_ARGUMENTS \ --num-workers NUMBER_OF_WORKER_MACHINES \ --master-class ApplicationMaster_CLASS --master-memory MEMORY_FOR_MASTER \ --worker-memory MEMORY_PER_WORKER \ --worker-cores CORES_PER_WORKER \ --name application_name \ --queue queue_name \ --addJars any_local_files_used_in_SparkContext.addJar \ --files files_for_distributed_cache \ --archives archives_for_distributed_cache which I thought was the yarn-standalone mode vs spark-submit ./bin/spark-submit --class org.apache.spark.examples.SparkPi \ --master yarn-cluster \ --num-executors 3 \ --driver-memory 4g \ --executor-memory 2g \ --executor-cores 1 \ --queue thequeue \ lib/spark-examples*.jar I didnt see example of ./bin/spark-class in 1.2.0 documentation, so am wondering if that is deprecated. On Mon, Mar 23, 2015 at 12:11 PM, Sandy Ryza sandy.r...@cloudera.com wrote: The mode is not deprecated, but the name yarn-standalone is now deprecated. It's now referred to as yarn-cluster. -Sandy On Mon, Mar 23, 2015 at 11:49 AM, nitinkak001 nitinkak...@gmail.com wrote: Is yarn-standalone mode deprecated in Spark now. The reason I am asking is because while I can find it in 0.9.0 documentation(https://spark.apache.org/docs/0.9.0/running-on-yarn.html). I am not able to find it in 1.2.0. I am using this mode to run the Spark jobs from Oozie as a java action. Removing this mode will prevent me from doing that. Are there any other ways of running a Spark job from Oozie other than Shell action? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-yarn-standalone-mode-deprecated-tp22188.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
akka.version error
I am facing the same issue as listed here: http://apache-spark-user-list.1001560.n3.nabble.com/Packaging-a-spark-job-using-maven-td5615.html Solution mentioned is here: https://gist.github.com/prb/d776a47bd164f704eecb However, I think I don't understand few things: 1) Why are jars being split into worker and driver? 2) Does it mean I now need to create 2 jars? 3) I am assuming I still need both jars in the path when I run this job? I am simply trying to execute a basic word count example.
What his the ideal method to interact with Spark Cluster from a Cloud App?
Hi all, We have a cloud application, to which we are adding a reporting service. For this we have narrowed down to use Cassandra + Spark for data store and processing respectively. Since cloud application is separate from Cassandra + Spark deployment, what is ideal method to interact with Spark Master from the application? We have been evaluating spark-job-server [1], which is an RESTful layer on top of Spark. Are there any other such tools? Or are there any other better approach which can be explored? We are evaluating following requirements against spark-job-server, 1. Provide a platform for applications to submit jobs 2. Provide RESTful APIs using which applications will interact with the server - Upload jar for running jobs - Submit job - Get job list - Get job status - Get job result 3. Provide support for kill/restart job - Kill job - Restart job 4. Support job priority 5. Queue up job submissions if resources not available 6. Troubleshoot job execution - Failure – job logs - Measure performance 7. Manage cluster deployment - Bootstrap, scale up/down (add, remove, replace nodes) 8. Monitor cluster deployment - Health report: Report metrics – CPU, Memory, - of jobs, spark processes - Alert DevOps about threshold limit of these metrics - Alert DevOps about job failures - Self healing? 9. Security - AAA job submissions 10. High availability/Redundancy - This is for the spark-jobserver component itself Any help is appreciated! Thanks and Regards Noorul - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Optimal solution for getting the header from CSV with Spark
Good point. There's no guarantee that you'll get the actual first partition. One reason why I wouldn't allow a CSV header line in a real data file, if I could avoid it. Back to Spark, a safer approach is RDD.foreachPartition, which takes a function expecting an iterator. You'll only need to grab the first element (being careful that the partition isn't empty!) and then determine which of those first lines has the header info. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Tue, Mar 24, 2015 at 11:12 AM, Sean Owen so...@cloudera.com wrote: I think this works in practice, but I don't know that the first block of the file is guaranteed to be in the first partition? certainly later down the pipeline that won't be true but presumably this is happening right after reading the file. I've always just written some filter that would only match the header, which assumes that this is possible to distinguish, but usually is. On Tue, Mar 24, 2015 at 2:41 PM, Dean Wampler deanwamp...@gmail.com wrote: Instead of data.zipWithIndex().filter(_._2==0), which will cause Spark to read the whole file, use data.take(1), which is simpler. From the Rdd.take documentation, it works by first scanning one partition, and using the results from that partition to estimate the number of additional partitions needed to satisfy the limit. In this case, it will trivially stop at the first. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe @deanwampler http://polyglotprogramming.com On Tue, Mar 24, 2015 at 7:12 AM, Spico Florin spicoflo...@gmail.com wrote: Hello! I would like to know what is the optimal solution for getting the header with from a CSV file with Spark? My aproach was: def getHeader(data: RDD[String]): String = { data.zipWithIndex().filter(_._2==0).map(x=x._1).take(1).mkString() } Thanks.
Re: Is it possible to use json4s 3.2.11 with Spark 1.3.0?
From the exception it seems like your app is also repackaging Scala classes somehow. Can you double check that and remove the Scala classes from your app if they're there? On Mon, Mar 23, 2015 at 10:07 PM, Alexey Zinoviev alexey.zinov...@gmail.com wrote: Thanks Marcelo, this options solved the problem (I'm using 1.3.0), but it works only if I remove extends Logging from the object, with extends Logging it return: Exception in thread main java.lang.LinkageError: loader constraint violation in interface itable initialization: when resolving method App1$.logInfo(Lscala/Function0;Ljava/lang/Throwable;)V the class loader (instance of org/apache/spark/util/ChildFirstURLClassLoader) of the current class, App1$, and the class loader (instance of sun/misc/Launcher$AppClassLoader) for interface org/apache/spark/Logging have different Class objects for the type scala/Function0 used in the signature at App1.main(App1.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Do you have any idea what's wrong with Logging? PS: I'm running it with spark-1.3.0/bin/spark-submit --class App1 --conf spark.driver.userClassPathFirst=true --conf spark.executor.userClassPathFirst=true $HOME/projects/sparkapp/target/scala-2.10/sparkapp-assembly-1.0.jar Thanks, Alexey On Tue, Mar 24, 2015 at 5:03 AM, Marcelo Vanzin van...@cloudera.com wrote: You could build a far jar for your application containing both your code and the json4s library, and then run Spark with these two options: spark.driver.userClassPathFirst=true spark.executor.userClassPathFirst=true Both only work in 1.3. (1.2 has spark.files.userClassPathFirst, but that only works for executors.) On Mon, Mar 23, 2015 at 2:12 PM, Alexey Zinoviev alexey.zinov...@gmail.com wrote: Spark has a dependency on json4s 3.2.10, but this version has several bugs and I need to use 3.2.11. I added json4s-native 3.2.11 dependency to build.sbt and everything compiled fine. But when I spark-submit my JAR it provides me with 3.2.10. build.sbt import sbt.Keys._ name := sparkapp version := 1.0 scalaVersion := 2.10.4 libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided libraryDependencies += org.json4s %% json4s-native % 3.2.11` plugins.sbt logLevel := Level.Warn resolvers += Resolver.url(artifactory, url(http://scalasbt.artifactoryonline.com/scalasbt/sbt-plugin-releases;))(Resolver.ivyStylePatterns) addSbtPlugin(com.eed3si9n % sbt-assembly % 0.13.0) App1.scala import org.apache.spark.SparkConf import org.apache.spark.rdd.RDD import org.apache.spark.{Logging, SparkConf, SparkContext} import org.apache.spark.SparkContext._ object App1 extends Logging { def main(args: Array[String]) = { val conf = new SparkConf().setAppName(App1) val sc = new SparkContext(conf) println(sjson4s version: ${org.json4s.BuildInfo.version.toString}) } } sbt 0.13.7, sbt-assembly 0.13.0, Scala 2.10.4 Is it possible to force 3.2.11 version usage? Thanks, Alexey -- Marcelo -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
CombineByKey - Please explain its working
I am reading about combinebyKey and going through below example from one of the blog post but i cant understand how it works step by step , Can some one please explain Case class Fruit ( kind : String , weight : Int ) { def makeJuice : Juice = Juice ( weight * 100 ) } Case class Juice ( volumn : Int ) { def add ( J : Juice ) : Juice = Juice ( volumn + J . volumn ) } Val apple1 = Fruit ( Apple , 5 ) Val Apple2 = Fruit ( Apple , 8 ) Val orange1 = Fruit ( orange , 10 ) Val Fruit = sc . Parallelize ( List (( Apple , apple1 ) , ( orange , orange1 ) , ( Apple , Apple2 ))) *Val Juice = Fruit . combineByKey ( f = f . makeJuice , ( J : Juice , f ) = J . add ( f . makeJuice ), ( J1 : Juice , J2 : Juice ) = J1 . add ( J2 ) )* -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CombineByKey-Please-explain-its-working-tp22203.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Question about Data Sources API
My question wrt Java/Scala was related to extending the classes to support new custom data sources, so was wondering if those could be written in Java, since our company is a Java shop. Yes, you should be able to extend the required interfaces using Java. The additional push downs I am looking for are aggregations with grouping and sorting. Essentially, I am trying to evaluate if this API can give me much of what is possible with the Apache MetaModel project. We don't currently push those down today as our initial focus is on getting data into Spark so that you can join with other sources and then do such processing. Its possible we will extend the pushdown API though in the future.
Spark GraphX In Action on documentation page?
Can my new book, Spark GraphX In Action, which is currently in MEAP http://manning.com/malak/, be added to https://spark.apache.org/documentation.html and, if appropriate, to https://spark.apache.org/graphx/ ? Michael Malak - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark-thriftserver Issue
Zhan specifying port fixed the port issue. Is it possible to specify the log directory while starting the spark thriftserver? Still getting this error even through the folder exists and everyone has permission to use that directory. drwxr-xr-x 2 root root 4096 Mar 24 19:04 spark-events Exception in thread main java.lang.IllegalArgumentException: Log directory /tmp/spark-events does not exist. at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:99) at org.apache.spark.SparkContext.init(SparkContext.scala:399) at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:49) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:58) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) On Mon, Mar 23, 2015 at 6:51 PM, Zhan Zhang zzh...@hortonworks.com wrote: Probably the port is already used by others, e.g., hive. You can change the port similar to below ./sbin/start-thriftserver.sh --master yarn --executor-memory 512m --hiveconf hive.server2.thrift.port=10001 Thanks. Zhan Zhang On Mar 23, 2015, at 12:01 PM, Neil Dev neilk...@gmail.com wrote: Hi, I am having issue starting spark-thriftserver. I'm running spark 1.3.with Hadoop 2.4.0. I would like to be able to change its port too so, I can hive hive-thriftserver as well as spark-thriftserver running at the same time. Starting sparkthrift server:- sudo ./start-thriftserver.sh --master spark://ip-172-31-10-124:7077 --executor-memory 2G Error:- I created the folder manually but still getting the following error Exception in thread main java.lang.IllegalArgumentException: Log directory /tmp/spark-events does not exist. I am getting the following error 15/03/23 15:07:02 ERROR thrift.ThriftCLIService: Error: org.apache.thrift.transport.TTransportException: Could not create ServerSocket on address0.0.0.0/0.0.0.0:1. at org.apache.thrift.transport.TServerSocket.init(TServerSocket.java:93) at org.apache.thrift.transport.TServerSocket.init(TServerSocket.java:79) at org.apache.hive.service.auth.HiveAuthFactory.getServerSocket(HiveAuthFactory.java:236) at org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:69) at java.lang.Thread.run(Thread.java:745) Thanks Neil
FAILED SelectChannelConnector@0.0.0.0:4040 java.net.BindException: Address already in use
I get following message for each time I run spark job 1. 15/03/24 15:35:56 WARN AbstractLifeCycle: FAILED SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in use full trace is here http://pastebin.com/xSvRN01f how do I fix this ? I am on CDH 5.3.1 thanks roy
Re: SparkSQL UDTs with Ordering
I'll caution that the UDTs are not a stable public interface yet. We'd like to do this someday, but currently this feature is mostly for MLlib as we have not finalized the API. Having an ordering could be useful, but I'll add that currently UDTs actually exist in serialized from so the ordering would have to be on the internal form, not the user visible form. On Tue, Mar 24, 2015 at 12:25 PM, Patrick Woody patrick.woo...@gmail.com wrote: Hey all, Currently looking into UDTs and I was wondering if it is reasonable to add the ability to define an Ordering (or if this is possible, then how)? Currently it will throw an error when non-Native types are used. Thanks! -Pat
Spark Application Hung
Hi, We are observing a hung spark application when one of the yarn datanode (running multiple spark executors) go down. Setup details: * Spark: 1.2.1 * Hadoop: 2.4.0 * Spark Application Mode: yarn-client * 2 datanodes (DN1, DN2) * 6 spark executors (initially 3 executors on both DN1 and DN2, after rebooting DN2, changes to 4 executors on DN1 and 2 executors on DN2) Scenario: When one of the datanodes (DN2) is brought down, the application gets hung, with spark driver continuously showing the following warning: 15/03/24 12:39:26 WARN TaskSetManager: Lost task 5.0 in stage 232.0 (TID 37941, DN1): FetchFailed(null, shuffleId=155, mapId=-1, reduceId=5, message= org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 155 at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:384) at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:381) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:380) at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:176) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) When DN2 is brought down, one executor gets launched on DN1. When DN2 is brought back up after 15mins, 2 executors get launched on it. All the executors (including the ones which got launched after DN2 comes back), keep showing the following errors: 15/03/24 12:43:30 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 155, fetching them 15/03/24 12:43:30 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 155, fetching them 15/03/24 12:43:30 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker actor = Actor[akka.tcp://sparkDriver@NN1:44353/user/MapOutputTracker#-957394722] 15/03/24 12:43:30 INFO spark.MapOutputTrackerWorker: Got the output locations 15/03/24 12:43:30 ERROR spark.MapOutputTracker: Missing an output location for shuffle 155 15/03/24 12:43:30 ERROR spark.MapOutputTracker: Missing an output location for shuffle 155 15/03/24 12:43:30 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 44623 15/03/24 12:43:30 INFO executor.Executor: Running task 5.0 in stage 232.960 (TID 44623) 15/03/24 12:43:30 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 44629
Re: hadoop input/output format advanced control
You can indeed override the Hadoop configuration at a per-RDD level - though it is a little more verbose, as in the below example, and you need to effectively make a copy of the hadoop Configuration: val thisRDDConf = new Configuration(sc.hadoopConfiguration) thisRDDConf.set(mapred.min.split.size, 5) val rdd = sc.newAPIHadoopFile(path, classOf[SequenceFileInputFormat[IntWritable, Text]], classOf[IntWritable], classOf[Text], thisRDDConf ) println(rdd.partitions.size) val rdd2 = sc.newAPIHadoopFile(path, classOf[SequenceFileInputFormat[IntWritable, Text]], classOf[IntWritable], classOf[Text] ) println(rdd2.partitions.size) For example, if I run the above on the following directory (some files I have lying around): -rw-r--r-- 1 Nick staff 0B Jul 11 2014 _SUCCESS -rw-r--r-- 1 Nick staff 291M Sep 16 2014 part-0 -rw-r--r-- 1 Nick staff 227M Sep 16 2014 part-1 -rw-r--r-- 1 Nick staff 370M Sep 16 2014 part-2 -rw-r--r-- 1 Nick staff 244M Sep 16 2014 part-3 -rw-r--r-- 1 Nick staff 240M Sep 16 2014 part-4 I get output: 15/03/24 20:43:12 INFO FileInputFormat: Total input paths to process : 5 *5* ... and then for the second RDD: 15/03/24 20:43:12 INFO SparkContext: Created broadcast 1 from newAPIHadoopFile at TestHash.scala:41 *45* As expected. Though a more succinct way of passing in those conf options would be nice - but this should get you what you need. On Mon, Mar 23, 2015 at 10:36 PM, Koert Kuipers ko...@tresata.com wrote: currently its pretty hard to control the Hadoop Input/Output formats used in Spark. The conventions seems to be to add extra parameters to all methods and then somewhere deep inside the code (for example in PairRDDFunctions.saveAsHadoopFile) all these parameters get translated into settings on the Hadoop Configuration object. for example for compression i see codec: Option[Class[_ : CompressionCodec]] = None added to a bunch of methods. how scalable is this solution really? for example i need to read from a hadoop dataset and i dont want the input (part) files to get split up. the way to do this is to set mapred.min.split.size. now i dont want to set this at the level of the SparkContext (which can be done), since i dont want it to apply to input formats in general. i want it to apply to just this one specific input dataset i need to read. which leaves me with no options currently. i could go add yet another input parameter to all the methods (SparkContext.textFile, SparkContext.hadoopFile, SparkContext.objectFile, etc.). but that seems ineffective. why can we not expose a Map[String, String] or some other generic way to manipulate settings for hadoop input/output formats? it would require adding one more parameter to all methods to deal with hadoop input/output formats, but after that its done. one parameter to rule them all then i could do: val x = sc.textFile(/some/path, formatSettings = Map(mapred.min.split.size - 12345)) or rdd.saveAsTextFile(/some/path, formatSettings = Map(mapred.output.compress - true, mapred.output.compression.codec - somecodec))
SparkSQL UDTs with Ordering
Hey all, Currently looking into UDTs and I was wondering if it is reasonable to add the ability to define an Ordering (or if this is possible, then how)? Currently it will throw an error when non-Native types are used. Thanks! -Pat
java.lang.OutOfMemoryError: unable to create new native thread
Hello, I am seeing various crashes in spark on large jobs which all share a similar exception: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) I increased nproc (i.e. ulimit -u) 10 fold, but it doesn't help. Does anyone know how to avoid those kinds of errors? Noteworthy: I added -XX:ThreadStackSize=10m on both driver and executor extra java options, which might have amplified the problem. Thanks for you help, Thomas
Re: Spark-thriftserver Issue
You can try to set it in spark-env.sh. # - SPARK_LOG_DIR Where log files are stored. (Default: ${SPARK_HOME}/logs) # - SPARK_PID_DIR Where the pid file is stored. (Default: /tmp) Thanks. Zhan Zhang On Mar 24, 2015, at 12:10 PM, Anubhav Agarwal anubha...@gmail.commailto:anubha...@gmail.com wrote: Zhan specifying port fixed the port issue. Is it possible to specify the log directory while starting the spark thriftserver? Still getting this error even through the folder exists and everyone has permission to use that directory. drwxr-xr-x 2 root root 4096 Mar 24 19:04 spark-events Exception in thread main java.lang.IllegalArgumentException: Log directory /tmp/spark-events does not exist. at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:99) at org.apache.spark.SparkContext.init(SparkContext.scala:399) at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:49) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:58) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) On Mon, Mar 23, 2015 at 6:51 PM, Zhan Zhang zzh...@hortonworks.commailto:zzh...@hortonworks.com wrote: Probably the port is already used by others, e.g., hive. You can change the port similar to below ./sbin/start-thriftserver.sh --master yarn --executor-memory 512m --hiveconf hive.server2.thrift.port=10001 Thanks. Zhan Zhang On Mar 23, 2015, at 12:01 PM, Neil Dev neilk...@gmail.commailto:neilk...@gmail.com wrote: Hi, I am having issue starting spark-thriftserver. I'm running spark 1.3.with Hadoop 2.4.0. I would like to be able to change its port too so, I can hive hive-thriftserver as well as spark-thriftserver running at the same time. Starting sparkthrift server:- sudo ./start-thriftserver.sh --master spark://ip-172-31-10-124:7077 --executor-memory 2G Error:- I created the folder manually but still getting the following error Exception in thread main java.lang.IllegalArgumentException: Log directory /tmp/spark-events does not exist. I am getting the following error 15/03/23 15:07:02 ERROR thrift.ThriftCLIService: Error: org.apache.thrift.transport.TTransportException: Could not create ServerSocket on address0.0.0.0/0.0.0.0:1http://0.0.0.0:1/. at org.apache.thrift.transport.TServerSocket.init(TServerSocket.java:93) at org.apache.thrift.transport.TServerSocket.init(TServerSocket.java:79) at org.apache.hive.service.auth.HiveAuthFactory.getServerSocket(HiveAuthFactory.java:236) at org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:69) at java.lang.Thread.run(Thread.java:745) Thanks Neil
diffrence in PCA of MLib vs H2o in R
I am trying to compute PCA using computePrincipalComponents. I also computed PCA using h2o in R and R's prcomp. The answers I get from H2o and R's prComp (non h2o) is same when I set the options for H2o as standardized=FALSE and for r's prcomp as center = false. How do I make sure that the settings for MLib PCA is same as I am using for H2o or prcomp. Thanks Roni
Re: Hive context datanucleus error
has this issue been fixed in spark 1.2: https://issues.apache.org/jira/browse/SPARK-2624 On Mon, Mar 23, 2015 at 9:19 PM, Udit Mehta ume...@groupon.com wrote: I am trying to run a simple query to view tables in my hive metastore using hive context. I am getting this error: spark Persistence process has been specified to use a *ClassLoader Resolve* of name datanucleus yet this has not been found by the DataNucleus plugin mechanism. Please check your CLASSPATH and plugin specification. https://www.google.com/search?espv=2biw=1440bih=802q=spark+Persistence+process+has+been+specified+to+use+a+ClassLoader+Resolve+of+name+%22datanucleus%22+yet+this+has+not+been+found+by+the+DataNucleus+plugin+mechanism.+Please+check+your+CLASSPATH+and+plugin+specification.spell=1sa=Xei=seQQVbPnCoyZNo6lgIgBved=0CBoQBSgA I am able to access the metastore using the spark-sql. Can someone point out what the issue could be? thanks
Re: Standalone Scheduler VS YARN Performance
By any chance does this thread address look similar: http://apache-spark-developers-list.1001551.n3.nabble.com/Lost-executor-on-YARN-ALS-iterations-td7916.html ? On Tue, Mar 24, 2015 at 5:23 AM Harut Martirosyan harut.martiros...@gmail.com wrote: What is performance overhead caused by YARN, or what configurations are being changed when the app is ran through YARN? The following example: sqlContext.sql(SELECT dayStamp(date), count(distinct deviceId) AS c FROM full GROUP BY dayStamp(date) ORDER BY c DESC LIMIT 10) .collect() runs on shell when we use standalone scheduler: ./spark-shell --master sparkmaster:7077 --executor-memory 20g --executor-cores 10 --driver-memory 10g --num-executors 8 and fails due to losing an executor, when we run it through YARN. ./spark-shell --master yarn-client --executor-memory 20g --executor-cores 10 --driver-memory 10g --num-executors 8 There are no evident logs, just messages that executors are being lost, and connection refused errors, (apparently due to executor failures) The cluster is the same, 8 nodes, 64Gb RAM each. Format is parquet. -- RGRDZ Harut
Re: Optimal solution for getting the header from CSV with Spark
Instead of data.zipWithIndex().filter(_._2==0), which will cause Spark to read the whole file, use data.take(1), which is simpler. From the Rdd.take documentation, it works by first scanning one partition, and using the results from that partition to estimate the number of additional partitions needed to satisfy the limit. In this case, it will trivially stop at the first. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Tue, Mar 24, 2015 at 7:12 AM, Spico Florin spicoflo...@gmail.com wrote: Hello! I would like to know what is the optimal solution for getting the header with from a CSV file with Spark? My aproach was: def getHeader(data: RDD[String]): String = { data.zipWithIndex().filter(_._2==0).map(x=x._1).take(1).mkString() } Thanks.
Re: Invalid ContainerId ... Caused by: java.lang.NumberFormatException: For input string: e04
Thanks All - perhaps I misread the earlier posts as dependencies with Hadoop version, but the key is also the CDH 5.3.2 (not just Hadoop 2.5 v/s 2.4) etc. After adding the classPath as Marcelo/Harsh suggested (loading CDH libs front), I am able to get spark-shell started without invalid container etc so that issue is solved. When I run any query, it gives java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode; This seems to be Guava lib version issue that has been known ... I will look into it. Thanks again ! On Tue, Mar 24, 2015 at 12:50 PM, Harsh J ha...@cloudera.com wrote: My comment's still the same: Runtime-link-via-classpath Spark to use CDH 5.3.2 libraries, just like your cluster does, not Apache Hadoop 2.5.0 (which CDH is merely based on, but carries several backports on top that aren't in Apache Hadoop 2.5.0, one of which addresses this parsing trouble). You do not require to recompile Spark, just alter its hadoop libraries in its classpath to be that of CDH server version (overwrite from parcels, etc.). On Wed, Mar 25, 2015 at 1:06 AM, Manoj Samel manojsamelt...@gmail.com wrote: I recompiled Spark 1.3 with Hadoop 2.5; it still gives same stack trace. A quick browse into stacktrace with Hadoop 2.5.0 org.apache.hadoop.yarn.util.ConverterUtils ... 1. toContainerId gets parameter containerId which I assume is container_ *e*06_1427223073530_0001_01_01 2. It splits it using public static final Splitter _SPLITTER = Splitter.on('_').trimResults(); 3. Line 172 checks container prefix with CONTAINER_PREFIX which is valid (container) 4. It calls toApplicationAttemptId 5. toApplicationAttemptId tries Long.parseLong(it.next()) on e06 and dies Seems like it is not expecting a non-numeric character. Is this a Yarn issue ? Thanks, On Tue, Mar 24, 2015 at 8:25 AM, Manoj Samel manoj.sa...@gmail.com wrote: I'll compile Spark with Hadoop libraries and try again ... Thanks, Manoj On Mar 23, 2015, at 10:34 PM, Harsh J ha...@cloudera.com wrote: This may happen if you are using different versions of CDH5 jars between Spark and the cluster. Can you ensure your Spark's Hadoop CDH jars match the cluster version exactly, since you seem to be using a custom version of Spark (out of CDH) here? On Tue, Mar 24, 2015 at 7:32 AM, Manoj Samel manojsamelt...@gmail.com wrote: x-post to CDH list for any insight ... Thanks, -- Forwarded message -- From: Manoj Samel manojsamelt...@gmail.com Date: Mon, Mar 23, 2015 at 6:32 PM Subject: Invalid ContainerId ... Caused by: java.lang.NumberFormatException: For input string: e04 To: user@spark.apache.org user@spark.apache.org Spark 1.3, CDH 5.3.2, Kerberos Setup works fine with base configuration, spark-shell can be used in yarn client mode etc. When work recovery feature is enabled via http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/admin_ha_yarn_work_preserving_recovery.html, the spark-shell fails with following log 15/03/24 01:20:16 ERROR yarn.ApplicationMaster: Uncaught exception: java.lang.IllegalArgumentException: Invalid ContainerId: container_e04_1427159778706_0002_01_01 at org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:182) at org.apache.spark.deploy.yarn.YarnRMClient.getAttemptId(YarnRMClient.scala:93) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:83) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:576) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:574) at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:597) at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala) Caused by: java.lang.NumberFormatException: For input string: e04 at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:589) at java.lang.Long.parseLong(Long.java:631) at org.apache.hadoop.yarn.util.ConverterUtils.toApplicationAttemptId(ConverterUtils.java:137) at org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:177) ... 12 more 15/03/24 01:20:16 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason:
Re: Invalid ContainerId ... Caused by: java.lang.NumberFormatException: For input string: e04
Hi there, On Tue, Mar 24, 2015 at 1:40 PM, Manoj Samel manojsamelt...@gmail.com wrote: When I run any query, it gives java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode; Are you running a custom-compiled Spark by any chance? Specifically, one you built with sbt? That would hit this problem, because the path I suggested (/usr/lib/hadoop/client/*) contains an older guava library, which would override the one shipped with the sbt-built Spark. If you build Spark with maven, or use the pre-built Spark distro, or specifically filter out the guava jar from your classpath when setting up the Spark job, things should work. -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
updateStateByKey - Seq[V] order
Hi Does updateStateByKey pass elements to updateFunc (in Seq[V]) in order in which they appear in the RDD? My guess is no which means updateFunc needs to be commutative. Am I correct? I've asked this question before but there were no takers. Here's the scala docs for updateStateByKey /** * Return a new state DStream where the state for each key is updated by applying * the given function on the previous state of the key and the new values of each key. * Hash partitioning is used to generate the RDDs with Spark's default number of partitions. * @param updateFunc State update function. If `this` function returns None, then * corresponding state key-value pair will be eliminated. * @tparam S State type */ def updateStateByKey[S: ClassTag]( updateFunc: (Seq[V], Option[S]) = Option[S] ): DStream[(K, S)] = { updateStateByKey(updateFunc, defaultPartitioner()) }
Re: java.lang.OutOfMemoryError: unable to create new native thread
I doubt you're hitting the limit of threads you can spawn, but as you say, running out of memory that the JVM process is allowed to allocate since your threads are grabbing stacks 10x bigger than usual. The thread stacks are 4GB by themselves. I suppose you can't not up the stack size so much? If so then I think you need to make more, smaller executors instead? On Tue, Mar 24, 2015 at 7:38 PM, Thomas Gerber thomas.ger...@radius.com wrote: Hello, I am seeing various crashes in spark on large jobs which all share a similar exception: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) I increased nproc (i.e. ulimit -u) 10 fold, but it doesn't help. Does anyone know how to avoid those kinds of errors? Noteworthy: I added -XX:ThreadStackSize=10m on both driver and executor extra java options, which might have amplified the problem. Thanks for you help, Thomas - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how?
Shahab - This should do the trick until Hao's changes are out: sqlContext.sql(create temporary function foobar as 'com.myco.FoobarUDAF'); sqlContext.sql(select foobar(some_column) from some_table); This works without requiring to 'deploy' a JAR with the UDAF in it - just make sure the UDAF is in your project's classpath. On Tue, Mar 10, 2015 at 8:21 PM, Cheng, Hao hao.ch...@intel.com wrote: Oh, sorry, my bad, currently Spark SQL doesn't provide the user interface for UDAF, but it can work seamlessly with Hive UDAF (via HiveContext). I am also working on the UDAF interface refactoring, after that we can provide the custom interface for extension. https://github.com/apache/spark/pull/3247 *From:* shahab [mailto:shahab.mok...@gmail.com] *Sent:* Wednesday, March 11, 2015 1:44 AM *To:* Cheng, Hao *Cc:* user@spark.apache.org *Subject:* Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how? Thanks Hao, But my question concerns UDAF (user defined aggregation function ) not UDTF( user defined type function ). I appreciate if you could point me to some starting point on UDAF development in Spark. Thanks Shahab On Tuesday, March 10, 2015, Cheng, Hao hao.ch...@intel.com wrote: Currently, Spark SQL doesn't provide interface for developing the custom UDTF, but it can work seamless with Hive UDTF. I am working on the UDTF refactoring for Spark SQL, hopefully will provide an Hive independent UDTF soon after that. *From:* shahab [mailto:shahab.mok...@gmail.com] *Sent:* Tuesday, March 10, 2015 5:44 PM *To:* user@spark.apache.org *Subject:* Registering custom UDAFs with HiveConetxt in SparkSQL, how? Hi, I need o develop couple of UDAFs and use them in the SparkSQL. While UDFs can be registered as a function in HiveContext, I could not find any documentation of how UDAFs can be registered in the HiveContext?? so far what I have found is to make a JAR file, out of developed UDAF class, and then deploy the JAR file to SparkSQL . But is there any way to avoid deploying the jar file and register it programmatically? best, /Shahab
Re: Question about Data Sources API
On Tue, Mar 24, 2015 at 12:57 AM, Ashish Mukherjee ashish.mukher...@gmail.com wrote: 1. Is the Data Source API stable as of Spark 1.3.0? It is marked DeveloperApi, but in general we do not plan to change even these APIs unless there is a very compelling reason to. 2. The Data Source API seems to be available only in Scala. Is there any plan to make it available for Java too? We tried to make all the suggested interfaces (other than CatalystScan which exposes internals and is only for experimentation) usable from Java. Is there something in particular you are having trouble with? 3. Are only filters and projections pushed down to the data source and all the data pulled into Spark for other processing? For now, this is all that is provided by the public stable API. We left a hook for more powerful push downs (sqlContext.experimental.extraStrategies), and would be interested in feedback on other operations we should push down as we expand the API.
Re: java.lang.OutOfMemoryError: unable to create new native thread
Additional notes: I did not find anything wrong with the number of threads (ps -u USER -L | wc -l): around 780 on the master and 400 on executors. I am running on 100 r3.2xlarge. On Tue, Mar 24, 2015 at 12:38 PM, Thomas Gerber thomas.ger...@radius.com wrote: Hello, I am seeing various crashes in spark on large jobs which all share a similar exception: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) I increased nproc (i.e. ulimit -u) 10 fold, but it doesn't help. Does anyone know how to avoid those kinds of errors? Noteworthy: I added -XX:ThreadStackSize=10m on both driver and executor extra java options, which might have amplified the problem. Thanks for you help, Thomas
Re: Hadoop 2.5 not listed in Spark 1.4 build page
Hadoop 2.5 would be referenced as via -Dhadoop-2.5 using the profile -Phadoop-2.4 Please note earlier in the link the section: # Apache Hadoop 2.4.X or 2.5.X mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package Versions of Hadoop after 2.5.X may or may not work with the -Phadoop-2.4 profile (they were released after this version of Spark). HTH! On Tue, Mar 24, 2015 at 10:28 AM Manoj Samel manojsamelt...@gmail.com wrote: http://spark.apache.org/docs/latest/building-spark.html#packaging-without-hadoop-dependencies-for-yarn does not list hadoop 2.5 in Hadoop version table table etc. I assume it is still OK to compile with -Pyarn -Phadoop-2.5 for use with Hadoop 2.5 (cdh 5.3.2) Thanks,
Re: Hadoop 2.5 not listed in Spark 1.4 build page
The right invocation is still a bit different: ... -Phadoop-2.4 -Dhadoop.version=2.5.0 hadoop-2.4 == Hadoop 2.4+ On Tue, Mar 24, 2015 at 5:44 PM, Denny Lee denny.g@gmail.com wrote: Hadoop 2.5 would be referenced as via -Dhadoop-2.5 using the profile -Phadoop-2.4 Please note earlier in the link the section: # Apache Hadoop 2.4.X or 2.5.X mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package Versions of Hadoop after 2.5.X may or may not work with the -Phadoop-2.4 profile (they were released after this version of Spark). HTH! On Tue, Mar 24, 2015 at 10:28 AM Manoj Samel manojsamelt...@gmail.com wrote: http://spark.apache.org/docs/latest/building-spark.html#packaging-without-hadoop-dependencies-for-yarn does not list hadoop 2.5 in Hadoop version table table etc. I assume it is still OK to compile with -Pyarn -Phadoop-2.5 for use with Hadoop 2.5 (cdh 5.3.2) Thanks, - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Does HiveContext connect to HiveServer2?
Another question related to this, how can we propagate the hive-site.xml to all workers when running in the yarn cluster mode? On Tue, Mar 24, 2015 at 10:09 AM, Marcelo Vanzin van...@cloudera.com wrote: It does neither. If you provide a Hive configuration to Spark, HiveContext will connect to your metastore server, otherwise it will create its own metastore in the working directory (IIRC). On Tue, Mar 24, 2015 at 8:58 AM, nitinkak001 nitinkak...@gmail.com wrote: I am wondering if HiveContext connects to HiveServer2 or does it work though Hive CLI. The reason I am asking is because Cloudera has deprecated Hive CLI. If the connection is through HiverServer2, is there a way to specify user credentials? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-connect-to-HiveServer2-tp22200.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
spark worker on mesos slave | possible networking config issue
is there some setting i am missing: this is my spark-env.sh export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so export SPARK_EXECUTOR_URI=http://100.125.5.93/sparkx.tgz export SPARK_LOCAL_IP=127.0.0.1 here is what i see on the slave node. less 20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/stderr WARNING: Logging before InitGoogleLogging() is written to STDERR I0324 02:30:29.389225 27755 fetcher.cpp:76] Fetching URI ' http://100.125.5.93/sparkx.tgz' I0324 02:30:29.389361 27755 fetcher.cpp:126] Downloading ' http://100.125.5.93/sparkx.tgz' to '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/sparkx.tgz' I0324 02:30:35.353446 27755 fetcher.cpp:64] Extracted resource '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/sparkx.tgz' into '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56' Spark assembly has been built with Hive, including Datanucleus jars on classpath Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/03/24 02:30:37 INFO MesosExecutorBackend: Registered signal handlers for [TERM, HUP, INT] I0324 02:30:37.071077 27863 exec.cpp:132] Version: 0.21.1 I0324 02:30:37.080971 27885 exec.cpp:206] Executor registered on slave 20150226-160708-78932-5050-8971-S0 15/03/24 02:30:37 INFO MesosExecutorBackend: Registered with Mesos as executor ID 20150226-160708-78932-5050-8971-S0 with 1 cpus 15/03/24 02:30:37 INFO SecurityManager: Changing view acls to: ubuntu 15/03/24 02:30:37 INFO SecurityManager: Changing modify acls to: ubuntu 15/03/24 02:30:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); users with modify permissions: Set(ubuntu) 15/03/24 02:30:37 INFO Slf4jLogger: Slf4jLogger started 15/03/24 02:30:37 INFO Remoting: Starting remoting 15/03/24 02:30:38 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@mesos-si2:50542] 15/03/24 02:30:38 INFO Utils: Successfully started service 'sparkExecutor' on port 50542. 15/03/24 02:30:38 INFO AkkaUtils: Connecting to MapOutputTracker: akka.tcp://sparkDriver@localhost:51849/user/MapOutputTracker 15/03/24 02:30:38 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkDriver@localhost:51849]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: localhost/127.0.0.1:51849 akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://sparkDriver@localhost:51849/), Path(/user/MapOutputTracker)] at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65) at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58) at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74) at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110) at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:267) at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:508) at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:541) at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:531) at akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:87)
Spark SQL: Day of month from Timestamp
Hi guys. Basically, we had to define a UDF that does that, is there a built in function that we can use for it? -- RGRDZ Harut
Re: Spark SQL: Day of month from Timestamp
Hi You can use functions like year(date),month(date) Thanks Arush On Tue, Mar 24, 2015 at 12:46 PM, Harut Martirosyan harut.martiros...@gmail.com wrote: Hi guys. Basically, we had to define a UDF that does that, is there a built in function that we can use for it? -- RGRDZ Harut -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Question about Data Sources API
Hello, I have some questions related to the Data Sources API - 1. Is the Data Source API stable as of Spark 1.3.0? 2. The Data Source API seems to be available only in Scala. Is there any plan to make it available for Java too? 3. Are only filters and projections pushed down to the data source and all the data pulled into Spark for other processing? Regards, Ashish
Re: FAILED SelectChannelConnector@0.0.0.0:4040 java.net.BindException: Address already in use
Does your application actually fail? That message just means there's another application listening on that port. Spark should try to bind to a different one after that and keep going. On Tue, Mar 24, 2015 at 12:43 PM, , Roy rp...@njit.edu wrote: I get following message for each time I run spark job 15/03/24 15:35:56 WARN AbstractLifeCycle: FAILED SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in use full trace is here http://pastebin.com/xSvRN01f how do I fix this ? I am on CDH 5.3.1 thanks roy -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org