Re: Python + Spark unable to connect to S3 bucket .... Invalid hostname in URI
So after doing some more research I found the root cause of the problem. The bucket name we were using contained an underscore '_'. This goes against the new requirements for naming buckets. Using a bucket that is not named with an underscore solved the issue. If anyone else runs into this problem, I hope this will help them out. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-unable-to-connect-to-S3-bucket-Invalid-hostname-in-URI-tp12076p12169.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
spark on yarn cluster can't launch
The code does not run as follows ../bin/spark-submit --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode cluster \ --verbose \ --num-executors 3 \ --driver-memory 4g \ --executor-memory 2g \ --executor-cores 1 \ ../lib/spark-examples*.jar \ 100 Exception in thread main java.lang.NullPointerException at org.apache.spark.deploy.yarn.Client$anonfun$logClusterResourceDetails$2.apply(Client.scala:109) at org.apache.spark.deploy.yarn.Client$anonfun$logClusterResourceDetails$2.apply(Client.scala:108) at org.apache.spark.Logging$class.logInfo(Logging.scala:58) However, when I removed --deploy-mode cluster \ Exception disappear. I think with the deploy-mode cluster is running in yarn cluster mode, if not, the default will be run in yarn client mode. But why did yarn cluster get Exception? Thanks -- cente...@gmail.com|齐忠
Re: Debugging Task not serializable
Hi Sourav, I will take a look to that too, thanks a lot for your help Greetings, Juan 2014-07-30 10:58 GMT+02:00 Sourav Chandra sourav.chan...@livestream.com: While running application set this -Dsun.io.serialization.extendedDebugInfo=true This is applciable post java 1.6 version On Wed, Jul 30, 2014 at 2:13 PM, Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Akhil, Andry, thanks a lot for your suggestions. I will take a look to those JVM options. Greetings, Juan 2014-07-28 18:56 GMT+02:00 andy petrella andy.petre...@gmail.com: Also check the guides for the JVM option that prints messages for such problems. Sorry, sent from phone and don't know it by heart :/ Le 28 juil. 2014 18:44, Akhil Das ak...@sigmoidanalytics.com a écrit : A quick fix would be to implement java.io.Serializable in those classes which are causing this exception. Thanks Best Regards On Mon, Jul 28, 2014 at 9:21 PM, Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Hi all, I was wondering if someone has conceived a method for debugging Task not serializable: java.io.NotSerializableException errors, apart from commenting and uncommenting parts of the program, or just turning everything into Serializable. I find this kind of error very hard to debug, as these are originated in the Spark runtime system. I'm using Spark for Java. Thanks a lot in advance, Juan -- Sourav Chandra Senior Software Engineer · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · sourav.chan...@livestream.com o: +91 80 4121 8723 m: +91 988 699 3746 skype: sourav.chandra Livestream Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd Block, Koramangala Industrial Area, Bangalore 560034 www.livestream.com
Re: spark streaming - lamda architecture
You may be interested in https://github.com/OryxProject/oryx which is at heart exactly lambda architecture on Spark Streaming. With ML pipelines on top. The architecture diagram and a peek at the code may give you a good example of how this could be implemented. I choose to view the batch layer as just a long-period streaming job on Spark Streaming, and implement the speed layer as a short-period streaming job. Summingbird is a good example too although it uses Storm and MapReduce, and is architected specifically for simple aggregations. I am not sure it generalizes but you may not need anything complex. On Thu, Aug 14, 2014 at 10:27 PM, salemi alireza.sal...@udo.edu wrote: Hi, How would you implement the batch layer of lamda architecture with spark/spark streaming? Thanks, Ali -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-lamda-architecture-tp12142.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to implement multinomial logistic regression(softmax regression) in Spark?
Did I describe the problem not clearly? Is anyone familiar to softmax regression? Thanks. Cui xp. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-implement-multinomial-logistic-regression-softmax-regression-in-Spark-tp11939p12175.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark won't build with maven
You are running a Continuous Compilation. AFAIK, it runs in an infinite loop and will compile only the modified files. For compiling with maven, have a look at these steps - https://spark.apache.org/docs/latest/building-with-maven.html Thanks, Visakh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-won-t-build-with-maven-tp12173p12176.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SparkR: split, apply, combine strategy for dataframes?
Thanks for your reply. I think that the problem was that SparkR tried to serialize the whole environment. Mind that the large dataframe was part of it. So every worker received their slice / partition (which is very small) plus the whole thing! So I deleted the large dataframe and list before parallelizing and the cluster ran without memory issues. Best, Carlos J. Gil Bellosta http://www.datanalytics.com 2014-08-15 3:53 GMT+02:00 Shivaram Venkataraman shiva...@eecs.berkeley.edu: Could you try increasing the number of slices with the large data set ? SparkR assumes that each slice (or partition in Spark terminology) can fit in memory of a single machine. Also is the error happening when you do the map function or does it happen when you combine the results ? Thanks Shivaram On Thu, Aug 14, 2014 at 3:53 PM, Carlos J. Gil Bellosta gilbello...@gmail.com wrote: Hello, I am having problems trying to apply the split-apply-combine strategy for dataframes using SparkR. I have a largish dataframe and I would like to achieve something similar to what ddply(df, .(id), foo) would do, only that using SparkR as computing engine. My df has a few million records and I would like to split it by id and operate on the pieces. These pieces are quite small in size: just a few hundred records. I do something along the following lines: 1) Use split to transform df into a list of dfs. 2) parallelize the resulting list as a RDD (using a few thousand slices) 3) map my function on the pieces using Spark. 4) recombine the results (do.call, rbind, etc.) My cluster works and I can perform medium sized batch jobs. However, it fails with my full df: I get a heap space out of memory error. It is funny as the slices are very small in size. Should I send smaller batches to my cluster? Is there any recommended general approach to these kind of split-apply-combine problems? Best, Carlos J. Gil Bellosta http://www.datanalytics.com - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Issues with S3 client library and Apache Spark
I've seen a couple of issues posted about this, but I never saw a resolution. When I'm using Spark 1.0.2 (and the spark-submit script to submit my jobs) and AWS SDK 1.8.7, I get the stack trace below. However, if I drop back to AWS SDK 1.3.26 (or anything from the AWS SDK 1.4.* family) then everything works fine. It would appear that after AWS SDK 1.4, there became a dependency on HTTP Client 4.2 (instead of 4.1). I would like to use the more recent versions of the AWS SDK (and not use something nearly 2 years old) so I'm curious whether anyone has figured out a workaround to this problem. Thanks. Darin. java.lang.NoSuchMethodError: org.apache.http.impl.conn.DefaultClientConnectionOperator.init(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V at org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(PoolingClientConnectionManager.java:140) at org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:114) at org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:99) at com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:29) at com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:97) at com.amazonaws.http.AmazonHttpClient.init(AmazonHttpClient.java:181) at com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:119) at com.amazonaws.services.s3.AmazonS3Client.init(AmazonS3Client.java:408) at com.amazonaws.services.s3.AmazonS3Client.init(AmazonS3Client.java:390) at com.amazonaws.services.s3.AmazonS3Client.init(AmazonS3Client.java:374) at com.amazonaws.services.s3.AmazonS3Client.init(AmazonS3Client.java:313) at com.elsevier.s3.SimpleStorageService.clinit(SimpleStorageService.java:27) at com.elsevier.spark.XMLKeyPair.call(SDKeyMapKeyPairRDD.java:75) at com.elsevier.spark.XMLKeyPair.call(SDKeyMapKeyPairRDD.java:65) at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750) at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:779) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:769) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:680)
Re: Seattle Spark Meetup: Spark at eBay - Troubleshooting the everyday issues Slides
Apologies but we had placed the settings for downloading the slides to Seattle Spark Meetup members only - but actually meant to share with everyone. We have since fixed this and now you can download it. HTH! On August 14, 2014 at 18:14:35, Denny Lee (denny.g@gmail.com) wrote: For those whom were not able to attend the Seattle Spark Meetup - Spark at eBay - Troubleshooting the Everyday Issues, the slides have been now posted at: http://files.meetup.com/12063092/SparkMeetupAugust2014Public.pdf. Enjoy! Denny
Re: Spark webUI - application details page
Hi Andrew, I'm running something close to the present master (I compiled several days ago) but am having some trouble viewing history. I set spark.eventLog.dir to true, but continually receive the error message (via the web UI) Application history not found...No event logs found for application ml-pipeline in file:/tmp/spark-events/ml-pipeline-1408117588599. I tried 2 fixes: -I manually set spark.eventLog.dir to a path beginning with file:///, believe that perhaps the problem was an invalid protocol specification. -I inspected /tmp/spark-events manually and noticed that each job directory (and the files there-in) were owned by the user who launched the job and were not world readable. Since I run Spark from a dedicated Spark user, I set the files world readable but I still receive the same Application history not found error. Is there a configuration step I may be missing? -Brad On Thu, Aug 14, 2014 at 7:33 PM, Andrew Or and...@databricks.com wrote: Hi SK, Not sure if I understand you correctly, but here is how the user normally uses the event logging functionality: After setting spark.eventLog.enabled and optionally spark.eventLog.dir, the user runs his/her Spark application and calls sc.stop() at the end of it. Then he/she goes to the standalone Master UI (under http://master-url:8080 by default) and click on the application under the Completed Applications table. This will link to the Spark UI of the finished application in its completed state, under a path that looks like http://master-url:8080/history/app-Id. It won't be on http://localhost:4040; anymore because the port is now freed for new applications to bind their SparkUIs to. To access the file that stores the raw statistics, go to the file specified in spark.eventLog.dir. This is by default /tmp/spark-events, though in Spark 1.0.1 it may be in HDFS under the same path. I could be misunderstanding what you mean by the stats being buried in the console output, because the events are not logged to the console but to a file in spark.eventLog.dir. For all of this to work, of course, you have to run Spark in standalone mode (i.e. with master set to spark://master-url:7077). In other modes, you will need to use the history server instead. Does this make sense? Andrew 2014-08-14 18:08 GMT-07:00 SK skrishna...@gmail.com: More specifically, as indicated by Patrick above, in 1.0+, apps will have persistent state so that the UI can be reloaded. Is there a way to enable this feature in 1.0.1? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-webUI-application-details-page-tp3490p12157.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Running Spark shell on YARN
I've been using the standalone cluster all this time and it worked fine. Recently I'm using another Spark cluster that is based on YARN and I've not experience with YARN. The YARN cluster has 10 nodes and a total memory of 480G. I'm having trouble starting the spark-shell with enough memory. I'm doing a very simple operation - reading a file 100GB from HDFS and running a count on it. This fails due to out of memory on the executors. Can someone point to the command line parameters that I should use for spark-shell so that it? Thanks -Soumya
Re: Spark webUI - application details page
Hi, Ok, I was specifying --master local. I changed that to --master spark://localhostname:7077 and am now able to see the completed applications. It provides summary stats about runtime and memory usage, which is sufficient for me at this time. However it doesn't seem to archive the info in the application detail UI that lists detailed stats about the completed stages of the application - which would be useful for identifying bottleneck steps in a large application. I guess we need to capture the application detail UI screen before the app run completes or find a way to extract this info by parsing the Json log file in /tmp/spark-events. thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-webUI-application-details-page-tp3490p12187.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Running Spark shell on YARN
Hi Soumya, The driver's console output prints out how much memory is actually granted to each executor, so from there you can verify how much memory the executors are actually getting. You should use the '--executor-memory' argument in spark-shell. For instance, assuming each node has 48G of memory, bin/spark-shell --executor-memory 46g --master yarn We leave a small cushion for the OS so we don't take up all of the entire system's memory. This option also applies to the standalone mode you've been using, but if you have been using the ec2 scripts, we set spark.executor.memory in conf/spark-defaults.conf for you automatically so you don't have to specify it each time on the command line. Of course, you can also do the same in YARN. -Andrew 2014-08-15 10:45 GMT-07:00 Soumya Simanta soumya.sima...@gmail.com: I've been using the standalone cluster all this time and it worked fine. Recently I'm using another Spark cluster that is based on YARN and I've not experience with YARN. The YARN cluster has 10 nodes and a total memory of 480G. I'm having trouble starting the spark-shell with enough memory. I'm doing a very simple operation - reading a file 100GB from HDFS and running a count on it. This fails due to out of memory on the executors. Can someone point to the command line parameters that I should use for spark-shell so that it? Thanks -Soumya
Re: spark on yarn cluster can't launch
Hi 齐忠, Thanks for reporting this. You're correct that the default deploy mode is client. However, this seems to be a bug in the YARN integration code; we should not throw null pointer exception in any case. What version of Spark are you using? Andrew 2014-08-15 0:23 GMT-07:00 centerqi hu cente...@gmail.com: The code does not run as follows ../bin/spark-submit --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode cluster \ --verbose \ --num-executors 3 \ --driver-memory 4g \ --executor-memory 2g \ --executor-cores 1 \ ../lib/spark-examples*.jar \ 100 Exception in thread main java.lang.NullPointerException at org.apache.spark.deploy.yarn.Client$anonfun$logClusterResourceDetails$2.apply(Client.scala:109) at org.apache.spark.deploy.yarn.Client$anonfun$logClusterResourceDetails$2.apply(Client.scala:108) at org.apache.spark.Logging$class.logInfo(Logging.scala:58) However, when I removed --deploy-mode cluster \ Exception disappear. I think with the deploy-mode cluster is running in yarn cluster mode, if not, the default will be run in yarn client mode. But why did yarn cluster get Exception? Thanks -- cente...@gmail.com|齐忠
[Spar Streaming] How can we use consecutive data points as the features ?
Hi guys, We have a use case where we need to use consecutive data points to predict the status. (yes, like using time series data to predict the machine failure). Is there a straight-forward way to do this in Spark Streaming? If all consecutive data points are in one batch, it's not complicated except that the order of data points is not guaranteed in the batch and so I have to use the timestamp in the data point to reach my goal. However, when the consecutive data points spread in two or more batches, how can I do this? From my understanding, I need to use the state management. But it's not easy to use the updateStateByKey. e.g. I will need to update one data point and delete the oldest data point but can not do them in a batch fashion. Does anyone have similar use case in the community and how do you solve this? Thank you. Cheers, Fang, Yan yanfang...@gmail.com +1 (206) 849-4108
Re: How to implement multinomial logistic regression(softmax regression) in Spark?
Hi Cui You can take a look at multinomial logistic regression PR I created. https://github.com/apache/spark/pull/1379 Ref: http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297 Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Aug 15, 2014 at 2:24 AM, Cui xp lifeiniao...@gmail.com wrote: Did I describe the problem not clearly? Is anyone familiar to softmax regression? Thanks. Cui xp. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-implement-multinomial-logistic-regression-softmax-regression-in-Spark-tp11939p12175.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Running Spark shell on YARN
I just checked the YARN config and looks like I need to change this value. Should be upgraded to 48G (the max memory allocated to YARN) per node ? property nameyarn.scheduler.maximum-allocation-mb/name value6144/value sourcejava.io.BufferedInputStream@2e7e1ee/source /property On Fri, Aug 15, 2014 at 2:37 PM, Soumya Simanta soumya.sima...@gmail.com wrote: Andrew, Thanks for your response. When I try to do the following. ./spark-shell --executor-memory 46g --master yarn I get the following error. Exception in thread main java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. at org.apache.spark.deploy.SparkSubmitArguments.checkRequiredArguments(SparkSubmitArguments.scala:166) at org.apache.spark.deploy.SparkSubmitArguments.init(SparkSubmitArguments.scala:61) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:50) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) After this I set the following env variable. export YARN_CONF_DIR=/usr/lib/hadoop-yarn/etc/hadoop/ The program launches but then halts with the following error. *14/08/15 14:33:22 ERROR yarn.Client: Required executor memory (47104 MB), is above the max threshold (6144 MB) of this cluster.* I guess this is some YARN setting that is not set correctly. Thanks -Soumya On Fri, Aug 15, 2014 at 2:19 PM, Andrew Or and...@databricks.com wrote: Hi Soumya, The driver's console output prints out how much memory is actually granted to each executor, so from there you can verify how much memory the executors are actually getting. You should use the '--executor-memory' argument in spark-shell. For instance, assuming each node has 48G of memory, bin/spark-shell --executor-memory 46g --master yarn We leave a small cushion for the OS so we don't take up all of the entire system's memory. This option also applies to the standalone mode you've been using, but if you have been using the ec2 scripts, we set spark.executor.memory in conf/spark-defaults.conf for you automatically so you don't have to specify it each time on the command line. Of course, you can also do the same in YARN. -Andrew 2014-08-15 10:45 GMT-07:00 Soumya Simanta soumya.sima...@gmail.com: I've been using the standalone cluster all this time and it worked fine. Recently I'm using another Spark cluster that is based on YARN and I've not experience with YARN. The YARN cluster has 10 nodes and a total memory of 480G. I'm having trouble starting the spark-shell with enough memory. I'm doing a very simple operation - reading a file 100GB from HDFS and running a count on it. This fails due to out of memory on the executors. Can someone point to the command line parameters that I should use for spark-shell so that it? Thanks -Soumya
Re: Running Spark shell on YARN
We generally recommend setting yarn.scheduler.maximum-allocation-mbto the maximum node capacity. -Sandy On Fri, Aug 15, 2014 at 11:41 AM, Soumya Simanta soumya.sima...@gmail.com wrote: I just checked the YARN config and looks like I need to change this value. Should be upgraded to 48G (the max memory allocated to YARN) per node ? property nameyarn.scheduler.maximum-allocation-mb/name value6144/value sourcejava.io.BufferedInputStream@2e7e1ee/source /property On Fri, Aug 15, 2014 at 2:37 PM, Soumya Simanta soumya.sima...@gmail.com wrote: Andrew, Thanks for your response. When I try to do the following. ./spark-shell --executor-memory 46g --master yarn I get the following error. Exception in thread main java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. at org.apache.spark.deploy.SparkSubmitArguments.checkRequiredArguments(SparkSubmitArguments.scala:166) at org.apache.spark.deploy.SparkSubmitArguments.init(SparkSubmitArguments.scala:61) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:50) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) After this I set the following env variable. export YARN_CONF_DIR=/usr/lib/hadoop-yarn/etc/hadoop/ The program launches but then halts with the following error. *14/08/15 14:33:22 ERROR yarn.Client: Required executor memory (47104 MB), is above the max threshold (6144 MB) of this cluster.* I guess this is some YARN setting that is not set correctly. Thanks -Soumya On Fri, Aug 15, 2014 at 2:19 PM, Andrew Or and...@databricks.com wrote: Hi Soumya, The driver's console output prints out how much memory is actually granted to each executor, so from there you can verify how much memory the executors are actually getting. You should use the '--executor-memory' argument in spark-shell. For instance, assuming each node has 48G of memory, bin/spark-shell --executor-memory 46g --master yarn We leave a small cushion for the OS so we don't take up all of the entire system's memory. This option also applies to the standalone mode you've been using, but if you have been using the ec2 scripts, we set spark.executor.memory in conf/spark-defaults.conf for you automatically so you don't have to specify it each time on the command line. Of course, you can also do the same in YARN. -Andrew 2014-08-15 10:45 GMT-07:00 Soumya Simanta soumya.sima...@gmail.com: I've been using the standalone cluster all this time and it worked fine. Recently I'm using another Spark cluster that is based on YARN and I've not experience with YARN. The YARN cluster has 10 nodes and a total memory of 480G. I'm having trouble starting the spark-shell with enough memory. I'm doing a very simple operation - reading a file 100GB from HDFS and running a count on it. This fails due to out of memory on the executors. Can someone point to the command line parameters that I should use for spark-shell so that it? Thanks -Soumya
Hardware Context on Spark Worker Hosts
Is it practical to maintain a hardware context on each of the worker hosts in Spark? In my particular problem I have an OpenCL (or JavaCL) context which has two things associated with it: - Data stored on a GPU - Code compiled for the GPU If the context goes away, the data is lost and the code must be recompiled. The code calling in is quite basic and is intended to be used in batch and streaming modes. Here is the batch version: object Classify { def run(sparkContext: SparkContext, config: com.infoblox.Config) { val subjects = Subject.load(sparkContext, config) val classifications = subjects.mapPartitions(subjectIter = classify(config.gpu, subjectIter)).reduceByKey(_ + _) classifications.saveAsTextFile(config.output) } private def classify(gpu: Option[String], subjects: Iterator[Subject]): Iterator[(String, Long)] = { val javaCLContext = JavaCLContext.build(gpu) // -- val classifier = Classifier.build(javaCLContext) // -- subjects.foreach(subject = classifier.classifyInBatches(subject)) classifier.classifyRemaining val results = classifier.results classifier.release results.result.iterator } } The two lines with -- on them are where the JavaCL/OpenCL context is currently created and used, and which is wrong. The JavaCL context is specific to the host, not the map. How do I keep this context between maps, and over a longer duration for a streaming job? Thanks, Chris... - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark webUI - application details page
@Brad Your configuration looks alright to me. We parse both file:/ and file:/// the same way so that shouldn't matter. I just tried this on the latest master and verified that it works for me. Can you dig into the directory /tmp/spark-events/ml-pipeline-1408117588599 to make sure that it's not empty? In particular, look for a file that looks like EVENT_LOG_0, then check the content of that file. The last event (on the last line) of the file should be an Application Complete event. If this is not true, it's likely that your application did not call sc.stop(), though the logs should still show up in spite of that. If all of that fails, try logging it in a more accessible place through setting spark.eventLog.dir. Let me know if that helps. @SK You shouldn't need to capture the screen before it finishes; the whole point of the event logging functionality is that the user doesn't have to do that themselves. What happens if you click into the application detail UI? In Spark 1.0.1, if it can't find the logs it may just refresh instead of printing a more explicit message. However, from your configuration you should be able to see the detailed stage information in the UI in addition to just the summary statistics under Completed Applications. I have listed a few debugging steps in the paragraph above, so maybe they're also applicable to you. Let me know if that works, Andrew 2014-08-15 11:07 GMT-07:00 SK skrishna...@gmail.com: Hi, Ok, I was specifying --master local. I changed that to --master spark://localhostname:7077 and am now able to see the completed applications. It provides summary stats about runtime and memory usage, which is sufficient for me at this time. However it doesn't seem to archive the info in the application detail UI that lists detailed stats about the completed stages of the application - which would be useful for identifying bottleneck steps in a large application. I guess we need to capture the application detail UI screen before the app run completes or find a way to extract this info by parsing the Json log file in /tmp/spark-events. thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-webUI-application-details-page-tp3490p12187.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
closure issue - works in scalatest but not in spark-shell
Folks, I wrote the following wrapper on top on combineByKey. The RDD is of Array[Any] and I am extracting a field at a given index for combining. There are two ways in which I tried this: Option A: leave colIndex abstract in Aggregator class and define in derived object Aggtor with value -1. It is set later in function myAggregate. Works fine but I want to keep the API user unaware of colIndex. Option B(shown in code below): Set colIndex to -1 in abstract class. Aggtor does not mention it at all. It is set later in myAggregate. Option B works from scalatest in Eclipse but runs into closure mishap in scala-shell. I am looking for an explanation and a possible solution/workaround. Appreciate any help! Thanks, Mohit. -- API helper - abstract class Aggregator[U] { var colIndex: Int = -1 def convert(a: Array[Any]): U = { a(colIndex).asInstanceOf[U] } def mergeValue(a: U, b: Array[Any]): U = { aggregate(a, convert(b)) } def mergeCombiners(x: U, y: U): U = { aggregate(x, y) } def aggregate(p: U, q: U): U } -- API handler - def myAggregate[U: ClassTag](...aggtor: Aggregator[U]) = { aggtor.colIndex = something keyBy(aggByCol).combineByKey(aggtor.convert, aggtor.mergeValue, aggtor.mergeCombiners) } call the API case object Aggtor extends Aggregator[List[String]] { //var colIndex = -1 def aggregate = } myAggregate(...Aggtor)
Re: How to implement multinomial logistic regression(softmax regression) in Spark?
DB, Did you compare softmax regression with one-vs-all and found that softmax is better ? one-vs-all can be implemented as a wrapper over binary classifier that we have in mllib...I am curious if softmax multinomial is better on most cases or is it worthwhile to add a one vs all version of mlor as well ? Thanks. Deb On Fri, Aug 15, 2014 at 11:39 AM, DB Tsai dbt...@dbtsai.com wrote: Hi Cui You can take a look at multinomial logistic regression PR I created. https://github.com/apache/spark/pull/1379 Ref: http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297 Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Aug 15, 2014 at 2:24 AM, Cui xp lifeiniao...@gmail.com wrote: Did I describe the problem not clearly? Is anyone familiar to softmax regression? Thanks. Cui xp. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-implement-multinomial-logistic-regression-softmax-regression-in-Spark-tp11939p12175.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to implement multinomial logistic regression(softmax regression) in Spark?
Hi Debasish, I didn't try one-vs-all vs softmax regression. One issue is that for one-vs-all, we have to train k classifiers for k classes problem. The training time will be k times longer. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Aug 15, 2014 at 11:53 AM, Debasish Das debasish.da...@gmail.com wrote: DB, Did you compare softmax regression with one-vs-all and found that softmax is better ? one-vs-all can be implemented as a wrapper over binary classifier that we have in mllib...I am curious if softmax multinomial is better on most cases or is it worthwhile to add a one vs all version of mlor as well ? Thanks. Deb On Fri, Aug 15, 2014 at 11:39 AM, DB Tsai dbt...@dbtsai.com wrote: Hi Cui You can take a look at multinomial logistic regression PR I created. https://github.com/apache/spark/pull/1379 Ref: http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297 Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Aug 15, 2014 at 2:24 AM, Cui xp lifeiniao...@gmail.com wrote: Did I describe the problem not clearly? Is anyone familiar to softmax regression? Thanks. Cui xp. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-implement-multinomial-logistic-regression-softmax-regression-in-Spark-tp11939p12175.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
ALS checkpoint performance
Hi, Are there any experiments detailing the performance hit due to HDFS checkpoint in ALS ? As we scale to large ranks with more ratings, I believe we have to cut the RDD lineage to safe guard against the lineage issue... Thanks. Deb
Re: Running Spark shell on YARN
After changing the allocation I'm getting the following in my logs. No idea what this means. 14/08/15 15:44:33 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:34 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:35 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:36 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:37 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:38 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:39 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:40 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:41 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:42 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:43 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:44 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:45 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:46 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:47 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:48 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:49 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:50 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:51 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED On Fri, Aug 15, 2014 at 2:47 PM, Sandy Ryza sandy.r...@cloudera.com wrote: We generally recommend setting yarn.scheduler.maximum-allocation-mbto the maximum node capacity. -Sandy On Fri, Aug 15, 2014 at 11:41 AM, Soumya Simanta soumya.sima...@gmail.com wrote: I just checked the YARN config and looks like I need to change this value. Should be upgraded to 48G (the max memory allocated to YARN) per node ? property nameyarn.scheduler.maximum-allocation-mb/name value6144/value sourcejava.io.BufferedInputStream@2e7e1ee/source /property On Fri, Aug 15, 2014 at 2:37 PM, Soumya Simanta soumya.sima...@gmail.com wrote: Andrew, Thanks for your response. When I try to do the following. ./spark-shell --executor-memory 46g --master yarn I get the following error. Exception in thread main java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. at org.apache.spark.deploy.SparkSubmitArguments.checkRequiredArguments(SparkSubmitArguments.scala:166) at org.apache.spark.deploy.SparkSubmitArguments.init(SparkSubmitArguments.scala:61) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:50) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) After this I set the following env variable. export YARN_CONF_DIR=/usr/lib/hadoop-yarn/etc/hadoop/ The program launches but then halts with the following error. *14/08/15 14:33:22 ERROR yarn.Client: Required executor memory (47104 MB), is above the max threshold (6144 MB) of this cluster.* I guess this is some YARN setting that is not set correctly. Thanks -Soumya On Fri, Aug 15, 2014 at 2:19 PM, Andrew Or
Re: Running Spark shell on YARN
Sandy and others: Is there a single source of Yarn/Hadoop properties that should be set or reset for running Spark on Yarn? We've sort of stumbled through one property after another, and (unless there's an update I've not yet seen) CDH5 Spark-related properties are for running the Spark Master instead of Yarn. Thanks Kevin On 08/15/2014 12:47 PM, Sandy Ryza wrote: We generally recommend setting yarn.scheduler.maximum-allocation-mbto the maximum node capacity. -Sandy On Fri, Aug 15, 2014 at 11:41 AM, Soumya Simanta soumya.sima...@gmail.com wrote: I just checked the YARN config and looks like I need to change this value. Should be upgraded to 48G (the max memory allocated to YARN) per node ? property nameyarn.scheduler.maximum-allocation-mb/name value6144/value sourcejava.io.BufferedInputStream@2e7e1ee/source /property On Fri, Aug 15, 2014 at 2:37 PM, Soumya Simanta soumya.sima...@gmail.com wrote: Andrew, Thanks for your response. When I try to do the following. ./spark-shell --executor-memory 46g --master yarn I get the following error. Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. at org.apache.spark.deploy.SparkSubmitArguments.checkRequiredArguments(SparkSubmitArguments.scala:166) at org.apache.spark.deploy.SparkSubmitArguments.init(SparkSubmitArguments.scala:61) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:50) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) After this I set the following env variable. export YARN_CONF_DIR=/usr/lib/hadoop-yarn/etc/hadoop/ The program launches but then halts with the following error. 14/08/15 14:33:22 ERROR yarn.Client: Required executor memory (47104 MB), is above the max threshold (6144 MB) of this cluster. I guess this is some YARN setting that is not set correctly. Thanks -Soumya On Fri, Aug 15, 2014 at 2:19 PM, Andrew Or and...@databricks.com wrote: Hi Soumya, The driver's console output prints out how much memory is actually granted to each executor, so from there you can verify how much memory the executors are actually getting. You should use the '--executor-memory' argument in spark-shell. For instance, assuming each node has 48G of memory, bin/spark-shell --executor-memory 46g --master yarn We leave a small cushion for the
spark streaming - saving kafka DStream into hadoop throws exception
Hi All, I am just trying to save the kafka dstream to hadoop as followed val dStream = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap) dStream.saveAsHadoopFiles(hdfsDataUrl, data) It throws the following exception. What am I doing wrong? 14/08/15 14:30:09 ERROR OneForOneStrategy: org.apache.hadoop.mapred.JobConf java.io.NotSerializableException: org.apache.hadoop.mapred.JobConf at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:440) at org.apache.spark.streaming.DStreamGraph.writeObject(DStreamGraph.scala:168) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.streaming.CheckpointWriter.write(Checkpoint.scala:185) at org.apache.spark.streaming.scheduler.JobGenerator.doCheckpoint(JobGenerator.scala:259) at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:167) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1$$anonfun$receive$1.applyOrElse(JobGenerator.scala:76) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) ^C14/08/15 14:30:10 ERROR OneForOneStrategy: org.apache.hadoop.mapred.JobConf java.io.NotSerializableException: org.apache.hadoop.mapred.JobConf at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at
Open sourcing Spindle by Adobe Research, a web analytics processing engine in Scala, Spark, and Parquet.
Hi Spark community, At Adobe Research, we're happy to open source a prototype technology called Spindle we've been developing over the past few months for processing analytics queries with Spark. Please take a look at the repository on GitHub at https://github.com/adobe-research/spindle, and we welcome any feedback. Thanks! Regards, Brandon. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Open-sourcing-Spindle-by-Adobe-Research-a-web-analytics-processing-engine-in-Scala-Spark-and-Parquet-tp12203.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark streaming - saving kafka DStream into hadoop throws exception
Somewhere, your function has a reference to the Hadoop JobConf object and is trying to send that to the workers. It's not in this code you pasted so must be from something slightly different? It shouldn't need to send that around and in fact it can't be serialized as you see. If you need a Hadoop Configuration object, you can get that from SparkContext, which you can get from the StreamingContext. On Fri, Aug 15, 2014 at 9:37 PM, salemi alireza.sal...@udo.edu wrote: Hi All, I am just trying to save the kafka dstream to hadoop as followed val dStream = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap) dStream.saveAsHadoopFiles(hdfsDataUrl, data) It throws the following exception. What am I doing wrong? 14/08/15 14:30:09 ERROR OneForOneStrategy: org.apache.hadoop.mapred.JobConf java.io.NotSerializableException: org.apache.hadoop.mapred.JobConf at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:440) at org.apache.spark.streaming.DStreamGraph.writeObject(DStreamGraph.scala:168) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.streaming.CheckpointWriter.write(Checkpoint.scala:185) at org.apache.spark.streaming.scheduler.JobGenerator.doCheckpoint(JobGenerator.scala:259) at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:167) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1$$anonfun$receive$1.applyOrElse(JobGenerator.scala:76) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) ^C14/08/15 14:30:10 ERROR OneForOneStrategy: org.apache.hadoop.mapred.JobConf java.io.NotSerializableException: org.apache.hadoop.mapred.JobConf at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at
Re: ALS checkpoint performance
Guoqiang reported some results in his PRs https://github.com/apache/spark/pull/828 and https://github.com/apache/spark/pull/929 . But this is really problem-dependent. -Xiangrui On Fri, Aug 15, 2014 at 12:30 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, Are there any experiments detailing the performance hit due to HDFS checkpoint in ALS ? As we scale to large ranks with more ratings, I believe we have to cut the RDD lineage to safe guard against the lineage issue... Thanks. Deb - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark streaming - saving kafka DStream into hadoop throws exception
Look this is the whole program. I am not trying to serialize the JobConf. def main(args: Array[String]) { try { val properties = getProperties(settings.properties) StreamingExamples.setStreamingLogLevels() val zkQuorum = properties.get(zookeeper.list).toString() val topic = properties.get(topic.name).toString() val group = properties.get(group.name).toString() val threads = properties.get(consumer.threads).toString() val topicpMap = Map(topic - threads.toInt) val hdfsNameNodeUrl = properties.get(hdfs.namenode.url).toString() val hdfsCheckPointUrl = hdfsNameNodeUrl + properties.get(hdfs.checkpoint.path).toString() val hdfsDataUrl = hdfsNameNodeUrl + properties.get(hdfs.data.path).toString() val checkPointInterval = properties.get(spark.streaming.checkpoint.interval).toString().toInt val sparkConf = new SparkConf().setAppName(KafkaMessageReceiver) println(===) println(kafka configuration: zk: + zkQuorum + ; topic: + topic + ; group: + group + ; threads: + threads) println(===) val ssc = new StreamingContext(sparkConf, Seconds(1)) ssc.checkpoint(hdfsCheckPointUrl) val dStream = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap) dStream.checkpoint(Seconds(checkPointInterval)) dStream.saveAsNewAPIHadoopFiles(hdfsDataUrl, csv, classOf[String], classOf[String], classOf[TextOutputFormat[String,String]], ssc.sparkContext.hadoopConfiguration) val eventData = dStream.map(_._2).map(_.split(,)).map(data = DataObject(data(0), data(1), data(2), data(3), data(4), data(5), data(6), data(7), data(8).toLong, data(9), data(10), data(11), data(12).toLong, data(13), data(14))) val count = eventData.filter(_.state == COMPLETE).countByWindow(Minutes(15), Seconds(1)) count.map(cnt = the Total count of calls in complete state in the last 15 minutes is: + cnt).print() ssc.start() ssc.awaitTermination() } catch { case e: Exception = println(exception caught: + e); } } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-saving-kafka-DStream-into-hadoop-throws-exception-tp12202p12207.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
mlib model viewing and saving
Hi All, I have a mlib model: val model = DecisionTree.train(parsedData, Regression, Variance, maxDepth) I see model has following methods:algo asInstanceOf isInstanceOf predicttoString topNode model.topNode outputs:org.apache.spark.mllib.tree.model.Node = id = 0, isLeaf = false, predict = 0.5, split = Some(Feature = 87, threshold = 0.7931471805599453, featureType = Continuous, categories = List()), stats = Some(gain = 0.89, impurity = 0.35, left impurity = 0.12, right impurity = 0.00, predict = 0.50) I was wondering what is the best way to look at the model. We want to see what the decision tree looks like-- which features are selected, the details of splitting, what is the depth etc. Is there an easy way to see that? I can traverse it recursively using topNode.leftNode and topNode.rightNode. However, was wondering if there is any way to look at the model and also to save it on the hdfs for later use.
Does HiveContext support Parquet?
Since SqlContext supports less SQL than Hive (if I understand correctly), I plan to run more queries by hql. However, is that possible to create some tables as Parquet in hql? What kind of commands should I use? Thanks in advance for any information. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-support-Parquet-tp12209.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Scala Spark Distinct on a case class doesn't work
I just discovered that the Distinct call is working as expected when I run a driver through spark-submit. This is only an issue in the REPL environment. Very strange... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Scala-Spark-Distinct-on-a-case-class-doesn-t-work-tp12206p12210.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Updating exising JSON files
I have a bunch of JSON files stored in HDFS that I want to read in, modify, and write back out. I'm new to all this and am not sure if this is even the right thing to do. Basically, my JSON files contain my raw data, and I want to calculate some derived data and add is to the existing data. So first, is my basic approach to the problem flawed? Should I be placing derived data somewhere else? If not, how to I modify the existing JSON files? Note: I have been able to read the JSON files into an RDD using sqlContext.jsonFile, and save them back using RDD.saveAsTextFile(). But this creates new files. Is there a way to over write the original files? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Updating-exising-JSON-files-tp12211.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Does HiveContext support Parquet?
Yes, you can write to Parquet tables. On Spark 1.0.2 all I had to do was include the parquet-hive-bundle-1.5.0.jar on my classpath. From: lycmailto:yanchen@huawei.com Sent: ?Friday?, ?August? ?15?, ?2014 ?7?:?30? ?PM To: u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org Since SqlContext supports less SQL than Hive (if I understand correctly), I plan to run more queries by hql. However, is that possible to create some tables as Parquet in hql? What kind of commands should I use? Thanks in advance for any information. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-support-Parquet-tp12209.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark streaming - saving kafka DStream into hadoop throws exception
if I reduce the app to the following code then I don't see the exception. It creates the hadoop files but they are empty! The DStream doesn't get written out to the files! def main(args: Array[String]) { try { val properties = getProperties(settings.properties) StreamingExamples.setStreamingLogLevels() val zkQuorum = properties.get(zookeeper.list).toString() val topic = properties.get(topic.name).toString() val group = properties.get(group.name).toString() val threads = properties.get(consumer.threads).toString() val topicpMap = Map(topic - threads.toInt) val hdfsNameNodeUrl = properties.get(hdfs.namenode.url).toString() val hdfsCheckPointUrl = hdfsNameNodeUrl + properties.get(hdfs.checkpoint.path).toString() val hdfsDataUrl = hdfsNameNodeUrl + properties.get(hdfs.data.path).toString() val checkPointInterval = properties.get(spark.streaming.checkpoint.interval).toString().toInt val sparkConf = new SparkConf().setAppName(KafkaMessageReceiver) println(===) println(kafka configuration: zk: + zkQuorum + ; topic: + topic + ; group: + group + ; threads: + threads) println(===) val ssc = new StreamingContext(sparkConf, Seconds(1)) val dStream = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap) dStream.saveAsNewAPIHadoopFiles(hdfsDataUrl, csv, classOf[String], classOf[String], classOf[TextOutputFormat[String,String]], ssc.sparkContext.hadoopConfiguration) ssc.start() ssc.awaitTermination() } catch { case e: Exception = println(exception caught: + e); } } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-saving-kafka-DStream-into-hadoop-throws-exception-tp12202p12213.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Question regarding spark data partition and coalesce. Need info on my use case.
My use case as mentioned below. 1. Read input data from local file system using sparkContext.textFile(input path). 2. partition the input data(80 million records) into partitions using RDD.coalesce(numberOfPArtitions) before submitting it to mapper/reducer function. Without using coalesce() or repartition() on the input data spark executes really slow and fails with out of memory exception. The issue i am facing here is in deciding the number of partitions to be applied on the input data. *The input data size varies every time and hard coding a particular value is not an option. And spark performs really well only when certain optimum partition is applied on the input data for which i have to perform lots of iteration(trial and error). Which is not an option in a production environment.* My question: Is there a thumb rule to decide the number of partitions required depending on the input data size and cluster resources available(executors,cores, etc...)? If yes please point me in that direction. Any help is much appreciated. I am using spark 1.0 on yarn. Thanks, AG -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Question-regarding-spark-data-partition-and-coalesce-Need-info-on-my-use-case-tp12214.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: GraphX Pagerank application
On Wed, Aug 6, 2014 at 11:37 AM, AlexanderRiggers alexander.rigg...@gmail.com wrote: To perform the page rank I have to create a graph object, adding the edges by setting sourceID=id and distID=brand. In GraphLab there is function: g = SGraph().add_edges(data, src_field='id', dst_field='brand') Is there something similar in GraphX? It sounds like you're trying to parse an edge list file into a graph, where each line is a comma-separated pair of numeric vertex ids. There's a built-in parser for tab-separated pairs (see GraphLoader) and it should be easy to adapt that to comma-separated pairs. You can also drop the header line using RDD#filter (and eventually using https://github.com/apache/spark/pull/1839). Ankur http://www.ankurdave.com/
Re: Does HiveContext support Parquet?
Thank you for your reply. Do you know where I can find some detailed information about how to use Parquet in HiveContext? Any information is appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-support-Parquet-tp12209p12216.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Error in sbt/sbt package
I am getting the following error while doing SPARK_HADOOP_VERSION=2.3.0 sbt/sbt/package java.io.IOException: Cannot run program /home/deep/spark-1.0.0/usr/lib/jvm/java-7-oracle/bin/javac: error=2, No such file or directory ...lots of errors [error] (core/compile:compile) java.io.IOException: Cannot run program /home/deep/spark-1.0.0/usr/lib/jvm/java-7-oracle/bin/javac: error=2, No such file or directory [error] Total time: 198 s, completed 16 Aug, 2014 10:25:50 AM My ~/.bashrc file has the following (apart from other paths too) export JAVA_HOME=usr/lib/jvm/java-7-oracle export PATH=$PATH:$JAVA_HOME/bin My spark-env.sh file has the following (apart from other paths) export JAVA_HOME=/usr/lib/jvm/jdk-7-oracle Can anyone tell me what I should modify? Thank You