Re: Spark performance for small queries
Hello, We were comparing performance of some of our production hive queries between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both Spark 0.9 and 1.1. We could see that the performance gains have been good in Spark. We tried a very simple query, select count(*) from T where col3=123 in both sparkSQL and Hive (with hive.map.aggr=true) and found that Spark performance had been 2x better than Hive (120sec vs 60sec). Table T is stored in S3 and contains 600MB single GZIP file. My question is, why Spark is faster than Hive here? In both of the cases, the file will be downloaded, uncompressed and lines will be counted by a single process. For Hive case, reducer will be identity function since hive.map.aggr is true. Note that disk spills and network I/O are very less for Hive's case as well,
Re: what is the roadmap for Spark SQL dialect in the coming releases?
Hi, would like to know if there is an update on this? rgds On Mon, Jan 12, 2015 at 10:44 AM, Niranda Perera niranda.per...@gmail.com wrote: Hi, I found out that SparkSQL supports only a relatively small subset of SQL dialect currently. I would like to know the roadmap for the coming releases. And, are you focusing more on popularizing the 'Hive on Spark' SQL dialect or the Spark SQL dialect? Rgds -- Niranda -- Niranda
Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)
Seems like it is a bug rather than a feature. I filed a bug report: https://issues.apache.org/jira/browse/SPARK-5363 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-slow-working-Spark-1-2-fast-freezing-tp21278p21317.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Are these numbers abnormal for spark streaming?
and post the code (if possible). In a nutshell, your processing time batch interval, resulting in an ever-increasing delay that will end up in a crash. 3 secs to process 14 messages looks like a lot. Curious what the job logic is. -kr, Gerard. On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das tathagata.das1...@gmail.com wrote: This is not normal. Its a huge scheduling delay!! Can you tell me more about the application? - cluser setup, number of receivers, whats the computation, etc. On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote: Hate to do this...but...erm...bump? Would really appreciate input from others using Streaming. Or at least some docs that would tell me if these are expected or not. -- From: as...@live.com To: user@spark.apache.org Subject: Are these numbers abnormal for spark streaming? Date: Wed, 21 Jan 2015 11:26:31 + Hi Guys, I've got Spark Streaming set up for a low data rate system (using spark's features for analysis, rather than high throughput). Messages are coming in throughout the day, at around 1-20 per second (finger in the air estimate...not analysed yet). In the spark streaming UI for the application, I'm getting the following after 17 hours. Streaming - *Started at: *Tue Jan 20 16:58:43 GMT 2015 - *Time since start: *18 hours 24 minutes 34 seconds - *Network receivers: *2 - *Batch interval: *2 seconds - *Processed batches: *16482 - *Waiting batches: *1 Statistics over last 100 processed batchesReceiver Statistics - Receiver - Status - Location - Records in last batch - [2015/01/21 11:23:18] - Minimum rate - [records/sec] - Median rate - [records/sec] - Maximum rate - [records/sec] - Last Error RmqReceiver-0ACTIVEF 144727-RmqReceiver-1ACTIVEBR 124726- Batch Processing Statistics MetricLast batchMinimum25th percentileMedian75th percentileMaximumProcessing Time3 seconds 994 ms157 ms4 seconds 16 ms4 seconds 961 ms5 seconds 3 ms5 seconds 171 msScheduling Delay9 hours 15 minutes 4 seconds9 hours 10 minutes 54 seconds9 hours 11 minutes 56 seconds9 hours 12 minutes 57 seconds9 hours 14 minutes 5 seconds9 hours 15 minutes 4 secondsTotal Delay9 hours 15 minutes 8 seconds9 hours 10 minutes 58 seconds9 hours 12 minutes9 hours 13 minutes 2 seconds9 hours 14 minutes 10 seconds9 hours 15 minutes 8 seconds Are these normal. I was wondering what the scheduling delay and total delay terms are, and if it's normal for them to be 9 hours. I've got a standalone spark master and 4 spark nodes. The streaming app has been given 4 cores, and it's using 1 core per worker node. The streaming app is submitted from a 5th machine, and that machine has nothing but the driver running. The worker nodes are running alongside Cassandra (and reading and writing to it). Any insights would be appreciated. Regards, Ashic.
RE: Are these numbers abnormal for spark streaming?
Hi Gerard, Thanks for the response. The messages get desrialised from msgpack format, and one of the strings is desrialised to json. Certain fields are checked to decide if further processing is required. If so, it goes through a series of in mem filters to check if more processing is required. If so, only then does the heavy work start. That consists of a few db queries, and potential updates to the db + message on message queue. The majority of messages don't need processing. The messages needing processing at peak are about three every other second. One possible things that might be happening is the session initialisation and prepared statement initialisation for each partition. I can resort to some tricks, but I think I'll try increasing batch interval to 15 seconds. I'll report back with findings. Thanks, Ashic. From: gerard.m...@gmail.com Date: Thu, 22 Jan 2015 12:30:08 +0100 Subject: Re: Are these numbers abnormal for spark streaming? To: tathagata.das1...@gmail.com CC: as...@live.com; t...@databricks.com; user@spark.apache.org and post the code (if possible).In a nutshell, your processing time batch interval, resulting in an ever-increasing delay that will end up in a crash. 3 secs to process 14 messages looks like a lot. Curious what the job logic is. -kr, Gerard. On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das tathagata.das1...@gmail.com wrote: This is not normal. Its a huge scheduling delay!! Can you tell me more about the application?- cluser setup, number of receivers, whats the computation, etc. On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote: Hate to do this...but...erm...bump? Would really appreciate input from others using Streaming. Or at least some docs that would tell me if these are expected or not. From: as...@live.com To: user@spark.apache.org Subject: Are these numbers abnormal for spark streaming? Date: Wed, 21 Jan 2015 11:26:31 + Hi Guys, I've got Spark Streaming set up for a low data rate system (using spark's features for analysis, rather than high throughput). Messages are coming in throughout the day, at around 1-20 per second (finger in the air estimate...not analysed yet). In the spark streaming UI for the application, I'm getting the following after 17 hours. StreamingStarted at: Tue Jan 20 16:58:43 GMT 2015Time since start: 18 hours 24 minutes 34 secondsNetwork receivers: 2Batch interval: 2 secondsProcessed batches: 16482Waiting batches: 1 Statistics over last 100 processed batchesReceiver StatisticsReceiverStatusLocationRecords in last batch[2015/01/21 11:23:18]Minimum rate[records/sec]Median rate[records/sec]Maximum rate[records/sec]Last ErrorRmqReceiver-0ACTIVEF 144727-RmqReceiver-1ACTIVEBR 124726-Batch Processing StatisticsMetricLast batchMinimum25th percentileMedian75th percentileMaximumProcessing Time3 seconds 994 ms157 ms4 seconds 16 ms4 seconds 961 ms5 seconds 3 ms5 seconds 171 msScheduling Delay9 hours 15 minutes 4 seconds9 hours 10 minutes 54 seconds9 hours 11 minutes 56 seconds9 hours 12 minutes 57 seconds9 hours 14 minutes 5 seconds9 hours 15 minutes 4 secondsTotal Delay9 hours 15 minutes 8 seconds9 hours 10 minutes 58 seconds9 hours 12 minutes9 hours 13 minutes 2 seconds9 hours 14 minutes 10 seconds9 hours 15 minutes 8 seconds Are these normal. I was wondering what the scheduling delay and total delay terms are, and if it's normal for them to be 9 hours. I've got a standalone spark master and 4 spark nodes. The streaming app has been given 4 cores, and it's using 1 core per worker node. The streaming app is submitted from a 5th machine, and that machine has nothing but the driver running. The worker nodes are running alongside Cassandra (and reading and writing to it). Any insights would be appreciated. Regards, Ashic.
Re: Are these numbers abnormal for spark streaming?
This is not normal. Its a huge scheduling delay!! Can you tell me more about the application? - cluser setup, number of receivers, whats the computation, etc. On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote: Hate to do this...but...erm...bump? Would really appreciate input from others using Streaming. Or at least some docs that would tell me if these are expected or not. -- From: as...@live.com To: user@spark.apache.org Subject: Are these numbers abnormal for spark streaming? Date: Wed, 21 Jan 2015 11:26:31 + Hi Guys, I've got Spark Streaming set up for a low data rate system (using spark's features for analysis, rather than high throughput). Messages are coming in throughout the day, at around 1-20 per second (finger in the air estimate...not analysed yet). In the spark streaming UI for the application, I'm getting the following after 17 hours. Streaming - *Started at: *Tue Jan 20 16:58:43 GMT 2015 - *Time since start: *18 hours 24 minutes 34 seconds - *Network receivers: *2 - *Batch interval: *2 seconds - *Processed batches: *16482 - *Waiting batches: *1 Statistics over last 100 processed batchesReceiver Statistics - Receiver - Status - Location - Records in last batch - [2015/01/21 11:23:18] - Minimum rate - [records/sec] - Median rate - [records/sec] - Maximum rate - [records/sec] - Last Error RmqReceiver-0ACTIVEF 144727-RmqReceiver-1ACTIVEBR 124726- Batch Processing Statistics MetricLast batchMinimum25th percentileMedian75th percentileMaximumProcessing Time3 seconds 994 ms157 ms4 seconds 16 ms4 seconds 961 ms5 seconds 3 ms5 seconds 171 msScheduling Delay9 hours 15 minutes 4 seconds9 hours 10 minutes 54 seconds9 hours 11 minutes 56 seconds9 hours 12 minutes 57 seconds9 hours 14 minutes 5 seconds9 hours 15 minutes 4 secondsTotal Delay9 hours 15 minutes 8 seconds9 hours 10 minutes 58 seconds9 hours 12 minutes9 hours 13 minutes 2 seconds9 hours 14 minutes 10 seconds9 hours 15 minutes 8 seconds Are these normal. I was wondering what the scheduling delay and total delay terms are, and if it's normal for them to be 9 hours. I've got a standalone spark master and 4 spark nodes. The streaming app has been given 4 cores, and it's using 1 core per worker node. The streaming app is submitted from a 5th machine, and that machine has nothing but the driver running. The worker nodes are running alongside Cassandra (and reading and writing to it). Any insights would be appreciated. Regards, Ashic.
Re: Spark Team - Paco Nathan said that your team can help
http://spark.apache.org/docs/latest/ Follow this. Its easy to get started. Use prebuilt version of spark as of now :D On Thu, Jan 22, 2015 at 5:06 PM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Apache-Spark team , What are the system requirements installing Hadoop and Apache Spark? I have attached the screen shot of Gparted. Thanks and regards, Sudipta -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: KNN for large data set
Thanks Xiangrui Meng will try this. And, found this https://github.com/kaushikranjan/knnJoin also. Will this work with double data ? Can we find out z value of *Vector(10.3,4.5,3,5)* ? On Thu, Jan 22, 2015 at 12:25 AM, Xiangrui Meng men...@gmail.com wrote: For large datasets, you need hashing in order to compute k-nearest neighbors locally. You can start with LSH + k-nearest in Google scholar: http://scholar.google.com/scholar?q=lsh+k+nearest -Xiangrui On Tue, Jan 20, 2015 at 9:55 PM, DEVAN M.S. msdeva...@gmail.com wrote: Hi all, Please help me to find out best way for K-nearest neighbor using spark for large data sets.
RE: Are these numbers abnormal for spark streaming?
Hate to do this...but...erm...bump? Would really appreciate input from others using Streaming. Or at least some docs that would tell me if these are expected or not. From: as...@live.com To: user@spark.apache.org Subject: Are these numbers abnormal for spark streaming? Date: Wed, 21 Jan 2015 11:26:31 + Hi Guys, I've got Spark Streaming set up for a low data rate system (using spark's features for analysis, rather than high throughput). Messages are coming in throughout the day, at around 1-20 per second (finger in the air estimate...not analysed yet). In the spark streaming UI for the application, I'm getting the following after 17 hours. StreamingStarted at: Tue Jan 20 16:58:43 GMT 2015Time since start: 18 hours 24 minutes 34 secondsNetwork receivers: 2Batch interval: 2 secondsProcessed batches: 16482Waiting batches: 1 Statistics over last 100 processed batchesReceiver StatisticsReceiverStatusLocationRecords in last batch[2015/01/21 11:23:18]Minimum rate[records/sec]Median rate[records/sec]Maximum rate[records/sec]Last ErrorRmqReceiver-0ACTIVEF 144727-RmqReceiver-1ACTIVEBR 124726-Batch Processing StatisticsMetricLast batchMinimum25th percentileMedian75th percentileMaximumProcessing Time3 seconds 994 ms157 ms4 seconds 16 ms4 seconds 961 ms5 seconds 3 ms5 seconds 171 msScheduling Delay9 hours 15 minutes 4 seconds9 hours 10 minutes 54 seconds9 hours 11 minutes 56 seconds9 hours 12 minutes 57 seconds9 hours 14 minutes 5 seconds9 hours 15 minutes 4 secondsTotal Delay9 hours 15 minutes 8 seconds9 hours 10 minutes 58 seconds9 hours 12 minutes9 hours 13 minutes 2 seconds9 hours 14 minutes 10 seconds9 hours 15 minutes 8 seconds Are these normal. I was wondering what the scheduling delay and total delay terms are, and if it's normal for them to be 9 hours. I've got a standalone spark master and 4 spark nodes. The streaming app has been given 4 cores, and it's using 1 core per worker node. The streaming app is submitted from a 5th machine, and that machine has nothing but the driver running. The worker nodes are running alongside Cassandra (and reading and writing to it). Any insights would be appreciated. Regards, Ashic.
RE: Are these numbers abnormal for spark streaming?
Hi TD, Here's some information: 1. Cluster has one standalone master, 4 workers. Workers are co-hosted with Apache Cassandra. Master is set up with external Zookeeper. 2. Each machine has 2 cores and 4GB of ram. This is for testing. All machines are vmware vms. Spark has 2GB dedicated to it on each node. 3. In addition to the streaming details, the master details as of now are given below. Only the streaming app is running. 4. I'm listening to two rabbitmq queues using a rabbitmq receiver (code: https://gist.github.com/ashic/b5edc7cfdc85aa60b066 ). Notifier code is here https://gist.github.com/ashic/9abd352c691eafc8c9f3 5. The receivers are initialised with the following code: val ssc = new StreamingContext(sc, Seconds(2)) val messages1 = ssc.receiverStream(new RmqReceiver(abc, abc, /, vdclog03, abc_input, None)) val messages2 = ssc.receiverStream(new RmqReceiver(abc, abc, /, vdclog04, abc_input, None)) val messages = messages1.union(messages2) val notifier = new RabbitMQEventNotifier(vdclog03, abc, abc_output_events, abc, abc, /) 6. Usage: messages.map(x = ScalaMessagePack.read[RadioMessage](x)) .flatMap(InputMessageParser.parse(_).getEvents()) .foreachRDD(x = { x.foreachPartition(x = { cassandraConnector.withSessionDo(session ={ val graphStorage = new CassandraGraphStorage(session) val notificationStorage = new CassandraNotificationStorage(session) val savingNotifier = new SavingNotifier(notifier, notificationStorage) x.foreach(eventWrapper = eventWrapper.event match { //do some queries. // save some stuff if needed to cassandra //raise a message to a separate queue with a msg = Unit() operation. 7. The algorithm is simple: listen to messages from two separate rmq queues. union them. for each message, check message properties. if needed, query cassandra for additional details (graph search..but done in 0.5-3 seconds...and rare..shouldn't overwhelm with low input rate). If needed, save some info back into cassandra (1-2ms), and raise an event to the notifier. I'm probably missing something basic, just wondering what. It has been running fine for about 42 hours now, but the numbers are a tad worrying. Cheers, Ashic. Workers: 4Cores: 8 Total, 4 UsedMemory: 8.0 GB Total, 2000.0 MB UsedApplications: 1 Running, 0 CompletedDrivers: 0 Running, 0 CompletedStatus: ALIVEWorkersIdAddressStateCoresMemoryworker-20141208131918-VDCAPP50.AAA.local-44476VDCAPP50.AAA.local:44476ALIVE2 (1 Used)2.0 GB (500.0 MB Used)worker-20141208132012-VDCAPP52.AAA.local-34349VDCAPP52.AAA.local:34349ALIVE2 (1 Used)2.0 GB (500.0 MB Used)worker-20141208132136-VDCAPP53.AAA.local-54000VDCAPP53.AAA.local:54000ALIVE2 (1 Used)2.0 GB (500.0 MB Used)worker-2014121627-VDCAPP49.AAA.local-57899VDCAPP49.AAA.local:57899ALIVE2 (1 Used)2.0 GB (500.0 MB Used)Running ApplicationsIDNameCoresMemory per NodeSubmitted TimeUserStateDurationapp-20150120165844-0005App1 4500.0 MB2015/01/20 16:58:44rootWAITING42.4 h From: tathagata.das1...@gmail.com Date: Thu, 22 Jan 2015 03:15:58 -0800 Subject: Re: Are these numbers abnormal for spark streaming? To: as...@live.com; t...@databricks.com CC: user@spark.apache.org This is not normal. Its a huge scheduling delay!! Can you tell me more about the application?- cluser setup, number of receivers, whats the computation, etc. On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote: Hate to do this...but...erm...bump? Would really appreciate input from others using Streaming. Or at least some docs that would tell me if these are expected or not. From: as...@live.com To: user@spark.apache.org Subject: Are these numbers abnormal for spark streaming? Date: Wed, 21 Jan 2015 11:26:31 + Hi Guys, I've got Spark Streaming set up for a low data rate system (using spark's features for analysis, rather than high throughput). Messages are coming in throughout the day, at around 1-20 per second (finger in the air estimate...not analysed yet). In the spark streaming UI for the application, I'm getting the following after 17 hours. StreamingStarted at: Tue Jan 20 16:58:43 GMT 2015Time since start: 18 hours 24 minutes 34 secondsNetwork receivers: 2Batch interval: 2 secondsProcessed batches: 16482Waiting batches: 1 Statistics over last 100 processed batchesReceiver StatisticsReceiverStatusLocationRecords in last batch[2015/01/21 11:23:18]Minimum rate[records/sec]Median rate[records/sec]Maximum rate[records/sec]Last ErrorRmqReceiver-0ACTIVEF 144727-RmqReceiver-1ACTIVEBR 124726-Batch Processing StatisticsMetricLast batchMinimum25th percentileMedian75th percentileMaximumProcessing Time3 seconds 994 ms157 ms4 seconds 16 ms4 seconds 961 ms5 seconds 3 ms5 seconds 171 msScheduling Delay9 hours 15 minutes 4 seconds9 hours 10 minutes 54 seconds9 hours 11 minutes 56 seconds9 hours 12 minutes 57 seconds9 hours 14 minutes 5 seconds9 hours 15 minutes 4
Fwd: Spark Team - Paco Nathan said that your team can help
Hi Apache-Spark team , What are the system requirements installing Hadoop and Apache Spark? I have attached the screen shot of Gparted. Thanks and regards, Sudipta -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Team - Paco Nathan said that your team can help
Hi, Let me reword your request so you understand how (too) generic your question is Hi, I have $10,000, please find me some means of transportation so I can get to work. Please provide (a lot) more details. If you can't, consider using one of the pre-built express VMs from either Cloudera, Hortonworks or MapR, for example. Marco On Jan 22, 2015, at 7:36 AM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Apache-Spark team , What are the system requirements installing Hadoop and Apache Spark? I have attached the screen shot of Gparted. Thanks and regards, Sudipta -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099 Screenshot - Wednesday 21 January 2015 - 10:55:29 IST.png - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Trying to run SparkSQL over Spark Streaming
Hi, I'm also trying to use the insertInto method, but end up getting the assertion error Is there any workaround to this?? rgds -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Trying-to-run-SparkSQL-over-Spark-Streaming-tp12530p21316.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Is Apache Spark less accurate than Scikit Learn?
Hi There are many different variants of gradient descent mostly dealing with what the step size is and how it might be adjusted as the algorithm proceeds. Also if it uses a stochastic variant (as opposed to batch descent) then there are variations there too. I don’t know off-hand what MLlib’s detailed implementation is but no doubt there are differences between the two - perhaps someone with more knowledge of the internals could comment. As you can tell from playing around with the parameters, step size is vitally important to the performance of the algorithm. On 22 Jan 2015, at 06:44, Jacques Heunis jaaksem...@gmail.com wrote: Ah I see, thanks! I was just confused because given the same configuration, I would have thought that Spark and Scikit would give more similar results, but I guess this is simply not the case (as in your example, in order to get spark to give an mse sufficiently close to scikit's you have to give it a significantly larger step and iteration count). Would that then be a result of MLLib and Scikit differing slightly in their exact implementation of the optimizer? Or rather a case of (as you say) Scikit being a far more mature system (and therefore that MLLib would 'get better' over time)? Surely it is far from ideal that to get the same results you need more iterations (IE more computation), or do you think that that is simply coincidence and that given a different model/dataset it may be the other way around? I ask because I encountered this situation on other, larger datasets, so this is not an isolated case (though being the simplest example I could think of I would imagine that it's somewhat indicative of general behaviour) On Thu, Jan 22, 2015 at 1:57 AM, Robin East robin.e...@xense.co.uk wrote: I don’t get those results. I get: spark 0.14 scikit-learn0.85 The scikit-learn mse is due to the very low eta0 setting. Tweak that to 0.1 and push iterations to 400 and you get a mse ~= 0. Of course the coefficients are both ~1 and the intercept ~0. Similarly if you change the mllib step size to 0.5 and number of iterations to 1200 you again get a very low mse. One of the issues with SGD is you have to tweak these parameters to tune the algorithm. FWIW I wouldn’t see Spark MLlib as a replacement for scikit-learn. MLLib is nowhere as mature as scikit learn. However if you have large datasets that won’t sensibly fit the scikit-learn in-core model MLLib is one of the top choices. Similarly if you are running proof of concepts that you are eventually going to scale up to production environments then there is a definite argument for using MLlib at both the PoC and production stages. On 21 Jan 2015, at 20:39, JacquesH jaaksem...@gmail.com wrote: I've recently been trying to get to know Apache Spark as a replacement for Scikit Learn, however it seems to me that even in simple cases, Scikit converges to an accurate model far faster than Spark does. For example I generated 1000 data points for a very simple linear function (z=x+y) with the following script: http://pastebin.com/ceRkh3nb I then ran the following Scikit script: http://pastebin.com/1aECPfvq And then this Spark script: (with spark-submit filename, no other arguments) http://pastebin.com/s281cuTL Strangely though, the error given by spark is an order of magnitude larger than that given by Scikit (0.185 and 0.045 respectively) despite the two models having a nearly identical setup (as far as I can tell) I understand that this is using SGD with very few iterations and so the results may differ but I wouldn't have thought that it would be anywhere near such a large difference or such a large error, especially given the exceptionally simple data. Is there something I'm misunderstanding in Spark? Is it not correctly configured? Surely I should be getting a smaller error than that? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-Apache-Spark-less-accurate-than-Scikit-Learn-tp21301.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
spark-shell has syntax error on windows.
I have a problem with running spark shell in windows 7. I made the following steps: 1. downloaded and installed Scala 2.11.5 2. downloaded spark 1.2.0 by git clone git://github.com/apache/spark.git 3. run dev/change-version-to-2.11.sh and mvn -Dscala-2.11 -DskipTests clean package (in git bash) After installation tried to run spark-shell.cmd in cmd shell and it says there is a syntax error in file. What could I do to fix problem? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-has-syntax-error-on-windows-tp21313.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
unable to write SequenceFile using saveAsNewAPIHadoopFile
Hi All, I'm using the saveAsNewAPIHadoopFile API to write SequenceFiles but I'm getting the following runtime exception: *Exception in thread main org.apache.spark.SparkException: Task not serializable* at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1435) at org.apache.spark.rdd.RDD.map(RDD.scala:271) at org.apache.spark.api.java.JavaRDDLike$class.mapToPair(JavaRDDLike.scala:102) at org.apache.spark.api.java.JavaPairRDD.mapToPair(JavaPairRDD.scala:45) at XoanonKMeansV2.main(XoanonKMeansV2.java:125) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) *Caused by: java.io.NotSerializableException: org.apache.hadoop.io.IntWritable* at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) ... 13 more Pls find below the code snippet: joiningKeyPlusPredictedPoint.mapToPair( new PairFunctionTuple2String, Integer, Text, IntWritable() { Text text = new Text(); IntWritable intwritable = new IntWritable(); @Override public Tuple2Text, IntWritable call( Tuple2String, Integer tuple) throws Exception { text.set(tuple._1); intwritable.set(tuple._2); return new Tuple2Text, IntWritable(text, intwritable); } }) *.saveAsNewAPIHadoopFile(/mllib/data/clusteroutput_seq, Text.class, IntWritable.class, SequenceFileOutputFormat.class);* Regards, Skanda
GraphX: ShortestPaths does not terminate on a grid graph
Hello, I try to execute a simple program that runs the ShortestPaths algorithm (org.apache.spark.graphx.lib.ShortestPaths) on a small grid graph. I use Spark 1.2.0 downloaded from spark.apache.org. The program's code is the following : object GraphXGridSP { def main(args : Array[String]) : Unit = { val appname : String = GraphXGridShortestPath val conf = new SparkConf().setAppName(appname) val sc = new SparkContext(conf) val gridSize : Int = 70 val nPartitions : Int = 4 val graph = GraphGenerators.gridGraph(sc, gridSize, gridSize). partitionBy( PartitionStrategy.EdgePartition2D, nPartitions) val landmarks : Seq[VertexId] = Seq(0) val graph2 : Graph[SPMap, Double]= ShortestPaths.run(graph, landmarks) graph2.vertices.count } } This program runs more than 2 hours when the grid size is 70x70 as above, and is then killed by the resource manager of the cluster (Torque). After a 5-6 minutes of execution, the Spark master UI does not even respond. I use a cluster of 5 nodes (4 workers, 1 executor per node). For a grid size of 30x30, the program terminates in about 20 seconds, and for a grid size of 50x50 it finishes in about 80 seconds. The problem appears for a grid size of 70x70 and above. What's wrong with this program ? Thanks for any help. Regards. Nicolas. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: unable to write SequenceFile using saveAsNewAPIHadoopFile
First as an aside I am pretty sure you cannot reuse one Text and IntWritable object here. Spark does not necessarily finish with one's value before the next call(). Although it should not be directly related to the serialization problem I suspect it is. Your function is not serializable since it contains references to these cached writables. I think removing them fixes both problems. On Jan 22, 2015 9:42 AM, Skanda skanda.ganapa...@gmail.com wrote: Hi All, I'm using the saveAsNewAPIHadoopFile API to write SequenceFiles but I'm getting the following runtime exception: *Exception in thread main org.apache.spark.SparkException: Task not serializable* at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1435) at org.apache.spark.rdd.RDD.map(RDD.scala:271) at org.apache.spark.api.java.JavaRDDLike$class.mapToPair(JavaRDDLike.scala:102) at org.apache.spark.api.java.JavaPairRDD.mapToPair(JavaPairRDD.scala:45) at XoanonKMeansV2.main(XoanonKMeansV2.java:125) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) *Caused by: java.io.NotSerializableException: org.apache.hadoop.io.IntWritable* at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) ... 13 more Pls find below the code snippet: joiningKeyPlusPredictedPoint.mapToPair( new PairFunctionTuple2String, Integer, Text, IntWritable() { Text text = new Text(); IntWritable intwritable = new IntWritable(); @Override public Tuple2Text, IntWritable call( Tuple2String, Integer tuple) throws Exception { text.set(tuple._1); intwritable.set(tuple._2); return new Tuple2Text, IntWritable(text, intwritable); } }) *.saveAsNewAPIHadoopFile(/mllib/data/clusteroutput_seq, Text.class, IntWritable.class, SequenceFileOutputFormat.class);* Regards, Skanda
Re: Discourse: A proposed alternative to the Spark User list
Ok, thanks for the clarifications. I didn't know this list has to remain as the only official list. Nabble is really not the best solution in the world, but we're stuck with it, I guess. That's it from me on this subject. Petar On 22.1.2015. 3:55, Nicholas Chammas wrote: I think a few things need to be laid out clearly: 1. This mailing list is the “official” user discussion platform. That is, it is sponsored and managed by the ASF. 2. Users are free to organize independent discussion platforms focusing on Spark, and there is already one such platform in Stack Overflow under the |apache-spark| and related tags. Stack Overflow works quite well. 3. The ASF will not agree to deprecating or migrating this user list to a platform that they do not control. 4. This mailing list has grown to an unwieldy size and discussions are hard to find or follow; discussion tooling is also lacking. We want to improve the utility and user experience of this mailing list. 5. We don’t want to fragment this “official” discussion community. 6. Nabble is an independent product not affiliated with the ASF. It offers a slightly better interface to the Apache mailing list archives. So to respond to some of your points, pzecevic: Apache user group could be frozen (not accepting new questions, if that’s possible) and redirect users to Stack Overflow (automatic reply?). From what I understand of the ASF’s policies, this is not possible. :( This mailing list must remain the official Spark user discussion platform. Other thing, about new Stack Exchange site I proposed earlier. If a new site is created, there is no problem with guidelines, I think, because Spark community can apply different guidelines for the new site. I think Stack Overflow and the various Spark tags are working fine. I don’t see a compelling need for a Stack Exchange dedicated to Spark, either now or in the near future. Also, I doubt a Spark-specific site can pass the 4 tests in the Area 51 FAQ http://area51.stackexchange.com/faq: * Almost all Spark questions are on-topic for Stack Overflow * Stack Overflow already exists, it already has a tag for Spark, and nobody is complaining * You’re not creating such a big group that you don’t have enough experts to answer all possible questions * There’s a high probability that users of Stack Overflow would enjoy seeing the occasional question about Spark I think complaining won’t be sufficient. :) Someone expressed a concern that they won’t allow creating a project-specific site, but there already exist some project-specific sites, like Tor, Drupal, Ubuntu… The communities for these projects are many, many times larger than the Spark community is or likely ever will be, simply due to the nature of the problems they are solving. What we need is an improvement to this mailing list. We need better tooling than Nabble to sit on top of the Apache archives, and we also need some way to control the volume and quality of mail on the list so that it remains a useful resource for the majority of users. Nick On Wed Jan 21 2015 at 3:13:21 PM pzecevic petar.zece...@gmail.com mailto:petar.zece...@gmail.com wrote: Hi, I tried to find the last reply by Nick Chammas (that I received in the digest) using the Nabble web interface, but I cannot find it (perhaps he didn't reply directly to the user list?). That's one example of Nabble's usability. Anyhow, I wanted to add my two cents... Apache user group could be frozen (not accepting new questions, if that's possible) and redirect users to Stack Overflow (automatic reply?). Old questions remain (and are searchable) on Nabble, new questions go to Stack Exchange, so no need for migration. That's the idea, at least, as I'm not sure if that's technically doable... Is it? dev mailing list could perhaps stay on Nabble (it's not that busy), or have a special tag on Stack Exchange. Other thing, about new Stack Exchange site I proposed earlier. If a new site is created, there is no problem with guidelines, I think, because Spark community can apply different guidelines for the new site. There is a FAQ about creating new sites: http://area51.stackexchange.com/faq It says: Stack Exchange sites are free to create and free to use. All we ask is that you have an enthusiastic, committed group of expert users who check in regularly, asking and answering questions. I think this requirement is satisfied... Someone expressed a concern that they won't allow creating a project-specific site, but there already exist some project-specific sites, like Tor, Drupal, Ubuntu... Later, though, the FAQ also says: If Y already exists, it already has a tag for X, and nobody is complaining (then you should not create a new
Re: unable to write SequenceFile using saveAsNewAPIHadoopFile
Yeah it worked like charm!! Thank you! On Thu, Jan 22, 2015 at 2:28 PM, Sean Owen so...@cloudera.com wrote: First as an aside I am pretty sure you cannot reuse one Text and IntWritable object here. Spark does not necessarily finish with one's value before the next call(). Although it should not be directly related to the serialization problem I suspect it is. Your function is not serializable since it contains references to these cached writables. I think removing them fixes both problems. On Jan 22, 2015 9:42 AM, Skanda skanda.ganapa...@gmail.com wrote: Hi All, I'm using the saveAsNewAPIHadoopFile API to write SequenceFiles but I'm getting the following runtime exception: *Exception in thread main org.apache.spark.SparkException: Task not serializable* at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1435) at org.apache.spark.rdd.RDD.map(RDD.scala:271) at org.apache.spark.api.java.JavaRDDLike$class.mapToPair(JavaRDDLike.scala:102) at org.apache.spark.api.java.JavaPairRDD.mapToPair(JavaPairRDD.scala:45) at XoanonKMeansV2.main(XoanonKMeansV2.java:125) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) *Caused by: java.io.NotSerializableException: org.apache.hadoop.io.IntWritable* at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) ... 13 more Pls find below the code snippet: joiningKeyPlusPredictedPoint.mapToPair( new PairFunctionTuple2String, Integer, Text, IntWritable() { Text text = new Text(); IntWritable intwritable = new IntWritable(); @Override public Tuple2Text, IntWritable call( Tuple2String, Integer tuple) throws Exception { text.set(tuple._1); intwritable.set(tuple._2); return new Tuple2Text, IntWritable(text, intwritable); } }) *.saveAsNewAPIHadoopFile(/mllib/data/clusteroutput_seq, Text.class, IntWritable.class, SequenceFileOutputFormat.class);* Regards, Skanda
Re: Discourse: A proposed alternative to the Spark User list
But voting is done on dev list, right? That could stay there... Overlay might be a fine solution, too, but that still gives two user lists (SO and Nabble+overlay). On 22.1.2015. 10:42, Sean Owen wrote: Yes, there is some project business like votes of record on releases that needs to be carried on in standard, simple accessible place and SO is not at all suitable. Nobody is stuck with Nabble. The suggestion is to enable a different overlay on the existing list. SO remains a place you can ask questions too. So I agree with Nick's take. BTW are there perhaps plans to split this mailing list into subproject-specific lists? That might also help tune in/out the subset of conversations of interest. On Jan 22, 2015 10:30 AM, Petar Zecevic petar.zece...@gmail.com mailto:petar.zece...@gmail.com wrote: Ok, thanks for the clarifications. I didn't know this list has to remain as the only official list. Nabble is really not the best solution in the world, but we're stuck with it, I guess. That's it from me on this subject. Petar On 22.1.2015. 3:55, Nicholas Chammas wrote: I think a few things need to be laid out clearly: 1. This mailing list is the “official” user discussion platform. That is, it is sponsored and managed by the ASF. 2. Users are free to organize independent discussion platforms focusing on Spark, and there is already one such platform in Stack Overflow under the |apache-spark| and related tags. Stack Overflow works quite well. 3. The ASF will not agree to deprecating or migrating this user list to a platform that they do not control. 4. This mailing list has grown to an unwieldy size and discussions are hard to find or follow; discussion tooling is also lacking. We want to improve the utility and user experience of this mailing list. 5. We don’t want to fragment this “official” discussion community. 6. Nabble is an independent product not affiliated with the ASF. It offers a slightly better interface to the Apache mailing list archives. So to respond to some of your points, pzecevic: Apache user group could be frozen (not accepting new questions, if that’s possible) and redirect users to Stack Overflow (automatic reply?). From what I understand of the ASF’s policies, this is not possible. :( This mailing list must remain the official Spark user discussion platform. Other thing, about new Stack Exchange site I proposed earlier. If a new site is created, there is no problem with guidelines, I think, because Spark community can apply different guidelines for the new site. I think Stack Overflow and the various Spark tags are working fine. I don’t see a compelling need for a Stack Exchange dedicated to Spark, either now or in the near future. Also, I doubt a Spark-specific site can pass the 4 tests in the Area 51 FAQ http://area51.stackexchange.com/faq: * Almost all Spark questions are on-topic for Stack Overflow * Stack Overflow already exists, it already has a tag for Spark, and nobody is complaining * You’re not creating such a big group that you don’t have enough experts to answer all possible questions * There’s a high probability that users of Stack Overflow would enjoy seeing the occasional question about Spark I think complaining won’t be sufficient. :) Someone expressed a concern that they won’t allow creating a project-specific site, but there already exist some project-specific sites, like Tor, Drupal, Ubuntu… The communities for these projects are many, many times larger than the Spark community is or likely ever will be, simply due to the nature of the problems they are solving. What we need is an improvement to this mailing list. We need better tooling than Nabble to sit on top of the Apache archives, and we also need some way to control the volume and quality of mail on the list so that it remains a useful resource for the majority of users. Nick On Wed Jan 21 2015 at 3:13:21 PM pzecevic petar.zece...@gmail.com mailto:petar.zece...@gmail.com wrote: Hi, I tried to find the last reply by Nick Chammas (that I received in the digest) using the Nabble web interface, but I cannot find it (perhaps he didn't reply directly to the user list?). That's one example of Nabble's usability. Anyhow, I wanted to add my two cents... Apache user group could be frozen (not accepting new questions, if that's possible) and redirect users to Stack Overflow (automatic reply?). Old questions remain (and are searchable) on Nabble, new questions go to Stack Exchange, so no need for
Re: Spark on YARN: java.lang.ClassCastException SerializedLambda to org.apache.spark.api.java.function.Function in instance of org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1
Update: I deployed a stand-alone spark in localhost then set Master as spark://localhost:7077 and it met the same issue Don't know how to solve it. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-YARN-java-lang-ClassCastException-SerializedLambda-to-org-apache-spark-api-java-function-Fu1-tp21261p21315.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Discourse: A proposed alternative to the Spark User list
Yes, there is some project business like votes of record on releases that needs to be carried on in standard, simple accessible place and SO is not at all suitable. Nobody is stuck with Nabble. The suggestion is to enable a different overlay on the existing list. SO remains a place you can ask questions too. So I agree with Nick's take. BTW are there perhaps plans to split this mailing list into subproject-specific lists? That might also help tune in/out the subset of conversations of interest. On Jan 22, 2015 10:30 AM, Petar Zecevic petar.zece...@gmail.com wrote: Ok, thanks for the clarifications. I didn't know this list has to remain as the only official list. Nabble is really not the best solution in the world, but we're stuck with it, I guess. That's it from me on this subject. Petar On 22.1.2015. 3:55, Nicholas Chammas wrote: I think a few things need to be laid out clearly: 1. This mailing list is the “official” user discussion platform. That is, it is sponsored and managed by the ASF. 2. Users are free to organize independent discussion platforms focusing on Spark, and there is already one such platform in Stack Overflow under the apache-spark and related tags. Stack Overflow works quite well. 3. The ASF will not agree to deprecating or migrating this user list to a platform that they do not control. 4. This mailing list has grown to an unwieldy size and discussions are hard to find or follow; discussion tooling is also lacking. We want to improve the utility and user experience of this mailing list. 5. We don’t want to fragment this “official” discussion community. 6. Nabble is an independent product not affiliated with the ASF. It offers a slightly better interface to the Apache mailing list archives. So to respond to some of your points, pzecevic: Apache user group could be frozen (not accepting new questions, if that’s possible) and redirect users to Stack Overflow (automatic reply?). From what I understand of the ASF’s policies, this is not possible. :( This mailing list must remain the official Spark user discussion platform. Other thing, about new Stack Exchange site I proposed earlier. If a new site is created, there is no problem with guidelines, I think, because Spark community can apply different guidelines for the new site. I think Stack Overflow and the various Spark tags are working fine. I don’t see a compelling need for a Stack Exchange dedicated to Spark, either now or in the near future. Also, I doubt a Spark-specific site can pass the 4 tests in the Area 51 FAQ http://area51.stackexchange.com/faq: - Almost all Spark questions are on-topic for Stack Overflow - Stack Overflow already exists, it already has a tag for Spark, and nobody is complaining - You’re not creating such a big group that you don’t have enough experts to answer all possible questions - There’s a high probability that users of Stack Overflow would enjoy seeing the occasional question about Spark I think complaining won’t be sufficient. :) Someone expressed a concern that they won’t allow creating a project-specific site, but there already exist some project-specific sites, like Tor, Drupal, Ubuntu… The communities for these projects are many, many times larger than the Spark community is or likely ever will be, simply due to the nature of the problems they are solving. What we need is an improvement to this mailing list. We need better tooling than Nabble to sit on top of the Apache archives, and we also need some way to control the volume and quality of mail on the list so that it remains a useful resource for the majority of users. Nick On Wed Jan 21 2015 at 3:13:21 PM pzecevic petar.zece...@gmail.com wrote: Hi, I tried to find the last reply by Nick Chammas (that I received in the digest) using the Nabble web interface, but I cannot find it (perhaps he didn't reply directly to the user list?). That's one example of Nabble's usability. Anyhow, I wanted to add my two cents... Apache user group could be frozen (not accepting new questions, if that's possible) and redirect users to Stack Overflow (automatic reply?). Old questions remain (and are searchable) on Nabble, new questions go to Stack Exchange, so no need for migration. That's the idea, at least, as I'm not sure if that's technically doable... Is it? dev mailing list could perhaps stay on Nabble (it's not that busy), or have a special tag on Stack Exchange. Other thing, about new Stack Exchange site I proposed earlier. If a new site is created, there is no problem with guidelines, I think, because Spark community can apply different guidelines for the new site. There is a FAQ about creating new sites: http://area51.stackexchange.com/faq It says: Stack Exchange sites are free to create and free to use. All we ask is that you have an enthusiastic, committed group of expert users who check
Re: Are these numbers abnormal for spark streaming?
Hi Ashic Mahtab, The Cassandra and the Zookeeper are they installed as a part of Yarn architecture or are they installed in a separate layer with Apache Spark . Thanks and Regards, Sudipta On Thu, Jan 22, 2015 at 8:13 PM, Ashic Mahtab as...@live.com wrote: Hi Guys, So I changed the interval to 15 seconds. There's obviously a lot more messages per batch, but (I think) it looks a lot healthier. Can you see any major warning signs? I think that with 2 second intervals, the setup / teardown per partition was what was causing the delays. Streaming - *Started at: *Thu Jan 22 13:23:12 GMT 2015 - *Time since start: *1 hour 17 minutes 16 seconds - *Network receivers: *2 - *Batch interval: *15 seconds - *Processed batches: *309 - *Waiting batches: *0 Statistics over last 100 processed batchesReceiver Statistics - Receiver - Status - Location - Records in last batch - [2015/01/22 14:40:29] - Minimum rate - [records/sec] - Median rate - [records/sec] - Maximum rate - [records/sec] - Last Error RmqReceiver-0ACTIVEVDCAPP53.foo.local2.6 K29106295-RmqReceiver-1ACTIVE VDCAPP50.bar.local2.6 K29107291- Batch Processing Statistics MetricLast batchMinimum25th percentileMedian75th percentileMaximumProcessing Time4 seconds 812 ms4 seconds 698 ms4 seconds 738 ms4 seconds 761 ms4 seconds 788 ms5 seconds 802 msScheduling Delay2 ms0 ms3 ms3 ms4 ms9 msTotal Delay4 seconds 814 ms4 seconds 701 ms4 seconds 739 ms4 seconds 764 ms4 seconds 792 ms5 seconds 809 ms Regards, Ashic. -- From: as...@live.com To: gerard.m...@gmail.com CC: user@spark.apache.org Subject: RE: Are these numbers abnormal for spark streaming? Date: Thu, 22 Jan 2015 12:32:05 + Hi Gerard, Thanks for the response. The messages get desrialised from msgpack format, and one of the strings is desrialised to json. Certain fields are checked to decide if further processing is required. If so, it goes through a series of in mem filters to check if more processing is required. If so, only then does the heavy work start. That consists of a few db queries, and potential updates to the db + message on message queue. The majority of messages don't need processing. The messages needing processing at peak are about three every other second. One possible things that might be happening is the session initialisation and prepared statement initialisation for each partition. I can resort to some tricks, but I think I'll try increasing batch interval to 15 seconds. I'll report back with findings. Thanks, Ashic. -- From: gerard.m...@gmail.com Date: Thu, 22 Jan 2015 12:30:08 +0100 Subject: Re: Are these numbers abnormal for spark streaming? To: tathagata.das1...@gmail.com CC: as...@live.com; t...@databricks.com; user@spark.apache.org and post the code (if possible). In a nutshell, your processing time batch interval, resulting in an ever-increasing delay that will end up in a crash. 3 secs to process 14 messages looks like a lot. Curious what the job logic is. -kr, Gerard. On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das tathagata.das1...@gmail.com wrote: This is not normal. Its a huge scheduling delay!! Can you tell me more about the application? - cluser setup, number of receivers, whats the computation, etc. On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote: Hate to do this...but...erm...bump? Would really appreciate input from others using Streaming. Or at least some docs that would tell me if these are expected or not. -- From: as...@live.com To: user@spark.apache.org Subject: Are these numbers abnormal for spark streaming? Date: Wed, 21 Jan 2015 11:26:31 + Hi Guys, I've got Spark Streaming set up for a low data rate system (using spark's features for analysis, rather than high throughput). Messages are coming in throughout the day, at around 1-20 per second (finger in the air estimate...not analysed yet). In the spark streaming UI for the application, I'm getting the following after 17 hours. Streaming - *Started at: *Tue Jan 20 16:58:43 GMT 2015 - *Time since start: *18 hours 24 minutes 34 seconds - *Network receivers: *2 - *Batch interval: *2 seconds - *Processed batches: *16482 - *Waiting batches: *1 Statistics over last 100 processed batchesReceiver Statistics - Receiver - Status - Location - Records in last batch - [2015/01/21 11:23:18] - Minimum rate - [records/sec] - Median rate - [records/sec] - Maximum rate - [records/sec] - Last Error RmqReceiver-0ACTIVEF 144727-RmqReceiver-1ACTIVEBR 124726- Batch Processing Statistics MetricLast batchMinimum25th percentileMedian75th percentileMaximumProcessing Time3 seconds 994 ms157 ms4 seconds 16 ms4 seconds 961 ms5
Re: Discourse: A proposed alternative to the Spark User list
I've have been contributing to SO for a while now. Here're few observations I'd like to contribute to the discussion: The level of questions on SO is often of more entry-level. Harder questions (that require expertise in a certain area) remain unanswered for a while. Same questions here on the list (as they are often cross-posted) receive faster turnaround. Roughly speaking, there're two groups of questions: Implementing things on Spark and Running Spark. The second one is borderline on SO guidelines as they often involve cluster setups, long logs and little idea of what's going on (mind you, often those questions come from people starting with Spark) In my opinion, Stack Overflow offers a better Q/A experience, in particular, they have tooling in place to reduce duplicates, something that often overloads this list (same getting started issues or how to map, filter, flatmap over and over again). That said, this list offers a richer forum, where the expertise pool is a lot deeper. Also, while SO is fairly strict in requiring posters from showing a minimal amount of effort in the question being asked, this list is quite friendly to the same behavior. This could be probably an element that makes the list 'lower impedance'. One additional thing on SO is that the [apache-spark] tag is a 'low rep' tag. Neither questions nor answers get significant voting, reducing the 'rep gaming' factor (discouraging participation?) Thinking about how to improve both platforms: SO[apache-spark] and this ML, and get back the list to not overwhelming message volumes, we could implement some 'load balancing' policies: - encourage new users to use Stack Overflow, in particular, redirect newbie questions to SO the friendly way: did you search SO already? or link to an existing question. - most how to map, flatmap, filter, aggregate, reduce, ... would fall under this category - encourage domain experts to hang on SO more often (my impression is that MLLib, GraphX are fairly underserved) - have an 'scalation process' in place, where we could post 'interesting/hard/bug' questions from SO back to the list (or encourage the poster to do so) - update our community guidelines on [ http://spark.apache.org/community.html] to implement such policies. Those are just some ideas on how to improve the community and better serve the newcomers while avoiding overload of our existing expertise pool. kr, Gerard. On Thu, Jan 22, 2015 at 10:42 AM, Sean Owen so...@cloudera.com wrote: Yes, there is some project business like votes of record on releases that needs to be carried on in standard, simple accessible place and SO is not at all suitable. Nobody is stuck with Nabble. The suggestion is to enable a different overlay on the existing list. SO remains a place you can ask questions too. So I agree with Nick's take. BTW are there perhaps plans to split this mailing list into subproject-specific lists? That might also help tune in/out the subset of conversations of interest. On Jan 22, 2015 10:30 AM, Petar Zecevic petar.zece...@gmail.com wrote: Ok, thanks for the clarifications. I didn't know this list has to remain as the only official list. Nabble is really not the best solution in the world, but we're stuck with it, I guess. That's it from me on this subject. Petar On 22.1.2015. 3:55, Nicholas Chammas wrote: I think a few things need to be laid out clearly: 1. This mailing list is the “official” user discussion platform. That is, it is sponsored and managed by the ASF. 2. Users are free to organize independent discussion platforms focusing on Spark, and there is already one such platform in Stack Overflow under the apache-spark and related tags. Stack Overflow works quite well. 3. The ASF will not agree to deprecating or migrating this user list to a platform that they do not control. 4. This mailing list has grown to an unwieldy size and discussions are hard to find or follow; discussion tooling is also lacking. We want to improve the utility and user experience of this mailing list. 5. We don’t want to fragment this “official” discussion community. 6. Nabble is an independent product not affiliated with the ASF. It offers a slightly better interface to the Apache mailing list archives. So to respond to some of your points, pzecevic: Apache user group could be frozen (not accepting new questions, if that’s possible) and redirect users to Stack Overflow (automatic reply?). From what I understand of the ASF’s policies, this is not possible. :( This mailing list must remain the official Spark user discussion platform. Other thing, about new Stack Exchange site I proposed earlier. If a new site is created, there is no problem with guidelines, I think, because Spark community can apply different guidelines for the new site. I think Stack Overflow and the various Spark tags are working fine. I don’t see a compelling need for a Stack
Spark performance for small queries
Hello, We were comparing performance of some of our production hive queries between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both Spark 0.9 and 1.1. We could see that the performance gains have been good in Spark. We tried a very simple query, select count(*) from T where col3=123 in both sparkSQL and Hive (with hive.map.aggr=true) and found that Spark performance had been 2x better than Hive (120sec vs 60sec). Table T is stored in S3 and contains 600MB single GZIP file. My question is, why Spark is faster than Hive here? In both of the cases, the file will be downloaded, uncompressed and lines will be counted by a single process. For Hive case, reducer will be identity function since hive.map.aggr is true. Note that disk spills and network I/O are very less for Hive's case as well, -- Regards, Saumitra Shahapure
Re: HDFS Namenode in safemode when I turn off my EC2 instance
If you are using CDH, you would be shutting down services with Cloudera Manager. I believe you can do it manually using Linux 'services' if you do the steps correctly across your whole cluster. I'm not sure if the stock stop-all.sh script is supposed to work. Certainly, if you are using CM, by far the easiest is to start/stop all of these things in CM. On Wed, Jan 21, 2015 at 6:08 PM, Su She suhsheka...@gmail.com wrote: Hello Sean Akhil, I tried running the stop-all.sh script on my master and I got this message: localhost: Permission denied (publickey,gssapi-keyex,gssapi-with-mic). chown: changing ownership of `/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/logs': Operation not permitted no org.apache.spark.deploy.master.Master to stop I am running Spark (Yarn) via Cloudera Manager. I tried stopping it from Cloudera Manager first, but it looked like it was only stopping the history server, so I started Spark again and tried ./stop-all.sh and got the above message. Also, what is the command for shutting down storage or can I simply stop hdfs in Cloudera Manager? Thank you for the help! On Sat, Jan 17, 2015 at 12:58 PM, Su She suhsheka...@gmail.com wrote: Thanks Akhil and Sean for the responses. I will try shutting down spark, then storage and then the instances. Initially, when hdfs was in safe mode, I waited for 1 hour and the problem still persisted. I will try this new method. Thanks! On Sat, Jan 17, 2015 at 2:03 AM, Sean Owen so...@cloudera.com wrote: You would not want to turn off storage underneath Spark. Shut down Spark first, then storage, then shut down the instances. Reverse the order when restarting. HDFS will be in safe mode for a short time after being started before it becomes writeable. I would first check that it's not just that. Otherwise, find out why the cluster went into safe mode from the logs, fix it, and then leave safe mode. On Sat, Jan 17, 2015 at 9:03 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Safest way would be to first shutdown HDFS and then shutdown Spark (call stop-all.sh would do) and then shutdown the machines. You can execute the following command to disable safe mode: hadoop fs -safemode leave Thanks Best Regards On Sat, Jan 17, 2015 at 8:31 AM, Su She suhsheka...@gmail.com wrote: Hello Everyone, I am encountering trouble running Spark applications when I shut down my EC2 instances. Everything else seems to work except Spark. When I try running a simple Spark application, like sc.parallelize() I get the message that hdfs name node is in safemode. Has anyone else had this issue? Is there a proper protocol I should be following to turn off my spark nodes? Thank you! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: [mllib] Decision Tree - prediction probabilites of label classes
You are right that this isn't implemented. I presume you could propose a PR for this. The impurity calculator implementations already receive category counts. The only drawback I see is having to store N probabilities at each leaf, not 1. On Wed, Jan 21, 2015 at 3:36 PM, Zsolt Tóth toth.zsolt@gmail.com wrote: Hi, I use DecisionTree for multi class classification. I can get the probability of the predicted label for every node in the decision tree from node.predict().prob(). Is it possible to retrieve or count the probability of every possible label class in the node? To be more clear: Say in Node A there are 4 of label 0.0, 2 of label 1.0 and 3 of label 2.0. If I'm correct predict.prob() is 4/9 in this case. I need the values 2/9 and 3/9 for the 2 other labels. It would be great to retrieve the exact count of label classes ([4,2,3] in the example) but I don't think thats possible now. Is something like this planned for a future release? Thanks! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Discourse: A proposed alternative to the Spark User list
I agree with Sean that a Spark-specific Stack Exchange likely won't help and almost certainly won't make it out of Area 51. The idea certainly sounds nice from our perspective as Spark users, but it doesn't mesh with the structure of Stack Exchange or the criteria for creating new sites. On Thu Jan 22 2015 at 1:23:14 PM Sean Owen so...@cloudera.com wrote: FWIW I am a moderator for datascience.stackexchange.com, and even that hasn't really achieved the critical mass that SE sites are supposed to: http://area51.stackexchange.com/proposals/55053/data-science I think a Spark site would have a lot less traffic. One annoyance is that people can't figure out when to post on SO vs Data Science vs Cross Validated. A Spark site would have the same problem, fragmentation and cross posting with SO. I don't think this would be accepted as a StackExchange site and don't think it helps. On Thu, Jan 22, 2015 at 6:16 PM, pierred pie...@demartines.com wrote: A dedicated stackexchange site for Apache Spark sounds to me like the logical solution. Less trolling, more enthusiasm, and with the participation of the people on this list, I think it would very quickly become the reference for many technical questions, as well as a great vehicle to promote the awesomeness of Spark. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: sparkcontext.objectFile return thousands of partitions
Sean You said Ø If you know that this number is too high you can request a number of partitions when you read it. How to do that? Can you give a code snippet? I want to read it into 8 partitions, so I do val rdd2 = sc.objectFile[LabeledPoint]( (“file:///tmp/mydirfile:///\\tmp\mydir”, 8) However rdd2 contains thousands of partitions instead of 8 partitions Regards, Ningjun Wang Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541 From: Sean Owen [mailto:so...@cloudera.com] Sent: Wednesday, January 21, 2015 2:32 PM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: sparkcontext.objectFile return thousands of partitions You have 8 files, not 8 partitions. It does not follow that they should be read as 8 partitions since they are presumably large and so you would be stuck using at most 8 tasks in parallel to process. The number of partitions is determined by Hadoop input splits and generally makes a partition per block of data. If you know that this number is too high you can request a number of partitions when you read it. Don't coalesce, just read the desired number from the start. On Jan 21, 2015 4:32 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.commailto:ningjun.w...@lexisnexis.com wrote: Why sc.objectFile(…) return a Rdd with thousands of partitions? I save a rdd to file system using rdd.saveAsObjectFile(“file:///tmp/mydirfile:///\\tmp\mydir”) Note that the rdd contains 7 millions object. I check the directory /tmp/mydir/, it contains 8 partitions part-0 part-2 part-4 part-6 _SUCCESS part-1 part-3 part-5 part-7 I then load the rdd back using val rdd2 = sc.objectFile[LabeledPoint]( (“file:///tmp/mydirfile:///\\tmp\mydir”, 8) I expect rdd2 to have 8 partitions. But from the master UI, I see that rdd2 has over 1000 partitions. This is very inefficient. How can I limit it to 8 partitions just like what is stored on the file system? Regards, Ningjun Wang Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541
Re: Discourse: A proposed alternative to the Spark User list
we could implement some ‘load balancing’ policies: I think Gerard’s suggestions are good. We need some “official” buy-in from the project’s maintainers and heavy contributors and we should move forward with them. I know that at least Josh Rosen, Sean Owen, and Tathagata Das, who are active on this list, are also active on SO http://stackoverflow.com/tags/apache-spark/topusers. So perhaps we’re already part of the way there. Nick On Thu Jan 22 2015 at 5:32:40 AM Gerard Maas gerard.m...@gmail.com wrote: I've have been contributing to SO for a while now. Here're few observations I'd like to contribute to the discussion: The level of questions on SO is often of more entry-level. Harder questions (that require expertise in a certain area) remain unanswered for a while. Same questions here on the list (as they are often cross-posted) receive faster turnaround. Roughly speaking, there're two groups of questions: Implementing things on Spark and Running Spark. The second one is borderline on SO guidelines as they often involve cluster setups, long logs and little idea of what's going on (mind you, often those questions come from people starting with Spark) In my opinion, Stack Overflow offers a better Q/A experience, in particular, they have tooling in place to reduce duplicates, something that often overloads this list (same getting started issues or how to map, filter, flatmap over and over again). That said, this list offers a richer forum, where the expertise pool is a lot deeper. Also, while SO is fairly strict in requiring posters from showing a minimal amount of effort in the question being asked, this list is quite friendly to the same behavior. This could be probably an element that makes the list 'lower impedance'. One additional thing on SO is that the [apache-spark] tag is a 'low rep' tag. Neither questions nor answers get significant voting, reducing the 'rep gaming' factor (discouraging participation?) Thinking about how to improve both platforms: SO[apache-spark] and this ML, and get back the list to not overwhelming message volumes, we could implement some 'load balancing' policies: - encourage new users to use Stack Overflow, in particular, redirect newbie questions to SO the friendly way: did you search SO already? or link to an existing question. - most how to map, flatmap, filter, aggregate, reduce, ... would fall under this category - encourage domain experts to hang on SO more often (my impression is that MLLib, GraphX are fairly underserved) - have an 'scalation process' in place, where we could post 'interesting/hard/bug' questions from SO back to the list (or encourage the poster to do so) - update our community guidelines on [ http://spark.apache.org/community.html] to implement such policies. Those are just some ideas on how to improve the community and better serve the newcomers while avoiding overload of our existing expertise pool. kr, Gerard. On Thu, Jan 22, 2015 at 10:42 AM, Sean Owen so...@cloudera.com wrote: Yes, there is some project business like votes of record on releases that needs to be carried on in standard, simple accessible place and SO is not at all suitable. Nobody is stuck with Nabble. The suggestion is to enable a different overlay on the existing list. SO remains a place you can ask questions too. So I agree with Nick's take. BTW are there perhaps plans to split this mailing list into subproject-specific lists? That might also help tune in/out the subset of conversations of interest. On Jan 22, 2015 10:30 AM, Petar Zecevic petar.zece...@gmail.com wrote: Ok, thanks for the clarifications. I didn't know this list has to remain as the only official list. Nabble is really not the best solution in the world, but we're stuck with it, I guess. That's it from me on this subject. Petar On 22.1.2015. 3:55, Nicholas Chammas wrote: I think a few things need to be laid out clearly: 1. This mailing list is the “official” user discussion platform. That is, it is sponsored and managed by the ASF. 2. Users are free to organize independent discussion platforms focusing on Spark, and there is already one such platform in Stack Overflow under the apache-spark and related tags. Stack Overflow works quite well. 3. The ASF will not agree to deprecating or migrating this user list to a platform that they do not control. 4. This mailing list has grown to an unwieldy size and discussions are hard to find or follow; discussion tooling is also lacking. We want to improve the utility and user experience of this mailing list. 5. We don’t want to fragment this “official” discussion community. 6. Nabble is an independent product not affiliated with the ASF. It offers a slightly better interface to the Apache mailing list archives. So to respond to some of your points, pzecevic: Apache user group could be frozen (not accepting new
Re: spark-shell has syntax error on windows.
I am not sure if you get the same exception as I do -- spark-shell2.cmd works fine for me. Windows 7 as well. I've never bothered looking to fix it as it seems spark-shell just calls spark-shell2 anyway... On Thu, Jan 22, 2015 at 3:16 AM, Vladimir Protsenko protsenk...@gmail.com wrote: I have a problem with running spark shell in windows 7. I made the following steps: 1. downloaded and installed Scala 2.11.5 2. downloaded spark 1.2.0 by git clone git://github.com/apache/spark.git 3. run dev/change-version-to-2.11.sh and mvn -Dscala-2.11 -DskipTests clean package (in git bash) After installation tried to run spark-shell.cmd in cmd shell and it says there is a syntax error in file. What could I do to fix problem? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-has-syntax-error-on-windows-tp21313.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Are these numbers abnormal for spark streaming?
Another quick question... I've got 4 nodes with 2 cores each. I've assinged the streaming app 4 cores. It seems to be using one per node. I imagine forwarding from the receivers to the executors are causing unnecessary processing. Is there a way to specify that I want 2 cores from the same machines to be involved (even better if this can be specified during spark-submit)? Thanks, Ashic. From: as...@live.com To: gerard.m...@gmail.com; asudipta.baner...@gmail.com CC: user@spark.apache.org; tathagata.das1...@gmail.com Subject: RE: Are these numbers abnormal for spark streaming? Date: Thu, 22 Jan 2015 15:40:17 + Yup...looks like it. I can do some tricks to reduce setup costs further, but this is much better than where I was yesterday. Thanks for your awesome input :) -Ashic. From: gerard.m...@gmail.com Date: Thu, 22 Jan 2015 16:34:38 +0100 Subject: Re: Are these numbers abnormal for spark streaming? To: asudipta.baner...@gmail.com CC: as...@live.com; user@spark.apache.org; tathagata.das1...@gmail.com Given that the process, and in particular, the setup of connections, is bound to the number of partitions (in x.foreachPartition{ x= ???}), I think it would be worth trying reducing them. Increasing the 'spark.streaming.BlockInterval' will do the trick (you can read the tuning details here: http://www.virdata.com/tuning-spark/#Partitions) -kr, Gerard. On Thu, Jan 22, 2015 at 4:28 PM, Gerard Maas gerard.m...@gmail.com wrote: So the system has gone from 7msg in 4.961 secs (median) to 106msgs in 4,761 seconds. I think there's evidence that setup costs are quite high in this case and increasing the batch interval is helping. On Thu, Jan 22, 2015 at 4:12 PM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Ashic Mahtab, The Cassandra and the Zookeeper are they installed as a part of Yarn architecture or are they installed in a separate layer with Apache Spark . Thanks and Regards, Sudipta On Thu, Jan 22, 2015 at 8:13 PM, Ashic Mahtab as...@live.com wrote: Hi Guys, So I changed the interval to 15 seconds. There's obviously a lot more messages per batch, but (I think) it looks a lot healthier. Can you see any major warning signs? I think that with 2 second intervals, the setup / teardown per partition was what was causing the delays. StreamingStarted at: Thu Jan 22 13:23:12 GMT 2015Time since start: 1 hour 17 minutes 16 secondsNetwork receivers: 2Batch interval: 15 secondsProcessed batches: 309Waiting batches: 0 Statistics over last 100 processed batchesReceiver StatisticsReceiverStatusLocationRecords in last batch[2015/01/22 14:40:29]Minimum rate[records/sec]Median rate[records/sec]Maximum rate[records/sec]Last ErrorRmqReceiver-0ACTIVEVDCAPP53.foo.local2.6 K29106295-RmqReceiver-1ACTIVEVDCAPP50.bar.local2.6 K29107291-Batch Processing StatisticsMetricLast batchMinimum25th percentileMedian75th percentileMaximumProcessing Time4 seconds 812 ms4 seconds 698 ms4 seconds 738 ms4 seconds 761 ms4 seconds 788 ms5 seconds 802 msScheduling Delay2 ms0 ms3 ms3 ms4 ms9 msTotal Delay4 seconds 814 ms4 seconds 701 ms4 seconds 739 ms4 seconds 764 ms4 seconds 792 ms5 seconds 809 ms Regards, Ashic. From: as...@live.com To: gerard.m...@gmail.com CC: user@spark.apache.org Subject: RE: Are these numbers abnormal for spark streaming? Date: Thu, 22 Jan 2015 12:32:05 + Hi Gerard, Thanks for the response. The messages get desrialised from msgpack format, and one of the strings is desrialised to json. Certain fields are checked to decide if further processing is required. If so, it goes through a series of in mem filters to check if more processing is required. If so, only then does the heavy work start. That consists of a few db queries, and potential updates to the db + message on message queue. The majority of messages don't need processing. The messages needing processing at peak are about three every other second. One possible things that might be happening is the session initialisation and prepared statement initialisation for each partition. I can resort to some tricks, but I think I'll try increasing batch interval to 15 seconds. I'll report back with findings. Thanks, Ashic. From: gerard.m...@gmail.com Date: Thu, 22 Jan 2015 12:30:08 +0100 Subject: Re: Are these numbers abnormal for spark streaming? To: tathagata.das1...@gmail.com CC: as...@live.com; t...@databricks.com; user@spark.apache.org and post the code (if possible).In a nutshell, your processing time batch interval, resulting in an ever-increasing delay that will end up in a crash. 3 secs to process 14 messages looks like a lot. Curious what the job logic is. -kr, Gerard. On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das tathagata.das1...@gmail.com wrote: This is not normal. Its a huge scheduling delay!! Can you tell me more about the application?- cluser setup, number of receivers, whats the computation, etc. On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote: Hate to do
Re: Spark Team - Paco Nathan said that your team can help
Yes, this isn't a well-formed question, and got maybe the response it deserved, but the tone is veering off the rails. I just got a much ruder reply from Sudipta privately, which I will not forward. Sudipta, I suggest you take the responses you've gotten so far as about as much answer as can be had here and do some work yourself, and come back with much more specific questions, and it will all be helpful and polite again. On Thu, Jan 22, 2015 at 2:51 PM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Marco, Thanks for the confirmation. Please let me know what are the lot more detail you need to answer a very specific question WHAT IS THE MINIMUM HARDWARE CONFIGURATION REQUIRED TO BUILT HDFS+ MAPREDUCE+SPARK+YARN on a system? Please let me know if you need any further information and if you dont know please drive across with the $1 to Sir Paco Nathan and get me the answer. Thanks and Regards, Sudipta On Thu, Jan 22, 2015 at 5:33 PM, Marco Shaw marco.s...@gmail.com wrote: Hi, Let me reword your request so you understand how (too) generic your question is Hi, I have $10,000, please find me some means of transportation so I can get to work. Please provide (a lot) more details. If you can't, consider using one of the pre-built express VMs from either Cloudera, Hortonworks or MapR, for example. Marco On Jan 22, 2015, at 7:36 AM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Apache-Spark team , What are the system requirements installing Hadoop and Apache Spark? I have attached the screen shot of Gparted. Thanks and regards, Sudipta -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099 Screenshot - Wednesday 21 January 2015 - 10:55:29 IST.png - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Exception: NoSuchMethodError: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions
Hi I get this exception when I run a Spark test case on my local machine: An exception or error caused a run to abort: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions(Lorg/apache/spark/streaming/dstream/DStream;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;Lscala/math/Ordering;)Lorg/apache/spark/streaming/dstream/PairDStreamFunctions; java.lang.NoSuchMethodError: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions(Lorg/apache/spark/streaming/dstream/DStream;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;Lscala/math/Ordering;)Lorg/apache/spark/streaming/dstream/PairDStreamFunctions; In my test case I have these Spark related imports imports: import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.TestSuiteBase import org.apache.spark.streaming.dstream.DStream import org.apache.spark.streaming.StreamingContext.toPairDStreamFunctions -Adrian
Re: How to 'Pipe' Binary Data in Apache Spark
Venkat, No problem! So, creating a custom InputFormat or using sc.binaryFiles alone is not the right solution. We also need the modified version of RDD.pipe to support binary data? Is my understanding correct? Yep! That is correct. The custom InputFormat allows Spark to load binary formatted data from disk/HDFS/S3/etc…, but then the default RDD.pipe reads/writes text to a pipe, so you’d need the custom mapPartitions call. If yes, this can be added as new enhancement Jira request? The code that I have right now is fairly custom to my application, but if there was interest, I would be glad to port it for the Spark core. Regards, Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Jan 22, 2015, at 7:11 AM, Venkat, Ankam ankam.ven...@centurylink.com wrote: Thanks Frank for your response. So, creating a custom InputFormat or using sc.binaryFiles alone is not the right solution. We also need the modified version of RDD.pipe to support binary data? Is my understanding correct? If yes, this can be added as new enhancement Jira request? Nick: What’s your take on this? Regards, Venkat Ankam From: Frank Austin Nothaft [mailto:fnoth...@berkeley.edu] Sent: Wednesday, January 21, 2015 12:30 PM To: Venkat, Ankam Cc: Nick Allen; user@spark.apache.org Subject: Re: How to 'Pipe' Binary Data in Apache Spark Hi Venkat/Nick, The Spark RDD.pipe method pipes text data into a subprocess and then receives text data back from that process. Once you have the binary data loaded into an RDD properly, to pipe binary data to/from a subprocess (e.g., you want the data in the pipes to contain binary, not text), you need to implement your own, modified version of RDD.pipe. The implementation of RDD.pipe spawns a process per partition (IIRC), as well as threads for writing to and reading from the process (as well as stderr for the process). When writing via RDD.pipe, Spark calls *.toString on the object, and pushes that text representation down the pipe. There is an example of how to pipe binary data from within a mapPartitions call using the Scala API in lines 107-177 of this file. This specific code contains some nastiness around the packaging of downstream libraries that we rely on in that project, so I’m not sure if it is the cleanest way, but it is a workable way. Regards, Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Jan 21, 2015, at 9:17 AM, Venkat, Ankam ankam.ven...@centurylink.com wrote: I am trying to solve similar problem. I am using option # 2 as suggested by Nick. I have created an RDD with sc.binaryFiles for a list of .wav files. But, I am not able to pipe it to the external programs. For example: sq = sc.binaryFiles(wavfiles) ß All .wav files stored on “wavfiles” directory on HDFS sq.keys().collect() ß works fine. Shows the list of file names. sq.values().collect() ß works fine. Shows the content of the files. sq.values().pipe(lambda x: subprocess.call(['/usr/local/bin/sox', '-t' 'wav', '-', '-n', 'stats'])).collect() ß Does not work. Tried different options. AttributeError: 'function' object has no attribute 'read' Any suggestions? Regards, Venkat Ankam From: Nick Allen [mailto:n...@nickallen.org] Sent: Friday, January 16, 2015 11:46 AM To: user@spark.apache.org Subject: Re: How to 'Pipe' Binary Data in Apache Spark I just wanted to reiterate the solution for the benefit of the community. The problem is not from my use of 'pipe', but that 'textFile' cannot be used to read in binary data. (Doh) There are a couple options to move forward. 1. Implement a custom 'InputFormat' that understands the binary input data. (Per Sean Owen) 2. Use 'SparkContext.binaryFiles' to read in the entire binary file as a single record. This will impact performance as it prevents the use of more than one mapper on the file's data. In my specific case for #1 I can only find one project from RIPE-NCC (https://github.com/RIPE-NCC/hadoop-pcap) that does this. Unfortunately, it appears to only support a limited set of network protocols. On Fri, Jan 16, 2015 at 10:40 AM, Nick Allen n...@nickallen.org wrote: Per your last comment, it appears I need something like this: https://github.com/RIPE-NCC/hadoop-pcap Thanks a ton. That get me oriented in the right direction. On Fri, Jan 16, 2015 at 10:20 AM, Sean Owen so...@cloudera.com wrote: Well it looks like you're reading some kind of binary file as text. That isn't going to work, in Spark or elsewhere, as binary data is not even necessarily the valid encoding of a string. There are no line breaks to delimit lines and thus elements of the RDD. Your input has some record structure (or else it's not really useful to put it into an RDD). You can encode this as a SequenceFile and read it with objectFile.
Re: spark streaming with checkpoint
Thank you Jerry, Does the window operation create new RDDs for each slide duration..? I am asking this because i see a constant increase in memory even when there is no logs received. If not checkpoint is there any alternative that you would suggest.? On Tue, Jan 20, 2015 at 7:08 PM, Shao, Saisai saisai.s...@intel.com wrote: Hi, Seems you have such a large window (24 hours), so the phenomena of memory increasing is expectable, because of window operation will cache the RDD within this window in memory. So for your requirement, memory should be enough to hold the data of 24 hours. I don’t think checkpoint in Spark Streaming can alleviate such problem, because checkpoint are mainly for fault tolerance. Thanks Jerry *From:* balu.naren [mailto:balu.na...@gmail.com] *Sent:* Tuesday, January 20, 2015 7:17 PM *To:* user@spark.apache.org *Subject:* spark streaming with checkpoint I am a beginner to spark streaming. So have a basic doubt regarding checkpoints. My use case is to calculate the no of unique users by day. I am using reduce by key and window for this. Where my window duration is 24 hours and slide duration is 5 mins. I am updating the processed record to mongodb. Currently I am replace the existing record each time. But I see the memory is slowly increasing over time and kills the process after 1 and 1/2 hours(in aws small instance). The DB write after the restart clears all the old data. So I understand checkpoint is the solution for this. But my doubt is - What should my check point duration be..? As per documentation it says 5-10 times of slide duration. But I need the data of entire day. So it is ok to keep 24 hrs. - Where ideally should the checkpoint be..? Initially when I receive the stream or just before the window operation or after the data reduction has taken place. Appreciate your help. Thank you -- View this message in context: spark streaming with checkpoint http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-with-checkpoint-tp21263.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.
RE: Are these numbers abnormal for spark streaming?
Hi Sudipta, Standalone spark master. Separate Zookeeper cluster. 4 worker nodes with cassandra + spark on each. No hadoop / hdfs / yarn. Regards, Ashic. Date: Thu, 22 Jan 2015 20:42:43 +0530 Subject: Re: Are these numbers abnormal for spark streaming? From: asudipta.baner...@gmail.com To: as...@live.com CC: gerard.m...@gmail.com; user@spark.apache.org; tathagata.das1...@gmail.com Hi Ashic Mahtab, The Cassandra and the Zookeeper are they installed as a part of Yarn architecture or are they installed in a separate layer with Apache Spark . Thanks and Regards, Sudipta On Thu, Jan 22, 2015 at 8:13 PM, Ashic Mahtab as...@live.com wrote: Hi Guys, So I changed the interval to 15 seconds. There's obviously a lot more messages per batch, but (I think) it looks a lot healthier. Can you see any major warning signs? I think that with 2 second intervals, the setup / teardown per partition was what was causing the delays. StreamingStarted at: Thu Jan 22 13:23:12 GMT 2015Time since start: 1 hour 17 minutes 16 secondsNetwork receivers: 2Batch interval: 15 secondsProcessed batches: 309Waiting batches: 0 Statistics over last 100 processed batchesReceiver StatisticsReceiverStatusLocationRecords in last batch[2015/01/22 14:40:29]Minimum rate[records/sec]Median rate[records/sec]Maximum rate[records/sec]Last ErrorRmqReceiver-0ACTIVEVDCAPP53.foo.local2.6 K29106295-RmqReceiver-1ACTIVEVDCAPP50.bar.local2.6 K29107291-Batch Processing StatisticsMetricLast batchMinimum25th percentileMedian75th percentileMaximumProcessing Time4 seconds 812 ms4 seconds 698 ms4 seconds 738 ms4 seconds 761 ms4 seconds 788 ms5 seconds 802 msScheduling Delay2 ms0 ms3 ms3 ms4 ms9 msTotal Delay4 seconds 814 ms4 seconds 701 ms4 seconds 739 ms4 seconds 764 ms4 seconds 792 ms5 seconds 809 ms Regards, Ashic. From: as...@live.com To: gerard.m...@gmail.com CC: user@spark.apache.org Subject: RE: Are these numbers abnormal for spark streaming? Date: Thu, 22 Jan 2015 12:32:05 + Hi Gerard, Thanks for the response. The messages get desrialised from msgpack format, and one of the strings is desrialised to json. Certain fields are checked to decide if further processing is required. If so, it goes through a series of in mem filters to check if more processing is required. If so, only then does the heavy work start. That consists of a few db queries, and potential updates to the db + message on message queue. The majority of messages don't need processing. The messages needing processing at peak are about three every other second. One possible things that might be happening is the session initialisation and prepared statement initialisation for each partition. I can resort to some tricks, but I think I'll try increasing batch interval to 15 seconds. I'll report back with findings. Thanks, Ashic. From: gerard.m...@gmail.com Date: Thu, 22 Jan 2015 12:30:08 +0100 Subject: Re: Are these numbers abnormal for spark streaming? To: tathagata.das1...@gmail.com CC: as...@live.com; t...@databricks.com; user@spark.apache.org and post the code (if possible).In a nutshell, your processing time batch interval, resulting in an ever-increasing delay that will end up in a crash. 3 secs to process 14 messages looks like a lot. Curious what the job logic is. -kr, Gerard. On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das tathagata.das1...@gmail.com wrote: This is not normal. Its a huge scheduling delay!! Can you tell me more about the application?- cluser setup, number of receivers, whats the computation, etc. On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote: Hate to do this...but...erm...bump? Would really appreciate input from others using Streaming. Or at least some docs that would tell me if these are expected or not. From: as...@live.com To: user@spark.apache.org Subject: Are these numbers abnormal for spark streaming? Date: Wed, 21 Jan 2015 11:26:31 + Hi Guys, I've got Spark Streaming set up for a low data rate system (using spark's features for analysis, rather than high throughput). Messages are coming in throughout the day, at around 1-20 per second (finger in the air estimate...not analysed yet). In the spark streaming UI for the application, I'm getting the following after 17 hours. StreamingStarted at: Tue Jan 20 16:58:43 GMT 2015Time since start: 18 hours 24 minutes 34 secondsNetwork receivers: 2Batch interval: 2 secondsProcessed batches: 16482Waiting batches: 1 Statistics over last 100 processed batchesReceiver StatisticsReceiverStatusLocationRecords in last batch[2015/01/21 11:23:18]Minimum rate[records/sec]Median rate[records/sec]Maximum rate[records/sec]Last ErrorRmqReceiver-0ACTIVEF 144727-RmqReceiver-1ACTIVEBR 124726-Batch Processing StatisticsMetricLast batchMinimum25th percentileMedian75th percentileMaximumProcessing Time3 seconds 994 ms157 ms4 seconds 16 ms4 seconds 961 ms5 seconds 3 ms5 seconds 171 msScheduling Delay9 hours 15 minutes 4 seconds9 hours 10 minutes
Re: Are these numbers abnormal for spark streaming?
Given that the process, and in particular, the setup of connections, is bound to the number of partitions (in x.foreachPartition{ x= ???}), I think it would be worth trying reducing them. Increasing the 'spark.streaming.BlockInterval' will do the trick (you can read the tuning details here: http://www.virdata.com/tuning-spark/#Partitions) -kr, Gerard. On Thu, Jan 22, 2015 at 4:28 PM, Gerard Maas gerard.m...@gmail.com wrote: So the system has gone from 7msg in 4.961 secs (median) to 106msgs in 4,761 seconds. I think there's evidence that setup costs are quite high in this case and increasing the batch interval is helping. On Thu, Jan 22, 2015 at 4:12 PM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Ashic Mahtab, The Cassandra and the Zookeeper are they installed as a part of Yarn architecture or are they installed in a separate layer with Apache Spark . Thanks and Regards, Sudipta On Thu, Jan 22, 2015 at 8:13 PM, Ashic Mahtab as...@live.com wrote: Hi Guys, So I changed the interval to 15 seconds. There's obviously a lot more messages per batch, but (I think) it looks a lot healthier. Can you see any major warning signs? I think that with 2 second intervals, the setup / teardown per partition was what was causing the delays. Streaming - *Started at: *Thu Jan 22 13:23:12 GMT 2015 - *Time since start: *1 hour 17 minutes 16 seconds - *Network receivers: *2 - *Batch interval: *15 seconds - *Processed batches: *309 - *Waiting batches: *0 Statistics over last 100 processed batchesReceiver Statistics - Receiver - Status - Location - Records in last batch - [2015/01/22 14:40:29] - Minimum rate - [records/sec] - Median rate - [records/sec] - Maximum rate - [records/sec] - Last Error RmqReceiver-0ACTIVEVDCAPP53.foo.local2.6 K29106295-RmqReceiver-1ACTIVE VDCAPP50.bar.local2.6 K29107291- Batch Processing Statistics MetricLast batchMinimum25th percentileMedian75th percentileMaximumProcessing Time4 seconds 812 ms4 seconds 698 ms4 seconds 738 ms4 seconds 761 ms4 seconds 788 ms5 seconds 802 msScheduling Delay2 ms0 ms3 ms3 ms4 ms9 msTotal Delay4 seconds 814 ms4 seconds 701 ms4 seconds 739 ms4 seconds 764 ms4 seconds 792 ms5 seconds 809 ms Regards, Ashic. -- From: as...@live.com To: gerard.m...@gmail.com CC: user@spark.apache.org Subject: RE: Are these numbers abnormal for spark streaming? Date: Thu, 22 Jan 2015 12:32:05 + Hi Gerard, Thanks for the response. The messages get desrialised from msgpack format, and one of the strings is desrialised to json. Certain fields are checked to decide if further processing is required. If so, it goes through a series of in mem filters to check if more processing is required. If so, only then does the heavy work start. That consists of a few db queries, and potential updates to the db + message on message queue. The majority of messages don't need processing. The messages needing processing at peak are about three every other second. One possible things that might be happening is the session initialisation and prepared statement initialisation for each partition. I can resort to some tricks, but I think I'll try increasing batch interval to 15 seconds. I'll report back with findings. Thanks, Ashic. -- From: gerard.m...@gmail.com Date: Thu, 22 Jan 2015 12:30:08 +0100 Subject: Re: Are these numbers abnormal for spark streaming? To: tathagata.das1...@gmail.com CC: as...@live.com; t...@databricks.com; user@spark.apache.org and post the code (if possible). In a nutshell, your processing time batch interval, resulting in an ever-increasing delay that will end up in a crash. 3 secs to process 14 messages looks like a lot. Curious what the job logic is. -kr, Gerard. On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das tathagata.das1...@gmail.com wrote: This is not normal. Its a huge scheduling delay!! Can you tell me more about the application? - cluser setup, number of receivers, whats the computation, etc. On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote: Hate to do this...but...erm...bump? Would really appreciate input from others using Streaming. Or at least some docs that would tell me if these are expected or not. -- From: as...@live.com To: user@spark.apache.org Subject: Are these numbers abnormal for spark streaming? Date: Wed, 21 Jan 2015 11:26:31 + Hi Guys, I've got Spark Streaming set up for a low data rate system (using spark's features for analysis, rather than high throughput). Messages are coming in throughout the day, at around 1-20 per second (finger in the air estimate...not analysed yet). In the spark streaming UI for the application, I'm getting the following after 17 hours. Streaming - *Started at: *Tue Jan 20 16:58:43 GMT 2015 -
RE: How to 'Pipe' Binary Data in Apache Spark
How much time it takes to port it? Spark committers: Please let us know your thoughts. Regards, Venkat From: Frank Austin Nothaft [mailto:fnoth...@berkeley.edu] Sent: Thursday, January 22, 2015 9:08 AM To: Venkat, Ankam Cc: Nick Allen; user@spark.apache.org Subject: Re: How to 'Pipe' Binary Data in Apache Spark Venkat, No problem! So, creating a custom InputFormat or using sc.binaryFiles alone is not the right solution. We also need the modified version of RDD.pipe to support binary data? Is my understanding correct? Yep! That is correct. The custom InputFormat allows Spark to load binary formatted data from disk/HDFS/S3/etc..., but then the default RDD.pipe reads/writes text to a pipe, so you'd need the custom mapPartitions call. If yes, this can be added as new enhancement Jira request? The code that I have right now is fairly custom to my application, but if there was interest, I would be glad to port it for the Spark core. Regards, Frank Austin Nothaft fnoth...@berkeley.edumailto:fnoth...@berkeley.edu fnoth...@eecs.berkeley.edumailto:fnoth...@eecs.berkeley.edu 202-340-0466 On Jan 22, 2015, at 7:11 AM, Venkat, Ankam ankam.ven...@centurylink.commailto:ankam.ven...@centurylink.com wrote: Thanks Frank for your response. So, creating a custom InputFormat or using sc.binaryFiles alone is not the right solution. We also need the modified version of RDD.pipe to support binary data? Is my understanding correct? If yes, this can be added as new enhancement Jira request? Nick: What's your take on this? Regards, Venkat Ankam From: Frank Austin Nothaft [mailto:fnoth...@berkeley.edu] Sent: Wednesday, January 21, 2015 12:30 PM To: Venkat, Ankam Cc: Nick Allen; user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: How to 'Pipe' Binary Data in Apache Spark Hi Venkat/Nick, The Spark RDD.pipe method pipes text data into a subprocess and then receives text data back from that process. Once you have the binary data loaded into an RDD properly, to pipe binary data to/from a subprocess (e.g., you want the data in the pipes to contain binary, not text), you need to implement your own, modified version of RDD.pipe. The implementationhttps://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala of RDD.pipe spawns a process per partition (IIRC), as well as threads for writing to and reading from the process (as well as stderr for the process). When writing via RDD.pipe, Spark calls *.toString on the object, and pushes that text representation down the pipe. There is an example of how to pipe binary data from within a mapPartitions call using the Scala API in lines 107-177 of this filehttps://github.com/bigdatagenomics/avocado/blob/master/avocado-core/src/main/scala/org/bdgenomics/avocado/genotyping/ExternalGenotyper.scala. This specific code contains some nastiness around the packaging of downstream libraries that we rely on in that project, so I'm not sure if it is the cleanest way, but it is a workable way. Regards, Frank Austin Nothaft fnoth...@berkeley.edumailto:fnoth...@berkeley.edu fnoth...@eecs.berkeley.edumailto:fnoth...@eecs.berkeley.edu 202-340-0466 On Jan 21, 2015, at 9:17 AM, Venkat, Ankam ankam.ven...@centurylink.commailto:ankam.ven...@centurylink.com wrote: I am trying to solve similar problem. I am using option # 2 as suggested by Nick. I have created an RDD with sc.binaryFiles for a list of .wav files. But, I am not able to pipe it to the external programs. For example: sq = sc.binaryFiles(wavfiles) -- All .wav files stored on wavfiles directory on HDFS sq.keys().collect() -- works fine. Shows the list of file names. sq.values().collect() -- works fine. Shows the content of the files. sq.values().pipe(lambda x: subprocess.call(['/usr/local/bin/sox', '-t' 'wav', '-', '-n', 'stats'])).collect() -- Does not work. Tried different options. AttributeError: 'function' object has no attribute 'read' Any suggestions? Regards, Venkat Ankam From: Nick Allen [mailto:n...@nickallen.org] Sent: Friday, January 16, 2015 11:46 AM To: user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: How to 'Pipe' Binary Data in Apache Spark I just wanted to reiterate the solution for the benefit of the community. The problem is not from my use of 'pipe', but that 'textFile' cannot be used to read in binary data. (Doh) There are a couple options to move forward. 1. Implement a custom 'InputFormat' that understands the binary input data. (Per Sean Owen) 2. Use 'SparkContext.binaryFiles' to read in the entire binary file as a single record. This will impact performance as it prevents the use of more than one mapper on the file's data. In my specific case for #1 I can only find one project from RIPE-NCC (https://github.com/RIPE-NCC/hadoop-pcap) that does this. Unfortunately, it appears to only support a limited set of network protocols. On Fri, Jan 16, 2015 at 10:40 AM, Nick Allen
Large dataset, reduceByKey - java heap space error
I'm trying to process a large dataset, mapping/filtering works ok, but as long as I try to reduceByKey, I get out of memory errors: http://pastebin.com/70M5d0Bn Any ideas how I can fix that? Thanks. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: PySpark Client
Hey Andrew, Thanks for the response. Is this the issue you're referring to (the duplicate linked there has an associated patch): https://issues.apache.org/jira/browse/SPARK-5162 ? Just to confirm that I understand this: with this patch, Python jobs can be submitted to YARN, and a node from the cluster will act as the driver, meaning that the Python version of the submission client vs. cluster shouldn't be an issue? Thanks, Chris On Tue, Jan 20, 2015 at 10:34 AM, Andrew Or and...@databricks.com wrote: Hi Chris, Short answer is no, not yet. Longer answer is that PySpark only supports client mode, which means your driver runs on the same machine as your submission client. By corollary this means your submission client must currently depend on all of Spark and its dependencies. There is a patch that supports this for *cluster* mode (as opposed to client mode), which would be the first step towards what you want. -Andrew 2015-01-20 8:36 GMT-08:00 Chris Beavers cbeav...@trifacta.com: Hey all, Is there any notion of a lightweight python client for submitting jobs to a Spark cluster remotely? If I essentially install Spark on the client machine, and that machine has the same OS, same version of Python, etc., then I'm able to communicate with the cluster just fine. But if Python versions differ slightly, then I start to see a lot of opaque errors that often bubble up as EOFExceptions. Furthermore, this just seems like a very heavy weight way to set up a client. Does anyone have any suggestions for setting up a thin pyspark client on a node which doesn't necessarily conform to the homogeneity of the target Spark cluster? Best, Chris
RE: Are these numbers abnormal for spark streaming?
Hi Guys, So I changed the interval to 15 seconds. There's obviously a lot more messages per batch, but (I think) it looks a lot healthier. Can you see any major warning signs? I think that with 2 second intervals, the setup / teardown per partition was what was causing the delays. StreamingStarted at: Thu Jan 22 13:23:12 GMT 2015Time since start: 1 hour 17 minutes 16 secondsNetwork receivers: 2Batch interval: 15 secondsProcessed batches: 309Waiting batches: 0 Statistics over last 100 processed batchesReceiver StatisticsReceiverStatusLocationRecords in last batch[2015/01/22 14:40:29]Minimum rate[records/sec]Median rate[records/sec]Maximum rate[records/sec]Last ErrorRmqReceiver-0ACTIVEVDCAPP53.foo.local2.6 K29106295-RmqReceiver-1ACTIVEVDCAPP50.bar.local2.6 K29107291-Batch Processing StatisticsMetricLast batchMinimum25th percentileMedian75th percentileMaximumProcessing Time4 seconds 812 ms4 seconds 698 ms4 seconds 738 ms4 seconds 761 ms4 seconds 788 ms5 seconds 802 msScheduling Delay2 ms0 ms3 ms3 ms4 ms9 msTotal Delay4 seconds 814 ms4 seconds 701 ms4 seconds 739 ms4 seconds 764 ms4 seconds 792 ms5 seconds 809 ms Regards, Ashic. From: as...@live.com To: gerard.m...@gmail.com CC: user@spark.apache.org Subject: RE: Are these numbers abnormal for spark streaming? Date: Thu, 22 Jan 2015 12:32:05 + Hi Gerard, Thanks for the response. The messages get desrialised from msgpack format, and one of the strings is desrialised to json. Certain fields are checked to decide if further processing is required. If so, it goes through a series of in mem filters to check if more processing is required. If so, only then does the heavy work start. That consists of a few db queries, and potential updates to the db + message on message queue. The majority of messages don't need processing. The messages needing processing at peak are about three every other second. One possible things that might be happening is the session initialisation and prepared statement initialisation for each partition. I can resort to some tricks, but I think I'll try increasing batch interval to 15 seconds. I'll report back with findings. Thanks, Ashic. From: gerard.m...@gmail.com Date: Thu, 22 Jan 2015 12:30:08 +0100 Subject: Re: Are these numbers abnormal for spark streaming? To: tathagata.das1...@gmail.com CC: as...@live.com; t...@databricks.com; user@spark.apache.org and post the code (if possible).In a nutshell, your processing time batch interval, resulting in an ever-increasing delay that will end up in a crash. 3 secs to process 14 messages looks like a lot. Curious what the job logic is. -kr, Gerard. On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das tathagata.das1...@gmail.com wrote: This is not normal. Its a huge scheduling delay!! Can you tell me more about the application?- cluser setup, number of receivers, whats the computation, etc. On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote: Hate to do this...but...erm...bump? Would really appreciate input from others using Streaming. Or at least some docs that would tell me if these are expected or not. From: as...@live.com To: user@spark.apache.org Subject: Are these numbers abnormal for spark streaming? Date: Wed, 21 Jan 2015 11:26:31 + Hi Guys, I've got Spark Streaming set up for a low data rate system (using spark's features for analysis, rather than high throughput). Messages are coming in throughout the day, at around 1-20 per second (finger in the air estimate...not analysed yet). In the spark streaming UI for the application, I'm getting the following after 17 hours. StreamingStarted at: Tue Jan 20 16:58:43 GMT 2015Time since start: 18 hours 24 minutes 34 secondsNetwork receivers: 2Batch interval: 2 secondsProcessed batches: 16482Waiting batches: 1 Statistics over last 100 processed batchesReceiver StatisticsReceiverStatusLocationRecords in last batch[2015/01/21 11:23:18]Minimum rate[records/sec]Median rate[records/sec]Maximum rate[records/sec]Last ErrorRmqReceiver-0ACTIVEF 144727-RmqReceiver-1ACTIVEBR 124726-Batch Processing StatisticsMetricLast batchMinimum25th percentileMedian75th percentileMaximumProcessing Time3 seconds 994 ms157 ms4 seconds 16 ms4 seconds 961 ms5 seconds 3 ms5 seconds 171 msScheduling Delay9 hours 15 minutes 4 seconds9 hours 10 minutes 54 seconds9 hours 11 minutes 56 seconds9 hours 12 minutes 57 seconds9 hours 14 minutes 5 seconds9 hours 15 minutes 4 secondsTotal Delay9 hours 15 minutes 8 seconds9 hours 10 minutes 58 seconds9 hours 12 minutes9 hours 13 minutes 2 seconds9 hours 14 minutes 10 seconds9 hours 15 minutes 8 seconds Are these normal. I was wondering what the scheduling delay and total delay terms are, and if it's normal for them to be 9 hours. I've got a standalone spark master and 4 spark nodes. The streaming app has been given 4 cores, and it's using 1 core per worker node. The streaming app is submitted from a 5th machine, and that machine has nothing
Re: Spark Team - Paco Nathan said that your team can help
Hi Marco, Thanks for the confirmation. Please let me know what are the lot more detail you need to answer a very specific question WHAT IS THE MINIMUM HARDWARE CONFIGURATION REQUIRED TO BUILT HDFS+ MAPREDUCE+SPARK+YARN on a system? Please let me know if you need any further information and if you dont know please drive across with the $1 to Sir Paco Nathan and get me the answer. Thanks and Regards, Sudipta On Thu, Jan 22, 2015 at 5:33 PM, Marco Shaw marco.s...@gmail.com wrote: Hi, Let me reword your request so you understand how (too) generic your question is Hi, I have $10,000, please find me some means of transportation so I can get to work. Please provide (a lot) more details. If you can't, consider using one of the pre-built express VMs from either Cloudera, Hortonworks or MapR, for example. Marco On Jan 22, 2015, at 7:36 AM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Apache-Spark team , What are the system requirements installing Hadoop and Apache Spark? I have attached the screen shot of Gparted. Thanks and regards, Sudipta -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099 Screenshot - Wednesday 21 January 2015 - 10:55:29 IST.png - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099
Re: Are these numbers abnormal for spark streaming?
So the system has gone from 7msg in 4.961 secs (median) to 106msgs in 4,761 seconds. I think there's evidence that setup costs are quite high in this case and increasing the batch interval is helping. On Thu, Jan 22, 2015 at 4:12 PM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Ashic Mahtab, The Cassandra and the Zookeeper are they installed as a part of Yarn architecture or are they installed in a separate layer with Apache Spark . Thanks and Regards, Sudipta On Thu, Jan 22, 2015 at 8:13 PM, Ashic Mahtab as...@live.com wrote: Hi Guys, So I changed the interval to 15 seconds. There's obviously a lot more messages per batch, but (I think) it looks a lot healthier. Can you see any major warning signs? I think that with 2 second intervals, the setup / teardown per partition was what was causing the delays. Streaming - *Started at: *Thu Jan 22 13:23:12 GMT 2015 - *Time since start: *1 hour 17 minutes 16 seconds - *Network receivers: *2 - *Batch interval: *15 seconds - *Processed batches: *309 - *Waiting batches: *0 Statistics over last 100 processed batchesReceiver Statistics - Receiver - Status - Location - Records in last batch - [2015/01/22 14:40:29] - Minimum rate - [records/sec] - Median rate - [records/sec] - Maximum rate - [records/sec] - Last Error RmqReceiver-0ACTIVEVDCAPP53.foo.local2.6 K29106295-RmqReceiver-1ACTIVE VDCAPP50.bar.local2.6 K29107291- Batch Processing Statistics MetricLast batchMinimum25th percentileMedian75th percentileMaximumProcessing Time4 seconds 812 ms4 seconds 698 ms4 seconds 738 ms4 seconds 761 ms4 seconds 788 ms5 seconds 802 msScheduling Delay2 ms0 ms3 ms3 ms4 ms9 msTotal Delay4 seconds 814 ms4 seconds 701 ms4 seconds 739 ms4 seconds 764 ms4 seconds 792 ms5 seconds 809 ms Regards, Ashic. -- From: as...@live.com To: gerard.m...@gmail.com CC: user@spark.apache.org Subject: RE: Are these numbers abnormal for spark streaming? Date: Thu, 22 Jan 2015 12:32:05 + Hi Gerard, Thanks for the response. The messages get desrialised from msgpack format, and one of the strings is desrialised to json. Certain fields are checked to decide if further processing is required. If so, it goes through a series of in mem filters to check if more processing is required. If so, only then does the heavy work start. That consists of a few db queries, and potential updates to the db + message on message queue. The majority of messages don't need processing. The messages needing processing at peak are about three every other second. One possible things that might be happening is the session initialisation and prepared statement initialisation for each partition. I can resort to some tricks, but I think I'll try increasing batch interval to 15 seconds. I'll report back with findings. Thanks, Ashic. -- From: gerard.m...@gmail.com Date: Thu, 22 Jan 2015 12:30:08 +0100 Subject: Re: Are these numbers abnormal for spark streaming? To: tathagata.das1...@gmail.com CC: as...@live.com; t...@databricks.com; user@spark.apache.org and post the code (if possible). In a nutshell, your processing time batch interval, resulting in an ever-increasing delay that will end up in a crash. 3 secs to process 14 messages looks like a lot. Curious what the job logic is. -kr, Gerard. On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das tathagata.das1...@gmail.com wrote: This is not normal. Its a huge scheduling delay!! Can you tell me more about the application? - cluser setup, number of receivers, whats the computation, etc. On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote: Hate to do this...but...erm...bump? Would really appreciate input from others using Streaming. Or at least some docs that would tell me if these are expected or not. -- From: as...@live.com To: user@spark.apache.org Subject: Are these numbers abnormal for spark streaming? Date: Wed, 21 Jan 2015 11:26:31 + Hi Guys, I've got Spark Streaming set up for a low data rate system (using spark's features for analysis, rather than high throughput). Messages are coming in throughout the day, at around 1-20 per second (finger in the air estimate...not analysed yet). In the spark streaming UI for the application, I'm getting the following after 17 hours. Streaming - *Started at: *Tue Jan 20 16:58:43 GMT 2015 - *Time since start: *18 hours 24 minutes 34 seconds - *Network receivers: *2 - *Batch interval: *2 seconds - *Processed batches: *16482 - *Waiting batches: *1 Statistics over last 100 processed batchesReceiver Statistics - Receiver - Status - Location - Records in last batch - [2015/01/21 11:23:18] - Minimum rate - [records/sec] - Median rate - [records/sec] -
RE: Are these numbers abnormal for spark streaming?
Yup...looks like it. I can do some tricks to reduce setup costs further, but this is much better than where I was yesterday. Thanks for your awesome input :) -Ashic. From: gerard.m...@gmail.com Date: Thu, 22 Jan 2015 16:34:38 +0100 Subject: Re: Are these numbers abnormal for spark streaming? To: asudipta.baner...@gmail.com CC: as...@live.com; user@spark.apache.org; tathagata.das1...@gmail.com Given that the process, and in particular, the setup of connections, is bound to the number of partitions (in x.foreachPartition{ x= ???}), I think it would be worth trying reducing them. Increasing the 'spark.streaming.BlockInterval' will do the trick (you can read the tuning details here: http://www.virdata.com/tuning-spark/#Partitions) -kr, Gerard. On Thu, Jan 22, 2015 at 4:28 PM, Gerard Maas gerard.m...@gmail.com wrote: So the system has gone from 7msg in 4.961 secs (median) to 106msgs in 4,761 seconds. I think there's evidence that setup costs are quite high in this case and increasing the batch interval is helping. On Thu, Jan 22, 2015 at 4:12 PM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Ashic Mahtab, The Cassandra and the Zookeeper are they installed as a part of Yarn architecture or are they installed in a separate layer with Apache Spark . Thanks and Regards, Sudipta On Thu, Jan 22, 2015 at 8:13 PM, Ashic Mahtab as...@live.com wrote: Hi Guys, So I changed the interval to 15 seconds. There's obviously a lot more messages per batch, but (I think) it looks a lot healthier. Can you see any major warning signs? I think that with 2 second intervals, the setup / teardown per partition was what was causing the delays. StreamingStarted at: Thu Jan 22 13:23:12 GMT 2015Time since start: 1 hour 17 minutes 16 secondsNetwork receivers: 2Batch interval: 15 secondsProcessed batches: 309Waiting batches: 0 Statistics over last 100 processed batchesReceiver StatisticsReceiverStatusLocationRecords in last batch[2015/01/22 14:40:29]Minimum rate[records/sec]Median rate[records/sec]Maximum rate[records/sec]Last ErrorRmqReceiver-0ACTIVEVDCAPP53.foo.local2.6 K29106295-RmqReceiver-1ACTIVEVDCAPP50.bar.local2.6 K29107291-Batch Processing StatisticsMetricLast batchMinimum25th percentileMedian75th percentileMaximumProcessing Time4 seconds 812 ms4 seconds 698 ms4 seconds 738 ms4 seconds 761 ms4 seconds 788 ms5 seconds 802 msScheduling Delay2 ms0 ms3 ms3 ms4 ms9 msTotal Delay4 seconds 814 ms4 seconds 701 ms4 seconds 739 ms4 seconds 764 ms4 seconds 792 ms5 seconds 809 ms Regards, Ashic. From: as...@live.com To: gerard.m...@gmail.com CC: user@spark.apache.org Subject: RE: Are these numbers abnormal for spark streaming? Date: Thu, 22 Jan 2015 12:32:05 + Hi Gerard, Thanks for the response. The messages get desrialised from msgpack format, and one of the strings is desrialised to json. Certain fields are checked to decide if further processing is required. If so, it goes through a series of in mem filters to check if more processing is required. If so, only then does the heavy work start. That consists of a few db queries, and potential updates to the db + message on message queue. The majority of messages don't need processing. The messages needing processing at peak are about three every other second. One possible things that might be happening is the session initialisation and prepared statement initialisation for each partition. I can resort to some tricks, but I think I'll try increasing batch interval to 15 seconds. I'll report back with findings. Thanks, Ashic. From: gerard.m...@gmail.com Date: Thu, 22 Jan 2015 12:30:08 +0100 Subject: Re: Are these numbers abnormal for spark streaming? To: tathagata.das1...@gmail.com CC: as...@live.com; t...@databricks.com; user@spark.apache.org and post the code (if possible).In a nutshell, your processing time batch interval, resulting in an ever-increasing delay that will end up in a crash. 3 secs to process 14 messages looks like a lot. Curious what the job logic is. -kr, Gerard. On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das tathagata.das1...@gmail.com wrote: This is not normal. Its a huge scheduling delay!! Can you tell me more about the application?- cluser setup, number of receivers, whats the computation, etc. On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote: Hate to do this...but...erm...bump? Would really appreciate input from others using Streaming. Or at least some docs that would tell me if these are expected or not. From: as...@live.com To: user@spark.apache.org Subject: Are these numbers abnormal for spark streaming? Date: Wed, 21 Jan 2015 11:26:31 + Hi Guys, I've got Spark Streaming set up for a low data rate system (using spark's features for analysis, rather than high throughput). Messages are coming in throughout the day, at around 1-20 per second (finger in the air estimate...not analysed yet). In the spark streaming UI for the application, I'm getting the following
RE: Spark Team - Paco Nathan said that your team can help
Sudipta, Use the Docker image [1] and play around with Hadoop and Spark in the VM for a while. Decide on your use case(s) and then you can move ahead for installing on a cluster, etc. This Docker image has all you want [HDFS + MapReduce + Spark + YARN]. All the best! [1]: https://github.com/sequenceiq/docker-spark From: Sudipta Banerjee [mailto:asudipta.baner...@gmail.com] Sent: 22 January 2015 14:51 To: Marco Shaw Cc: user@spark.apache.org Subject: Re: Spark Team - Paco Nathan said that your team can help Hi Marco, Thanks for the confirmation. Please let me know what are the lot more detail you need to answer a very specific question WHAT IS THE MINIMUM HARDWARE CONFIGURATION REQUIRED TO BUILT HDFS+ MAPREDUCE+SPARK+YARN on a system? Please let me know if you need any further information and if you dont know please drive across with the $1 to Sir Paco Nathan and get me the answer. Thanks and Regards, Sudipta On Thu, Jan 22, 2015 at 5:33 PM, Marco Shaw marco.s...@gmail.commailto:marco.s...@gmail.com wrote: Hi, Let me reword your request so you understand how (too) generic your question is Hi, I have $10,000, please find me some means of transportation so I can get to work. Please provide (a lot) more details. If you can't, consider using one of the pre-built express VMs from either Cloudera, Hortonworks or MapR, for example. Marco On Jan 22, 2015, at 7:36 AM, Sudipta Banerjee asudipta.baner...@gmail.commailto:asudipta.baner...@gmail.com wrote: Hi Apache-Spark team , What are the system requirements installing Hadoop and Apache Spark? I have attached the screen shot of Gparted. Thanks and regards, Sudipta -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099 Screenshot - Wednesday 21 January 2015 - 10:55:29 IST.png - To unsubscribe, e-mail: user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.orgmailto:user-h...@spark.apache.org -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099 __ Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding.
RE: How to 'Pipe' Binary Data in Apache Spark
Thanks Frank for your response. So, creating a custom InputFormat or using sc.binaryFiles alone is not the right solution. We also need the modified version of RDD.pipe to support binary data? Is my understanding correct? If yes, this can be added as new enhancement Jira request? Nick: What's your take on this? Regards, Venkat Ankam From: Frank Austin Nothaft [mailto:fnoth...@berkeley.edu] Sent: Wednesday, January 21, 2015 12:30 PM To: Venkat, Ankam Cc: Nick Allen; user@spark.apache.org Subject: Re: How to 'Pipe' Binary Data in Apache Spark Hi Venkat/Nick, The Spark RDD.pipe method pipes text data into a subprocess and then receives text data back from that process. Once you have the binary data loaded into an RDD properly, to pipe binary data to/from a subprocess (e.g., you want the data in the pipes to contain binary, not text), you need to implement your own, modified version of RDD.pipe. The implementationhttps://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala of RDD.pipe spawns a process per partition (IIRC), as well as threads for writing to and reading from the process (as well as stderr for the process). When writing via RDD.pipe, Spark calls *.toString on the object, and pushes that text representation down the pipe. There is an example of how to pipe binary data from within a mapPartitions call using the Scala API in lines 107-177 of this filehttps://github.com/bigdatagenomics/avocado/blob/master/avocado-core/src/main/scala/org/bdgenomics/avocado/genotyping/ExternalGenotyper.scala. This specific code contains some nastiness around the packaging of downstream libraries that we rely on in that project, so I'm not sure if it is the cleanest way, but it is a workable way. Regards, Frank Austin Nothaft fnoth...@berkeley.edumailto:fnoth...@berkeley.edu fnoth...@eecs.berkeley.edumailto:fnoth...@eecs.berkeley.edu 202-340-0466 On Jan 21, 2015, at 9:17 AM, Venkat, Ankam ankam.ven...@centurylink.commailto:ankam.ven...@centurylink.com wrote: I am trying to solve similar problem. I am using option # 2 as suggested by Nick. I have created an RDD with sc.binaryFiles for a list of .wav files. But, I am not able to pipe it to the external programs. For example: sq = sc.binaryFiles(wavfiles) -- All .wav files stored on wavfiles directory on HDFS sq.keys().collect() -- works fine. Shows the list of file names. sq.values().collect() -- works fine. Shows the content of the files. sq.values().pipe(lambda x: subprocess.call(['/usr/local/bin/sox', '-t' 'wav', '-', '-n', 'stats'])).collect() -- Does not work. Tried different options. AttributeError: 'function' object has no attribute 'read' Any suggestions? Regards, Venkat Ankam From: Nick Allen [mailto:n...@nickallen.org] Sent: Friday, January 16, 2015 11:46 AM To: user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: How to 'Pipe' Binary Data in Apache Spark I just wanted to reiterate the solution for the benefit of the community. The problem is not from my use of 'pipe', but that 'textFile' cannot be used to read in binary data. (Doh) There are a couple options to move forward. 1. Implement a custom 'InputFormat' that understands the binary input data. (Per Sean Owen) 2. Use 'SparkContext.binaryFiles' to read in the entire binary file as a single record. This will impact performance as it prevents the use of more than one mapper on the file's data. In my specific case for #1 I can only find one project from RIPE-NCC (https://github.com/RIPE-NCC/hadoop-pcap) that does this. Unfortunately, it appears to only support a limited set of network protocols. On Fri, Jan 16, 2015 at 10:40 AM, Nick Allen n...@nickallen.orgmailto:n...@nickallen.org wrote: Per your last comment, it appears I need something like this: https://github.com/RIPE-NCC/hadoop-pcap Thanks a ton. That get me oriented in the right direction. On Fri, Jan 16, 2015 at 10:20 AM, Sean Owen so...@cloudera.commailto:so...@cloudera.com wrote: Well it looks like you're reading some kind of binary file as text. That isn't going to work, in Spark or elsewhere, as binary data is not even necessarily the valid encoding of a string. There are no line breaks to delimit lines and thus elements of the RDD. Your input has some record structure (or else it's not really useful to put it into an RDD). You can encode this as a SequenceFile and read it with objectFile. You could also write a custom InputFormat that knows how to parse pcap records directly. On Fri, Jan 16, 2015 at 3:09 PM, Nick Allen n...@nickallen.orgmailto:n...@nickallen.org wrote: I have an RDD containing binary data. I would like to use 'RDD.pipe' to pipe that binary data to an external program that will translate it to string/text data. Unfortunately, it seems that Spark is mangling the binary data before it gets passed to the external program. This code is representative of what I am trying to do. What am I doing wrong?
Re: Spark Team - Paco Nathan said that your team can help
Hi Sudipta, I would also like to suggest to ask this question in Cloudera mailing list since you have HDFS, MAPREDUCE and Yarn requirements. Spark can work with HDFS and YARN but it is more like a client to those clusters. Cloudera can provide services to answer your question more clearly. I'm not affiliate with Cloudera but it seems they are the only one who is very active in the spark project and provides a hadoop distribution. HTH, Jerry btw, who is Paco Nathan? On Thu, Jan 22, 2015 at 10:03 AM, Babu, Prashanth prashanth.b...@nttdata.com wrote: Sudipta, Use the Docker image [1] and play around with Hadoop and Spark in the VM for a while. Decide on your use case(s) and then you can move ahead for installing on a cluster, etc. This Docker image has all you want [HDFS + MapReduce + Spark + YARN]. All the best! [1]: https://github.com/sequenceiq/docker-spark *From:* Sudipta Banerjee [mailto:asudipta.baner...@gmail.com] *Sent:* 22 January 2015 14:51 *To:* Marco Shaw *Cc:* user@spark.apache.org *Subject:* Re: Spark Team - Paco Nathan said that your team can help Hi Marco, Thanks for the confirmation. Please let me know what are the lot more detail you need to answer a very specific question WHAT IS THE MINIMUM HARDWARE CONFIGURATION REQUIRED TO BUILT HDFS+ MAPREDUCE+SPARK+YARN on a system? Please let me know if you need any further information and if you dont know please drive across with the $1 to Sir Paco Nathan and get me the answer. Thanks and Regards, Sudipta On Thu, Jan 22, 2015 at 5:33 PM, Marco Shaw marco.s...@gmail.com wrote: Hi, Let me reword your request so you understand how (too) generic your question is Hi, I have $10,000, please find me some means of transportation so I can get to work. Please provide (a lot) more details. If you can't, consider using one of the pre-built express VMs from either Cloudera, Hortonworks or MapR, for example. Marco On Jan 22, 2015, at 7:36 AM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Apache-Spark team , What are the system requirements installing Hadoop and Apache Spark? I have attached the screen shot of Gparted. Thanks and regards, Sudipta -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099 Screenshot - Wednesday 21 January 2015 - 10:55:29 IST.png - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099 __ Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding.
RE: spark 1.1.0 save data to hdfs failed
I looked into the namenode log and found this message: 2015-01-22 22:18:39,441 WARN org.apache.hadoop.ipc.Server: Incorrect header or version mismatch from 10.33.140.233:53776 got version 9 expected version 4 What should I do to fix this? Thanks. Ey-Chih From: eyc...@hotmail.com To: yuzhih...@gmail.com CC: user@spark.apache.org Subject: RE: spark 1.1.0 save data to hdfs failed Date: Wed, 21 Jan 2015 23:12:56 -0800 The hdfs release should be hadoop 1.0.4. Ey-Chih Chow Date: Wed, 21 Jan 2015 16:56:25 -0800 Subject: Re: spark 1.1.0 save data to hdfs failed From: yuzhih...@gmail.com To: eyc...@hotmail.com CC: user@spark.apache.org What hdfs release are you using ? Can you check namenode log around time of error below to see if there is some clue ? Cheers On Wed, Jan 21, 2015 at 4:51 PM, ey-chih chow eyc...@hotmail.com wrote: Hi, I used the following fragment of a scala program to save data to hdfs: contextAwareEvents .map(e = (new AvroKey(e), null)) .saveAsNewAPIHadoopFile(hdfs:// + masterHostname + :9000/ETL/output/ + dateDir, classOf[AvroKey[GenericRecord]], classOf[NullWritable], classOf[AvroKeyOutputFormat[GenericRecord]], job.getConfiguration) But it failed with the following error messages. Is there any people who can help? Thanks. Ey-Chih Chow = Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40) at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) Caused by: java.io.IOException: Failed on local exception: java.io.EOFException; Host Details : local host is: ip-10-33-140-157/10.33.140.157; destination host is: ec2-54-203-58-2.us-west-2.compute.amazonaws.com:9000; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764) at org.apache.hadoop.ipc.Client.call(Client.java:1415) at org.apache.hadoop.ipc.Client.call(Client.java:1364) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:744) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy15.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1925) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1079) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1075) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1075) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1400) at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:145) at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:900) at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:832) at com.crowdstar.etl.ParseAndClean$.main(ParseAndClean.scala:101) at com.crowdstar.etl.ParseAndClean.main(ParseAndClean.scala) ... 6 more Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1055) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:950) === -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-1-0-save-data-to-hdfs-failed-tp21305.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Large dataset, reduceByKey - java heap space error
Hi Kane- http://spark.apache.org/docs/latest/tuning.html has excellent information that may be helpful. In particular increasing the number of tasks may help, as well as confirming that you don’t have more data than you're expecting landing on a key. Also, if you are using spark 1.2.0, setting spark.shuffle.manager=sort was a huge help for many of our shuffle heavy workloads (this is the default in 1.2.0 now) Cheers, Sean On Jan 22, 2015, at 3:15 PM, Kane Kim kane.ist...@gmail.commailto:kane.ist...@gmail.com wrote: I'm trying to process a large dataset, mapping/filtering works ok, but as long as I try to reduceByKey, I get out of memory errors: http://pastebin.com/70M5d0Bn Any ideas how I can fix that? Thanks. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: reducing number of output files
One output file is produced per partition. If you want fewer, use coalesce() before saving the RDD. On Thu, Jan 22, 2015 at 10:46 PM, Kane Kim kane.ist...@gmail.com wrote: How I can reduce number of output files? Is there a parameter to saveAsTextFile? Thanks. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Results never return to driver | Spark Custom Reader
Hi All, I wrote a custom reader to read a DB, and it is able to return key and value as expected but after it finished it never returned to driver here is output of worker log : 15/01/23 15:51:38 INFO worker.ExecutorRunner: Launch command: java -cp ::/usr/local/spark-1.2.0-bin-hadoop2.4/sbin/../conf:/usr/local/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar:/usr/local/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/usr/local/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/usr/local/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/usr/local/hadoop/etc/hadoop -XX:MaxPermSize=128m -Dspark.driver.port=53484 -Xms1024M -Xmx1024M org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@VM90:53484/user/CoarseGrainedScheduler 6 VM99 4 app-20150123155114- akka.tcp://sparkWorker@VM99:44826/user/Worker 15/01/23 15:51:47 INFO worker.Worker: Executor app-20150123155114-/6 finished with state EXITED message Command exited with code 1 exitStatus 1 15/01/23 15:51:47 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@VM99:57695] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/01/23 15:51:47 INFO actor.LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkWorker/deadLetters] to Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%40143.96.25.29%3A35065-4#-915179653] was not delivered. [3] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 15/01/23 15:51:49 INFO worker.Worker: Asked to kill unknown executor app-20150123155114-/6 If someone noticed any clue to fixed that will really appreciate. - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Results-never-return-to-driver-Spark-Custom-Reader-tp21328.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
processing large dataset
I'm trying to process 5TB of data, not doing anything fancy, just map/filter and reduceByKey. Spent whole day today trying to get it processed, but never succeeded. I've tried to deploy to ec2 with the script provided with spark on pretty beefy machines (100 r3.2xlarge nodes). Really frustrated that spark doesn't work out of the box for anything bigger than word count sample. One big problem is that defaults are not suitable for processing big datasets, provided ec2 script could do a better job, knowing instance type requested. Second it takes hours to figure out what is wrong, when spark job fails almost finished processing. Even after raising all limits as per https://spark.apache.org/docs/latest/tuning.html it still fails (now with: error communicating with MapOutputTracker). After all I have only one question - how to get spark tuned up for processing terabytes of data and is there a way to make this configuration easier and more transparent? Thanks. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: KNN for large data set
Hi Devan and Xiangrui, Can you please explain the cost and optimization function of the KNN alogorithim that is being used? Thank and Regards, Sudipta On Thu, Jan 22, 2015 at 6:59 PM, DEVAN M.S. msdeva...@gmail.com wrote: Thanks Xiangrui Meng will try this. And, found this https://github.com/kaushikranjan/knnJoin also. Will this work with double data ? Can we find out z value of *Vector(10.3,4.5,3,5)* ? On Thu, Jan 22, 2015 at 12:25 AM, Xiangrui Meng men...@gmail.com wrote: For large datasets, you need hashing in order to compute k-nearest neighbors locally. You can start with LSH + k-nearest in Google scholar: http://scholar.google.com/scholar?q=lsh+k+nearest -Xiangrui On Tue, Jan 20, 2015 at 9:55 PM, DEVAN M.S. msdeva...@gmail.com wrote: Hi all, Please help me to find out best way for K-nearest neighbor using spark for large data sets. -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099
Exception in parsley pyspark cassandra hadoop connector
I am following the repo on github about pyspark cassandra connector at https://github.com/Parsely/pyspark-cassandra On executing the line : ./run_script.py src/main/python/pyspark_cassandra_hadoop_example.py run test It ends up wit an exception: ERROR Executor: Exception in task 9.0 in stage 2.0 (TID 14) java.io.NotSerializableException: scala.collection.convert.Wrappers$MapWrapper I am unable to figure out the cause of the exception Thanks, Nishant
Re: spark 1.1.0 save data to hdfs failed
It means your client app is using Hadoop 2.x and your HDFS is Hadoop 1.x. On Thu, Jan 22, 2015 at 10:32 PM, ey-chih chow eyc...@hotmail.com wrote: I looked into the namenode log and found this message: 2015-01-22 22:18:39,441 WARN org.apache.hadoop.ipc.Server: Incorrect header or version mismatch from 10.33.140.233:53776 got version 9 expected version 4 What should I do to fix this? Thanks. Ey-Chih From: eyc...@hotmail.com To: yuzhih...@gmail.com CC: user@spark.apache.org Subject: RE: spark 1.1.0 save data to hdfs failed Date: Wed, 21 Jan 2015 23:12:56 -0800 The hdfs release should be hadoop 1.0.4. Ey-Chih Chow Date: Wed, 21 Jan 2015 16:56:25 -0800 Subject: Re: spark 1.1.0 save data to hdfs failed From: yuzhih...@gmail.com To: eyc...@hotmail.com CC: user@spark.apache.org What hdfs release are you using ? Can you check namenode log around time of error below to see if there is some clue ? Cheers On Wed, Jan 21, 2015 at 4:51 PM, ey-chih chow eyc...@hotmail.com wrote: Hi, I used the following fragment of a scala program to save data to hdfs: contextAwareEvents .map(e = (new AvroKey(e), null)) .saveAsNewAPIHadoopFile(hdfs:// + masterHostname + :9000/ETL/output/ + dateDir, classOf[AvroKey[GenericRecord]], classOf[NullWritable], classOf[AvroKeyOutputFormat[GenericRecord]], job.getConfiguration) But it failed with the following error messages. Is there any people who can help? Thanks. Ey-Chih Chow = Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40) at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) Caused by: java.io.IOException: Failed on local exception: java.io.EOFException; Host Details : local host is: ip-10-33-140-157/10.33.140.157; destination host is: ec2-54-203-58-2.us-west-2.compute.amazonaws.com:9000; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764) at org.apache.hadoop.ipc.Client.call(Client.java:1415) at org.apache.hadoop.ipc.Client.call(Client.java:1364) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:744) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy15.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1925) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1079) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1075) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1075) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1400) at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:145) at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:900) at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:832) at com.crowdstar.etl.ParseAndClean$.main(ParseAndClean.scala:101) at com.crowdstar.etl.ParseAndClean.main(ParseAndClean.scala) ... 6 more Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1055) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:950) === -- View this message in
Re: GraphX: ShortestPaths does not terminate on a grid graph
At 2015-01-22 02:06:37 -0800, NicolasC nicolas.ch...@inria.fr wrote: I try to execute a simple program that runs the ShortestPaths algorithm (org.apache.spark.graphx.lib.ShortestPaths) on a small grid graph. I use Spark 1.2.0 downloaded from spark.apache.org. This program runs more than 2 hours when the grid size is 70x70 as above, and is then killed by the resource manager of the cluster (Torque). After a 5-6 minutes of execution, the Spark master UI does not even respond. For a grid size of 30x30, the program terminates in about 20 seconds, and for a grid size of 50x50 it finishes in about 80 seconds. The problem appears for a grid size of 70x70 and above. Unfortunately this problem is due to a Spark bug. In later iterations of iterative algorithms, the lineage maintained for fault tolerance grows long and causes Spark to consume increasing amounts of resources for scheduling and task serialization. The workaround is to checkpoint the graph periodically, which writes it to stable storage and interrupts the lineage chain before it grows too long. If you're able to recompile Spark, you can do this by applying the patch to GraphX at the end of this mail, and before running graph algorithms, calling sc.setCheckpointDir(/tmp) to set the checkpoint directory as desired. Ankur === patch begins here === diff --git a/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala b/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala index 5e55620..1fbbb87 100644 --- a/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala +++ b/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala @@ -126,6 +126,8 @@ object Pregel extends Logging { // Loop var prevG: Graph[VD, ED] = null var i = 0 +val checkpoint = g.vertices.context.getCheckpointDir.nonEmpty +val checkpointFrequency = 25 while (activeMessages 0 i maxIterations) { // Receive the messages. Vertices that didn't get any messages do not appear in newVerts. val newVerts = g.vertices.innerJoin(messages)(vprog).cache() @@ -139,6 +141,14 @@ object Pregel extends Logging { // get to send messages. We must cache messages so it can be materialized on the next line, // allowing us to uncache the previous iteration. messages = g.mapReduceTriplets(sendMsg, mergeMsg, Some((newVerts, activeDirection))).cache() + + if (checkpoint i % checkpointFrequency == checkpointFrequency - 1) { +logInfo(Checkpointing in iteration + i) +g.vertices.checkpoint() +g.edges.checkpoint() +messages.checkpoint() + } + // The call to count() materializes `messages`, `newVerts`, and the vertices of `g`. This // hides oldMessages (depended on by newVerts), newVerts (depended on by messages), and the // vertices of prevG (depended on by newVerts, oldMessages, and the vertices of g). - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
save a histogram to a file
Hi, histogram() returns an object that is a pair of Arrays. There appears to be no saveAsTextFile() for this paired object. Currently I am using the following to save the output to a file: val hist = a.histogram(10) val arr1 = sc.parallelize(hist._1).saveAsTextFile(file1) val arr2 = sc.parallelize(hist._2).saveAsTextFile(file2) Is there a simpler way to save the histogram() result to a file? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/save-a-histogram-to-a-file-tp21324.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: spark streaming with checkpoint
Hi, A new RDD will be created in each slide duration, if there’s no data coming, an empty RDD will be generated. I’m not sure there’s way to alleviate your problem from Spark side. Is your application design have to build such a large window, can you change your implementation if it is easy for you? I think it’s better and easy for you to change your implementation rather than rely on Spark to handle this. Thanks Jerry From: Balakrishnan Narendran [mailto:balu.na...@gmail.com] Sent: Friday, January 23, 2015 12:19 AM To: Shao, Saisai Cc: user@spark.apache.org Subject: Re: spark streaming with checkpoint Thank you Jerry, Does the window operation create new RDDs for each slide duration..? I am asking this because i see a constant increase in memory even when there is no logs received. If not checkpoint is there any alternative that you would suggest.? On Tue, Jan 20, 2015 at 7:08 PM, Shao, Saisai saisai.s...@intel.commailto:saisai.s...@intel.com wrote: Hi, Seems you have such a large window (24 hours), so the phenomena of memory increasing is expectable, because of window operation will cache the RDD within this window in memory. So for your requirement, memory should be enough to hold the data of 24 hours. I don’t think checkpoint in Spark Streaming can alleviate such problem, because checkpoint are mainly for fault tolerance. Thanks Jerry From: balu.naren [mailto:balu.na...@gmail.commailto:balu.na...@gmail.com] Sent: Tuesday, January 20, 2015 7:17 PM To: user@spark.apache.orgmailto:user@spark.apache.org Subject: spark streaming with checkpoint I am a beginner to spark streaming. So have a basic doubt regarding checkpoints. My use case is to calculate the no of unique users by day. I am using reduce by key and window for this. Where my window duration is 24 hours and slide duration is 5 mins. I am updating the processed record to mongodb. Currently I am replace the existing record each time. But I see the memory is slowly increasing over time and kills the process after 1 and 1/2 hours(in aws small instance). The DB write after the restart clears all the old data. So I understand checkpoint is the solution for this. But my doubt is * What should my check point duration be..? As per documentation it says 5-10 times of slide duration. But I need the data of entire day. So it is ok to keep 24 hrs. * Where ideally should the checkpoint be..? Initially when I receive the stream or just before the window operation or after the data reduction has taken place. Appreciate your help. Thank you View this message in context: spark streaming with checkpointhttp://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-with-checkpoint-tp21263.html Sent from the Apache Spark User List mailing list archivehttp://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.
Re: reading a csv dynamically
Spark can definitely process data with optional fields. It kinda depends on what you want to do with the results -- its more of a object design / knowing scala types question. Eg., scala has a built in type Option specifically for handling optional data, which works nicely in pattern matching functional programming. Just to save myself some typing, I'm going to show an example with 2 or 3 fields: myProcessedRdd: RDD[(String, Double, Option[Double])] = sc.textFile(file.csv).map{txt = val split = txt.split(,) val opt = if split.length == 3 Some(split.toDouble) else None (split(0),split(1).toDouble, opt) } then eg., say in a later processing step, you want to make the 3rd field have a default of 6.9, you'd do something like: myProcessedRdd.map{ case (name, count,ageOpt) = //arbitrary variable names I'm just making up val age = ageOpt.getOrElse(6.9) ... } You might be interested in reading up on Scala's Option type, and how you can use it. There are a lot of other options too, eg. the Either type if you want to track 2 alternatives, Try for keeping track of errors, etc. You can play around with all of them outside of spark. Of course you could do similar things in Java well without these types. You just need to write your own container for dealing w/ missing data, which could be really simple in your use case. I would advise against first creating a key w/ the number of fields, and then doing a groupByKey. Since you are only expecting 2 different lengths, al the data will only go to two tasks, so this will not scale very well. And though the data is now grouped by length, its all in one RDD, so you've still got to figure out what to do with both record lengths. Imran On Wed, Jan 21, 2015 at 6:46 PM, Pankaj Narang pankajnaran...@gmail.com wrote: Yes I think you need to create one map first which will keep the number of values in every line. Now you can group all the records with same number of values. Now you know how many types of arrays you will have. val dataRDD = sc.textFile(file.csv) val dataLengthRDD = dataRDD .map(line=(_.split(,).length,line)) val groupedData = dataLengthRDD.groupByKey() now you can process the groupedData as it will have arrays of length x in one RDD. groupByKey([numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, IterableV) pairs. I hope this helps Regards Pankaj Infoshore Software India -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/reading-a-csv-dynamically-tp21304p21307.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Missing output partition file in S3
Hi, My team is using Spark 1.0.1 and the project we're working on needs to compute exact numbers, which are then saved to S3, to be reused later in other Spark jobs to compute other numbers. The problem we noticed yesterday: one of the output partition files in S3 was missing :/ (some part-00218)... The problem only occurred once, and cannot be reproed. However because of this incident, our numbers may not be reliable. From the Spark logs (from the cluster which generated the files with the missing partition), we noticed some errors appearing multiple times: - Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: s3://xxx/_temporary/_attempt_201501142002__m_000368_12139/part-00368: No such file or directory. at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:340) at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:165) at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172) at org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132) at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109) at org.apache.spark.rdd.PairRDDFunctions.org$apache$spark$rdd$PairRDDFunctions$$writeToFile$1(PairRDDFunctions.scala:785) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:788) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:788) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) And: - WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(3, ip-10-152-30-234.ec2.internal, 48973, 0) with no recent heart beats: 72614ms exceeds 45000ms Questions: - Do those errors explain why the output partition file was missing? (knowing that we still get those errors in our logs). - Is there a way to detect data loss during runtime, and then stop our Spark job completely ASAP if it happens? Thanks, Nicolas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Missing-output-partition-file-in-S3-tp21326.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to make spark partition sticky, i.e. stay with node?
Also, Setting spark.locality.wait=100 did not work for me. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-spark-partition-sticky-i-e-stay-with-node-tp21322p21325.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: sparkcontext.objectFile return thousands of partitions
I think you should also just be able to provide an input format that never splits the input data. This has come up before on the list, but I couldn't find it.* I think this should work, but I can't try it out at the moment. Can you please try and let us know if it works? class TextFormatNoSplits extends TextInputFormat { override def isSplitable(fs: FileSystem, file: Path): Boolean = false } def textFileNoSplits(sc: SparkContext, path: String): RDD[String] = { //note this is just a copy of sc.textFile, with a different InputFormatClass sc.hadoopFile(path, classOf[TextFormatNoSplits], classOf[LongWritable], classOf[Text]).map(pair = pair._2.toString).setName(path) } * yes I realize the irony given the recent discussion about mailing list vs. stackoverflow ... On Thu, Jan 22, 2015 at 11:01 AM, Sean Owen so...@cloudera.com wrote: Yes, that second argument is what I was referring to, but yes it's a *minimum*, oops, right. OK, you will want to coalesce then, indeed. On Thu, Jan 22, 2015 at 6:51 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: Ø If you know that this number is too high you can request a number of partitions when you read it. How to do that? Can you give a code snippet? I want to read it into 8 partitions, so I do val rdd2 = sc.objectFile[LabeledPoint]( (“file:///tmp/mydir”, 8) However rdd2 contains thousands of partitions instead of 8 partitions - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: spark 1.1.0 save data to hdfs failed
Thanks. But after I replace the maven dependence from dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-client/artifactId version2.5.0-cdh5.2.0/version scopeprovided/scope exclusions exclusion groupIdorg.mortbay.jetty/groupId artifactIdservlet-api/artifactId /exclusion exclusion groupIdjavax.servlet/groupId artifactIdservlet-api/artifactId /exclusion exclusion groupIdio.netty/groupId artifactIdnetty/artifactId /exclusion /exclusions/dependency todependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-client/artifactId version1.0.4/version scopeprovided/scope exclusions exclusion groupIdorg.mortbay.jetty/groupId artifactIdservlet-api/artifactId /exclusion exclusion groupIdjavax.servlet/groupId artifactIdservlet-api/artifactId /exclusion exclusion groupIdio.netty/groupId artifactIdnetty/artifactId /exclusion /exclusions /dependency the warning message is still shown up in the namenode log. Is there any other thing I need to do? Thanks. Ey-Chih Chow From: so...@cloudera.com Date: Thu, 22 Jan 2015 22:34:22 + Subject: Re: spark 1.1.0 save data to hdfs failed To: eyc...@hotmail.com CC: yuzhih...@gmail.com; user@spark.apache.org It means your client app is using Hadoop 2.x and your HDFS is Hadoop 1.x. On Thu, Jan 22, 2015 at 10:32 PM, ey-chih chow eyc...@hotmail.com wrote: I looked into the namenode log and found this message: 2015-01-22 22:18:39,441 WARN org.apache.hadoop.ipc.Server: Incorrect header or version mismatch from 10.33.140.233:53776 got version 9 expected version 4 What should I do to fix this? Thanks. Ey-Chih From: eyc...@hotmail.com To: yuzhih...@gmail.com CC: user@spark.apache.org Subject: RE: spark 1.1.0 save data to hdfs failed Date: Wed, 21 Jan 2015 23:12:56 -0800 The hdfs release should be hadoop 1.0.4. Ey-Chih Chow Date: Wed, 21 Jan 2015 16:56:25 -0800 Subject: Re: spark 1.1.0 save data to hdfs failed From: yuzhih...@gmail.com To: eyc...@hotmail.com CC: user@spark.apache.org What hdfs release are you using ? Can you check namenode log around time of error below to see if there is some clue ? Cheers On Wed, Jan 21, 2015 at 4:51 PM, ey-chih chow eyc...@hotmail.com wrote: Hi, I used the following fragment of a scala program to save data to hdfs: contextAwareEvents .map(e = (new AvroKey(e), null)) .saveAsNewAPIHadoopFile(hdfs:// + masterHostname + :9000/ETL/output/ + dateDir, classOf[AvroKey[GenericRecord]], classOf[NullWritable], classOf[AvroKeyOutputFormat[GenericRecord]], job.getConfiguration) But it failed with the following error messages. Is there any people who can help? Thanks. Ey-Chih Chow = Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40) at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) Caused by: java.io.IOException: Failed on local exception: java.io.EOFException; Host Details : local host is: ip-10-33-140-157/10.33.140.157; destination host is: ec2-54-203-58-2.us-west-2.compute.amazonaws.com:9000; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764) at org.apache.hadoop.ipc.Client.call(Client.java:1415) at org.apache.hadoop.ipc.Client.call(Client.java:1364)
Re: reducing number of output files
Rdd.coalesce(1) will coalesce RDD and give only one output file. coalesce(2) will give 2 wise versa. On Jan 23, 2015 4:58 AM, Sean Owen so...@cloudera.com wrote: One output file is produced per partition. If you want fewer, use coalesce() before saving the RDD. On Thu, Jan 22, 2015 at 10:46 PM, Kane Kim kane.ist...@gmail.com wrote: How I can reduce number of output files? Is there a parameter to saveAsTextFile? Thanks. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Using third party libraries in pyspark
You need to install these libraries on all the slaves, or submit via spark-submit: spark-submit --py-files xxx On Thu, Jan 22, 2015 at 11:23 AM, Mohit Singh mohit1...@gmail.com wrote: Hi, I might be asking something very trivial, but whats the recommend way of using third party libraries. I am using tables to read hdf5 format file.. And here is the error trace: print rdd.take(2) File /tmp/spark/python/pyspark/rdd.py, line , in take res = self.context.runJob(self, takeUpToNumLeft, p, True) File /tmp/spark/python/pyspark/context.py, line 818, in runJob it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, javaPartitions, allowLocal) File /tmp/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /tmp/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, srv-108-23.720.rdio): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py, line 90, in main command = pickleSer._read_with_length(infile) File /hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py, line 151, in _read_with_length return self.loads(obj) File /hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py, line 396, in loads return cPickle.loads(obj) File /hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/cloudpickle.py, line 825, in subimport __import__(name) ImportError: ('No module named tables', function subimport at 0x47e1398, ('tables',)) Though, import tables works fine on the local python shell.. but seems like every thing is being pickled.. Are we expected to send all the files as helper files? that doesn't seems right? Thanks -- Mohit When you want success as badly as you want the air, then you will get it. There is no other secret of success. -Socrates - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: processing large dataset
Did you try it with a smaller subset of the data first? Le 23 janv. 2015 05:54, Kane Kim kane.ist...@gmail.com a écrit : I'm trying to process 5TB of data, not doing anything fancy, just map/filter and reduceByKey. Spent whole day today trying to get it processed, but never succeeded. I've tried to deploy to ec2 with the script provided with spark on pretty beefy machines (100 r3.2xlarge nodes). Really frustrated that spark doesn't work out of the box for anything bigger than word count sample. One big problem is that defaults are not suitable for processing big datasets, provided ec2 script could do a better job, knowing instance type requested. Second it takes hours to figure out what is wrong, when spark job fails almost finished processing. Even after raising all limits as per https://spark.apache.org/docs/latest/tuning.html it still fails (now with: error communicating with MapOutputTracker). After all I have only one question - how to get spark tuned up for processing terabytes of data and is there a way to make this configuration easier and more transparent? Thanks. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: sparkcontext.objectFile return thousands of partitions
Yes, that second argument is what I was referring to, but yes it's a *minimum*, oops, right. OK, you will want to coalesce then, indeed. On Thu, Jan 22, 2015 at 6:51 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: Ø If you know that this number is too high you can request a number of partitions when you read it. How to do that? Can you give a code snippet? I want to read it into 8 partitions, so I do val rdd2 = sc.objectFile[LabeledPoint]( (“file:///tmp/mydir”, 8) However rdd2 contains thousands of partitions instead of 8 partitions - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How to make spark partition sticky, i.e. stay with node?
I posted an question on stackoverflow and haven't gotten any answer yet. http://stackoverflow.com/questions/28079037/how-to-make-spark-partition-sticky-i-e-stay-with-node Is there a way to make a partition stay with a node in Spark Streaming? I need these since I have to load large amount partition specific auxiliary data for processing the stream. I noticed that the partitions move among the nodes. I cannot afford to move the large auxiliary data around. Thanks, Mingyu -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-spark-partition-sticky-i-e-stay-with-node-tp21322.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Using third party libraries in pyspark
Hi, I might be asking something very trivial, but whats the recommend way of using third party libraries. I am using tables to read hdf5 format file.. And here is the error trace: print rdd.take(2) File /tmp/spark/python/pyspark/rdd.py, line , in take res = self.context.runJob(self, takeUpToNumLeft, p, True) File /tmp/spark/python/pyspark/context.py, line 818, in runJob it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, javaPartitions, allowLocal) File /tmp/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /tmp/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, srv-108-23.720.rdio): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py, line 90, in main command = pickleSer._read_with_length(infile) File /hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py, line 151, in _read_with_length return self.loads(obj) File /hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py, line 396, in loads return cPickle.loads(obj) File /hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/cloudpickle.py, line 825, in subimport __import__(name) ImportError: ('No module named tables', function subimport at 0x47e1398, ('tables',)) Though, import tables works fine on the local python shell.. but seems like every thing is being pickled.. Are we expected to send all the files as helper files? that doesn't seems right? Thanks -- Mohit When you want success as badly as you want the air, then you will get it. There is no other secret of success. -Socrates
Re: Discourse: A proposed alternative to the Spark User list
On Thu, Jan 22, 2015 at 10:21 AM, Sean Owen so...@cloudera.com wrote: I think a Spark site would have a lot less traffic. One annoyance is that people can't figure out when to post on SO vs Data Science vs Cross Validated. Another is that a lot of the discussions we see on the Spark users list would be closed very quickly at Stack Overflow. Long and abstract discussions are generally a good recipe to get your question closed. Which is an argument for why Discourse would be more appropriate, I guess. Finally, maybe I'm showing my age, but I really dislike having to follow lots of different places. What would happen is that, personally, I'd end up either ignoring any new discussion forum, or just treating it like a mailing list and doing everything by e-mail. Now get off my lawn. -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Apache Spark broadcast error: Error sending message as driverActor is null [message = UpdateBlockInfo(BlockManagerId(4)
HiI'm using Apache Spark 1.1.0 and I'm currently having issue with broadcast method. So when I call broadcast function on a small dataset to a 5 nodes cluster, I experiencing the Error sending message as driverActor is null after broadcast the variables several times (apps running under jboss). Any help would be appreciate.ThanksEdwin
Durablility of Spark Streaming Applications
Deployed Spark Streaming applications to a standalone cluster, after a cluster restart, all the deployed applications are gone and I could not see any applications through the Spark Web UI. How to make the Spark Streaming applications durable and auto-restart after a cluster restart? Thanks, Daniel
Re: spark streaming with checkpoint
Maybe you use a wrong approach - try something like hyperloglog or bitmap structures as you can find them, for instance, in redis. They are much smaller Le 22 janv. 2015 17:19, Balakrishnan Narendran balu.na...@gmail.com a écrit : Thank you Jerry, Does the window operation create new RDDs for each slide duration..? I am asking this because i see a constant increase in memory even when there is no logs received. If not checkpoint is there any alternative that you would suggest.? On Tue, Jan 20, 2015 at 7:08 PM, Shao, Saisai saisai.s...@intel.com wrote: Hi, Seems you have such a large window (24 hours), so the phenomena of memory increasing is expectable, because of window operation will cache the RDD within this window in memory. So for your requirement, memory should be enough to hold the data of 24 hours. I don’t think checkpoint in Spark Streaming can alleviate such problem, because checkpoint are mainly for fault tolerance. Thanks Jerry *From:* balu.naren [mailto:balu.na...@gmail.com] *Sent:* Tuesday, January 20, 2015 7:17 PM *To:* user@spark.apache.org *Subject:* spark streaming with checkpoint I am a beginner to spark streaming. So have a basic doubt regarding checkpoints. My use case is to calculate the no of unique users by day. I am using reduce by key and window for this. Where my window duration is 24 hours and slide duration is 5 mins. I am updating the processed record to mongodb. Currently I am replace the existing record each time. But I see the memory is slowly increasing over time and kills the process after 1 and 1/2 hours(in aws small instance). The DB write after the restart clears all the old data. So I understand checkpoint is the solution for this. But my doubt is - What should my check point duration be..? As per documentation it says 5-10 times of slide duration. But I need the data of entire day. So it is ok to keep 24 hrs. - Where ideally should the checkpoint be..? Initially when I receive the stream or just before the window operation or after the data reduction has taken place. Appreciate your help. Thank you -- View this message in context: spark streaming with checkpoint http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-with-checkpoint-tp21263.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.
Re: How to 'Pipe' Binary Data in Apache Spark
Nick, Have you tried https://github.com/kaitoy/pcap4j I’ve used this in a Spark app already and didn’t have any issues. My use case was slightly different than yours, but you should give it a try. From: Nick Allen n...@nickallen.orgmailto:n...@nickallen.org Date: Friday, January 16, 2015 at 10:09 AM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: How to 'Pipe' Binary Data in Apache Spark I have an RDD containing binary data. I would like to use 'RDD.pipe' to pipe that binary data to an external program that will translate it to string/text data. Unfortunately, it seems that Spark is mangling the binary data before it gets passed to the external program. This code is representative of what I am trying to do. What am I doing wrong? How can I pipe binary data in Spark? Maybe it is getting corrupted when I read it in initially with 'textFile'? bin = sc.textFile(binary-data.dat) csv = bin.pipe (/usr/bin/binary-to-csv.sh) csv.saveAsTextFile(text-data.csv) Specifically, I am trying to use Spark to transform pcap (packet capture) data to text/csv so that I can perform an analysis on it. Thanks! -- Nick Allen n...@nickallen.orgmailto:n...@nickallen.org
Installing Spark Standalone to a Cluster
I have downloaded spark-1.2.0.tgz on each of my node and execute ./sbt/sbt assembly on each of them. So I execute. /sbin/start-master.sh on my master and ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT. Althought when I got to http://localhost:8080 I cannot see any worker. Why is that? Do I do something wrong with the installation deploy of the spark? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Installing-Spark-Standalone-to-a-Cluster-tp21319.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Exception: NoSuchMethodError: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions
NoSuchMethodError almost always means that you have compiled some code against one version of a library but are running against another. I wonder if you are including different versions of Spark in your project, or running against a cluster on an older version? On Thu, Jan 22, 2015 at 3:57 PM, Adrian Mocanu amoc...@verticalscope.com wrote: Hi I get this exception when I run a Spark test case on my local machine: An exception or error caused a run to abort: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions(Lorg/apache/spark/streaming/dstream/DStream;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;Lscala/math/Ordering;)Lorg/apache/spark/streaming/dstream/PairDStreamFunctions; java.lang.NoSuchMethodError: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions(Lorg/apache/spark/streaming/dstream/DStream;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;Lscala/math/Ordering;)Lorg/apache/spark/streaming/dstream/PairDStreamFunctions; In my test case I have these Spark related imports imports: import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.TestSuiteBase import org.apache.spark.streaming.dstream.DStream import org.apache.spark.streaming.StreamingContext.toPairDStreamFunctions -Adrian - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Team - Paco Nathan said that your team can help
Folks, Just a gentle reminder we owe to ourselves: - this is a public forum and we need to behave accordingly, it is not place to vent frustration in rude way - getting attention here is an earned privilege and not entitlement - this is not a “Platinum Support” department of your vendor rather and open source collaboration forum where people volunteer their time to pay attention to your needs - there are still many gray areas so be patient and articulate questions in as much details as possible if you want to get quick help and not just be perceived as a smart a$$ FYI - Paco Nathan is a well respected Spark evangelist and many people, including myself, owe to his passion for jumping on Spark platform promise. People like Sean Owen keep us believing in things when we feel like hitting the dead-end. Please, be respectful of what connections you are prized with and act civilized. Have a great day! - Nicos On Jan 22, 2015, at 7:49 AM, Sean Owen so...@cloudera.com wrote: Yes, this isn't a well-formed question, and got maybe the response it deserved, but the tone is veering off the rails. I just got a much ruder reply from Sudipta privately, which I will not forward. Sudipta, I suggest you take the responses you've gotten so far as about as much answer as can be had here and do some work yourself, and come back with much more specific questions, and it will all be helpful and polite again. On Thu, Jan 22, 2015 at 2:51 PM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Marco, Thanks for the confirmation. Please let me know what are the lot more detail you need to answer a very specific question WHAT IS THE MINIMUM HARDWARE CONFIGURATION REQUIRED TO BUILT HDFS+ MAPREDUCE+SPARK+YARN on a system? Please let me know if you need any further information and if you dont know please drive across with the $1 to Sir Paco Nathan and get me the answer. Thanks and Regards, Sudipta On Thu, Jan 22, 2015 at 5:33 PM, Marco Shaw marco.s...@gmail.com wrote: Hi, Let me reword your request so you understand how (too) generic your question is Hi, I have $10,000, please find me some means of transportation so I can get to work. Please provide (a lot) more details. If you can't, consider using one of the pre-built express VMs from either Cloudera, Hortonworks or MapR, for example. Marco On Jan 22, 2015, at 7:36 AM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Apache-Spark team , What are the system requirements installing Hadoop and Apache Spark? I have attached the screen shot of Gparted. Thanks and regards, Sudipta -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099 Screenshot - Wednesday 21 January 2015 - 10:55:29 IST.png - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Discourse: A proposed alternative to the Spark User list
Love it! There is a reason why SO is so effective and popular. Search is excellent, you can quickly find very thoughtful answers about sometimes thorny problems, and it is easy to contribute, format code, etc. Perhaps the most useful feature is that the best answers naturally bubble up to the top, so these are the ones you see first. One annoyance is the troll phenomenon, see e.g. http://michael.richter.name/blogs/why-i-no-longer-contribute-to-stackoverflow (that also mentions other pet peeves about SO). That phenomenon is, IMHO, most prevalent on the stackoverflow itself, perhaps less so on other stackexchange sites. At the same time, I do appreciate the pressure to provide well-written, concise, and for the posterity questions and answers. That peer pressure is what, to a good extent, makes the material on SO so valuable and useful. It is probably a tricky balance to strike. A dedicated stackexchange site for Apache Spark sounds to me like the logical solution. Less trolling, more enthusiasm, and with the participation of the people on this list, I think it would very quickly become the reference for many technical questions, as well as a great vehicle to promote the awesomeness of Spark. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Discourse-A-proposed-alternative-to-the-Spark-User-list-tp20851p21321.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Discourse: A proposed alternative to the Spark User list
FWIW I am a moderator for datascience.stackexchange.com, and even that hasn't really achieved the critical mass that SE sites are supposed to: http://area51.stackexchange.com/proposals/55053/data-science I think a Spark site would have a lot less traffic. One annoyance is that people can't figure out when to post on SO vs Data Science vs Cross Validated. A Spark site would have the same problem, fragmentation and cross posting with SO. I don't think this would be accepted as a StackExchange site and don't think it helps. On Thu, Jan 22, 2015 at 6:16 PM, pierred pie...@demartines.com wrote: A dedicated stackexchange site for Apache Spark sounds to me like the logical solution. Less trolling, more enthusiasm, and with the participation of the people on this list, I think it would very quickly become the reference for many technical questions, as well as a great vehicle to promote the awesomeness of Spark. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Exception: NoSuchMethodError: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions
I use spark 1.1.0-SNAPSHOT and the test I'm running is in local mode. My test case uses org.apache.spark.streaming.TestSuiteBase val spark=org.apache.spark %% spark-core % 1.1.0-SNAPSHOT % provided excludeAll( val sparkStreaming= org.apache.spark % spark-streaming_2.10 % 1.1.0-SNAPSHOT % provided excludeAll( val sparkCassandra= com.tuplejump % calliope_2.10 % 0.9.0-C2-EA exclude(org.apache.cassandra, cassandra-all) exclude(org.apache.cassandra, cassandra-thrift) val casAll = org.apache.cassandra % cassandra-all % 2.0.3 intransitive() val casThrift = org.apache.cassandra % cassandra-thrift % 2.0.3 intransitive() val sparkStreamingFromKafka = org.apache.spark % spark-streaming-kafka_2.10 % 0.9.1 excludeAll( -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: January-22-15 11:39 AM To: Adrian Mocanu Cc: u...@spark.incubator.apache.org Subject: Re: Exception: NoSuchMethodError: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions NoSuchMethodError almost always means that you have compiled some code against one version of a library but are running against another. I wonder if you are including different versions of Spark in your project, or running against a cluster on an older version? On Thu, Jan 22, 2015 at 3:57 PM, Adrian Mocanu amoc...@verticalscope.com wrote: Hi I get this exception when I run a Spark test case on my local machine: An exception or error caused a run to abort: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions(Lo rg/apache/spark/streaming/dstream/DStream;Lscala/reflect/ClassTag;Lsca la/reflect/ClassTag;Lscala/math/Ordering;)Lorg/apache/spark/streaming/ dstream/PairDStreamFunctions; java.lang.NoSuchMethodError: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions(Lo rg/apache/spark/streaming/dstream/DStream;Lscala/reflect/ClassTag;Lsca la/reflect/ClassTag;Lscala/math/Ordering;)Lorg/apache/spark/streaming/ dstream/PairDStreamFunctions; In my test case I have these Spark related imports imports: import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.TestSuiteBase import org.apache.spark.streaming.dstream.DStream import org.apache.spark.streaming.StreamingContext.toPairDStreamFunctions -Adrian B CB [ X ܚX KK[XZ[ \ \ ][ X ܚX P \ ˘\X K ܙ B ܈Y][ۘ[ [X[ K[XZ[ \ \ Z[ \ ˘\X K ܙ B B - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Exception: NoSuchMethodError: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions
I use spark 1.1.0-SNAPSHOT val spark=org.apache.spark %% spark-core % 1.1.0-SNAPSHOT % provided excludeAll( -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: January-22-15 11:39 AM To: Adrian Mocanu Cc: u...@spark.incubator.apache.org Subject: Re: Exception: NoSuchMethodError: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions NoSuchMethodError almost always means that you have compiled some code against one version of a library but are running against another. I wonder if you are including different versions of Spark in your project, or running against a cluster on an older version? On Thu, Jan 22, 2015 at 3:57 PM, Adrian Mocanu amoc...@verticalscope.com wrote: Hi I get this exception when I run a Spark test case on my local machine: An exception or error caused a run to abort: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions(Lo rg/apache/spark/streaming/dstream/DStream;Lscala/reflect/ClassTag;Lsca la/reflect/ClassTag;Lscala/math/Ordering;)Lorg/apache/spark/streaming/ dstream/PairDStreamFunctions; java.lang.NoSuchMethodError: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions(Lo rg/apache/spark/streaming/dstream/DStream;Lscala/reflect/ClassTag;Lsca la/reflect/ClassTag;Lscala/math/Ordering;)Lorg/apache/spark/streaming/ dstream/PairDStreamFunctions; In my test case I have these Spark related imports imports: import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.TestSuiteBase import org.apache.spark.streaming.dstream.DStream import org.apache.spark.streaming.StreamingContext.toPairDStreamFunctions -Adrian
Re: Spark Team - Paco Nathan said that your team can help
Hi Nicos, Taking forward your argument,please be a smart a$$ and dont use unprofessional language just for the sake of being a moderator. Paco Nathan is respected for the dignity he carries in sharing his knowledge and making it available free for a$$es like us right! So just mind your tongue next time you put such a$$ in your mouth. Best Regards, Sudipta On Thu, Jan 22, 2015 at 10:39 PM, Nicos Kekchidis ikon...@me.com wrote: Folks, Just a gentle reminder we owe to ourselves: - this is a public forum and we need to behave accordingly, it is not place to vent frustration in rude way - getting attention here is an earned privilege and not entitlement - this is not a “Platinum Support” department of your vendor rather and open source collaboration forum where people volunteer their time to pay attention to your needs - there are still many gray areas so be patient and articulate questions in as much details as possible if you want to get quick help and not just be perceived as a smart a$$ FYI - Paco Nathan is a well respected Spark evangelist and many people, including myself, owe to his passion for jumping on Spark platform promise. People like Sean Owen keep us believing in things when we feel like hitting the dead-end. Please, be respectful of what connections you are prized with and act civilized. Have a great day! - Nicos On Jan 22, 2015, at 7:49 AM, Sean Owen so...@cloudera.com wrote: Yes, this isn't a well-formed question, and got maybe the response it deserved, but the tone is veering off the rails. I just got a much ruder reply from Sudipta privately, which I will not forward. Sudipta, I suggest you take the responses you've gotten so far as about as much answer as can be had here and do some work yourself, and come back with much more specific questions, and it will all be helpful and polite again. On Thu, Jan 22, 2015 at 2:51 PM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Marco, Thanks for the confirmation. Please let me know what are the lot more detail you need to answer a very specific question WHAT IS THE MINIMUM HARDWARE CONFIGURATION REQUIRED TO BUILT HDFS+ MAPREDUCE+SPARK+YARN on a system? Please let me know if you need any further information and if you dont know please drive across with the $1 to Sir Paco Nathan and get me the answer. Thanks and Regards, Sudipta On Thu, Jan 22, 2015 at 5:33 PM, Marco Shaw marco.s...@gmail.com wrote: Hi, Let me reword your request so you understand how (too) generic your question is Hi, I have $10,000, please find me some means of transportation so I can get to work. Please provide (a lot) more details. If you can't, consider using one of the pre-built express VMs from either Cloudera, Hortonworks or MapR, for example. Marco On Jan 22, 2015, at 7:36 AM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Apache-Spark team , What are the system requirements installing Hadoop and Apache Spark? I have attached the screen shot of Gparted. Thanks and regards, Sudipta -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099 Screenshot - Wednesday 21 January 2015 - 10:55:29 IST.png - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099
Re: Spark Team - Paco Nathan said that your team can help
Thank you very much Marco! Really appreciate your support. On Thu, Jan 22, 2015 at 10:57 PM, Marco Shaw marco.s...@gmail.com wrote: (Starting over...) The best place to look for the requirements would be at the individual pages of each technology. As for absolute minimum requirements, I would suggest 50GB of disk space and at least 8GB of memory. This is the absolute minimum. Architecting a solution like you are looking for is very complex. If you are just looking for a proof-of-concept consider a Docker image or going to Cloudera/Hortonworks/MapR and look for their express VMs which can usually run on Oracle Virtualbox or VMware. Marco On Thu, Jan 22, 2015 at 7:36 AM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Apache-Spark team , What are the system requirements installing Hadoop and Apache Spark? I have attached the screen shot of Gparted. Thanks and regards, Sudipta -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099
Re: Spark Team - Paco Nathan said that your team can help
Sudipta - Please don't ever come here or post here again. On Thu, Jan 22, 2015 at 1:25 PM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Nicos, Taking forward your argument,please be a smart a$$ and dont use unprofessional language just for the sake of being a moderator. Paco Nathan is respected for the dignity he carries in sharing his knowledge and making it available free for a$$es like us right! So just mind your tongue next time you put such a$$ in your mouth. Best Regards, Sudipta On Thu, Jan 22, 2015 at 10:39 PM, Nicos Kekchidis ikon...@me.com wrote: Folks, Just a gentle reminder we owe to ourselves: - this is a public forum and we need to behave accordingly, it is not place to vent frustration in rude way - getting attention here is an earned privilege and not entitlement - this is not a “Platinum Support” department of your vendor rather and open source collaboration forum where people volunteer their time to pay attention to your needs - there are still many gray areas so be patient and articulate questions in as much details as possible if you want to get quick help and not just be perceived as a smart a$$ FYI - Paco Nathan is a well respected Spark evangelist and many people, including myself, owe to his passion for jumping on Spark platform promise. People like Sean Owen keep us believing in things when we feel like hitting the dead-end. Please, be respectful of what connections you are prized with and act civilized. Have a great day! - Nicos On Jan 22, 2015, at 7:49 AM, Sean Owen so...@cloudera.com wrote: Yes, this isn't a well-formed question, and got maybe the response it deserved, but the tone is veering off the rails. I just got a much ruder reply from Sudipta privately, which I will not forward. Sudipta, I suggest you take the responses you've gotten so far as about as much answer as can be had here and do some work yourself, and come back with much more specific questions, and it will all be helpful and polite again. On Thu, Jan 22, 2015 at 2:51 PM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Marco, Thanks for the confirmation. Please let me know what are the lot more detail you need to answer a very specific question WHAT IS THE MINIMUM HARDWARE CONFIGURATION REQUIRED TO BUILT HDFS+ MAPREDUCE+SPARK+YARN on a system? Please let me know if you need any further information and if you dont know please drive across with the $1 to Sir Paco Nathan and get me the answer. Thanks and Regards, Sudipta On Thu, Jan 22, 2015 at 5:33 PM, Marco Shaw marco.s...@gmail.com wrote: Hi, Let me reword your request so you understand how (too) generic your question is Hi, I have $10,000, please find me some means of transportation so I can get to work. Please provide (a lot) more details. If you can't, consider using one of the pre-built express VMs from either Cloudera, Hortonworks or MapR, for example. Marco On Jan 22, 2015, at 7:36 AM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Apache-Spark team , What are the system requirements installing Hadoop and Apache Spark? I have attached the screen shot of Gparted. Thanks and regards, Sudipta -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099 Screenshot - Wednesday 21 January 2015 - 10:55:29 IST.png - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099
Re: Installing Spark Standalone to a Cluster
You can do ./sbin/start-slave.sh --master spark://IP:PORT. I believe you're missing --master. In addition, it's a good idea to pass with --master exactly the spark master's endpoint as shown on your UI under http://localhost:8080. But that should do it. If that's not working, you can look at the Worker log and see where it's trying to connect to and if it's getting any errors. On Thu, Jan 22, 2015 at 12:06 PM, riginos samarasrigi...@gmail.com wrote: I have downloaded spark-1.2.0.tgz on each of my node and execute ./sbt/sbt assembly on each of them. So I execute. /sbin/start-master.sh on my master and ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT. Althought when I got to http://localhost:8080 I cannot see any worker. Why is that? Do I do something wrong with the installation deploy of the spark? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Installing-Spark-Standalone-to-a-Cluster-tp21319.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Would Join on PairRDD's result in co-locating data by keys?
Hi, I wanted to understand how the join on two pair rdd's work? Would it result in shuffling data from both the RDD's with same key into same partition? If that is the case would it be better to use partitionBy function to partition (by the join attribute) the RDD at creation for lesser shuffling? Thanks Ankur
Apache Spark broadcast error: Error sending message as driverActor is null [message = UpdateBlockInfo(BlockManagerId
I'm using Apache Spark 1.1.0 and I'm currently having issue with broadcast method. So when I call broadcast function on a small dataset to a 5 nodes cluster, I experiencing the Error sending message as driverActor is null after broadcast the variables several times (apps running under jboss). Any help would be appreciate. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-broadcast-error-Error-sending-message-as-driverActor-is-null-message-UpdateBlockInfo-Bld-tp21320.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Team - Paco Nathan said that your team can help
Sudipta, with all due respect... don't respond to me if you don't like what I say is not the same as not being a jerk about it. One earns social capital, by being respectful and by respecting the social norms during interaction; by everything I've seen, you've been demanding and disrespectful (although some of it's been in private, so I don't know what those interactions have looked like, nor do I care.) Bottom line, you could do a lot to mitigate negative responses in what has been a supportive environment, by all appearances. That's on you, not the community. On Thu, Jan 22, 2015 at 12:35 PM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Dont ever reply to my queries :D On Thu, Jan 22, 2015 at 11:02 PM, Lukas Nalezenec lukas.naleze...@firma.seznam.cz wrote: +1 On 22.1.2015 18:30, Marco Shaw wrote: Sudipta - Please don't ever come here or post here again. On Thu, Jan 22, 2015 at 1:25 PM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Nicos, Taking forward your argument,please be a smart a$$ and dont use unprofessional language just for the sake of being a moderator. Paco Nathan is respected for the dignity he carries in sharing his knowledge and making it available free for a$$es like us right! So just mind your tongue next time you put such a$$ in your mouth. Best Regards, Sudipta On Thu, Jan 22, 2015 at 10:39 PM, Nicos Kekchidis ikon...@me.com wrote: Folks, Just a gentle reminder we owe to ourselves: - this is a public forum and we need to behave accordingly, it is not place to vent frustration in rude way - getting attention here is an earned privilege and not entitlement - this is not a “Platinum Support” department of your vendor rather and open source collaboration forum where people volunteer their time to pay attention to your needs - there are still many gray areas so be patient and articulate questions in as much details as possible if you want to get quick help and not just be perceived as a smart a$$ FYI - Paco Nathan is a well respected Spark evangelist and many people, including myself, owe to his passion for jumping on Spark platform promise. People like Sean Owen keep us believing in things when we feel like hitting the dead-end. Please, be respectful of what connections you are prized with and act civilized. Have a great day! - Nicos On Jan 22, 2015, at 7:49 AM, Sean Owen so...@cloudera.com wrote: Yes, this isn't a well-formed question, and got maybe the response it deserved, but the tone is veering off the rails. I just got a much ruder reply from Sudipta privately, which I will not forward. Sudipta, I suggest you take the responses you've gotten so far as about as much answer as can be had here and do some work yourself, and come back with much more specific questions, and it will all be helpful and polite again. On Thu, Jan 22, 2015 at 2:51 PM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Marco, Thanks for the confirmation. Please let me know what are the lot more detail you need to answer a very specific question WHAT IS THE MINIMUM HARDWARE CONFIGURATION REQUIRED TO BUILT HDFS+ MAPREDUCE+SPARK+YARN on a system? Please let me know if you need any further information and if you dont know please drive across with the $1 to Sir Paco Nathan and get me the answer. Thanks and Regards, Sudipta On Thu, Jan 22, 2015 at 5:33 PM, Marco Shaw marco.s...@gmail.com wrote: Hi, Let me reword your request so you understand how (too) generic your question is Hi, I have $10,000, please find me some means of transportation so I can get to work. Please provide (a lot) more details. If you can't, consider using one of the pre-built express VMs from either Cloudera, Hortonworks or MapR, for example. Marco On Jan 22, 2015, at 7:36 AM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Apache-Spark team , What are the system requirements installing Hadoop and Apache Spark? I have attached the screen shot of Gparted. Thanks and regards, Sudipta -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099 Screenshot - Wednesday 21 January 2015 - 10:55:29 IST.png - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099 -- Sudipta Banerjee Consultant, Business
Re: Spark Team - Paco Nathan said that your team can help
(Starting over...) The best place to look for the requirements would be at the individual pages of each technology. As for absolute minimum requirements, I would suggest 50GB of disk space and at least 8GB of memory. This is the absolute minimum. Architecting a solution like you are looking for is very complex. If you are just looking for a proof-of-concept consider a Docker image or going to Cloudera/Hortonworks/MapR and look for their express VMs which can usually run on Oracle Virtualbox or VMware. Marco On Thu, Jan 22, 2015 at 7:36 AM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Apache-Spark team , What are the system requirements installing Hadoop and Apache Spark? I have attached the screen shot of Gparted. Thanks and regards, Sudipta -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Team - Paco Nathan said that your team can help
+1 On 22.1.2015 18:30, Marco Shaw wrote: Sudipta - Please don't ever come here or post here again. On Thu, Jan 22, 2015 at 1:25 PM, Sudipta Banerjee asudipta.baner...@gmail.com mailto:asudipta.baner...@gmail.com wrote: Hi Nicos, Taking forward your argument,please be a smart a$$ and dont use unprofessional language just for the sake of being a moderator. Paco Nathan is respected for the dignity he carries in sharing his knowledge and making it available free for a$$es like us right! So just mind your tongue next time you put such a$$ in your mouth. Best Regards, Sudipta On Thu, Jan 22, 2015 at 10:39 PM, Nicos Kekchidis ikon...@me.com mailto:ikon...@me.com wrote: Folks, Just a gentle reminder we owe to ourselves: - this is a public forum and we need to behave accordingly, it is not place to vent frustration in rude way - getting attention here is an earned privilege and not entitlement - this is not a “Platinum Support” department of your vendor rather and open source collaboration forum where people volunteer their time to pay attention to your needs - there are still many gray areas so be patient and articulate questions in as much details as possible if you want to get quick help and not just be perceived as a smart a$$ FYI - Paco Nathan is a well respected Spark evangelist and many people, including myself, owe to his passion for jumping on Spark platform promise. People like Sean Owen keep us believing in things when we feel like hitting the dead-end. Please, be respectful of what connections you are prized with and act civilized. Have a great day! - Nicos On Jan 22, 2015, at 7:49 AM, Sean Owen so...@cloudera.com mailto:so...@cloudera.com wrote: Yes, this isn't a well-formed question, and got maybe the response it deserved, but the tone is veering off the rails. I just got a much ruder reply from Sudipta privately, which I will not forward. Sudipta, I suggest you take the responses you've gotten so far as about as much answer as can be had here and do some work yourself, and come back with much more specific questions, and it will all be helpful and polite again. On Thu, Jan 22, 2015 at 2:51 PM, Sudipta Banerjee asudipta.baner...@gmail.com mailto:asudipta.baner...@gmail.com wrote: Hi Marco, Thanks for the confirmation. Please let me know what are the lot more detail you need to answer a very specific question WHAT IS THE MINIMUM HARDWARE CONFIGURATION REQUIRED TO BUILT HDFS+ MAPREDUCE+SPARK+YARN on a system? Please let me know if you need any further information and if you dont know please drive across with the $1 to Sir Paco Nathan and get me the answer. Thanks and Regards, Sudipta On Thu, Jan 22, 2015 at 5:33 PM, Marco Shaw marco.s...@gmail.com mailto:marco.s...@gmail.com wrote: Hi, Let me reword your request so you understand how (too) generic your question is Hi, I have $10,000, please find me some means of transportation so I can get to work. Please provide (a lot) more details. If you can't, consider using one of the pre-built express VMs from either Cloudera, Hortonworks or MapR, for example. Marco On Jan 22, 2015, at 7:36 AM, Sudipta Banerjee asudipta.baner...@gmail.com mailto:asudipta.baner...@gmail.com wrote: Hi Apache-Spark team , What are the system requirements installing Hadoop and Apache Spark? I have attached the screen shot of Gparted. Thanks and regards, Sudipta -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099 tel:%2B919019578099 Screenshot - Wednesday 21 January 2015 - 10:55:29 IST.png - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099
Re: Spark Team - Paco Nathan said that your team can help
Dont ever reply to my queries :D On Thu, Jan 22, 2015 at 11:02 PM, Lukas Nalezenec lukas.naleze...@firma.seznam.cz wrote: +1 On 22.1.2015 18:30, Marco Shaw wrote: Sudipta - Please don't ever come here or post here again. On Thu, Jan 22, 2015 at 1:25 PM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Nicos, Taking forward your argument,please be a smart a$$ and dont use unprofessional language just for the sake of being a moderator. Paco Nathan is respected for the dignity he carries in sharing his knowledge and making it available free for a$$es like us right! So just mind your tongue next time you put such a$$ in your mouth. Best Regards, Sudipta On Thu, Jan 22, 2015 at 10:39 PM, Nicos Kekchidis ikon...@me.com wrote: Folks, Just a gentle reminder we owe to ourselves: - this is a public forum and we need to behave accordingly, it is not place to vent frustration in rude way - getting attention here is an earned privilege and not entitlement - this is not a “Platinum Support” department of your vendor rather and open source collaboration forum where people volunteer their time to pay attention to your needs - there are still many gray areas so be patient and articulate questions in as much details as possible if you want to get quick help and not just be perceived as a smart a$$ FYI - Paco Nathan is a well respected Spark evangelist and many people, including myself, owe to his passion for jumping on Spark platform promise. People like Sean Owen keep us believing in things when we feel like hitting the dead-end. Please, be respectful of what connections you are prized with and act civilized. Have a great day! - Nicos On Jan 22, 2015, at 7:49 AM, Sean Owen so...@cloudera.com wrote: Yes, this isn't a well-formed question, and got maybe the response it deserved, but the tone is veering off the rails. I just got a much ruder reply from Sudipta privately, which I will not forward. Sudipta, I suggest you take the responses you've gotten so far as about as much answer as can be had here and do some work yourself, and come back with much more specific questions, and it will all be helpful and polite again. On Thu, Jan 22, 2015 at 2:51 PM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Marco, Thanks for the confirmation. Please let me know what are the lot more detail you need to answer a very specific question WHAT IS THE MINIMUM HARDWARE CONFIGURATION REQUIRED TO BUILT HDFS+ MAPREDUCE+SPARK+YARN on a system? Please let me know if you need any further information and if you dont know please drive across with the $1 to Sir Paco Nathan and get me the answer. Thanks and Regards, Sudipta On Thu, Jan 22, 2015 at 5:33 PM, Marco Shaw marco.s...@gmail.com wrote: Hi, Let me reword your request so you understand how (too) generic your question is Hi, I have $10,000, please find me some means of transportation so I can get to work. Please provide (a lot) more details. If you can't, consider using one of the pre-built express VMs from either Cloudera, Hortonworks or MapR, for example. Marco On Jan 22, 2015, at 7:36 AM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Apache-Spark team , What are the system requirements installing Hadoop and Apache Spark? I have attached the screen shot of Gparted. Thanks and regards, Sudipta -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099 Screenshot - Wednesday 21 January 2015 - 10:55:29 IST.png - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099 -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099
Re: Using third party libraries in pyspark
Python couldn't find your module. Do you have that on each worker node? You will need to have that on each one --- Original Message --- From: Davies Liu dav...@databricks.com Sent: January 22, 2015 9:12 PM To: Mohit Singh mohit1...@gmail.com Cc: user@spark.apache.org Subject: Re: Using third party libraries in pyspark You need to install these libraries on all the slaves, or submit via spark-submit: spark-submit --py-files xxx On Thu, Jan 22, 2015 at 11:23 AM, Mohit Singh mohit1...@gmail.com wrote: Hi, I might be asking something very trivial, but whats the recommend way of using third party libraries. I am using tables to read hdf5 format file.. And here is the error trace: print rdd.take(2) File /tmp/spark/python/pyspark/rdd.py, line , in take res = self.context.runJob(self, takeUpToNumLeft, p, True) File /tmp/spark/python/pyspark/context.py, line 818, in runJob it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, javaPartitions, allowLocal) File /tmp/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /tmp/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, srv-108-23.720.rdio): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py, line 90, in main command = pickleSer._read_with_length(infile) File /hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py, line 151, in _read_with_length return self.loads(obj) File /hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py, line 396, in loads return cPickle.loads(obj) File /hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/cloudpickle.py, line 825, in subimport __import__(name) ImportError: ('No module named tables', function subimport at 0x47e1398, ('tables',)) Though, import tables works fine on the local python shell.. but seems like every thing is being pickled.. Are we expected to send all the files as helper files? that doesn't seems right? Thanks -- Mohit When you want success as badly as you want the air, then you will get it. There is no other secret of success. -Socrates - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: processing large dataset
Often when this happens to me, it is actually an exception parsing a few messages. Easy to miss this, as error messages aren't always informative. I would be blaming spark, but in reality it was missing fields in a CSV file. As has been said, make a file with a few records and see if your job works. On Thursday, January 22, 2015, Jörn Franke jornfra...@gmail.com wrote: Did you try it with a smaller subset of the data first? Le 23 janv. 2015 05:54, Kane Kim kane.ist...@gmail.com javascript:_e(%7B%7D,'cvml','kane.ist...@gmail.com'); a écrit : I'm trying to process 5TB of data, not doing anything fancy, just map/filter and reduceByKey. Spent whole day today trying to get it processed, but never succeeded. I've tried to deploy to ec2 with the script provided with spark on pretty beefy machines (100 r3.2xlarge nodes). Really frustrated that spark doesn't work out of the box for anything bigger than word count sample. One big problem is that defaults are not suitable for processing big datasets, provided ec2 script could do a better job, knowing instance type requested. Second it takes hours to figure out what is wrong, when spark job fails almost finished processing. Even after raising all limits as per https://spark.apache.org/docs/latest/tuning.html it still fails (now with: error communicating with MapOutputTracker). After all I have only one question - how to get spark tuned up for processing terabytes of data and is there a way to make this configuration easier and more transparent? Thanks. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org javascript:_e(%7B%7D,'cvml','user-unsubscr...@spark.apache.org'); For additional commands, e-mail: user-h...@spark.apache.org javascript:_e(%7B%7D,'cvml','user-h...@spark.apache.org'); -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com