Re: The running time of spark
In fact I think it's highly impossible, but I just want some confirmation from you, please leave your option, thanks :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/The-running-time-of-spark-tp12624p12691.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: [PySpark][Python 2.7.8][Spark 1.0.2] count() with TypeError: an integer is required
Yes. And point that variable at your virtual env python. Eric Friedman On Aug 22, 2014, at 6:08 AM, Earthson earthson...@gmail.com wrote: Do I have to deploy Python to every machine to make $PYSPARK_PYTHON work correctly? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Python-2-7-8-Spark-1-0-2-count-with-TypeError-an-integer-is-required-tp12643p12651.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: The running time of spark
I think you would have to be more specific. How are you running shortest-path? how long does it take? how long do you expect, roughly? does the bottleneck seem to be I/O, CPU? are you caching what needs to be cached? If your cluster is virtualized, and has little memory, you may be hitting disk constantly, and also hitting the overhead of virtualized I/O. It's unclear what your infrastructure is like. Too slow is one of those how-long-is-a-piece-of-string questions. There's no inherent reason 500GB of data can't be processed but how fast will depend on what you are doing. On Fri, Aug 22, 2014 at 2:49 AM, Denis RP qq378789...@gmail.com wrote: Hi, I'm using spark on a cluster of 8 VMs, each with two cores and 3.5GB RAM. But I need to run a shortest path algorithm on data of 500+GB(textfile, each line contains a node id and nodes it points to) I've tested it on the cluster, but the speed seems to be extremely slow, and haven't got any result yet. Is it natural to be so slow based on such cluster and data, or there is something wrong since the problem can be solved much efficiently?(say half an hour after reading the data?) Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/The-running-time-of-spark-tp12624.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: The running time of spark
The algorithm uses Pregel of GraphX. It ran for more than one day and only reached the third stage, and I cancelled it because the consumption is unacceptable. The time expected is about ten minutes (not expected by me ...), but I think a couple of hours is acceptable. Bottleneck seems to be I/O, the CPU usage ranges 10%~15% most time per VM. The caching is maintained by pregel, should be reliable. Storage level is MEMORY_AND_DISK_SER. I think the cluster is just too small for the data to be processed efficiently, thus the time expectation is very unreachable. So, I need some conformation, or suggestions to make the process fast enough. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/The-running-time-of-spark-tp12624p12696.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Finding previous and next element in a sorted RDD
Using mapPartitions, you could get the neighbors within a partition, but if you think about it, it's much more difficult to accomplish this for the complete dataset. On Fri, Aug 22, 2014 at 11:24 AM, cjwang c...@cjwang.us wrote: It would be nice if an RDD that was massaged by OrderedRDDFunction could know its neighbors. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-tp12621p12664.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: FetchFailed when collect at YARN cluster
I think I emailed about a similar issue, but in standalone mode. I haven't investigated much so I don't know what's a good fix. On Fri, Aug 22, 2014 at 12:00 PM, Jiayu Zhou dearji...@gmail.com wrote: Hi, I am having this FetchFailed issue when the driver is about to collect about 2.5M lines of short strings (about 10 characters each line) from a YARN cluster with 400 nodes: *14/08/22 11:43:27 WARN scheduler.TaskSetManager: Lost task 205.0 in stage 0.0 (TID 1228, aaa.xxx.com): FetchFailed(BlockManagerId(220, aaa.xxx.com, 37899, 0), shuffleId=0, mapId=420, reduceId=205) 14/08/22 11:43:27 WARN scheduler.TaskSetManager: Lost task 603.0 in stage 0.0 (TID 1626, aaa.xxx.com): FetchFailed(BlockManagerId(220, aaa.xxx.com, 37899, 0), shuffleId=0, mapId=420, reduceId=603)* And other than this FetchFailed, I am not able to see anything else from the log file (no OOM errors shown). This does not happen when there is only 2M lines. I guess it might because of the akka message size, and then I used the following spark.akka.frameSize 100 spark.akka.timeout 200 And that does not help as well. Has anyone experienced similar problems? Thanks, Jiayu -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/FetchFailed-when-collect-at-YARN-cluster-tp12670.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Advantage of using cache()
Yep - that's correct. As an optimization we save the shuffle output and re-use if if you execute a stage twice. So this can make A:B tests like this a bit confusing. - Patrick On Friday, August 22, 2014, Nieyuan qiushuiwuh...@gmail.com wrote: Because map-reduce tasks like join will save shuffle data to disk . So the only diffrence with caching or no-caching version is : .map { case (x, (n, i)) = (x, n)} - Thanks, Nieyuan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Advantage-of-using-cache-tp12480p12634.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org javascript:; For additional commands, e-mail: user-h...@spark.apache.org javascript:;
Re: The running time of spark
At 2014-08-23 08:33:48 -0700, Denis RP qq378789...@gmail.com wrote: Bottleneck seems to be I/O, the CPU usage ranges 10%~15% most time per VM. The caching is maintained by pregel, should be reliable. Storage level is MEMORY_AND_DISK_SER. I'd suggest trying the DISK_ONLY storage level and possibly increasing the number of partitions. I did a local test with a 2G heap, 1M vertices, 126M edges, and 100 edge partitions, and MEMORY_AND_DISK_SER failed with OutOfMemoryErrors while DISK_ONLY succeeded. Ankur - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Finding Rank in Spark
Spearman's Correlation requires the calculation of ranks for columns. You can checkout the code here and slice the part you need! https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmanCorrelation.scala Best, Burak - Original Message - From: athiradas athira@flutura.com To: u...@spark.incubator.apache.org Sent: Friday, August 22, 2014 4:14:34 AM Subject: Re: Finding Rank in Spark Does anyone knw a way to do this? I tried it by sorting it and writing an auto increment function. But since its parallel computing the result is wrong. Is there anyway? please reply -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-Rank-in-Spark-tp12028p12647.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: FetchFailed when collect at YARN cluster
I saw your post. What are the operations you did? Are you trying to collect data from driver? Did you try the akka configurations? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/FetchFailed-when-collect-at-YARN-cluster-tp12670p12703.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: FetchFailed when collect at YARN cluster
I did not try the Akka configs. I was doing a shuffle operation, I believe a sort, but two copies of the operation at the same time. It was a 20M row dataset of reasonable horizontal size. On Sat, Aug 23, 2014 at 2:23 PM, Jiayu Zhou dearji...@gmail.com wrote: I saw your post. What are the operations you did? Are you trying to collect data from driver? Did you try the akka configurations? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/FetchFailed-when-collect-at-YARN-cluster-tp12670p12703.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Open sourcing Spindle by Adobe Research, a web analytics processing engine in Scala, Spark, and Parquet.
\cc David Tompkins and Jim Donahue if they have anything to add. \cc My school email. Please include bamos_cmu.edu for further discussion. Hi Soumya, ssimanta wrote The project mentions - process petabytes of data in real-time. I'm curious to know if the architecture implemented in the Github repo was used to process petabytes? If yes, how many nodes did you use for this and did you use Spark standalone cluster or with YARN/Mesos ? I'm also interested to know what issues you had with Spray and Akka working at this scale. Great question, I've added the following portion to the README's intro section to make it clear that Spindle is not yet ready for processing petabytes of data in real-time. Also, I'd be interested in seeing how Spray/Akka at larger scales compares to using job or resource managers. We're currently running Spindle on the standalone cluster. Regards, Brandon. --- This repo contains the Spindle implementation and benchmarking scripts to observe Spindle's performance while exploring Spark's tuning options. Spindle's goal is to process petabytes of data on thousands of nodes, but the current implementation has not yet been tested at this scale. Our current experimental results use six nodes, each with 24 cores and 21g of Spark memory, to query 13.1GB of analytics data. The trends show that further Spark tuning and optimizations should be investigated before attempting larger scale deployments. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Open-sourcing-Spindle-by-Adobe-Research-a-web-analytics-processing-engine-in-Scala-Spark-and-Parquet-tp12203p12706.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: The running time of spark
Thanks for the suggestion, the program actually failed because of OutOfMemory: Java heap space, and I tried some modifications and it went further, but the exception might occur again anyway. How long did your test take? I can take it for reference. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/The-running-time-of-spark-tp12624p12707.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org