Re: The running time of spark

2014-08-23 Thread Denis RP
In fact I think it's highly impossible, but I just want some confirmation from you, please leave your option, thanks :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/The-running-time-of-spark-tp12624p12691.html Sent from the Apache Spark User List mailing

Re: [PySpark][Python 2.7.8][Spark 1.0.2] count() with TypeError: an integer is required

2014-08-23 Thread Eric Friedman
Yes. And point that variable at your virtual env python. Eric Friedman On Aug 22, 2014, at 6:08 AM, Earthson earthson...@gmail.com wrote: Do I have to deploy Python to every machine to make $PYSPARK_PYTHON work correctly? -- View this message in context:

Re: The running time of spark

2014-08-23 Thread Sean Owen
I think you would have to be more specific. How are you running shortest-path? how long does it take? how long do you expect, roughly? does the bottleneck seem to be I/O, CPU? are you caching what needs to be cached? If your cluster is virtualized, and has little memory, you may be hitting disk

Re: The running time of spark

2014-08-23 Thread Denis RP
The algorithm uses Pregel of GraphX. It ran for more than one day and only reached the third stage, and I cancelled it because the consumption is unacceptable. The time expected is about ten minutes (not expected by me ...), but I think a couple of hours is acceptable. Bottleneck seems to be I/O,

Re: Finding previous and next element in a sorted RDD

2014-08-23 Thread Victor Tso-Guillen
Using mapPartitions, you could get the neighbors within a partition, but if you think about it, it's much more difficult to accomplish this for the complete dataset. On Fri, Aug 22, 2014 at 11:24 AM, cjwang c...@cjwang.us wrote: It would be nice if an RDD that was massaged by

Re: FetchFailed when collect at YARN cluster

2014-08-23 Thread Victor Tso-Guillen
I think I emailed about a similar issue, but in standalone mode. I haven't investigated much so I don't know what's a good fix. On Fri, Aug 22, 2014 at 12:00 PM, Jiayu Zhou dearji...@gmail.com wrote: Hi, I am having this FetchFailed issue when the driver is about to collect about 2.5M

Re: Advantage of using cache()

2014-08-23 Thread Patrick Wendell
Yep - that's correct. As an optimization we save the shuffle output and re-use if if you execute a stage twice. So this can make A:B tests like this a bit confusing. - Patrick On Friday, August 22, 2014, Nieyuan qiushuiwuh...@gmail.com wrote: Because map-reduce tasks like join will save

Re: The running time of spark

2014-08-23 Thread Ankur Dave
At 2014-08-23 08:33:48 -0700, Denis RP qq378789...@gmail.com wrote: Bottleneck seems to be I/O, the CPU usage ranges 10%~15% most time per VM. The caching is maintained by pregel, should be reliable. Storage level is MEMORY_AND_DISK_SER. I'd suggest trying the DISK_ONLY storage level and

Re: Finding Rank in Spark

2014-08-23 Thread Burak Yavuz
Spearman's Correlation requires the calculation of ranks for columns. You can checkout the code here and slice the part you need! https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmanCorrelation.scala Best, Burak - Original

Re: FetchFailed when collect at YARN cluster

2014-08-23 Thread Jiayu Zhou
I saw your post. What are the operations you did? Are you trying to collect data from driver? Did you try the akka configurations? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/FetchFailed-when-collect-at-YARN-cluster-tp12670p12703.html Sent from the

Re: FetchFailed when collect at YARN cluster

2014-08-23 Thread Victor Tso-Guillen
I did not try the Akka configs. I was doing a shuffle operation, I believe a sort, but two copies of the operation at the same time. It was a 20M row dataset of reasonable horizontal size. On Sat, Aug 23, 2014 at 2:23 PM, Jiayu Zhou dearji...@gmail.com wrote: I saw your post. What are the

Re: Open sourcing Spindle by Adobe Research, a web analytics processing engine in Scala, Spark, and Parquet.

2014-08-23 Thread Brandon Amos
\cc David Tompkins and Jim Donahue if they have anything to add. \cc My school email. Please include bamos_cmu.edu for further discussion. Hi Soumya, ssimanta wrote The project mentions - process petabytes of data in real-time. I'm curious to know if the architecture implemented in the

Re: The running time of spark

2014-08-23 Thread Denis RP
Thanks for the suggestion, the program actually failed because of OutOfMemory: Java heap space, and I tried some modifications and it went further, but the exception might occur again anyway. How long did your test take? I can take it for reference. -- View this message in context: