Re: Spark Streaming with Flume event

2014-08-23 Thread Spidy
Anybody? Example of how to desearalize FlumeEvent data using Scala -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-Flume-event-tp12569p12709.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Printing the RDDs in SparkPageRank

2014-08-23 Thread Deep Pradhan
Hi, I was going through the SparkPageRank code and want to see the intermediate steps, like the RDDs formed in the intermediate steps. Here is a part of the code along with the lines that I added in order to print the RDDs. I want to print the "*parts*" in the code (denoted by the comment in Bold l

Re: The running time of spark

2014-08-23 Thread Denis RP
Thanks for the suggestion, the program actually failed because of OutOfMemory: Java heap space, and I tried some modifications and it went further, but the exception might occur again anyway. How long did your test take? I can take it for reference. -- View this message in context: http://apac

Re: Open sourcing Spindle by Adobe Research, a web analytics processing engine in Scala, Spark, and Parquet.

2014-08-23 Thread Brandon Amos
\cc David Tompkins and Jim Donahue if they have anything to add. \cc My school email. Please include bamos_cmu.edu for further discussion. Hi Soumya, ssimanta wrote > The project mentions - "process petabytes of data in real-time". I'm > curious to know if the architecture implemented in the G

Re: FetchFailed when collect at YARN cluster

2014-08-23 Thread Victor Tso-Guillen
I did not try the Akka configs. I was doing a shuffle operation, I believe a sort, but two copies of the operation at the same time. It was a 20M row dataset of reasonable horizontal size. On Sat, Aug 23, 2014 at 2:23 PM, Jiayu Zhou wrote: > I saw your post. What are the operations you did? Are

Re: FetchFailed when collect at YARN cluster

2014-08-23 Thread Jiayu Zhou
I saw your post. What are the operations you did? Are you trying to collect data from driver? Did you try the akka configurations? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/FetchFailed-when-collect-at-YARN-cluster-tp12670p12703.html Sent from the Apac

Re: Finding Rank in Spark

2014-08-23 Thread Burak Yavuz
Spearman's Correlation requires the calculation of ranks for columns. You can checkout the code here and slice the part you need! https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmanCorrelation.scala Best, Burak - Original Message

Re: The running time of spark

2014-08-23 Thread Ankur Dave
At 2014-08-23 08:33:48 -0700, Denis RP wrote: > Bottleneck seems to be I/O, the CPU usage ranges 10%~15% most time per VM. > The caching is maintained by pregel, should be reliable. Storage level is > MEMORY_AND_DISK_SER. I'd suggest trying the DISK_ONLY storage level and possibly increasing the

Re: Advantage of using cache()

2014-08-23 Thread Patrick Wendell
Yep - that's correct. As an optimization we save the shuffle output and re-use if if you execute a stage twice. So this can make A:B tests like this a bit confusing. - Patrick On Friday, August 22, 2014, Nieyuan wrote: > Because map-reduce tasks like join will save shuffle data to disk . So the

Spark Cluster Benchmarking Frameworks

2014-08-23 Thread Jonathan Hodges
Hi Spark Experts, I am curious what people are using to benchmark their Spark clusters. We are about to start a build (bare metal) vs buy (AWS/Google Cloud/Qubole) project to determine our Hadoop and Spark deployment selection. On the Hadoop side we will test live workloads as well as simulated

Re: FetchFailed when collect at YARN cluster

2014-08-23 Thread Victor Tso-Guillen
I think I emailed about a similar issue, but in standalone mode. I haven't investigated much so I don't know what's a good fix. On Fri, Aug 22, 2014 at 12:00 PM, Jiayu Zhou wrote: > Hi, > > I am having this FetchFailed issue when the driver is about to collect > about > 2.5M lines of short stri

Re: Finding previous and next element in a sorted RDD

2014-08-23 Thread Victor Tso-Guillen
Using mapPartitions, you could get the neighbors within a partition, but if you think about it, it's much more difficult to accomplish this for the complete dataset. On Fri, Aug 22, 2014 at 11:24 AM, cjwang wrote: > It would be nice if an RDD that was massaged by OrderedRDDFunction could > know

Re: The running time of spark

2014-08-23 Thread Denis RP
The algorithm uses Pregel of GraphX. It ran for more than one day and only reached the third stage, and I cancelled it because the consumption is unacceptable. The time expected is about ten minutes (not expected by me ...), but I think a couple of hours is acceptable. Bottleneck seems to be I/O,

Re: The running time of spark

2014-08-23 Thread Sean Owen
I think you would have to be more specific. How are you running shortest-path? how long does it take? how long do you expect, roughly? does the bottleneck seem to be I/O, CPU? are you caching what needs to be cached? If your cluster is virtualized, and has little memory, you may be hitting disk co

Re: [PySpark][Python 2.7.8][Spark 1.0.2] count() with TypeError: an integer is required

2014-08-23 Thread Eric Friedman
Yes. And point that variable at your virtual env python. Eric Friedman > On Aug 22, 2014, at 6:08 AM, Earthson wrote: > > Do I have to deploy Python to every machine to make "$PYSPARK_PYTHON" work > correctly? > > > > -- > View this message in context: > http://apache-spark-user-list.

Re: The running time of spark

2014-08-23 Thread Denis RP
In fact I think it's highly impossible, but I just want some confirmation from you, please leave your option, thanks :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/The-running-time-of-spark-tp12624p12691.html Sent from the Apache Spark User List mailing