Re: The running time of spark

2014-08-23 Thread Denis RP
In fact I think it's highly impossible, but I just want some confirmation
from you, please leave your option, thanks :)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/The-running-time-of-spark-tp12624p12691.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [PySpark][Python 2.7.8][Spark 1.0.2] count() with TypeError: an integer is required

2014-08-23 Thread Eric Friedman
Yes. And point that variable at your virtual env python. 


Eric Friedman

 On Aug 22, 2014, at 6:08 AM, Earthson earthson...@gmail.com wrote:
 
 Do I have to deploy Python to every machine to make $PYSPARK_PYTHON work
 correctly?
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Python-2-7-8-Spark-1-0-2-count-with-TypeError-an-integer-is-required-tp12643p12651.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: The running time of spark

2014-08-23 Thread Sean Owen
I think you would have to be more specific. How are you running
shortest-path? how long does it take? how long do you expect, roughly?
does the bottleneck seem to be I/O, CPU? are you caching what needs to
be cached?

If your cluster is virtualized, and has little memory, you may be
hitting disk constantly, and also hitting the overhead of virtualized
I/O. It's unclear what your infrastructure is like.

Too slow is one of those how-long-is-a-piece-of-string questions.
There's no inherent reason 500GB of data can't be processed but how
fast will depend on what you are doing.

On Fri, Aug 22, 2014 at 2:49 AM, Denis RP qq378789...@gmail.com wrote:
 Hi,

 I'm using spark on a cluster of 8 VMs, each with two cores and 3.5GB RAM.

 But I need to run a shortest path algorithm on data of 500+GB(textfile, each
 line contains a node id and nodes it points to)

 I've tested it on the cluster, but the speed seems to be extremely slow, and
 haven't got any result yet.

 Is it natural to be so slow based on such cluster and data, or there is
 something wrong since the problem can be solved much efficiently?(say half
 an hour after reading the data?)

 Thanks!



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/The-running-time-of-spark-tp12624.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: The running time of spark

2014-08-23 Thread Denis RP
The algorithm uses Pregel of GraphX.
It ran for more than one day and only reached the third stage, and I
cancelled it because the consumption is unacceptable.
The time expected is about ten minutes (not expected by me ...), but I think
a couple of hours is acceptable.

Bottleneck seems to be I/O, the CPU usage ranges 10%~15% most time per VM.
The caching is maintained by pregel, should be reliable. Storage level is
MEMORY_AND_DISK_SER.

I think the cluster is just too small for the data to be processed
efficiently, thus the time expectation is very unreachable. So, I need some
conformation, or suggestions to make the process fast enough.

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/The-running-time-of-spark-tp12624p12696.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Finding previous and next element in a sorted RDD

2014-08-23 Thread Victor Tso-Guillen
Using mapPartitions, you could get the neighbors within a partition, but if
you think about it, it's much more difficult to accomplish this for the
complete dataset.


On Fri, Aug 22, 2014 at 11:24 AM, cjwang c...@cjwang.us wrote:

 It would be nice if an RDD that was massaged by OrderedRDDFunction could
 know
 its neighbors.




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-tp12621p12664.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: FetchFailed when collect at YARN cluster

2014-08-23 Thread Victor Tso-Guillen
I think I emailed about a similar issue, but in standalone mode. I haven't
investigated much so I don't know what's a good fix.


On Fri, Aug 22, 2014 at 12:00 PM, Jiayu Zhou dearji...@gmail.com wrote:

 Hi,

 I am having this FetchFailed issue when the driver is about to collect
 about
 2.5M lines of short strings (about 10 characters each line) from a YARN
 cluster with 400 nodes:

 *14/08/22 11:43:27 WARN scheduler.TaskSetManager: Lost task 205.0 in stage
 0.0 (TID 1228, aaa.xxx.com): FetchFailed(BlockManagerId(220, aaa.xxx.com,
 37899, 0), shuffleId=0, mapId=420, reduceId=205)
 14/08/22 11:43:27 WARN scheduler.TaskSetManager: Lost task 603.0 in stage
 0.0 (TID 1626, aaa.xxx.com): FetchFailed(BlockManagerId(220, aaa.xxx.com,
 37899, 0), shuffleId=0, mapId=420, reduceId=603)*

 And other than this FetchFailed, I am not able to see anything else from
 the
 log file (no OOM errors shown).

 This does not happen when there is only 2M lines. I guess it might because
 of the akka message size, and then I used the following

 spark.akka.frameSize  100
 spark.akka.timeout  200

 And that does not help as well. Has anyone experienced similar problems?

 Thanks,
 Jiayu



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/FetchFailed-when-collect-at-YARN-cluster-tp12670.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Advantage of using cache()

2014-08-23 Thread Patrick Wendell
Yep - that's correct. As an optimization we save the shuffle output and
re-use if if you execute a stage twice. So this can make A:B tests like
this a bit confusing.

- Patrick

On Friday, August 22, 2014, Nieyuan qiushuiwuh...@gmail.com wrote:

 Because map-reduce tasks like join will save shuffle data to disk . So the
 only diffrence with caching or no-caching version is :
.map { case (x, (n, i)) = (x, n)}



 -
 Thanks,
 Nieyuan
 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Advantage-of-using-cache-tp12480p12634.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org javascript:;
 For additional commands, e-mail: user-h...@spark.apache.org javascript:;




Re: The running time of spark

2014-08-23 Thread Ankur Dave
At 2014-08-23 08:33:48 -0700, Denis RP qq378789...@gmail.com wrote:
 Bottleneck seems to be I/O, the CPU usage ranges 10%~15% most time per VM.
 The caching is maintained by pregel, should be reliable. Storage level is
 MEMORY_AND_DISK_SER.

I'd suggest trying the DISK_ONLY storage level and possibly increasing the 
number of partitions. I did a local test with a 2G heap, 1M vertices, 126M 
edges, and 100 edge partitions, and MEMORY_AND_DISK_SER failed with 
OutOfMemoryErrors while DISK_ONLY succeeded.

Ankur

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Finding Rank in Spark

2014-08-23 Thread Burak Yavuz
Spearman's Correlation requires the calculation of ranks for columns. You can 
checkout the code here and slice the part you need!

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmanCorrelation.scala

Best,
Burak

- Original Message -
From: athiradas athira@flutura.com
To: u...@spark.incubator.apache.org
Sent: Friday, August 22, 2014 4:14:34 AM
Subject: Re: Finding Rank in Spark

Does anyone knw a way to do this?

I tried it by sorting it and writing an auto increment function.

But since its parallel computing the result is wrong.

Is there anyway? please reply



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-Rank-in-Spark-tp12028p12647.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: FetchFailed when collect at YARN cluster

2014-08-23 Thread Jiayu Zhou
I saw your post. What are the operations you did? Are you trying to collect
data from driver? Did you try the akka configurations? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/FetchFailed-when-collect-at-YARN-cluster-tp12670p12703.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: FetchFailed when collect at YARN cluster

2014-08-23 Thread Victor Tso-Guillen
I did not try the Akka configs. I was doing a shuffle operation, I believe
a sort, but two copies of the operation at the same time. It was a 20M row
dataset of reasonable horizontal size.


On Sat, Aug 23, 2014 at 2:23 PM, Jiayu Zhou dearji...@gmail.com wrote:

 I saw your post. What are the operations you did? Are you trying to collect
 data from driver? Did you try the akka configurations?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/FetchFailed-when-collect-at-YARN-cluster-tp12670p12703.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Open sourcing Spindle by Adobe Research, a web analytics processing engine in Scala, Spark, and Parquet.

2014-08-23 Thread Brandon Amos
\cc David Tompkins and Jim Donahue if they have anything to add. 
\cc My school email. Please include bamos_cmu.edu for further discussion. 

Hi Soumya,


ssimanta wrote
 The project mentions - process petabytes of data in real-time. I'm
 curious to know if the architecture implemented in the Github repo was
 used
 to process petabytes?
 If yes, how many nodes did you use for this and did you use Spark
 standalone cluster or with YARN/Mesos ?
 I'm also interested to know what issues you had with Spray and Akka
 working
 at this scale.

Great question, I've added the following portion to the README's intro
section to make it clear that Spindle is not yet ready for processing
petabytes of data in real-time.

Also, I'd be interested in seeing how Spray/Akka at larger
scales compares to using job or resource managers.
We're currently running Spindle on the standalone cluster.

Regards,
Brandon.

---

This repo contains the Spindle implementation and benchmarking scripts
to observe Spindle's performance while exploring Spark's tuning options.
Spindle's goal is to process petabytes of data on thousands of nodes,
but the current implementation has not yet been tested at this scale.
Our current experimental results use six nodes,
each with 24 cores and 21g of Spark memory, to query 13.1GB of analytics
data.
The trends show that further Spark tuning and optimizations should
be investigated before attempting larger scale deployments.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Open-sourcing-Spindle-by-Adobe-Research-a-web-analytics-processing-engine-in-Scala-Spark-and-Parquet-tp12203p12706.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: The running time of spark

2014-08-23 Thread Denis RP
Thanks for the suggestion, the program actually failed because of
OutOfMemory: Java heap space, and I tried some modifications and it went
further, but the exception might occur again anyway.

How long did your test take? I can take it for reference.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/The-running-time-of-spark-tp12624p12707.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org