In fact I think it's highly impossible, but I just want some confirmation
from you, please leave your option, thanks :)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/The-running-time-of-spark-tp12624p12691.html
Sent from the Apache Spark User List mailing
Yes. And point that variable at your virtual env python.
Eric Friedman
On Aug 22, 2014, at 6:08 AM, Earthson earthson...@gmail.com wrote:
Do I have to deploy Python to every machine to make $PYSPARK_PYTHON work
correctly?
--
View this message in context:
I think you would have to be more specific. How are you running
shortest-path? how long does it take? how long do you expect, roughly?
does the bottleneck seem to be I/O, CPU? are you caching what needs to
be cached?
If your cluster is virtualized, and has little memory, you may be
hitting disk
The algorithm uses Pregel of GraphX.
It ran for more than one day and only reached the third stage, and I
cancelled it because the consumption is unacceptable.
The time expected is about ten minutes (not expected by me ...), but I think
a couple of hours is acceptable.
Bottleneck seems to be I/O,
Using mapPartitions, you could get the neighbors within a partition, but if
you think about it, it's much more difficult to accomplish this for the
complete dataset.
On Fri, Aug 22, 2014 at 11:24 AM, cjwang c...@cjwang.us wrote:
It would be nice if an RDD that was massaged by
I think I emailed about a similar issue, but in standalone mode. I haven't
investigated much so I don't know what's a good fix.
On Fri, Aug 22, 2014 at 12:00 PM, Jiayu Zhou dearji...@gmail.com wrote:
Hi,
I am having this FetchFailed issue when the driver is about to collect
about
2.5M
Yep - that's correct. As an optimization we save the shuffle output and
re-use if if you execute a stage twice. So this can make A:B tests like
this a bit confusing.
- Patrick
On Friday, August 22, 2014, Nieyuan qiushuiwuh...@gmail.com wrote:
Because map-reduce tasks like join will save
At 2014-08-23 08:33:48 -0700, Denis RP qq378789...@gmail.com wrote:
Bottleneck seems to be I/O, the CPU usage ranges 10%~15% most time per VM.
The caching is maintained by pregel, should be reliable. Storage level is
MEMORY_AND_DISK_SER.
I'd suggest trying the DISK_ONLY storage level and
Spearman's Correlation requires the calculation of ranks for columns. You can
checkout the code here and slice the part you need!
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmanCorrelation.scala
Best,
Burak
- Original
I saw your post. What are the operations you did? Are you trying to collect
data from driver? Did you try the akka configurations?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/FetchFailed-when-collect-at-YARN-cluster-tp12670p12703.html
Sent from the
I did not try the Akka configs. I was doing a shuffle operation, I believe
a sort, but two copies of the operation at the same time. It was a 20M row
dataset of reasonable horizontal size.
On Sat, Aug 23, 2014 at 2:23 PM, Jiayu Zhou dearji...@gmail.com wrote:
I saw your post. What are the
\cc David Tompkins and Jim Donahue if they have anything to add.
\cc My school email. Please include bamos_cmu.edu for further discussion.
Hi Soumya,
ssimanta wrote
The project mentions - process petabytes of data in real-time. I'm
curious to know if the architecture implemented in the
Thanks for the suggestion, the program actually failed because of
OutOfMemory: Java heap space, and I tried some modifications and it went
further, but the exception might occur again anyway.
How long did your test take? I can take it for reference.
--
View this message in context:
13 matches
Mail list logo