Anybody? Example of how to desearalize FlumeEvent data using Scala
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-Flume-event-tp12569p12709.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
Hi,
I was going through the SparkPageRank code and want to see the intermediate
steps, like the RDDs formed in the intermediate steps.
Here is a part of the code along with the lines that I added in order to
print the RDDs.
I want to print the "*parts*" in the code (denoted by the comment in Bold
l
Thanks for the suggestion, the program actually failed because of
OutOfMemory: Java heap space, and I tried some modifications and it went
further, but the exception might occur again anyway.
How long did your test take? I can take it for reference.
--
View this message in context:
http://apac
\cc David Tompkins and Jim Donahue if they have anything to add.
\cc My school email. Please include bamos_cmu.edu for further discussion.
Hi Soumya,
ssimanta wrote
> The project mentions - "process petabytes of data in real-time". I'm
> curious to know if the architecture implemented in the G
I did not try the Akka configs. I was doing a shuffle operation, I believe
a sort, but two copies of the operation at the same time. It was a 20M row
dataset of reasonable horizontal size.
On Sat, Aug 23, 2014 at 2:23 PM, Jiayu Zhou wrote:
> I saw your post. What are the operations you did? Are
I saw your post. What are the operations you did? Are you trying to collect
data from driver? Did you try the akka configurations?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/FetchFailed-when-collect-at-YARN-cluster-tp12670p12703.html
Sent from the Apac
Spearman's Correlation requires the calculation of ranks for columns. You can
checkout the code here and slice the part you need!
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmanCorrelation.scala
Best,
Burak
- Original Message
At 2014-08-23 08:33:48 -0700, Denis RP wrote:
> Bottleneck seems to be I/O, the CPU usage ranges 10%~15% most time per VM.
> The caching is maintained by pregel, should be reliable. Storage level is
> MEMORY_AND_DISK_SER.
I'd suggest trying the DISK_ONLY storage level and possibly increasing the
Yep - that's correct. As an optimization we save the shuffle output and
re-use if if you execute a stage twice. So this can make A:B tests like
this a bit confusing.
- Patrick
On Friday, August 22, 2014, Nieyuan wrote:
> Because map-reduce tasks like join will save shuffle data to disk . So the
Hi Spark Experts,
I am curious what people are using to benchmark their Spark clusters. We
are about to start a build (bare metal) vs buy (AWS/Google Cloud/Qubole)
project to determine our Hadoop and Spark deployment selection. On the
Hadoop side we will test live workloads as well as simulated
I think I emailed about a similar issue, but in standalone mode. I haven't
investigated much so I don't know what's a good fix.
On Fri, Aug 22, 2014 at 12:00 PM, Jiayu Zhou wrote:
> Hi,
>
> I am having this FetchFailed issue when the driver is about to collect
> about
> 2.5M lines of short stri
Using mapPartitions, you could get the neighbors within a partition, but if
you think about it, it's much more difficult to accomplish this for the
complete dataset.
On Fri, Aug 22, 2014 at 11:24 AM, cjwang wrote:
> It would be nice if an RDD that was massaged by OrderedRDDFunction could
> know
The algorithm uses Pregel of GraphX.
It ran for more than one day and only reached the third stage, and I
cancelled it because the consumption is unacceptable.
The time expected is about ten minutes (not expected by me ...), but I think
a couple of hours is acceptable.
Bottleneck seems to be I/O,
I think you would have to be more specific. How are you running
shortest-path? how long does it take? how long do you expect, roughly?
does the bottleneck seem to be I/O, CPU? are you caching what needs to
be cached?
If your cluster is virtualized, and has little memory, you may be
hitting disk co
Yes. And point that variable at your virtual env python.
Eric Friedman
> On Aug 22, 2014, at 6:08 AM, Earthson wrote:
>
> Do I have to deploy Python to every machine to make "$PYSPARK_PYTHON" work
> correctly?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.
In fact I think it's highly impossible, but I just want some confirmation
from you, please leave your option, thanks :)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/The-running-time-of-spark-tp12624p12691.html
Sent from the Apache Spark User List mailing
16 matches
Mail list logo