I bring this up because the performance we are seeing is dreadful. From
cpu usage, it appears the issue is the spark shell cpu power. We have
increased this node from a EC2 medium to an xl, we are seeing slightly
better performance but still not great.
My understanding of Spark was that most of
Sorry for the following question, but I just need a little clarity on
expectations of Spark using YARN.
Is it possible to use the spark-shell with YARN ? Or is the only way to submit
a Spark job to YARN is by write a Java application and submit it via the
yarn.Client application.
Also is there
Hi, Philip.
Basically, we need* PairRDDFunctions.saveAsHadoopDataset* to do the job, as
HBase is not a fs, saveAsHadoopFile doesn't work.
*def saveAsHadoopDataset(conf: JobConf): Unit*
this function takes a JobConf parameter which should be configured.
Essentially, you need to set output format
Hi!
I load data from list( sc.parallelize() ) with length about 140
items. After that I run data.filter(func1).map(func2). This operation
runs less, then a second. But after that function count() (or
collect() ) takes about 30 seconds. Please, help me to reduce this
time!
Best Regards,
Valentin
Hi,
Just wondering what people suggest for joining of 2 RDDs of very different
sizes
I have a sequence of map reduce that will in the end yield me a RDD ~ 500MB
- 800MB that typically has a couple hundred partitions.
After that I want to join that rdd with 2 smaller rdds 1 will be <50MB
anothe
Hi Valentin,
data.filter() and rdd map() do not actually do the computation. When
you call count() or collect(), your RDD first dies the filter(), then
the map() and then the count() or collect().
See this for more info:
https://github.com/mesos/spark/wiki/Spark-Programming-Guide#transformations
I've done this with a "broadcast". It worked pretty well. Around 10g
(for the smaller dataset) I started having problems (cf
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201310.mbox/%3ccamgysq9sivs0j9dhv9qgdzp9qxgfadqkrd58b3ynbnhdgkp...@mail.gmail.com%3E
)
If it's really only 800M
The starting data set is much larger than that, I start from a couple ~20GB
data sets.
Any hints on when it becomes impractical to broadcast .. ~ >50MB ...?? some
ball park?
On Thu, Nov 14, 2013 at 11:44 AM, Ryan Compton wrote:
> I've done this with a "broadcast". It worked pretty well. Around
My broadcast joins were working fine when the small data set was
several GB, though that probably has more to do with the computer I
was using than anything else. Switching code between a regular join
and a broadcast join is easy. I basically copied the example here:
http://ampcamp.berkeley.edu/wp-
Thank you, Meisam! But I have found something interesting (for me, as
novice in Spark). Working with 400k elements, count() takes 30 secs
and .take(Int.MaxValue).size is less than a second!
The problem comes when working with 1400k elements -
.take(Int.MaxValue).size is not so quik.
Best regards,
V
I would like to get info like total cores and total memory available in the
spark cluster via an spark java api, any suggestion?
This will help me in setting the right partition when invoking parallelize for
example
Thanks,
Hussam
Hey folks, just a quick announcement -- in case you’re interested in learning
more about Spark in the Boston area, I’m going to speak at the Boston Hadoop
Meetup next Thursday: http://www.meetup.com/bostonhadoop/events/150875522/.
This is a good chance to meet local users and learn more about th
Hi! I think I have the maximally horrendous solution to this problem. If
you just want to know the total cores of a Standalone or Coarse Grained
scheduler, and are OK with going "off trail" of the public API, so to
speak, you can use something like the following (just beware that it's
liable to bre
13 matches
Mail list logo