Hi,
Oracle JDK and OpenJDK, which one is better or preferred for Spark?
Regards,
Wang Hao(王灏)
CloudTeam | School of Software Engineering
Shanghai Jiao Tong University
Address:800 Dongchuan Road, Minhang District, Shanghai, 200240
Email:wh.s...@gmail.com
I would like to say that Oracle JDK may be the better choice. A lot of
hadoop distribution vendors use Oracle JDK instead of Open JDK for
enterprise.
On Mon, May 19, 2014 at 2:50 PM, Hao Wang wh.s...@gmail.com wrote:
Hi,
Oracle JDK and OpenJDK, which one is better or preferred for Spark?
btw is there a command or script to update the slaves from the master?
thanks
Daniel
On Mon, May 19, 2014 at 1:48 AM, Andrew Ash and...@andrewash.com wrote:
If the codebase for Spark's broadcast is pretty self-contained, you could
consider creating a small bootstrap sent out via the doubling
Hi all,
When i gave the persist level as DISK_ONLY, still Spark tries to use memory
and caches.
Any reason ?
Do i need to override some parameter elsewhere ?
Thanks !
Hi Eugen,
Thanks for your help. I'm not familiar with the shaded plugin and i was
wondering: does it replace the assembly plugin ? Also, do i have to specify
all the artifacts and sub artifacts in the artifactSet ? Or can i just use a
*:* wildcard and let the maven scopes do their work ? I have a
2014-05-19 10:35 GMT+02:00 Laurent T laurent.thou...@ldmobile.net:
Hi Eugen,
Thanks for your help. I'm not familiar with the shaded plugin and i was
wondering: does it replace the assembly plugin ?
Nope it doesn't replace it. It allows you to make fat jars and other nice
things such as
On Mon, May 19, 2014 at 2:04 AM, Daniel Mahler dmah...@gmail.com wrote:
I agree that for updating rsync is probably preferable, and it seems like
for that purpose it would also parallelize well since most of the time is
spent computing checksums so the process is not constrained by the total
why does it need to be local file? why not do some filter ops on hdfs file
and save to hdfs, from where you can create rdd?
you can read a small file in on driver program and use sc.parallelize to
turn it into RDD
On May 16, 2014 7:01 PM, Sai Prasanna ansaiprasa...@gmail.com wrote:
I found that
Hi
I am working with a Cloudera 5 cluster with 192 nodes and can’t work out how to
get the spark repo to use more than 2 nodes in an interactive session.
So, this works, but is non-interactive (using yarn-client as MASTER)
I am encountering the same thing. Basic yarn apps work as does the SparkPi
example, but my custom application gives this result. I am using
compute-classpath to create the proper classpath for my application, same
with SparkPi - was there a resolution to this issue?
Thanks,
Arun
On Wed, Feb
This is the patch for it: https://github.com/apache/spark/pull/50/. It might be
possible to backport it to 0.8.
Matei
On May 19, 2014, at 2:04 AM, Sai Prasanna ansaiprasa...@gmail.com wrote:
Matei, I am using 0.8.1 !!
But is there a way without moving to 0.9.1 to bypass cache ?
On
Hi Eric,
Have you tried setting the SPARK_WORKER_INSTANCES env variable before
running spark-shell?
http://spark.apache.org/docs/0.9.0/running-on-yarn.html
-Sandy
On Mon, May 19, 2014 at 8:08 AM, Eric Friedman e...@spottedsnake.netwrote:
Hi
I am working with a Cloudera 5 cluster with 192
Hi All
I am new to Spark, I was trying to use Spark Streaming and Shark at the same
time.
I was recieiving messages from Kafka and pushing them to HDFS after minor
processing.
It was workin fine, but it was taking all the CPUs and at the same time on
other terminal i tried to access shark
Hey Andrew,
Since we're seeing so many of these e-mails, I think it's worth
pointing out that it's not really obvious to find unsubscription
information for the lists.
The community link on the Spark site
(http://spark.apache.org/community.html) does not have instructions
for unsubscribing; it
Hi, I'm attempting to run spark-ec2 launch on AWS. My AWS instances
would be in our EC2 VPC (which seems to be causing a problem).
The two security groups MyClusterName-master and MyClusterName-slaves have
already been setup with the same ports open as the security group that
spark-ec2 tries to
Agree that the links to the archives should probably point to the Apache
archives rather than Nabble's, so the unsubscribe documentation is clearer.
Also, an (unsubscribe) link right next to subscribe with the email already
generated could help too.
I'd be one of those highly against a footer on
On the ec2 machines, you can update the slaves from the master using
something like ~/spark-ec2/copy-dir ~/spark.
Spark's TorrentBroadcast relies on the Block Manager to distribute blocks,
making it relatively hard to extract.
On Mon, May 19, 2014 at 12:36 AM, Daniel Mahler dmah...@gmail.com
Good catch. In that case, using BitTornado/murder would be better.
--
Mosharaf Chowdhury
http://www.mosharaf.com/
On Mon, May 19, 2014 at 11:17 AM, Aaron Davidson ilike...@gmail.com wrote:
On the ec2 machines, you can update the slaves from the master using
something like
Thanks Xiangrui,
But I did not find the directory:
examples/src/main/scala/org/apache/spark/examples/mllib.
Could you give me more detail or show me one example? Thanks a lot.
--
View this message in context:
Besides Hadoop, are there any other components of Spark that do not support
IPv6?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Which-component-s-of-Spark-do-not-support-IPv6-tp6050.html
Sent from the Apache Spark User List mailing list archive at
Sandy, thank you so much — that was indeed my omission!
Eric
On May 19, 2014, at 10:14 AM, Sandy Ryza sandy.r...@cloudera.com wrote:
Hi Eric,
Have you tried setting the SPARK_WORKER_INSTANCES env variable before running
spark-shell?
http://spark.apache.org/docs/0.9.0/running-on-yarn.html
If you’d like to work on just this code for your own changes, it might be best
to copy it to a separate project. Look at
http://spark.apache.org/docs/latest/quick-start.html for how to set up a
standalone job.
Matei
On May 19, 2014, at 4:53 AM, Hao Wang wh.s...@gmail.com wrote:
Hi,
I am
I am using CDH5 on a three machines cluster. map data from hbase as (string,
V) pair , then call combineByKey like this:
.combineByKey[C](
(v:V)=new C(v), //this line throw java.lang.ClassCastException: C
cannot be cast to V
(v:C,v:V)=C,
(c1:C,c2:C)=C)
I am very
Which version is this with? I haven’t seen standalone masters lose workers. Is
there other stuff on the machines that’s killing them, or what errors do you
see?
Matei
On May 16, 2014, at 9:53 AM, Josh Marcus jmar...@meetup.com wrote:
Hey folks,
I'm wondering what strategies other folks
Thanks Xiangrui,
Sorry I am new for Spark, could you give me more detail about
master or branch-1.0 I do not know what master or branch-1.0 is.
Thanks again.
On Mon, May 19, 2014 at 10:17 PM, Xiangrui Meng [via Apache Spark User
List] ml-node+s1001560n6064...@n3.nabble.com wrote:
Hi yxzhao,
Those are branches in the source code git repository. You can get to them
with git checkout branch-1.0 once you've cloned the git repository.
Cheers,
Andrew
On Mon, May 19, 2014 at 8:30 PM, yxzhao yxz...@ualr.edu wrote:
Thanks Xiangrui,
Sorry I am new for Spark, could
Has anyone observed Spark worker threads stalling during a shuffle phase with
the following message (one per worker host) being echoed to the terminal on
the driver thread?
INFO spark.MapOutputTrackerActor: Asked to send map output locations for
shuffle 0 to [worker host]...
At this point
Hi Xiangrui, many thanks to you and Sandy for fixing this issue!
On Fri, May 16, 2014 at 10:23 PM, Xiangrui Meng men...@gmail.com wrote:
Hi Andrew,
I submitted a patch and verified it solves the problem. You can
download the patch from
https://issues.apache.org/jira/browse/HADOOP-10614 .
from looking at the source code i see executors run in their own jvm
subprocesses.
how long to they live for? as long as the worker/slave? or are they tied to
the sparkcontext and life/die with it?
thx
Thanks Andrew,
Yes, I have downloaded the master code.
But, actually I just want to know how to run the
classfication algorithms SVM and LogisticRegression implemented under
/spark-0.9.0-incubating/mllib/src/main/scala/org/apache/spark/mllib/classification
. Thanks.
On Mon,
Is your RDD of Strings? If so, you should make sure to use the Kryo
serializer instead of the default Java one. It stores strings as UTF8
rather than Java's default UTF16 representation, which can save you half
the memory usage in the right situation.
Try setting the persistence level on the
Is there any way to get facebook data into Spark and filter the content of
it?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/facebook-data-mining-with-Spark-tp6072.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
How does one submit a spark job to yarn and specify a queue?
The code that successfully submits to yarn is:
val conf = new SparkConf()
val sc = new SparkContext(yarn-client, Simple App, conf)
Where do I need to specify the queue?
Thanks in advance for any help on this...
They’re tied to the SparkContext (application) that launched them.
Matei
On May 19, 2014, at 8:44 PM, Koert Kuipers ko...@tresata.com wrote:
from looking at the source code i see executors run in their own jvm
subprocesses.
how long to they live for? as long as the worker/slave? or are
Thanks Sean. Yes, your solution works :-) I did oversimplify my real
problem, which has other parameters that go along with the sequence.
On Fri, May 16, 2014 at 3:03 AM, Sean Owen so...@cloudera.com wrote:
Not sure if this is feasible, but this literally does what I think you
are describing:
I guess it needs to be this way to benefit from caching of RDDs in
memory. It would be nice however if the RDD cache can be dissociated from
the JVM heap so that in cases where garbage collection is difficult to
tune, one could choose to discard the JVM and run the next operation in a
few one.
That's one the main motivation in using Tachyon ;)
http://tachyon-project.org/
It gives off heap in-memory caching. And starting Spark 0.9, you can cache
any RDD in Tachyon just by specifying the appropriate StorageLevel.
TD
On Mon, May 19, 2014 at 10:22 PM, Mohit Jaggi mohitja...@gmail.com
37 matches
Mail list logo