It could be GC issue, first time it triggers a full GC that takes too
much time ?
Make sure you have Xms and Xms at the same values, and try
-XX:+UseConcMarkSweepGC
And analyse GC logs.
André Bois-Crettez
On 2014-04-16 16:44, Arpit Tak wrote:
I am loading some data(25GB) in shark from hdfs :
Hi, Andre, thanks a lot for you reply, but i still get the same exception, the
complete exception message is as below:
Exception in thread main org.apache.spark.SparkException: Job aborted: Task
1.0:9 failed 4 times (most recent failure: Exception failure:
java.lang.OutOfMemoryError: Java
Hi,
For one of my application, I want to use Random forests(RF) on top of spark. I
see that currenlty MLLib does not have implementation for RF. What other
opensource RF implementations will be great to use with spark in terms of speed?
Regards,
Laeeq Ahmed,
KTH, Sweden.
Hi, there
I would like to know is there any differences between Spark on Yarn and
Spark on Mesos. Is there any comparision between them? What are the
advantages and disadvantages for each of them. Is there any criterion for
choosing between Yarn and Mesos?
BTW, we need MPI in our framework, and
Just for curiosity , as you are using Cloudera-Manager hadoop and spark..
How you build shark .for it??
are you able to read any file from hdfs ...did you tried that out..???
Regards,
Arpit Tak
On Thu, Apr 17, 2014 at 7:07 PM, ge ko koenig@gmail.com wrote:
Hi,
the error
Hi all,
Just a quick email to share a new GitHub project we've just released at
Snowplow:
https://github.com/snowplow/spark-example-project
It's an example Scala SBT project which can assemble a fat jar ready for
running on Amazon Elastic MapReduce. It includes Specs2 tests too.
The blog post
how many tasks are there in your job?
发自我的 iPhone
在 2014-4-17,16:24,Qin Wei wei@dewmobile.net 写道:
Hi, Andre, thanks a lot for you reply, but i still get the same exception,
the complete exception message is as below:
Exception in thread main org.apache.spark.SparkException: Job
HI,
I am new to spark,when try to write some simple tests in spark shell, I met
following problem.
I create a very small text file,name it as 5.txt
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
and experiment in spark shell:
scala val d5 = sc.textFile(5.txt).cache()
d5: org.apache.spark.rdd.RDD[String] =
Mllib has decision treethere is a rf pr which is not active nowtake
that and swap the tree builder with the fast tree builder that's in
mllib...search for the spark jira...the code is based on google planet
paper. ..
I am sure people in devlist are already working on it...send an email to
Is there a way to create continuously-running, or at least
continuously-loaded, jobs that can be 'invoked' rather than 'sent' to to
avoid the job creation overhead of a couple seconds?
I read through the following:
So I tried a fix found on the list...
The issue was due to meos version mismatch as I am using latest mesos
0.17.0, but spark uses 0.13.0.
Fixed by updating the SparkBuild.scala to latest version.
I changed this line in SparkBuild.scala
org.apache.mesos % mesos
Debasish, we've tested the MLLib decision tree a bit and it eats up too
much memory for RF purposes.
Once the tree got to depth 8~9, it was easy to get heap exception, even
with 2~4 GB of memory per worker.
With RF, it's very easy to get 100+ depth in RF with even only 100,000+
rows (because
Hyea,
I still have to try it myself (I'm trying to create GCE images with Spark
on Mesos 0.18.0) but I think your change is one of the required ones,
however my gut feeling is that others will be required to have this working.
Actually, in my understanding, this core dump is due to protobuf
Sorry - I meant to say that Multiclass classification, Gradient Boosting,
and Random Forest support based on the recent Decision Tree implementation
in MLlib is planned and coming soon.
On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks evan.spa...@gmail.comwrote:
Multiclass classification,
I don't know if it's anything you or the project is missing... that's
just a JDK bug.
FWIW I am on 1.7.0_51 and have not seen anything like that.
I don't think it's a protobuf issue -- you don't crash the JVM with
simple version incompatibilities :)
--
Sean Owen | Director, Data Science | London
Daniel,
I'm new to Spark but I thought that thread hinted at the right answer.
Thanks,
Jim
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Continuously-running-non-streaming-jobs-tp4391p4397.html
Sent from the Apache Spark User List mailing list archive
Evan,
I actually haven't heard of 'shallow' random forest. I think that the only
scenarios where shallow trees are useful are boosting scenarios.
AFAIK, Random Forest is a variance reducing technique and doesn't do much
about bias (although some people claim that it does have some bias reducing
No of course, but I was guessing some native libs imported (to communicate
with Mesos) in the project that... could miserably crash the JVM.
Anyway, so you tell us that using this oracle version, you don't have any
issues when using spark on mesos 0.18.0, that's interesting 'cause AFAIR,
my last
I'm quite new myself (just subscribed to the mailing list today :)), but
this happens to be something we've had success with. So let me know if you
hit any problems with this sort of usage.
On Thu, Apr 17, 2014 at 9:11 PM, Jim Carroll jimfcarr...@gmail.com wrote:
Daniel,
I'm new to Spark but
FYI, I've tried older versions (jdk6.x), openjdk. Also here's a fresh core dump
on jdk7u55-b13:
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x7f7c6b718d39, pid=7708, tid=140171900581632
#
# JRE version: Java(TM) SE Runtime Environment
If you can test it quickly, an option would be to try with the exact same
version that Sean used (1.7.0_51) ?
Maybe it was a bug fixed in 51 and a regression has been introduced in 55
:-D
Andy
On Thu, Apr 17, 2014 at 9:36 PM, Steven Cox s...@renci.org wrote:
FYI, I've tried older versions
Has anyone managed to write Booleans to Cassandra from an RDD with Calliope?
My Booleans give compile time errors: expression of type List[Any] does not
conform to expected type Types.CQLRowValues
CQLColumnValue is definted as ByteBuffer: type CQLColumnValue = ByteBuffer
For now I convert them
Hi, I'm completely new to Spark streaming (and Spark) and have been reading
up on it and trying out various examples the past few days. I have a
particular use case which I think it would work well for, but I wanted to
put it out there and get some feedback on whether or not it actually would.
The
Well, if you read the original paper,
http://oz.berkeley.edu/~breiman/randomforest2001.pdf
Grow the tree using CART methodology to maximum size and do not prune.
Now, the elements of statistical learning book on page 598 says that you
could potentially overfit fully-grown regression random
Additionally, the 'random features per node' (or mtry in R) is a very
important feature for Random Forest. The variance reduction comes if the
trees are decorrelated from each other and often the random features per
node does more than bootstrap samples. And this is something that would
have to be
You have two kind of ser : data and closures. They both use java ser. This
means that in your function you reference an object outside of it and it is
getting ser with your task. To enable kryo ser for closures set
spark.closure.serializer property. But usualy I dont as it allows me to
detect such
Evan,
Was not mllib decision tree implemented using ideas from Google's PLANET
paper...do the paper also propose to grow a shallow tree ?
Thanks.
Deb
On Thu, Apr 17, 2014 at 1:52 PM, Sung Hwan Chung
coded...@cs.stanford.eduwrote:
Additionally, the 'random features per node' (or mtry in R) is
I believe that they show one example comparing depth 1 ensemble vs depth 3
ensemble but it is based on boosting, not bagging.
On Thu, Apr 17, 2014 at 2:21 PM, Debasish Das debasish.da...@gmail.comwrote:
Evan,
Was not mllib decision tree implemented using ideas from Google's PLANET
Does this continue in newer versions? (I'm on 0.8.0 now)
When I use .distinct() on moderately large datasets (224GB, 8.5B rows,
I'm guessing about 500M are distinct) my jobs fail with:
14/04/17 15:04:02 INFO cluster.ClusterTaskSetManager: Loss was due to
java.io.FileNotFoundException
Btw, I've got System.setProperty(spark.shuffle.consolidate.files,
true) and use ext3 (CentOS...)
On Thu, Apr 17, 2014 at 3:20 PM, Ryan Compton compton.r...@gmail.com wrote:
Does this continue in newer versions? (I'm on 0.8.0 now)
When I use .distinct() on moderately large datasets (224GB, 8.5B
yeah, I got it.!
using println to debug is great for me to explore spark.
thank you very much for your kindly help.
On Fri, Apr 18, 2014 at 12:54 AM, Daniel Darabos
daniel.dara...@lynxanalytics.com wrote:
Here's a way to debug something like this:
scala d5.keyBy(_.split(
What kind of data are you training on? These effects are *highly* data
dependent, and while saying the depth of 10 is simply not adequate to
build high-accuracy models may be accurate for the particular problem
you're modeling, it is not true in general. From a statistical perspective,
I consider
Yes, it should be data specific and perhaps we're biased toward the data
sets that we are playing with. To put things in perspective, we're highly
interested in (and I believe, our customers are):
1. large (hundreds of millions of rows)
2. multi-class classification - nowadays, dozens of target
A tip: using println is only convenient when you are working with local
mode. When running Spark in clustering mode (standalone/YARN/Mesos), output
of println goes to executor stdout.
On Fri, Apr 18, 2014 at 6:53 AM, 诺铁 noty...@gmail.com wrote:
yeah, I got it.!
using println to debug is great
hi,Cheng,
thank you for let me know this. so what do you think is better way to
debug?
On Fri, Apr 18, 2014 at 9:27 AM, Cheng Lian lian.cs@gmail.com wrote:
A tip: using println is only convenient when you are working with local
mode. When running Spark in clustering mode
Ah, I’m not saying println is bad, it’s just that you need to go to the
right place to locate the output, e.g. you can check stdout of any executor
from the Web UI.
On Fri, Apr 18, 2014 at 9:48 AM, 诺铁 noty...@gmail.com wrote:
hi,Cheng,
thank you for let me know this. so what do you think
Preferably increase the ulimit on your machines. Spark needs to access a lot of
small files hence hard to control file handlers.
—
Sent from Mailbox
On Fri, Apr 18, 2014 at 3:59 AM, Ryan Compton compton.r...@gmail.com
wrote:
Btw, I've got System.setProperty(spark.shuffle.consolidate.files,
got it, thank you.
On Fri, Apr 18, 2014 at 9:55 AM, Cheng Lian lian.cs@gmail.com wrote:
Ah, I’m not saying println is bad, it’s just that you need to go to the
right place to locate the output, e.g. you can check stdout of any executor
from the Web UI.
On Fri, Apr 18, 2014 at 9:48 AM,
38 matches
Mail list logo