Re: Create cache fails on first time

2014-04-17 Thread Andre Bois-Crettez
It could be GC issue, first time it triggers a full GC that takes too much time ? Make sure you have Xms and Xms at the same values, and try -XX:+UseConcMarkSweepGC And analyse GC logs. André Bois-Crettez On 2014-04-16 16:44, Arpit Tak wrote: I am loading some data(25GB) in shark from hdfs :

Re: Re: Spark program thows OutOfMemoryError

2014-04-17 Thread Qin Wei
Hi, Andre, thanks a lot for you reply, but i still get the same exception, the complete exception message is as below: Exception in thread main org.apache.spark.SparkException: Job aborted: Task 1.0:9 failed 4 times (most recent failure: Exception failure: java.lang.OutOfMemoryError: Java

Random Forest on Spark

2014-04-17 Thread Laeeq Ahmed
Hi, For one of my application, I want to use Random forests(RF) on top of spark. I see that currenlty MLLib does not have implementation for RF. What other opensource RF implementations will be great to use with spark in terms of speed? Regards, Laeeq Ahmed, KTH, Sweden.

Spark on Yarn or Mesos?

2014-04-17 Thread Wei Wang
Hi, there I would like to know is there any differences between Spark on Yarn and Spark on Mesos. Is there any comparision between them? What are the advantages and disadvantages for each of them. Is there any criterion for choosing between Yarn and Mesos? BTW, we need MPI in our framework, and

Re: Shark: ClassNotFoundException org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat

2014-04-17 Thread Arpit Tak
Just for curiosity , as you are using Cloudera-Manager hadoop and spark.. How you build shark .for it?? are you able to read any file from hdfs ...did you tried that out..??? Regards, Arpit Tak On Thu, Apr 17, 2014 at 7:07 PM, ge ko koenig@gmail.com wrote: Hi, the error

Spark Example Project, runnable on EMR, open sourced

2014-04-17 Thread Alex Dean
Hi all, Just a quick email to share a new GitHub project we've just released at Snowplow: https://github.com/snowplow/spark-example-project It's an example Scala SBT project which can assemble a fat jar ready for running on Amazon Elastic MapReduce. It includes Specs2 tests too. The blog post

Re: Spark program thows OutOfMemoryError

2014-04-17 Thread yypvsxf19870706
how many tasks are there in your job? 发自我的 iPhone 在 2014-4-17,16:24,Qin Wei wei@dewmobile.net 写道: Hi, Andre, thanks a lot for you reply, but i still get the same exception, the complete exception message is as below: Exception in thread main org.apache.spark.SparkException: Job

confused by reduceByKey usage

2014-04-17 Thread 诺铁
HI, I am new to spark,when try to write some simple tests in spark shell, I met following problem. I create a very small text file,name it as 5.txt 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 and experiment in spark shell: scala val d5 = sc.textFile(5.txt).cache() d5: org.apache.spark.rdd.RDD[String] =

Re: Random Forest on Spark

2014-04-17 Thread Debasish Das
Mllib has decision treethere is a rf pr which is not active nowtake that and swap the tree builder with the fast tree builder that's in mllib...search for the spark jira...the code is based on google planet paper. .. I am sure people in devlist are already working on it...send an email to

Continuously running non-streaming jobs

2014-04-17 Thread Jim Carroll
Is there a way to create continuously-running, or at least continuously-loaded, jobs that can be 'invoked' rather than 'sent' to to avoid the job creation overhead of a couple seconds? I read through the following:

Spark 0.9.1 core dumps on Mesos 0.18.0

2014-04-17 Thread Steven Cox
So I tried a fix found on the list... The issue was due to meos version mismatch as I am using latest mesos 0.17.0, but spark uses 0.13.0. Fixed by updating the SparkBuild.scala to latest version. I changed this line in SparkBuild.scala org.apache.mesos % mesos

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung
Debasish, we've tested the MLLib decision tree a bit and it eats up too much memory for RF purposes. Once the tree got to depth 8~9, it was easy to get heap exception, even with 2~4 GB of memory per worker. With RF, it's very easy to get 100+ depth in RF with even only 100,000+ rows (because

Re: Spark 0.9.1 core dumps on Mesos 0.18.0

2014-04-17 Thread andy petrella
Hyea, I still have to try it myself (I'm trying to create GCE images with Spark on Mesos 0.18.0) but I think your change is one of the required ones, however my gut feeling is that others will be required to have this working. Actually, in my understanding, this core dump is due to protobuf

Re: Random Forest on Spark

2014-04-17 Thread Evan R. Sparks
Sorry - I meant to say that Multiclass classification, Gradient Boosting, and Random Forest support based on the recent Decision Tree implementation in MLlib is planned and coming soon. On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks evan.spa...@gmail.comwrote: Multiclass classification,

Re: Spark 0.9.1 core dumps on Mesos 0.18.0

2014-04-17 Thread Sean Owen
I don't know if it's anything you or the project is missing... that's just a JDK bug. FWIW I am on 1.7.0_51 and have not seen anything like that. I don't think it's a protobuf issue -- you don't crash the JVM with simple version incompatibilities :) -- Sean Owen | Director, Data Science | London

Re: Continuously running non-streaming jobs

2014-04-17 Thread Jim Carroll
Daniel, I'm new to Spark but I thought that thread hinted at the right answer. Thanks, Jim -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Continuously-running-non-streaming-jobs-tp4391p4397.html Sent from the Apache Spark User List mailing list archive

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung
Evan, I actually haven't heard of 'shallow' random forest. I think that the only scenarios where shallow trees are useful are boosting scenarios. AFAIK, Random Forest is a variance reducing technique and doesn't do much about bias (although some people claim that it does have some bias reducing

Re: Spark 0.9.1 core dumps on Mesos 0.18.0

2014-04-17 Thread andy petrella
No of course, but I was guessing some native libs imported (to communicate with Mesos) in the project that... could miserably crash the JVM. Anyway, so you tell us that using this oracle version, you don't have any issues when using spark on mesos 0.18.0, that's interesting 'cause AFAIR, my last

Re: Continuously running non-streaming jobs

2014-04-17 Thread Daniel Darabos
I'm quite new myself (just subscribed to the mailing list today :)), but this happens to be something we've had success with. So let me know if you hit any problems with this sort of usage. On Thu, Apr 17, 2014 at 9:11 PM, Jim Carroll jimfcarr...@gmail.com wrote: Daniel, I'm new to Spark but

RE: Spark 0.9.1 core dumps on Mesos 0.18.0

2014-04-17 Thread Steven Cox
FYI, I've tried older versions (jdk6.x), openjdk. Also here's a fresh core dump on jdk7u55-b13: # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x7f7c6b718d39, pid=7708, tid=140171900581632 # # JRE version: Java(TM) SE Runtime Environment

Re: Spark 0.9.1 core dumps on Mesos 0.18.0

2014-04-17 Thread andy petrella
If you can test it quickly, an option would be to try with the exact same version that Sean used (1.7.0_51) ? Maybe it was a bug fixed in 51 and a regression has been introduced in 55 :-D Andy On Thu, Apr 17, 2014 at 9:36 PM, Steven Cox s...@renci.org wrote: FYI, I've tried older versions

writing booleans w Calliope

2014-04-17 Thread Adrian Mocanu
Has anyone managed to write Booleans to Cassandra from an RDD with Calliope? My Booleans give compile time errors: expression of type List[Any] does not conform to expected type Types.CQLRowValues CQLColumnValue is definted as ByteBuffer: type CQLColumnValue = ByteBuffer For now I convert them

Valid spark streaming use case?

2014-04-17 Thread xargsgrep
Hi, I'm completely new to Spark streaming (and Spark) and have been reading up on it and trying out various examples the past few days. I have a particular use case which I think it would work well for, but I wanted to put it out there and get some feedback on whether or not it actually would. The

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung
Well, if you read the original paper, http://oz.berkeley.edu/~breiman/randomforest2001.pdf Grow the tree using CART methodology to maximum size and do not prune. Now, the elements of statistical learning book on page 598 says that you could potentially overfit fully-grown regression random

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung
Additionally, the 'random features per node' (or mtry in R) is a very important feature for Random Forest. The variance reduction comes if the trees are decorrelated from each other and often the random features per node does more than bootstrap samples. And this is something that would have to be

Re: RDD collect help

2014-04-17 Thread Eugen Cepoi
You have two kind of ser : data and closures. They both use java ser. This means that in your function you reference an object outside of it and it is getting ser with your task. To enable kryo ser for closures set spark.closure.serializer property. But usualy I dont as it allows me to detect such

Re: Random Forest on Spark

2014-04-17 Thread Debasish Das
Evan, Was not mllib decision tree implemented using ideas from Google's PLANET paper...do the paper also propose to grow a shallow tree ? Thanks. Deb On Thu, Apr 17, 2014 at 1:52 PM, Sung Hwan Chung coded...@cs.stanford.eduwrote: Additionally, the 'random features per node' (or mtry in R) is

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung
I believe that they show one example comparing depth 1 ensemble vs depth 3 ensemble but it is based on boosting, not bagging. On Thu, Apr 17, 2014 at 2:21 PM, Debasish Das debasish.da...@gmail.comwrote: Evan, Was not mllib decision tree implemented using ideas from Google's PLANET

Re: distinct on huge dataset

2014-04-17 Thread Ryan Compton
Does this continue in newer versions? (I'm on 0.8.0 now) When I use .distinct() on moderately large datasets (224GB, 8.5B rows, I'm guessing about 500M are distinct) my jobs fail with: 14/04/17 15:04:02 INFO cluster.ClusterTaskSetManager: Loss was due to java.io.FileNotFoundException

Re: distinct on huge dataset

2014-04-17 Thread Ryan Compton
Btw, I've got System.setProperty(spark.shuffle.consolidate.files, true) and use ext3 (CentOS...) On Thu, Apr 17, 2014 at 3:20 PM, Ryan Compton compton.r...@gmail.com wrote: Does this continue in newer versions? (I'm on 0.8.0 now) When I use .distinct() on moderately large datasets (224GB, 8.5B

Re: confused by reduceByKey usage

2014-04-17 Thread 诺铁
yeah, I got it.! using println to debug is great for me to explore spark. thank you very much for your kindly help. On Fri, Apr 18, 2014 at 12:54 AM, Daniel Darabos daniel.dara...@lynxanalytics.com wrote: Here's a way to debug something like this: scala d5.keyBy(_.split(

Re: Random Forest on Spark

2014-04-17 Thread Evan R. Sparks
What kind of data are you training on? These effects are *highly* data dependent, and while saying the depth of 10 is simply not adequate to build high-accuracy models may be accurate for the particular problem you're modeling, it is not true in general. From a statistical perspective, I consider

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung
Yes, it should be data specific and perhaps we're biased toward the data sets that we are playing with. To put things in perspective, we're highly interested in (and I believe, our customers are): 1. large (hundreds of millions of rows) 2. multi-class classification - nowadays, dozens of target

Re: confused by reduceByKey usage

2014-04-17 Thread Cheng Lian
A tip: using println is only convenient when you are working with local mode. When running Spark in clustering mode (standalone/YARN/Mesos), output of println goes to executor stdout. On Fri, Apr 18, 2014 at 6:53 AM, 诺铁 noty...@gmail.com wrote: yeah, I got it.! using println to debug is great

Re: confused by reduceByKey usage

2014-04-17 Thread 诺铁
hi,Cheng, thank you for let me know this. so what do you think is better way to debug? On Fri, Apr 18, 2014 at 9:27 AM, Cheng Lian lian.cs@gmail.com wrote: A tip: using println is only convenient when you are working with local mode. When running Spark in clustering mode

Re: confused by reduceByKey usage

2014-04-17 Thread Cheng Lian
Ah, I’m not saying println is bad, it’s just that you need to go to the right place to locate the output, e.g. you can check stdout of any executor from the Web UI. On Fri, Apr 18, 2014 at 9:48 AM, 诺铁 noty...@gmail.com wrote: hi,Cheng, thank you for let me know this. so what do you think

Re: distinct on huge dataset

2014-04-17 Thread Mayur Rustagi
Preferably increase the ulimit on your machines. Spark needs to access a lot of small files hence hard to control file handlers. — Sent from Mailbox On Fri, Apr 18, 2014 at 3:59 AM, Ryan Compton compton.r...@gmail.com wrote: Btw, I've got System.setProperty(spark.shuffle.consolidate.files,

Re: confused by reduceByKey usage

2014-04-17 Thread 诺铁
got it, thank you. On Fri, Apr 18, 2014 at 9:55 AM, Cheng Lian lian.cs@gmail.com wrote: Ah, I’m not saying println is bad, it’s just that you need to go to the right place to locate the output, e.g. you can check stdout of any executor from the Web UI. On Fri, Apr 18, 2014 at 9:48 AM,