Snappy initialization issue, spark assembly jar missing snappy classes?

2016-07-20 Thread Eugene Morozov
Greetings! We're reading input files with newApiHadoopFile that is configured with multiline split. Everything's fine, besides https://issues.apache.org/jira/browse/MAPREDUCE-6549. It looks like the issue is fixed, but within hadoop 2.7.2. Which means we have to download spark without hadoop and

Re: Union of multiple RDDs

2016-06-21 Thread Eugene Morozov
Apurva, I'd say you have to apply repartition just once to the RDD that is union of all your files. And it has to be done right before you do anything else. If something is not needed on your files, then the sooner you project, the better. Hope, this helps. -- Be well! Jean Morozov On Tue,

Re: Does Spark uses data locality information from HDFS when running in standalone mode?

2016-06-05 Thread Eugene Morozov
Marco, I'd say yes, because it uses different implementation of hadoop's InputFormat interface underneath. What kind of proof would you like to see? -- Be well! Jean Morozov On Sun, Jun 5, 2016 at 12:50 PM, Marco Capuccini < marco.capucc...@farmbio.uu.se> wrote: > Dear all, > > Does Spark uses

Re: StackOverflowError even with JavaSparkContext union(JavaRDD... rdds)

2016-06-05 Thread Eugene Morozov
Everett, try to increase thread stack size. To do that run your application with the following options (my app is a web application, so you might adjust something): -XX:ThreadStackSize=81920 -Dspark.executor.extraJavaOptions="-XX:ThreadStackSize=81920" The number 81920 is memory in KB. You could

Re: RDDs caching in typical machine learning use cases

2016-04-04 Thread Eugene Morozov
Hi, Yes, I believe people do that. I also believe that SparkML is able to figure out when to cache some internal RDD also. That's definitely true for random forest algo. It doesn't harm to cache the same RDD twice, too. But it's not clear what'd you want to know... -- Be well! Jean Morozov On

Re: SparkML RandomForest java.lang.StackOverflowError

2016-03-31 Thread Eugene Morozov
mentation (and > any PLANET-like implementation) > > Using fewer partitions is a good idea. > > Which Spark version was this on? > > On Tue, Mar 29, 2016 at 5:21 AM, Eugene Morozov < > evgeny.a.moro...@gmail.com> wrote: > >> The questions I have in mind: >> &

Re: SparkML RandomForest java.lang.StackOverflowError

2016-03-30 Thread Eugene Morozov
increases over time. When the warning appeared first time it was around 100KB. Also time to complete collectAsMap at DecisionTree.scala:651 also increased from 8 seconds at the beginning of the training up to 20-24 seconds now. -- Be well! Jean Morozov On Wed, Mar 30, 2016 at 12:14 AM, Eugene Morozov

Re: SparkML RandomForest java.lang.StackOverflowError

2016-03-29 Thread Eugene Morozov
od idea. > > Which Spark version was this on? > > On Tue, Mar 29, 2016 at 5:21 AM, Eugene Morozov < > evgeny.a.moro...@gmail.com> wrote: > >> The questions I have in mind: >> >> Is it smth that the one might expect? From the stack trace itself it's >>

Re: SparkML RandomForest java.lang.StackOverflowError

2016-03-29 Thread Eugene Morozov
the right thing to do, but I've increased thread stack size 10 times (to 80MB) reduced default parallelism 10 times (only 20 cores are available) Thank you in advance. -- Be well! Jean Morozov On Tue, Mar 29, 2016 at 1:12 PM, Eugene Morozov <evgeny.a.moro...@gmail.com> wrote: >

SparkML RandomForest java.lang.StackOverflowError

2016-03-29 Thread Eugene Morozov
Hi, I have a web service that provides rest api to train random forest algo. I train random forest on a 5 nodes spark cluster with enough memory - everything is cached (~22 GB). On a small datasets up to 100k samples everything is fine, but with the biggest one (400k samples and ~70k features)

Re: IntelliJ idea not work well with spark

2016-03-27 Thread Eugene Morozov
Could you, pls share your code, so that I could try it. -- Be well! Jean Morozov On Sun, Mar 27, 2016 at 5:20 PM, 吴文超 wrote: > I am a newbie to spark, when I use IntelliJ idea to write some scala code, > i found it reports error when using spark's implicit

Re: SparkML algos limitations question.

2016-03-21 Thread Eugene Morozov
t; > Joseph > > On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozov < > evgeny.a.moro...@gmail.com> wrote: > >> Hello! >> >> I'm currently working on POC and try to use Random Forest (classification >> and regression). I also have to check SVM and Mul

SparkML. RandomForest scalability question.

2016-03-08 Thread Eugene Morozov
Hi, I have 4 nodes cluster: one master (also has hdfs namenode) and 3 workers (also have 3 colocated hdfs datanodes). Each worker has only 2 cores and spark.executor.memory is 2.3g. Input file is two hdfs blocks, one block configured = 64MB. I train random forest regression with numTrees=50 and

Re: Dynamic partitions reallocations with added worker nodes

2016-03-05 Thread Eugene Morozov
I haven't added one more HDFS node to a hadoop cluster > > Does each of three nodes colocate with hdfs data nodes ? > The absence of 4th data node might have something to do with the partition > allocation. > > Can you show your code snippet ? > > Thanks > > On Sat,

Dynamic partitions reallocations with added worker nodes

2016-03-05 Thread Eugene Morozov
Hi, My cluster (standalone deployment) consisting of 3 worker nodes was in the middle of computations, when I added one more worker node. I can see that new worker is registered in master and that my job actually get one more executor. I have configured default parallelism as 12 and thus I see

Re: Fair scheduler pool details

2016-03-02 Thread Eugene Morozov
lgorithm. There is > no pre-emption or rescheduling of Tasks that the scheduler has already sent > to the workers, nor is there any attempt to anticipate when already running > Tasks will complete. > > > On Sat, Feb 20, 2016 at 4:14 PM, Eugene Morozov < > evgeny.a.moro...

SparkML Using Pipeline API locally on driver

2016-02-26 Thread Eugene Morozov
Hi everyone. I have a requirement to run prediction for random forest model locally on a web-service without touching spark at all in some specific cases. I've achieved that with previous mllib API (java 8 syntax): public List> predictLocally(RandomForestModel model,

Fair scheduler pool details

2016-02-20 Thread Eugene Morozov
Hi, I'm trying to understand how this thing works underneath. Let's say I have two types of jobs - high important, that might use small amount of cores and has to be run pretty fast. And less important, but greedy - uses as many cores as available. So, the idea is to use two corresponding pools.

Best practises of share Spark cluster over few applications

2016-02-13 Thread Eugene Morozov
Hi, I have several instances of the same web-service that is running some ML algos on Spark (both training and prediction) and do some Spark unrelated job. Each web-service instance creates their on JavaSparkContext, thus they're seen as separate applications by Spark, thus they're configured

[SparkML] RandomForestModel save on disk.

2016-02-12 Thread Eugene Morozov
Hello, I'm building simple web service that works with spark and allows users to train random forest model (mlib API) and use it for prediction. Trained models are stored on the local file system (web service and spark of just one worker are run on the same machine). I'm concerned about

Re: [SparkML] RandomForestModel save on disk.

2016-02-12 Thread Eugene Morozov
(ScalaReflection.scala:642) ~[spark-catalyst_2.10-1.6.0.jar:1.6.0] -- Be well! Jean Morozov On Fri, Feb 12, 2016 at 5:57 PM, Eugene Morozov <evgeny.a.moro...@gmail.com> wrote: > Hello, > > I'm building simple web service that works with spark and allows users to > train random forest model

Re: Concurrent Spark jobs

2016-01-22 Thread Eugene Morozov
Emlyn, Have you considered using pools? http://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools I haven't tried that by myself, but it looks like pool setting is applied per thread so that means it's possible to configure fair scheduler, so that more, than one job is on a

Re: Usage of SparkContext within a Web container

2016-01-14 Thread Eugene Morozov
Praveen, Zeppelin uses Spark's REPL. I'm currently writing an app that is a web service, which is going to run spark jobs. So, at the init stage I just create JavaSparkContext and then use it for all users requests. Web service is stateless. The issue with stateless is that it's possible to run

Re: Stuck with DataFrame df.select("select * from table");

2015-12-29 Thread Eugene Morozov
, > > Try this: > > df.select("""select * from tmptable where x1 = '3.0'""").show(); > > > *Note: *you have to use 3 double quotes as marked > > > > On Friday, December 25, 2015 11:30 AM, Eugene Morozov < > evgeny.a.moro...@gmail.com> wro

Re: Help: Driver OOM when shuffle large amount of data

2015-12-28 Thread Eugene Morozov
Kendal, have you tried to reduce number of partitions? -- Be well! Jean Morozov On Mon, Dec 28, 2015 at 9:02 AM, kendal wrote: > My driver is running OOM with my 4T data set... I don't collect any data to > driver. All what the program done is map - reduce - saveAsTextFile.

Re: Stuck with DataFrame df.select("select * from table");

2015-12-26 Thread Eugene Morozov
>> https://github.com/apache/incubator-zeppelin/blob/01f4884a3a971ece49d668a9783d6b705cf6dbb5/spark/src/main/java/org/apache/zeppelin/spark/SparkSqlInterpreter.java#L140-L141 >> >> >> Also, keep in mind that you can do something like this if you want to >> stay in DataFram

Stuck with DataFrame df.select("select * from table");

2015-12-25 Thread Eugene Morozov
Hello, I'm basically stuck as I have no idea where to look; Following simple code, given that my Datasource is working gives me an exception. DataFrame df = sqlc.load(filename, "com.epam.parso.spark.ds.DefaultSource"); df.cache(); df.printSchema(); <-- prints the schema perfectly fine!

Re: Stuck with DataFrame df.select("select * from table");

2015-12-25 Thread Eugene Morozov
from SQL query. > I searched unit tests but didn't find any in the form of df.select("select > ...") > > Looks like you should use sqlContext as other people suggested. > > On Fri, Dec 25, 2015 at 8:29 AM, Eugene Morozov < > evgeny.a.moro...@gmail.com> wrote: &g

Re: Stuck with DataFrame df.select("select * from table");

2015-12-25 Thread Eugene Morozov
ail.com> > wrote: > >> hello >> you can try to use df.limit(5).show() >> just trick :) >> >> On Fri, Dec 25, 2015 at 2:34 PM, Eugene Morozov < >> evgeny.a.moro...@gmail.com> wrote: >> >>> Hello, I'm basically stuck as I have no idea w

[SparkML] RandomForestModel vs PipelineModel API on a Driver.

2015-12-17 Thread Eugene Morozov
Hi! I'm looking for a way to run prediction for learned model in the most performant way. It might happen that some users might want to predict just couple of samples (literally one or two), but some other would run prediction for tens of thousands. It's not a surprise there is an overhead to

SparkML algos limitations question.

2015-12-14 Thread Eugene Morozov
Hello! I'm currently working on POC and try to use Random Forest (classification and regression). I also have to check SVM and Multiclass perceptron (other algos are less important at the moment). So far I've discovered that Random Forest has a limitation of maxDepth for trees and just out of

SparkML. RandomForest predict performance for small dataset.

2015-12-09 Thread Eugene Morozov
Hello, I'm using RandomForest pipeline (ml package). Everything is working fine (learning models, prediction, etc), but I'd like to tune it for the case, when I predict with small dataset. My issue is that when I apply (PipelineModel)model.transform(dataset) The model consists of the following

Re: Spark ML Random Forest output.

2015-12-05 Thread Eugene Morozov
ithTheOriginalLabels) > .setLabels(labelIndexer.labels) > > val pipeline = new Pipeline() > .setStages(Array(labelIndexer, randomForest, labelConverter)) > > Hoping that helps, > Ben. > > On Sat, Dec 5, 2015 at 12:26 PM, Eugene Morozov < > evgeny.a.moro

Re: Spark ML Random Forest output.

2015-12-05 Thread Eugene Morozov
create your own map and reverse map of (label to index) and > (index to label) and use this for getting back your original label. > > May be there is better way to do this.. > > Regards, > Vishnu > > On Fri, Dec 4, 2015 at 4:56 PM, Eugene Morozov <evgeny.a.moro...@gmail.c

Spark ML Random Forest output.

2015-12-04 Thread Eugene Morozov
Hello, I've got an input dataset of handwritten digits and working java code that uses random forest classification algorithm to determine the numbers. My test set is just some lines from the same input dataset - just to be sure I'm doing the right thing. My understanding is that having correct

DataFrame Explode for ArrayBuffer[Any]

2015-10-10 Thread Eugene Morozov
Hi, I have a DataFrame with several columns I'd like to explode. All of the columns I have to explode has an ArrayBuffer type of some other types inside. I'd say that the following code is totally legit to use it as explode function for any given ArrayBuffer - my assumption is that for any given

Re: StructType has more rows, than corresponding Row has objects.

2015-10-06 Thread Eugene Morozov
. -- Be well! Jean Morozov On Tue, Oct 6, 2015 at 1:58 AM, Davies Liu <dav...@databricks.com> wrote: > Could you tell us a way to reproduce this failure? Reading from JSON or > Parquet? > > On Mon, Oct 5, 2015 at 4:28 AM, Eugene Morozov > <evgeny.a.moro...@gmail.com> w

StructType has more rows, than corresponding Row has objects.

2015-10-05 Thread Eugene Morozov
Hi, We're building our own framework on top of spark and we give users pretty complex schema to work with. That requires from us to build dataframes by ourselves: we transform business objects to rows and struct types and uses these two to create dataframe. Everything was fine until I started to

DataFrame. SparkPlan / Project serialization issue: ArrayIndexOutOfBounds.

2015-08-21 Thread Eugene Morozov
Hi, I'm using spark 1.3.1 built against hadoop 1.0.4 and java 1.7 and I'm trying to save my data frame to parquet. The issue I'm stuck looks like serialization tries to do pretty weird thing: tries to write to an empty array. The last (through stack trace) line of spark code that leads to

Re: DataFrame column structure change

2015-08-13 Thread Eugene Morozov
(nullable = true) ||-- e: string (nullable = true) help me. Regards, Rishabh. Eugene Morozov fathers...@list.ru

Eviction of RDD persisted on disk

2015-08-13 Thread Eugene Morozov
, stored partitions have to be deleted somehow. How is that happened? -- Eugene Morozov fathers...@list.ru

Re: using Spark or pig group by efficient in my use case?

2015-08-13 Thread Eugene Morozov
List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org Eugene Morozov fathers...@list.ru

Re: Sorted Multiple Outputs

2015-08-12 Thread Eugene Morozov
java.nio.channels.SocketChannel Probably it's hitting a race condition. Has anyone else faced this situation? Any suggestions? Thanks a lot! On 15 July 2015 at 14:04, Eugene Morozov fathers...@list.ru wrote: Yiannis , It looks like you might explore other approach. sc.textFile

Re: Possible issue for Spark SQL/DataFrame

2015-08-12 Thread Eugene Morozov
? Hopefully I have described the issue clearly, and please feel free to correct me if have done something wrong, thanks a lot. Eugene Morozov fathers...@list.ru

Does Spark optimization might miss to run transformation?

2015-08-12 Thread Eugene Morozov
Hi! I’d like to complete action (store / print smth) inside of transformation (map or mapPartitions). This approach has some flaws, but there is a question. Might it happen that Spark will optimise (RDD or DataFrame) processing so that my mapPartitions simply won’t happen? -- Eugene Morozov

Re: grouping by a partitioned key

2015-08-11 Thread Eugene Morozov
by a key that I'm already partitioned by? - Philip Eugene Morozov fathers...@list.ru

Re: How to distribute non-serializable object in transform task or broadcast ?

2015-08-07 Thread Eugene Morozov
constructor for the class C and deserialization is broken with an invalid constructor exception. I think it's a common use case. Any help is appreciated. -- Hao Ren Data Engineer @ leboncoin Paris, France Eugene Morozov fathers...@list.ru

Re: Debugging Spark job in Eclipse

2015-08-05 Thread Eugene Morozov
the in-between data values. Regards, Deepesh Eugene Morozov fathers...@list.ru

Re: spark 1.3.1 : unable to access s3n:// urls (no file system for scheme s3n:)

2015-07-22 Thread Eugene Morozov
://sujee.net | http://www.linkedin.com/in/sujeemaniyam ) -- Sujee Maniyam (http://sujee.net | http://www.linkedin.com/in/sujeemaniyam ) Eugene Morozov fathers...@list.ru

Kryo Serializer on Worker doesn't work by default.

2015-07-08 Thread Eugene Morozov
, and second time in properties file, which looks weird and unclear to as why I should do that. What is the reason for it? I thought the jar file has to be copied into all Worker nodes (or else it’s not possible to run the job on Wokrers). Can anyone shed some light on this? Thanks -- Eugene

Re: Kryo Serializer on Worker doesn't work by default.

2015-07-08 Thread Eugene Morozov
might explain, why KryoRegistrator is not being found on Worker - there are no functions, which use it directly, so it never copied to Workers. Could you, please, explain of how code is end up on Worker or give me a hint where I can find it in the sources? On 08 Jul 2015, at 17:40, Eugene Morozov

Spark. Efficiency. toDebugString understanding

2015-06-25 Thread Eugene Morozov
Spark does reshuffle. Why it does so? Thanks in advance. -- Eugene Morozov fathers...@list.ru

DataFrame nested sctructure selection limit

2015-05-28 Thread Eugene Morozov
) at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67) Eugene Morozov fathers...@list.ru

DataFrame.explode produces field with wrong type.

2015-05-26 Thread Eugene Morozov
. And unfortunately it’s not possible to cast this column as cast string to struct is not allowed. Are there any workarounds to have correct schema? Thanks in advance. Eugene Morozov fathers...@list.ru

Spark SQL: SchemaRDD, DataFrame. Multi-value, Nested attributes

2015-04-22 Thread Eugene Morozov
be implementation of DataFrame itself provides some sort of custom types or smth pluggable that I might consider. Any clue would be really appreciated. Thanks -- Eugene Morozov fathers...@list.ru