Snappy initialization issue, spark assembly jar missing snappy classes?

2016-07-20 Thread Eugene Morozov
Greetings! We're reading input files with newApiHadoopFile that is configured with multiline split. Everything's fine, besides https://issues.apache.org/jira/browse/MAPREDUCE-6549. It looks like the issue is fixed, but within hadoop 2.7.2. Which means we have to download spark without hadoop and

SparkML RandomForest java.lang.StackOverflowError

2016-03-29 Thread Eugene Morozov
Hi, I have a web service that provides rest api to train random forest algo. I train random forest on a 5 nodes spark cluster with enough memory - everything is cached (~22 GB). On a small datasets up to 100k samples everything is fine, but with the biggest one (400k samples and ~70k features)

Re: SparkML algos limitations question.

2016-03-21 Thread Eugene Morozov
t; > Joseph > > On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozov < > evgeny.a.moro...@gmail.com> wrote: > >> Hello! >> >> I'm currently working on POC and try to use Random Forest (classification >> and regression). I also have to check SVM and Mul

Dynamic allocation availability on standalone mode. Misleading doc.

2016-03-07 Thread Eugene Morozov
Hi, the feature looks like the one I'd like to use, but there are two different descriptions in the docs of whether it's available. I'm on a standalone deployment mode and here: http://spark.apache.org/docs/latest/configuration.html it's specified the feature is available only for YARN, but here:

Typo in community databricks cloud docs

2016-03-05 Thread Eugene Morozov
Hi, I'm not sure where to put this, but I've found a typo on a page Caching: " - *Serialization:* The default serialization in Spark is Java serialization. However for better peformance, we recommend Kyro serialization, which you can learn more about here

Re: Usage of SparkContext within a Web container

2016-01-14 Thread Eugene Morozov
Praveen, Zeppelin uses Spark's REPL. I'm currently writing an app that is a web service, which is going to run spark jobs. So, at the init stage I just create JavaSparkContext and then use it for all users requests. Web service is stateless. The issue with stateless is that it's possible to run

A bug in Spark ML? NoSuchElementException while using RandomForest for regression.

2015-12-16 Thread Eugene Morozov
Hi! I've looked through issues and haven't found anything like that, so I've created a new one. Everything to reproduce is attached to it: https://issues.apache.org/jira/browse/SPARK-12367 Could you, please, take a look and if possible advice any workaround. Thank you in advance. -- Be well!

SparkML algos limitations question.

2015-12-14 Thread Eugene Morozov
Hello! I'm currently working on POC and try to use Random Forest (classification and regression). I also have to check SVM and Multiclass perceptron (other algos are less important at the moment). So far I've discovered that Random Forest has a limitation of maxDepth for trees and just out of

Re: StructType has more rows, than corresponding Row has objects.

2015-10-06 Thread Eugene Morozov
. -- Be well! Jean Morozov On Tue, Oct 6, 2015 at 1:58 AM, Davies Liu <dav...@databricks.com> wrote: > Could you tell us a way to reproduce this failure? Reading from JSON or > Parquet? > > On Mon, Oct 5, 2015 at 4:28 AM, Eugene Morozov > <evgeny.a.moro...@gmail.com> w

StructType has more rows, than corresponding Row has objects.

2015-10-05 Thread Eugene Morozov
Hi, We're building our own framework on top of spark and we give users pretty complex schema to work with. That requires from us to build dataframes by ourselves: we transform business objects to rows and struct types and uses these two to create dataframe. Everything was fine until I started to

DataFrame. SparkPlan / Project serialization issue: ArrayIndexOutOfBounds.

2015-08-21 Thread Eugene Morozov
Hi, I'm using spark 1.3.1 built against hadoop 1.0.4 and java 1.7 and I'm trying to save my data frame to parquet. The issue I'm stuck looks like serialization tries to do pretty weird thing: tries to write to an empty array. The last (through stack trace) line of spark code that leads to

Does Spark optimization might miss to run transformation?

2015-08-12 Thread Eugene Morozov
Hi! I’d like to complete action (store / print smth) inside of transformation (map or mapPartitions). This approach has some flaws, but there is a question. Might it happen that Spark will optimise (RDD or DataFrame) processing so that my mapPartitions simply won’t happen? -- Eugene Morozov

Re: KryoSerializer gives class cast exception

2015-07-20 Thread Eugene Morozov
more previous discussions RE: Kryo upgrade. Anyhow, I'm not sure what the right solution is yet, but just wanted to link to some previous context / discussions. - Josh On Thu, Jul 16, 2015 at 7:57 AM, Eugene Morozov fathers...@list.ru wrote: Hi, some time ago we’ve found that it’s

KryoSerializer gives class cast exception

2015-07-16 Thread Eugene Morozov
. Thanks. -- Eugene Morozov fathers...@list.ru

Re: question related partitions of the DataFrame

2015-07-14 Thread Eugene Morozov
. Eugene Morozov fathers...@list.ru

Code movements from Driver to Workers

2015-07-08 Thread Eugene Morozov
class not found. Is my “new” understanding correct? Could you, please, explain in couple of words how code being moved from Driver to Workers? Could you give me a hint of where to find this in sources? Thanks in advance. -- Eugene Morozov fathers...@list.ru