unsubscribe

2016-10-02 Thread Nikos Viorres

Re: Loading data into Hbase table throws NoClassDefFoundError: org/apache/htrace/Trace error

2016-10-02 Thread Mich Talebzadeh
Thanks Ben The thing is I am using Spark 2 and no stack from CDH! Is this approach to reading/writing to Hbase specific to Cloudera? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Spark ML Decision Trees Algorithm

2016-10-02 Thread Yan Facai
Perhaps the best way is to read the code. The Decision tree is implemented by 1-tree Random forest, whose entry point is `run` method: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L88 I'm not familiar with the so-called algorithm

Re: Dataframe, Java: How to convert String to Vector ?

2016-10-02 Thread Yan Facai
Hi, Perter. It's interesting that `DecisionTreeRegressor.transformImpl` also use udf to transform dataframe, instead of using map: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L175 On Wed, Sep 21, 2016 at 10:22 PM, P

Partitioned windows in spark streaming

2016-10-02 Thread Adrienne Kole
Hi, Is spark 2.0.0 supports partitioned windows in streaming? Cheers Adrienne

Re: use CrossValidatorModel for prediction

2016-10-02 Thread Pengcheng Luo
> On Oct 2, 2016, at 1:04 AM, Pengcheng wrote: > > Dear Spark Users, > > I was wondering. > > I have a trained crossvalidator model > model: CrossValidatorModel > > I wan to predict a score for features: RDD[Features] > > Right now I have to convert features to dataframe and then perform

Re: DataFrame Sort gives Cannot allocate a page with more than 17179869176 bytes

2016-10-02 Thread Babak Alipour
Thanks Vadim for sharing your experience, but I have tried multi-JVM setup (2 workers), various sizes for spark.executor.memory (8g, 16g, 20g, 32g, 64g) and spark.executor.core (2-4), same error all along. As for the files, these are all .snappy.parquet files, resulting from inserting some data fr

statistical theory behind estimating the number of total tasks in GroupedSumEvaluator.scala

2016-10-02 Thread philipghu
Hi, I've been struggling to understand the statistical theory behind this piece of code (from /core/src/main/scala/org/apache/spark/partial/GroupedSumEvaluator.scala) below, especially with respect to estimating the size of the population (total tasks) and its variance. Also I'm trying to under

Re: Pls assist: Spark 2.0 build failure on Ubuntu 16.06

2016-10-02 Thread Marco Mistroni
Hi Sean thanks. I managed to build Spark 2 (it was actually 2.1, not 2.0...i m sourcing it from here (git clone git://github.com/apache/spark.git)) Now, i managed to build it but i had to - use Java 1.7 along with MAVEN_OPTS (using java1.8 send the whole process into insufficient memory for JVM w

unsubscribe

2016-10-02 Thread Qing Lin

Spark Streaming: How to load a Pipeline on a Stream?

2016-10-02 Thread manueslapera
I am implementing a lambda architecture system for stream processing. I have no issue creating a Pipeline with GridSearch in Spark batch: pipeline = Pipeline(stages=[data1_indexer, data2_indexer, ..., assembler, logistic_regressor]) paramGrid = ( ParamGridBuilder() .add

Re: Setting conf options in jupyter

2016-10-02 Thread Kabeer Ahmed
William:   Try something based on the below lines. You should get rid of the error that you have reported.   HTH,   scala> sc.stop   scala> :paste // Entering paste mode (ctrl-D to finish)   import org.apache.spark._ val sc = new SparkContext(       new SparkConf().setAppName("bar")

unsubscribe

2016-10-02 Thread Moses Oduma
unsubscribe

filtering in SparkR

2016-10-02 Thread Yogesh Vyas
Hi, I have two SparkDataFrames, df1 and df2. There schemas are as follows: df1=>SparkDataFrame[id:double, c1:string, c2:string] df2=>SparkDataFrame[id:double, c3:string, c4:string] I want to filter out rows from df1 where df1$id does not match df2$id I tried some expression: filter(df1,!(df1$id

Re: Loading data into Hbase table throws NoClassDefFoundError: org/apache/htrace/Trace error

2016-10-02 Thread Benjamin Kim
We installed Apache Spark 1.6.0 at the time alongside CDH 5.4.8 because Cloudera only had Spark 1.3.0 at the time, and we wanted to use Spark 1.6.0’s features. We borrowed the /etc/spark/conf/spark-env.sh file that Cloudera generated because it was customized to add jars first from paths listed

Fwd: filtering in SparkR

2016-10-02 Thread Yogesh Vyas
Hi, I have two SparkDataFrames, df1 and df2. There schemas are as follows: df1=>SparkDataFrame[id:double, c1:string, c2:string] df2=>SparkDataFrame[id:double, c3:string, c4:string] I want to filter out rows from df1 where df1$id does not match df2$id I tried some expression: filter(df1,!(df1$id