[Pyspark mllib] RowMatrix.columnSimilarities losing spark context?

2018-05-31 Thread pchu
I'm getting a strange error when I try to use the result of a RowMatrix.columnSimilarities call in pyspark. Hoping to get a second opinion. I'm somewhat new to spark - to me it looks like the RDD behind the CoordinateMatrix returned by columnSimilarities() doesn't have a handle on the spark

Spark Task Failure due to OOM and subsequently task finishes

2018-05-31 Thread sparknewbie1
When running Spark job often times some tasks fails for stage X with OOM however same task for same stage succeeds eventually when relaunched and stage X and job completes successfully. One thing I can think of is say there 2 cores per executor and say executor memory of 8G so initially task got

REMINDER: Apache EU Roadshow 2018 in Berlin is less than 2 weeks away!

2018-05-31 Thread sharan
Hello Apache Supporters and Enthusiasts This is a reminder that our Apache EU Roadshow in Berlin is less than two weeks away and we need your help to spread the word. Please let your work colleagues, friends and anyone interested in any attending know about our Apache EU Roadshow event. We

Is Spark DataFrame limit function action or transformation?

2018-05-31 Thread unk1102
Is Spark DataFrame limit function action or transformation? I think it returns DataFrame so it should be a transformation but it executes entire DAG so I think it is action. Same goes to persist function. Please guide. Thanks in advance. -- Sent from:

Re: Apache Spark Installation error

2018-05-31 Thread Irving Duran
You probably want to recognize "spark-shell" as a command in your environment. Maybe try "sudo ln -s /path/to/spark-shell /usr/bin/spark-shell" Have you tried "./spark-shell" in the current path to see if it works? Thank You, Irving Duran On Thu, May 31, 2018 at 9:00 AM Remil Mohanan wrote:

Fwd: [Help] PySpark Dynamic mean calculation

2018-05-31 Thread Aakash Basu
Solved it myself. In-case anyone needs to reuse the code. Can be re-used. orig_list = ['Married-spouse-absent', 'Married-AF-spouse', 'Separated', 'Married-civ-spouse', 'Widowed', 'Divorced', 'Never-married'] k_folds = 3 cols = df.columns # ['fnlwgt_bucketed', 'Married-spouse-absent_fold_0',

[Help] PySpark Dynamic mean calculation

2018-05-31 Thread Aakash Basu
Hi, Using - Python 3.6 Spark 2.3 Original DF - key a_fold_0 b_fold_0 a_fold_1 b_fold_1 a_fold_2 b_fold_2 1 1 2 3 4 5 6 2 7 5 3 5 2 1 I want to calculate means from the below dataframe as follows (like this for all columns and all folds) - key a_fold_0 b_fold_0 a_fold_1 b_fold_1 a_fold_2

[Suggestions needed] Weight of Evidence PySpark

2018-05-31 Thread Aakash Basu
Hi guys, I'm trying to calculate WoE on a particular categorical column depending on the target column. But the code is taking a lot of time on very few datapoints (rows). How can I optimize it to make it performant enough? Here's the code (here categorical_col is a python list of columns) -

Re: Fastest way to drop useless columns

2018-05-31 Thread devjyoti patra
One thing that we do on our datasets is : 1. Take 'n' random samples of equal size 2. If the distribution is heavily skewed for one key in your samples. The way we define "heavy skewness" is; if the mean is more than one std deviation away from the median. In your case, you can drop this column.

Re: Fastest way to drop useless columns

2018-05-31 Thread julio . cesare
I believe this only works when we need to drop duplicate ROWS Here I want to drop cols which contains one unique value Le 2018-05-31 11:16, Divya Gehlot a écrit : you can try dropduplicate function

Re: Fastest way to drop useless columns

2018-05-31 Thread Divya Gehlot
you can try dropduplicate function https://github.com/spirom/LearningSpark/blob/master/src/main/scala/dataframe/DropDuplicates.scala On 31 May 2018 at 16:34, wrote: > Hi there ! > > I have a potentially large dataset ( regarding number of rows and cols ) > > And I want to find the fastest way

Re: Fastest way to drop useless columns

2018-05-31 Thread Anastasios Zouzias
Hi Julien, One quick and easy to implement idea is to use sampling on your dataset, i.e., sample a large enough subset of your data and test is there are no unique values on some columns. Repeat the process a few times and then do the full test on the surviving columns. This will allow you to

Fastest way to drop useless columns

2018-05-31 Thread julio . cesare
Hi there ! I have a potentially large dataset ( regarding number of rows and cols ) And I want to find the fastest way to drop some useless cols for me, i.e. cols containing only an unique value ! I want to know what do you think that I could do to do this as fast as possible using spark.

[PySpark Pipeline XGboost] How to use XGboost in PySpark Pipeline

2018-05-31 Thread Daniel Du
Dear all, I want to update my code of pyspark. In the pyspark, it must put the base model in a pipeline, the office demo of pipeline use the LogistictRegression as an base model. However, it seems not be able to use XGboost model in the pipeline api. How can I use the pyspark like this: from