[Pyspark mllib] RowMatrix.columnSimilarities losing spark context?
I'm getting a strange error when I try to use the result of a RowMatrix.columnSimilarities call in pyspark. Hoping to get a second opinion. I'm somewhat new to spark - to me it looks like the RDD behind the CoordinateMatrix returned by columnSimilarities() doesn't have a handle on the spark context. Is there something I'm missing or might there be a bug in how the result is translated back to python from the JVM? I found a related post on StackOverflow but no responses yet: https://stackoverflow.com/questions/44929009/collecting-pyspark-matrixs-entries-raise-a-weird-error-when-run-in-test?rq=1 Here's the pyspark documentation on columnSimilarities() (Its just a java / scala function call) http://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg/distributed.html#RowMatrix.columnSimilarities *This snippet should reproduce the issue:* from pyspark.mllib.linalg.distributed import RowMatrix rows = spark.sparkContext.parallelize([[0,1,2],[1,1,1]]) matrix = RowMatrix(rows) sims = matrix.columnSimilarities() print(sims.numRows(),sims.numCols()) #Prints correctly: "3 3" print(sims.entries.collect()) #Error: 'NoneType' object has no attribute 'setCallSite' *Full stack trace of the Error:* AttributeErrorTraceback (most recent call last) in () --> 1 sims.entries.collect() /usr/lib/spark/python/pyspark/rdd.py in collect(self) 821 to be small, as all the data is loaded into the driver's memory. 822 """ -->823 with SCCallSiteSync(self.context) as css: 824 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) 825 return list(_load_from_socket(port, self._jrdd_deserializer)) /usr/lib/spark/python/pyspark/traceback_utils.py in __enter__(self) 70 def __enter__(self): 71 if SCCallSiteSync._spark_stack_depth == 0: -->72 self._context._jsc.setCallSite(self._call_site) 73 SCCallSiteSync._spark_stack_depth += 1 74 AttributeError: 'NoneType' object has no attribute 'setCallSite' -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Spark Task Failure due to OOM and subsequently task finishes
When running Spark job often times some tasks fails for stage X with OOM however same task for same stage succeeds eventually when relaunched and stage X and job completes successfully. One thing I can think of is say there 2 cores per executor and say executor memory of 8G so initially task got OOM as 2 task per executor needed 8gb+ memory but eventually when task was relaunched for that executor no other task was running and hence it could finish successfully. https://stackoverflow.com/questions/48532836/spark-memory-management-for-oom However I do not find very clear answer. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
REMINDER: Apache EU Roadshow 2018 in Berlin is less than 2 weeks away!
Hello Apache Supporters and Enthusiasts This is a reminder that our Apache EU Roadshow in Berlin is less than two weeks away and we need your help to spread the word. Please let your work colleagues, friends and anyone interested in any attending know about our Apache EU Roadshow event. We have a great schedule including tracks on Apache Tomcat, Apache Http Server, Microservices, Internet of Things (IoT) and Cloud Technologies. You can find more details at the link below: https://s.apache.org/0hnG Ticket prices will be going up on 8^th June 2018, so please make sure that you register soon if you want to beat the price increase. https://foss-backstage.de/tickets Remember that registering for the Apache EU Roadshow also gives you access to FOSS Backstage so you can attend any talks and workshops from both conferences. And don’t forget that our Apache Lounge will be open throughout the whole conference as a place to meet up, hack and relax. We look forward to seeing you in Berlin! Thanks Sharan Foga, VP Apache Community Development http://apachecon.com/ @apachecon PLEASE NOTE: You are receiving this message because you are subscribed to a user@ or dev@ list of one or more Apache Software Foundation projects.
Is Spark DataFrame limit function action or transformation?
Is Spark DataFrame limit function action or transformation? I think it returns DataFrame so it should be a transformation but it executes entire DAG so I think it is action. Same goes to persist function. Please guide. Thanks in advance. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Apache Spark Installation error
You probably want to recognize "spark-shell" as a command in your environment. Maybe try "sudo ln -s /path/to/spark-shell /usr/bin/spark-shell" Have you tried "./spark-shell" in the current path to see if it works? Thank You, Irving Duran On Thu, May 31, 2018 at 9:00 AM Remil Mohanan wrote: > Hi there, > >I am not able to execute the spark-shell command. Can you please help. > > Thanks > > Remil > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Fwd: [Help] PySpark Dynamic mean calculation
Solved it myself. In-case anyone needs to reuse the code. Can be re-used. orig_list = ['Married-spouse-absent', 'Married-AF-spouse', 'Separated', 'Married-civ-spouse', 'Widowed', 'Divorced', 'Never-married'] k_folds = 3 cols = df.columns # ['fnlwgt_bucketed', 'Married-spouse-absent_fold_0', 'Married-AF-spouse_fold_0', 'Separated_fold_0', 'Married-civ-spouse_fold_0', 'Widowed_fold_0', 'Divorced_fold_0', 'Never-married_fold_0', 'Married-spouse-absent_fold_1', 'Married-AF-spouse_fold_1', 'Separated_fold_1', 'Married-civ-spouse_fold_1', 'Widowed_fold_1', 'Divorced_fold_1', 'Never-married_fold_1', 'Married-spouse-absent_fold_2', 'Married-AF-spouse_fold_2', 'Separated_fold_2', 'Married-civ-spouse_fold_2', 'Widowed_fold_2', 'Divorced_fold_2', 'Never-married_fold_2'] for folds in range(k_folds): for column in orig_list: col_namer = [] for fold in range(k_folds): if fold != folds: col_namer.append(column+'_fold_'+str(fold)) df = df.withColumn(column+'_fold_'+str(folds)+'_mean', (sum(df[col] for col in col_namer)/(k_folds-1))) print(col_namer) df.show(1) -- Forwarded message -- From: Aakash Basu Date: Thu, May 31, 2018 at 3:40 PM Subject: [Help] PySpark Dynamic mean calculation To: user Hi, Using - Python 3.6 Spark 2.3 Original DF - key a_fold_0 b_fold_0 a_fold_1 b_fold_1 a_fold_2 b_fold_2 1 1 2 3 4 5 6 2 7 5 3 5 2 1 I want to calculate means from the below dataframe as follows (like this for all columns and all folds) - key a_fold_0 b_fold_0 a_fold_1 b_fold_1 a_fold_2 b_fold_2 a_fold_0_mean b_fold_0_mean a_fold_1_mean 1 1 2 3 4 5 6 3 + 5 / 2 4 + 6 / 2 1 + 5 / 2 2 7 5 3 5 2 1 3 + 2 / 2 5 + 1 / 2 7 + 2 / 2 Process - For fold_0 my mean should be fold_1 + fold_2 / 2 For fold_1 my mean should be fold_0 + fold_2 / 2 For fold_2 my mean should be fold_0 + fold_1 / 2 For each column. And my number of columns, no. of folds, everything would be dynamic. How to go about this problem on a pyspark dataframe? Thanks, Aakash.
[Help] PySpark Dynamic mean calculation
Hi, Using - Python 3.6 Spark 2.3 Original DF - key a_fold_0 b_fold_0 a_fold_1 b_fold_1 a_fold_2 b_fold_2 1 1 2 3 4 5 6 2 7 5 3 5 2 1 I want to calculate means from the below dataframe as follows (like this for all columns and all folds) - key a_fold_0 b_fold_0 a_fold_1 b_fold_1 a_fold_2 b_fold_2 a_fold_0_mean b_fold_0_mean a_fold_1_mean 1 1 2 3 4 5 6 3 + 5 / 2 4 + 6 / 2 1 + 5 / 2 2 7 5 3 5 2 1 3 + 2 / 2 5 + 1 / 2 7 + 2 / 2 Process - For fold_0 my mean should be fold_1 + fold_2 / 2 For fold_1 my mean should be fold_0 + fold_2 / 2 For fold_2 my mean should be fold_0 + fold_1 / 2 For each column. And my number of columns, no. of folds, everything would be dynamic. How to go about this problem on a pyspark dataframe? Thanks, Aakash.
[Suggestions needed] Weight of Evidence PySpark
Hi guys, I'm trying to calculate WoE on a particular categorical column depending on the target column. But the code is taking a lot of time on very few datapoints (rows). How can I optimize it to make it performant enough? Here's the code (here categorical_col is a python list of columns) - for item in categorical_col: new_df = spark.sql('Select `' + item + '`, `' + target_col + '`, count(*) as Counts from a group by `' + item + '`, `' + target_col + '` order by `' + item + '`') # new_df.show() new_df.registerTempTable('b') # exit(0) new_df2 = spark.sql('Select `' + item + '`, ' + 'case when `' + target_col + '` == 0 then Counts else 0 end as Count_0, ' + 'case when `' + target_col + '` == 1 then Counts else 0 end as Count_1 ' + 'from b') spark.catalog.dropTempView('b') # new_df2.show() new_df2.registerTempTable('c') # exit(0) new_df3 = spark.sql('SELECT `' + item + '`, SUM(Count_0) AS Count_0, ' + 'SUM(Count_1) AS Count_1 FROM c GROUP BY `' + item + '`') spark.catalog.dropTempView('c') # new_df3.show() # exit(0) new_df3.registerTempTable('d') # SQL DF Experiment new_df4 = spark.sql('Select `' + item + '` as bucketed_col_of_source, Count_0/(select sum(d.Count_0) as sum from d) as Prop_0, ' + 'Count_1/(select sum(d.Count_1) as sum from d) as Prop_1 from d') spark.catalog.dropTempView('d') # new_df4.show() # exit(0) new_df4.registerTempTable('e') new_df5 = spark.sql('Select *, case when log(e.Prop_0/e.Prop_1) IS NULL then 0 else log(e.Prop_0/e.Prop_1) end as WoE from e') spark.catalog.dropTempView('e') # print('Problem starts here: ') # new_df5.show() new_df5.registerTempTable('WoE_table') joined_Train_DF = spark.sql('Select bucketed.*, WoE_table.WoE as `' + item + '_WoE` from a bucketed inner join WoE_table on bucketed.`' + item + '` = WoE_table.bucketed_col_of_source') # joined_Train_DF.show() joined_Test_DF = spark.sql('Select bucketed.*, WoE_table.WoE as `' + item + '_WoE` from test_data bucketed inner join WoE_table on bucketed.`' + item + '` = WoE_table.bucketed_col_of_source') if validation: joined_Validation_DF = spark.sql('Select bucketed.*, WoE_table.WoE as `' + item + '_WoE` from validation_data bucketed inner join WoE_table on bucketed.`' + item + '` = WoE_table.bucketed_col_of_source') WoE_Validation_DF = joined_Validation_DF spark.catalog.dropTempView('WoE_table') WoE_Train_DF = joined_Train_DF WoE_Test_DF = joined_Test_DF col_len = len(categorical_col) if col_len > 1: WoE_Train_DF.registerTempTable('a') WoE_Test_DF.registerTempTable('test_data') if validation: # print('got inside...') WoE_Validation_DF.registerTempTable('validation_data') Any help? Thanks, Aakash.
Re: Fastest way to drop useless columns
One thing that we do on our datasets is : 1. Take 'n' random samples of equal size 2. If the distribution is heavily skewed for one key in your samples. The way we define "heavy skewness" is; if the mean is more than one std deviation away from the median. In your case, you can drop this column. On Thu, 31 May 2018, 14:55 , wrote: > I believe this only works when we need to drop duplicate ROWS > > Here I want to drop cols which contains one unique value > > > Le 2018-05-31 11:16, Divya Gehlot a écrit : > > you can try dropduplicate function > > > > > https://github.com/spirom/LearningSpark/blob/master/src/main/scala/dataframe/DropDuplicates.scala > > > > On 31 May 2018 at 16:34, wrote: > > > >> Hi there ! > >> > >> I have a potentially large dataset ( regarding number of rows and > >> cols ) > >> > >> And I want to find the fastest way to drop some useless cols for me, > >> i.e. cols containing only an unique value ! > >> > >> I want to know what do you think that I could do to do this as fast > >> as possible using spark. > >> > >> I already have a solution using distinct().count() or > >> approxCountDistinct() > >> But, they may not be the best choice as this requires to go through > >> all the data, even if the 2 first tested values for a col are > >> already different ( and in this case I know that I can keep the col > >> ) > >> > >> Thx for your ideas ! > >> > >> Julien > >> > >> > > - > >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: Fastest way to drop useless columns
I believe this only works when we need to drop duplicate ROWS Here I want to drop cols which contains one unique value Le 2018-05-31 11:16, Divya Gehlot a écrit : you can try dropduplicate function https://github.com/spirom/LearningSpark/blob/master/src/main/scala/dataframe/DropDuplicates.scala On 31 May 2018 at 16:34, wrote: Hi there ! I have a potentially large dataset ( regarding number of rows and cols ) And I want to find the fastest way to drop some useless cols for me, i.e. cols containing only an unique value ! I want to know what do you think that I could do to do this as fast as possible using spark. I already have a solution using distinct().count() or approxCountDistinct() But, they may not be the best choice as this requires to go through all the data, even if the 2 first tested values for a col are already different ( and in this case I know that I can keep the col ) Thx for your ideas ! Julien - To unsubscribe e-mail: user-unsubscr...@spark.apache.org - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Fastest way to drop useless columns
you can try dropduplicate function https://github.com/spirom/LearningSpark/blob/master/src/main/scala/dataframe/DropDuplicates.scala On 31 May 2018 at 16:34, wrote: > Hi there ! > > I have a potentially large dataset ( regarding number of rows and cols ) > > And I want to find the fastest way to drop some useless cols for me, i.e. > cols containing only an unique value ! > > I want to know what do you think that I could do to do this as fast as > possible using spark. > > > I already have a solution using distinct().count() or approxCountDistinct() > But, they may not be the best choice as this requires to go through all > the data, even if the 2 first tested values for a col are already different > ( and in this case I know that I can keep the col ) > > > Thx for your ideas ! > > Julien > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: Fastest way to drop useless columns
Hi Julien, One quick and easy to implement idea is to use sampling on your dataset, i.e., sample a large enough subset of your data and test is there are no unique values on some columns. Repeat the process a few times and then do the full test on the surviving columns. This will allow you to load only a subset of your dataset if it is stored in Parquet. Best, Anastasios On Thu, May 31, 2018 at 10:34 AM, wrote: > Hi there ! > > I have a potentially large dataset ( regarding number of rows and cols ) > > And I want to find the fastest way to drop some useless cols for me, i.e. > cols containing only an unique value ! > > I want to know what do you think that I could do to do this as fast as > possible using spark. > > > I already have a solution using distinct().count() or approxCountDistinct() > But, they may not be the best choice as this requires to go through all > the data, even if the 2 first tested values for a col are already different > ( and in this case I know that I can keep the col ) > > > Thx for your ideas ! > > Julien > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- -- Anastasios Zouzias
Fastest way to drop useless columns
Hi there ! I have a potentially large dataset ( regarding number of rows and cols ) And I want to find the fastest way to drop some useless cols for me, i.e. cols containing only an unique value ! I want to know what do you think that I could do to do this as fast as possible using spark. I already have a solution using distinct().count() or approxCountDistinct() But, they may not be the best choice as this requires to go through all the data, even if the 2 first tested values for a col are already different ( and in this case I know that I can keep the col ) Thx for your ideas ! Julien - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
[PySpark Pipeline XGboost] How to use XGboost in PySpark Pipeline
Dear all, I want to update my code of pyspark. In the pyspark, it must put the base model in a pipeline, the office demo of pipeline use the LogistictRegression as an base model. However, it seems not be able to use XGboost model in the pipeline api. How can I use the pyspark like this: from xgboost import XGBClassifier ... model = XGBClassifier() model.fit(X_train, y_train) pipeline = Pipeline(stages=[..., model, ...]) It is convenient to use the pipeline api, so can anybody give some advices? Thank you! Daniel -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org