[Pyspark mllib] RowMatrix.columnSimilarities losing spark context?

2018-05-31 Thread pchu
I'm getting a strange error when I try to use the result of a
RowMatrix.columnSimilarities call in pyspark. Hoping to get a second
opinion.

I'm somewhat new to spark - to me it looks like the RDD behind the
CoordinateMatrix returned by columnSimilarities() doesn't have a handle on
the spark context. Is there something I'm missing or might there be a bug in
how the result is translated back to python from the JVM?

I found a related post on StackOverflow but no responses yet:
https://stackoverflow.com/questions/44929009/collecting-pyspark-matrixs-entries-raise-a-weird-error-when-run-in-test?rq=1

Here's the pyspark documentation on columnSimilarities() (Its just a java /
scala function call)
http://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg/distributed.html#RowMatrix.columnSimilarities

*This snippet should reproduce the issue:*

from pyspark.mllib.linalg.distributed import RowMatrix

rows = spark.sparkContext.parallelize([[0,1,2],[1,1,1]])
matrix = RowMatrix(rows)
sims = matrix.columnSimilarities()

print(sims.numRows(),sims.numCols()) #Prints correctly: "3 3"
print(sims.entries.collect()) #Error: 'NoneType' object has no attribute
'setCallSite'


*Full stack trace of the Error:*

AttributeErrorTraceback (most recent call last)
 in ()
--> 1 sims.entries.collect()

/usr/lib/spark/python/pyspark/rdd.py in collect(self)
821 to be small, as all the data is loaded into the driver's
memory.
822 """
-->823 with SCCallSiteSync(self.context) as css:
824 port =
self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
825 return list(_load_from_socket(port,
self._jrdd_deserializer))

/usr/lib/spark/python/pyspark/traceback_utils.py in __enter__(self)
 70 def __enter__(self):
 71 if SCCallSiteSync._spark_stack_depth == 0:
-->72 self._context._jsc.setCallSite(self._call_site)
 73 SCCallSiteSync._spark_stack_depth += 1
 74 

AttributeError: 'NoneType' object has no attribute 'setCallSite'






--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark Task Failure due to OOM and subsequently task finishes

2018-05-31 Thread sparknewbie1
When running Spark job often times some tasks fails for stage X with OOM
however same task for same stage succeeds eventually when relaunched and
stage X and job completes successfully.

One thing I can think of is say there 2 cores per executor and say executor
memory of 8G so initially task got OOM as 2 task per executor needed 8gb+
memory but eventually when task was relaunched for that executor no other
task was running and hence it could finish successfully.

https://stackoverflow.com/questions/48532836/spark-memory-management-for-oom

However I do not find very clear answer.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



REMINDER: Apache EU Roadshow 2018 in Berlin is less than 2 weeks away!

2018-05-31 Thread sharan

Hello Apache Supporters and Enthusiasts

This is a reminder that our Apache EU Roadshow in Berlin is less than 
two weeks away and we need your help to spread the word. Please let your 
work colleagues, friends and anyone interested in any attending know 
about our Apache EU Roadshow event.


We have a great schedule including tracks on Apache Tomcat, Apache Http 
Server, Microservices, Internet of Things (IoT) and Cloud Technologies. 
You can find more details at the link below:


https://s.apache.org/0hnG

Ticket prices will be going up on 8^th June 2018, so please make sure 
that you register soon if you want to beat the price increase. 
https://foss-backstage.de/tickets


Remember that registering for the Apache EU Roadshow also gives you 
access to FOSS Backstage so you can attend any talks and workshops from 
both conferences. And don’t forget that our Apache Lounge will be open 
throughout the whole conference as a place to meet up, hack and relax.


We look forward to seeing you in Berlin!

Thanks
Sharan Foga,  VP Apache Community Development

http://apachecon.com/
@apachecon

PLEASE NOTE: You are receiving this message because you are subscribed 
to a user@ or dev@ list of one or more Apache Software Foundation projects.


Is Spark DataFrame limit function action or transformation?

2018-05-31 Thread unk1102
Is Spark DataFrame limit function action or transformation? I think it
returns DataFrame so it should be a transformation but it executes entire
DAG so I think it is action. Same goes to persist function. Please guide.
Thanks in advance.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Apache Spark Installation error

2018-05-31 Thread Irving Duran
You probably want to recognize "spark-shell" as a command in your
environment.  Maybe try "sudo ln -s /path/to/spark-shell
/usr/bin/spark-shell"  Have you tried "./spark-shell" in the current path
to see if it works?

Thank You,

Irving Duran


On Thu, May 31, 2018 at 9:00 AM Remil Mohanan  wrote:

> Hi there,
>
>I am not able to execute the spark-shell command. Can you please help.
>
> Thanks
>
> Remil
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Fwd: [Help] PySpark Dynamic mean calculation

2018-05-31 Thread Aakash Basu
Solved it myself.

In-case anyone needs to reuse the code. Can be re-used.

orig_list = ['Married-spouse-absent', 'Married-AF-spouse',
'Separated', 'Married-civ-spouse', 'Widowed', 'Divorced',
'Never-married']
k_folds = 3

cols = df.columns  # ['fnlwgt_bucketed',
'Married-spouse-absent_fold_0', 'Married-AF-spouse_fold_0',
'Separated_fold_0', 'Married-civ-spouse_fold_0', 'Widowed_fold_0',
'Divorced_fold_0', 'Never-married_fold_0',
'Married-spouse-absent_fold_1', 'Married-AF-spouse_fold_1',
'Separated_fold_1', 'Married-civ-spouse_fold_1', 'Widowed_fold_1',
'Divorced_fold_1', 'Never-married_fold_1',
'Married-spouse-absent_fold_2', 'Married-AF-spouse_fold_2',
'Separated_fold_2', 'Married-civ-spouse_fold_2', 'Widowed_fold_2',
'Divorced_fold_2', 'Never-married_fold_2']

for folds in range(k_folds):
for column in orig_list:
col_namer = []
for fold in range(k_folds):
if fold != folds:
col_namer.append(column+'_fold_'+str(fold))
df = df.withColumn(column+'_fold_'+str(folds)+'_mean', (sum(df[col] for col in
col_namer)/(k_folds-1)))
print(col_namer)
df.show(1)



-- Forwarded message --
From: Aakash Basu 
Date: Thu, May 31, 2018 at 3:40 PM
Subject: [Help] PySpark Dynamic mean calculation
To: user 


Hi,

Using -
Python 3.6
Spark 2.3

Original DF -
key a_fold_0 b_fold_0 a_fold_1 b_fold_1 a_fold_2 b_fold_2
1 1 2 3 4 5 6
2 7 5 3 5 2 1


I want to calculate means from the below  dataframe as follows (like this
for all columns and all folds) -

key a_fold_0 b_fold_0 a_fold_1 b_fold_1 a_fold_2 b_fold_2 a_fold_0_mean
b_fold_0_mean a_fold_1_mean
1 1 2 3 4 5 6 3 + 5 / 2 4 + 6 / 2 1 + 5 / 2
2 7 5 3 5 2 1 3 + 2 / 2 5 + 1 / 2 7 + 2 / 2

Process -

For fold_0 my mean should be fold_1 + fold_2 / 2
For fold_1 my mean should be fold_0 + fold_2 / 2
For fold_2 my mean should be fold_0 + fold_1 / 2

For each column.

And my number of columns, no. of folds, everything would be dynamic.

How to go about this problem on a pyspark dataframe?

Thanks,
Aakash.


[Help] PySpark Dynamic mean calculation

2018-05-31 Thread Aakash Basu
Hi,

Using -
Python 3.6
Spark 2.3

Original DF -
key a_fold_0 b_fold_0 a_fold_1 b_fold_1 a_fold_2 b_fold_2
1 1 2 3 4 5 6
2 7 5 3 5 2 1


I want to calculate means from the below  dataframe as follows (like this
for all columns and all folds) -

key a_fold_0 b_fold_0 a_fold_1 b_fold_1 a_fold_2 b_fold_2 a_fold_0_mean
b_fold_0_mean a_fold_1_mean
1 1 2 3 4 5 6 3 + 5 / 2 4 + 6 / 2 1 + 5 / 2
2 7 5 3 5 2 1 3 + 2 / 2 5 + 1 / 2 7 + 2 / 2

Process -

For fold_0 my mean should be fold_1 + fold_2 / 2
For fold_1 my mean should be fold_0 + fold_2 / 2
For fold_2 my mean should be fold_0 + fold_1 / 2

For each column.

And my number of columns, no. of folds, everything would be dynamic.

How to go about this problem on a pyspark dataframe?

Thanks,
Aakash.


[Suggestions needed] Weight of Evidence PySpark

2018-05-31 Thread Aakash Basu
Hi guys,

I'm trying to calculate WoE on a particular categorical column depending on
the target column. But the code is taking a lot of time on very few
datapoints (rows).

How can I optimize it to make it performant enough?

Here's the code (here categorical_col is a python list of columns) -

for item in categorical_col:
new_df = spark.sql('Select `' + item + '`, `' + target_col + '`,
count(*) as Counts from a group by `'
   + item + '`, `' + target_col + '` order by `' +
item + '`')
# new_df.show()
new_df.registerTempTable('b')
# exit(0)
new_df2 = spark.sql('Select `' + item + '`, ' +
'case when `' + target_col + '` == 0 then
Counts else 0 end as Count_0, ' +
'case when `' + target_col + '` == 1 then
Counts else 0 end as Count_1 ' +
'from b')

spark.catalog.dropTempView('b')
# new_df2.show()
new_df2.registerTempTable('c')
# exit(0)

new_df3 = spark.sql('SELECT `' + item + '`, SUM(Count_0) AS Count_0, ' +
'SUM(Count_1) AS Count_1 FROM c GROUP BY `' +
item + '`')

spark.catalog.dropTempView('c')
# new_df3.show()
# exit(0)

new_df3.registerTempTable('d')

# SQL DF Experiment
new_df4 = spark.sql('Select `' + item + '` as
bucketed_col_of_source, Count_0/(select sum(d.Count_0) as sum from d)
as Prop_0, ' +
'Count_1/(select sum(d.Count_1) as sum from d)
as Prop_1 from d')

spark.catalog.dropTempView('d')
# new_df4.show()
# exit(0)
new_df4.registerTempTable('e')

new_df5 = spark.sql('Select *, case when log(e.Prop_0/e.Prop_1) IS
NULL then 0 else log(e.Prop_0/e.Prop_1) end as WoE from e')

spark.catalog.dropTempView('e')

# print('Problem starts here: ')
# new_df5.show()

new_df5.registerTempTable('WoE_table')

joined_Train_DF = spark.sql('Select bucketed.*, WoE_table.WoE as `' + item +
  '_WoE` from a bucketed inner join WoE_table
on bucketed.`' + item +
  '` = WoE_table.bucketed_col_of_source')

# joined_Train_DF.show()
joined_Test_DF = spark.sql('Select bucketed.*, WoE_table.WoE as `' + item +
  '_WoE` from test_data bucketed inner join
WoE_table on bucketed.`' + item +
  '` = WoE_table.bucketed_col_of_source')

if validation:
joined_Validation_DF = spark.sql('Select bucketed.*,
WoE_table.WoE as `' + item +
   '_WoE` from validation_data
bucketed inner join WoE_table on bucketed.`' + item +
   '` = WoE_table.bucketed_col_of_source')
WoE_Validation_DF = joined_Validation_DF

spark.catalog.dropTempView('WoE_table')

WoE_Train_DF = joined_Train_DF
WoE_Test_DF = joined_Test_DF
col_len = len(categorical_col)
if col_len > 1:
WoE_Train_DF.registerTempTable('a')
WoE_Test_DF.registerTempTable('test_data')
if validation:
# print('got inside...')
WoE_Validation_DF.registerTempTable('validation_data')

Any help?

Thanks,
Aakash.


Re: Fastest way to drop useless columns

2018-05-31 Thread devjyoti patra
One thing that we do on our datasets is :
1. Take 'n' random samples of equal size
2. If the distribution is heavily skewed for one key in your samples. The
way we define "heavy skewness" is; if the mean is more than one std
deviation away from the median.

In your case, you can drop this column.

On Thu, 31 May 2018, 14:55 ,  wrote:

> I believe this only works when we need to drop duplicate ROWS
>
> Here I want to drop cols which contains one unique value
>
>
> Le 2018-05-31 11:16, Divya Gehlot a écrit :
> > you can try dropduplicate function
> >
> >
> https://github.com/spirom/LearningSpark/blob/master/src/main/scala/dataframe/DropDuplicates.scala
> >
> > On 31 May 2018 at 16:34,  wrote:
> >
> >> Hi there !
> >>
> >> I have a potentially large dataset ( regarding number of rows and
> >> cols )
> >>
> >> And I want to find the fastest way to drop some useless cols for me,
> >> i.e. cols containing only an unique value !
> >>
> >> I want to know what do you think that I could do to do this as fast
> >> as possible using spark.
> >>
> >> I already have a solution using distinct().count() or
> >> approxCountDistinct()
> >> But, they may not be the best choice as this requires to go through
> >> all the data, even if the 2 first tested values for a col are
> >> already different ( and in this case I know that I can keep the col
> >> )
> >>
> >> Thx for your ideas !
> >>
> >> Julien
> >>
> >>
> > -
> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Fastest way to drop useless columns

2018-05-31 Thread julio . cesare

I believe this only works when we need to drop duplicate ROWS

Here I want to drop cols which contains one unique value


Le 2018-05-31 11:16, Divya Gehlot a écrit :

you can try dropduplicate function

https://github.com/spirom/LearningSpark/blob/master/src/main/scala/dataframe/DropDuplicates.scala

On 31 May 2018 at 16:34,  wrote:


Hi there !

I have a potentially large dataset ( regarding number of rows and
cols )

And I want to find the fastest way to drop some useless cols for me,
i.e. cols containing only an unique value !

I want to know what do you think that I could do to do this as fast
as possible using spark.

I already have a solution using distinct().count() or
approxCountDistinct()
But, they may not be the best choice as this requires to go through
all the data, even if the 2 first tested values for a col are
already different ( and in this case I know that I can keep the col
)

Thx for your ideas !

Julien



-

To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Fastest way to drop useless columns

2018-05-31 Thread Divya Gehlot
you can try dropduplicate function

https://github.com/spirom/LearningSpark/blob/master/src/main/scala/dataframe/DropDuplicates.scala

On 31 May 2018 at 16:34,  wrote:

> Hi there !
>
> I have a potentially large dataset ( regarding number of rows and cols )
>
> And I want to find the fastest way to drop some useless cols for me, i.e.
> cols containing only an unique value !
>
> I want to know what do you think that I could do to do this as fast as
> possible using spark.
>
>
> I already have a solution using distinct().count() or approxCountDistinct()
> But, they may not be the best choice as this requires to go through all
> the data, even if the 2 first tested values for a col are already different
> ( and in this case I know that I can keep the col )
>
>
> Thx for your ideas !
>
> Julien
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Fastest way to drop useless columns

2018-05-31 Thread Anastasios Zouzias
Hi Julien,

One quick and easy to implement idea is to use sampling on your dataset,
i.e., sample a large enough subset of your data and test is there are no
unique values on some columns. Repeat the process a few times and then do
the full test on the surviving columns.

This will allow you to load only a subset of your dataset if it is stored
in Parquet.

Best,
Anastasios

On Thu, May 31, 2018 at 10:34 AM,  wrote:

> Hi there !
>
> I have a potentially large dataset ( regarding number of rows and cols )
>
> And I want to find the fastest way to drop some useless cols for me, i.e.
> cols containing only an unique value !
>
> I want to know what do you think that I could do to do this as fast as
> possible using spark.
>
>
> I already have a solution using distinct().count() or approxCountDistinct()
> But, they may not be the best choice as this requires to go through all
> the data, even if the 2 first tested values for a col are already different
> ( and in this case I know that I can keep the col )
>
>
> Thx for your ideas !
>
> Julien
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
-- Anastasios Zouzias



Fastest way to drop useless columns

2018-05-31 Thread julio . cesare

Hi there !

I have a potentially large dataset ( regarding number of rows and cols )

And I want to find the fastest way to drop some useless cols for me, 
i.e. cols containing only an unique value !


I want to know what do you think that I could do to do this as fast as 
possible using spark.



I already have a solution using distinct().count() or 
approxCountDistinct()
But, they may not be the best choice as this requires to go through all 
the data, even if the 2 first tested values for a col are already 
different ( and in this case I know that I can keep the col )



Thx for your ideas !

Julien

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[PySpark Pipeline XGboost] How to use XGboost in PySpark Pipeline

2018-05-31 Thread Daniel Du
Dear all, 

I want to update my code of pyspark. In the pyspark, it must put the base
model in a pipeline, the office demo of pipeline use the LogistictRegression
as an base model. However, it seems not be able to use XGboost model in the
pipeline api. How can I use the pyspark like this: 

from xgboost import XGBClassifier 
... 
model = XGBClassifier() 
model.fit(X_train, y_train) 
pipeline = Pipeline(stages=[..., model, ...]) 

It is convenient to use the pipeline api, so can anybody give some advices?
Thank you! 

Daniel



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org