I'm getting a strange error when I try to use the result of a
RowMatrix.columnSimilarities call in pyspark. Hoping to get a second
opinion.
I'm somewhat new to spark - to me it looks like the RDD behind the
CoordinateMatrix returned by columnSimilarities() doesn't have a handle on
the spark
When running Spark job often times some tasks fails for stage X with OOM
however same task for same stage succeeds eventually when relaunched and
stage X and job completes successfully.
One thing I can think of is say there 2 cores per executor and say executor
memory of 8G so initially task got
Hello Apache Supporters and Enthusiasts
This is a reminder that our Apache EU Roadshow in Berlin is less than
two weeks away and we need your help to spread the word. Please let your
work colleagues, friends and anyone interested in any attending know
about our Apache EU Roadshow event.
We
Is Spark DataFrame limit function action or transformation? I think it
returns DataFrame so it should be a transformation but it executes entire
DAG so I think it is action. Same goes to persist function. Please guide.
Thanks in advance.
--
Sent from:
You probably want to recognize "spark-shell" as a command in your
environment. Maybe try "sudo ln -s /path/to/spark-shell
/usr/bin/spark-shell" Have you tried "./spark-shell" in the current path
to see if it works?
Thank You,
Irving Duran
On Thu, May 31, 2018 at 9:00 AM Remil Mohanan wrote:
Solved it myself.
In-case anyone needs to reuse the code. Can be re-used.
orig_list = ['Married-spouse-absent', 'Married-AF-spouse',
'Separated', 'Married-civ-spouse', 'Widowed', 'Divorced',
'Never-married']
k_folds = 3
cols = df.columns # ['fnlwgt_bucketed',
'Married-spouse-absent_fold_0',
Hi,
Using -
Python 3.6
Spark 2.3
Original DF -
key a_fold_0 b_fold_0 a_fold_1 b_fold_1 a_fold_2 b_fold_2
1 1 2 3 4 5 6
2 7 5 3 5 2 1
I want to calculate means from the below dataframe as follows (like this
for all columns and all folds) -
key a_fold_0 b_fold_0 a_fold_1 b_fold_1 a_fold_2
Hi guys,
I'm trying to calculate WoE on a particular categorical column depending on
the target column. But the code is taking a lot of time on very few
datapoints (rows).
How can I optimize it to make it performant enough?
Here's the code (here categorical_col is a python list of columns) -
One thing that we do on our datasets is :
1. Take 'n' random samples of equal size
2. If the distribution is heavily skewed for one key in your samples. The
way we define "heavy skewness" is; if the mean is more than one std
deviation away from the median.
In your case, you can drop this column.
I believe this only works when we need to drop duplicate ROWS
Here I want to drop cols which contains one unique value
Le 2018-05-31 11:16, Divya Gehlot a écrit :
you can try dropduplicate function
you can try dropduplicate function
https://github.com/spirom/LearningSpark/blob/master/src/main/scala/dataframe/DropDuplicates.scala
On 31 May 2018 at 16:34, wrote:
> Hi there !
>
> I have a potentially large dataset ( regarding number of rows and cols )
>
> And I want to find the fastest way
Hi Julien,
One quick and easy to implement idea is to use sampling on your dataset,
i.e., sample a large enough subset of your data and test is there are no
unique values on some columns. Repeat the process a few times and then do
the full test on the surviving columns.
This will allow you to
Hi there !
I have a potentially large dataset ( regarding number of rows and cols )
And I want to find the fastest way to drop some useless cols for me,
i.e. cols containing only an unique value !
I want to know what do you think that I could do to do this as fast as
possible using spark.
Dear all,
I want to update my code of pyspark. In the pyspark, it must put the base
model in a pipeline, the office demo of pipeline use the LogistictRegression
as an base model. However, it seems not be able to use XGboost model in the
pipeline api. How can I use the pyspark like this:
from
14 matches
Mail list logo