each step.
Thanks,
Piero
From: DB Tsai [mailto:dbt...@dbtsai.com]
Sent: Wednesday, June 03, 2015 10:33 PM
To: Piero Cinquegrana
Cc: user@spark.apache.org
Subject: Re: Standard Scaler taking 1.5hrs
Can you do count() before fit to force materialize the RDD? I think something
before fit is slow
,
Piero
*From:* DB Tsai [mailto:dbt...@dbtsai.com
javascript:_e(%7B%7D,'cvml','dbt...@dbtsai.com');]
*Sent:* Wednesday, June 03, 2015 10:33 PM
*To:* Piero Cinquegrana
*Cc:* user@spark.apache.org
javascript:_e(%7B%7D,'cvml','user@spark.apache.org');
*Subject:* Re: Standard Scaler taking 1.5hrs
Hello User group,
I have a RDD of LabeledPoint composed of sparse vectors like showing below. In
the next step, I am standardizing the columns with the Standard Scaler. The
data has 2450 columns and ~110M rows. It took 1.5hrs to complete the
standardization with 10 nodes and 80 executors. The
The fit part is very slow, transform not at all.
The number of partitions was 210 vs number of executors 80.
Spark 1.4 sounds great but as my company is using Qubole we are dependent upon
them to upgrade from version 1.3.1. Until that happens, can you think of any
other reasons as to why it
Which part of StandardScaler is slow? Fit or transform? Fit has shuffle but
very small, and transform doesn't do shuffle. I guess you don't have enough
partition, so please repartition your input dataset to a number at least
larger than the # of executors you have.
In Spark 1.4's new ML pipeline
Can you do count() before fit to force materialize the RDD? I think
something before fit is slow.
On Wednesday, June 3, 2015, Piero Cinquegrana pcinquegr...@marketshare.com
wrote:
The fit part is very slow, transform not at all.
The number of partitions was 210 vs number of executors 80.