RE: Standard Scaler taking 1.5hrs

2015-06-04 Thread Piero Cinquegrana
each step. Thanks, Piero From: DB Tsai [mailto:dbt...@dbtsai.com] Sent: Wednesday, June 03, 2015 10:33 PM To: Piero Cinquegrana Cc: user@spark.apache.org Subject: Re: Standard Scaler taking 1.5hrs Can you do count() before fit to force materialize the RDD? I think something before fit is slow

Re: Standard Scaler taking 1.5hrs

2015-06-04 Thread Holden Karau
, Piero *From:* DB Tsai [mailto:dbt...@dbtsai.com javascript:_e(%7B%7D,'cvml','dbt...@dbtsai.com');] *Sent:* Wednesday, June 03, 2015 10:33 PM *To:* Piero Cinquegrana *Cc:* user@spark.apache.org javascript:_e(%7B%7D,'cvml','user@spark.apache.org'); *Subject:* Re: Standard Scaler taking 1.5hrs

Standard Scaler taking 1.5hrs

2015-06-03 Thread Piero Cinquegrana
Hello User group, I have a RDD of LabeledPoint composed of sparse vectors like showing below. In the next step, I am standardizing the columns with the Standard Scaler. The data has 2450 columns and ~110M rows. It took 1.5hrs to complete the standardization with 10 nodes and 80 executors. The

Re: Standard Scaler taking 1.5hrs

2015-06-03 Thread Piero Cinquegrana
The fit part is very slow, transform not at all. The number of partitions was 210 vs number of executors 80. Spark 1.4 sounds great but as my company is using Qubole we are dependent upon them to upgrade from version 1.3.1. Until that happens, can you think of any other reasons as to why it

Re: Standard Scaler taking 1.5hrs

2015-06-03 Thread DB Tsai
Which part of StandardScaler is slow? Fit or transform? Fit has shuffle but very small, and transform doesn't do shuffle. I guess you don't have enough partition, so please repartition your input dataset to a number at least larger than the # of executors you have. In Spark 1.4's new ML pipeline

Re: Standard Scaler taking 1.5hrs

2015-06-03 Thread DB Tsai
Can you do count() before fit to force materialize the RDD? I think something before fit is slow. On Wednesday, June 3, 2015, Piero Cinquegrana pcinquegr...@marketshare.com wrote: The fit part is very slow, transform not at all. The number of partitions was 210 vs number of executors 80.