Re: Re: spark sql data skew

2018-07-20 Thread Xiaomeng Wan
try divide and conquer, create a column x for the fist character of userid, and group by company+x. if still too large, try first two character. On 17 July 2018 at 02:25, 崔苗 wrote: > 30G user data, how to get distinct users count after creating a composite > key based on company and userid? > >

Re: IDE for python

2017-06-28 Thread Xiaomeng Wan
ening in pub. > > > > *From:* Md. Rezaul Karim [mailto:rezaul.ka...@insight-centre.org] > *Sent:* Wednesday, June 28, 2017 12:55 PM > *To:* Sotola, Radim <radim.sot...@teradata.com> > *Cc:* spark users <user@spark.apache.org>; ayan guha <guha.a...@gmail.com>; > Abhina

IDE for python

2017-06-27 Thread Xiaomeng Wan
Hi, I recently switched from scala to python, and wondered which IDE people are using for python. I heard about pycharm, spyder etc. How do they compare with each other? Thanks, Shawn

Re: Un-exploding / denormalizing Spark SQL help

2017-02-08 Thread Xiaomeng Wan
You could also try pivot. On 7 February 2017 at 16:13, Everett Anderson wrote: > > > On Tue, Feb 7, 2017 at 2:21 PM, Michael Armbrust > wrote: > >> I think the fastest way is likely to use a combination of conditionals >> (when / otherwise),

Re: How to save spark-ML model in Java?

2017-01-19 Thread Xiaomeng Wan
cv.fit is going to give you a CrossValidatorModel, if you want to extract the real model built. You need to do val cvModel = cv.fit(data) val plmodel = cvModel.bestModel.asInstanceOf[PipelineModel] val model = plmodel.stages(2).asInstanceOf[whatever_model] then you can model.save

Re: filter rows by all columns

2017-01-17 Thread Xiaomeng Wan
Thank you Hyukjin, It works. This is what I end up doing df.filter(_.toSeq.zipWithIndex.forall(v => v._1.toString().toDouble - means(v._2) <= 3*staddevs(v._2))).show() Regards, Shawn On 16 January 2017 at 18:30, Hyukjin Kwon wrote: > Hi Shawn, > > Could we do this as

filter rows based on all columns

2017-01-13 Thread Xiaomeng Wan
I need to filter out outliers from a dataframe on all columns. I can manually list all columns like: df.filter(x=>math.abs(x.get(0).toString().toDouble-means(0))<=3*stddevs(0)) .filter(x=>math.abs(x.get(1).toString().toDouble-means(1))<=3*stddevs(1 )) ... But I want to turn it into a

spark linear regression error training dataset is empty

2016-12-21 Thread Xiaomeng Wan
Hi, I am running linear regression on a dataframe and get the following error: Exception in thread "main" java.lang.AssertionError: assertion failed: Training dataset is empty. at scala.Predef$.assert(Predef.scala:170) at

build models in parallel

2016-11-29 Thread Xiaomeng Wan
I want to divide big data into groups (eg groupby some id), and build one model for each group. I am wondering whether I can parallelize the model building process by implementing a UDAF (eg running linearregression in its evaluate mothod). is it good practice? anybody has experience? Thanks!

Re: how to see Pipeline model information

2016-11-24 Thread Xiaomeng Wan
GBTClassifier seems not much detailed model description to > get > //but by GBTClassificationModel.toString(), we may get the specific trees > which are just I want > > GBTClassificationModel model = (GBTClassificationModel)get; //wrong > to compile > > > Then ho

Re: how to see Pipeline model information

2016-11-23 Thread Xiaomeng Wan
You can use pipelinemodel.stages(0).asInstanceOf[RandomForestModel]. The number (0 in example) for stages depends on the order you call setStages. Shawn On 23 November 2016 at 10:21, Zhiliang Zhu wrote: > > Dear All, > > I am building model by spark pipeline, and

Re: Spark Partitioning Strategy with Parquet

2016-11-17 Thread Xiaomeng Wan
You can partitioned on the first n letters of userid On 17 November 2016 at 08:25, titli batali wrote: > Hi, > > I have a use case, where we have 1000 csv files with a column user_Id, > having 8 million unique users. The data contains: userid,date,transaction, > where we

load large number of files from s3

2016-11-11 Thread Xiaomeng Wan
Hi, We have 30 million small files (100k each) on s3. I want to know how bad it is to load them directly from s3 ( eg driver memory, io, executor memory, s3 reliability) before merge or distcp them. Anybody has experience? Thanks in advance! Regards, Shawn

read large number of files on s3

2016-11-08 Thread Xiaomeng Wan
Hi, We have 30 million small (100k each) files on s3 to process. I am thinking about something like below to load them in parallel val data = sc.union(sc.wholeTextFiles("s3a://.../*.json").map(...) .toDF().createOrReplaceTempView("data") How to estimate the driver memory it should be given? is