try divide and conquer, create a column x for the fist character of userid,
and group by company+x. if still too large, try first two character.
On 17 July 2018 at 02:25, 崔苗 wrote:
> 30G user data, how to get distinct users count after creating a composite
> key based on company and userid?
>
>
ening in pub.
>
>
>
> *From:* Md. Rezaul Karim [mailto:rezaul.ka...@insight-centre.org]
> *Sent:* Wednesday, June 28, 2017 12:55 PM
> *To:* Sotola, Radim <radim.sot...@teradata.com>
> *Cc:* spark users <user@spark.apache.org>; ayan guha <guha.a...@gmail.com>;
> Abhina
Hi,
I recently switched from scala to python, and wondered which IDE people are
using for python. I heard about pycharm, spyder etc. How do they compare
with each other?
Thanks,
Shawn
You could also try pivot.
On 7 February 2017 at 16:13, Everett Anderson
wrote:
>
>
> On Tue, Feb 7, 2017 at 2:21 PM, Michael Armbrust
> wrote:
>
>> I think the fastest way is likely to use a combination of conditionals
>> (when / otherwise),
cv.fit is going to give you a CrossValidatorModel, if you want to extract
the real model built. You need to do
val cvModel = cv.fit(data)
val plmodel = cvModel.bestModel.asInstanceOf[PipelineModel]
val model = plmodel.stages(2).asInstanceOf[whatever_model]
then you can model.save
Thank you Hyukjin,
It works. This is what I end up doing
df.filter(_.toSeq.zipWithIndex.forall(v => v._1.toString().toDouble -
means(v._2) <= 3*staddevs(v._2))).show()
Regards,
Shawn
On 16 January 2017 at 18:30, Hyukjin Kwon wrote:
> Hi Shawn,
>
> Could we do this as
I need to filter out outliers from a dataframe on all columns. I can
manually list all columns like:
df.filter(x=>math.abs(x.get(0).toString().toDouble-means(0))<=3*stddevs(0))
.filter(x=>math.abs(x.get(1).toString().toDouble-means(1))<=3*stddevs(1
))
...
But I want to turn it into a
Hi,
I am running linear regression on a dataframe and get the following error:
Exception in thread "main" java.lang.AssertionError: assertion failed:
Training dataset is empty.
at scala.Predef$.assert(Predef.scala:170)
at
I want to divide big data into groups (eg groupby some id), and build one
model for each group. I am wondering whether I can parallelize the model
building process by implementing a UDAF (eg running linearregression in its
evaluate mothod). is it good practice? anybody has experience? Thanks!
GBTClassifier seems not much detailed model description to
> get
> //but by GBTClassificationModel.toString(), we may get the specific trees
> which are just I want
>
> GBTClassificationModel model = (GBTClassificationModel)get; //wrong
> to compile
>
>
> Then ho
You can use pipelinemodel.stages(0).asInstanceOf[RandomForestModel]. The
number (0 in example) for stages depends on the order you call setStages.
Shawn
On 23 November 2016 at 10:21, Zhiliang Zhu
wrote:
>
> Dear All,
>
> I am building model by spark pipeline, and
You can partitioned on the first n letters of userid
On 17 November 2016 at 08:25, titli batali wrote:
> Hi,
>
> I have a use case, where we have 1000 csv files with a column user_Id,
> having 8 million unique users. The data contains: userid,date,transaction,
> where we
Hi,
We have 30 million small files (100k each) on s3. I want to know how bad it
is to load them directly from s3 ( eg driver memory, io, executor memory,
s3 reliability) before merge or distcp them. Anybody has experience? Thanks
in advance!
Regards,
Shawn
Hi,
We have 30 million small (100k each) files on s3 to process. I am thinking
about something like below to load them in parallel
val data = sc.union(sc.wholeTextFiles("s3a://.../*.json").map(...)
.toDF().createOrReplaceTempView("data")
How to estimate the driver memory it should be given? is
14 matches
Mail list logo