filter rows by all columns

2017-01-16 Thread Shawn Wan
I need to filter out outliers from a dataframe by all columns. I can
manually list all columns like:

df.filter(x=>math.abs(x.get(0).toString().toDouble-means(0))<=3*stddevs(0))

.filter(x=>math.abs(x.get(1).toString().toDouble-means(1))<=3*stddevs(1
))

...

But I want to turn it into a general function which can handle variable
number of columns. How could I do that? Thanks in advance!


Regards,

Shawn




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/filter-rows-by-all-columns-tp28309.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

load large number of files from s3

2016-11-11 Thread Shawn Wan
Hi,
We have 30 million small files (100k each) on s3. I want to know how bad it
is to load them directly from s3 ( eg driver memory, io, executor memory,
s3 reliability) before merge or distcp them. Anybody has experience? Thanks
in advance!

Regards,
Shawn




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/load-large-number-of-files-from-s3-tp28062.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.