filter rows by all columns
I need to filter out outliers from a dataframe by all columns. I can manually list all columns like: df.filter(x=>math.abs(x.get(0).toString().toDouble-means(0))<=3*stddevs(0)) .filter(x=>math.abs(x.get(1).toString().toDouble-means(1))<=3*stddevs(1 )) ... But I want to turn it into a general function which can handle variable number of columns. How could I do that? Thanks in advance! Regards, Shawn -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/filter-rows-by-all-columns-tp28309.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
load large number of files from s3
Hi, We have 30 million small files (100k each) on s3. I want to know how bad it is to load them directly from s3 ( eg driver memory, io, executor memory, s3 reliability) before merge or distcp them. Anybody has experience? Thanks in advance! Regards, Shawn -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/load-large-number-of-files-from-s3-tp28062.html Sent from the Apache Spark User List mailing list archive at Nabble.com.