How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread MoTao
Hi all, I have 10M (ID, BINARY) record, and the size of each BINARY is 5MB on average. In my daily application, I need to filter out 10K BINARY according to an ID list. How should I store the whole data to make the filtering faster? I'm using DataFrame in Spark 2.0.0 and I've tried row-based

how to generate a column using mapParition and then add it back to the df?

2016-08-08 Thread MoTao
Hi all, I'm trying to append a column to a df. I understand that the new column must be created by 1) using literals, 2) transforming an existing column in df, or 3) generated from udf over this df In my case, the column to be appended is created by processing each row, like val df =