You need to sort the data by id otherwise q situation can occur where the index does not work. Aside from this, it sounds odd to put a 5 MB column using those formats. This will be also not so efficient. What is in the 5 MB binary data? You could use HAR or maybe Hbase to store this kind of data (if it does not get much larger than 5 MB).
> On 17. Apr 2017, at 08:23, MoTao <[email protected]> wrote: > > Hi all, > > I have 10M (ID, BINARY) record, and the size of each BINARY is 5MB on > average. > In my daily application, I need to filter out 10K BINARY according to an ID > list. > How should I store the whole data to make the filtering faster? > > I'm using DataFrame in Spark 2.0.0 and I've tried row-based format (avro) > and column-based format (orc). > However, both of them require to scan almost ALL records, making the > filtering stage very very slow. > The code block for filtering looks like: > > val IDSet: Set[String] = ... > val checkID = udf { ID: String => IDSet(ID) } > spark.read.orc("/path/to/whole/data") > .filter(checkID($"ID")) > .select($"ID", $"BINARY") > .write... > > Thanks for any advice! > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-store-10M-records-in-HDFS-to-speed-up-further-filtering-tp28605.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe e-mail: [email protected]
