you can build a search tree using ids within each partition to act like an index, or create a bloom filter to see if current partition would have any hit.
What's your expected qps and response time for the filter request? On Mon, Apr 17, 2017 at 2:23 PM, MoTao <[email protected]> wrote: > Hi all, > > I have 10M (ID, BINARY) record, and the size of each BINARY is 5MB on > average. > In my daily application, I need to filter out 10K BINARY according to an ID > list. > How should I store the whole data to make the filtering faster? > > I'm using DataFrame in Spark 2.0.0 and I've tried row-based format (avro) > and column-based format (orc). > However, both of them require to scan almost ALL records, making the > filtering stage very very slow. > The code block for filtering looks like: > > val IDSet: Set[String] = ... > val checkID = udf { ID: String => IDSet(ID) } > spark.read.orc("/path/to/whole/data") > .filter(checkID($"ID")) > .select($"ID", $"BINARY") > .write... > > Thanks for any advice! > > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/How-to-store-10M-records-in-HDFS-to- > speed-up-further-filtering-tp28605.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: [email protected] > >
