1. Per my understanding, for orc files, it should push down the filters,
which means all id columns will be scanned but only for matched ones the
binary data is read. I haven't dig into spark-orc reader though..

2. orc itself have row group index and bloom filter index. you may try
configurations like 'orc.bloom.filter.columns' on the source table first.
>From the spark side, with mapPartitions, it's possible to build sort of
index for each partition.

And could you check how many tasks does the filter stage have? maybe
there's too few partitions..

On Mon, Apr 17, 2017 at 3:01 PM, 莫涛 <mo...@sensetime.com> wrote:

> Hi Ryan,
>
>
> 1. "expected qps and response time for the filter request"
>
> I expect that only the requested BINARY are scanned instead of all
> records, so the response time would be "10K * 5MB / disk read speed", or
> several times of this.
>
> In practice, our cluster has 30 SAS disks and scanning all the 10M * 5MB
> data takes about 6 hours now. It should becomes several minutes as expected.
>
>
> 2. "build a search tree using ids within each partition to act like an
> index, or create a bloom filter to see if current partition would have any
> hit"
>
> Sounds like the thing I'm looking for!
>
> Could you kindly provide some links for reference? I found nothing in
> spark document about index or bloom filter working inside partition.
>
>
> Thanks very much!
>
>
> Mo Tao
>
> ------------------------------
> *发件人:* Ryan <ryan.hd....@gmail.com>
> *发送时间:* 2017年4月17日 14:32:00
> *收件人:* 莫涛
> *抄送:* user
> *主题:* Re: How to store 10M records in HDFS to speed up further filtering?
>
> you can build a search tree using ids within each partition to act like an
> index, or create a bloom filter to see if current partition would have any
> hit.
>
> What's your expected qps and response time for the filter request?
>
>
> On Mon, Apr 17, 2017 at 2:23 PM, MoTao <mo...@sensetime.com> wrote:
>
>> Hi all,
>>
>> I have 10M (ID, BINARY) record, and the size of each BINARY is 5MB on
>> average.
>> In my daily application, I need to filter out 10K BINARY according to an
>> ID
>> list.
>> How should I store the whole data to make the filtering faster?
>>
>> I'm using DataFrame in Spark 2.0.0 and I've tried row-based format (avro)
>> and column-based format (orc).
>> However, both of them require to scan almost ALL records, making the
>> filtering stage very very slow.
>> The code block for filtering looks like:
>>
>> val IDSet: Set[String] = ...
>> val checkID = udf { ID: String => IDSet(ID) }
>> spark.read.orc("/path/to/whole/data")
>>   .filter(checkID($"ID"))
>>   .select($"ID", $"BINARY")
>>   .write...
>>
>> Thanks for any advice!
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/How-to-store-10M-records-in-HDFS-to-sp
>> eed-up-further-filtering-tp28605.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>

Reply via email to