Hi all,

I'm looking for suggestions on how to optimize a number of Hadoop jobs (written 
using Cascading) that only need a fraction of the records store in Avro files.

Essentially I have a small number (let's say 10K) of essentially random keys 
out of a total of 100M unique values, and I need to select & process all and 
only those records in my Avro files where the key field matches. The set of 
keys that are of interest changes with each run.

I have about 1TB of compressed data to scan through, saved as about 200 5GB 
files. This represents about 10B records.

The data format has to stay as Avro, for interchange with various groups.

As I'm building the Avro files, I could sort by the key field.

I'm wondering if it's feasible to build a skip table that would let me seek to 
a sync position in the Avro file and read from it. If the default sync interval 
is 16K, then I'd have 65M of these that I could use, and even if every key of 
interest had 100 records that were each in a separate block, this would still 
dramatically cut down on the amount of data I'd have to scan over.

But is that possible? Any input would be appreciated.

Thanks,

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply via email to