Ken, Did you get this working? If so, I'd love to get some pointers (or a blog post) on how you did it. I'm interested in the same functionality in the near future, but don't have time right now to dive in and figure it out.
-Michael From: Ken Krugler [mailto:kkrugler_li...@transpac.com] Sent: Thursday, December 04, 2014 11:05 PM To: user@avro.apache.org Subject: RE: Building skip table for Avro data Hi Doug, ________________________________ From: Doug Cutting Sent: December 4, 2014 7:41:05am PST To: user@avro.apache.org<mailto:user@avro.apache.org> Subject: Re: Building skip table for Avro data Have you looked at SortedKeyValueFile? https://avro.apache.org/docs/current/api/java/org/apache/avro/hadoop/file/SortedKeyValueFile.html This may already provide what you need. Seems pretty darn close. So to use this is a regular Hadoop job, it looks like I'd essentially not emit anything via the reducer's standard context.write() call, but instead I'd need to set up a SortedKeyValueFile.Writer to write to the job's output directory (using a sub-directory Path based on the task number), yes? Which means I'd wind up with <job output path>/part-xxxxx/data and <job output path>/part-xxxxx/data/index However I'm a little confused about the /data file being "an ordinary Avro container file". The doc says: Each record has exactly two fields, 'key' and 'value'. The keys are sorted lexicographically But both the key and value fields have their own Avro Schema, right? Or at least that's what I assume from the withValueSchema<https://avro.apache.org/docs/current/api/java/org/apache/avro/hadoop/file/SortedKeyValueFile.Writer.Options.html#getValueSchema()> and withKeySchema<https://avro.apache.org/docs/current/api/java/org/apache/avro/hadoop/file/SortedKeyValueFile.Writer.Options.html#withKeySchema(org.apache.avro.Schema)> calls for setting up the writer options. So in that case: a. how is it sorted lexicographically (as per the SortedKeyValueFile<https://avro.apache.org/docs/current/api/java/org/apache/avro/hadoop/file/SortedKeyValueFile.html> JavaDocs)? b. How would a reader who's expecting a regular Avro file read the records? Would they get records that were the union of fields in the key + value schemas? Thanks again, -- Ken On Dec 3, 2014 10:14 PM, "Joey Echeverria" <j...@cloudera.com<mailto:j...@cloudera.com>> wrote: It sounds feasible to me. You can certainly seek to a specific sync marker and so long as you're periodically calling sync to get the last position, then you can save those offsets in a separate file(s) that you load into memory or search sequentially. This sounds very similar to MapFiles which used a pair of SequenceFiles, one with the data and one with an index of every Nth key to speed up lookups of sorted data. -Joey On Wed, Dec 3, 2014 at 6:06 PM, Ken Krugler <kkrugler_li...@transpac.com<mailto:kkrugler_li...@transpac.com>> wrote: > Hi all, > > I'm looking for suggestions on how to optimize a number of Hadoop jobs > (written using Cascading) that only need a fraction of the records store in > Avro files. > > Essentially I have a small number (let's say 10K) of essentially random keys > out of a total of 100M unique values, and I need to select & process all and > only those records in my Avro files where the key field matches. The set of > keys that are of interest changes with each run. > > I have about 1TB of compressed data to scan through, saved as about 200 5GB > files. This represents about 10B records. > > The data format has to stay as Avro, for interchange with various groups. > > As I'm building the Avro files, I could sort by the key field. > > I'm wondering if it's feasible to build a skip table that would let me seek > to a sync position in the Avro file and read from it. If the default sync > interval is 16K, then I'd have 65M of these that I could use, and even if > every key of interest had 100 records that were each in a separate block, > this would still dramatically cut down on the amount of data I'd have to > scan over. > > But is that possible? Any input would be appreciated. > > Thanks, > > -- Ken > -- Joey Echeverria -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr