Ha! Quite possibly. Let's JIRA it up. Victor, I haven't had much coffee yet, but it looks like there is a bug in the gist-- the MultiSequenceFileInputFormat refers to a new CombineFileRecordReader, which has a different constructor signature from the MultiSequenceFileRecordReader in the patch. What did I miss?
J On Wed, Feb 13, 2013 at 8:34 AM, Dave Beech <[email protected]> wrote: > Love it enough to write it for us? ;) I'll stick it in JIRA just in > case. Or if not, maybe one day I'll have a free couple of hours and > feel like doing it myself! > > Cheers, > Dave > > On 13 February 2013 16:18, Josh Wills <[email protected]> wrote: > > Yep, I would love that. > > > > > > On Wed, Feb 13, 2013 at 7:30 AM, Dave Beech <[email protected]> > wrote: > > > >> Actually, while we're on the subject of small files and > >> CombineFileInputFormat... > >> > >> I believe Hive has a feature whereby CombineFileInputFormat is used > >> internally if it's required to read many small files to make the > >> resulting mapreduce jobs more efficient. Would it be worth looking > >> into whether Crunch could support this, too? > >> > >> > >> On 13 February 2013 15:27, Dave Beech <[email protected]> wrote: > >> > thanks! > >> > > >> > On 13 February 2013 15:22, Victor Iacoban <[email protected]> > >> wrote: > >> >> https://gist.github.com/viacoban/4945325 > >> >> > >> >> > >> >> On Wed, Feb 13, 2013 at 9:59 AM, Dave Beech <[email protected]> > >> wrote: > >> >> > >> >>> A gist would be great - thanks very much > >> >>> > >> >>> Dave > >> >>> > >> >>> On 13 February 2013 14:52, Victor Iacoban <[email protected] > > > >> >>> wrote: > >> >>> > Dave, > >> >>> > > >> >>> > How do you want this, copy pasted code into a gist or a reusable > jar? > >> >>> > > >> >>> > --victor > >> >>> > > >> >>> > > >> >>> > On Wed, Feb 13, 2013 at 3:59 AM, Dave Beech <[email protected] > > > >> >>> wrote: > >> >>> > > >> >>> >> Hi Victor, > >> >>> >> Any chance you could share your implementation of a Source that > >> reads > >> >>> >> from multiple paths? I've wanted this for a while but haven't > found > >> >>> >> time to go ahead and write one myself! > >> >>> >> Thanks, > >> >>> >> Dave > >> >>> >> > >> >>> >> On 12 February 2013 23:07, Victor Iacoban < > [email protected] > >> > > >> >>> >> wrote: > >> >>> >> > Thanks J > >> >>> >> > > >> >>> >> > I could not extend the FileSourceImpl since it works with only > one > >> >>> input > >> >>> >> > path, > >> >>> >> > but I implemented the Source interface directly and it appears > to > >> do > >> >>> the > >> >>> >> > job, thx for the pointer > >> >>> >> > > >> >>> >> > -- victor > >> >>> >> > > >> >>> >> > > >> >>> >> > > >> >>> >> > On Tue, Feb 12, 2013 at 5:20 PM, Josh Wills < > [email protected] > >> > > >> >>> >> wrote: > >> >>> >> > > >> >>> >> >> Yep-- check out the formattedFile function in o.a.c.io.From. > You > >> can > >> >>> >> also > >> >>> >> >> write a custom extension of o.a.c.io.impl.FileSourceImpl if > it's > >> one > >> >>> >> you're > >> >>> >> >> going to be using a lot, or if there is custom configuration > >> >>> information > >> >>> >> >> required to use the InputFormat. > >> >>> >> >> > >> >>> >> >> J > >> >>> >> >> > >> >>> >> >> > >> >>> >> >> On Tue, Feb 12, 2013 at 2:13 PM, Victor Iacoban < > >> >>> >> [email protected] > >> >>> >> >> >wrote: > >> >>> >> >> > >> >>> >> >> > That's exactly what I have in the code not using Crunch API: > >> >>> >> >> > public class MultiSequenceFileInputFormat<K, V> extends > >> >>> >> >> > CombineFileInputFormat<K, V> { > >> >>> >> >> > ... > >> >>> >> >> > } > >> >>> >> >> > > >> >>> >> >> > Are you saying there is way to use my custom input format > with > >> >>> Crunch? > >> >>> >> >> > > >> >>> >> >> > > >> >>> >> >> > > >> >>> >> >> > On Tue, Feb 12, 2013 at 5:06 PM, Josh Wills < > >> [email protected]> > >> >>> >> >> wrote: > >> >>> >> >> > > >> >>> >> >> > > Depends on the size of the files-- if there are a bunch of > >> tiny > >> >>> >> ones, > >> >>> >> >> it > >> >>> >> >> > > can be worthwhile to have a CombineFileInputFormat, ala > >> >>> >> >> > > > >> >>> >> >> > > > >> >>> >> > >> http://yaseminavcular.blogspot.com/2011/03/many-small-input-files.html > >> >>> >> >> > > > >> >>> >> >> > > J > >> >>> >> >> > > > >> >>> >> >> > > > >> >>> >> >> > > On Tue, Feb 12, 2013 at 1:56 PM, Victor Iacoban < > >> >>> >> >> > [email protected] > >> >>> >> >> > > >wrote: > >> >>> >> >> > > > >> >>> >> >> > > > Thanks Josh, > >> >>> >> >> > > > Is there any performance penalty in unions, assuming > that I > >> >>> have > >> >>> >> >> > several > >> >>> >> >> > > > hundreds of input files? > >> >>> >> >> > > > > >> >>> >> >> > > > > >> >>> >> >> > > > On Tue, Feb 12, 2013 at 4:39 PM, Josh Wills < > >> >>> [email protected] > >> >>> >> > > >> >>> >> >> > > wrote: > >> >>> >> >> > > > > >> >>> >> >> > > > > Yeah, of course-- that's how stuff like joins work. > >> >>> >> >> > > > > > >> >>> >> >> > > > > PTable<K, V> first = pipeline.read(new TableSource<K, > >> >>> >> >> V>(firstFile)); > >> >>> >> >> > > > > PTable<K, V> second = ...; > >> >>> >> >> > > > > PTable<K, V> union = first.union(second); > >> >>> >> >> > > > > > >> >>> >> >> > > > > etc. > >> >>> >> >> > > > > > >> >>> >> >> > > > > > >> >>> >> >> > > > > On Tue, Feb 12, 2013 at 1:36 PM, Victor Iacoban < > >> >>> >> >> > > > [email protected] > >> >>> >> >> > > > > >wrote: > >> >>> >> >> > > > > > >> >>> >> >> > > > > > Is there any support in crunch to use multiple > sequence > >> >>> files > >> >>> >> as > >> >>> >> >> > > > pipeline > >> >>> >> >> > > > > > source? > >> >>> >> >> > > > > > something similar to standard MultipleInputs > >> >>> >> >> > > > > > > >> >>> >> >> > > > > > Thanks, > >> >>> >> >> > > > > > victor > >> >>> >> >> > > > > > > >> >>> >> >> > > > > > >> >>> >> >> > > > > >> >>> >> >> > > > >> >>> >> >> > > >> >>> >> >> > >> >>> >> > >> >>> > >> > > > > > > > > -- > > Director of Data Science > > Cloudera <http://www.cloudera.com> > > Twitter: @josh_wills <http://twitter.com/josh_wills> > > > On 13 February 2013 16:18, Josh Wills <[email protected]> wrote: > > Yep, I would love that. > > > > > > On Wed, Feb 13, 2013 at 7:30 AM, Dave Beech <[email protected]> > wrote: > > > >> Actually, while we're on the subject of small files and > >> CombineFileInputFormat... > >> > >> I believe Hive has a feature whereby CombineFileInputFormat is used > >> internally if it's required to read many small files to make the > >> resulting mapreduce jobs more efficient. Would it be worth looking > >> into whether Crunch could support this, too? > >> > >> > >> On 13 February 2013 15:27, Dave Beech <[email protected]> wrote: > >> > thanks! > >> > > >> > On 13 February 2013 15:22, Victor Iacoban <[email protected]> > >> wrote: > >> >> https://gist.github.com/viacoban/4945325 > >> >> > >> >> > >> >> On Wed, Feb 13, 2013 at 9:59 AM, Dave Beech <[email protected]> > >> wrote: > >> >> > >> >>> A gist would be great - thanks very much > >> >>> > >> >>> Dave > >> >>> > >> >>> On 13 February 2013 14:52, Victor Iacoban <[email protected] > > > >> >>> wrote: > >> >>> > Dave, > >> >>> > > >> >>> > How do you want this, copy pasted code into a gist or a reusable > jar? > >> >>> > > >> >>> > --victor > >> >>> > > >> >>> > > >> >>> > On Wed, Feb 13, 2013 at 3:59 AM, Dave Beech <[email protected] > > > >> >>> wrote: > >> >>> > > >> >>> >> Hi Victor, > >> >>> >> Any chance you could share your implementation of a Source that > >> reads > >> >>> >> from multiple paths? I've wanted this for a while but haven't > found > >> >>> >> time to go ahead and write one myself! > >> >>> >> Thanks, > >> >>> >> Dave > >> >>> >> > >> >>> >> On 12 February 2013 23:07, Victor Iacoban < > [email protected] > >> > > >> >>> >> wrote: > >> >>> >> > Thanks J > >> >>> >> > > >> >>> >> > I could not extend the FileSourceImpl since it works with only > one > >> >>> input > >> >>> >> > path, > >> >>> >> > but I implemented the Source interface directly and it appears > to > >> do > >> >>> the > >> >>> >> > job, thx for the pointer > >> >>> >> > > >> >>> >> > -- victor > >> >>> >> > > >> >>> >> > > >> >>> >> > > >> >>> >> > On Tue, Feb 12, 2013 at 5:20 PM, Josh Wills < > [email protected] > >> > > >> >>> >> wrote: > >> >>> >> > > >> >>> >> >> Yep-- check out the formattedFile function in o.a.c.io.From. > You > >> can > >> >>> >> also > >> >>> >> >> write a custom extension of o.a.c.io.impl.FileSourceImpl if > it's > >> one > >> >>> >> you're > >> >>> >> >> going to be using a lot, or if there is custom configuration > >> >>> information > >> >>> >> >> required to use the InputFormat. > >> >>> >> >> > >> >>> >> >> J > >> >>> >> >> > >> >>> >> >> > >> >>> >> >> On Tue, Feb 12, 2013 at 2:13 PM, Victor Iacoban < > >> >>> >> [email protected] > >> >>> >> >> >wrote: > >> >>> >> >> > >> >>> >> >> > That's exactly what I have in the code not using Crunch API: > >> >>> >> >> > public class MultiSequenceFileInputFormat<K, V> extends > >> >>> >> >> > CombineFileInputFormat<K, V> { > >> >>> >> >> > ... > >> >>> >> >> > } > >> >>> >> >> > > >> >>> >> >> > Are you saying there is way to use my custom input format > with > >> >>> Crunch? > >> >>> >> >> > > >> >>> >> >> > > >> >>> >> >> > > >> >>> >> >> > On Tue, Feb 12, 2013 at 5:06 PM, Josh Wills < > >> [email protected]> > >> >>> >> >> wrote: > >> >>> >> >> > > >> >>> >> >> > > Depends on the size of the files-- if there are a bunch of > >> tiny > >> >>> >> ones, > >> >>> >> >> it > >> >>> >> >> > > can be worthwhile to have a CombineFileInputFormat, ala > >> >>> >> >> > > > >> >>> >> >> > > > >> >>> >> > >> http://yaseminavcular.blogspot.com/2011/03/many-small-input-files.html > >> >>> >> >> > > > >> >>> >> >> > > J > >> >>> >> >> > > > >> >>> >> >> > > > >> >>> >> >> > > On Tue, Feb 12, 2013 at 1:56 PM, Victor Iacoban < > >> >>> >> >> > [email protected] > >> >>> >> >> > > >wrote: > >> >>> >> >> > > > >> >>> >> >> > > > Thanks Josh, > >> >>> >> >> > > > Is there any performance penalty in unions, assuming > that I > >> >>> have > >> >>> >> >> > several > >> >>> >> >> > > > hundreds of input files? > >> >>> >> >> > > > > >> >>> >> >> > > > > >> >>> >> >> > > > On Tue, Feb 12, 2013 at 4:39 PM, Josh Wills < > >> >>> [email protected] > >> >>> >> > > >> >>> >> >> > > wrote: > >> >>> >> >> > > > > >> >>> >> >> > > > > Yeah, of course-- that's how stuff like joins work. > >> >>> >> >> > > > > > >> >>> >> >> > > > > PTable<K, V> first = pipeline.read(new TableSource<K, > >> >>> >> >> V>(firstFile)); > >> >>> >> >> > > > > PTable<K, V> second = ...; > >> >>> >> >> > > > > PTable<K, V> union = first.union(second); > >> >>> >> >> > > > > > >> >>> >> >> > > > > etc. > >> >>> >> >> > > > > > >> >>> >> >> > > > > > >> >>> >> >> > > > > On Tue, Feb 12, 2013 at 1:36 PM, Victor Iacoban < > >> >>> >> >> > > > [email protected] > >> >>> >> >> > > > > >wrote: > >> >>> >> >> > > > > > >> >>> >> >> > > > > > Is there any support in crunch to use multiple > sequence > >> >>> files > >> >>> >> as > >> >>> >> >> > > > pipeline > >> >>> >> >> > > > > > source? > >> >>> >> >> > > > > > something similar to standard MultipleInputs > >> >>> >> >> > > > > > > >> >>> >> >> > > > > > Thanks, > >> >>> >> >> > > > > > victor > >> >>> >> >> > > > > > > >> >>> >> >> > > > > > >> >>> >> >> > > > > >> >>> >> >> > > > >> >>> >> >> > > >> >>> >> >> > >> >>> >> > >> >>> > >> > > > > > > > > -- > > Director of Data Science > > Cloudera <http://www.cloudera.com> > > Twitter: @josh_wills <http://twitter.com/josh_wills> >
