https://gist.github.com/viacoban/4945325
On Wed, Feb 13, 2013 at 9:59 AM, Dave Beech <[email protected]> wrote: > A gist would be great - thanks very much > > Dave > > On 13 February 2013 14:52, Victor Iacoban <[email protected]> > wrote: > > Dave, > > > > How do you want this, copy pasted code into a gist or a reusable jar? > > > > --victor > > > > > > On Wed, Feb 13, 2013 at 3:59 AM, Dave Beech <[email protected]> > wrote: > > > >> Hi Victor, > >> Any chance you could share your implementation of a Source that reads > >> from multiple paths? I've wanted this for a while but haven't found > >> time to go ahead and write one myself! > >> Thanks, > >> Dave > >> > >> On 12 February 2013 23:07, Victor Iacoban <[email protected]> > >> wrote: > >> > Thanks J > >> > > >> > I could not extend the FileSourceImpl since it works with only one > input > >> > path, > >> > but I implemented the Source interface directly and it appears to do > the > >> > job, thx for the pointer > >> > > >> > -- victor > >> > > >> > > >> > > >> > On Tue, Feb 12, 2013 at 5:20 PM, Josh Wills <[email protected]> > >> wrote: > >> > > >> >> Yep-- check out the formattedFile function in o.a.c.io.From. You can > >> also > >> >> write a custom extension of o.a.c.io.impl.FileSourceImpl if it's one > >> you're > >> >> going to be using a lot, or if there is custom configuration > information > >> >> required to use the InputFormat. > >> >> > >> >> J > >> >> > >> >> > >> >> On Tue, Feb 12, 2013 at 2:13 PM, Victor Iacoban < > >> [email protected] > >> >> >wrote: > >> >> > >> >> > That's exactly what I have in the code not using Crunch API: > >> >> > public class MultiSequenceFileInputFormat<K, V> extends > >> >> > CombineFileInputFormat<K, V> { > >> >> > ... > >> >> > } > >> >> > > >> >> > Are you saying there is way to use my custom input format with > Crunch? > >> >> > > >> >> > > >> >> > > >> >> > On Tue, Feb 12, 2013 at 5:06 PM, Josh Wills <[email protected]> > >> >> wrote: > >> >> > > >> >> > > Depends on the size of the files-- if there are a bunch of tiny > >> ones, > >> >> it > >> >> > > can be worthwhile to have a CombineFileInputFormat, ala > >> >> > > > >> >> > > > >> http://yaseminavcular.blogspot.com/2011/03/many-small-input-files.html > >> >> > > > >> >> > > J > >> >> > > > >> >> > > > >> >> > > On Tue, Feb 12, 2013 at 1:56 PM, Victor Iacoban < > >> >> > [email protected] > >> >> > > >wrote: > >> >> > > > >> >> > > > Thanks Josh, > >> >> > > > Is there any performance penalty in unions, assuming that I > have > >> >> > several > >> >> > > > hundreds of input files? > >> >> > > > > >> >> > > > > >> >> > > > On Tue, Feb 12, 2013 at 4:39 PM, Josh Wills < > [email protected] > >> > > >> >> > > wrote: > >> >> > > > > >> >> > > > > Yeah, of course-- that's how stuff like joins work. > >> >> > > > > > >> >> > > > > PTable<K, V> first = pipeline.read(new TableSource<K, > >> >> V>(firstFile)); > >> >> > > > > PTable<K, V> second = ...; > >> >> > > > > PTable<K, V> union = first.union(second); > >> >> > > > > > >> >> > > > > etc. > >> >> > > > > > >> >> > > > > > >> >> > > > > On Tue, Feb 12, 2013 at 1:36 PM, Victor Iacoban < > >> >> > > > [email protected] > >> >> > > > > >wrote: > >> >> > > > > > >> >> > > > > > Is there any support in crunch to use multiple sequence > files > >> as > >> >> > > > pipeline > >> >> > > > > > source? > >> >> > > > > > something similar to standard MultipleInputs > >> >> > > > > > > >> >> > > > > > Thanks, > >> >> > > > > > victor > >> >> > > > > > > >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > >> >> > >> >
