Hi Victor, Any chance you could share your implementation of a Source that reads from multiple paths? I've wanted this for a while but haven't found time to go ahead and write one myself! Thanks, Dave
On 12 February 2013 23:07, Victor Iacoban <[email protected]> wrote: > Thanks J > > I could not extend the FileSourceImpl since it works with only one input > path, > but I implemented the Source interface directly and it appears to do the > job, thx for the pointer > > -- victor > > > > On Tue, Feb 12, 2013 at 5:20 PM, Josh Wills <[email protected]> wrote: > >> Yep-- check out the formattedFile function in o.a.c.io.From. You can also >> write a custom extension of o.a.c.io.impl.FileSourceImpl if it's one you're >> going to be using a lot, or if there is custom configuration information >> required to use the InputFormat. >> >> J >> >> >> On Tue, Feb 12, 2013 at 2:13 PM, Victor Iacoban <[email protected] >> >wrote: >> >> > That's exactly what I have in the code not using Crunch API: >> > public class MultiSequenceFileInputFormat<K, V> extends >> > CombineFileInputFormat<K, V> { >> > ... >> > } >> > >> > Are you saying there is way to use my custom input format with Crunch? >> > >> > >> > >> > On Tue, Feb 12, 2013 at 5:06 PM, Josh Wills <[email protected]> >> wrote: >> > >> > > Depends on the size of the files-- if there are a bunch of tiny ones, >> it >> > > can be worthwhile to have a CombineFileInputFormat, ala >> > > >> > > http://yaseminavcular.blogspot.com/2011/03/many-small-input-files.html >> > > >> > > J >> > > >> > > >> > > On Tue, Feb 12, 2013 at 1:56 PM, Victor Iacoban < >> > [email protected] >> > > >wrote: >> > > >> > > > Thanks Josh, >> > > > Is there any performance penalty in unions, assuming that I have >> > several >> > > > hundreds of input files? >> > > > >> > > > >> > > > On Tue, Feb 12, 2013 at 4:39 PM, Josh Wills <[email protected]> >> > > wrote: >> > > > >> > > > > Yeah, of course-- that's how stuff like joins work. >> > > > > >> > > > > PTable<K, V> first = pipeline.read(new TableSource<K, >> V>(firstFile)); >> > > > > PTable<K, V> second = ...; >> > > > > PTable<K, V> union = first.union(second); >> > > > > >> > > > > etc. >> > > > > >> > > > > >> > > > > On Tue, Feb 12, 2013 at 1:36 PM, Victor Iacoban < >> > > > [email protected] >> > > > > >wrote: >> > > > > >> > > > > > Is there any support in crunch to use multiple sequence files as >> > > > pipeline >> > > > > > source? >> > > > > > something similar to standard MultipleInputs >> > > > > > >> > > > > > Thanks, >> > > > > > victor >> > > > > > >> > > > > >> > > > >> > > >> > >>
