Ah, gotcha-- good to know.
On Tue, Feb 12, 2013 at 3:07 PM, Victor Iacoban <[email protected]>wrote: > Thanks J > > I could not extend the FileSourceImpl since it works with only one input > path, > but I implemented the Source interface directly and it appears to do the > job, thx for the pointer > > -- victor > > > > On Tue, Feb 12, 2013 at 5:20 PM, Josh Wills <[email protected]> wrote: > > > Yep-- check out the formattedFile function in o.a.c.io.From. You can also > > write a custom extension of o.a.c.io.impl.FileSourceImpl if it's one > you're > > going to be using a lot, or if there is custom configuration information > > required to use the InputFormat. > > > > J > > > > > > On Tue, Feb 12, 2013 at 2:13 PM, Victor Iacoban < > [email protected] > > >wrote: > > > > > That's exactly what I have in the code not using Crunch API: > > > public class MultiSequenceFileInputFormat<K, V> extends > > > CombineFileInputFormat<K, V> { > > > ... > > > } > > > > > > Are you saying there is way to use my custom input format with Crunch? > > > > > > > > > > > > On Tue, Feb 12, 2013 at 5:06 PM, Josh Wills <[email protected]> > > wrote: > > > > > > > Depends on the size of the files-- if there are a bunch of tiny ones, > > it > > > > can be worthwhile to have a CombineFileInputFormat, ala > > > > > > > > > http://yaseminavcular.blogspot.com/2011/03/many-small-input-files.html > > > > > > > > J > > > > > > > > > > > > On Tue, Feb 12, 2013 at 1:56 PM, Victor Iacoban < > > > [email protected] > > > > >wrote: > > > > > > > > > Thanks Josh, > > > > > Is there any performance penalty in unions, assuming that I have > > > several > > > > > hundreds of input files? > > > > > > > > > > > > > > > On Tue, Feb 12, 2013 at 4:39 PM, Josh Wills <[email protected]> > > > > wrote: > > > > > > > > > > > Yeah, of course-- that's how stuff like joins work. > > > > > > > > > > > > PTable<K, V> first = pipeline.read(new TableSource<K, > > V>(firstFile)); > > > > > > PTable<K, V> second = ...; > > > > > > PTable<K, V> union = first.union(second); > > > > > > > > > > > > etc. > > > > > > > > > > > > > > > > > > On Tue, Feb 12, 2013 at 1:36 PM, Victor Iacoban < > > > > > [email protected] > > > > > > >wrote: > > > > > > > > > > > > > Is there any support in crunch to use multiple sequence files > as > > > > > pipeline > > > > > > > source? > > > > > > > something similar to standard MultipleInputs > > > > > > > > > > > > > > Thanks, > > > > > > > victor > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
