Love it enough to write it for us? ;) I'll stick it in JIRA just in case. Or if not, maybe one day I'll have a free couple of hours and feel like doing it myself!
Cheers, Dave On 13 February 2013 16:18, Josh Wills <[email protected]> wrote: > Yep, I would love that. > > > On Wed, Feb 13, 2013 at 7:30 AM, Dave Beech <[email protected]> wrote: > >> Actually, while we're on the subject of small files and >> CombineFileInputFormat... >> >> I believe Hive has a feature whereby CombineFileInputFormat is used >> internally if it's required to read many small files to make the >> resulting mapreduce jobs more efficient. Would it be worth looking >> into whether Crunch could support this, too? >> >> >> On 13 February 2013 15:27, Dave Beech <[email protected]> wrote: >> > thanks! >> > >> > On 13 February 2013 15:22, Victor Iacoban <[email protected]> >> wrote: >> >> https://gist.github.com/viacoban/4945325 >> >> >> >> >> >> On Wed, Feb 13, 2013 at 9:59 AM, Dave Beech <[email protected]> >> wrote: >> >> >> >>> A gist would be great - thanks very much >> >>> >> >>> Dave >> >>> >> >>> On 13 February 2013 14:52, Victor Iacoban <[email protected]> >> >>> wrote: >> >>> > Dave, >> >>> > >> >>> > How do you want this, copy pasted code into a gist or a reusable jar? >> >>> > >> >>> > --victor >> >>> > >> >>> > >> >>> > On Wed, Feb 13, 2013 at 3:59 AM, Dave Beech <[email protected]> >> >>> wrote: >> >>> > >> >>> >> Hi Victor, >> >>> >> Any chance you could share your implementation of a Source that >> reads >> >>> >> from multiple paths? I've wanted this for a while but haven't found >> >>> >> time to go ahead and write one myself! >> >>> >> Thanks, >> >>> >> Dave >> >>> >> >> >>> >> On 12 February 2013 23:07, Victor Iacoban <[email protected] >> > >> >>> >> wrote: >> >>> >> > Thanks J >> >>> >> > >> >>> >> > I could not extend the FileSourceImpl since it works with only one >> >>> input >> >>> >> > path, >> >>> >> > but I implemented the Source interface directly and it appears to >> do >> >>> the >> >>> >> > job, thx for the pointer >> >>> >> > >> >>> >> > -- victor >> >>> >> > >> >>> >> > >> >>> >> > >> >>> >> > On Tue, Feb 12, 2013 at 5:20 PM, Josh Wills <[email protected] >> > >> >>> >> wrote: >> >>> >> > >> >>> >> >> Yep-- check out the formattedFile function in o.a.c.io.From. You >> can >> >>> >> also >> >>> >> >> write a custom extension of o.a.c.io.impl.FileSourceImpl if it's >> one >> >>> >> you're >> >>> >> >> going to be using a lot, or if there is custom configuration >> >>> information >> >>> >> >> required to use the InputFormat. >> >>> >> >> >> >>> >> >> J >> >>> >> >> >> >>> >> >> >> >>> >> >> On Tue, Feb 12, 2013 at 2:13 PM, Victor Iacoban < >> >>> >> [email protected] >> >>> >> >> >wrote: >> >>> >> >> >> >>> >> >> > That's exactly what I have in the code not using Crunch API: >> >>> >> >> > public class MultiSequenceFileInputFormat<K, V> extends >> >>> >> >> > CombineFileInputFormat<K, V> { >> >>> >> >> > ... >> >>> >> >> > } >> >>> >> >> > >> >>> >> >> > Are you saying there is way to use my custom input format with >> >>> Crunch? >> >>> >> >> > >> >>> >> >> > >> >>> >> >> > >> >>> >> >> > On Tue, Feb 12, 2013 at 5:06 PM, Josh Wills < >> [email protected]> >> >>> >> >> wrote: >> >>> >> >> > >> >>> >> >> > > Depends on the size of the files-- if there are a bunch of >> tiny >> >>> >> ones, >> >>> >> >> it >> >>> >> >> > > can be worthwhile to have a CombineFileInputFormat, ala >> >>> >> >> > > >> >>> >> >> > > >> >>> >> >> http://yaseminavcular.blogspot.com/2011/03/many-small-input-files.html >> >>> >> >> > > >> >>> >> >> > > J >> >>> >> >> > > >> >>> >> >> > > >> >>> >> >> > > On Tue, Feb 12, 2013 at 1:56 PM, Victor Iacoban < >> >>> >> >> > [email protected] >> >>> >> >> > > >wrote: >> >>> >> >> > > >> >>> >> >> > > > Thanks Josh, >> >>> >> >> > > > Is there any performance penalty in unions, assuming that I >> >>> have >> >>> >> >> > several >> >>> >> >> > > > hundreds of input files? >> >>> >> >> > > > >> >>> >> >> > > > >> >>> >> >> > > > On Tue, Feb 12, 2013 at 4:39 PM, Josh Wills < >> >>> [email protected] >> >>> >> > >> >>> >> >> > > wrote: >> >>> >> >> > > > >> >>> >> >> > > > > Yeah, of course-- that's how stuff like joins work. >> >>> >> >> > > > > >> >>> >> >> > > > > PTable<K, V> first = pipeline.read(new TableSource<K, >> >>> >> >> V>(firstFile)); >> >>> >> >> > > > > PTable<K, V> second = ...; >> >>> >> >> > > > > PTable<K, V> union = first.union(second); >> >>> >> >> > > > > >> >>> >> >> > > > > etc. >> >>> >> >> > > > > >> >>> >> >> > > > > >> >>> >> >> > > > > On Tue, Feb 12, 2013 at 1:36 PM, Victor Iacoban < >> >>> >> >> > > > [email protected] >> >>> >> >> > > > > >wrote: >> >>> >> >> > > > > >> >>> >> >> > > > > > Is there any support in crunch to use multiple sequence >> >>> files >> >>> >> as >> >>> >> >> > > > pipeline >> >>> >> >> > > > > > source? >> >>> >> >> > > > > > something similar to standard MultipleInputs >> >>> >> >> > > > > > >> >>> >> >> > > > > > Thanks, >> >>> >> >> > > > > > victor >> >>> >> >> > > > > > >> >>> >> >> > > > > >> >>> >> >> > > > >> >>> >> >> > > >> >>> >> >> > >> >>> >> >> >> >>> >> >> >>> >> > > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> On 13 February 2013 16:18, Josh Wills <[email protected]> wrote: > Yep, I would love that. > > > On Wed, Feb 13, 2013 at 7:30 AM, Dave Beech <[email protected]> wrote: > >> Actually, while we're on the subject of small files and >> CombineFileInputFormat... >> >> I believe Hive has a feature whereby CombineFileInputFormat is used >> internally if it's required to read many small files to make the >> resulting mapreduce jobs more efficient. Would it be worth looking >> into whether Crunch could support this, too? >> >> >> On 13 February 2013 15:27, Dave Beech <[email protected]> wrote: >> > thanks! >> > >> > On 13 February 2013 15:22, Victor Iacoban <[email protected]> >> wrote: >> >> https://gist.github.com/viacoban/4945325 >> >> >> >> >> >> On Wed, Feb 13, 2013 at 9:59 AM, Dave Beech <[email protected]> >> wrote: >> >> >> >>> A gist would be great - thanks very much >> >>> >> >>> Dave >> >>> >> >>> On 13 February 2013 14:52, Victor Iacoban <[email protected]> >> >>> wrote: >> >>> > Dave, >> >>> > >> >>> > How do you want this, copy pasted code into a gist or a reusable jar? >> >>> > >> >>> > --victor >> >>> > >> >>> > >> >>> > On Wed, Feb 13, 2013 at 3:59 AM, Dave Beech <[email protected]> >> >>> wrote: >> >>> > >> >>> >> Hi Victor, >> >>> >> Any chance you could share your implementation of a Source that >> reads >> >>> >> from multiple paths? I've wanted this for a while but haven't found >> >>> >> time to go ahead and write one myself! >> >>> >> Thanks, >> >>> >> Dave >> >>> >> >> >>> >> On 12 February 2013 23:07, Victor Iacoban <[email protected] >> > >> >>> >> wrote: >> >>> >> > Thanks J >> >>> >> > >> >>> >> > I could not extend the FileSourceImpl since it works with only one >> >>> input >> >>> >> > path, >> >>> >> > but I implemented the Source interface directly and it appears to >> do >> >>> the >> >>> >> > job, thx for the pointer >> >>> >> > >> >>> >> > -- victor >> >>> >> > >> >>> >> > >> >>> >> > >> >>> >> > On Tue, Feb 12, 2013 at 5:20 PM, Josh Wills <[email protected] >> > >> >>> >> wrote: >> >>> >> > >> >>> >> >> Yep-- check out the formattedFile function in o.a.c.io.From. You >> can >> >>> >> also >> >>> >> >> write a custom extension of o.a.c.io.impl.FileSourceImpl if it's >> one >> >>> >> you're >> >>> >> >> going to be using a lot, or if there is custom configuration >> >>> information >> >>> >> >> required to use the InputFormat. >> >>> >> >> >> >>> >> >> J >> >>> >> >> >> >>> >> >> >> >>> >> >> On Tue, Feb 12, 2013 at 2:13 PM, Victor Iacoban < >> >>> >> [email protected] >> >>> >> >> >wrote: >> >>> >> >> >> >>> >> >> > That's exactly what I have in the code not using Crunch API: >> >>> >> >> > public class MultiSequenceFileInputFormat<K, V> extends >> >>> >> >> > CombineFileInputFormat<K, V> { >> >>> >> >> > ... >> >>> >> >> > } >> >>> >> >> > >> >>> >> >> > Are you saying there is way to use my custom input format with >> >>> Crunch? >> >>> >> >> > >> >>> >> >> > >> >>> >> >> > >> >>> >> >> > On Tue, Feb 12, 2013 at 5:06 PM, Josh Wills < >> [email protected]> >> >>> >> >> wrote: >> >>> >> >> > >> >>> >> >> > > Depends on the size of the files-- if there are a bunch of >> tiny >> >>> >> ones, >> >>> >> >> it >> >>> >> >> > > can be worthwhile to have a CombineFileInputFormat, ala >> >>> >> >> > > >> >>> >> >> > > >> >>> >> >> http://yaseminavcular.blogspot.com/2011/03/many-small-input-files.html >> >>> >> >> > > >> >>> >> >> > > J >> >>> >> >> > > >> >>> >> >> > > >> >>> >> >> > > On Tue, Feb 12, 2013 at 1:56 PM, Victor Iacoban < >> >>> >> >> > [email protected] >> >>> >> >> > > >wrote: >> >>> >> >> > > >> >>> >> >> > > > Thanks Josh, >> >>> >> >> > > > Is there any performance penalty in unions, assuming that I >> >>> have >> >>> >> >> > several >> >>> >> >> > > > hundreds of input files? >> >>> >> >> > > > >> >>> >> >> > > > >> >>> >> >> > > > On Tue, Feb 12, 2013 at 4:39 PM, Josh Wills < >> >>> [email protected] >> >>> >> > >> >>> >> >> > > wrote: >> >>> >> >> > > > >> >>> >> >> > > > > Yeah, of course-- that's how stuff like joins work. >> >>> >> >> > > > > >> >>> >> >> > > > > PTable<K, V> first = pipeline.read(new TableSource<K, >> >>> >> >> V>(firstFile)); >> >>> >> >> > > > > PTable<K, V> second = ...; >> >>> >> >> > > > > PTable<K, V> union = first.union(second); >> >>> >> >> > > > > >> >>> >> >> > > > > etc. >> >>> >> >> > > > > >> >>> >> >> > > > > >> >>> >> >> > > > > On Tue, Feb 12, 2013 at 1:36 PM, Victor Iacoban < >> >>> >> >> > > > [email protected] >> >>> >> >> > > > > >wrote: >> >>> >> >> > > > > >> >>> >> >> > > > > > Is there any support in crunch to use multiple sequence >> >>> files >> >>> >> as >> >>> >> >> > > > pipeline >> >>> >> >> > > > > > source? >> >>> >> >> > > > > > something similar to standard MultipleInputs >> >>> >> >> > > > > > >> >>> >> >> > > > > > Thanks, >> >>> >> >> > > > > > victor >> >>> >> >> > > > > > >> >>> >> >> > > > > >> >>> >> >> > > > >> >>> >> >> > > >> >>> >> >> > >> >>> >> >> >> >>> >> >> >>> >> > > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills>
