Re: multiple input files as pipeline source?

Dave Beech Wed, 13 Feb 2013 08:35:00 -0800

Love it enough to write it for us? ;) I'll stick it in JIRA just in
case. Or if not, maybe one day I'll have a free couple of hours and
feel like doing it myself!


Cheers,
Dave

On 13 February 2013 16:18, Josh Wills <[email protected]> wrote:
> Yep, I would love that.
>
>
> On Wed, Feb 13, 2013 at 7:30 AM, Dave Beech <[email protected]> wrote:
>
>> Actually, while we're on the subject of small files and
>> CombineFileInputFormat...
>>
>> I believe Hive has a feature whereby CombineFileInputFormat is used
>> internally if it's required to read many small files to make the
>> resulting mapreduce jobs more efficient. Would it be worth looking
>> into whether Crunch could support this, too?
>>
>>
>> On 13 February 2013 15:27, Dave Beech <[email protected]> wrote:
>> > thanks!
>> >
>> > On 13 February 2013 15:22, Victor Iacoban <[email protected]>
>> wrote:
>> >> https://gist.github.com/viacoban/4945325
>> >>
>> >>
>> >> On Wed, Feb 13, 2013 at 9:59 AM, Dave Beech <[email protected]>
>> wrote:
>> >>
>> >>> A gist would be great - thanks very much
>> >>>
>> >>> Dave
>> >>>
>> >>> On 13 February 2013 14:52, Victor Iacoban <[email protected]>
>> >>> wrote:
>> >>> > Dave,
>> >>> >
>> >>> > How do you want this, copy pasted code into a gist or a reusable jar?
>> >>> >
>> >>> > --victor
>> >>> >
>> >>> >
>> >>> > On Wed, Feb 13, 2013 at 3:59 AM, Dave Beech <[email protected]>
>> >>> wrote:
>> >>> >
>> >>> >> Hi Victor,
>> >>> >> Any chance you could share your implementation of a Source that
>> reads
>> >>> >> from multiple paths? I've wanted this for a while but haven't found
>> >>> >> time to go ahead and write one myself!
>> >>> >> Thanks,
>> >>> >> Dave
>> >>> >>
>> >>> >> On 12 February 2013 23:07, Victor Iacoban <[email protected]
>> >
>> >>> >> wrote:
>> >>> >> > Thanks J
>> >>> >> >
>> >>> >> > I could not extend the FileSourceImpl since it works with only one
>> >>> input
>> >>> >> > path,
>> >>> >> > but I implemented the Source interface directly and it appears to
>> do
>> >>> the
>> >>> >> > job, thx for the pointer
>> >>> >> >
>> >>> >> > -- victor
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> > On Tue, Feb 12, 2013 at 5:20 PM, Josh Wills <[email protected]
>> >
>> >>> >> wrote:
>> >>> >> >
>> >>> >> >> Yep-- check out the formattedFile function in o.a.c.io.From. You
>> can
>> >>> >> also
>> >>> >> >> write a custom extension of o.a.c.io.impl.FileSourceImpl if it's
>> one
>> >>> >> you're
>> >>> >> >> going to be using a lot, or if there is custom configuration
>> >>> information
>> >>> >> >> required to use the InputFormat.
>> >>> >> >>
>> >>> >> >> J
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> On Tue, Feb 12, 2013 at 2:13 PM, Victor Iacoban <
>> >>> >> [email protected]
>> >>> >> >> >wrote:
>> >>> >> >>
>> >>> >> >> > That's exactly what I have in the code not using Crunch API:
>> >>> >> >> > public class MultiSequenceFileInputFormat<K, V> extends
>> >>> >> >> > CombineFileInputFormat<K, V> {
>> >>> >> >> > ...
>> >>> >> >> > }
>> >>> >> >> >
>> >>> >> >> > Are you saying there is way to use my custom input format with
>> >>> Crunch?
>> >>> >> >> >
>> >>> >> >> >
>> >>> >> >> >
>> >>> >> >> > On Tue, Feb 12, 2013 at 5:06 PM, Josh Wills <
>> [email protected]>
>> >>> >> >> wrote:
>> >>> >> >> >
>> >>> >> >> > > Depends on the size of the files-- if there are a bunch of
>> tiny
>> >>> >> ones,
>> >>> >> >> it
>> >>> >> >> > > can be worthwhile to have a CombineFileInputFormat, ala
>> >>> >> >> > >
>> >>> >> >> > >
>> >>> >>
>> http://yaseminavcular.blogspot.com/2011/03/many-small-input-files.html
>> >>> >> >> > >
>> >>> >> >> > > J
>> >>> >> >> > >
>> >>> >> >> > >
>> >>> >> >> > > On Tue, Feb 12, 2013 at 1:56 PM, Victor Iacoban <
>> >>> >> >> > [email protected]
>> >>> >> >> > > >wrote:
>> >>> >> >> > >
>> >>> >> >> > > > Thanks Josh,
>> >>> >> >> > > > Is there any performance penalty in unions, assuming that I
>> >>> have
>> >>> >> >> > several
>> >>> >> >> > > > hundreds of input files?
>> >>> >> >> > > >
>> >>> >> >> > > >
>> >>> >> >> > > > On Tue, Feb 12, 2013 at 4:39 PM, Josh Wills <
>> >>> [email protected]
>> >>> >> >
>> >>> >> >> > > wrote:
>> >>> >> >> > > >
>> >>> >> >> > > > > Yeah, of course-- that's how stuff like joins work.
>> >>> >> >> > > > >
>> >>> >> >> > > > > PTable<K, V> first = pipeline.read(new TableSource<K,
>> >>> >> >> V>(firstFile));
>> >>> >> >> > > > > PTable<K, V> second = ...;
>> >>> >> >> > > > > PTable<K, V> union = first.union(second);
>> >>> >> >> > > > >
>> >>> >> >> > > > > etc.
>> >>> >> >> > > > >
>> >>> >> >> > > > >
>> >>> >> >> > > > > On Tue, Feb 12, 2013 at 1:36 PM, Victor Iacoban <
>> >>> >> >> > > > [email protected]
>> >>> >> >> > > > > >wrote:
>> >>> >> >> > > > >
>> >>> >> >> > > > > > Is there any support in crunch to use multiple sequence
>> >>> files
>> >>> >> as
>> >>> >> >> > > > pipeline
>> >>> >> >> > > > > > source?
>> >>> >> >> > > > > > something similar to standard MultipleInputs
>> >>> >> >> > > > > >
>> >>> >> >> > > > > > Thanks,
>> >>> >> >> > > > > > victor
>> >>> >> >> > > > > >
>> >>> >> >> > > > >
>> >>> >> >> > > >
>> >>> >> >> > >
>> >>> >> >> >
>> >>> >> >>
>> >>> >>
>> >>>
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>


On 13 February 2013 16:18, Josh Wills <[email protected]> wrote:
> Yep, I would love that.
>
>
> On Wed, Feb 13, 2013 at 7:30 AM, Dave Beech <[email protected]> wrote:
>
>> Actually, while we're on the subject of small files and
>> CombineFileInputFormat...
>>
>> I believe Hive has a feature whereby CombineFileInputFormat is used
>> internally if it's required to read many small files to make the
>> resulting mapreduce jobs more efficient. Would it be worth looking
>> into whether Crunch could support this, too?
>>
>>
>> On 13 February 2013 15:27, Dave Beech <[email protected]> wrote:
>> > thanks!
>> >
>> > On 13 February 2013 15:22, Victor Iacoban <[email protected]>
>> wrote:
>> >> https://gist.github.com/viacoban/4945325
>> >>
>> >>
>> >> On Wed, Feb 13, 2013 at 9:59 AM, Dave Beech <[email protected]>
>> wrote:
>> >>
>> >>> A gist would be great - thanks very much
>> >>>
>> >>> Dave
>> >>>
>> >>> On 13 February 2013 14:52, Victor Iacoban <[email protected]>
>> >>> wrote:
>> >>> > Dave,
>> >>> >
>> >>> > How do you want this, copy pasted code into a gist or a reusable jar?
>> >>> >
>> >>> > --victor
>> >>> >
>> >>> >
>> >>> > On Wed, Feb 13, 2013 at 3:59 AM, Dave Beech <[email protected]>
>> >>> wrote:
>> >>> >
>> >>> >> Hi Victor,
>> >>> >> Any chance you could share your implementation of a Source that
>> reads
>> >>> >> from multiple paths? I've wanted this for a while but haven't found
>> >>> >> time to go ahead and write one myself!
>> >>> >> Thanks,
>> >>> >> Dave
>> >>> >>
>> >>> >> On 12 February 2013 23:07, Victor Iacoban <[email protected]
>> >
>> >>> >> wrote:
>> >>> >> > Thanks J
>> >>> >> >
>> >>> >> > I could not extend the FileSourceImpl since it works with only one
>> >>> input
>> >>> >> > path,
>> >>> >> > but I implemented the Source interface directly and it appears to
>> do
>> >>> the
>> >>> >> > job, thx for the pointer
>> >>> >> >
>> >>> >> > -- victor
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> > On Tue, Feb 12, 2013 at 5:20 PM, Josh Wills <[email protected]
>> >
>> >>> >> wrote:
>> >>> >> >
>> >>> >> >> Yep-- check out the formattedFile function in o.a.c.io.From. You
>> can
>> >>> >> also
>> >>> >> >> write a custom extension of o.a.c.io.impl.FileSourceImpl if it's
>> one
>> >>> >> you're
>> >>> >> >> going to be using a lot, or if there is custom configuration
>> >>> information
>> >>> >> >> required to use the InputFormat.
>> >>> >> >>
>> >>> >> >> J
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> On Tue, Feb 12, 2013 at 2:13 PM, Victor Iacoban <
>> >>> >> [email protected]
>> >>> >> >> >wrote:
>> >>> >> >>
>> >>> >> >> > That's exactly what I have in the code not using Crunch API:
>> >>> >> >> > public class MultiSequenceFileInputFormat<K, V> extends
>> >>> >> >> > CombineFileInputFormat<K, V> {
>> >>> >> >> > ...
>> >>> >> >> > }
>> >>> >> >> >
>> >>> >> >> > Are you saying there is way to use my custom input format with
>> >>> Crunch?
>> >>> >> >> >
>> >>> >> >> >
>> >>> >> >> >
>> >>> >> >> > On Tue, Feb 12, 2013 at 5:06 PM, Josh Wills <
>> [email protected]>
>> >>> >> >> wrote:
>> >>> >> >> >
>> >>> >> >> > > Depends on the size of the files-- if there are a bunch of
>> tiny
>> >>> >> ones,
>> >>> >> >> it
>> >>> >> >> > > can be worthwhile to have a CombineFileInputFormat, ala
>> >>> >> >> > >
>> >>> >> >> > >
>> >>> >>
>> http://yaseminavcular.blogspot.com/2011/03/many-small-input-files.html
>> >>> >> >> > >
>> >>> >> >> > > J
>> >>> >> >> > >
>> >>> >> >> > >
>> >>> >> >> > > On Tue, Feb 12, 2013 at 1:56 PM, Victor Iacoban <
>> >>> >> >> > [email protected]
>> >>> >> >> > > >wrote:
>> >>> >> >> > >
>> >>> >> >> > > > Thanks Josh,
>> >>> >> >> > > > Is there any performance penalty in unions, assuming that I
>> >>> have
>> >>> >> >> > several
>> >>> >> >> > > > hundreds of input files?
>> >>> >> >> > > >
>> >>> >> >> > > >
>> >>> >> >> > > > On Tue, Feb 12, 2013 at 4:39 PM, Josh Wills <
>> >>> [email protected]
>> >>> >> >
>> >>> >> >> > > wrote:
>> >>> >> >> > > >
>> >>> >> >> > > > > Yeah, of course-- that's how stuff like joins work.
>> >>> >> >> > > > >
>> >>> >> >> > > > > PTable<K, V> first = pipeline.read(new TableSource<K,
>> >>> >> >> V>(firstFile));
>> >>> >> >> > > > > PTable<K, V> second = ...;
>> >>> >> >> > > > > PTable<K, V> union = first.union(second);
>> >>> >> >> > > > >
>> >>> >> >> > > > > etc.
>> >>> >> >> > > > >
>> >>> >> >> > > > >
>> >>> >> >> > > > > On Tue, Feb 12, 2013 at 1:36 PM, Victor Iacoban <
>> >>> >> >> > > > [email protected]
>> >>> >> >> > > > > >wrote:
>> >>> >> >> > > > >
>> >>> >> >> > > > > > Is there any support in crunch to use multiple sequence
>> >>> files
>> >>> >> as
>> >>> >> >> > > > pipeline
>> >>> >> >> > > > > > source?
>> >>> >> >> > > > > > something similar to standard MultipleInputs
>> >>> >> >> > > > > >
>> >>> >> >> > > > > > Thanks,
>> >>> >> >> > > > > > victor
>> >>> >> >> > > > > >
>> >>> >> >> > > > >
>> >>> >> >> > > >
>> >>> >> >> > >
>> >>> >> >> >
>> >>> >> >>
>> >>> >>
>> >>>
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: multiple input files as pipeline source?

Reply via email to