There's a patch available to allow using any available javax.script
language to do the conversion from any Java object type in the
sequence file to pig types.  See
https://issues.apache.org/jira/browse/PIG-1777

On Tue, Sep 24, 2013 at 5:22 AM, Dmitriy Ryaboy <dvrya...@gmail.com> wrote:
> I assume by scala you mean scalding?
> If so, yeah, scalding should be much easier for working with custom data
> types.
>
> Pig doesn't handle generic "objects" well. You have to write converters to
> and from, like the ones we created in ElephantBird for Protocol Buffers and
> Thrift (and a bunch of writables, as Pradeep pointed out).
>
> D
>
>
> On Tue, Sep 17, 2013 at 9:20 AM, Yang <teddyyyy...@gmail.com> wrote:
>
>> Thanks Pradeep.
>>
>> it seems in this case just using scala/cascalog is easier for my purposes.
>> I tried out scala yesterday, works fine for me in local mode
>>
>>
>> On Mon, Sep 16, 2013 at 7:47 PM, Pradeep Gollakota <pradeep...@gmail.com
>> >wrote:
>>
>> > It doesn't look like the SequenceFileLoader from the piggybank has much
>> > support. The elephant bird version looks like it does what you need it to
>> > do.
>> >
>> >
>> https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/load/SequenceFileLoader.java
>> >
>> > You'll have to write the converters from your types to Pig data types and
>> > pass it into the constructor of the SequenceFileLoader.
>> >
>> > Hope this helps!
>> >
>> >
>> > On Mon, Sep 16, 2013 at 6:56 PM, Pradeep Gollakota <pradeep...@gmail.com
>> > >wrote:
>> >
>> > > Thats correct...
>> > >
>> > > The "load ... AS (k:chararray, v:charrary);" doesn't actually do what
>> you
>> > > think it does. The AS statement tell Pig what the schema types are, so
>> it
>> > > will call the appropriate LoadCaster method to get it into the right
>> > type.
>> > > A LoadCaster object defines how to map byte[] into appropriate Pig
>> > > datatypes. If the LoadFunc is not schema aware and you don't have the
>> > > schema defined when you load, everything will be loaded as a bytearray.
>> > >
>> > > The problem you have is that the custom writable isn't a Pig datatype.
>> I
>> > > don't think you'll be able to do this without writing some custom code.
>> > > I'll take a look at the source code for the SequenceFileLoader and see
>> if
>> > > there's a way to specify your own LoadCaster. If there is, then you'll
>> > just
>> > > have to write a custom LoadCaster and specify it in the configuration.
>> If
>> > > not, you'll have to extend and roll out your own SequenceFileLoader.
>> > >
>> > >
>> > > On Mon, Sep 16, 2013 at 6:43 PM, Yang <teddyyyy...@gmail.com> wrote:
>> > >
>> > >> I think my custom type has toString(), well at least writable() says
>> > it's
>> > >> writable to bytes, so supposedly if I force it to bytes or string, pig
>> > >> should be able to cast
>> > >> like
>> > >>
>> > >> load ... AS ( k:chararray, v:chararray);
>> > >>
>> > >> but this actually fails
>> > >>
>> > >>
>> > >> On Mon, Sep 16, 2013 at 6:22 PM, Pradeep Gollakota <
>> > pradeep...@gmail.com
>> > >> >wrote:
>> > >>
>> > >> > The problem is that pig only speaks its data types. So you need to
>> > tell
>> > >> it
>> > >> > how to translate from your custom writable to a pig datatype.
>> > >> >
>> > >> > Apparently elephant-bird has some support for doing this type of
>> > >> thing...
>> > >> > take a look at this SO post
>> > >> >
>> > >> >
>> > >>
>> >
>> http://stackoverflow.com/questions/16540651/apache-pig-can-we-convert-a-custom-writable-object-to-pig-format
>> > >> >
>> > >> >
>> > >> > On Mon, Sep 16, 2013 at 5:37 PM, Yang <teddyyyy...@gmail.com>
>> wrote:
>> > >> >
>> > >> > > I tried to do a quick and dirty inspection of some of our data
>> > feeds,
>> > >> > which
>> > >> > > are encoded in gzipped SequenceFile.
>> > >> > >
>> > >> > > basically I did
>> > >> > >
>> > >> > > a = load 'myfile' using ......SequenceFileLoader() AS ( mykey,
>> > >> myvalue);
>> > >> > >
>> > >> > > but it gave me some error:
>> > >> > > 2013-09-16 17:34:28,915 [Thread-5] INFO
>> > >> > >  org.apache.hadoop.io.compress.CodecPool - Got brand-new
>> > decompressor
>> > >> > > 2013-09-16 17:34:28,915 [Thread-5] INFO
>> > >> > >  org.apache.hadoop.io.compress.CodecPool - Got brand-new
>> > decompressor
>> > >> > > 2013-09-16 17:34:28,915 [Thread-5] INFO
>> > >> > >  org.apache.hadoop.io.compress.CodecPool - Got brand-new
>> > decompressor
>> > >> > > 2013-09-16 17:34:28,961 [Thread-5] WARN
>> > >> > >  org.apache.pig.piggybank.storage.SequenceFileLoader - Unable to
>> > >> > translate
>> > >> > > key class com.mycompany.model.VisitKey to a Pig datatype
>> > >> > > 2013-09-16 17:34:28,962 [Thread-5] WARN
>> > >> > >  org.apache.hadoop.mapred.FileOutputCommitter - Output path is
>> null
>> > in
>> > >> > > cleanup
>> > >> > > 2013-09-16 17:34:28,963 [Thread-5] WARN
>> > >> > >  org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
>> > >> > > org.apache.pig.backend.BackendException: ERROR 0: Unable to
>> > translate
>> > >> > class
>> > >> > > com.mycompany.model.VisitKey to a Pig datatype
>> > >> > > at
>> > >> > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> org.apache.pig.piggybank.storage.SequenceFileLoader.setKeyType(SequenceFileLoader.java:78)
>> > >> > >  at
>> > >> > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> org.apache.pig.piggybank.storage.SequenceFileLoader.getNext(SequenceFileLoader.java:133)
>> > >> > >
>> > >> > >
>> > >> > > in the pig file, I have already REGISTERED the jar that contains
>> the
>> > >> > class
>> > >> > >  com.mycompany.model.VisitKey
>> > >> > >
>> > >> > >
>> > >> > > if PIG doesn't work, the only other approach is probably to use
>> some
>> > >> of
>> > >> > the
>> > >> > > newer "pseudo-scripting " languages like cascalog or scala
>> > >> > > thanks
>> > >> > > Yang
>> > >> > >
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>

Reply via email to