Just in case someone hits this thread by having the same issue, please vote
for this bug:
https://issues.apache.org/jira/browse/PIG-1271


On Tue, Nov 6, 2012 at 4:50 PM, William Oberman <ober...@civicscience.com>wrote:

> Wow, ok.  That is completely unexpected.  Thanks for the heads up!
>
> In my case, because part of my data is binary (UUIDs from Cassandra) all
> possible characters can appear in the data, making PigStorage.... unhelpful
> ;-)
>
> I just tried AvroStorage in piggybank and that is able to store/load my
> data correctly.
>
>
> On Tue, Nov 6, 2012 at 4:35 PM, Cheolsoo Park <cheol...@cloudera.com>wrote:
>
>> >> This is a dumb question, but PigStorage escapes the delimiter, right?
>>
>> No it doesn't.
>>
>> On Tue, Nov 6, 2012 at 1:29 PM, William Oberman <ober...@civicscience.com
>> >wrote:
>>
>> > This is a dumb question, but PigStorage escapes the delimiter, right?  I
>> > was assuming I didn't have to select a delimiter such that it doesn't
>> > appear in the data as it would get escaped by the export process, and
>> > unescaped in the import process....
>> >
>> >
>> > On Tue, Nov 6, 2012 at 4:01 PM, Cheolsoo Park <cheol...@cloudera.com>
>> > wrote:
>> >
>> > > Hi Will,
>> > >
>> > > >> data = LOAD 'hdfs://ZZZ/tmp/test' USING PigStorage() AS
>> > > (key:chararray,columns:bag {column:tuple (name, value)});
>> > >
>> > > Can you please provide some of your data from this file
>> > > (hdfs://ZZZ/tmp/test) that can help us to reproduce your problem? 1 ~
>> 2
>> > > rows would be sufficient.
>> > >
>> > > Thanks,
>> > > Cheolsoo
>> > >
>> > > On Tue, Nov 6, 2012 at 12:20 PM, William Oberman
>> > > <ober...@civicscience.com>wrote:
>> > >
>> > > > I'm trying to play around with Amazon EMR, and I currently have self
>> > > hosted
>> > > > Cassandra as the source of data.  I was going to try to do:
>> Cassandra
>> > ->
>> > > S3
>> > > > -> EMR.  I've traced my problems to PigStorage.  At this point I can
>> > > > recreate my problem "locally" without involving S3 or Amazon.
>> > > >
>> > > > In my local test environment I have this script:
>> > > >
>> > > > data = LOAD 'cassandra://XXX/YYY' USING CassandraStorage() AS
>> > > > (key:chararray, columns:bag {column:tuple (name, value)});
>> > > >
>> > > > STORE data INTO 'hdfs://ZZZ/tmp/test' USING PigStorage();
>> > > >
>> > > >
>> > > > I can verify that HDFS file looks vaguely correct (\t separated
>> fields,
>> > > > return separated lines, my data is in the right spots).
>> > > >
>> > > >
>> > > > Then if I do:
>> > > >
>> > > > data = LOAD 'hdfs://ZZZ/tmp/test' USING PigStorage() AS
>> (key:chararray,
>> > > > columns:bag {column:tuple (name, value)});
>> > > >
>> > > > keys = FOREACH data GENERATE key;
>> > > >
>> > > > DUMP keys;
>> > > >
>> > > >
>> > > > I can see that data is wrong.  In the dump sometimes I see keys,
>> > > sometimes
>> > > > I see columns, and sometimes I see a mismatch of keys/columns lumped
>> > > > together.
>> > > >
>> > > >
>> > > > As far as I can tell PigStorage is unable to parse the data it just
>> > > > persisted.  I've tried pig 0.8, 0.9 and 0.10 with the same results.
>> > > >
>> > > >
>> > > > In terms of my data:
>> > > >
>> > > > key = URI (ASCII)
>> > > >
>> > > > columns = binary UUID -> JSON (ASCII)
>> > > >
>> > > >
>> > > > Any ideas?  Next I guess I'll see what kind of debugging is in pig
>> in
>> > the
>> > > > STORE/LOAD processes.
>> > > >
>> > > >
>> > > > Thanks!
>> > > >
>> > > >
>> > > > will
>> > > >
>> > >
>> >
>>
>
>
>

Reply via email to