Hi Will,

>> data = LOAD 'hdfs://ZZZ/tmp/test' USING PigStorage() AS
(key:chararray,columns:bag {column:tuple (name, value)});

Can you please provide some of your data from this file
(hdfs://ZZZ/tmp/test) that can help us to reproduce your problem? 1 ~ 2
rows would be sufficient.

Thanks,
Cheolsoo

On Tue, Nov 6, 2012 at 12:20 PM, William Oberman
<ober...@civicscience.com>wrote:

> I'm trying to play around with Amazon EMR, and I currently have self hosted
> Cassandra as the source of data.  I was going to try to do: Cassandra -> S3
> -> EMR.  I've traced my problems to PigStorage.  At this point I can
> recreate my problem "locally" without involving S3 or Amazon.
>
> In my local test environment I have this script:
>
> data = LOAD 'cassandra://XXX/YYY' USING CassandraStorage() AS
> (key:chararray, columns:bag {column:tuple (name, value)});
>
> STORE data INTO 'hdfs://ZZZ/tmp/test' USING PigStorage();
>
>
> I can verify that HDFS file looks vaguely correct (\t separated fields,
> return separated lines, my data is in the right spots).
>
>
> Then if I do:
>
> data = LOAD 'hdfs://ZZZ/tmp/test' USING PigStorage() AS (key:chararray,
> columns:bag {column:tuple (name, value)});
>
> keys = FOREACH data GENERATE key;
>
> DUMP keys;
>
>
> I can see that data is wrong.  In the dump sometimes I see keys, sometimes
> I see columns, and sometimes I see a mismatch of keys/columns lumped
> together.
>
>
> As far as I can tell PigStorage is unable to parse the data it just
> persisted.  I've tried pig 0.8, 0.9 and 0.10 with the same results.
>
>
> In terms of my data:
>
> key = URI (ASCII)
>
> columns = binary UUID -> JSON (ASCII)
>
>
> Any ideas?  Next I guess I'll see what kind of debugging is in pig in the
> STORE/LOAD processes.
>
>
> Thanks!
>
>
> will
>

Reply via email to