Hi Will, >> data = LOAD 'hdfs://ZZZ/tmp/test' USING PigStorage() AS (key:chararray,columns:bag {column:tuple (name, value)});
Can you please provide some of your data from this file (hdfs://ZZZ/tmp/test) that can help us to reproduce your problem? 1 ~ 2 rows would be sufficient. Thanks, Cheolsoo On Tue, Nov 6, 2012 at 12:20 PM, William Oberman <ober...@civicscience.com>wrote: > I'm trying to play around with Amazon EMR, and I currently have self hosted > Cassandra as the source of data. I was going to try to do: Cassandra -> S3 > -> EMR. I've traced my problems to PigStorage. At this point I can > recreate my problem "locally" without involving S3 or Amazon. > > In my local test environment I have this script: > > data = LOAD 'cassandra://XXX/YYY' USING CassandraStorage() AS > (key:chararray, columns:bag {column:tuple (name, value)}); > > STORE data INTO 'hdfs://ZZZ/tmp/test' USING PigStorage(); > > > I can verify that HDFS file looks vaguely correct (\t separated fields, > return separated lines, my data is in the right spots). > > > Then if I do: > > data = LOAD 'hdfs://ZZZ/tmp/test' USING PigStorage() AS (key:chararray, > columns:bag {column:tuple (name, value)}); > > keys = FOREACH data GENERATE key; > > DUMP keys; > > > I can see that data is wrong. In the dump sometimes I see keys, sometimes > I see columns, and sometimes I see a mismatch of keys/columns lumped > together. > > > As far as I can tell PigStorage is unable to parse the data it just > persisted. I've tried pig 0.8, 0.9 and 0.10 with the same results. > > > In terms of my data: > > key = URI (ASCII) > > columns = binary UUID -> JSON (ASCII) > > > Any ideas? Next I guess I'll see what kind of debugging is in pig in the > STORE/LOAD processes. > > > Thanks! > > > will >