This is a dumb question, but PigStorage escapes the delimiter, right? I was assuming I didn't have to select a delimiter such that it doesn't appear in the data as it would get escaped by the export process, and unescaped in the import process....
On Tue, Nov 6, 2012 at 4:01 PM, Cheolsoo Park <cheol...@cloudera.com> wrote: > Hi Will, > > >> data = LOAD 'hdfs://ZZZ/tmp/test' USING PigStorage() AS > (key:chararray,columns:bag {column:tuple (name, value)}); > > Can you please provide some of your data from this file > (hdfs://ZZZ/tmp/test) that can help us to reproduce your problem? 1 ~ 2 > rows would be sufficient. > > Thanks, > Cheolsoo > > On Tue, Nov 6, 2012 at 12:20 PM, William Oberman > <ober...@civicscience.com>wrote: > > > I'm trying to play around with Amazon EMR, and I currently have self > hosted > > Cassandra as the source of data. I was going to try to do: Cassandra -> > S3 > > -> EMR. I've traced my problems to PigStorage. At this point I can > > recreate my problem "locally" without involving S3 or Amazon. > > > > In my local test environment I have this script: > > > > data = LOAD 'cassandra://XXX/YYY' USING CassandraStorage() AS > > (key:chararray, columns:bag {column:tuple (name, value)}); > > > > STORE data INTO 'hdfs://ZZZ/tmp/test' USING PigStorage(); > > > > > > I can verify that HDFS file looks vaguely correct (\t separated fields, > > return separated lines, my data is in the right spots). > > > > > > Then if I do: > > > > data = LOAD 'hdfs://ZZZ/tmp/test' USING PigStorage() AS (key:chararray, > > columns:bag {column:tuple (name, value)}); > > > > keys = FOREACH data GENERATE key; > > > > DUMP keys; > > > > > > I can see that data is wrong. In the dump sometimes I see keys, > sometimes > > I see columns, and sometimes I see a mismatch of keys/columns lumped > > together. > > > > > > As far as I can tell PigStorage is unable to parse the data it just > > persisted. I've tried pig 0.8, 0.9 and 0.10 with the same results. > > > > > > In terms of my data: > > > > key = URI (ASCII) > > > > columns = binary UUID -> JSON (ASCII) > > > > > > Any ideas? Next I guess I'll see what kind of debugging is in pig in the > > STORE/LOAD processes. > > > > > > Thanks! > > > > > > will > > >