Just in case someone hits this thread by having the same issue, please vote for this bug: https://issues.apache.org/jira/browse/PIG-1271
On Tue, Nov 6, 2012 at 4:50 PM, William Oberman <ober...@civicscience.com>wrote: > Wow, ok. That is completely unexpected. Thanks for the heads up! > > In my case, because part of my data is binary (UUIDs from Cassandra) all > possible characters can appear in the data, making PigStorage.... unhelpful > ;-) > > I just tried AvroStorage in piggybank and that is able to store/load my > data correctly. > > > On Tue, Nov 6, 2012 at 4:35 PM, Cheolsoo Park <cheol...@cloudera.com>wrote: > >> >> This is a dumb question, but PigStorage escapes the delimiter, right? >> >> No it doesn't. >> >> On Tue, Nov 6, 2012 at 1:29 PM, William Oberman <ober...@civicscience.com >> >wrote: >> >> > This is a dumb question, but PigStorage escapes the delimiter, right? I >> > was assuming I didn't have to select a delimiter such that it doesn't >> > appear in the data as it would get escaped by the export process, and >> > unescaped in the import process.... >> > >> > >> > On Tue, Nov 6, 2012 at 4:01 PM, Cheolsoo Park <cheol...@cloudera.com> >> > wrote: >> > >> > > Hi Will, >> > > >> > > >> data = LOAD 'hdfs://ZZZ/tmp/test' USING PigStorage() AS >> > > (key:chararray,columns:bag {column:tuple (name, value)}); >> > > >> > > Can you please provide some of your data from this file >> > > (hdfs://ZZZ/tmp/test) that can help us to reproduce your problem? 1 ~ >> 2 >> > > rows would be sufficient. >> > > >> > > Thanks, >> > > Cheolsoo >> > > >> > > On Tue, Nov 6, 2012 at 12:20 PM, William Oberman >> > > <ober...@civicscience.com>wrote: >> > > >> > > > I'm trying to play around with Amazon EMR, and I currently have self >> > > hosted >> > > > Cassandra as the source of data. I was going to try to do: >> Cassandra >> > -> >> > > S3 >> > > > -> EMR. I've traced my problems to PigStorage. At this point I can >> > > > recreate my problem "locally" without involving S3 or Amazon. >> > > > >> > > > In my local test environment I have this script: >> > > > >> > > > data = LOAD 'cassandra://XXX/YYY' USING CassandraStorage() AS >> > > > (key:chararray, columns:bag {column:tuple (name, value)}); >> > > > >> > > > STORE data INTO 'hdfs://ZZZ/tmp/test' USING PigStorage(); >> > > > >> > > > >> > > > I can verify that HDFS file looks vaguely correct (\t separated >> fields, >> > > > return separated lines, my data is in the right spots). >> > > > >> > > > >> > > > Then if I do: >> > > > >> > > > data = LOAD 'hdfs://ZZZ/tmp/test' USING PigStorage() AS >> (key:chararray, >> > > > columns:bag {column:tuple (name, value)}); >> > > > >> > > > keys = FOREACH data GENERATE key; >> > > > >> > > > DUMP keys; >> > > > >> > > > >> > > > I can see that data is wrong. In the dump sometimes I see keys, >> > > sometimes >> > > > I see columns, and sometimes I see a mismatch of keys/columns lumped >> > > > together. >> > > > >> > > > >> > > > As far as I can tell PigStorage is unable to parse the data it just >> > > > persisted. I've tried pig 0.8, 0.9 and 0.10 with the same results. >> > > > >> > > > >> > > > In terms of my data: >> > > > >> > > > key = URI (ASCII) >> > > > >> > > > columns = binary UUID -> JSON (ASCII) >> > > > >> > > > >> > > > Any ideas? Next I guess I'll see what kind of debugging is in pig >> in >> > the >> > > > STORE/LOAD processes. >> > > > >> > > > >> > > > Thanks! >> > > > >> > > > >> > > > will >> > > > >> > > >> > >> > > >