[ https://issues.apache.org/jira/browse/PIG-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dmitriy V. Ryaboy reassigned PIG-1797: -------------------------------------- Assignee: Dmitriy V. Ryaboy > Problems when applying FOREACH ... GENERATE on data loaded from HBase > --------------------------------------------------------------------- > > Key: PIG-1797 > URL: https://issues.apache.org/jira/browse/PIG-1797 > Project: Pig > Issue Type: Bug > Affects Versions: 0.8.0 > Environment: Our environment consists on Hadoop 0.20.2, HBase > 0.20.6, ZooKeeper 3.3.2 and Pig 0.8.0. They are configured to run as a > pseudo-distributed system. > Reporter: Eduardo Galán Herrero > Assignee: Dmitriy V. Ryaboy > > We defined a table at HBase and populated with some data: > create 'tests', {NAME => 'age'}, {NAME => 'colour'} > put 'tests', 'one', 'age', '22' > put 'tests', 'one', 'colour', 'green' > put 'tests', 'another', 'age', '439' > put 'tests', 'another', 'colour', 'red' > put 'tests', 'more', 'colour', 'grey' > scan 'tests' > ROW COLUMN+CELL > > another column=age:, timestamp=1294745175613, value=439 > > another column=colour:, timestamp=1294745155873, > value=red > more column=colour:, timestamp=1294745185331, > value=grey > one column=age:, timestamp=1294745127129, value=22 > > one column=colour:, timestamp=1294745144160, > value=green > We are using Pig on mapreduce mode to load data from HBase (recovering also > the row key): > > DATA = LOAD 'hbase://tests' USING > > org.apache.pig.backend.hadoop.hbase.HBaseStorage('age: colour:', > > '-loadKey') AS (row:chararray,age:int,colour:chararray); > We make sure that data has been correcly loaded. > > dump DATA; > (another,439,red) > (more,,grey) > (one,22,green) > > describe DATA; > DATA: {row: chararray,age: int,colour: chararray} > We can see that we can get good results if we use the "FOREACH .. GENERATE" > structure with all the columns ($0, $1 and $2) that were loaded before: > > b= FOREACH DATA GENERATE $0, $1, $2; > > dump b; > (another,439,red) > (more,,grey) > (one,22,green) > no matter the order... > c= FOREACH DATA GENERATE $2, $0, $1; > dump c; > (red,another,439) > (grey,more,) > (green,one,22) > but if we don't include some column (in our example, we don't use $2 column) > in the "FOREACH .. GENERATE" structure, then we get the following bug: > > d= FOREACH DATA GENERATE $0, $1; > > dump d; > (another,) > (more,) > (one,) > > describe d; > d: {row: chararray,age: int} > Here is another example of the bug: > > e= FOREACH DATA GENERATE $1, $2; > > dump e; > (,439) > (,) > (,22) > > describe e; > e: {age: int,colour: chararray} > Here is one more example of the bug: > > f= FOREACH DATA GENERATE $0, $2; > > dump f; > (another,another) > (more,more) > (one,one) > > describe f; > f: {row: chararray,colour: chararray} > Regards, > Eduardo Galan Herrero -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira