[ https://issues.apache.org/jira/browse/PIG-1207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yan Zhou updated PIG-1207: -------------------------- Assignee: Yan Zhou > [zebra] Data sanity check should be performed at the end of writing instead > of later at query time > --------------------------------------------------------------------------------------------------- > > Key: PIG-1207 > URL: https://issues.apache.org/jira/browse/PIG-1207 > Project: Pig > Issue Type: Improvement > Reporter: Yan Zhou > Assignee: Yan Zhou > > Currently the equity check of number of rows across different column groups > are performed by the query. And the error info is sketchy and only emits a > "Column groups are not evenly distributed", or worse, throws an > IndexOufOfBound exception from CGScanner.getCGValue since BasicTable.atEnd > and BasicTable.getKey, which are called just before BasicTable.getValue, only > checks the first column group in projection and any discrepancy of the number > of rows per file cross multiple column groups in projection could have > BasicTable.atEnd return false and BasicTable.getKey return a key normally > but another column group already exaust its current file and the call to its > CGScanner.getCGValue throw the exception. > This check should also be performed at the end of writing and the error info > should be more informational. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.