[ https://issues.apache.org/jira/browse/HAWQ-56?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Caleb Welton updated HAWQ-56: ----------------------------- Priority: Critical (was: Minor) > Non-deterministic header results with "HEADER" option from external table > ------------------------------------------------------------------------- > > Key: HAWQ-56 > URL: https://issues.apache.org/jira/browse/HAWQ-56 > Project: Apache HAWQ > Issue Type: Bug > Components: External Tables & PXF > Reporter: Goden Yao > Priority: Critical > > **Repro Steps** > External table definition > {code:sql} > drop external table if exists testtbl; > create external table testtbl(i text, j text) > location > ('pxf://nakaphdns/tmp/testdata/*?Fragmenter=com.pivotal.pxf.plugins.hdfs.HdfsDataFragmenter&Accessor=com.pivotal.pxf.plugins.hdfs.TextFileAccessor&Resolver=com.pivotal.pxf.plugins.hdfs.TextResolver') > format 'TEXT' (delimiter ',' header); > select * from testtbl; > {code} > example with 4 segment servers and 4 test files with headers in hdfs > {code:sql} > gpadmin=# select * from testtbl ; > i | j > ---+--- > 3 | c > 2 | b > 1 | a > 4 | d > (4 rows) > {code} > With 5 test files > {code:sql} > gpadmin=# select * from testtbl ; > i | j > ----+------- > 5 | e > 2 | b > ID | Value > 3 | c > 1 | a > ID | VALUE > 4 | d > (7 rows) > {code} > **Analysis** > When using HEADER option, header line is removed only once per segment. > As a result there will be different results depending on the number of > segments/fragments are scanned. > The reason for this behavior is that header line handling is done by the > external protocol code (fileam.c) which checks if the header_line flag is > true, and if so skips the first line and sets the flag to false. This code > calls the custom protocol code (in our case pxf) to get the next tuples, and > so doesn't know if the tuples are from the same resource or not. > From what I can see, in gpfdist protocol the problem is solved by letting the > custom protocol code handle this and marking the flag as false for the > external protocol infrastructure in fileam.c. > **Proposed Solution** > Add a check in pxf validator function (a function that is being called as > part of external table creation). > This check will error out if HEADER option is used in a PXF table. > Currently the validation function API only allows access to the table's URLs > (in the LOCATION part of the table), and not the format options. In order to > add the check an API change in the external protocol is required. -- This message was sent by Atlassian JIRA (v6.3.4#6332)