[ https://issues.apache.org/jira/browse/HAWQ-56?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15018366#comment-15018366 ]
ASF GitHub Bot commented on HAWQ-56: ------------------------------------ Github user shivzone commented on the pull request: https://github.com/apache/incubator-hawq/pull/121#issuecomment-158464458 +1 > Non-deterministic header results with "HEADER" option from external table > ------------------------------------------------------------------------- > > Key: HAWQ-56 > URL: https://issues.apache.org/jira/browse/HAWQ-56 > Project: Apache HAWQ > Issue Type: Bug > Components: PXF > Reporter: Goden Yao > Assignee: Noa Horn > Priority: Critical > > *Repro Steps* > External table definition > {code:sql} > drop external table if exists testtbl; > create external table testtbl(i text, j text) > location > ('pxf://nakaphdns/tmp/testdata/*?Fragmenter=com.pivotal.pxf.plugins.hdfs.HdfsDataFragmenter&Accessor=com.pivotal.pxf.plugins.hdfs.TextFileAccessor&Resolver=com.pivotal.pxf.plugins.hdfs.TextResolver') > format 'TEXT' (delimiter ',' header); > select * from testtbl; > {code} > example with 4 segment servers and 4 test files with headers in hdfs > {code:sql} > gpadmin=# select * from testtbl ; > i | j > ---+--- > 3 | c > 2 | b > 1 | a > 4 | d > (4 rows) > {code} > With 5 test files > {code:sql} > gpadmin=# select * from testtbl ; > i | j > ----+------- > 5 | e > 2 | b > ID | Value > 3 | c > 1 | a > ID | VALUE > 4 | d > (7 rows) > {code} > *Analysis* > When using HEADER option, header line is removed only once per segment. > As a result there will be different results depending on the number of > segments/fragments are scanned. If the number of files is greater than the > number of segments, the header row is included in the data for some files. > If the number of files is less than or equal to the number of segments, the > data retrieved is good. Thus, non-deterministic. > The reason for this behavior is that header line handling is done by the > external protocol code (fileam.c) which checks if the header_line flag is > true, and if so skips the first line and sets the flag to false. This code > calls the custom protocol code (in our case pxf) to get the next tuples, and > so doesn't know if the tuples are from the same resource or not. > From what I can see, in gpfdist protocol the problem is solved by letting the > custom protocol code handle this and marking the flag as false for the > external protocol infrastructure in fileam.c. > *Proposed Solution* > Add a check in pxf validator function (a function that is being called as > part of external table creation). > This check will error out if HEADER option is used in a PXF table. > Currently the validation function API only allows access to the table's URLs > (in the LOCATION part of the table), and not the format options. In order to > add the check an API change in the external protocol is required. -- This message was sent by Atlassian JIRA (v6.3.4#6332)