[ 
https://issues.apache.org/jira/browse/HAWQ-56?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Welton updated HAWQ-56:
-----------------------------
    Priority: Critical  (was: Minor)

> Non-deterministic header results with "HEADER" option from external table
> -------------------------------------------------------------------------
>
>                 Key: HAWQ-56
>                 URL: https://issues.apache.org/jira/browse/HAWQ-56
>             Project: Apache HAWQ
>          Issue Type: Bug
>          Components: External Tables & PXF
>            Reporter: Goden Yao
>            Priority: Critical
>
> **Repro Steps**
> External table definition
> {code:sql}
> drop external table if exists testtbl;
> create external table testtbl(i text, j text)
> location 
> ('pxf://nakaphdns/tmp/testdata/*?Fragmenter=com.pivotal.pxf.plugins.hdfs.HdfsDataFragmenter&Accessor=com.pivotal.pxf.plugins.hdfs.TextFileAccessor&Resolver=com.pivotal.pxf.plugins.hdfs.TextResolver')
> format 'TEXT' (delimiter ',' header);
> select * from testtbl;
> {code}
> example with 4 segment servers and 4 test files with headers in hdfs
> {code:sql}
> gpadmin=# select * from testtbl ;
>  i | j
> ---+---
>  3 | c
>  2 | b
>  1 | a
>  4 | d
> (4 rows)
> {code}
> With 5 test files
> {code:sql}
> gpadmin=# select * from testtbl ;
>  i  |   j
> ----+-------
>  5  | e
>  2  | b
>  ID | Value
>  3  | c
>  1  | a
>  ID | VALUE
>  4  | d
> (7 rows)
> {code}
> **Analysis**
> When using HEADER option, header line is removed only once per segment.
> As a result there will be different results depending on the number of 
> segments/fragments are scanned.
> The reason for this behavior is that header line handling is done by the 
> external protocol code (fileam.c) which checks if the header_line flag is 
> true, and if so skips the first line and sets the flag to false. This code 
> calls the custom protocol code (in our case pxf) to get the next tuples, and 
> so doesn't know if the tuples are from the same resource or not.
> From what I can see, in gpfdist protocol the problem is solved by letting the 
> custom protocol code handle this and marking the flag as false for the 
> external protocol infrastructure in fileam.c.
> **Proposed Solution**
> Add a check in pxf validator function (a function that is being called as 
> part of external table creation).
> This check will error out if HEADER option is used in a PXF table.
> Currently the validation function API only allows access to the table's URLs 
> (in the LOCATION part of the table), and not the format options. In order to 
> add the check an API change in the external protocol is required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to