[ 
https://issues.apache.org/jira/browse/HAWQ-56?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030913#comment-16030913
 ] 

Jacek Dobrowolski commented on HAWQ-56:
---------------------------------------

https://hdb.docs.pivotal.io/200/hawq/pxf/ReadWritePXF.html#built-inprofiles

HdfsTextMulti  - Read delimited single or multi-line records (with quoted 
linefeeds) from plain text files on HDFS. This profile is not splittable (non 
parallel); therefore reading is slower than reading with HdfsTextSimple.

HdfsTextMulti is non splittable what makes it read all file blocks and strip 
only first file line (header line) so it correctly treat HEADER keyword.

Conclusion: you should not disable HEADER for PROFILE=HdfsTextMulti

> Non-deterministic header results with "HEADER" option from external table
> -------------------------------------------------------------------------
>
>                 Key: HAWQ-56
>                 URL: https://issues.apache.org/jira/browse/HAWQ-56
>             Project: Apache HAWQ
>          Issue Type: Bug
>          Components: PXF
>            Reporter: Goden Yao
>            Assignee: Noa Horn
>            Priority: Critical
>             Fix For: 2.0.0.0-incubating
>
>
> *Repro Steps*
> External table definition
> {code:sql}
> drop external table if exists testtbl;
> create external table testtbl(i text, j text)
> location 
> ('pxf://nakaphdns/tmp/testdata/*?Fragmenter=com.pivotal.pxf.plugins.hdfs.HdfsDataFragmenter&Accessor=com.pivotal.pxf.plugins.hdfs.TextFileAccessor&Resolver=com.pivotal.pxf.plugins.hdfs.TextResolver')
> format 'TEXT' (delimiter ',' header);
> select * from testtbl;
> {code}
> example with 4 segment servers and 4 test files with headers in hdfs
> {code:sql}
> gpadmin=# select * from testtbl ;
>  i | j
> ---+---
>  3 | c
>  2 | b
>  1 | a
>  4 | d
> (4 rows)
> {code}
> With 5 test files
> {code:sql}
> gpadmin=# select * from testtbl ;
>  i  |   j
> ----+-------
>  5  | e
>  2  | b
>  ID | Value
>  3  | c
>  1  | a
>  ID | VALUE
>  4  | d
> (7 rows)
> {code}
> *Analysis*
> When using HEADER option, header line is removed only once per segment.
> As a result there will be different results depending on the number of 
> segments/fragments are scanned. If the number of files is greater than the 
> number of segments, the header row is included in the data for some files.
> If the number of files is less than or equal to the number of segments, the 
> data retrieved is good. Thus, non-deterministic.
> The reason for this behavior is that header line handling is done by the 
> external protocol code (fileam.c) which checks if the header_line flag is 
> true, and if so skips the first line and sets the flag to false. This code 
> calls the custom protocol code (in our case pxf) to get the next tuples, and 
> so doesn't know if the tuples are from the same resource or not.
> From what I can see, in gpfdist protocol the problem is solved by letting the 
> custom protocol code handle this and marking the flag as false for the 
> external protocol infrastructure in fileam.c.
> *Proposed Solution*
> Add a check in pxf validator function (a function that is being called as 
> part of external table creation).
> This check will error out if HEADER option is used in a PXF table.
> Currently the validation function API only allows access to the table's URLs 
> (in the LOCATION part of the table), and not the format options. In order to 
> add the check an API change in the external protocol is required.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to