[jira] [Updated] (HIVE-27498) Support custom delimiter in SkippingTextInputFormat

Taraka Rama Rao Lethavadla (Jira) Thu, 13 Jul 2023 01:06:52 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-27498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Taraka Rama Rao Lethavadla updated HIVE-27498:
----------------------------------------------
    Description: 
Simple select is returning results as expected when there are configs
{noformat}
'skip.header.line.count'='1',                    
'textinputformat.record.delimiter'='|'{noformat}
but if we execute select count(*) or any query that launches a tez job is 
considering the whole text as single line

*Test case*

data.csv
{noformat}
Code    Name|A AAAA|B BBBB
CCCC|C  DDDD{noformat}
DDL
{noformat}
create external table test(code string,name string)
ROW FORMAT SERDE
   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
 WITH SERDEPROPERTIES (
   'field.delim'='\t')
 STORED AS INPUTFORMAT
   'org.apache.hadoop.mapred.TextInputFormat'
 OUTPUTFORMAT
   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
   location '${system:test.tmp.dir}/test'
 TBLPROPERTIES (
   'skip.header.line.count'='1',
   'textinputformat.record.delimiter'='|');{noformat}
Query result
select code,name from test;
{noformat}
A AAAA
B BBBB
CCCC
C DDDD{noformat}
*Problem:* But query _+select count(*) from test+_  is returning 1 instead of 3

It used to work in older hive versions.

The difference in behaviour started to happen after the introduction of feature 
https://issues.apache.org/jira/browse/HIVE-21924

The feature aims at splitting the text files while reading even though the 
table has configuration to skip headers. There by increasing the number of 
mappers to process the query there by improving throughput of the query.

The actual problem lies in how new feature is reading a file. It does not 
consider 'textinputformat.record.delimiter' property and tries to read the file 
looking for new line characters. Since the input file does not have a new line 
for every record, it is reading the whole file as single line and count is 
returned as 1

Ref: 
[https://github.com/apache/hive/blob/24a82a65f96b65eeebe4e23b2fec425037a70216/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L548]

 

 *Workaround*

If we can remove headers in the data and skip header config in table properties 
or compress the files, then we will not get into this issue

 

 

  was:
Simple select is returning results as expected when there are configs
{noformat}
'skip.header.line.count'='1',                    
'textinputformat.record.delimiter'='|'{noformat}
but if we execute select count(*) or any query that launches a tez job is 
considering the whole text as single line

*Test case*

data.csv
{noformat}
Code    Name|A AAAA|B BBBB
CCCC|C  DDDD{noformat}
DDL
{noformat}
create external table test(code string,name string)
ROW FORMAT SERDE
   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
 WITH SERDEPROPERTIES (
   'field.delim'='\t')
 STORED AS INPUTFORMAT
   'org.apache.hadoop.mapred.TextInputFormat'
 OUTPUTFORMAT
   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
   location '${system:test.tmp.dir}/test'
 TBLPROPERTIES (
   'skip.header.line.count'='1',
   'textinputformat.record.delimiter'='|');{noformat}
Query result
select code,name from test;
{noformat}
A AAAA
B BBBB
CCCC
C DDDD{noformat}
*Problem:* But query _+select count(*) from test+_  is returning 1 instead of 3

It used to work in older hive versions.

The difference in behaviour started to happen after the introduction of feature 
https://issues.apache.org/jira/browse/HIVE-21924

The feature aims at splitting the text files while reading even though the 
table has configuration to skip headers. There by increasing the number of 
mappers to process the query there by improving throughput of the query.

The actual problem lies in how new feature is reading a file. It does not 
consider 'textinputformat.record.delimiter' property and tries to read the file 
looking for new line characters. Since the input file does not have a new line 
for every record, it is reading the whole file as single line and count is 
returned as 1

Ref: 
[https://github.com/apache/hive/blob/24a82a65f96b65eeebe4e23b2fec425037a70216/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L548]

 

 *Workaround*

If we can remove headers in the data and skip header config in table properties 
or compress the files, then we will get into this issue

 

 


> Support custom delimiter in SkippingTextInputFormat
> ---------------------------------------------------
>
>                 Key: HIVE-27498
>                 URL: https://issues.apache.org/jira/browse/HIVE-27498
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>            Reporter: Taraka Rama Rao Lethavadla
>            Priority: Major
>
> Simple select is returning results as expected when there are configs
> {noformat}
> 'skip.header.line.count'='1',                    
> 'textinputformat.record.delimiter'='|'{noformat}
> but if we execute select count(*) or any query that launches a tez job is 
> considering the whole text as single line
> *Test case*
> data.csv
> {noformat}
> Code    Name|A AAAA|B BBBB
> CCCC|C  DDDD{noformat}
> DDL
> {noformat}
> create external table test(code string,name string)
> ROW FORMAT SERDE
>    'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
>  WITH SERDEPROPERTIES (
>    'field.delim'='\t')
>  STORED AS INPUTFORMAT
>    'org.apache.hadoop.mapred.TextInputFormat'
>  OUTPUTFORMAT
>    'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>    location '${system:test.tmp.dir}/test'
>  TBLPROPERTIES (
>    'skip.header.line.count'='1',
>    'textinputformat.record.delimiter'='|');{noformat}
> Query result
> select code,name from test;
> {noformat}
> A AAAA
> B BBBB
> CCCC
> C DDDD{noformat}
> *Problem:* But query _+select count(*) from test+_  is returning 1 instead of 
> 3
> It used to work in older hive versions.
> The difference in behaviour started to happen after the introduction of 
> feature https://issues.apache.org/jira/browse/HIVE-21924
> The feature aims at splitting the text files while reading even though the 
> table has configuration to skip headers. There by increasing the number of 
> mappers to process the query there by improving throughput of the query.
> The actual problem lies in how new feature is reading a file. It does not 
> consider 'textinputformat.record.delimiter' property and tries to read the 
> file looking for new line characters. Since the input file does not have a 
> new line for every record, it is reading the whole file as single line and 
> count is returned as 1
> Ref: 
> [https://github.com/apache/hive/blob/24a82a65f96b65eeebe4e23b2fec425037a70216/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L548]
>  
>  *Workaround*
> If we can remove headers in the data and skip header config in table 
> properties or compress the files, then we will not get into this issue
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-27498) Support custom delimiter in SkippingTextInputFormat

Reply via email to