[jira] [Updated] (SQOOP-3151) Sqoop export HDFS file type auto detection can pick wrong type

Boglarka Egyed (JIRA) Fri, 10 Mar 2017 06:08:33 -0800

     [ 
https://issues.apache.org/jira/browse/SQOOP-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Boglarka Egyed updated SQOOP-3151:
----------------------------------
    Description: 
It appears that Sqoop export tries to detect the file format by reading the 
first 3 characters of a file. Based on that header, the appropriate file reader 
is used. However, if the result set happens to contain the header sequence, the 
wrong reader is chosen resulting in a misleading error.

For example, if someone is exporting a table in which one of the field values 
is "PART". Since Sqoop sees the letters "PAR", it is invoking the Kite SDK as 
it assumes the file is in Parquet format. This leads to a misleading error:

ERROR sqoop.Sqoop: Got exception running Sqoop: 
org.kitesdk.data.DatasetNotFoundException: Descriptor location does not exist: 
hdfs://<path>.metadata 
org.kitesdk.data.DatasetNotFoundException: Descriptor location does not exist: 
hdfs://<path>.metadata

This can be reproduced easily, using Hive as a real world example:

> create table test2 (val string);
> insert into test1 values ('PAR');

Then run a sqoop export against the table data:

$ sqoop export --connect $MYCONN --username $MYUSER --password $MYPWD -m 1 
--export-dir /user/hive/warehouse/test --table $MYTABLE

Sqoop will fail with the following:
ERROR sqoop.Sqoop: Got exception running Sqoop: 
org.kitesdk.data.DatasetNotFoundException: Descriptor location does not exist: 
hdfs://<path>.metadata
org.kitesdk.data.DatasetNotFoundException: Descriptor location does not exist: 
hdfs://<path>.metadata

Changing value from "PAR" to something else, like 'Obj' (Avro) or 'SEQ' 
(sequencefile), which will result in similar errors.

  was:
It appears that Sqoop export tries to detect the file format by reading the 
first 3 characters of a file. Based on that header, the appropriate file reader 
is used. However, if the result set happens to contain the header sequence, the 
wrong reader is chosen resulting in a misleading error.

For example, if someone is exporting a table in which one of the field values 
is "PART". Since Sqoop sees the letters "PAR", it is invoking the Kite SDK as 
it assumes the file is in Parquet format. This leads to a misleading error:

ERROR sqoop.Sqoop: Got exception running Sqoop: 
org.kitesdk.data.DatasetNotFoundException: Descriptor location does not exist: 
hdfs://<path>.metadata 
org.kitesdk.data.DatasetNotFoundException: Descriptor location does not exist: 
hdfs://<path>.metadata

This can be reproduced easily, using Hive as a real world example:

> create table test2 (val string);
> insert into test1 values ('PAR');

Then run a sqoop export against the table data:

$ sqoop export --connect $MYCONN --username $MYUSER --password $MYPWD -m 1 
--export-dir /user/hive/warehouse/test --table $MYTABLE

Sqoop will fail with the following:
17/02/09 09:20:02 ERROR sqoop.Sqoop: Got exception running Sqoop: 
org.kitesdk.data.DatasetNotFoundException: Descriptor location does not exist: 
hdfs://<path>.metadata
org.kitesdk.data.DatasetNotFoundException: Descriptor location does not exist: 
hdfs://<path>.metadata

Changing value from "PAR" to something else, like 'Obj' (Avro) or 'SEQ' 
(sequencefile), which will result in similar errors.


> Sqoop export HDFS file type auto detection can pick wrong type
> --------------------------------------------------------------
>
>                 Key: SQOOP-3151
>                 URL: https://issues.apache.org/jira/browse/SQOOP-3151
>             Project: Sqoop
>          Issue Type: Bug
>    Affects Versions: 1.4.6
>            Reporter: Boglarka Egyed
>
> It appears that Sqoop export tries to detect the file format by reading the 
> first 3 characters of a file. Based on that header, the appropriate file 
> reader is used. However, if the result set happens to contain the header 
> sequence, the wrong reader is chosen resulting in a misleading error.
> For example, if someone is exporting a table in which one of the field values 
> is "PART". Since Sqoop sees the letters "PAR", it is invoking the Kite SDK as 
> it assumes the file is in Parquet format. This leads to a misleading error:
> ERROR sqoop.Sqoop: Got exception running Sqoop: 
> org.kitesdk.data.DatasetNotFoundException: Descriptor location does not 
> exist: hdfs://<path>.metadata 
> org.kitesdk.data.DatasetNotFoundException: Descriptor location does not 
> exist: hdfs://<path>.metadata
> This can be reproduced easily, using Hive as a real world example:
> > create table test2 (val string);
> > insert into test1 values ('PAR');
> Then run a sqoop export against the table data:
> $ sqoop export --connect $MYCONN --username $MYUSER --password $MYPWD -m 1 
> --export-dir /user/hive/warehouse/test --table $MYTABLE
> Sqoop will fail with the following:
> ERROR sqoop.Sqoop: Got exception running Sqoop: 
> org.kitesdk.data.DatasetNotFoundException: Descriptor location does not 
> exist: hdfs://<path>.metadata
> org.kitesdk.data.DatasetNotFoundException: Descriptor location does not 
> exist: hdfs://<path>.metadata
> Changing value from "PAR" to something else, like 'Obj' (Avro) or 'SEQ' 
> (sequencefile), which will result in similar errors.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (SQOOP-3151) Sqoop export HDFS file type auto detection can pick wrong type

Reply via email to