[jira] [Comment Edited] (SQOOP-3151) Sqoop export HDFS file type auto detection can pick wrong type

Boglarka Egyed (JIRA) Fri, 10 Mar 2017 06:15:30 -0800

    [ 
https://issues.apache.org/jira/browse/SQOOP-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905132#comment-15905132
 ]


Boglarka Egyed edited comment on SQOOP-3151 at 3/10/17 2:14 PM:
----------------------------------------------------------------

Related code part:

{code}
ExportJobBase.java:
 190   private static FileType fromMagicNumber(Path file, Configuration conf) {
 191     // Test target's header to see if it contains magic numbers indicating 
its
 192     // file type
 193     byte [] header = new byte[3];
 194     FSDataInputStream is = null;
 195     try {
 196       FileSystem fs = file.getFileSystem(conf);
 197       is = fs.open(file);
 198       is.readFully(header);
 199     } catch (IOException ioe) {
 200       // Error reading header or EOF; assume unknown
 201       LOG.warn("IOException checking input file header: " + ioe);
 202       return FileType.UNKNOWN;
 203     } finally {
 204       try {
 205         if (null != is) {
 206           is.close();
 207         }
 208       } catch (IOException ioe) {
 209         // ignore; closing.
 210         LOG.warn("IOException closing input stream: " + ioe + "; 
ignoring.");
 211       }
 212     }
 213 
 214     if (header[0] == 'S' && header[1] == 'E' && header[2] == 'Q') {
 215       return FileType.SEQUENCE_FILE;
 216     }
 217     if (header[0] == 'O' && header[1] == 'b' && header[2] == 'j') {
 218       return FileType.AVRO_DATA_FILE;
 219     }
 220     if (header[0] == 'P' && header[1] == 'A' && header[2] == 'R') {
 221       return FileType.PARQUET_FILE;
 222     }
 223     return FileType.UNKNOWN;
 224   }
{code}

https://git-wip-us.apache.org/repos/asf?p=sqoop.git;a=blob;f=src/java/org/apache/sqoop/mapreduce/ExportJobBase.java;hb=98c5ccb80f8039dd5e1f9451c43443bb01dfd973#l190

It should be investigated if
* the code could be changed to avoid these cases
or
* a new command line option could be introduced to enforce Sqoop to use a 
specific file format during export (similar options exist only for import 
currently)


was (Author: boglarkaegyed):
Related code part which could be investigated:

{code}
ExportJobBase.java:
 190   private static FileType fromMagicNumber(Path file, Configuration conf) {
 191     // Test target's header to see if it contains magic numbers indicating 
its
 192     // file type
 193     byte [] header = new byte[3];
 194     FSDataInputStream is = null;
 195     try {
 196       FileSystem fs = file.getFileSystem(conf);
 197       is = fs.open(file);
 198       is.readFully(header);
 199     } catch (IOException ioe) {
 200       // Error reading header or EOF; assume unknown
 201       LOG.warn("IOException checking input file header: " + ioe);
 202       return FileType.UNKNOWN;
 203     } finally {
 204       try {
 205         if (null != is) {
 206           is.close();
 207         }
 208       } catch (IOException ioe) {
 209         // ignore; closing.
 210         LOG.warn("IOException closing input stream: " + ioe + "; 
ignoring.");
 211       }
 212     }
 213 
 214     if (header[0] == 'S' && header[1] == 'E' && header[2] == 'Q') {
 215       return FileType.SEQUENCE_FILE;
 216     }
 217     if (header[0] == 'O' && header[1] == 'b' && header[2] == 'j') {
 218       return FileType.AVRO_DATA_FILE;
 219     }
 220     if (header[0] == 'P' && header[1] == 'A' && header[2] == 'R') {
 221       return FileType.PARQUET_FILE;
 222     }
 223     return FileType.UNKNOWN;
 224   }
{code}

https://git-wip-us.apache.org/repos/asf?p=sqoop.git;a=blob;f=src/java/org/apache/sqoop/mapreduce/ExportJobBase.java;hb=98c5ccb80f8039dd5e1f9451c43443bb01dfd973#l190

It should be investigated if
* the code could be changed to avoid these cases
or
* a new command line option could be introduced to enforce Sqoop to use a 
specific file format during export (similar options exist only for import 
currently)

> Sqoop export HDFS file type auto detection can pick wrong type
> --------------------------------------------------------------
>
>                 Key: SQOOP-3151
>                 URL: https://issues.apache.org/jira/browse/SQOOP-3151
>             Project: Sqoop
>          Issue Type: Bug
>    Affects Versions: 1.4.6
>            Reporter: Boglarka Egyed
>
> It appears that Sqoop export tries to detect the file format by reading the 
> first 3 characters of a file. Based on that header, the appropriate file 
> reader is used. However, if the result set happens to contain the header 
> sequence, the wrong reader is chosen resulting in a misleading error.
> For example, if someone is exporting a table in which one of the field values 
> is "PART". Since Sqoop sees the letters "PAR", it is invoking the Kite SDK as 
> it assumes the file is in Parquet format. This leads to a misleading error:
> ERROR sqoop.Sqoop: Got exception running Sqoop: 
> org.kitesdk.data.DatasetNotFoundException: Descriptor location does not 
> exist: hdfs://<path>.metadata 
> org.kitesdk.data.DatasetNotFoundException: Descriptor location does not 
> exist: hdfs://<path>.metadata
> This can be reproduced easily, using Hive as a real world example:
> > create table test2 (val string);
> > insert into test1 values ('PAR');
> Then run a sqoop export against the table data:
> $ sqoop export --connect $MYCONN --username $MYUSER --password $MYPWD -m 1 
> --export-dir /user/hive/warehouse/test --table $MYTABLE
> Sqoop will fail with the following:
> ERROR sqoop.Sqoop: Got exception running Sqoop: 
> org.kitesdk.data.DatasetNotFoundException: Descriptor location does not 
> exist: hdfs://<path>.metadata
> org.kitesdk.data.DatasetNotFoundException: Descriptor location does not 
> exist: hdfs://<path>.metadata
> Changing value from "PAR" to something else, like 'Obj' (Avro) or 'SEQ' 
> (sequencefile), which will result in similar errors.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (SQOOP-3151) Sqoop export HDFS file type auto detection can pick wrong type

Reply via email to