[ https://issues.apache.org/jira/browse/SQOOP-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905132#comment-15905132 ]
Boglarka Egyed edited comment on SQOOP-3151 at 3/10/17 2:14 PM: ---------------------------------------------------------------- Related code part: {code} ExportJobBase.java: 190 private static FileType fromMagicNumber(Path file, Configuration conf) { 191 // Test target's header to see if it contains magic numbers indicating its 192 // file type 193 byte [] header = new byte[3]; 194 FSDataInputStream is = null; 195 try { 196 FileSystem fs = file.getFileSystem(conf); 197 is = fs.open(file); 198 is.readFully(header); 199 } catch (IOException ioe) { 200 // Error reading header or EOF; assume unknown 201 LOG.warn("IOException checking input file header: " + ioe); 202 return FileType.UNKNOWN; 203 } finally { 204 try { 205 if (null != is) { 206 is.close(); 207 } 208 } catch (IOException ioe) { 209 // ignore; closing. 210 LOG.warn("IOException closing input stream: " + ioe + "; ignoring."); 211 } 212 } 213 214 if (header[0] == 'S' && header[1] == 'E' && header[2] == 'Q') { 215 return FileType.SEQUENCE_FILE; 216 } 217 if (header[0] == 'O' && header[1] == 'b' && header[2] == 'j') { 218 return FileType.AVRO_DATA_FILE; 219 } 220 if (header[0] == 'P' && header[1] == 'A' && header[2] == 'R') { 221 return FileType.PARQUET_FILE; 222 } 223 return FileType.UNKNOWN; 224 } {code} https://git-wip-us.apache.org/repos/asf?p=sqoop.git;a=blob;f=src/java/org/apache/sqoop/mapreduce/ExportJobBase.java;hb=98c5ccb80f8039dd5e1f9451c43443bb01dfd973#l190 It should be investigated if * the code could be changed to avoid these cases or * a new command line option could be introduced to enforce Sqoop to use a specific file format during export (similar options exist only for import currently) was (Author: boglarkaegyed): Related code part which could be investigated: {code} ExportJobBase.java: 190 private static FileType fromMagicNumber(Path file, Configuration conf) { 191 // Test target's header to see if it contains magic numbers indicating its 192 // file type 193 byte [] header = new byte[3]; 194 FSDataInputStream is = null; 195 try { 196 FileSystem fs = file.getFileSystem(conf); 197 is = fs.open(file); 198 is.readFully(header); 199 } catch (IOException ioe) { 200 // Error reading header or EOF; assume unknown 201 LOG.warn("IOException checking input file header: " + ioe); 202 return FileType.UNKNOWN; 203 } finally { 204 try { 205 if (null != is) { 206 is.close(); 207 } 208 } catch (IOException ioe) { 209 // ignore; closing. 210 LOG.warn("IOException closing input stream: " + ioe + "; ignoring."); 211 } 212 } 213 214 if (header[0] == 'S' && header[1] == 'E' && header[2] == 'Q') { 215 return FileType.SEQUENCE_FILE; 216 } 217 if (header[0] == 'O' && header[1] == 'b' && header[2] == 'j') { 218 return FileType.AVRO_DATA_FILE; 219 } 220 if (header[0] == 'P' && header[1] == 'A' && header[2] == 'R') { 221 return FileType.PARQUET_FILE; 222 } 223 return FileType.UNKNOWN; 224 } {code} https://git-wip-us.apache.org/repos/asf?p=sqoop.git;a=blob;f=src/java/org/apache/sqoop/mapreduce/ExportJobBase.java;hb=98c5ccb80f8039dd5e1f9451c43443bb01dfd973#l190 It should be investigated if * the code could be changed to avoid these cases or * a new command line option could be introduced to enforce Sqoop to use a specific file format during export (similar options exist only for import currently) > Sqoop export HDFS file type auto detection can pick wrong type > -------------------------------------------------------------- > > Key: SQOOP-3151 > URL: https://issues.apache.org/jira/browse/SQOOP-3151 > Project: Sqoop > Issue Type: Bug > Affects Versions: 1.4.6 > Reporter: Boglarka Egyed > > It appears that Sqoop export tries to detect the file format by reading the > first 3 characters of a file. Based on that header, the appropriate file > reader is used. However, if the result set happens to contain the header > sequence, the wrong reader is chosen resulting in a misleading error. > For example, if someone is exporting a table in which one of the field values > is "PART". Since Sqoop sees the letters "PAR", it is invoking the Kite SDK as > it assumes the file is in Parquet format. This leads to a misleading error: > ERROR sqoop.Sqoop: Got exception running Sqoop: > org.kitesdk.data.DatasetNotFoundException: Descriptor location does not > exist: hdfs://<path>.metadata > org.kitesdk.data.DatasetNotFoundException: Descriptor location does not > exist: hdfs://<path>.metadata > This can be reproduced easily, using Hive as a real world example: > > create table test2 (val string); > > insert into test1 values ('PAR'); > Then run a sqoop export against the table data: > $ sqoop export --connect $MYCONN --username $MYUSER --password $MYPWD -m 1 > --export-dir /user/hive/warehouse/test --table $MYTABLE > Sqoop will fail with the following: > ERROR sqoop.Sqoop: Got exception running Sqoop: > org.kitesdk.data.DatasetNotFoundException: Descriptor location does not > exist: hdfs://<path>.metadata > org.kitesdk.data.DatasetNotFoundException: Descriptor location does not > exist: hdfs://<path>.metadata > Changing value from "PAR" to something else, like 'Obj' (Avro) or 'SEQ' > (sequencefile), which will result in similar errors. -- This message was sent by Atlassian JIRA (v6.3.15#6346)