[
https://issues.apache.org/jira/browse/HIVE-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089612#comment-13089612
]
Raghu Angadi commented on HIVE-2395:
------------------------------------
> .lzo files require that an LzoIndexer is run on them.
This is not a requirement. You need the index file only if you want split large
lzo files. You could just remove the index files as a quick workaround (in
which case you might as well use just TextInputFormat ).
> Misleading "No LZO codec found, cannot run." exception when using external
> table and LZO / DeprecatedLzoTextInputFormat
> -----------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-2395
> URL: https://issues.apache.org/jira/browse/HIVE-2395
> Project: Hive
> Issue Type: Bug
> Components: Serializers/Deserializers
> Affects Versions: 0.7.1
> Environment: Cloudera 3u1 with
> https://github.com/kevinweil/hadoop-lzo or
> https://github.com/kevinweil/elephant-bird
> Reporter: Vitaliy Fuks
>
> We have a {{/tables/}} directory containing .lzo files with our data,
> compressed using lzop.
> We {{CREATE EXTERNAL TABLE}} on top of this directory, using {{STORED AS
> INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"}}.
> .lzo files require that an LzoIndexer is run on them. When this is done,
> .lzo.index file is created for every .lzo file, so we end up with:
> {noformat}
> /tables/ourdata_2011-08-19.lzo
> /tables/ourdata_2011-08-19.lzo.index
> /tables/ourdata_2011-08-18.lzo
> /tables/ourdata_2011-08-18.lzo.index
> ..etc
> {noformat}
> The issue is that org.apache.hadoop.hive.ql.io.CombineHiveRecordReader is
> attempting to getRecordReader() for .lzo.index files. This throws a pretty
> confusing exception:
> {noformat}
> Caused by: java.io.IOException: No LZO codec found, cannot run.
> at
> com.hadoop.mapred.DeprecatedLzoLineRecordReader.<init>(DeprecatedLzoLineRecordReader.java:53)
> at
> com.hadoop.mapred.DeprecatedLzoTextInputFormat.getRecordReader(DeprecatedLzoTextInputFormat.java:128)
> at
> org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:68)
> {noformat}
> More precisely, it dies on second invocation of getRecordReader() - here is
> some System.out.println() output:
> {noformat}
> DeprecatedLzoTextInputFormat.getRecordReader():
> split=/tables/ourdata_2011-08-19.lzo:0+616479
> DeprecatedLzoTextInputFormat.getRecordReader():
> split=/tables/ourdata_2011-08-19.lzo.index:0+64
> {noformat}
> DeprecatedLzoTextInputFormat contains the following code which causes the
> ultimate exception and death of query, as it obviously doesn't have a codec
> to read .lzo.index files.
> {noformat}
> final CompressionCodec codec = codecFactory.getCodec(file);
> if (codec == null) {
> throw new IOException("No LZO codec found, cannot run.");
> }
> {noformat}
> So I understand that the way things are right now is that Hive considers all
> files within a directory to be part of a table. There is an open patch
> HIVE-951 which would allow a quick workaround for this problem.
> Does it make sense to add some hooks so that CombineHiveRecordReader or its
> parents are more aware of what files should be considered instead of blindly
> trying to read everything?
> Any suggestions for a quick workaround to make it skip .index files?
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira