Something must have changed with the bzip2 codec in later versions of hadoop. When I get time I'll investigate which version actually breaks it and see what changed.
On Thu, Sep 5, 2019 at 11:40 AM Lukasz Cwik <[email protected]> wrote: > Sorry for the poor experience and thanks for sharing a solution with > others. > > On Thu, Sep 5, 2019 at 6:34 AM Shannon Duncan <[email protected]> > wrote: > >> FYI this was due to hadoop version. 3.2.0 was throwing this error, but >> rolled back to version in googles pom.xml 2.7.4 and it is working fine now. >> >> Kindof annoying cause I wasted several hours jumping through hoops trying >> to get 3.2.0 working :( >> >> On Wed, Sep 4, 2019 at 5:09 PM Shannon Duncan <[email protected]> >> wrote: >> >>> I have successfully been using the sequence file source located here: >>> >>> >>> https://github.com/googleapis/java-bigtable-hbase/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java >>> >>> However recently we started to do block level compression with bzip2 on >>> the SequenceFile. This is supported out of the box on the Hadoop side of >>> things. >>> >>> However when reading in the files, while most records parse out just >>> fine there are a handful of records that throw: >>> >>> #### >>> Exception in thread "main" java.lang.IndexOutOfBoundsException: >>> offs(1368) + len(1369) > dest.length(1467). >>> at >>> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:398) >>> #### >>> >>> I've gone in circles looking at this. It seems that the last record >>> being read from the sequencefile in each thread is hitting this on the >>> value retrieval (Key retrieves just fine, but value throws this error). >>> >>> Any clues as to what this could be? >>> >>> File is KV<Text, Text> aka >>> "SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text(org.apache.hadoop.io.compress.BZip2Codec" >>> >>> Any help is appreciated! >>> >>> - Shannon >>> >>
