Re: HashJoin throws ParquetDecodingException with input as ParquetTupleScheme

Ryan Blue Mon, 21 Mar 2016 09:19:31 -0700

Looks like it hasn't been committed yet. I think you're interested in this
PR: https://github.com/apache/parquet-mr/pull/280


rb

On Mon, Mar 21, 2016 at 4:31 AM, Santlal J Gupta <
[email protected]> wrote:

> Hi Ryan,
>
> Did you get chance to look into this query. I am still waiting for your
> reply.
>
> Could you tell me in which version Reuben had fixed the issue.
>
> Thanks
> Santlal
>
>
> From: Santlal J Gupta
> Sent: Monday, February 29, 2016 6:02 PM
> To: '[email protected]'
> Cc: Gurdit Singh
> Subject: HashJoin throws ParquetDecodingException with input as
> ParquetTupleScheme
>
>
> Hi Ryan,
>
> Currently I am using following version:
>
> Hadoop    : 2.6.0
> Cascading : 3.0.1
> Parquet   : 1.6.0
> Parquet-cascading : 1.6.0
> Parquet-hadoop    : 1.6.0
> Parquet-column    : 1.6.0
>
> Thanks
> Santlal
>
>
>
>
>
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>
>
>
> Santlal,
>
> What version of Parquet are you using? I think this was recently fixed by
> Reuben.
>
> rb
>
> On Tue, Feb 16, 2016 at 5:16 AM, Santlal J Gupta <
> [email protected]<mailto:[email protected]>>
> wrote:
>
> > Hi,
> >
> > I am facing problem while using *HashJoin* with input using
> > *ParquetTupleScheme*. I have two source taps of which one is using
> > *TextDelimited* scheme and the other source tap is using
> > *ParquetTupleScheme. *I am performing a *HashJoin *and writing the data
> > as Delimited file. The program runs successfully on local mode but when i
> > tried to run it on cluster, it gives following error :
> >
> > parquet.io.ParquetDecodingException: Can not read value at 0 in block -1
> > in file hdfs://Hostname:8020/user/username/testData/lookup-file.parquet
> >         at
> >
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:211)
> >         at
> >
> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:144)
> >         at
> >
> parquet.hadoop.mapred.DeprecatedParquetInputFormat$RecordReaderWrapper.<init>(DeprecatedParquetInputFormat.java:91)
> >         at
> >
> parquet.hadoop.mapred.DeprecatedParquetInputFormat.getRecordReader(DeprecatedParquetInputFormat.java:42)
> >         at
> >
> cascading.tap.hadoop.io.MultiRecordReaderIterator.makeReader(MultiRecordReaderIterator.java:123)
> >         at
> >
> cascading.tap.hadoop.io.MultiRecordReaderIterator.getNextReader(MultiRecordReaderIterator.java:172)
> >         at
> >
> cascading.tap.hadoop.io.MultiRecordReaderIterator.hasNext(MultiRecordReaderIterator.java:133)
> >         at
> >
> cascading.tuple.TupleEntrySchemeIterator.<init>(TupleEntrySchemeIterator.java:94)
> >         at
> >
> cascading.tap.hadoop.io.HadoopTupleEntrySchemeIterator.<init>(HadoopTupleEntrySchemeIterator.java:49)
> >         at
> >
> cascading.tap.hadoop.io.HadoopTupleEntrySchemeIterator.<init>(HadoopTupleEntrySchemeIterator.java:44)
> >         at cascading.tap.hadoop.Hfs.openForRead(Hfs.java:439)
> >         at cascading.tap.hadoop.Hfs.openForRead(Hfs.java:108)
> >         at
> > cascading.flow.stream.element.SourceStage.map(SourceStage.java:82)
> >         at
> > cascading.flow.stream.element.SourceStage.run(SourceStage.java:66)
> >         at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:139)
> >         at
> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
> >         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
> >         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
> >         at java.security.AccessController.doPrivileged(Native Method)
> >         at javax.security.auth.Subject.doAs(Subject.java:415)
> >         at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
> >         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> > Caused by: java.lang.NullPointerException
> >         at
> >
> parquet.hadoop.util.counters.mapred.MapRedCounterAdapter.increment(MapRedCounterAdapter.java:34)
> >         at
> >
> parquet.hadoop.util.counters.BenchmarkCounter.incrementTotalBytes(BenchmarkCounter.java:75)
> >         at
> >
> parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:349)
> >         at
> >
> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:114)
> >         at
> >
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:191)
> >         ... 21 more
> >
> > *Below are the UseCase:*
> >
> >     public static void main(String[] args) throws IOException {
> >
> >         Configuration conf = new Configuration();
> >
> >         String[] otherArgs;
> >
> >         otherArgs = new GenericOptionsParser(conf,
> > args).getRemainingArgs();
> >
> >         String argsString = "";
> >         for (String arg : otherArgs) {
> >             argsString = argsString + " " + arg;
> >         }
> >         System.out.println("After processing arguments are:" +
> argsString);
> >
> >         Properties properties = new Properties();
> >         properties.putAll(conf.getValByRegex(".*"));
> >
> >         String OutputPath = "testData/BasicEx_Output";
> >         Class types1[] = { String.class, String.class, String.class };
> >         Fields f1 = new Fields("id1", "city1", "state");
> >
> >         Tap source = new Hfs(new TextDelimited(f1, "|", "", types1,
> > false), "main-txt-file.dat");
> >         Pipe pipe = new Pipe("ReadWrite");
> >
> >         Scheme pScheme = new ParquetTupleScheme();
> >         Tap source2 = new Hfs(pScheme, "testData/lookup-file.parquet");
> >         Pipe pipe2 = new Pipe("ReadWrite2");
> >
> >         Pipe tokenPipe = new HashJoin(pipe, new Fields("id1"), pipe2, new
> > Fields("id"), new LeftJoin());
> >
> >         Tap sink = new Hfs(new TextDelimited(f1, true, "|"), OutputPath,
> > SinkMode.REPLACE);
> >
> >         FlowDef flowDef1 = FlowDef.flowDef().addSource(pipe,
> > source).addSource(pipe2, source2).addTailSink(tokenPipe,
> >                 sink);
> >         new
> > Hadoop2MR1FlowConnector(properties).connect(flowDef1).complete();
> >
> >     }
> >
> >
> > I have attached the input files for the reference . Please help me in
> > solving this issue.
> >
> >
> >
> > I have asked the same question on cascading google group and below is
> > response for it :
> >
> >
> >
> > *André Kelpe *
> >
> >
> >
> >
> >
> > This looks like a bug caused by a wrong assumption in parquet. I fixed
> > a similar thing 2 years ago in parquet:
> > https://github.com/Parquet/parquet-mr/pull/388/ Can you check with the
> > upstream project? It looks like it is their problem and not a problem
> > in Cascading.
> >
> > - André
> >
> > - show quoted text -
> >
> > > --
> > > You received this message because you are subscribed to the Google
> Groups
> >
> > > "cascading-user" group.
> > > To unsubscribe from this group and stop receiving emails from it, send
> an
> >
> > > email to [email protected]<mailto:
> [email protected]>.
> > > To post to this group, send email to [email protected]
> <mailto:[email protected]>.
> > > Visit this group at https://groups.google.com/group/cascading-user.
> > > To view this discussion on the web visit
> > >
> >
> https://groups.google.com/d/msgid/cascading-user/4af70450-d5f6-4186-bb9e-8b9755ed7bb3%40googlegroups.com
> > .
> > > For more options, visit https://groups.google.com/d/optout.
> >
> >
> >
> > --
> > André Kelpe
> > [email protected]<mailto:[email protected]>
> > http://concurrentinc.com
> >
> >
> >
> >
> >
> > Thanks
> >
> > Santlal
> >
> >
> >
> **************************************Disclaimer******************************************
> > This e-mail message and any attachments may contain confidential
> > information and is for the sole use of the intended recipient(s) only.
> Any
> > views or opinions presented or implied are solely those of the author and
> > do not necessarily represent the views of BitWise. If you are not the
> > intended recipient(s), you are hereby notified that disclosure, printing,
> > copying, forwarding, distribution, or the taking of any action whatsoever
> > in reliance on the contents of this electronic information is strictly
> > prohibited. If you have received this e-mail message in error, please
> > immediately notify the sender and delete the electronic message and any
> > attachments.BitWise does not accept liability for any virus introduced by
> > this e-mail or any attachments.
> >
> ********************************************************************************************
> >
> >
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: HashJoin throws ParquetDecodingException with input as ParquetTupleScheme

Reply via email to