[ https://issues.apache.org/jira/browse/PIG-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adam Szita updated PIG-5277: ---------------------------- Description: After committing PIG-3655 a couple of Spark mode tests (e.g. org.apache.pig.test.TestEvalPipeline.testCogroupAfterDistinct) started failing on: {code} java.lang.Error: java.io.IOException: Corrupt data file, expected tuple type byte, but seen 27 at org.apache.pig.backend.hadoop.executionengine.HJob$1.hasNext(HJob.java:122) at org.apache.pig.test.TestEvalPipeline.testCogroupAfterDistinct(TestEvalPipeline.java:1052) Caused by: java.io.IOException: Corrupt data file, expected tuple type byte, but seen 27 at org.apache.pig.impl.io.InterRecordReader.readDataOrEOF(InterRecordReader.java:158) at org.apache.pig.impl.io.InterRecordReader.nextKeyValue(InterRecordReader.java:194) at org.apache.pig.impl.io.InterStorage.getNext(InterStorage.java:79) at org.apache.pig.impl.io.ReadToEndLoader.getNextHelper(ReadToEndLoader.java:238) at org.apache.pig.impl.io.ReadToEndLoader.getNext(ReadToEndLoader.java:218) at org.apache.pig.backend.hadoop.executionengine.HJob$1.hasNext(HJob.java:115) {code} This is because InterRecordReader became much stricter after PIG-3655. Before it just simply skipped these bytes thinking that they are just garbage on the split beginning. Now when we expect a [proper tuple with a tuple type byte|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/impl/io/InterRecordReader.java#L153] we see these nulls and throw an Exception. As I can see it this is happening because JoinGroupSparkConverter has to return something even when it shouldn't. When the POPackage operator returns a [POStatus.STATUS_NULL|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/spark/converter/JoinGroupSparkConverter.java#L211], the converter shouldn't return a anything, but it can't do better than returning a null. This then gets written out by Spark.. was: After committing PIG-3655 a couple of Spark mode tests (e.g. org.apache.pig.test.TestEvalPipeline.testCogroupAfterDistinct) started failing on: {code} java.lang.Error: java.io.IOException: Corrupt data file, expected tuple type byte, but seen 27 at org.apache.pig.backend.hadoop.executionengine.HJob$1.hasNext(HJob.java:122) at org.apache.pig.test.TestEvalPipeline.testCogroupAfterDistinct(TestEvalPipeline.java:1052) Caused by: java.io.IOException: Corrupt data file, expected tuple type byte, but seen 27 at org.apache.pig.impl.io.InterRecordReader.readDataOrEOF(InterRecordReader.java:158) at org.apache.pig.impl.io.InterRecordReader.nextKeyValue(InterRecordReader.java:194) at org.apache.pig.impl.io.InterStorage.getNext(InterStorage.java:79) at org.apache.pig.impl.io.ReadToEndLoader.getNextHelper(ReadToEndLoader.java:238) at org.apache.pig.impl.io.ReadToEndLoader.getNext(ReadToEndLoader.java:218) at org.apache.pig.backend.hadoop.executionengine.HJob$1.hasNext(HJob.java:115) {code} This is because InterRecordReader became much stricter after PIG-3655. Before it just simply skipped these bytes thinking that they are just garbage on the split beginning. Now when we expect a [proper tuple with a tuple type byte|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/impl/io/InterRecordReader.java#L153] we see these nulls and throw an Exception. > Spark mode is writing nulls among tuples to the output > ------------------------------------------------------- > > Key: PIG-5277 > URL: https://issues.apache.org/jira/browse/PIG-5277 > Project: Pig > Issue Type: Bug > Components: spark > Reporter: Adam Szita > Assignee: Adam Szita > > After committing PIG-3655 a couple of Spark mode tests (e.g. > org.apache.pig.test.TestEvalPipeline.testCogroupAfterDistinct) started > failing on: > {code} > java.lang.Error: java.io.IOException: Corrupt data file, expected tuple type > byte, but seen 27 > at > org.apache.pig.backend.hadoop.executionengine.HJob$1.hasNext(HJob.java:122) > at > org.apache.pig.test.TestEvalPipeline.testCogroupAfterDistinct(TestEvalPipeline.java:1052) > Caused by: java.io.IOException: Corrupt data file, expected tuple type byte, > but seen 27 > at > org.apache.pig.impl.io.InterRecordReader.readDataOrEOF(InterRecordReader.java:158) > at > org.apache.pig.impl.io.InterRecordReader.nextKeyValue(InterRecordReader.java:194) > at org.apache.pig.impl.io.InterStorage.getNext(InterStorage.java:79) > at > org.apache.pig.impl.io.ReadToEndLoader.getNextHelper(ReadToEndLoader.java:238) > at > org.apache.pig.impl.io.ReadToEndLoader.getNext(ReadToEndLoader.java:218) > at > org.apache.pig.backend.hadoop.executionengine.HJob$1.hasNext(HJob.java:115) > {code} > This is because InterRecordReader became much stricter after PIG-3655. Before > it just simply skipped these bytes thinking that they are just garbage on the > split beginning. Now when we expect a [proper tuple with a tuple type > byte|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/impl/io/InterRecordReader.java#L153] > we see these nulls and throw an Exception. > As I can see it this is happening because JoinGroupSparkConverter has to > return something even when it shouldn't. > When the POPackage operator returns a > [POStatus.STATUS_NULL|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/spark/converter/JoinGroupSparkConverter.java#L211], > the converter shouldn't return a anything, but it can't do better than > returning a null. This then gets written out by Spark.. -- This message was sent by Atlassian JIRA (v6.4.14#64029)