[ 
https://issues.apache.org/jira/browse/PIG-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated PIG-5277:
----------------------------
    Description: 
After committing PIG-3655 a couple of Spark mode tests (e.g. 
org.apache.pig.test.TestEvalPipeline.testCogroupAfterDistinct) started failing 
on:
{code}
java.lang.Error: java.io.IOException: Corrupt data file, expected tuple type 
byte, but seen 27
        at 
org.apache.pig.backend.hadoop.executionengine.HJob$1.hasNext(HJob.java:122)
        at 
org.apache.pig.test.TestEvalPipeline.testCogroupAfterDistinct(TestEvalPipeline.java:1052)
Caused by: java.io.IOException: Corrupt data file, expected tuple type byte, 
but seen 27
        at 
org.apache.pig.impl.io.InterRecordReader.readDataOrEOF(InterRecordReader.java:158)
        at 
org.apache.pig.impl.io.InterRecordReader.nextKeyValue(InterRecordReader.java:194)
        at org.apache.pig.impl.io.InterStorage.getNext(InterStorage.java:79)
        at 
org.apache.pig.impl.io.ReadToEndLoader.getNextHelper(ReadToEndLoader.java:238)
        at 
org.apache.pig.impl.io.ReadToEndLoader.getNext(ReadToEndLoader.java:218)
        at 
org.apache.pig.backend.hadoop.executionengine.HJob$1.hasNext(HJob.java:115)
{code}

This is because InterRecordReader became much stricter after PIG-3655. Before 
it just simply skipped these bytes thinking that they are just garbage on the 
split beginning. Now when we expect a [proper tuple with a tuple type 
byte|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/impl/io/InterRecordReader.java#L153]
 we see these nulls and throw an Exception.


As I can see it this is happening because JoinGroupSparkConverter has to return 
something even when it shouldn't.
When the POPackage operator returns a 
[POStatus.STATUS_NULL|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/spark/converter/JoinGroupSparkConverter.java#L211],
 the converter shouldn't return a anything, but it can't do better than 
returning a null. This then gets written out by Spark..

  was:
After committing PIG-3655 a couple of Spark mode tests (e.g. 
org.apache.pig.test.TestEvalPipeline.testCogroupAfterDistinct) started failing 
on:
{code}
java.lang.Error: java.io.IOException: Corrupt data file, expected tuple type 
byte, but seen 27
        at 
org.apache.pig.backend.hadoop.executionengine.HJob$1.hasNext(HJob.java:122)
        at 
org.apache.pig.test.TestEvalPipeline.testCogroupAfterDistinct(TestEvalPipeline.java:1052)
Caused by: java.io.IOException: Corrupt data file, expected tuple type byte, 
but seen 27
        at 
org.apache.pig.impl.io.InterRecordReader.readDataOrEOF(InterRecordReader.java:158)
        at 
org.apache.pig.impl.io.InterRecordReader.nextKeyValue(InterRecordReader.java:194)
        at org.apache.pig.impl.io.InterStorage.getNext(InterStorage.java:79)
        at 
org.apache.pig.impl.io.ReadToEndLoader.getNextHelper(ReadToEndLoader.java:238)
        at 
org.apache.pig.impl.io.ReadToEndLoader.getNext(ReadToEndLoader.java:218)
        at 
org.apache.pig.backend.hadoop.executionengine.HJob$1.hasNext(HJob.java:115)
{code}

This is because InterRecordReader became much stricter after PIG-3655. Before 
it just simply skipped these bytes thinking that they are just garbage on the 
split beginning. Now when we expect a [proper tuple with a tuple type 
byte|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/impl/io/InterRecordReader.java#L153]
 we see these nulls and throw an Exception.


> Spark mode is writing nulls among tuples to the output 
> -------------------------------------------------------
>
>                 Key: PIG-5277
>                 URL: https://issues.apache.org/jira/browse/PIG-5277
>             Project: Pig
>          Issue Type: Bug
>          Components: spark
>            Reporter: Adam Szita
>            Assignee: Adam Szita
>
> After committing PIG-3655 a couple of Spark mode tests (e.g. 
> org.apache.pig.test.TestEvalPipeline.testCogroupAfterDistinct) started 
> failing on:
> {code}
> java.lang.Error: java.io.IOException: Corrupt data file, expected tuple type 
> byte, but seen 27
>       at 
> org.apache.pig.backend.hadoop.executionengine.HJob$1.hasNext(HJob.java:122)
>       at 
> org.apache.pig.test.TestEvalPipeline.testCogroupAfterDistinct(TestEvalPipeline.java:1052)
> Caused by: java.io.IOException: Corrupt data file, expected tuple type byte, 
> but seen 27
>       at 
> org.apache.pig.impl.io.InterRecordReader.readDataOrEOF(InterRecordReader.java:158)
>       at 
> org.apache.pig.impl.io.InterRecordReader.nextKeyValue(InterRecordReader.java:194)
>       at org.apache.pig.impl.io.InterStorage.getNext(InterStorage.java:79)
>       at 
> org.apache.pig.impl.io.ReadToEndLoader.getNextHelper(ReadToEndLoader.java:238)
>       at 
> org.apache.pig.impl.io.ReadToEndLoader.getNext(ReadToEndLoader.java:218)
>       at 
> org.apache.pig.backend.hadoop.executionengine.HJob$1.hasNext(HJob.java:115)
> {code}
> This is because InterRecordReader became much stricter after PIG-3655. Before 
> it just simply skipped these bytes thinking that they are just garbage on the 
> split beginning. Now when we expect a [proper tuple with a tuple type 
> byte|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/impl/io/InterRecordReader.java#L153]
>  we see these nulls and throw an Exception.
> As I can see it this is happening because JoinGroupSparkConverter has to 
> return something even when it shouldn't.
> When the POPackage operator returns a 
> [POStatus.STATUS_NULL|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/spark/converter/JoinGroupSparkConverter.java#L211],
>  the converter shouldn't return a anything, but it can't do better than 
> returning a null. This then gets written out by Spark..



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to