[jira] [Updated] (SPARK-23076) When we call cache() on ShuffleRowRDD, we will get an error result

zhoukang (JIRA) Mon, 15 Jan 2018 02:53:18 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-23076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


zhoukang updated SPARK-23076:
-----------------------------
    Description: 
For query below:
{code:java}
select * from csv_demo limit 3;
{code}
The correct result should be:
 0: jdbc:hive2://10.108.230.228:10000/> select * from csv_demo limit 3;
 +------------+----++--
|_c0|_c1|

+------------+----++--
|Joe|20|
|Tom|30|
|Hyukjin|25|

+------------+----++--
 However,when we call cache on ShuffleRowRDD(or RDD which depends on 
ShuffleRowRDD in one stage):

  !shufflerowrdd-cache.png!

Then result will be error:
 0: jdbc:hive2://xxx/> select * from csv_demo limit 3;
 +------------+----++--
|_c0|_c1|

+------------+----++--
|Hyukjin|25|
|Hyukjin|25|
|Hyukjin|25|

+------------+----++--
 The reason why this happen is that:

UnsafeRow which generated by ShuffleRowRDD#compute will use the same under byte 
buffer

I print some log below to explain this:

Modify UnsafeRow.toString()
{code:java}
// This is for debugging
@Override
public String toString() {
  StringBuilder build = new StringBuilder("[");
  for (int i = 0; i < sizeInBytes; i += 8) {
    if (i != 0) build.append(',');
    build.append(java.lang.Long.toHexString(Platform.getLong(baseObject, 
baseOffset + i)));
  }
  build.append(","+baseObject+","+baseOffset+']'); // Print baseObject and 
baseOffset here
  return build.toString();
}{code}
{code:java}
2018-01-12,22:08:47,441 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
Read value: [0,1800000003,2000000002,656f4a,3032,[B@6225ec90,16]
2018-01-12,22:08:47,445 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
Read value: [0,1800000003,2000000002,6d6f54,3033,[B@6225ec90,16]
2018-01-12,22:08:47,448 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
Read value: [0,1800000007,2000000002,6e696a6b757948,3532,[B@6225ec90,16]
{code}

  was:
For query below:
{code:java}
select * from csv_demo limit 3;
{code}
The correct result should be:
 0: jdbc:hive2://10.108.230.228:10000/> select * from csv_demo limit 3;
 +-----------+-----++--
|_c0|_c1|

+-----------+-----++--
|Joe|20|
|Tom|30|
|Hyukjin|25|

+-----------+-----++--
 However,when we call cache on ShuffleRowRDD(or RDD which depends on 
ShuffleRowRDD in one stage):

 

Then result will be error:
 0: jdbc:hive2://xxx/> select * from csv_demo limit 3;
 +-----------+-----++--
|_c0|_c1|

+-----------+-----++--
|Hyukjin|25|
|Hyukjin|25|
|Hyukjin|25|

+-----------+-----++--
 The reason why this happen is that:

UnsafeRow which generated by ShuffleRowRDD#compute will use the same under byte 
buffer

I print some log below to explain this:

Modify UnsafeRow.toString()
{code:java}
// This is for debugging
@Override
public String toString() {
  StringBuilder build = new StringBuilder("[");
  for (int i = 0; i < sizeInBytes; i += 8) {
    if (i != 0) build.append(',');
    build.append(java.lang.Long.toHexString(Platform.getLong(baseObject, 
baseOffset + i)));
  }
  build.append(","+baseObject+","+baseOffset+']'); // Print baseObject and 
baseOffset here
  return build.toString();
}{code}
2018-01-12,22:08:47,441 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
Read value: [0,1800000003,2000000002,656f4a,3032,[B@6225ec90,16]
 2018-01-12,22:08:47,445 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
Read value: [0,1800000003,2000000002,6d6f54,3033,[B@6225ec90,16]
 2018-01-12,22:08:47,448 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
Read value: [0,1800000007,2000000002,6e696a6b757948,3532,[B@6225ec90,16]


> When we call cache() on ShuffleRowRDD, we will get an error result
> ------------------------------------------------------------------
>
>                 Key: SPARK-23076
>                 URL: https://issues.apache.org/jira/browse/SPARK-23076
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: zhoukang
>            Priority: Major
>         Attachments: shufflerowrdd-cache.png
>
>
> For query below:
> {code:java}
> select * from csv_demo limit 3;
> {code}
> The correct result should be:
>  0: jdbc:hive2://10.108.230.228:10000/> select * from csv_demo limit 3;
>  +------------+----++--
> |_c0|_c1|
> +------------+----++--
> |Joe|20|
> |Tom|30|
> |Hyukjin|25|
> +------------+----++--
>  However,when we call cache on ShuffleRowRDD(or RDD which depends on 
> ShuffleRowRDD in one stage):
>   !shufflerowrdd-cache.png!
> Then result will be error:
>  0: jdbc:hive2://xxx/> select * from csv_demo limit 3;
>  +------------+----++--
> |_c0|_c1|
> +------------+----++--
> |Hyukjin|25|
> |Hyukjin|25|
> |Hyukjin|25|
> +------------+----++--
>  The reason why this happen is that:
> UnsafeRow which generated by ShuffleRowRDD#compute will use the same under 
> byte buffer
> I print some log below to explain this:
> Modify UnsafeRow.toString()
> {code:java}
> // This is for debugging
> @Override
> public String toString() {
>   StringBuilder build = new StringBuilder("[");
>   for (int i = 0; i < sizeInBytes; i += 8) {
>     if (i != 0) build.append(',');
>     build.append(java.lang.Long.toHexString(Platform.getLong(baseObject, 
> baseOffset + i)));
>   }
>   build.append(","+baseObject+","+baseOffset+']'); // Print baseObject and 
> baseOffset here
>   return build.toString();
> }{code}
> {code:java}
> 2018-01-12,22:08:47,441 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
> Read value: [0,1800000003,2000000002,656f4a,3032,[B@6225ec90,16]
> 2018-01-12,22:08:47,445 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
> Read value: [0,1800000003,2000000002,6d6f54,3033,[B@6225ec90,16]
> 2018-01-12,22:08:47,448 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
> Read value: [0,1800000007,2000000002,6e696a6b757948,3532,[B@6225ec90,16]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23076) When we call cache() on ShuffleRowRDD, we will get an error result

Reply via email to