liyunzhang_intel created PIG-4899:
-------------------------------------
Summary: The number of records of input file is calculated
wrongly in spark mode in multiquery case
Key: PIG-4899
URL: https://issues.apache.org/jira/browse/PIG-4899
Project: Pig
Issue Type: Sub-task
Reporter: liyunzhang_intel
Assignee: liyunzhang_intel
sparkCounter to calucate the records of input
file(LoadConverter#ToTupleFunction#apply) will be executed multiple times in
multiquery case. This will cause the input records number is calculated
wrongly. for example:
{code}
#--------------------------------------------------
# Spark Plan
#--------------------------------------------------
Spark node scope-534
Split - scope-548
| |
|
Store(hdfs://localhost:48350/tmp/temp649016960/tmp48836938:org.apache.pig.impl.io.InterStorage)
- scope-538
| |
| |---C: Filter[bag] - scope-495
| | |
| | Less Than or Equal[boolean] - scope-498
| | |
| | |---Project[int][1] - scope-496
| | |
| | |---Constant(5) - scope-497
| |
|
Store(hdfs://localhost:48350/tmp/temp649016960/tmp804709981:org.apache.pig.impl.io.InterStorage)
- scope-546
| |
| |---B: Filter[bag] - scope-507
| | |
| | Equal To[boolean] - scope-510
| | |
| | |---Project[int][0] - scope-508
| | |
| | |---Constant(3) - scope-509
|
|---A: New For Each(false,false,false)[bag] - scope-491
| |
| Cast[int] - scope-483
| |
| |---Project[bytearray][0] - scope-482
| |
| Cast[int] - scope-486
| |
| |---Project[bytearray][1] - scope-485
| |
| Cast[int] - scope-489
| |
| |---Project[bytearray][2] - scope-488
|
|---A:
Load(hdfs://localhost:48350/user/root/input:org.apache.pig.builtin.PigStorage)
- scope-481--------
Spark node scope-540
C:
Store(hdfs://localhost:48350/user/root/output:org.apache.pig.builtin.PigStorage)
- scope-502
|
|---Load(hdfs://localhost:48350/tmp/temp649016960/tmp48836938:org.apache.pig.impl.io.InterStorage)
- scope-539--------
Spark node scope-542
D:
Store(hdfs://localhost:48350/user/root/output2:org.apache.pig.builtin.PigStorage)
- scope-533
|
|---D: FRJoin[tuple] - scope-525
| |
| Project[int][0] - scope-522
| |
| Project[int][0] - scope-523
| |
| Project[int][0] - scope-524
|
|---Load(hdfs://localhost:48350/tmp/temp649016960/tmp48836938:org.apache.pig.impl.io.InterStorage)
- scope-541--------
Spark node scope-545
Store(hdfs://localhost:48350/tmp/temp649016960/tmp-2036144538:org.apache.pig.impl.io.InterStorage)
- scope-547
|
|---A1: New For Each(false,false,false)[bag] - scope-521
| |
| Cast[int] - scope-513
| |
| |---Project[bytearray][0] - scope-512
| |
| Cast[int] - scope-516
| |
| |---Project[bytearray][1] - scope-515
| |
| Cast[int] - scope-519
| |
| |---Project[bytearray][2] - scope-518
|
|---A1:
Load(hdfs://localhost:48350/user/root/input2:org.apache.pig.builtin.PigStorage)
- scope-511-------
{code}
PhysicalOperator (LoadA) will be executed in
LoadConverter#ToTupleFunction#apply for more than the correct times because
this is a multi-query case.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)