[jira] [Commented] (PIG-5135) HDFS bytes read stats are always 0 in Spark mode

2017-05-25 Thread Adam Szita (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024858#comment-16024858
 ] 

Adam Szita commented on PIG-5135:
-

[~kellyzly] I just realized you already did this change in one of your patches 
in PIG-5215. Let's continue the discussion there

> HDFS bytes read stats are always 0 in Spark mode
> 
>
> Key: PIG-5135
> URL: https://issues.apache.org/jira/browse/PIG-5135
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: Adam Szita
> Fix For: spark-branch
>
> Attachments: PIG-5135.0.patch, PIG-5135.1.patch, PIG-5135.2.patch, 
> PIG-5135.smallfixes.patch
>
>
> I discovered this while running TestOrcStoragePushdown unit test in Spark 
> mode where the test depends on the value of this stat.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5135) HDFS bytes read stats are always 0 in Spark mode

2017-05-25 Thread Adam Szita (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024736#comment-16024736
 ] 

Adam Szita commented on PIG-5135:
-

[~kellyzly]: fair point. I've done the suggested modifications in 
[^PIG-5135.smallfixes.patch]. In this patch there is also a fix for a missing 
"else" keyword that currently causes the test to fail in MR mode. We could do 
it in a separate jira but it's very minimal and could go together with the 
other change I think.

> HDFS bytes read stats are always 0 in Spark mode
> 
>
> Key: PIG-5135
> URL: https://issues.apache.org/jira/browse/PIG-5135
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: Adam Szita
> Fix For: spark-branch
>
> Attachments: PIG-5135.0.patch, PIG-5135.1.patch, PIG-5135.2.patch, 
> PIG-5135.smallfixes.patch
>
>
> I discovered this while running TestOrcStoragePushdown unit test in Spark 
> mode where the test depends on the value of this stat.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5135) HDFS bytes read stats are always 0 in Spark mode

2017-05-24 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024218#comment-16024218
 ] 

liyunzhang_intel commented on PIG-5135:
---

[~szita]:
bq.I've checked this, it seems that assertEquals(30, 
inputStats.get(0).getBytes()); is fine, but assertEquals(18, 
inputStats.get(1).getBytes()); is not true, Spark returns -1 here. The plan 
generated for spark consists of 4 jobs, last one being the responsible for 
replicated join. This latter does 3 loads, and thus SparkPigStats handle this 
as -1. (Even after adding together all the bytes from all load ops in this job 
I got different result than 18.) I guess compression is also at work here on 
the tmp file part generation that further alters the number of bytes being read.
org.apache.pig.test.TestPigRunner#simpleMultiQueryTest3
{code}
#--
# Spark Plan  
#--

Spark node scope-53
Store(hdfs://localhost:58892/tmp/temp-1660154197/tmp1818797386:org.apache.pig.impl.io.InterStorage)
 - scope-54
|
|---A: New For Each(false,false,false)[bag] - scope-10
|   |
|   Cast[int] - scope-2
|   |
|   |---Project[bytearray][0] - scope-1
|   |
|   Cast[int] - scope-5
|   |
|   |---Project[bytearray][1] - scope-4
|   |
|   Cast[int] - scope-8
|   |
|   |---Project[bytearray][2] - scope-7
|
|---A: 
Load(hdfs://localhost:58892/user/root/input:org.apache.pig.builtin.PigStorage) 
- scope-0

Spark node scope-55
Store(hdfs://localhost:58892/tmp/temp-1660154197/tmp-546700946:org.apache.pig.impl.io.InterStorage)
 - scope-56
|
|---C: Filter[bag] - scope-14
|   |
|   Less Than or Equal[boolean] - scope-17
|   |
|   |---Project[int][1] - scope-15
|   |
|   |---Constant(5) - scope-16
|

|---Load(hdfs://localhost:58892/tmp/temp-1660154197/tmp1818797386:org.apache.pig.impl.io.InterStorage)
 - scope-10

Spark node scope-57
C: 
Store(hdfs://localhost:58892/user/root/output:org.apache.pig.builtin.PigStorage)
 - scope-21
|
|---Load(hdfs://localhost:58892/tmp/temp-1660154197/tmp-546700946:org.apache.pig.impl.io.InterStorage)
 - scope-14

Spark node scope-65
D: 
Store(hdfs://localhost:58892/user/root/output2:org.apache.pig.builtin.PigStorage)
 - scope-52
|
|---D: FRJoinSpark[tuple] - scope-44
|   |
|   Project[int][0] - scope-41
|   |
|   Project[int][0] - scope-42
|   |
|   Project[int][0] - scope-43
|

|---Load(hdfs://localhost:58892/tmp/temp-1660154197/tmp-546700946:org.apache.pig.impl.io.InterStorage)
 - scope-58
|
|---BroadcastSpark - scope-63
|   |
|   |---B: Filter[bag] - scope-26
|   |   |
|   |   Equal To[boolean] - scope-29
|   |   |
|   |   |---Project[int][0] - scope-27
|   |   |
|   |   |---Constant(3) - scope-28
|   |
|   
|---Load(hdfs://localhost:58892/tmp/temp-1660154197/tmp1818797386:org.apache.pig.impl.io.InterStorage)
 - scope-60
|
|---BroadcastSpark - scope-64
|
|---A1: New For Each(false,false,false)[bag] - scope-40
|   |
|   Cast[int] - scope-32
|   |
|   |---Project[bytearray][0] - scope-31
|   |
|   Cast[int] - scope-35
|   |
|   |---Project[bytearray][1] - scope-34
|   |
|   Cast[int] - scope-38
|   |
|   |---Project[bytearray][2] - scope-37
|
|---A1: 
Load(hdfs://localhost:58892/user/root/input2:org.apache.pig.builtin.PigStorage) 
- scope-30
{code}
 assertEquals(30, inputStats.get(0).getBytes()) is correct in spark mode,
 assertEquals(18, inputStats.get(1).getBytes()) is wrong in spark mode as the 
there are 3 loads in {{Spark node scope-65}}.  
[{{stats.get("BytesRead")}}|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L93]
 returns 49( guess this is the sum of 
three loads({{input2}},{{tmp1818797386}},{{tmp-546700946}}). But current 
[{{bytesRead}}|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L91]
 is -1 because 
[{{singleInput}}|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L92]
 is false.


Let's modify the code like
{code}

  // Since Tez does has only one load per job its values are correct
// the result of inputStats in spark mode is also correct
  if (!Util.isMapredExecType(cluster.getExecType())) {
assertEquals(30, inputStats.get(0).getBytes());
  }

  //TODO PIG-5240:Fix TestPigRunner#simpleMultiQueryTest3 in spark mode 
for wrong inputStats
  if (!Util.isMapredExecType(cluster.getExecT

[jira] [Commented] (PIG-5135) HDFS bytes read stats are always 0 in Spark mode

2017-05-18 Thread Adam Szita (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015412#comment-16015412
 ] 

Adam Szita commented on PIG-5135:
-

[~kellyzly] Thanks for catching this - it was probably missing {{svn add}} 
calls during the original commit. I can see now it's fixed by committing the 
missing files

> HDFS bytes read stats are always 0 in Spark mode
> 
>
> Key: PIG-5135
> URL: https://issues.apache.org/jira/browse/PIG-5135
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: Adam Szita
> Fix For: spark-branch
>
> Attachments: PIG-5135.0.patch, PIG-5135.1.patch, PIG-5135.2.patch
>
>
> I discovered this while running TestOrcStoragePushdown unit test in Spark 
> mode where the test depends on the value of this stat.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5135) HDFS bytes read stats are always 0 in Spark mode

2017-05-17 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015285#comment-16015285
 ] 

liyunzhang_intel commented on PIG-5135:
---

[~rohini]: there is some problem in last checkin causing the [jenkins 
failure|https://builds.apache.org/job/Pig-spark/402/consoleFull]
* 19463c9 - (HEAD, origin/spark, spark) PIG-5135: HDFS bytes read stats are 
always 0 in Spark mode (szita via rohini) (8 hours ago) 

recommit and see whether jenkins pass or not.

> HDFS bytes read stats are always 0 in Spark mode
> 
>
> Key: PIG-5135
> URL: https://issues.apache.org/jira/browse/PIG-5135
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: Adam Szita
> Fix For: spark-branch
>
> Attachments: PIG-5135.0.patch, PIG-5135.1.patch, PIG-5135.2.patch
>
>
> I discovered this while running TestOrcStoragePushdown unit test in Spark 
> mode where the test depends on the value of this stat.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5135) HDFS bytes read stats are always 0 in Spark mode

2017-05-17 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16014833#comment-16014833
 ] 

liyunzhang_intel commented on PIG-5135:
---

[~szita]: will update the review board according to the latest patch of 
PIG-5135(latest branch code) today.

> HDFS bytes read stats are always 0 in Spark mode
> 
>
> Key: PIG-5135
> URL: https://issues.apache.org/jira/browse/PIG-5135
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: Adam Szita
> Fix For: spark-branch
>
> Attachments: PIG-5135.0.patch, PIG-5135.1.patch, PIG-5135.2.patch
>
>
> I discovered this while running TestOrcStoragePushdown unit test in Spark 
> mode where the test depends on the value of this stat.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5135) HDFS bytes read stats are always 0 in Spark mode

2017-05-17 Thread Adam Szita (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16014831#comment-16014831
 ] 

Adam Szita commented on PIG-5135:
-

Thanks for review Rohini and Liyun

> HDFS bytes read stats are always 0 in Spark mode
> 
>
> Key: PIG-5135
> URL: https://issues.apache.org/jira/browse/PIG-5135
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: Adam Szita
> Fix For: spark-branch
>
> Attachments: PIG-5135.0.patch, PIG-5135.1.patch, PIG-5135.2.patch
>
>
> I discovered this while running TestOrcStoragePushdown unit test in Spark 
> mode where the test depends on the value of this stat.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5135) HDFS bytes read stats are always 0 in Spark mode

2017-05-15 Thread Adam Szita (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16010331#comment-16010331
 ] 

Adam Szita commented on PIG-5135:
-

[~rohini] Thanks for the review - I've attached [^PIG-5135.2.patch] with the 
fixes

> HDFS bytes read stats are always 0 in Spark mode
> 
>
> Key: PIG-5135
> URL: https://issues.apache.org/jira/browse/PIG-5135
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: Adam Szita
> Fix For: spark-branch
>
> Attachments: PIG-5135.0.patch, PIG-5135.1.patch, PIG-5135.2.patch
>
>
> I discovered this while running TestOrcStoragePushdown unit test in Spark 
> mode where the test depends on the value of this stat.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5135) HDFS bytes read stats are always 0 in Spark mode

2017-05-14 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009837#comment-16009837
 ] 

Rohini Palaniswamy commented on PIG-5135:
-

This patch is good to go. Just two minor comments.
   1) isFiledSplits -> isFileSplits
   2) Some cleanup unrelated to this patch but would be good to do as it 
touches that code - Get rid of the static activeSplit variable and 
getActiveSplit method in PigInputFormat. Do not seem to be used anywhere and it 
is not a good idea to have a static state.  Also remove PigInputFormat.sJob 
which has been deprecated for long.

> HDFS bytes read stats are always 0 in Spark mode
> 
>
> Key: PIG-5135
> URL: https://issues.apache.org/jira/browse/PIG-5135
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: Adam Szita
> Fix For: spark-branch
>
> Attachments: PIG-5135.0.patch, PIG-5135.1.patch
>
>
> I discovered this while running TestOrcStoragePushdown unit test in Spark 
> mode where the test depends on the value of this stat.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5135) HDFS bytes read stats are always 0 in Spark mode

2017-04-20 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15978017#comment-15978017
 ] 

liyunzhang_intel commented on PIG-5135:
---

[~szita]: include PIG-5135.1.patch to the total change of [review 
board|https://reviews.apache.org/r/57317/diff/5-6/].  

> HDFS bytes read stats are always 0 in Spark mode
> 
>
> Key: PIG-5135
> URL: https://issues.apache.org/jira/browse/PIG-5135
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: Adam Szita
> Fix For: spark-branch
>
> Attachments: PIG-5135.0.patch, PIG-5135.1.patch
>
>
> I discovered this while running TestOrcStoragePushdown unit test in Spark 
> mode where the test depends on the value of this stat.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5135) HDFS bytes read stats are always 0 in Spark mode

2017-04-13 Thread Adam Szita (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967670#comment-15967670
 ] 

Adam Szita commented on PIG-5135:
-

[~kellyzly] I've checked this, it seems that {{assertEquals(30, 
inputStats.get(0).getBytes());}} is fine, but {{assertEquals(18, 
inputStats.get(1).getBytes());}} is not true, Spark returns -1 here. The plan 
generated for spark consists of 4 jobs, last one being the responsible for 
replicated join. This latter does 3 loads, and thus SparkPigStats handle this 
as -1. (Even after adding together all the bytes from all load ops in this job 
I got different result than 18.) I guess compression is also at work here on 
the tmp file part generation that further alters the number of bytes being read.

I would say we should leave the exclusion for Spark as is, but update the 
comment section since we don't get the expected numbers for a different reason. 
What do you think?

> HDFS bytes read stats are always 0 in Spark mode
> 
>
> Key: PIG-5135
> URL: https://issues.apache.org/jira/browse/PIG-5135
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: Adam Szita
> Fix For: spark-branch
>
> Attachments: PIG-5135.0.patch, PIG-5135.1.patch
>
>
> I discovered this while running TestOrcStoragePushdown unit test in Spark 
> mode where the test depends on the value of this stat.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5135) HDFS bytes read stats are always 0 in Spark mode

2017-04-12 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15966615#comment-15966615
 ] 

liyunzhang_intel commented on PIG-5135:
---

[~szita]:  I see. please create a jira board and add [~rohini] as reviewer to 
help review the modification of PigInputFormat.java.

And remove some code in org.apache.pig.test.TestPigRunner#simpleMultiQueryTest3 
as now the hdfs bytes read stats are not always 0 in spark mode
{code}
  // For mapreduce, since hdfs bytes read includes replicated tables bytes read 
is wrong
// Since Tez does has only one load per job its values are correct
// By pass the check for spark due to PIG-4788
if (!Util.isMapredExecType(cluster.getExecType()) && 
!Util.isSparkExecType(cluster.getExecType())) {
assertEquals(30, inputStats.get(0).getBytes());
assertEquals(18, inputStats.get(1).getBytes());
}
{code}

> HDFS bytes read stats are always 0 in Spark mode
> 
>
> Key: PIG-5135
> URL: https://issues.apache.org/jira/browse/PIG-5135
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: Adam Szita
> Fix For: spark-branch
>
> Attachments: PIG-5135.0.patch, PIG-5135.1.patch
>
>
> I discovered this while running TestOrcStoragePushdown unit test in Spark 
> mode where the test depends on the value of this stat.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5135) HDFS bytes read stats are always 0 in Spark mode

2017-04-12 Thread Adam Szita (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15966093#comment-15966093
 ] 

Adam Szita commented on PIG-5135:
-

also attached new patch with one missing method override that came up during 
e2e test execution

> HDFS bytes read stats are always 0 in Spark mode
> 
>
> Key: PIG-5135
> URL: https://issues.apache.org/jira/browse/PIG-5135
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: Adam Szita
> Fix For: spark-branch
>
> Attachments: PIG-5135.0.patch, PIG-5135.1.patch
>
>
> I discovered this while running TestOrcStoragePushdown unit test in Spark 
> mode where the test depends on the value of this stat.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)