[jira] [Commented] (PIG-5135) HDFS bytes read stats are always 0 in Spark mode
[ https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024858#comment-16024858 ] Adam Szita commented on PIG-5135: - [~kellyzly] I just realized you already did this change in one of your patches in PIG-5215. Let's continue the discussion there > HDFS bytes read stats are always 0 in Spark mode > > > Key: PIG-5135 > URL: https://issues.apache.org/jira/browse/PIG-5135 > Project: Pig > Issue Type: Bug > Components: spark >Reporter: liyunzhang_intel >Assignee: Adam Szita > Fix For: spark-branch > > Attachments: PIG-5135.0.patch, PIG-5135.1.patch, PIG-5135.2.patch, > PIG-5135.smallfixes.patch > > > I discovered this while running TestOrcStoragePushdown unit test in Spark > mode where the test depends on the value of this stat. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (PIG-5135) HDFS bytes read stats are always 0 in Spark mode
[ https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024736#comment-16024736 ] Adam Szita commented on PIG-5135: - [~kellyzly]: fair point. I've done the suggested modifications in [^PIG-5135.smallfixes.patch]. In this patch there is also a fix for a missing "else" keyword that currently causes the test to fail in MR mode. We could do it in a separate jira but it's very minimal and could go together with the other change I think. > HDFS bytes read stats are always 0 in Spark mode > > > Key: PIG-5135 > URL: https://issues.apache.org/jira/browse/PIG-5135 > Project: Pig > Issue Type: Bug > Components: spark >Reporter: liyunzhang_intel >Assignee: Adam Szita > Fix For: spark-branch > > Attachments: PIG-5135.0.patch, PIG-5135.1.patch, PIG-5135.2.patch, > PIG-5135.smallfixes.patch > > > I discovered this while running TestOrcStoragePushdown unit test in Spark > mode where the test depends on the value of this stat. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (PIG-5135) HDFS bytes read stats are always 0 in Spark mode
[ https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024218#comment-16024218 ] liyunzhang_intel commented on PIG-5135: --- [~szita]: bq.I've checked this, it seems that assertEquals(30, inputStats.get(0).getBytes()); is fine, but assertEquals(18, inputStats.get(1).getBytes()); is not true, Spark returns -1 here. The plan generated for spark consists of 4 jobs, last one being the responsible for replicated join. This latter does 3 loads, and thus SparkPigStats handle this as -1. (Even after adding together all the bytes from all load ops in this job I got different result than 18.) I guess compression is also at work here on the tmp file part generation that further alters the number of bytes being read. org.apache.pig.test.TestPigRunner#simpleMultiQueryTest3 {code} #-- # Spark Plan #-- Spark node scope-53 Store(hdfs://localhost:58892/tmp/temp-1660154197/tmp1818797386:org.apache.pig.impl.io.InterStorage) - scope-54 | |---A: New For Each(false,false,false)[bag] - scope-10 | | | Cast[int] - scope-2 | | | |---Project[bytearray][0] - scope-1 | | | Cast[int] - scope-5 | | | |---Project[bytearray][1] - scope-4 | | | Cast[int] - scope-8 | | | |---Project[bytearray][2] - scope-7 | |---A: Load(hdfs://localhost:58892/user/root/input:org.apache.pig.builtin.PigStorage) - scope-0 Spark node scope-55 Store(hdfs://localhost:58892/tmp/temp-1660154197/tmp-546700946:org.apache.pig.impl.io.InterStorage) - scope-56 | |---C: Filter[bag] - scope-14 | | | Less Than or Equal[boolean] - scope-17 | | | |---Project[int][1] - scope-15 | | | |---Constant(5) - scope-16 | |---Load(hdfs://localhost:58892/tmp/temp-1660154197/tmp1818797386:org.apache.pig.impl.io.InterStorage) - scope-10 Spark node scope-57 C: Store(hdfs://localhost:58892/user/root/output:org.apache.pig.builtin.PigStorage) - scope-21 | |---Load(hdfs://localhost:58892/tmp/temp-1660154197/tmp-546700946:org.apache.pig.impl.io.InterStorage) - scope-14 Spark node scope-65 D: Store(hdfs://localhost:58892/user/root/output2:org.apache.pig.builtin.PigStorage) - scope-52 | |---D: FRJoinSpark[tuple] - scope-44 | | | Project[int][0] - scope-41 | | | Project[int][0] - scope-42 | | | Project[int][0] - scope-43 | |---Load(hdfs://localhost:58892/tmp/temp-1660154197/tmp-546700946:org.apache.pig.impl.io.InterStorage) - scope-58 | |---BroadcastSpark - scope-63 | | | |---B: Filter[bag] - scope-26 | | | | | Equal To[boolean] - scope-29 | | | | | |---Project[int][0] - scope-27 | | | | | |---Constant(3) - scope-28 | | | |---Load(hdfs://localhost:58892/tmp/temp-1660154197/tmp1818797386:org.apache.pig.impl.io.InterStorage) - scope-60 | |---BroadcastSpark - scope-64 | |---A1: New For Each(false,false,false)[bag] - scope-40 | | | Cast[int] - scope-32 | | | |---Project[bytearray][0] - scope-31 | | | Cast[int] - scope-35 | | | |---Project[bytearray][1] - scope-34 | | | Cast[int] - scope-38 | | | |---Project[bytearray][2] - scope-37 | |---A1: Load(hdfs://localhost:58892/user/root/input2:org.apache.pig.builtin.PigStorage) - scope-30 {code} assertEquals(30, inputStats.get(0).getBytes()) is correct in spark mode, assertEquals(18, inputStats.get(1).getBytes()) is wrong in spark mode as the there are 3 loads in {{Spark node scope-65}}. [{{stats.get("BytesRead")}}|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L93] returns 49( guess this is the sum of three loads({{input2}},{{tmp1818797386}},{{tmp-546700946}}). But current [{{bytesRead}}|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L91] is -1 because [{{singleInput}}|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L92] is false. Let's modify the code like {code} // Since Tez does has only one load per job its values are correct // the result of inputStats in spark mode is also correct if (!Util.isMapredExecType(cluster.getExecType())) { assertEquals(30, inputStats.get(0).getBytes()); } //TODO PIG-5240:Fix TestPigRunner#simpleMultiQueryTest3 in spark mode for wrong inputStats if (!Util.isMapredExecType(cluster.getExecT
[jira] [Commented] (PIG-5135) HDFS bytes read stats are always 0 in Spark mode
[ https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015412#comment-16015412 ] Adam Szita commented on PIG-5135: - [~kellyzly] Thanks for catching this - it was probably missing {{svn add}} calls during the original commit. I can see now it's fixed by committing the missing files > HDFS bytes read stats are always 0 in Spark mode > > > Key: PIG-5135 > URL: https://issues.apache.org/jira/browse/PIG-5135 > Project: Pig > Issue Type: Bug > Components: spark >Reporter: liyunzhang_intel >Assignee: Adam Szita > Fix For: spark-branch > > Attachments: PIG-5135.0.patch, PIG-5135.1.patch, PIG-5135.2.patch > > > I discovered this while running TestOrcStoragePushdown unit test in Spark > mode where the test depends on the value of this stat. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (PIG-5135) HDFS bytes read stats are always 0 in Spark mode
[ https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015285#comment-16015285 ] liyunzhang_intel commented on PIG-5135: --- [~rohini]: there is some problem in last checkin causing the [jenkins failure|https://builds.apache.org/job/Pig-spark/402/consoleFull] * 19463c9 - (HEAD, origin/spark, spark) PIG-5135: HDFS bytes read stats are always 0 in Spark mode (szita via rohini) (8 hours ago) recommit and see whether jenkins pass or not. > HDFS bytes read stats are always 0 in Spark mode > > > Key: PIG-5135 > URL: https://issues.apache.org/jira/browse/PIG-5135 > Project: Pig > Issue Type: Bug > Components: spark >Reporter: liyunzhang_intel >Assignee: Adam Szita > Fix For: spark-branch > > Attachments: PIG-5135.0.patch, PIG-5135.1.patch, PIG-5135.2.patch > > > I discovered this while running TestOrcStoragePushdown unit test in Spark > mode where the test depends on the value of this stat. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (PIG-5135) HDFS bytes read stats are always 0 in Spark mode
[ https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16014833#comment-16014833 ] liyunzhang_intel commented on PIG-5135: --- [~szita]: will update the review board according to the latest patch of PIG-5135(latest branch code) today. > HDFS bytes read stats are always 0 in Spark mode > > > Key: PIG-5135 > URL: https://issues.apache.org/jira/browse/PIG-5135 > Project: Pig > Issue Type: Bug > Components: spark >Reporter: liyunzhang_intel >Assignee: Adam Szita > Fix For: spark-branch > > Attachments: PIG-5135.0.patch, PIG-5135.1.patch, PIG-5135.2.patch > > > I discovered this while running TestOrcStoragePushdown unit test in Spark > mode where the test depends on the value of this stat. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (PIG-5135) HDFS bytes read stats are always 0 in Spark mode
[ https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16014831#comment-16014831 ] Adam Szita commented on PIG-5135: - Thanks for review Rohini and Liyun > HDFS bytes read stats are always 0 in Spark mode > > > Key: PIG-5135 > URL: https://issues.apache.org/jira/browse/PIG-5135 > Project: Pig > Issue Type: Bug > Components: spark >Reporter: liyunzhang_intel >Assignee: Adam Szita > Fix For: spark-branch > > Attachments: PIG-5135.0.patch, PIG-5135.1.patch, PIG-5135.2.patch > > > I discovered this while running TestOrcStoragePushdown unit test in Spark > mode where the test depends on the value of this stat. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (PIG-5135) HDFS bytes read stats are always 0 in Spark mode
[ https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16010331#comment-16010331 ] Adam Szita commented on PIG-5135: - [~rohini] Thanks for the review - I've attached [^PIG-5135.2.patch] with the fixes > HDFS bytes read stats are always 0 in Spark mode > > > Key: PIG-5135 > URL: https://issues.apache.org/jira/browse/PIG-5135 > Project: Pig > Issue Type: Bug > Components: spark >Reporter: liyunzhang_intel >Assignee: Adam Szita > Fix For: spark-branch > > Attachments: PIG-5135.0.patch, PIG-5135.1.patch, PIG-5135.2.patch > > > I discovered this while running TestOrcStoragePushdown unit test in Spark > mode where the test depends on the value of this stat. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (PIG-5135) HDFS bytes read stats are always 0 in Spark mode
[ https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009837#comment-16009837 ] Rohini Palaniswamy commented on PIG-5135: - This patch is good to go. Just two minor comments. 1) isFiledSplits -> isFileSplits 2) Some cleanup unrelated to this patch but would be good to do as it touches that code - Get rid of the static activeSplit variable and getActiveSplit method in PigInputFormat. Do not seem to be used anywhere and it is not a good idea to have a static state. Also remove PigInputFormat.sJob which has been deprecated for long. > HDFS bytes read stats are always 0 in Spark mode > > > Key: PIG-5135 > URL: https://issues.apache.org/jira/browse/PIG-5135 > Project: Pig > Issue Type: Bug > Components: spark >Reporter: liyunzhang_intel >Assignee: Adam Szita > Fix For: spark-branch > > Attachments: PIG-5135.0.patch, PIG-5135.1.patch > > > I discovered this while running TestOrcStoragePushdown unit test in Spark > mode where the test depends on the value of this stat. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (PIG-5135) HDFS bytes read stats are always 0 in Spark mode
[ https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15978017#comment-15978017 ] liyunzhang_intel commented on PIG-5135: --- [~szita]: include PIG-5135.1.patch to the total change of [review board|https://reviews.apache.org/r/57317/diff/5-6/]. > HDFS bytes read stats are always 0 in Spark mode > > > Key: PIG-5135 > URL: https://issues.apache.org/jira/browse/PIG-5135 > Project: Pig > Issue Type: Bug > Components: spark >Reporter: liyunzhang_intel >Assignee: Adam Szita > Fix For: spark-branch > > Attachments: PIG-5135.0.patch, PIG-5135.1.patch > > > I discovered this while running TestOrcStoragePushdown unit test in Spark > mode where the test depends on the value of this stat. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (PIG-5135) HDFS bytes read stats are always 0 in Spark mode
[ https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967670#comment-15967670 ] Adam Szita commented on PIG-5135: - [~kellyzly] I've checked this, it seems that {{assertEquals(30, inputStats.get(0).getBytes());}} is fine, but {{assertEquals(18, inputStats.get(1).getBytes());}} is not true, Spark returns -1 here. The plan generated for spark consists of 4 jobs, last one being the responsible for replicated join. This latter does 3 loads, and thus SparkPigStats handle this as -1. (Even after adding together all the bytes from all load ops in this job I got different result than 18.) I guess compression is also at work here on the tmp file part generation that further alters the number of bytes being read. I would say we should leave the exclusion for Spark as is, but update the comment section since we don't get the expected numbers for a different reason. What do you think? > HDFS bytes read stats are always 0 in Spark mode > > > Key: PIG-5135 > URL: https://issues.apache.org/jira/browse/PIG-5135 > Project: Pig > Issue Type: Bug > Components: spark >Reporter: liyunzhang_intel >Assignee: Adam Szita > Fix For: spark-branch > > Attachments: PIG-5135.0.patch, PIG-5135.1.patch > > > I discovered this while running TestOrcStoragePushdown unit test in Spark > mode where the test depends on the value of this stat. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (PIG-5135) HDFS bytes read stats are always 0 in Spark mode
[ https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15966615#comment-15966615 ] liyunzhang_intel commented on PIG-5135: --- [~szita]: I see. please create a jira board and add [~rohini] as reviewer to help review the modification of PigInputFormat.java. And remove some code in org.apache.pig.test.TestPigRunner#simpleMultiQueryTest3 as now the hdfs bytes read stats are not always 0 in spark mode {code} // For mapreduce, since hdfs bytes read includes replicated tables bytes read is wrong // Since Tez does has only one load per job its values are correct // By pass the check for spark due to PIG-4788 if (!Util.isMapredExecType(cluster.getExecType()) && !Util.isSparkExecType(cluster.getExecType())) { assertEquals(30, inputStats.get(0).getBytes()); assertEquals(18, inputStats.get(1).getBytes()); } {code} > HDFS bytes read stats are always 0 in Spark mode > > > Key: PIG-5135 > URL: https://issues.apache.org/jira/browse/PIG-5135 > Project: Pig > Issue Type: Bug > Components: spark >Reporter: liyunzhang_intel >Assignee: Adam Szita > Fix For: spark-branch > > Attachments: PIG-5135.0.patch, PIG-5135.1.patch > > > I discovered this while running TestOrcStoragePushdown unit test in Spark > mode where the test depends on the value of this stat. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (PIG-5135) HDFS bytes read stats are always 0 in Spark mode
[ https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15966093#comment-15966093 ] Adam Szita commented on PIG-5135: - also attached new patch with one missing method override that came up during e2e test execution > HDFS bytes read stats are always 0 in Spark mode > > > Key: PIG-5135 > URL: https://issues.apache.org/jira/browse/PIG-5135 > Project: Pig > Issue Type: Bug > Components: spark >Reporter: liyunzhang_intel >Assignee: Adam Szita > Fix For: spark-branch > > Attachments: PIG-5135.0.patch, PIG-5135.1.patch > > > I discovered this while running TestOrcStoragePushdown unit test in Spark > mode where the test depends on the value of this stat. -- This message was sent by Atlassian JIRA (v6.3.15#6346)