[jira] [Commented] (PIG-4634) Fix records count issues in output statistics
[ https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982091#comment-14982091 ] Xianda Ke commented on PIG-4634: Hi [~mohitsabharwal], Thank you for your comments. the code readability nits are fixed(https://reviews.apache.org/r/37627/diff/5-6/). Thanks a lot! > Fix records count issues in output statistics > - > > Key: PIG-4634 > URL: https://issues.apache.org/jira/browse/PIG-4634 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Xianda Ke >Assignee: Xianda Ke > Fix For: spark-branch > > Attachments: PIG-4634-3.patch, PIG-4634-4.patch, PIG-4634-5.patch, > PIG-4634-6.patch, PIG-4634.patch, PIG-4634_2.patch > > > Test cases simpleTest() and simpleTest2() in TestPigRunner failed, caused by > following issues: > 1. pig context in SparkPigStats isn't initialized. > 2. the records count logic hasn't been implemented. > 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and > getRecordWritten() have not been implemented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4634) Fix records count issues in output statistics
[ https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982093#comment-14982093 ] Xianda Ke commented on PIG-4634: the latest patch PIG-4634-6.patch is attached. > Fix records count issues in output statistics > - > > Key: PIG-4634 > URL: https://issues.apache.org/jira/browse/PIG-4634 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Xianda Ke >Assignee: Xianda Ke > Fix For: spark-branch > > Attachments: PIG-4634-3.patch, PIG-4634-4.patch, PIG-4634-5.patch, > PIG-4634-6.patch, PIG-4634.patch, PIG-4634_2.patch > > > Test cases simpleTest() and simpleTest2() in TestPigRunner failed, caused by > following issues: > 1. pig context in SparkPigStats isn't initialized. > 2. the records count logic hasn't been implemented. > 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and > getRecordWritten() have not been implemented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4634) Fix records count issues in output statistics
[ https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982908#comment-14982908 ] Mohit Sabharwal commented on PIG-4634: -- Thanks, [~kexianda]! +1 (non-binding) > Fix records count issues in output statistics > - > > Key: PIG-4634 > URL: https://issues.apache.org/jira/browse/PIG-4634 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Xianda Ke >Assignee: Xianda Ke > Fix For: spark-branch > > Attachments: PIG-4634-3.patch, PIG-4634-4.patch, PIG-4634-5.patch, > PIG-4634-6.patch, PIG-4634.patch, PIG-4634_2.patch > > > Test cases simpleTest() and simpleTest2() in TestPigRunner failed, caused by > following issues: > 1. pig context in SparkPigStats isn't initialized. > 2. the records count logic hasn't been implemented. > 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and > getRecordWritten() have not been implemented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4634) Fix records count issues in output statistics
[ https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14975606#comment-14975606 ] Mohit Sabharwal commented on PIG-4634: -- Thanks, [~xianda]. I had couple of code readability nits on RB. Otherwise LGTM. > Fix records count issues in output statistics > - > > Key: PIG-4634 > URL: https://issues.apache.org/jira/browse/PIG-4634 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Xianda Ke >Assignee: Xianda Ke > Fix For: spark-branch > > Attachments: PIG-4634-3.patch, PIG-4634-4.patch, PIG-4634-5.patch, > PIG-4634.patch, PIG-4634_2.patch > > > Test cases simpleTest() and simpleTest2() in TestPigRunner failed, caused by > following issues: > 1. pig context in SparkPigStats isn't initialized. > 2. the records count logic hasn't been implemented. > 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and > getRecordWritten() have not been implemented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4634) Fix records count issues in output statistics
[ https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948325#comment-14948325 ] Xianda Ke commented on PIG-4634: Hi Mohit, PIG-4634-5.patch is attached. RB: https://reviews.apache.org/r/37627/diff/5/ > Fix records count issues in output statistics > - > > Key: PIG-4634 > URL: https://issues.apache.org/jira/browse/PIG-4634 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Xianda Ke >Assignee: Xianda Ke > Fix For: spark-branch > > Attachments: PIG-4634-3.patch, PIG-4634-4.patch, PIG-4634-5.patch, > PIG-4634.patch, PIG-4634_2.patch > > > Test cases simpleTest() and simpleTest2() in TestPigRunner failed, caused by > following issues: > 1. pig context in SparkPigStats isn't initialized. > 2. the records count logic hasn't been implemented. > 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and > getRecordWritten() have not been implemented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4634) Fix records count issues in output statistics
[ https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14946839#comment-14946839 ] Xianda Ke commented on PIG-4634: Hi Mohit, please try [it | https://reviews.apache.org/r/37627/diff/3/] again. > Fix records count issues in output statistics > - > > Key: PIG-4634 > URL: https://issues.apache.org/jira/browse/PIG-4634 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Xianda Ke >Assignee: Xianda Ke > Fix For: spark-branch > > Attachments: PIG-4634-3.patch, PIG-4634-4.patch, PIG-4634.patch, > PIG-4634_2.patch > > > Test cases simpleTest() and simpleTest2() in TestPigRunner failed, caused by > following issues: > 1. pig context in SparkPigStats isn't initialized. > 2. the records count logic hasn't been implemented. > 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and > getRecordWritten() have not been implemented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4634) Fix records count issues in output statistics
[ https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14946838#comment-14946838 ] Xianda Ke commented on PIG-4634: Hi Mohit, please try [it | https://reviews.apache.org/r/37627/diff/3/] again. > Fix records count issues in output statistics > - > > Key: PIG-4634 > URL: https://issues.apache.org/jira/browse/PIG-4634 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Xianda Ke >Assignee: Xianda Ke > Fix For: spark-branch > > Attachments: PIG-4634-3.patch, PIG-4634-4.patch, PIG-4634.patch, > PIG-4634_2.patch > > > Test cases simpleTest() and simpleTest2() in TestPigRunner failed, caused by > following issues: > 1. pig context in SparkPigStats isn't initialized. > 2. the records count logic hasn't been implemented. > 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and > getRecordWritten() have not been implemented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4634) Fix records count issues in output statistics
[ https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14933705#comment-14933705 ] Xianda Ke commented on PIG-4634: Thanks. Reply to you on RB. > Fix records count issues in output statistics > - > > Key: PIG-4634 > URL: https://issues.apache.org/jira/browse/PIG-4634 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Xianda Ke >Assignee: Xianda Ke > Fix For: spark-branch > > Attachments: PIG-4634-3.patch, PIG-4634.patch, PIG-4634_2.patch > > > Test cases simpleTest() and simpleTest2() in TestPigRunner failed, caused by > following issues: > 1. pig context in SparkPigStats isn't initialized. > 2. the records count logic hasn't been implemented. > 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and > getRecordWritten() have not been implemented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4634) Fix records count issues in output statistics
[ https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731603#comment-14731603 ] Mohit Sabharwal commented on PIG-4634: -- Thanks, [~kexianda], left some comments on RB. > Fix records count issues in output statistics > - > > Key: PIG-4634 > URL: https://issues.apache.org/jira/browse/PIG-4634 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Xianda Ke >Assignee: Xianda Ke > Fix For: spark-branch > > Attachments: PIG-4634-3.patch, PIG-4634.patch, PIG-4634_2.patch > > > Test cases simpleTest() and simpleTest2() in TestPigRunner failed, caused by > following issues: > 1. pig context in SparkPigStats isn't initialized. > 2. the records count logic hasn't been implemented. > 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and > getRecordWritten() have not been implemented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4634) Fix records count issues in output statistics
[ https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721892#comment-14721892 ] Xianda Ke commented on PIG-4634: Hi [~mohitsabharwal], I have created the RB([RB37627 | https://reviews.apache.org/r/37627/]). Thanks. Fix records count issues in output statistics - Key: PIG-4634 URL: https://issues.apache.org/jira/browse/PIG-4634 Project: Pig Issue Type: Sub-task Components: spark Reporter: Xianda Ke Assignee: Xianda Ke Fix For: spark-branch Attachments: PIG-4634-3.patch, PIG-4634.patch, PIG-4634_2.patch Test cases simpleTest() and simpleTest2() in TestPigRunner failed, caused by following issues: 1. pig context in SparkPigStats isn't initialized. 2. the records count logic hasn't been implemented. 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and getRecordWritten() have not been implemented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4634) Fix records count issues in output statistics
[ https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720863#comment-14720863 ] Mohit Sabharwal commented on PIG-4634: -- [~kexianda] could you create a RB request for this please? Fix records count issues in output statistics - Key: PIG-4634 URL: https://issues.apache.org/jira/browse/PIG-4634 Project: Pig Issue Type: Sub-task Components: spark Reporter: Xianda Ke Assignee: Xianda Ke Fix For: spark-branch Attachments: PIG-4634-3.patch, PIG-4634.patch, PIG-4634_2.patch Test cases simpleTest() and simpleTest2() in TestPigRunner failed, caused by following issues: 1. pig context in SparkPigStats isn't initialized. 2. the records count logic hasn't been implemented. 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and getRecordWritten() have not been implemented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4634) Fix records count issues in output statistics
[ https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14693038#comment-14693038 ] kexianda commented on PIG-4634: --- Hi [~mohitsabharwal] [~xuefuz], PIG-4634-3.patch is attached. Would you please help review the code. 1. Implement records count logic using SparkCounter (a). SparkPigStatusReporter.java: a singleton factory to get sparkcounters. (b). Create a new SparkCounter in StoreConverter.convert(). And increase the counter in FromTupleFunction. We append the key of store operator to the counter name (in SparkStatsUtil.getStoreSparkCOunterName()), to avoid the counter name conflict when output file have the same shortname(say, /tmp1/output /tmp2/output). 2. some slight changes/fix: (a).set pigContext when initializing SparkPigStats. (b).getOutputAlias() in spark mode How to test: Run TestPigRunner.simpleTest() Fix records count issues in output statistics - Key: PIG-4634 URL: https://issues.apache.org/jira/browse/PIG-4634 Project: Pig Issue Type: Sub-task Components: spark Reporter: kexianda Assignee: kexianda Fix For: spark-branch Attachments: PIG-4634-3.patch, PIG-4634.patch, PIG-4634_2.patch Test cases simpleTest() and simpleTest2() in TestPigRunner failed, caused by following issues: 1. pig context in SparkPigStats isn't initialized. 2. the records count logic hasn't been implemented. 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and getRecordWritten() have not been implemented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4634) Fix records count issues in output statistics
[ https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14643721#comment-14643721 ] kexianda commented on PIG-4634: --- Hi [~xuefuz], When I investigating the Input Statistics issue, I found this solution seems not good. In MR mode, Counter were used for Output/Input statistics. I will investigate this and provide a new patch for spark mode. -Xianda Fix records count issues in output statistics - Key: PIG-4634 URL: https://issues.apache.org/jira/browse/PIG-4634 Project: Pig Issue Type: Sub-task Components: spark Reporter: kexianda Assignee: kexianda Fix For: spark-branch Attachments: PIG-4634.patch, PIG-4634_2.patch Test cases simpleTest() and simpleTest2() in TestPigRunner failed, caused by following issues: 1. pig context in SparkPigStats isn't initialized. 2. the records count logic hasn't been implemented. 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and getRecordWritten() have not been implemented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4634) Fix records count issues in output statistics
[ https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14638486#comment-14638486 ] kexianda commented on PIG-4634: --- Hi [~xuefuz], latest patch(PIG-4634_2.patch) is attached. Please help review the code. Thanks. 1. I threw the exception just following the behavior of function SparkJobStats.collectStats(). I agree with you, it is too harsh. Now, logging message instead of throwing an exception. The exception is also removed in SparkJobStats.collectStats(). 2. From my understanding, we should only aggregate the OutputMetrics of the Result stages, and ignore the OutputMetrics of ShuffleMap stages. The JobMetricsListener just collect all the taskMetrics of the stages. We don't know the taskMetrics is from ShuffleMap stage or Result stage. There is a tricky detail: the OutputMetrics is null in the ShuffleMapTask. And previous patch file use this tricky detail. It is not intuitive, hard to maintain. To improve the robustness, the ID of Result stages are collected in a set in JobMetricsListener.java. Then, in getRecordCount(), only the Result stages's OutputMetrics are aggregated, the OutputMetrics of ShuffleMap stages are ignored. 3. How to test this? Here are the two cases for testing this: TestPigRunner.simpleTest() TestPigRunner.simpleTest2() ant -Dhadoopversion=23 -Dexectype=spark -Dtestcase=TestPigRunner test Regards, Xianda Fix records count issues in output statistics - Key: PIG-4634 URL: https://issues.apache.org/jira/browse/PIG-4634 Project: Pig Issue Type: Sub-task Components: spark Reporter: kexianda Assignee: kexianda Fix For: spark-branch Attachments: PIG-4634.patch, PIG-4634_2.patch Test cases simpleTest() and simpleTest2() in TestPigRunner failed, caused by following issues: 1. pig context in SparkPigStats isn't initialized. 2. the records count logic hasn't been implemented. 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and getRecordWritten() have not been implemented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4634) Fix records count issues in output statistics
[ https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14634479#comment-14634479 ] Xuefu Zhang commented on PIG-4634: -- Patch looks good. Two questions: 1. In getRecordCount(), we throw runtime exception if the job metrics is null. This might fail the job. Is that too harsh? 2. We are aggregating records in different stages. Then, what does record count in means exactly? 3. Is there a way to test this? Fix records count issues in output statistics - Key: PIG-4634 URL: https://issues.apache.org/jira/browse/PIG-4634 Project: Pig Issue Type: Sub-task Components: spark Reporter: kexianda Assignee: kexianda Fix For: spark-branch Attachments: PIG-4634.patch Test cases simpleTest() and simpleTest2() in TestPigRunner failed, caused by following issues: 1. pig context in SparkPigStats isn't initialized. 2. the records count logic hasn't been implemented. 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and getRecordWritten() have not been implemented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)