[jira] [Commented] (PIG-4634) Fix records count issues in output statistics

2015-10-30 Thread Xianda Ke (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982091#comment-14982091
 ] 

Xianda Ke commented on PIG-4634:


Hi [~mohitsabharwal], Thank you for your comments. the code readability nits 
are fixed(https://reviews.apache.org/r/37627/diff/5-6/). Thanks a lot!



> Fix records count issues in output statistics
> -
>
> Key: PIG-4634
> URL: https://issues.apache.org/jira/browse/PIG-4634
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4634-3.patch, PIG-4634-4.patch, PIG-4634-5.patch, 
> PIG-4634-6.patch, PIG-4634.patch, PIG-4634_2.patch
>
>
> Test cases simpleTest() and simpleTest2()  in TestPigRunner failed, caused by 
> following issues:
> 1. pig context in SparkPigStats isn't initialized.
> 2. the records count logic hasn't been implemented.
> 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and 
> getRecordWritten() have not been implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4634) Fix records count issues in output statistics

2015-10-30 Thread Xianda Ke (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982093#comment-14982093
 ] 

Xianda Ke commented on PIG-4634:


the latest patch PIG-4634-6.patch is attached.

> Fix records count issues in output statistics
> -
>
> Key: PIG-4634
> URL: https://issues.apache.org/jira/browse/PIG-4634
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4634-3.patch, PIG-4634-4.patch, PIG-4634-5.patch, 
> PIG-4634-6.patch, PIG-4634.patch, PIG-4634_2.patch
>
>
> Test cases simpleTest() and simpleTest2()  in TestPigRunner failed, caused by 
> following issues:
> 1. pig context in SparkPigStats isn't initialized.
> 2. the records count logic hasn't been implemented.
> 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and 
> getRecordWritten() have not been implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4634) Fix records count issues in output statistics

2015-10-30 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982908#comment-14982908
 ] 

Mohit Sabharwal commented on PIG-4634:
--

Thanks, [~kexianda]! 

+1 (non-binding)

> Fix records count issues in output statistics
> -
>
> Key: PIG-4634
> URL: https://issues.apache.org/jira/browse/PIG-4634
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4634-3.patch, PIG-4634-4.patch, PIG-4634-5.patch, 
> PIG-4634-6.patch, PIG-4634.patch, PIG-4634_2.patch
>
>
> Test cases simpleTest() and simpleTest2()  in TestPigRunner failed, caused by 
> following issues:
> 1. pig context in SparkPigStats isn't initialized.
> 2. the records count logic hasn't been implemented.
> 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and 
> getRecordWritten() have not been implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4634) Fix records count issues in output statistics

2015-10-26 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14975606#comment-14975606
 ] 

Mohit Sabharwal commented on PIG-4634:
--

Thanks, [~xianda]. I had couple of code readability nits on RB. Otherwise LGTM.

> Fix records count issues in output statistics
> -
>
> Key: PIG-4634
> URL: https://issues.apache.org/jira/browse/PIG-4634
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4634-3.patch, PIG-4634-4.patch, PIG-4634-5.patch, 
> PIG-4634.patch, PIG-4634_2.patch
>
>
> Test cases simpleTest() and simpleTest2()  in TestPigRunner failed, caused by 
> following issues:
> 1. pig context in SparkPigStats isn't initialized.
> 2. the records count logic hasn't been implemented.
> 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and 
> getRecordWritten() have not been implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4634) Fix records count issues in output statistics

2015-10-08 Thread Xianda Ke (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948325#comment-14948325
 ] 

Xianda Ke commented on PIG-4634:


Hi Mohit, PIG-4634-5.patch is attached. RB: 
https://reviews.apache.org/r/37627/diff/5/

> Fix records count issues in output statistics
> -
>
> Key: PIG-4634
> URL: https://issues.apache.org/jira/browse/PIG-4634
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4634-3.patch, PIG-4634-4.patch, PIG-4634-5.patch, 
> PIG-4634.patch, PIG-4634_2.patch
>
>
> Test cases simpleTest() and simpleTest2()  in TestPigRunner failed, caused by 
> following issues:
> 1. pig context in SparkPigStats isn't initialized.
> 2. the records count logic hasn't been implemented.
> 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and 
> getRecordWritten() have not been implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4634) Fix records count issues in output statistics

2015-10-07 Thread Xianda Ke (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14946839#comment-14946839
 ] 

Xianda Ke commented on PIG-4634:


Hi Mohit, please try [it | https://reviews.apache.org/r/37627/diff/3/] again. 

> Fix records count issues in output statistics
> -
>
> Key: PIG-4634
> URL: https://issues.apache.org/jira/browse/PIG-4634
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4634-3.patch, PIG-4634-4.patch, PIG-4634.patch, 
> PIG-4634_2.patch
>
>
> Test cases simpleTest() and simpleTest2()  in TestPigRunner failed, caused by 
> following issues:
> 1. pig context in SparkPigStats isn't initialized.
> 2. the records count logic hasn't been implemented.
> 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and 
> getRecordWritten() have not been implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4634) Fix records count issues in output statistics

2015-10-07 Thread Xianda Ke (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14946838#comment-14946838
 ] 

Xianda Ke commented on PIG-4634:


Hi Mohit, please try [it | https://reviews.apache.org/r/37627/diff/3/] again. 

> Fix records count issues in output statistics
> -
>
> Key: PIG-4634
> URL: https://issues.apache.org/jira/browse/PIG-4634
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4634-3.patch, PIG-4634-4.patch, PIG-4634.patch, 
> PIG-4634_2.patch
>
>
> Test cases simpleTest() and simpleTest2()  in TestPigRunner failed, caused by 
> following issues:
> 1. pig context in SparkPigStats isn't initialized.
> 2. the records count logic hasn't been implemented.
> 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and 
> getRecordWritten() have not been implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4634) Fix records count issues in output statistics

2015-09-28 Thread Xianda Ke (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14933705#comment-14933705
 ] 

Xianda Ke commented on PIG-4634:


Thanks. Reply to you on RB.

> Fix records count issues in output statistics
> -
>
> Key: PIG-4634
> URL: https://issues.apache.org/jira/browse/PIG-4634
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4634-3.patch, PIG-4634.patch, PIG-4634_2.patch
>
>
> Test cases simpleTest() and simpleTest2()  in TestPigRunner failed, caused by 
> following issues:
> 1. pig context in SparkPigStats isn't initialized.
> 2. the records count logic hasn't been implemented.
> 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and 
> getRecordWritten() have not been implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4634) Fix records count issues in output statistics

2015-09-04 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731603#comment-14731603
 ] 

Mohit Sabharwal commented on PIG-4634:
--

Thanks, [~kexianda], left some comments on RB.

> Fix records count issues in output statistics
> -
>
> Key: PIG-4634
> URL: https://issues.apache.org/jira/browse/PIG-4634
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4634-3.patch, PIG-4634.patch, PIG-4634_2.patch
>
>
> Test cases simpleTest() and simpleTest2()  in TestPigRunner failed, caused by 
> following issues:
> 1. pig context in SparkPigStats isn't initialized.
> 2. the records count logic hasn't been implemented.
> 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and 
> getRecordWritten() have not been implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4634) Fix records count issues in output statistics

2015-08-30 Thread Xianda Ke (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721892#comment-14721892
 ] 

Xianda Ke commented on PIG-4634:


Hi [~mohitsabharwal], I have created the RB([RB37627 | 
https://reviews.apache.org/r/37627/]). Thanks.

 Fix records count issues in output statistics
 -

 Key: PIG-4634
 URL: https://issues.apache.org/jira/browse/PIG-4634
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: Xianda Ke
Assignee: Xianda Ke
 Fix For: spark-branch

 Attachments: PIG-4634-3.patch, PIG-4634.patch, PIG-4634_2.patch


 Test cases simpleTest() and simpleTest2()  in TestPigRunner failed, caused by 
 following issues:
 1. pig context in SparkPigStats isn't initialized.
 2. the records count logic hasn't been implemented.
 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and 
 getRecordWritten() have not been implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4634) Fix records count issues in output statistics

2015-08-28 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720863#comment-14720863
 ] 

Mohit Sabharwal commented on PIG-4634:
--

[~kexianda] could you create a RB request for this please?

 Fix records count issues in output statistics
 -

 Key: PIG-4634
 URL: https://issues.apache.org/jira/browse/PIG-4634
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: Xianda Ke
Assignee: Xianda Ke
 Fix For: spark-branch

 Attachments: PIG-4634-3.patch, PIG-4634.patch, PIG-4634_2.patch


 Test cases simpleTest() and simpleTest2()  in TestPigRunner failed, caused by 
 following issues:
 1. pig context in SparkPigStats isn't initialized.
 2. the records count logic hasn't been implemented.
 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and 
 getRecordWritten() have not been implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4634) Fix records count issues in output statistics

2015-08-12 Thread kexianda (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14693038#comment-14693038
 ] 

kexianda commented on PIG-4634:
---

Hi [~mohitsabharwal]  [~xuefuz],
PIG-4634-3.patch is attached.  Would you please help review the code. 

1. Implement records count logic using SparkCounter
(a). SparkPigStatusReporter.java:  a singleton factory to get sparkcounters.
(b). Create a new SparkCounter in StoreConverter.convert(). And increase the 
counter in FromTupleFunction.
We append the key of store operator to the counter name (in 
SparkStatsUtil.getStoreSparkCOunterName()), to avoid the counter name conflict 
when output file have the same shortname(say, /tmp1/output  /tmp2/output).

2. some slight changes/fix:
(a).set pigContext when initializing SparkPigStats.
(b).getOutputAlias() in spark mode


How to test:
Run TestPigRunner.simpleTest()

 Fix records count issues in output statistics
 -

 Key: PIG-4634
 URL: https://issues.apache.org/jira/browse/PIG-4634
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: kexianda
Assignee: kexianda
 Fix For: spark-branch

 Attachments: PIG-4634-3.patch, PIG-4634.patch, PIG-4634_2.patch


 Test cases simpleTest() and simpleTest2()  in TestPigRunner failed, caused by 
 following issues:
 1. pig context in SparkPigStats isn't initialized.
 2. the records count logic hasn't been implemented.
 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and 
 getRecordWritten() have not been implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4634) Fix records count issues in output statistics

2015-07-27 Thread kexianda (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14643721#comment-14643721
 ] 

kexianda commented on PIG-4634:
---

Hi [~xuefuz], When I investigating the Input Statistics issue, I found this 
solution seems not good.  
In MR mode, Counter were used for Output/Input statistics.  I will investigate 
this and provide a new patch for spark mode.
  -Xianda

 Fix records count issues in output statistics
 -

 Key: PIG-4634
 URL: https://issues.apache.org/jira/browse/PIG-4634
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: kexianda
Assignee: kexianda
 Fix For: spark-branch

 Attachments: PIG-4634.patch, PIG-4634_2.patch


 Test cases simpleTest() and simpleTest2()  in TestPigRunner failed, caused by 
 following issues:
 1. pig context in SparkPigStats isn't initialized.
 2. the records count logic hasn't been implemented.
 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and 
 getRecordWritten() have not been implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4634) Fix records count issues in output statistics

2015-07-23 Thread kexianda (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14638486#comment-14638486
 ] 

kexianda commented on PIG-4634:
---

Hi [~xuefuz],

latest patch(PIG-4634_2.patch) is attached. Please help review the code. Thanks.

1. I threw the exception just following the behavior of function 
SparkJobStats.collectStats().  I agree with you, it is too harsh. Now, logging 
message instead of throwing an exception. The exception is also removed in 
SparkJobStats.collectStats().

2. From my understanding, we should only aggregate the OutputMetrics of the 
Result stages, and ignore the OutputMetrics of ShuffleMap stages. The 
JobMetricsListener just collect all the taskMetrics of the stages. We don't 
know the taskMetrics is from ShuffleMap stage or Result stage. 
There is a tricky detail: the OutputMetrics is null in the ShuffleMapTask. And 
previous patch file use this tricky detail. It is not intuitive, hard to 
maintain.
To improve the robustness, the ID of Result stages are collected in a set in 
JobMetricsListener.java. Then, in getRecordCount(), only the Result stages's 
OutputMetrics are aggregated, the OutputMetrics of ShuffleMap stages are 
ignored.

3. How to test this?
Here are the two cases for testing this:
TestPigRunner.simpleTest()
TestPigRunner.simpleTest2() 
ant -Dhadoopversion=23 -Dexectype=spark -Dtestcase=TestPigRunner test

Regards, 
Xianda

 Fix records count issues in output statistics
 -

 Key: PIG-4634
 URL: https://issues.apache.org/jira/browse/PIG-4634
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: kexianda
Assignee: kexianda
 Fix For: spark-branch

 Attachments: PIG-4634.patch, PIG-4634_2.patch


 Test cases simpleTest() and simpleTest2()  in TestPigRunner failed, caused by 
 following issues:
 1. pig context in SparkPigStats isn't initialized.
 2. the records count logic hasn't been implemented.
 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and 
 getRecordWritten() have not been implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4634) Fix records count issues in output statistics

2015-07-20 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14634479#comment-14634479
 ] 

Xuefu Zhang commented on PIG-4634:
--

Patch looks good. Two questions:

1. In getRecordCount(), we throw runtime exception if the job metrics is null. 
This might fail the job. Is that too harsh?
2. We are aggregating records in different stages. Then, what does record count 
in means exactly?
3. Is there a way to test this?

 Fix records count issues in output statistics
 -

 Key: PIG-4634
 URL: https://issues.apache.org/jira/browse/PIG-4634
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: kexianda
Assignee: kexianda
 Fix For: spark-branch

 Attachments: PIG-4634.patch


 Test cases simpleTest() and simpleTest2()  in TestPigRunner failed, caused by 
 following issues:
 1. pig context in SparkPigStats isn't initialized.
 2. the records count logic hasn't been implemented.
 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and 
 getRecordWritten() have not been implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)