[jira] [Updated] (PIG-3634) Improve performance of order-by

2013-12-27 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3634:


Attachment: (was: PIG-3634-1.patch)

> Improve performance of order-by
> ---
>
> Key: PIG-3634
> URL: https://issues.apache.org/jira/browse/PIG-3634
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: tez-branch
>
> Attachments: PIG-3634-0.patch, PIG-3634-1.patch
>
>
> This is a followup for PIG-3534. In PIG-3534, we use 5 vertexes (3 DAGs) to 
> implement an order-by. We can optimize to use 4 vertexes in 1 DAG:
> vertex 1: close the current vertex, create input + samples input
> vertex 2: aggregate samples to create quantiles
> vertex 3: use quantiles to partition input
> vertex 4: sort input after partition
> The DAG is:
> {code}
> vertex 1   -->  vertex 3 --> vertex 4
>\--> vertex 2 ---/
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PIG-3634) Improve performance of order-by

2013-12-27 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3634:


Attachment: PIG-3634-1.patch

> Improve performance of order-by
> ---
>
> Key: PIG-3634
> URL: https://issues.apache.org/jira/browse/PIG-3634
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: tez-branch
>
> Attachments: PIG-3634-0.patch, PIG-3634-1.patch
>
>
> This is a followup for PIG-3534. In PIG-3534, we use 5 vertexes (3 DAGs) to 
> implement an order-by. We can optimize to use 4 vertexes in 1 DAG:
> vertex 1: close the current vertex, create input + samples input
> vertex 2: aggregate samples to create quantiles
> vertex 3: use quantiles to partition input
> vertex 4: sort input after partition
> The DAG is:
> {code}
> vertex 1   -->  vertex 3 --> vertex 4
>\--> vertex 2 ---/
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PIG-3634) Improve performance of order-by

2013-12-27 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3634:


Attachment: PIG-3634-1.patch

Operators_4 fixed.

> Improve performance of order-by
> ---
>
> Key: PIG-3634
> URL: https://issues.apache.org/jira/browse/PIG-3634
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: tez-branch
>
> Attachments: PIG-3634-0.patch, PIG-3634-1.patch
>
>
> This is a followup for PIG-3534. In PIG-3534, we use 5 vertexes (3 DAGs) to 
> implement an order-by. We can optimize to use 4 vertexes in 1 DAG:
> vertex 1: close the current vertex, create input + samples input
> vertex 2: aggregate samples to create quantiles
> vertex 3: use quantiles to partition input
> vertex 4: sort input after partition
> The DAG is:
> {code}
> vertex 1   -->  vertex 3 --> vertex 4
>\--> vertex 2 ---/
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PIG-3608) ClassCastException when looking up a value from AvroMapWrapper using a Utf8 key

2013-12-27 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-3608:
--

   Resolution: Fixed
Fix Version/s: 0.13.0
 Release Note: 
Committed to trunk.

 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

> ClassCastException when looking up a value from AvroMapWrapper using a Utf8 
> key
> ---
>
> Key: PIG-3608
> URL: https://issues.apache.org/jira/browse/PIG-3608
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.12.0
>Reporter: Richard Ding
>Assignee: Richard Ding
>Priority: Minor
> Fix For: 0.13.0
>
> Attachments: PIG-3608.patch, PIG-3608_2.patch
>
>
> One got the following exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.util.Utf8 incompatible with 
> java.lang.String 
> at 
> org.apache.pig.impl.util.avro.AvroMapWrapper.get(AvroMapWrapper.java:80)
> {code}
> This is related to the change by PIG-3420.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PIG-3643) Nested Foreach with UDF and bincond is broken

2013-12-27 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3643:
---

Status: Patch Available  (was: Open)

> Nested Foreach with UDF and bincond is broken
> -
>
> Key: PIG-3643
> URL: https://issues.apache.org/jira/browse/PIG-3643
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.13.0
>Reporter: Rohini Palaniswamy
>Assignee: Cheolsoo Park
> Attachments: PIG-3643-1.patch
>
>
> Was checking out PIG-3000. 
> A = load 'data' as (a:chararray);
> B = foreach A { c = UPPER(a); generate ((c eq 'TEST') ? 1 : 0), ((c eq 'DEV') 
> ? 1 : 0); }
> This now throws "Invalid field projection. Projected field [c] does not exist 
> in schema".  Works fine in 0.11. Broken in trunk. Haven't checked 0.12. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3581) Incorrect scope resolution with nested foreach

2013-12-27 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13857896#comment-13857896
 ] 

Cheolsoo Park commented on PIG-3581:


[~aniket486], I put up a fix in PIG-3643.

> Incorrect scope resolution with nested foreach
> --
>
> Key: PIG-3581
> URL: https://issues.apache.org/jira/browse/PIG-3581
> Project: Pig
>  Issue Type: Bug
>Reporter: Venu Satuluri
>Assignee: Aniket Mokashi
> Fix For: 0.13.0
>
> Attachments: PIG-3581-1.patch, PIG-3581-2.patch
>
>
> Consider the following script:
> {code}
> A = LOAD 'test_data' AS (a: int, b: int);
> C = FOREACH A GENERATE *;
> B = FOREACH (GROUP A BY a) {
>   C = FILTER A BY b % 2 == 0;
>   D = FILTER A BY b % 2 == 1;
>   GENERATE group AS a, A.b AS every, C.b AS even, D.b AS odd;
> };
> DESCRIBE B;
> {code}
> Notice that C is defined both inside the nested foreach as well as outside. I 
> would expect that in the GENERATE inside the nested FOREACH, the C that is 
> used will be the one that is defined inside. If that is not so, I think at 
> least a warning is due.
> However, currently Pig silently assumes that the C you mean one is the one 
> that is defined *outside* the nested FOREACH.
> Hence, the result of "DESCRIBE B" looks as follows:
> {code}
> B: {
> a: int,
> every: {
> (
> b: int
> )
> },
> even: int,
> odd: {
> (
> b: int
> )
> }
> }
> {code}
> If I remove the definition of C that is outside the foreach, then I get the 
> following for "DESCRIBE B":
> {code}
> B: {
> a: int,
> every: {
> (
> b: int
> )
> },
> even: {
> (
> b: int
> )
> },
> odd: {
> (
> b: int
> )
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PIG-3643) Nested Foreach with UDF and bincond is broken

2013-12-27 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3643:
---

Attachment: PIG-3643-1.patch

Uploading a patch that fixes the regression.

To summarize the issue-
# Before PIG-3581:
{code}
if X, do a
else if Y, do b
else do c
{code}
# With PIG-3581:
{code}
If Y && N, do b
else if X, do a
else do c
{code}
As can be seen, if Y && !N, it falls into c now while we used to fall into b.
# With my patch:
{code}
If Y && N, do b
else if X, do a
else if Y, do b
else do c
{code}

I verified-
* TestColumnAliasConversion passes.
* TestLogicalPlanGenerator passes.
* The new test cases added by PIG-3581 pass.
* The example in this jira wrks.

> Nested Foreach with UDF and bincond is broken
> -
>
> Key: PIG-3643
> URL: https://issues.apache.org/jira/browse/PIG-3643
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.13.0
>Reporter: Rohini Palaniswamy
>Assignee: Cheolsoo Park
> Attachments: PIG-3643-1.patch
>
>
> Was checking out PIG-3000. 
> A = load 'data' as (a:chararray);
> B = foreach A { c = UPPER(a); generate ((c eq 'TEST') ? 1 : 0), ((c eq 'DEV') 
> ? 1 : 0); }
> This now throws "Invalid field projection. Projected field [c] does not exist 
> in schema".  Works fine in 0.11. Broken in trunk. Haven't checked 0.12. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PIG-3634) Improve performance of order-by

2013-12-27 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3634:


Attachment: PIG-3634-0.patch

Attach partial patch. Operators_4 is still failing, I will attach a new patch 
once fixed.

> Improve performance of order-by
> ---
>
> Key: PIG-3634
> URL: https://issues.apache.org/jira/browse/PIG-3634
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: tez-branch
>
> Attachments: PIG-3634-0.patch
>
>
> This is a followup for PIG-3534. In PIG-3534, we use 5 vertexes (3 DAGs) to 
> implement an order-by. We can optimize to use 4 vertexes in 1 DAG:
> vertex 1: close the current vertex, create input + samples input
> vertex 2: aggregate samples to create quantiles
> vertex 3: use quantiles to partition input
> vertex 4: sort input after partition
> The DAG is:
> {code}
> vertex 1   -->  vertex 3 --> vertex 4
>\--> vertex 2 ---/
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] Subscription: PIG patch available

2013-12-27 Thread jira
Issue Subscription
Filter: PIG patch available (6 issues)

Subscriber: pigdaily

Key Summary
PIG-3635Fix e2e tests for Hadoop 2.X on Windows
https://issues.apache.org/jira/browse/PIG-3635
PIG-3608ClassCastException when looking up a value from AvroMapWrapper 
using a Utf8 key
https://issues.apache.org/jira/browse/PIG-3608
PIG-3573Provide StoreFunc and LoadFunc for Accumulo
https://issues.apache.org/jira/browse/PIG-3573
PIG-3453Implement a Storm backend to Pig
https://issues.apache.org/jira/browse/PIG-3453
PIG-3441Allow Pig to use default resources from Configuration objects
https://issues.apache.org/jira/browse/PIG-3441
PIG-3347Store invocation brings side effect
https://issues.apache.org/jira/browse/PIG-3347

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225&filterId=12322384


[jira] [Updated] (PIG-3645) Move FileLocalizer.setR() calls to unit tests

2013-12-27 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3645:
---

Attachment: PIG-3645-4.patch

As per discussion with Rohini, moved FiliLocalizer.setR() to setup. This 
minimizes the number of gold files to be modified to 11. It seems inevitable to 
regenerate these 11 files because I am re-enabling testSim4 that used to be 
omitted, and that introduces some randomness.

I will kick off unit tests again with this patch. Thanks!

> Move FileLocalizer.setR() calls to unit tests
> -
>
> Key: PIG-3645
> URL: https://issues.apache.org/jira/browse/PIG-3645
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Fix For: 0.13.0
>
> Attachments: PIG-3645-1.patch, PIG-3645-2.patch, PIG-3645-3.patch, 
> PIG-3645-4.patch, TEST-org.apache.pig.test.TestMRCompiler.txt
>
>
> Currently, temporary paths are generated by FileLocalizer using 
> Random.nextInt(). To provide strong randomness, MapReduceLauncher resets the 
> Random object every time when compiling physical plan to MR plan:
> {code}
> MRCompiler comp = new MRCompiler(php, pc); 
> comp.randomizeFileLocalizer(); // This in turn calls FileLocalizer.setR(new 
> Random()).
> {code}
> Besides, there are a couple of places calling FileLocalizer.setR() (e.g. 
> MRCompiler) with some random seed.
> I think-
> # Randomizing Random seed is unnecessary if we switch to UUID.
> # Setting Random objects in code like this is error-prone because it can be 
> easily broken by having or missing a FileLocalizer.setR() somewhere else. See 
> an example [here|http://search-hadoop.com/m/2nxTzQXfHw1].
> So I propose that we remove all this "randomizing Random seed" code and use 
> UUID instead in temporary paths.
> For unit tests that compare the results against gold files, we should still 
> allow to set Random seed through FileLocalizer.setR(). But this method will 
> be annotated as "VisibleForTesting" to ensure it is not used nowhere else 
> other than in unit tests.
> Regarding the existing gold files, they can be easily regenerated by 
> TestMRCompiler as follows-
> {code}
> FileOutputStream fos = new FileOutputStream(expectedFile + "_new");
> PrintWriter pw = new PrintWriter(fos);
> pw.write(compiledPlan);
> {code}
> I assume there won't be any kind of regressions due to this change. But 
> please let me know if I am wrong.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Assigned] (PIG-3643) Nested Foreach with UDF and bincond is broken

2013-12-27 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park reassigned PIG-3643:
--

Assignee: Cheolsoo Park

TestLogicalPlanGenerator is broken by this too. Let me fix this.

> Nested Foreach with UDF and bincond is broken
> -
>
> Key: PIG-3643
> URL: https://issues.apache.org/jira/browse/PIG-3643
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.13.0
>Reporter: Rohini Palaniswamy
>Assignee: Cheolsoo Park
>
> Was checking out PIG-3000. 
> A = load 'data' as (a:chararray);
> B = foreach A { c = UPPER(a); generate ((c eq 'TEST') ? 1 : 0), ((c eq 'DEV') 
> ? 1 : 0); }
> This now throws "Invalid field projection. Projected field [c] does not exist 
> in schema".  Works fine in 0.11. Broken in trunk. Haven't checked 0.12. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PIG-3645) Move FileLocalizer.setR() calls to unit tests

2013-12-27 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3645:
---

Attachment: TEST-org.apache.pig.test.TestMRCompiler.txt

I tried PigServer.resetScope() and NodeIdGenerator.reset(""), and 
FileLocalizer.setR(new Random(1331L)) in setup. But that doesn't seem to 
entirely fix it. In fact, temporary paths do not match while scope and node ids 
match.

I am attaching the test log if you're interested.

> Move FileLocalizer.setR() calls to unit tests
> -
>
> Key: PIG-3645
> URL: https://issues.apache.org/jira/browse/PIG-3645
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Fix For: 0.13.0
>
> Attachments: PIG-3645-1.patch, PIG-3645-2.patch, PIG-3645-3.patch, 
> TEST-org.apache.pig.test.TestMRCompiler.txt
>
>
> Currently, temporary paths are generated by FileLocalizer using 
> Random.nextInt(). To provide strong randomness, MapReduceLauncher resets the 
> Random object every time when compiling physical plan to MR plan:
> {code}
> MRCompiler comp = new MRCompiler(php, pc); 
> comp.randomizeFileLocalizer(); // This in turn calls FileLocalizer.setR(new 
> Random()).
> {code}
> Besides, there are a couple of places calling FileLocalizer.setR() (e.g. 
> MRCompiler) with some random seed.
> I think-
> # Randomizing Random seed is unnecessary if we switch to UUID.
> # Setting Random objects in code like this is error-prone because it can be 
> easily broken by having or missing a FileLocalizer.setR() somewhere else. See 
> an example [here|http://search-hadoop.com/m/2nxTzQXfHw1].
> So I propose that we remove all this "randomizing Random seed" code and use 
> UUID instead in temporary paths.
> For unit tests that compare the results against gold files, we should still 
> allow to set Random seed through FileLocalizer.setR(). But this method will 
> be annotated as "VisibleForTesting" to ensure it is not used nowhere else 
> other than in unit tests.
> Regarding the existing gold files, they can be easily regenerated by 
> TestMRCompiler as follows-
> {code}
> FileOutputStream fos = new FileOutputStream(expectedFile + "_new");
> PrintWriter pw = new PrintWriter(fos);
> pw.write(compiledPlan);
> {code}
> I assume there won't be any kind of regressions due to this change. But 
> please let me know if I am wrong.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PIG-3639) TestRegisteredJarVisibility is broken in trunk

2013-12-27 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3639:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to trunk. Thank you Jarcec and Rohini for the review!

> TestRegisteredJarVisibility is broken in trunk
> --
>
> Key: PIG-3639
> URL: https://issues.apache.org/jira/browse/PIG-3639
> Project: Pig
>  Issue Type: Bug
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.13.0
>
> Attachments: PIG-3639-1.patch
>
>
> {code}
> [junit] Running org.apache.pig.test.TestRegisteredJarVisibility
> [junit] Tests run: 3, Failures: 1, Errors: 0, Time elapsed: 44.127 sec
> [junit] Test org.apache.pig.test.TestRegisteredJarVisibility FAILED
> {code}
> This is a side-effect of PIG-3584 that bumped avro version to 1.7.5.
> The problem is that avro 1.7.5 pulls down jackson 1.9.9 jars as dependencies, 
> and that makes 
> TestRegisteredJarVisibility.testRegisterJarOverridePigJarPackages fail 
> because the test case assumes that jackson 1.9.9 jars are not present in 
> classpath.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3645) Move FileLocalizer.setR() calls to unit tests

2013-12-27 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13857790#comment-13857790
 ] 

Rohini Palaniswamy commented on PIG-3645:
-

FileLocalizer.setR(new Random(1331L)); - If you do this in setup instead of 
setUpBeforeClass in TestMRCompiler, I think you will not need to regenerate the 
golden files. But if you still need to regenerate the golden files, is it 
possible for you to copy over what is done in TestTezCompiler setup() method. 
That makes each test's golden file independent of another and allows tests to 
be run in any order or just run individually from eclipse without having to run 
previous tests. Doing that in a separate jira is fine as well. Everything else 
looks good. Will do a +1 once you confirm full unit test suite passes.

> Move FileLocalizer.setR() calls to unit tests
> -
>
> Key: PIG-3645
> URL: https://issues.apache.org/jira/browse/PIG-3645
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Fix For: 0.13.0
>
> Attachments: PIG-3645-1.patch, PIG-3645-2.patch, PIG-3645-3.patch
>
>
> Currently, temporary paths are generated by FileLocalizer using 
> Random.nextInt(). To provide strong randomness, MapReduceLauncher resets the 
> Random object every time when compiling physical plan to MR plan:
> {code}
> MRCompiler comp = new MRCompiler(php, pc); 
> comp.randomizeFileLocalizer(); // This in turn calls FileLocalizer.setR(new 
> Random()).
> {code}
> Besides, there are a couple of places calling FileLocalizer.setR() (e.g. 
> MRCompiler) with some random seed.
> I think-
> # Randomizing Random seed is unnecessary if we switch to UUID.
> # Setting Random objects in code like this is error-prone because it can be 
> easily broken by having or missing a FileLocalizer.setR() somewhere else. See 
> an example [here|http://search-hadoop.com/m/2nxTzQXfHw1].
> So I propose that we remove all this "randomizing Random seed" code and use 
> UUID instead in temporary paths.
> For unit tests that compare the results against gold files, we should still 
> allow to set Random seed through FileLocalizer.setR(). But this method will 
> be annotated as "VisibleForTesting" to ensure it is not used nowhere else 
> other than in unit tests.
> Regarding the existing gold files, they can be easily regenerated by 
> TestMRCompiler as follows-
> {code}
> FileOutputStream fos = new FileOutputStream(expectedFile + "_new");
> PrintWriter pw = new PrintWriter(fos);
> pw.write(compiledPlan);
> {code}
> I assume there won't be any kind of regressions due to this change. But 
> please let me know if I am wrong.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PIG-3645) Move FileLocalizer.setR() calls to unit tests

2013-12-27 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3645:
---

Attachment: PIG-3645-3.patch

Reset "generate" to false in TestMRCompiler in the new patch. Changes can be 
viewed 
[here|https://github.com/piaozhexiu/apache-pig/commit/f833c8f950dc158a711043721e8876d9de96fe78].

I am running full unit tests now.



> Move FileLocalizer.setR() calls to unit tests
> -
>
> Key: PIG-3645
> URL: https://issues.apache.org/jira/browse/PIG-3645
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Fix For: 0.13.0
>
> Attachments: PIG-3645-1.patch, PIG-3645-2.patch, PIG-3645-3.patch
>
>
> Currently, temporary paths are generated by FileLocalizer using 
> Random.nextInt(). To provide strong randomness, MapReduceLauncher resets the 
> Random object every time when compiling physical plan to MR plan:
> {code}
> MRCompiler comp = new MRCompiler(php, pc); 
> comp.randomizeFileLocalizer(); // This in turn calls FileLocalizer.setR(new 
> Random()).
> {code}
> Besides, there are a couple of places calling FileLocalizer.setR() (e.g. 
> MRCompiler) with some random seed.
> I think-
> # Randomizing Random seed is unnecessary if we switch to UUID.
> # Setting Random objects in code like this is error-prone because it can be 
> easily broken by having or missing a FileLocalizer.setR() somewhere else. See 
> an example [here|http://search-hadoop.com/m/2nxTzQXfHw1].
> So I propose that we remove all this "randomizing Random seed" code and use 
> UUID instead in temporary paths.
> For unit tests that compare the results against gold files, we should still 
> allow to set Random seed through FileLocalizer.setR(). But this method will 
> be annotated as "VisibleForTesting" to ensure it is not used nowhere else 
> other than in unit tests.
> Regarding the existing gold files, they can be easily regenerated by 
> TestMRCompiler as follows-
> {code}
> FileOutputStream fos = new FileOutputStream(expectedFile + "_new");
> PrintWriter pw = new PrintWriter(fos);
> pw.write(compiledPlan);
> {code}
> I assume there won't be any kind of regressions due to this change. But 
> please let me know if I am wrong.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PIG-3645) Move FileLocalizer.setR() calls to unit tests

2013-12-27 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3645:
---

Summary: Move FileLocalizer.setR() calls to unit tests  (was: Replace 
Random with UUID in FileLocalizer.getTemporaryPath())

> Move FileLocalizer.setR() calls to unit tests
> -
>
> Key: PIG-3645
> URL: https://issues.apache.org/jira/browse/PIG-3645
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Fix For: 0.13.0
>
> Attachments: PIG-3645-1.patch, PIG-3645-2.patch
>
>
> Currently, temporary paths are generated by FileLocalizer using 
> Random.nextInt(). To provide strong randomness, MapReduceLauncher resets the 
> Random object every time when compiling physical plan to MR plan:
> {code}
> MRCompiler comp = new MRCompiler(php, pc); 
> comp.randomizeFileLocalizer(); // This in turn calls FileLocalizer.setR(new 
> Random()).
> {code}
> Besides, there are a couple of places calling FileLocalizer.setR() (e.g. 
> MRCompiler) with some random seed.
> I think-
> # Randomizing Random seed is unnecessary if we switch to UUID.
> # Setting Random objects in code like this is error-prone because it can be 
> easily broken by having or missing a FileLocalizer.setR() somewhere else. See 
> an example [here|http://search-hadoop.com/m/2nxTzQXfHw1].
> So I propose that we remove all this "randomizing Random seed" code and use 
> UUID instead in temporary paths.
> For unit tests that compare the results against gold files, we should still 
> allow to set Random seed through FileLocalizer.setR(). But this method will 
> be annotated as "VisibleForTesting" to ensure it is not used nowhere else 
> other than in unit tests.
> Regarding the existing gold files, they can be easily regenerated by 
> TestMRCompiler as follows-
> {code}
> FileOutputStream fos = new FileOutputStream(expectedFile + "_new");
> PrintWriter pw = new PrintWriter(fos);
> pw.write(compiledPlan);
> {code}
> I assume there won't be any kind of regressions due to this change. But 
> please let me know if I am wrong.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PIG-3645) Replace Random with UUID in FileLocalizer.getTemporaryPath()

2013-12-27 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3645:
---

Attachment: PIG-3645-2.patch

Uploading a new patch as per discussion. The changes include-
# Annotate FileLocalizer.setR() as @VisibleForTesting.
# Remove FileLocalizer.setR() calls from MRCompiler and MapReduceLauncher.
# Add FileLocalizer.setR(new Random(1331L)) in TestMRCompiler.
# Regenerate gold files for TestMRCompiler.
# Enabled testSim4 in TestMRCompier. Looks like it was unintentionally not 
marked as @Test.
# Remove FileLocalizer.setR(new Random()) in unit tests. Now FileLocalizer is 
randomized by default, so no need to do it again-
## TestEvalPipeline.java
## TestEvalPipeline2.java
## TestEvalPipelineLocal.java
## TestJoin.java
## TestScriptUDF.java
## TestSecondarySort.java

> Replace Random with UUID in FileLocalizer.getTemporaryPath()
> 
>
> Key: PIG-3645
> URL: https://issues.apache.org/jira/browse/PIG-3645
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Fix For: 0.13.0
>
> Attachments: PIG-3645-1.patch, PIG-3645-2.patch
>
>
> Currently, temporary paths are generated by FileLocalizer using 
> Random.nextInt(). To provide strong randomness, MapReduceLauncher resets the 
> Random object every time when compiling physical plan to MR plan:
> {code}
> MRCompiler comp = new MRCompiler(php, pc); 
> comp.randomizeFileLocalizer(); // This in turn calls FileLocalizer.setR(new 
> Random()).
> {code}
> Besides, there are a couple of places calling FileLocalizer.setR() (e.g. 
> MRCompiler) with some random seed.
> I think-
> # Randomizing Random seed is unnecessary if we switch to UUID.
> # Setting Random objects in code like this is error-prone because it can be 
> easily broken by having or missing a FileLocalizer.setR() somewhere else. See 
> an example [here|http://search-hadoop.com/m/2nxTzQXfHw1].
> So I propose that we remove all this "randomizing Random seed" code and use 
> UUID instead in temporary paths.
> For unit tests that compare the results against gold files, we should still 
> allow to set Random seed through FileLocalizer.setR(). But this method will 
> be annotated as "VisibleForTesting" to ensure it is not used nowhere else 
> other than in unit tests.
> Regarding the existing gold files, they can be easily regenerated by 
> TestMRCompiler as follows-
> {code}
> FileOutputStream fos = new FileOutputStream(expectedFile + "_new");
> PrintWriter pw = new PrintWriter(fos);
> pw.write(compiledPlan);
> {code}
> I assume there won't be any kind of regressions due to this change. But 
> please let me know if I am wrong.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3643) Nested Foreach with UDF and bincond is broken

2013-12-27 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13857765#comment-13857765
 ] 

Cheolsoo Park commented on PIG-3643:


TestColumnAliasConversion is broken in trunk for this as well.

> Nested Foreach with UDF and bincond is broken
> -
>
> Key: PIG-3643
> URL: https://issues.apache.org/jira/browse/PIG-3643
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.13.0
>Reporter: Rohini Palaniswamy
>
> Was checking out PIG-3000. 
> A = load 'data' as (a:chararray);
> B = foreach A { c = UPPER(a); generate ((c eq 'TEST') ? 1 : 0), ((c eq 'DEV') 
> ? 1 : 0); }
> This now throws "Invalid field projection. Projected field [c] does not exist 
> in schema".  Works fine in 0.11. Broken in trunk. Haven't checked 0.12. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3645) Replace Random with UUID in FileLocalizer.getTemporaryPath()

2013-12-27 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13857727#comment-13857727
 ] 

Rohini Palaniswamy commented on PIG-3645:
-

Sounds good. +1 on moving all FileLocalizer.setR() calls to test classes.

> Replace Random with UUID in FileLocalizer.getTemporaryPath()
> 
>
> Key: PIG-3645
> URL: https://issues.apache.org/jira/browse/PIG-3645
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Fix For: 0.13.0
>
> Attachments: PIG-3645-1.patch
>
>
> Currently, temporary paths are generated by FileLocalizer using 
> Random.nextInt(). To provide strong randomness, MapReduceLauncher resets the 
> Random object every time when compiling physical plan to MR plan:
> {code}
> MRCompiler comp = new MRCompiler(php, pc); 
> comp.randomizeFileLocalizer(); // This in turn calls FileLocalizer.setR(new 
> Random()).
> {code}
> Besides, there are a couple of places calling FileLocalizer.setR() (e.g. 
> MRCompiler) with some random seed.
> I think-
> # Randomizing Random seed is unnecessary if we switch to UUID.
> # Setting Random objects in code like this is error-prone because it can be 
> easily broken by having or missing a FileLocalizer.setR() somewhere else. See 
> an example [here|http://search-hadoop.com/m/2nxTzQXfHw1].
> So I propose that we remove all this "randomizing Random seed" code and use 
> UUID instead in temporary paths.
> For unit tests that compare the results against gold files, we should still 
> allow to set Random seed through FileLocalizer.setR(). But this method will 
> be annotated as "VisibleForTesting" to ensure it is not used nowhere else 
> other than in unit tests.
> Regarding the existing gold files, they can be easily regenerated by 
> TestMRCompiler as follows-
> {code}
> FileOutputStream fos = new FileOutputStream(expectedFile + "_new");
> PrintWriter pw = new PrintWriter(fos);
> pw.write(compiledPlan);
> {code}
> I assume there won't be any kind of regressions due to this change. But 
> please let me know if I am wrong.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3645) Replace Random with UUID in FileLocalizer.getTemporaryPath()

2013-12-27 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13857703#comment-13857703
 ] 

Cheolsoo Park commented on PIG-3645:


[~rohini], thank you for the detailed explanation. My gaol is to simplify the 
current code rather than increase strongeness of random.

Let's separate two things here-
1) Remove all the FileLocalizer.setR() in our code (except ones with fixed 
seeds in unit tests).
2) Replaced Random with UUID

You and I agree on #1 but disagree on #2, correct? Only reason why I proposed 
#2 in addition to #1 because I didn't want to reduce randomness than the 
current form. Sounds like you're not worried about that. If that's the case, I 
can do #1 only.

How does that sound to you?




> Replace Random with UUID in FileLocalizer.getTemporaryPath()
> 
>
> Key: PIG-3645
> URL: https://issues.apache.org/jira/browse/PIG-3645
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Fix For: 0.13.0
>
> Attachments: PIG-3645-1.patch
>
>
> Currently, temporary paths are generated by FileLocalizer using 
> Random.nextInt(). To provide strong randomness, MapReduceLauncher resets the 
> Random object every time when compiling physical plan to MR plan:
> {code}
> MRCompiler comp = new MRCompiler(php, pc); 
> comp.randomizeFileLocalizer(); // This in turn calls FileLocalizer.setR(new 
> Random()).
> {code}
> Besides, there are a couple of places calling FileLocalizer.setR() (e.g. 
> MRCompiler) with some random seed.
> I think-
> # Randomizing Random seed is unnecessary if we switch to UUID.
> # Setting Random objects in code like this is error-prone because it can be 
> easily broken by having or missing a FileLocalizer.setR() somewhere else. See 
> an example [here|http://search-hadoop.com/m/2nxTzQXfHw1].
> So I propose that we remove all this "randomizing Random seed" code and use 
> UUID instead in temporary paths.
> For unit tests that compare the results against gold files, we should still 
> allow to set Random seed through FileLocalizer.setR(). But this method will 
> be annotated as "VisibleForTesting" to ensure it is not used nowhere else 
> other than in unit tests.
> Regarding the existing gold files, they can be easily regenerated by 
> TestMRCompiler as follows-
> {code}
> FileOutputStream fos = new FileOutputStream(expectedFile + "_new");
> PrintWriter pw = new PrintWriter(fos);
> pw.write(compiledPlan);
> {code}
> I assume there won't be any kind of regressions due to this change. But 
> please let me know if I am wrong.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3645) Replace Random with UUID in FileLocalizer.getTemporaryPath()

2013-12-27 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13857699#comment-13857699
 ] 

Rohini Palaniswamy commented on PIG-3645:
-

> UUID returns takes up 36 chars, while random we are only using 10 chars. 
> Would prefer shorter filenames to take up less space in NN and prevent GCs
  This is actually a bigger concern for me as intermediate jobs will create lot 
of temporary files under the temporary directory with MR when there are lot of 
maps/reduces. Since these will be short lived files and will be deleted soon 
will be a pressure on NN if they have long paths. We already have serious 
problems with NN performance on big clusters with heavy usage and hadoop team 
has been squeezing optimizations in cutting down on the number of calls MR 
framework makes. 

> Replace Random with UUID in FileLocalizer.getTemporaryPath()
> 
>
> Key: PIG-3645
> URL: https://issues.apache.org/jira/browse/PIG-3645
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Fix For: 0.13.0
>
> Attachments: PIG-3645-1.patch
>
>
> Currently, temporary paths are generated by FileLocalizer using 
> Random.nextInt(). To provide strong randomness, MapReduceLauncher resets the 
> Random object every time when compiling physical plan to MR plan:
> {code}
> MRCompiler comp = new MRCompiler(php, pc); 
> comp.randomizeFileLocalizer(); // This in turn calls FileLocalizer.setR(new 
> Random()).
> {code}
> Besides, there are a couple of places calling FileLocalizer.setR() (e.g. 
> MRCompiler) with some random seed.
> I think-
> # Randomizing Random seed is unnecessary if we switch to UUID.
> # Setting Random objects in code like this is error-prone because it can be 
> easily broken by having or missing a FileLocalizer.setR() somewhere else. See 
> an example [here|http://search-hadoop.com/m/2nxTzQXfHw1].
> So I propose that we remove all this "randomizing Random seed" code and use 
> UUID instead in temporary paths.
> For unit tests that compare the results against gold files, we should still 
> allow to set Random seed through FileLocalizer.setR(). But this method will 
> be annotated as "VisibleForTesting" to ensure it is not used nowhere else 
> other than in unit tests.
> Regarding the existing gold files, they can be easily regenerated by 
> TestMRCompiler as follows-
> {code}
> FileOutputStream fos = new FileOutputStream(expectedFile + "_new");
> PrintWriter pw = new PrintWriter(fos);
> pw.write(compiledPlan);
> {code}
> I assume there won't be any kind of regressions due to this change. But 
> please let me know if I am wrong.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3645) Replace Random with UUID in FileLocalizer.getTemporaryPath()

2013-12-27 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13857691#comment-13857691
 ] 

Rohini Palaniswamy commented on PIG-3645:
-

This change is not required.  Few reasons:
  - new Random() gives enough randomness as it takes into account 
System.nanoTime() and UUID is not going to add any advantage. Infact UUID uses 
Random internally. Differences are it uses SecureRandom (which we don't need) 
and does more bytes (which also we don't need as integer range itself is huge 
and is not going to repeat within a execution of a pig script) and uses clock 
sequence (System.nanoTime() in Random is good enough for us). And Random has 
been working fine in production for 7+ years without collision between users 
and I don't see any reason to change that.  If UUID were to give some advantage 
or fixes a real issue, I would certainly go for it.
  - UUID returns takes up 36 chars, while random we are only using 10 chars. 
Would prefer shorter filenames to take up less space in NN and prevent GCs. 
  - Problem is not with randomness of Random but mix up in MRComplier code 
resetting the seed of Random to take care of tests.  What is been done in MR is 
that they set new Random(1331) in MRCompiler so tests get the same plan 
everytime. But they reset it to new Random() again when the script is actually 
run (in MapreduceLauncher) so that it does not write to the same file again and 
again. Simple change would be to move the new Random(1331) to TestMRCompiler 
and org.apache.pig.test.Util.buildMRPlan() as that should belong in the tests 
and not in the normal code path. If that is done, 
comp.randomizeFileLocalizer(); can be removed from MapreduceLauncher.

> Replace Random with UUID in FileLocalizer.getTemporaryPath()
> 
>
> Key: PIG-3645
> URL: https://issues.apache.org/jira/browse/PIG-3645
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Fix For: 0.13.0
>
> Attachments: PIG-3645-1.patch
>
>
> Currently, temporary paths are generated by FileLocalizer using 
> Random.nextInt(). To provide strong randomness, MapReduceLauncher resets the 
> Random object every time when compiling physical plan to MR plan:
> {code}
> MRCompiler comp = new MRCompiler(php, pc); 
> comp.randomizeFileLocalizer(); // This in turn calls FileLocalizer.setR(new 
> Random()).
> {code}
> Besides, there are a couple of places calling FileLocalizer.setR() (e.g. 
> MRCompiler) with some random seed.
> I think-
> # Randomizing Random seed is unnecessary if we switch to UUID.
> # Setting Random objects in code like this is error-prone because it can be 
> easily broken by having or missing a FileLocalizer.setR() somewhere else. See 
> an example [here|http://search-hadoop.com/m/2nxTzQXfHw1].
> So I propose that we remove all this "randomizing Random seed" code and use 
> UUID instead in temporary paths.
> For unit tests that compare the results against gold files, we should still 
> allow to set Random seed through FileLocalizer.setR(). But this method will 
> be annotated as "VisibleForTesting" to ensure it is not used nowhere else 
> other than in unit tests.
> Regarding the existing gold files, they can be easily regenerated by 
> TestMRCompiler as follows-
> {code}
> FileOutputStream fos = new FileOutputStream(expectedFile + "_new");
> PrintWriter pw = new PrintWriter(fos);
> pw.write(compiledPlan);
> {code}
> I assume there won't be any kind of regressions due to this change. But 
> please let me know if I am wrong.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PIG-3645) Replace Random with UUID in FileLocalizer.getTemporaryPath()

2013-12-27 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3645:
---

Attachment: PIG-3645-1.patch

Attaching an initial patch that includes regenerated gold files.

I will run the full suite of unit tests.

> Replace Random with UUID in FileLocalizer.getTemporaryPath()
> 
>
> Key: PIG-3645
> URL: https://issues.apache.org/jira/browse/PIG-3645
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Fix For: 0.13.0
>
> Attachments: PIG-3645-1.patch
>
>
> Currently, temporary paths are generated by FileLocalizer using 
> Random.nextInt(). To provide strong randomness, MapReduceLauncher resets the 
> Random object every time when compiling physical plan to MR plan:
> {code}
> MRCompiler comp = new MRCompiler(php, pc); 
> comp.randomizeFileLocalizer(); // This in turn calls FileLocalizer.setR(new 
> Random()).
> {code}
> Besides, there are a couple of places calling FileLocalizer.setR() (e.g. 
> MRCompiler) with some random seed.
> I think-
> # Randomizing Random seed is unnecessary if we switch to UUID.
> # Setting Random objects in code like this is error-prone because it can be 
> easily broken by having or missing a FileLocalizer.setR() somewhere else. See 
> an example [here|http://search-hadoop.com/m/2nxTzQXfHw1].
> So I propose that we remove all this "randomizing Random seed" code and use 
> UUID instead in temporary paths.
> For unit tests that compare the results against gold files, we should still 
> allow to set Random seed through FileLocalizer.setR(). But this method will 
> be annotated as "VisibleForTesting" to ensure it is not used nowhere else 
> other than in unit tests.
> Regarding the existing gold files, they can be easily regenerated by 
> TestMRCompiler as follows-
> {code}
> FileOutputStream fos = new FileOutputStream(expectedFile + "_new");
> PrintWriter pw = new PrintWriter(fos);
> pw.write(compiledPlan);
> {code}
> I assume there won't be any kind of regressions due to this change. But 
> please let me know if I am wrong.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PIG-3637) PigCombiner creating log spam

2013-12-27 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-3637:


  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Committed to trunk (0.13). Thanks for the review Cheolsoo.

> PigCombiner creating log spam
> -
>
> Key: PIG-3637
> URL: https://issues.apache.org/jira/browse/PIG-3637
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11.1
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
>Priority: Minor
> Fix For: 0.13.0
>
> Attachments: PIG-3637-1.patch, PIG-3637-2-whitespacechanges.patch
>
>
> "Aliases being processed per job phase" is logged a lot of times even though 
> it is in setup because Combiners can be called multiple times in both map and 
> reduce by MR framework (HADOOP-3226). With Hadoop 2.0 since logs go to hdfs 
> this is inefficient.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PIG-3637) PigCombiner creating log spam

2013-12-27 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-3637:


Attachment: PIG-3637-2-whitespacechanges.patch

> PigCombiner creating log spam
> -
>
> Key: PIG-3637
> URL: https://issues.apache.org/jira/browse/PIG-3637
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11.1
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
>Priority: Minor
> Fix For: 0.13.0
>
> Attachments: PIG-3637-1.patch, PIG-3637-2-whitespacechanges.patch
>
>
> "Aliases being processed per job phase" is logged a lot of times even though 
> it is in setup because Combiners can be called multiple times in both map and 
> reduce by MR framework (HADOOP-3226). With Hadoop 2.0 since logs go to hdfs 
> this is inefficient.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3639) TestRegisteredJarVisibility is broken in trunk

2013-12-27 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13857678#comment-13857678
 ] 

Rohini Palaniswamy commented on PIG-3639:
-

+1

> TestRegisteredJarVisibility is broken in trunk
> --
>
> Key: PIG-3639
> URL: https://issues.apache.org/jira/browse/PIG-3639
> Project: Pig
>  Issue Type: Bug
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.13.0
>
> Attachments: PIG-3639-1.patch
>
>
> {code}
> [junit] Running org.apache.pig.test.TestRegisteredJarVisibility
> [junit] Tests run: 3, Failures: 1, Errors: 0, Time elapsed: 44.127 sec
> [junit] Test org.apache.pig.test.TestRegisteredJarVisibility FAILED
> {code}
> This is a side-effect of PIG-3584 that bumped avro version to 1.7.5.
> The problem is that avro 1.7.5 pulls down jackson 1.9.9 jars as dependencies, 
> and that makes 
> TestRegisteredJarVisibility.testRegisterJarOverridePigJarPackages fail 
> because the test case assumes that jackson 1.9.9 jars are not present in 
> classpath.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (PIG-3645) Replace Random with UUID in FileLocalizer.getTemporaryPath()

2013-12-27 Thread Cheolsoo Park (JIRA)
Cheolsoo Park created PIG-3645:
--

 Summary: Replace Random with UUID in 
FileLocalizer.getTemporaryPath()
 Key: PIG-3645
 URL: https://issues.apache.org/jira/browse/PIG-3645
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
Priority: Minor
 Fix For: 0.13.0


Currently, temporary paths are generated by FileLocalizer using 
Random.nextInt(). To provide strong randomness, MapReduceLauncher resets the 
Random object every time when compiling physical plan to MR plan:
{code}
MRCompiler comp = new MRCompiler(php, pc); 
comp.randomizeFileLocalizer(); // This in turn calls FileLocalizer.setR(new 
Random()).
{code}

Besides, there are a couple of places calling FileLocalizer.setR() (e.g. 
MRCompiler) with some random seed.

I think-
# Randomizing Random seed is unnecessary if we switch to UUID.
# Setting Random objects in code like this is error-prone because it can be 
easily broken by having or missing a FileLocalizer.setR() somewhere else. See 
an example [here|http://search-hadoop.com/m/2nxTzQXfHw1].

So I propose that we remove all this "randomizing Random seed" code and use 
UUID instead in temporary paths.

For unit tests that compare the results against gold files, we should still 
allow to set Random seed through FileLocalizer.setR(). But this method will be 
annotated as "VisibleForTesting" to ensure it is not used nowhere else other 
than in unit tests.

Regarding the existing gold files, they can be easily regenerated by 
TestMRCompiler as follows-
{code}
FileOutputStream fos = new FileOutputStream(expectedFile + "_new");
PrintWriter pw = new PrintWriter(fos);
pw.write(compiledPlan);
{code}

I assume there won't be any kind of regressions due to this change. But please 
let me know if I am wrong.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PIG-3632) Add option to configure cacheBlocks in HBaseStorage

2013-12-27 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-3632:


  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

 We want it to be false by default in MR (Refer PIG-2619). Our hbase team also 
has recommended the same as default, but for few users they wanted to have this 
true. So made it configurable.

Committed to trunk (0.13). Thanks Prashant for the review.

> Add option to configure cacheBlocks in HBaseStorage
> ---
>
> Key: PIG-3632
> URL: https://issues.apache.org/jira/browse/PIG-3632
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.13.0
>
> Attachments: PIG-3632-1.patch
>
>
> By default it should be set to false as we don't wan't users trashing the 
> cache unintentionally.  But there are some cases, a user wants to work on the 
> same data potentially and it would be good to warm up the cache.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (PIG-3644) Implement skewed join in Tez

2013-12-27 Thread Cheolsoo Park (JIRA)
Cheolsoo Park created PIG-3644:
--

 Summary: Implement skewed join in Tez
 Key: PIG-3644
 URL: https://issues.apache.org/jira/browse/PIG-3644
 Project: Pig
  Issue Type: Sub-task
  Components: tez
Affects Versions: tez-branch
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
 Fix For: tez-branch


Skewed join in Tez can be implemented similarly to order-by (PIG-3634).



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3637) PigCombiner creating log spam

2013-12-27 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13857608#comment-13857608
 ] 

Cheolsoo Park commented on PIG-3637:


+1. LGTM.

> PigCombiner creating log spam
> -
>
> Key: PIG-3637
> URL: https://issues.apache.org/jira/browse/PIG-3637
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11.1
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
>Priority: Minor
> Fix For: 0.13.0
>
> Attachments: PIG-3637-1.patch
>
>
> "Aliases being processed per job phase" is logged a lot of times even though 
> it is in setup because Combiners can be called multiple times in both map and 
> reduce by MR framework (HADOOP-3226). With Hadoop 2.0 since logs go to hdfs 
> this is inefficient.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3643) Nested Foreach with UDF and bincond is broken

2013-12-27 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13857604#comment-13857604
 ] 

Cheolsoo Park commented on PIG-3643:


This is a regression of PIG-3581. If I drop the PIG-3581 commit, the example 
works in trunk too.

[~aniket486] and I were discussion this in PIG-3581 too.



> Nested Foreach with UDF and bincond is broken
> -
>
> Key: PIG-3643
> URL: https://issues.apache.org/jira/browse/PIG-3643
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.13.0
>Reporter: Rohini Palaniswamy
>
> Was checking out PIG-3000. 
> A = load 'data' as (a:chararray);
> B = foreach A { c = UPPER(a); generate ((c eq 'TEST') ? 1 : 0), ((c eq 'DEV') 
> ? 1 : 0); }
> This now throws "Invalid field projection. Projected field [c] does not exist 
> in schema".  Works fine in 0.11. Broken in trunk. Haven't checked 0.12. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PIG-3643) Nested Foreach with UDF and bincond is broken

2013-12-27 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-3643:


Summary: Nested Foreach with UDF and bincond is broken  (was: Nested 
Foreach with simple UDF is broken)

> Nested Foreach with UDF and bincond is broken
> -
>
> Key: PIG-3643
> URL: https://issues.apache.org/jira/browse/PIG-3643
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.13.0
>Reporter: Rohini Palaniswamy
>
> Was checking out PIG-3000. 
> A = load 'data' as (a:chararray);
> B = foreach A { c = UPPER(a); generate ((c eq 'TEST') ? 1 : 0), ((c eq 'DEV') 
> ? 1 : 0); }
> This now throws "Invalid field projection. Projected field [c] does not exist 
> in schema".  Works fine in 0.11. Broken in trunk. Haven't checked 0.12. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (PIG-3643) Nested Foreach with simple UDF is broken

2013-12-27 Thread Rohini Palaniswamy (JIRA)
Rohini Palaniswamy created PIG-3643:
---

 Summary: Nested Foreach with simple UDF is broken
 Key: PIG-3643
 URL: https://issues.apache.org/jira/browse/PIG-3643
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.13.0
Reporter: Rohini Palaniswamy


Was checking out PIG-3000. 

A = load 'data' as (a:chararray);
B = foreach A { c = UPPER(a); generate ((c eq 'TEST') ? 1 : 0), ((c eq 'DEV') ? 
1 : 0); }

This now throws "Invalid field projection. Projected field [c] does not exist 
in schema".  Works fine in 0.11. Broken in trunk. Haven't checked 0.12. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)