[jira] [Commented] (PIG-3634) Improve performance of order-by

2013-12-29 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858526#comment-13858526
 ] 

Daniel Dai commented on PIG-3634:
-

Thanks for clarification. PIG-3634-2.patch should work with top of tez branch 
now.

> Improve performance of order-by
> ---
>
> Key: PIG-3634
> URL: https://issues.apache.org/jira/browse/PIG-3634
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: tez-branch
>
> Attachments: PIG-3634-0.patch, PIG-3634-1.patch, PIG-3634-2.patch
>
>
> This is a followup for PIG-3534. In PIG-3534, we use 5 vertexes (3 DAGs) to 
> implement an order-by. We can optimize to use 4 vertexes in 1 DAG:
> vertex 1: close the current vertex, create input + samples input
> vertex 2: aggregate samples to create quantiles
> vertex 3: use quantiles to partition input
> vertex 4: sort input after partition
> The DAG is:
> {code}
> vertex 1   -->  vertex 3 --> vertex 4
>\--> vertex 2 ---/
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3634) Improve performance of order-by

2013-12-29 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858518#comment-13858518
 ] 

Cheolsoo Park commented on PIG-3634:


To be clear, I see no e2e failures in the current tez branch either.

> Improve performance of order-by
> ---
>
> Key: PIG-3634
> URL: https://issues.apache.org/jira/browse/PIG-3634
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: tez-branch
>
> Attachments: PIG-3634-0.patch, PIG-3634-1.patch, PIG-3634-2.patch
>
>
> This is a followup for PIG-3534. In PIG-3534, we use 5 vertexes (3 DAGs) to 
> implement an order-by. We can optimize to use 4 vertexes in 1 DAG:
> vertex 1: close the current vertex, create input + samples input
> vertex 2: aggregate samples to create quantiles
> vertex 3: use quantiles to partition input
> vertex 4: sort input after partition
> The DAG is:
> {code}
> vertex 1   -->  vertex 3 --> vertex 4
>\--> vertex 2 ---/
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3634) Improve performance of order-by

2013-12-29 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858515#comment-13858515
 ] 

Cheolsoo Park commented on PIG-3634:


[~rohini], with the latest patch (PIG-3634-2.patch) that Daniel uploaded 
everything works for me. Do you still see any error?

> Improve performance of order-by
> ---
>
> Key: PIG-3634
> URL: https://issues.apache.org/jira/browse/PIG-3634
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: tez-branch
>
> Attachments: PIG-3634-0.patch, PIG-3634-1.patch, PIG-3634-2.patch
>
>
> This is a followup for PIG-3534. In PIG-3534, we use 5 vertexes (3 DAGs) to 
> implement an order-by. We can optimize to use 4 vertexes in 1 DAG:
> vertex 1: close the current vertex, create input + samples input
> vertex 2: aggregate samples to create quantiles
> vertex 3: use quantiles to partition input
> vertex 4: sort input after partition
> The DAG is:
> {code}
> vertex 1   -->  vertex 3 --> vertex 4
>\--> vertex 2 ---/
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3634) Improve performance of order-by

2013-12-29 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858511#comment-13858511
 ] 

Rohini Palaniswamy commented on PIG-3634:
-

[~cheolsoo],
   java.lang.ClassCastException: org.apache.pig.impl.io.NullableBytesWritable 
cannot be cast to org.apache.pig.impl.io.NullableText happens after PIG-3636 as 
Daniel mentioned. Without that checkin e2e tests pass fine for me as well.  
Initially seeing your comment, I thought that the query failed because Daniel 
said that just load followed by order by will not work in this patch. 

> Improve performance of order-by
> ---
>
> Key: PIG-3634
> URL: https://issues.apache.org/jira/browse/PIG-3634
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: tez-branch
>
> Attachments: PIG-3634-0.patch, PIG-3634-1.patch, PIG-3634-2.patch
>
>
> This is a followup for PIG-3534. In PIG-3534, we use 5 vertexes (3 DAGs) to 
> implement an order-by. We can optimize to use 4 vertexes in 1 DAG:
> vertex 1: close the current vertex, create input + samples input
> vertex 2: aggregate samples to create quantiles
> vertex 3: use quantiles to partition input
> vertex 4: sort input after partition
> The DAG is:
> {code}
> vertex 1   -->  vertex 3 --> vertex 4
>\--> vertex 2 ---/
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3634) Improve performance of order-by

2013-12-29 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858508#comment-13858508
 ] 

Daniel Dai commented on PIG-3634:
-

Here is RB link: https://reviews.apache.org/r/16510/

> Improve performance of order-by
> ---
>
> Key: PIG-3634
> URL: https://issues.apache.org/jira/browse/PIG-3634
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: tez-branch
>
> Attachments: PIG-3634-0.patch, PIG-3634-1.patch, PIG-3634-2.patch
>
>
> This is a followup for PIG-3534. In PIG-3534, we use 5 vertexes (3 DAGs) to 
> implement an order-by. We can optimize to use 4 vertexes in 1 DAG:
> vertex 1: close the current vertex, create input + samples input
> vertex 2: aggregate samples to create quantiles
> vertex 3: use quantiles to partition input
> vertex 4: sort input after partition
> The DAG is:
> {code}
> vertex 1   -->  vertex 3 --> vertex 4
>\--> vertex 2 ---/
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] Subscription: PIG patch available

2013-12-29 Thread jira
Issue Subscription
Filter: PIG patch available (6 issues)

Subscriber: pigdaily

Key Summary
PIG-3642Direct HDFS access for small jobs (fetch) 
https://issues.apache.org/jira/browse/PIG-3642
PIG-3635Fix e2e tests for Hadoop 2.X on Windows
https://issues.apache.org/jira/browse/PIG-3635
PIG-3573Provide StoreFunc and LoadFunc for Accumulo
https://issues.apache.org/jira/browse/PIG-3573
PIG-3453Implement a Storm backend to Pig
https://issues.apache.org/jira/browse/PIG-3453
PIG-3441Allow Pig to use default resources from Configuration objects
https://issues.apache.org/jira/browse/PIG-3441
PIG-3347Store invocation brings side effect
https://issues.apache.org/jira/browse/PIG-3347

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225&filterId=12322384


[jira] [Updated] (PIG-3642) Direct HDFS access for small jobs (fetch)

2013-12-29 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3642:
---

Status: Patch Available  (was: Open)

> Direct HDFS access for small jobs (fetch) 
> --
>
> Key: PIG-3642
> URL: https://issues.apache.org/jira/browse/PIG-3642
> Project: Pig
>  Issue Type: Improvement
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
> Fix For: 0.13.0
>
> Attachments: PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3645) Move FileLocalizer.setR() calls to unit tests

2013-12-29 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858477#comment-13858477
 ] 

Cheolsoo Park commented on PIG-3645:


I also merged trunk into tez branch, so the UUID stuff in tez branch is all 
overwritten now. 

> Move FileLocalizer.setR() calls to unit tests
> -
>
> Key: PIG-3645
> URL: https://issues.apache.org/jira/browse/PIG-3645
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Fix For: 0.13.0
>
> Attachments: PIG-3645-1.patch, PIG-3645-2.patch, PIG-3645-3.patch, 
> PIG-3645-4.patch, TEST-org.apache.pig.test.TestMRCompiler.txt
>
>
> Currently, temporary paths are generated by FileLocalizer using 
> Random.nextInt(). To provide strong randomness, MapReduceLauncher resets the 
> Random object every time when compiling physical plan to MR plan:
> {code}
> MRCompiler comp = new MRCompiler(php, pc); 
> comp.randomizeFileLocalizer(); // This in turn calls FileLocalizer.setR(new 
> Random()).
> {code}
> Besides, there are a couple of places calling FileLocalizer.setR() (e.g. 
> MRCompiler) with some random seed.
> I think-
> # Randomizing Random seed is unnecessary if we switch to UUID.
> # Setting Random objects in code like this is error-prone because it can be 
> easily broken by having or missing a FileLocalizer.setR() somewhere else. See 
> an example [here|http://search-hadoop.com/m/2nxTzQXfHw1].
> So I propose that we remove all this "randomizing Random seed" code and use 
> UUID instead in temporary paths.
> For unit tests that compare the results against gold files, we should still 
> allow to set Random seed through FileLocalizer.setR(). But this method will 
> be annotated as "VisibleForTesting" to ensure it is not used nowhere else 
> other than in unit tests.
> Regarding the existing gold files, they can be easily regenerated by 
> TestMRCompiler as follows-
> {code}
> FileOutputStream fos = new FileOutputStream(expectedFile + "_new");
> PrintWriter pw = new PrintWriter(fos);
> pw.write(compiledPlan);
> {code}
> I assume there won't be any kind of regressions due to this change. But 
> please let me know if I am wrong.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2013-12-29 Thread Lorand Bendig (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858473#comment-13858473
 ] 

Lorand Bendig commented on PIG-3642:


Please find attached the review request at : https://reviews.apache.org/r/16507/

> Direct HDFS access for small jobs (fetch) 
> --
>
> Key: PIG-3642
> URL: https://issues.apache.org/jira/browse/PIG-3642
> Project: Pig
>  Issue Type: Improvement
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
> Fix For: 0.13.0
>
> Attachments: PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2013-12-29 Thread Lorand Bendig (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858471#comment-13858471
 ] 

Lorand Bendig commented on PIG-3642:


[~cheolsoo] es, I'd like to have it reviewed

> Direct HDFS access for small jobs (fetch) 
> --
>
> Key: PIG-3642
> URL: https://issues.apache.org/jira/browse/PIG-3642
> Project: Pig
>  Issue Type: Improvement
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
> Fix For: 0.13.0
>
> Attachments: PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)

2013-12-29 Thread Lorand Bendig

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16507/
---

Review request for pig.


Bugs: PIG-3642
https://issues.apache.org/jira/browse/PIG-3642


Repository: pig


Description
---

With this patch I'd like to add the possibility to directly read data from HDFS 
instead of launching MR jobs in case of simple (map-only) tasks. Hive already 
has this feature (fetch). This patch shares some similarities with the local 
mode of Pig 0.6. Here, fetching kicks off when the following holds for a script:

it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
(nested) FOREACH with expression operators, custom UDFs..etc
no scalar aliases
no SampleLoader
single leaf job
DUMP (no STORE)

The feature is enabled by default and can be toggled with:

-N or -no_fetch
set opt.fetch true/false;

There's no STORE support because I wanted to make it explicit that this 
"optimization" is for launching small/simple scripts during development, rather 
than querying and filtering large number of rows on the client machine. 
However, a threshold could be given on the input size (an estimation) to 
determine whether to prefer fetch over MR jobs, similar to what Hive's 
'hive.fetch.task.conversion.threshold' does. (through Pig's 
LoadMetadata#getStatistic ?)


Diffs
-

  
/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java
 1553596 
  
/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/FixedWidthLoader.java
 1553596 
  /trunk/src/org/apache/pig/Main.java 1553596 
  /trunk/src/org/apache/pig/PigServer.java 1553596 
  
/trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java 
1553596 
  
/trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java
 PRE-CREATION 
  
/trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java
 PRE-CREATION 
  
/trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchPOStoreImpl.java
 PRE-CREATION 
  
/trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchProgressableReporter.java
 PRE-CREATION 
  
/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
 1553596 
  
/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
 1553596 
  
/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java
 1553596 
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 
1553596 
  /trunk/src/org/apache/pig/impl/util/PropertiesUtil.java 1553596 
  
/trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java
 1553596 
  /trunk/src/org/apache/pig/tools/pigstats/SimpleFetchPigStats.java 
PRE-CREATION 
  /trunk/test/org/apache/pig/test/TestAssert.java 1553596 
  /trunk/test/org/apache/pig/test/TestEvalPipeline2.java 1553596 
  /trunk/test/org/apache/pig/test/TestFetch.java PRE-CREATION 
  /trunk/test/org/apache/pig/test/TestPigRunner.java 1553596 

Diff: https://reviews.apache.org/r/16507/diff/


Testing
---

- new testcase added:  TestFetch
- the patch was checked against test-commit and test-core
- Because opt.fetch is set by default, the testcases were using fetch instead 
of MR jobs wherever it was possible


Thanks,

Lorand Bendig



[jira] [Updated] (PIG-3645) Move FileLocalizer.setR() calls to unit tests

2013-12-29 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3645:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to trunk. I removed FiliLocalizer.getR().

Thank you Rohini for all the help with this jira!

> Move FileLocalizer.setR() calls to unit tests
> -
>
> Key: PIG-3645
> URL: https://issues.apache.org/jira/browse/PIG-3645
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Fix For: 0.13.0
>
> Attachments: PIG-3645-1.patch, PIG-3645-2.patch, PIG-3645-3.patch, 
> PIG-3645-4.patch, TEST-org.apache.pig.test.TestMRCompiler.txt
>
>
> Currently, temporary paths are generated by FileLocalizer using 
> Random.nextInt(). To provide strong randomness, MapReduceLauncher resets the 
> Random object every time when compiling physical plan to MR plan:
> {code}
> MRCompiler comp = new MRCompiler(php, pc); 
> comp.randomizeFileLocalizer(); // This in turn calls FileLocalizer.setR(new 
> Random()).
> {code}
> Besides, there are a couple of places calling FileLocalizer.setR() (e.g. 
> MRCompiler) with some random seed.
> I think-
> # Randomizing Random seed is unnecessary if we switch to UUID.
> # Setting Random objects in code like this is error-prone because it can be 
> easily broken by having or missing a FileLocalizer.setR() somewhere else. See 
> an example [here|http://search-hadoop.com/m/2nxTzQXfHw1].
> So I propose that we remove all this "randomizing Random seed" code and use 
> UUID instead in temporary paths.
> For unit tests that compare the results against gold files, we should still 
> allow to set Random seed through FileLocalizer.setR(). But this method will 
> be annotated as "VisibleForTesting" to ensure it is not used nowhere else 
> other than in unit tests.
> Regarding the existing gold files, they can be easily regenerated by 
> TestMRCompiler as follows-
> {code}
> FileOutputStream fos = new FileOutputStream(expectedFile + "_new");
> PrintWriter pw = new PrintWriter(fos);
> pw.write(compiledPlan);
> {code}
> I assume there won't be any kind of regressions due to this change. But 
> please let me know if I am wrong.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3616) TestBuiltIn.testURIwithCurlyBrace() silently fails

2013-12-29 Thread Lorand Bendig (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858467#comment-13858467
 ] 

Lorand Bendig commented on PIG-3616:


Sure, I had no objections. Thank you for the update!

> TestBuiltIn.testURIwithCurlyBrace() silently fails
> --
>
> Key: PIG-3616
> URL: https://issues.apache.org/jira/browse/PIG-3616
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.12.0
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
>Priority: Minor
>  Labels: test
> Fix For: 0.13.0
>
> Attachments: PIG-3616-2.patch, PIG-3616.patch
>
>
> This test runs against MiniCluster but takes the input from the local path.
> The empty catch block swallows the exception ("input path does not exist") 
> thus making a false negative result.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PIG-3643) Nested Foreach with UDF and bincond is broken

2013-12-29 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3643:
---

   Resolution: Fixed
Fix Version/s: 0.13.0
   Status: Resolved  (was: Patch Available)

Committed to trunk. Thank you Rohini for the review!

> Nested Foreach with UDF and bincond is broken
> -
>
> Key: PIG-3643
> URL: https://issues.apache.org/jira/browse/PIG-3643
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.13.0
>Reporter: Rohini Palaniswamy
>Assignee: Cheolsoo Park
> Fix For: 0.13.0
>
> Attachments: PIG-3643-1.patch
>
>
> Was checking out PIG-3000. 
> A = load 'data' as (a:chararray);
> B = foreach A { c = UPPER(a); generate ((c eq 'TEST') ? 1 : 0), ((c eq 'DEV') 
> ? 1 : 0); }
> This now throws "Invalid field projection. Projected field [c] does not exist 
> in schema".  Works fine in 0.11. Broken in trunk. Haven't checked 0.12. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PIG-3616) TestBuiltIn.testURIwithCurlyBrace() silently fails

2013-12-29 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3616:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to trunk. Thank you Lorand and Rohini!

> TestBuiltIn.testURIwithCurlyBrace() silently fails
> --
>
> Key: PIG-3616
> URL: https://issues.apache.org/jira/browse/PIG-3616
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.12.0
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
>Priority: Minor
>  Labels: test
> Fix For: 0.13.0
>
> Attachments: PIG-3616-2.patch, PIG-3616.patch
>
>
> This test runs against MiniCluster but takes the input from the local path.
> The empty catch block swallows the exception ("input path does not exist") 
> thus making a false negative result.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3643) Nested Foreach with UDF and bincond is broken

2013-12-29 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858418#comment-13858418
 ] 

Rohini Palaniswamy commented on PIG-3643:
-

+1. This patch just adds back code that was removed by PIG-3581. 

> Nested Foreach with UDF and bincond is broken
> -
>
> Key: PIG-3643
> URL: https://issues.apache.org/jira/browse/PIG-3643
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.13.0
>Reporter: Rohini Palaniswamy
>Assignee: Cheolsoo Park
> Attachments: PIG-3643-1.patch
>
>
> Was checking out PIG-3000. 
> A = load 'data' as (a:chararray);
> B = foreach A { c = UPPER(a); generate ((c eq 'TEST') ? 1 : 0), ((c eq 'DEV') 
> ? 1 : 0); }
> This now throws "Invalid field projection. Projected field [c] does not exist 
> in schema".  Works fine in 0.11. Broken in trunk. Haven't checked 0.12. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3645) Move FileLocalizer.setR() calls to unit tests

2013-12-29 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858410#comment-13858410
 ] 

Rohini Palaniswamy commented on PIG-3645:
-

+1. Can you remove the FileLocalizer.getR() method while committing? You had 
done that in tez branch.

> Move FileLocalizer.setR() calls to unit tests
> -
>
> Key: PIG-3645
> URL: https://issues.apache.org/jira/browse/PIG-3645
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Fix For: 0.13.0
>
> Attachments: PIG-3645-1.patch, PIG-3645-2.patch, PIG-3645-3.patch, 
> PIG-3645-4.patch, TEST-org.apache.pig.test.TestMRCompiler.txt
>
>
> Currently, temporary paths are generated by FileLocalizer using 
> Random.nextInt(). To provide strong randomness, MapReduceLauncher resets the 
> Random object every time when compiling physical plan to MR plan:
> {code}
> MRCompiler comp = new MRCompiler(php, pc); 
> comp.randomizeFileLocalizer(); // This in turn calls FileLocalizer.setR(new 
> Random()).
> {code}
> Besides, there are a couple of places calling FileLocalizer.setR() (e.g. 
> MRCompiler) with some random seed.
> I think-
> # Randomizing Random seed is unnecessary if we switch to UUID.
> # Setting Random objects in code like this is error-prone because it can be 
> easily broken by having or missing a FileLocalizer.setR() somewhere else. See 
> an example [here|http://search-hadoop.com/m/2nxTzQXfHw1].
> So I propose that we remove all this "randomizing Random seed" code and use 
> UUID instead in temporary paths.
> For unit tests that compare the results against gold files, we should still 
> allow to set Random seed through FileLocalizer.setR(). But this method will 
> be annotated as "VisibleForTesting" to ensure it is not used nowhere else 
> other than in unit tests.
> Regarding the existing gold files, they can be easily regenerated by 
> TestMRCompiler as follows-
> {code}
> FileOutputStream fos = new FileOutputStream(expectedFile + "_new");
> PrintWriter pw = new PrintWriter(fos);
> pw.write(compiledPlan);
> {code}
> I assume there won't be any kind of regressions due to this change. But 
> please let me know if I am wrong.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3616) TestBuiltIn.testURIwithCurlyBrace() silently fails

2013-12-29 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858409#comment-13858409
 ] 

Rohini Palaniswamy commented on PIG-3616:
-

+1

> TestBuiltIn.testURIwithCurlyBrace() silently fails
> --
>
> Key: PIG-3616
> URL: https://issues.apache.org/jira/browse/PIG-3616
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.12.0
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
>Priority: Minor
>  Labels: test
> Fix For: 0.13.0
>
> Attachments: PIG-3616-2.patch, PIG-3616.patch
>
>
> This test runs against MiniCluster but takes the input from the local path.
> The empty catch block swallows the exception ("input path does not exist") 
> thus making a false negative result.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3634) Improve performance of order-by

2013-12-29 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858407#comment-13858407
 ] 

Rohini Palaniswamy commented on PIG-3634:
-

[~daijy],
Is there a reviewboard link for this patch?

> Improve performance of order-by
> ---
>
> Key: PIG-3634
> URL: https://issues.apache.org/jira/browse/PIG-3634
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: tez-branch
>
> Attachments: PIG-3634-0.patch, PIG-3634-1.patch, PIG-3634-2.patch
>
>
> This is a followup for PIG-3534. In PIG-3534, we use 5 vertexes (3 DAGs) to 
> implement an order-by. We can optimize to use 4 vertexes in 1 DAG:
> vertex 1: close the current vertex, create input + samples input
> vertex 2: aggregate samples to create quantiles
> vertex 3: use quantiles to partition input
> vertex 4: sort input after partition
> The DAG is:
> {code}
> vertex 1   -->  vertex 3 --> vertex 4
>\--> vertex 2 ---/
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3608) ClassCastException when looking up a value from AvroMapWrapper using a Utf8 key

2013-12-29 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858404#comment-13858404
 ] 

Richard Ding commented on PIG-3608:
---

Thanks [~cheolsoo].

> ClassCastException when looking up a value from AvroMapWrapper using a Utf8 
> key
> ---
>
> Key: PIG-3608
> URL: https://issues.apache.org/jira/browse/PIG-3608
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.12.0
>Reporter: Richard Ding
>Assignee: Richard Ding
>Priority: Minor
> Fix For: 0.13.0
>
> Attachments: PIG-3608.patch, PIG-3608_2.patch
>
>
> One got the following exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.util.Utf8 incompatible with 
> java.lang.String 
> at 
> org.apache.pig.impl.util.avro.AvroMapWrapper.get(AvroMapWrapper.java:80)
> {code}
> This is related to the change by PIG-3420.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PIG-3646) LoadFunc cannot get a hold of the associated user defined schema

2013-12-29 Thread Costin Leau (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Costin Leau updated PIG-3646:
-

Description: 
Described on the mailing list here: 
http://www.mail-archive.com/user%40pig.apache.org/msg09009.html

A Pig {{LoadFunc}} cannot get a hold of its associated schema. For example, in 
the following script:
{code}
A = LOAD 'pig/tupleartists' USING MyStorage() AS (name: chararray, links 
(url:chararray, picture:chararray));
B = FOREACH A GENERATE name, links.url;
DUMP B;
{code}

{{MyStorage}} cannot get a hold of {{(name:chararray, links ...}} even when 
{{LoadPushDown#pushProjection()}} is implemented (which is called only when a 
transformation occurs - PlanOptimizer/ColumnMapKeyPrune).

One can look into a {{POStore}} but even then the information obtain is 
incomplete - meaning the schema is incomplete and the fields mentioned in 
{{FOREACH}} are dereferenced {{links.url}} is returned as {{url}}.

The purpose of this issue is to allow a {{LoadFunc}} implementation to get 
access to its schema declaration as specified in the script.

Thanks!

  was:
Described on the mailing list here: 
http://www.mail-archive.com/user%40pig.apache.org/msg09009.html

A Pig LoadFunc cannot get a hold of its associated schema. For example, in the 
following script:



> LoadFunc cannot get a hold of the associated user defined schema
> 
>
> Key: PIG-3646
> URL: https://issues.apache.org/jira/browse/PIG-3646
> Project: Pig
>  Issue Type: Bug
>  Components: data
>Affects Versions: 0.12.0
>Reporter: Costin Leau
>
> Described on the mailing list here: 
> http://www.mail-archive.com/user%40pig.apache.org/msg09009.html
> A Pig {{LoadFunc}} cannot get a hold of its associated schema. For example, 
> in the following script:
> {code}
> A = LOAD 'pig/tupleartists' USING MyStorage() AS (name: chararray, links 
> (url:chararray, picture:chararray));
> B = FOREACH A GENERATE name, links.url;
> DUMP B;
> {code}
> {{MyStorage}} cannot get a hold of {{(name:chararray, links ...}} even when 
> {{LoadPushDown#pushProjection()}} is implemented (which is called only when a 
> transformation occurs - PlanOptimizer/ColumnMapKeyPrune).
> One can look into a {{POStore}} but even then the information obtain is 
> incomplete - meaning the schema is incomplete and the fields mentioned in 
> {{FOREACH}} are dereferenced {{links.url}} is returned as {{url}}.
> The purpose of this issue is to allow a {{LoadFunc}} implementation to get 
> access to its schema declaration as specified in the script.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (PIG-3646) LoadFunc cannot get a hold of the associated user defined schema

2013-12-29 Thread Costin Leau (JIRA)
Costin Leau created PIG-3646:


 Summary: LoadFunc cannot get a hold of the associated user defined 
schema
 Key: PIG-3646
 URL: https://issues.apache.org/jira/browse/PIG-3646
 Project: Pig
  Issue Type: Bug
  Components: data
Affects Versions: 0.12.0
Reporter: Costin Leau


Described on the mailing list here: 
http://www.mail-archive.com/user%40pig.apache.org/msg09009.html

A Pig LoadFunc cannot get a hold of its associated schema. For example, in the 
following script:




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PIG-3634) Improve performance of order-by

2013-12-29 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3634:


Attachment: PIG-3634-2.patch

Not sure why it works before PIG-3636. Reattach patch.

> Improve performance of order-by
> ---
>
> Key: PIG-3634
> URL: https://issues.apache.org/jira/browse/PIG-3634
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: tez-branch
>
> Attachments: PIG-3634-0.patch, PIG-3634-1.patch, PIG-3634-2.patch
>
>
> This is a followup for PIG-3534. In PIG-3534, we use 5 vertexes (3 DAGs) to 
> implement an order-by. We can optimize to use 4 vertexes in 1 DAG:
> vertex 1: close the current vertex, create input + samples input
> vertex 2: aggregate samples to create quantiles
> vertex 3: use quantiles to partition input
> vertex 4: sort input after partition
> The DAG is:
> {code}
> vertex 1   -->  vertex 3 --> vertex 4
>\--> vertex 2 ---/
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)