[jira] [Commented] (PIG-3634) Improve performance of order-by

2014-01-02 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860079#comment-13860079
 ] 

Daniel Dai commented on PIG-3634:
-

Thanks Cheolsoo!

 Improve performance of order-by
 ---

 Key: PIG-3634
 URL: https://issues.apache.org/jira/browse/PIG-3634
 Project: Pig
  Issue Type: Sub-task
  Components: tez
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: tez-branch

 Attachments: PIG-3634-0.patch, PIG-3634-1.patch, PIG-3634-2.patch, 
 PIG-3634-3.patch


 This is a followup for PIG-3534. In PIG-3534, we use 5 vertexes (3 DAGs) to 
 implement an order-by. We can optimize to use 4 vertexes in 1 DAG:
 vertex 1: close the current vertex, create input + samples input
 vertex 2: aggregate samples to create quantiles
 vertex 3: use quantiles to partition input
 vertex 4: sort input after partition
 The DAG is:
 {code}
 vertex 1   --  vertex 3 -- vertex 4
\-- vertex 2 ---/
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2014-01-02 Thread Gianmarco De Francisci Morales (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860156#comment-13860156
 ] 

Gianmarco De Francisci Morales commented on PIG-3642:
-

I am -0 on this idea.
Skipping MR requires rewriting good part of the execution logic, and might 
introduce weird optimization bugs.
More importantly, the added advantage brought by this feature is small.
Usually, if you want to test your program on a small input, you copy it locally 
and run Pig in local mode.

 Direct HDFS access for small jobs (fetch) 
 --

 Key: PIG-3642
 URL: https://issues.apache.org/jira/browse/PIG-3642
 Project: Pig
  Issue Type: Improvement
Reporter: Lorand Bendig
Assignee: Lorand Bendig
 Fix For: 0.13.0

 Attachments: PIG-3642.patch


 With this patch I'd like to add the possibility to directly read data from 
 HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
 already has this feature (fetch). This patch shares some similarities with 
 the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
 for a script:
 * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
 (nested) FOREACH with expression operators, custom UDFs..etc
 * no scalar aliases
 * no SampleLoader
 * single leaf job
 * DUMP (no STORE)
 The feature is enabled by default and can be toggled with:
 * -N or -no_fetch 
 * set opt.fetch true/false; 
 There's no STORE support because I wanted to make it explicit that this 
 optimization is for launching small/simple scripts during development, 
 rather than querying and filtering large number of rows on the client 
 machine. However, a threshold could be given on the input size (an 
 estimation) to determine whether to prefer fetch over MR jobs, similar to 
 what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
 LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PIG-3453) Implement a Storm backend to Pig

2014-01-02 Thread Gianmarco De Francisci Morales (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gianmarco De Francisci Morales updated PIG-3453:


Status: Open  (was: Patch Available)

Canceling patch as it is not ready to be committed.

 Implement a Storm backend to Pig
 

 Key: PIG-3453
 URL: https://issues.apache.org/jira/browse/PIG-3453
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.13.0
Reporter: Pradeep Gollakota
Assignee: Jacob Perkins
  Labels: storm
 Fix For: 0.13.0

 Attachments: storm-integration.patch


 There is a lot of interest around implementing a Storm backend to Pig for 
 streaming processing. The proposal and initial discussions can be found at 
 https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)

2014-01-02 Thread Lorand Bendig


 On Dec. 30, 2013, 9:50 p.m., Cheolsoo Park wrote:
  This a great work. Thank you very much!
  
  I have few minor comments below mostly about tests.

Cheolsoo, thanks for taking your time to review it!
I fixed/commented the issues.


 On Dec. 30, 2013, 9:50 p.m., Cheolsoo Park wrote:
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java,
   line 74
  https://reviews.apache.org/r/16507/diff/1/?file=404117#file404117line74
 
  Can you move this to PigConfiguration?

PigConfiguration seems to be a better place to put OPT_FETCH


 On Dec. 30, 2013, 9:50 p.m., Cheolsoo Park wrote:
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java,
   line 23
  https://reviews.apache.org/r/16507/diff/1/?file=404122#file404122line23
 
  Do you mind removing unused imports?
  - import java.util.LinkedList;
  - import org.apache.pig.impl.util.IdentityHashSet;
  - import org.apache.pig.pen.util.LineageTracer;

Sure. Intially didn't want to remove these leftovers from PIG-1712 


 On Dec. 30, 2013, 9:50 p.m., Cheolsoo Park wrote:
  /trunk/test/org/apache/pig/test/TestAssert.java, lines 91-94
  https://reviews.apache.org/r/16507/diff/1/?file=404127#file404127line91
 
  Does this else block ever get executed given the we're running the test 
  with opt.fetch on?
  
  I think you can do either-
  
  1) explicitly set opt.fetch to true or false in setup(),
  
  or
  
  2) change the test to run the query twice with opt.fetch on and off to 
  ensure we're not breaking anything when opt.fetch is off.

Not really. I chose the second option


 On Dec. 30, 2013, 9:50 p.m., Cheolsoo Park wrote:
  /trunk/test/org/apache/pig/test/TestPigRunner.java, line 174
  https://reviews.apache.org/r/16507/diff/1/?file=404130#file404130line174
 
  Why is this changed? I think the default value for opt.multiquery is 
  true.

I accidentally changed it


- Lorand


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16507/#review30936
---


On Dec. 29, 2013, 11:19 p.m., Lorand Bendig wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/16507/
 ---
 
 (Updated Dec. 29, 2013, 11:19 p.m.)
 
 
 Review request for pig.
 
 
 Bugs: PIG-3642
 https://issues.apache.org/jira/browse/PIG-3642
 
 
 Repository: pig
 
 
 Description
 ---
 
 With this patch I'd like to add the possibility to directly read data from 
 HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
 already has this feature (fetch). This patch shares some similarities with 
 the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
 for a script:
 
 it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
 (nested) FOREACH with expression operators, custom UDFs..etc
 no scalar aliases
 no SampleLoader
 single leaf job
 DUMP (no STORE)
 
 The feature is enabled by default and can be toggled with:
 
 -N or -no_fetch
 set opt.fetch true/false;
 
 There's no STORE support because I wanted to make it explicit that this 
 optimization is for launching small/simple scripts during development, 
 rather than querying and filtering large number of rows on the client 
 machine. However, a threshold could be given on the input size (an 
 estimation) to determine whether to prefer fetch over MR jobs, similar to 
 what Hive's 'hive.fetch.task.conversion.threshold' does. (through Pig's 
 LoadMetadata#getStatistic ?)
 
 
 Diffs
 -
 
   
 /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java
  1553596 
   
 /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/FixedWidthLoader.java
  1553596 
   /trunk/src/org/apache/pig/Main.java 1553596 
   /trunk/src/org/apache/pig/PigServer.java 1553596 
   
 /trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java
  1553596 
   
 /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java
  PRE-CREATION 
   
 /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java
  PRE-CREATION 
   
 /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchPOStoreImpl.java
  PRE-CREATION 
   
 /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchProgressableReporter.java
  PRE-CREATION 
   
 /trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
  1553596 
   
 /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
  1553596 
   
 

Re: Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)

2014-01-02 Thread Lorand Bendig

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16507/
---

(Updated Jan. 2, 2014, 2:05 p.m.)


Review request for pig.


Changes
---

Updated patch: PIG-3642-2.patch


Bugs: PIG-3642
https://issues.apache.org/jira/browse/PIG-3642


Repository: pig


Description
---

With this patch I'd like to add the possibility to directly read data from HDFS 
instead of launching MR jobs in case of simple (map-only) tasks. Hive already 
has this feature (fetch). This patch shares some similarities with the local 
mode of Pig 0.6. Here, fetching kicks off when the following holds for a script:

it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
(nested) FOREACH with expression operators, custom UDFs..etc
no scalar aliases
no SampleLoader
single leaf job
DUMP (no STORE)

The feature is enabled by default and can be toggled with:

-N or -no_fetch
set opt.fetch true/false;

There's no STORE support because I wanted to make it explicit that this 
optimization is for launching small/simple scripts during development, rather 
than querying and filtering large number of rows on the client machine. 
However, a threshold could be given on the input size (an estimation) to 
determine whether to prefer fetch over MR jobs, similar to what Hive's 
'hive.fetch.task.conversion.threshold' does. (through Pig's 
LoadMetadata#getStatistic ?)


Diffs (updated)
-

  
/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java
 1554785 
  
/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/FixedWidthLoader.java
 1554785 
  /trunk/src/org/apache/pig/Main.java 1554785 
  /trunk/src/org/apache/pig/PigConfiguration.java 1554785 
  /trunk/src/org/apache/pig/PigServer.java 1554785 
  
/trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java 
1554785 
  
/trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java
 PRE-CREATION 
  
/trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java
 PRE-CREATION 
  
/trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchPOStoreImpl.java
 PRE-CREATION 
  
/trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchProgressableReporter.java
 PRE-CREATION 
  
/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
 1554785 
  
/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
 1554785 
  
/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java
 1554785 
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 
1554785 
  /trunk/src/org/apache/pig/impl/util/PropertiesUtil.java 1554785 
  
/trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java
 1554785 
  /trunk/src/org/apache/pig/tools/pigstats/SimpleFetchPigStats.java 
PRE-CREATION 
  /trunk/test/org/apache/pig/test/TestAssert.java 1554785 
  /trunk/test/org/apache/pig/test/TestEvalPipeline2.java 1554785 
  /trunk/test/org/apache/pig/test/TestFetch.java PRE-CREATION 
  /trunk/test/org/apache/pig/test/TestPigRunner.java 1554785 

Diff: https://reviews.apache.org/r/16507/diff/


Testing
---

- new testcase added:  TestFetch
- the patch was checked against test-commit and test-core
- Because opt.fetch is set by default, the testcases were using fetch instead 
of MR jobs wherever it was possible


Thanks,

Lorand Bendig



[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2014-01-02 Thread Lorand Bendig (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860224#comment-13860224
 ] 

Lorand Bendig commented on PIG-3642:


[~azaroth], I took the idea of this patch from HIVE-2925 and PIG-2864. I agree, 
that the benefit is limited, however simple scripts/queries will run 
significantly faster than in local MR mode. As far as I can judge, aside from 
some mocking and initialization
the execution logic literally follows Pig's pull-based model. What optimization 
bugs do you think that can happen? 


 Direct HDFS access for small jobs (fetch) 
 --

 Key: PIG-3642
 URL: https://issues.apache.org/jira/browse/PIG-3642
 Project: Pig
  Issue Type: Improvement
Reporter: Lorand Bendig
Assignee: Lorand Bendig
 Fix For: 0.13.0

 Attachments: PIG-3642.patch


 With this patch I'd like to add the possibility to directly read data from 
 HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
 already has this feature (fetch). This patch shares some similarities with 
 the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
 for a script:
 * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
 (nested) FOREACH with expression operators, custom UDFs..etc
 * no scalar aliases
 * no SampleLoader
 * single leaf job
 * DUMP (no STORE)
 The feature is enabled by default and can be toggled with:
 * -N or -no_fetch 
 * set opt.fetch true/false; 
 There's no STORE support because I wanted to make it explicit that this 
 optimization is for launching small/simple scripts during development, 
 rather than querying and filtering large number of rows on the client 
 machine. However, a threshold could be given on the input size (an 
 estimation) to determine whether to prefer fetch over MR jobs, similar to 
 what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
 LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2014-01-02 Thread Gianmarco De Francisci Morales (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860228#comment-13860228
 ] 

Gianmarco De Francisci Morales commented on PIG-3642:
-

I haven't reviewed the patch thoroughly so take my comments with the due care.
I am just afraid that we will redo the same mistake we did with the local 
mode execution of Pig that you mention in the ticket.
That mode of execution was removed because it was a burden to maintain, and in 
the end the two implementations (MR and local mode) were out of synch, 
resulting in the same script doing different things.
I just want to avoid the same thing happening again.

If [~cheolsoo] has reviewed the patch, I would like to hear his comments on 
this issue.

 Direct HDFS access for small jobs (fetch) 
 --

 Key: PIG-3642
 URL: https://issues.apache.org/jira/browse/PIG-3642
 Project: Pig
  Issue Type: Improvement
Reporter: Lorand Bendig
Assignee: Lorand Bendig
 Fix For: 0.13.0

 Attachments: PIG-3642.patch


 With this patch I'd like to add the possibility to directly read data from 
 HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
 already has this feature (fetch). This patch shares some similarities with 
 the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
 for a script:
 * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
 (nested) FOREACH with expression operators, custom UDFs..etc
 * no scalar aliases
 * no SampleLoader
 * single leaf job
 * DUMP (no STORE)
 The feature is enabled by default and can be toggled with:
 * -N or -no_fetch 
 * set opt.fetch true/false; 
 There's no STORE support because I wanted to make it explicit that this 
 optimization is for launching small/simple scripts during development, 
 rather than querying and filtering large number of rows on the client 
 machine. However, a threshold could be given on the input size (an 
 estimation) to determine whether to prefer fetch over MR jobs, similar to 
 what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
 LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2014-01-02 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860392#comment-13860392
 ] 

Cheolsoo Park commented on PIG-3642:


[~azaroth], thank you for raising a concern. But I still think we should commit 
this patch for the following reasons-

# Fetch optimization happens after physical plan is fully built. If the plan is 
fetchable (i.e. meets all the conditions Lorand listed in the description), Pig 
will launch a job via FetchLauncher instead via MapReduceLauncher. Given this 
code path, I think the possibility of introducing a weird optimization bug is 
minimal. In addition, the optimization is only applicable to fairly small 
queries.
# There are indeed changes to some backend operators such as POStream. This is 
because the logic about when to pull data from pipeline is different in some 
cases. But these changes are fairly minimal too.
# IMO, the benefit of this optimization is big. I am constantly asked by users 
about this feature. True that it won't improve any performance of production 
ETL jobs, but it will shorten development iteration. In addition, launching a 
full MR job for a simple load/dump query definitely makes a bad impression to 
new users.






 Direct HDFS access for small jobs (fetch) 
 --

 Key: PIG-3642
 URL: https://issues.apache.org/jira/browse/PIG-3642
 Project: Pig
  Issue Type: Improvement
Reporter: Lorand Bendig
Assignee: Lorand Bendig
 Fix For: 0.13.0

 Attachments: PIG-3642.patch


 With this patch I'd like to add the possibility to directly read data from 
 HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
 already has this feature (fetch). This patch shares some similarities with 
 the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
 for a script:
 * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
 (nested) FOREACH with expression operators, custom UDFs..etc
 * no scalar aliases
 * no SampleLoader
 * single leaf job
 * DUMP (no STORE)
 The feature is enabled by default and can be toggled with:
 * -N or -no_fetch 
 * set opt.fetch true/false; 
 There's no STORE support because I wanted to make it explicit that this 
 optimization is for launching small/simple scripts during development, 
 rather than querying and filtering large number of rows on the client 
 machine. However, a threshold could be given on the input size (an 
 estimation) to determine whether to prefer fetch over MR jobs, similar to 
 what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
 LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2014-01-02 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860531#comment-13860531
 ] 

Alan Gates commented on PIG-3642:
-

I don't think this will result in the same local mode/mr mode problem that we 
had before.  The issue there was we tried (and failed) to have two modes where 
Pig provided all features.  This is much more limited to doing things locally 
that can easily be done locally.

 Direct HDFS access for small jobs (fetch) 
 --

 Key: PIG-3642
 URL: https://issues.apache.org/jira/browse/PIG-3642
 Project: Pig
  Issue Type: Improvement
Reporter: Lorand Bendig
Assignee: Lorand Bendig
 Fix For: 0.13.0

 Attachments: PIG-3642.patch


 With this patch I'd like to add the possibility to directly read data from 
 HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
 already has this feature (fetch). This patch shares some similarities with 
 the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
 for a script:
 * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
 (nested) FOREACH with expression operators, custom UDFs..etc
 * no scalar aliases
 * no SampleLoader
 * single leaf job
 * DUMP (no STORE)
 The feature is enabled by default and can be toggled with:
 * -N or -no_fetch 
 * set opt.fetch true/false; 
 There's no STORE support because I wanted to make it explicit that this 
 optimization is for launching small/simple scripts during development, 
 rather than querying and filtering large number of rows on the client 
 machine. However, a threshold could be given on the input size (an 
 estimation) to determine whether to prefer fetch over MR jobs, similar to 
 what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
 LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3615) Update the way that JsonLoader/JsonStorage deal with BigDecimal

2014-01-02 Thread Erik Selin (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860699#comment-13860699
 ] 

Erik Selin commented on PIG-3615:
-

Could someone take a look at this tiny little pr. It would be great to get it 
merged or at least have a discussion about it :)

 Update the way that JsonLoader/JsonStorage deal with BigDecimal
 ---

 Key: PIG-3615
 URL: https://issues.apache.org/jira/browse/PIG-3615
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.12.0
Reporter: Erik Selin
Priority: Minor
 Attachments: bugPig-3615.patch


 It's a common (and good) convention to quote fixed point numbers when storing 
 them as json. The reason being that majority of json libraries will 
 implicitly load any number value as a floating point number and if you care 
 about data integrity this will make you very sad.
 This update makes JsonLoader able to load BigDecimal values from quoted 
 values (the old jackson library that we're using doesn't support this through 
 the current approach) as well as making JsonStorage store BigDecimal values 
 as quoted strings.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3615) Update the way that JsonLoader/JsonStorage deal with BigDecimal

2014-01-02 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860715#comment-13860715
 ] 

Cheolsoo Park commented on PIG-3615:


[~tyro89], please hit the Submit Patch button. That will make this jira show 
up in the Patch Available list.

 Update the way that JsonLoader/JsonStorage deal with BigDecimal
 ---

 Key: PIG-3615
 URL: https://issues.apache.org/jira/browse/PIG-3615
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.12.0
Reporter: Erik Selin
Priority: Minor
 Attachments: bugPig-3615.patch


 It's a common (and good) convention to quote fixed point numbers when storing 
 them as json. The reason being that majority of json libraries will 
 implicitly load any number value as a floating point number and if you care 
 about data integrity this will make you very sad.
 This update makes JsonLoader able to load BigDecimal values from quoted 
 values (the old jackson library that we're using doesn't support this through 
 the current approach) as well as making JsonStorage store BigDecimal values 
 as quoted strings.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PIG-3615) Update the way that JsonLoader/JsonStorage deal with BigDecimal

2014-01-02 Thread Erik Selin (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Selin updated PIG-3615:


Status: Patch Available  (was: Open)

 Update the way that JsonLoader/JsonStorage deal with BigDecimal
 ---

 Key: PIG-3615
 URL: https://issues.apache.org/jira/browse/PIG-3615
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.12.0
Reporter: Erik Selin
Priority: Minor
 Attachments: bugPig-3615.patch


 It's a common (and good) convention to quote fixed point numbers when storing 
 them as json. The reason being that majority of json libraries will 
 implicitly load any number value as a floating point number and if you care 
 about data integrity this will make you very sad.
 This update makes JsonLoader able to load BigDecimal values from quoted 
 values (the old jackson library that we're using doesn't support this through 
 the current approach) as well as making JsonStorage store BigDecimal values 
 as quoted strings.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] Subscription: PIG patch available

2014-01-02 Thread jira
Issue Subscription
Filter: PIG patch available (8 issues)

Subscriber: pigdaily

Key Summary
PIG-3642Direct HDFS access for small jobs (fetch) 
https://issues.apache.org/jira/browse/PIG-3642
PIG-3635Fix e2e tests for Hadoop 2.X on Windows
https://issues.apache.org/jira/browse/PIG-3635
PIG-3615Update the way that JsonLoader/JsonStorage deal with BigDecimal
https://issues.apache.org/jira/browse/PIG-3615
PIG-3613UDF for SimilarityMatching between strings with matching scores
https://issues.apache.org/jira/browse/PIG-3613
PIG-3587add functionality for rolling over dates
https://issues.apache.org/jira/browse/PIG-3587
PIG-3573Provide StoreFunc and LoadFunc for Accumulo
https://issues.apache.org/jira/browse/PIG-3573
PIG-3441Allow Pig to use default resources from Configuration objects
https://issues.apache.org/jira/browse/PIG-3441
PIG-3347Store invocation brings side effect
https://issues.apache.org/jira/browse/PIG-3347

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225filterId=12322384


Re: Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)

2014-01-02 Thread Cheolsoo Park

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16507/#review31099
---


I have one last comment below. Other than that, everything looks good.

Also, can you document this? It think it's worth to mention in the Performance 
and Efficiency section in the manual. You can post a doc patch in a separate 
jira if you'd like.


/trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java
https://reviews.apache.org/r/16507/#comment59452

This won't work if the temporary file storage is not InterStorage. It can 
be one of Inter, TFile, and SequenceFile storages.

See here-

https://github.com/apache/pig/blob/trunk/src/org/apache/pig/impl/util/Utils.java#L347



- Cheolsoo Park


On Jan. 2, 2014, 2:05 p.m., Lorand Bendig wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/16507/
 ---
 
 (Updated Jan. 2, 2014, 2:05 p.m.)
 
 
 Review request for pig.
 
 
 Bugs: PIG-3642
 https://issues.apache.org/jira/browse/PIG-3642
 
 
 Repository: pig
 
 
 Description
 ---
 
 With this patch I'd like to add the possibility to directly read data from 
 HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
 already has this feature (fetch). This patch shares some similarities with 
 the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
 for a script:
 
 it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
 (nested) FOREACH with expression operators, custom UDFs..etc
 no scalar aliases
 no SampleLoader
 single leaf job
 DUMP (no STORE)
 
 The feature is enabled by default and can be toggled with:
 
 -N or -no_fetch
 set opt.fetch true/false;
 
 There's no STORE support because I wanted to make it explicit that this 
 optimization is for launching small/simple scripts during development, 
 rather than querying and filtering large number of rows on the client 
 machine. However, a threshold could be given on the input size (an 
 estimation) to determine whether to prefer fetch over MR jobs, similar to 
 what Hive's 'hive.fetch.task.conversion.threshold' does. (through Pig's 
 LoadMetadata#getStatistic ?)
 
 
 Diffs
 -
 
   
 /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java
  1554785 
   
 /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/FixedWidthLoader.java
  1554785 
   /trunk/src/org/apache/pig/Main.java 1554785 
   /trunk/src/org/apache/pig/PigConfiguration.java 1554785 
   /trunk/src/org/apache/pig/PigServer.java 1554785 
   
 /trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java
  1554785 
   
 /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java
  PRE-CREATION 
   
 /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java
  PRE-CREATION 
   
 /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchPOStoreImpl.java
  PRE-CREATION 
   
 /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchProgressableReporter.java
  PRE-CREATION 
   
 /trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
  1554785 
   
 /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
  1554785 
   
 /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java
  1554785 
   
 /trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 
 1554785 
   /trunk/src/org/apache/pig/impl/util/PropertiesUtil.java 1554785 
   
 /trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java
  1554785 
   /trunk/src/org/apache/pig/tools/pigstats/SimpleFetchPigStats.java 
 PRE-CREATION 
   /trunk/test/org/apache/pig/test/TestAssert.java 1554785 
   /trunk/test/org/apache/pig/test/TestEvalPipeline2.java 1554785 
   /trunk/test/org/apache/pig/test/TestFetch.java PRE-CREATION 
   /trunk/test/org/apache/pig/test/TestPigRunner.java 1554785 
 
 Diff: https://reviews.apache.org/r/16507/diff/
 
 
 Testing
 ---
 
 - new testcase added:  TestFetch
 - the patch was checked against test-commit and test-core
 - Because opt.fetch is set by default, the testcases were using fetch instead 
 of MR jobs wherever it was possible
 
 
 Thanks,
 
 Lorand Bendig