[jira] [Commented] (PIG-3419) Pluggable Execution Engine
[ https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747131#comment-13747131 ] Mark Wagner commented on PIG-3419: -- I'd also be in favor putting this in trunk as opposed to a Tez branch. Although the motivation for this is Tez, I think we would want this patch in Pig even if there wasn't Tez support. A couple short comments for Achal: * It looks like the build targets that include the META-INF are only executed when building against hadoopversion=23. The META-INF also don't seem to be included in the pig.jar and pig-withouthadoop.jar that go in the root directory. I tried copying in the correct jars, but it seems like something is still off. * The changes to the try/catch blocks in MapReduceLauncher break on 23, because HadoopShims for 23 doesn't throw an exception where 20 does. Maybe that should be fixed in HadoopShims though. > Pluggable Execution Engine > --- > > Key: PIG-3419 > URL: https://issues.apache.org/jira/browse/PIG-3419 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.12 >Reporter: Achal Soni >Assignee: Achal Soni >Priority: Minor > Attachments: execengine.patch, finalpatch.patch, > mapreduce_execengine.patch, stats_scriptstate.patch, test_suite.patch > > > In an effort to adapt Pig to work using Apache Tez > (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for > a cleaner ExecutionEngine abstraction than existed before. The changes are > not that major as Pig was already relatively abstracted out between the > frontend and backend. The changes in the attached commit are essentially the > barebones changes -- I tried to not change the structure of Pig's different > components too much. I think it will be interesting to see in the future how > we can refactor more areas of Pig to really honor this abstraction between > the frontend and backend. > Some of the changes was to reinstate an ExecutionEngine interface to tie > together the front end and backend, and making the changes in Pig to delegate > to the EE when necessary, and creating an MRExecutionEngine that implements > this interface. Other work included changing ExecType to cycle through the > ExecutionEngines on the classpath and select the appropriate one (this is > done using Java ServiceLoader, exactly how MapReduce does for choosing the > framework to use between local and distributed mode). Also I tried to make > ScriptState, JobStats, and PigStats as abstract as possible in its current > state. I think in the future some work will need to be done here to perhaps > re-evaluate the usage of ScriptState and the responsibilities of the > different statistics classes. I haven't touched the PPNL, but I think more > abstraction is needed here, perhaps in a separate patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Subscription: PIG patch available
Issue Subscription Filter: PIG patch available (19 issues) Subscriber: pigdaily Key Summary PIG-3431Return more information for parsing related exceptions. https://issues.apache.org/jira/browse/PIG-3431 PIG-3430Add xml format for explaining MapReduce Plan. https://issues.apache.org/jira/browse/PIG-3430 PIG-3426Add support for removing s3 files https://issues.apache.org/jira/browse/PIG-3426 PIG-3419Pluggable Execution Engine https://issues.apache.org/jira/browse/PIG-3419 PIG-3379Alias reuse in nested foreach causes PIG script to fail https://issues.apache.org/jira/browse/PIG-3379 PIG-3374CASE and IN fail when expression includes dereferencing operator https://issues.apache.org/jira/browse/PIG-3374 PIG-3349Document ToString(Datetime, String) UDF https://issues.apache.org/jira/browse/PIG-3349 PIG-3346New property that controls the number of combined splits https://issues.apache.org/jira/browse/PIG-3346 PIG-Fix remaining Windows core unit test failures https://issues.apache.org/jira/browse/PIG- PIG-3325Adding a tuple to a bag is slow https://issues.apache.org/jira/browse/PIG-3325 PIG-3295Casting from bytearray failing after Union (even when each field is from a single Loader) https://issues.apache.org/jira/browse/PIG-3295 PIG-3292Logical plan invalid state: duplicate uid in schema during self-join to get cross product https://issues.apache.org/jira/browse/PIG-3292 PIG-3257Add unique identifier UDF https://issues.apache.org/jira/browse/PIG-3257 PIG-3199Expose LogicalPlan via PigServer API https://issues.apache.org/jira/browse/PIG-3199 PIG-3168TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk https://issues.apache.org/jira/browse/PIG-3168 PIG-3117A debug mode in which pig does not delete temporary files https://issues.apache.org/jira/browse/PIG-3117 PIG-3088Add a builtin udf which removes prefixes https://issues.apache.org/jira/browse/PIG-3088 PIG-3048Add mapreduce workflow information to job configuration https://issues.apache.org/jira/browse/PIG-3048 PIG-3021Split results missing records when there is null values in the column comparison https://issues.apache.org/jira/browse/PIG-3021 You may edit this subscription at: https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225&filterId=12322384
Re: Slow Group By operator
Hi Benjamin, Thank you very much for sharing detailed information! 1) From the runtime numbers that you provided, the mappers are very slow. CPU time spent (ms)5,081,610168,7405,250,350CPU time spent (ms)5,052,700 178,2205,230,920CPU time spent (ms)5,084,430193,4805,277,910 2) In your GROUP BY query, you have an algebraic UDF "COUNT". I am wondering whether disabling combiner will help here. I have seen a lot of cases where combiner actually hurt performance significantly if it doesn't combine mapper outputs significantly. Briefly looking at generate_data.pl in PIG-200, it looks like a lot of random keys are generated. So I guess you will end up with a large number of small bags rather than a small number of large bags. If that's the case, combiner will only add overhead to mappers. Can you try to include this "set pig.exec.nocombiner true;" and see whether it helps? Thanks, Cheolsoo On Wed, Aug 21, 2013 at 3:52 AM, Benjamin Jakobus wrote: > Hi Cheolsoo, > > >>What's your query like? Can you share it? Do you call any algebraic UDF > >> after group by? I am wondering whether combiner matters in your test. > I have been running 3 different types of queries. > > The first was performed on datasets of 6 different sizes: > > >- Dataset size 1: 30,000 records (772KB) >- Dataset size 2: 300,000 records (6.4MB) >- Dataset size 3: 3,000,000 records (63MB) >- Dataset size 4: 30 million records (628MB) >- Dataset size 5: 300 million records (6.2GB) >- Dataset size 6: 3 billion records (62GB) > > The datasets scale linearly, whereby the size equates to 3000 * 10n . > A seventh dataset consisting of 1,000 records (23KB) was produced to > perform join > operations on. Its schema is as follows: > name - string > marks - integer > gpa - float > The data was generated using the generate data.pl perl script available > for > download > from https://issues.apache.org/jira/browse/PIG-200 to produce the > datasets. The results are as follows: > > > * * * * * * *Set 1 * *Set 2** * *Set 3** * > *Set > 4** * *Set 5** * *Set 6* > *Arithmetic** * 32.82* * 36.21* * 49.49* * 83.25* > * > 423.63* * 3900.78 > *Filter 10%** * 32.94* * 34.32* * 44.56* * 66.68* > * > 295.59* * 2640.52 > *Filter 90%** * 33.93* * 32.55* * 37.86* * 53.22* > * > 197.36* * 1657.37 > *Group** * * *49.43* * 53.34* * 69.84* * 105.12* >*497.61* * 4394.21 > *Join** * * * 49.89* * 50.08* * 78.55* * 150.39* >*1045.34* *10258.19 > *Averaged performance of arithmetic, join, group, order, distinct select > and filter operations on six datasets using Pig. Scripts were configured as > to use 8 reduce and 11 map tasks.* > > > > * * * Set 1** * *Set 2** * *Set 3** * > *Set > 4** * *Set 5** * *Set 6* > *Arithmetic** * 32.84* * 37.33* * 72.55* * 300.08 > 2633.7227821.19 > *Filter 10% * 32.36* * 53.28* * 59.22* * 209.5** > 1672.3* *18222.19 > *Filter 90% * 31.23* * 32.68* * 36.8* * 69.55* > * > 331.88* *3320.59 > *Group * * * 48.27* * 47.68* * 46.87* * 53.66* > *141.36* *1233.4 > *Join * * * * *48.54* *56.86* * 104.6* * 517.5* >* 4388.34* * - > *Distinct** * * *48.73* *53.28* * 72.54* * 109.77* >* - * * * * - > *Averaged performance of arithmetic, join, group, distinct select and > filter operations on six datasets using Hive. Scripts were configured as to > use 8 reduce and 11 map tasks.* > > (If you want to see the standard deviation, let me know). > > So, to summarize the results: Pig outperforms Hive, with the exception of > using *Group By*. > > The Pig scripts used for this benchmark are as follows: > *Arithmetic* > -- Generate with basic arithmetic > A = load '$input/dataset_3' using PigStorage('\t') as (name, age, > gpa) PARALLEL $reducers; > B = foreach A generate age * gpa + 3, age/gpa - 1.5 PARALLEL $reducers; > store B into '$output/dataset_3_projection' using PigStorage() > PARALLEL $reducers; > > * > * > *Filter 10%* > -- Filter that removes 10% of data > A = load '$input/dataset_3' using PigStorage('\t') as (name, age, > gpa) PARALLEL $reducers; > B = filter A by gpa < '3.6' PARALLEL $reducers; > store B into '$output/dataset_3_filter_10' using PigStorage() > PARALLEL $reducers; > > > *Filter 90%* > -- Filter that removes 90% of data > A = load '$input/dataset_3' using PigStorage('\t') as (name, age, > gpa) PARALLEL $reducers; > B = filter A by age < '25' PARALLEL $reducers; > store B into '$output/dataset_3_filter_90' using PigStorage() > PARALLEL $reducers; > > * > * > *Group* > A = load '$input/dataset_3' using PigStorage('\t') as (name, age, > gpa)
[jira] [Commented] (PIG-3419) Pluggable Execution Engine
[ https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747064#comment-13747064 ] Julien Le Dem commented on PIG-3419: The point is to be able to implement alternate execution engines without having to fork Pig. I think it should go in trunk. > Pluggable Execution Engine > --- > > Key: PIG-3419 > URL: https://issues.apache.org/jira/browse/PIG-3419 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.12 >Reporter: Achal Soni >Assignee: Achal Soni >Priority: Minor > Attachments: execengine.patch, finalpatch.patch, > mapreduce_execengine.patch, stats_scriptstate.patch, test_suite.patch > > > In an effort to adapt Pig to work using Apache Tez > (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for > a cleaner ExecutionEngine abstraction than existed before. The changes are > not that major as Pig was already relatively abstracted out between the > frontend and backend. The changes in the attached commit are essentially the > barebones changes -- I tried to not change the structure of Pig's different > components too much. I think it will be interesting to see in the future how > we can refactor more areas of Pig to really honor this abstraction between > the frontend and backend. > Some of the changes was to reinstate an ExecutionEngine interface to tie > together the front end and backend, and making the changes in Pig to delegate > to the EE when necessary, and creating an MRExecutionEngine that implements > this interface. Other work included changing ExecType to cycle through the > ExecutionEngines on the classpath and select the appropriate one (this is > done using Java ServiceLoader, exactly how MapReduce does for choosing the > framework to use between local and distributed mode). Also I tried to make > ScriptState, JobStats, and PigStats as abstract as possible in its current > state. I think in the future some work will need to be done here to perhaps > re-evaluate the usage of ScriptState and the responsibilities of the > different statistics classes. I haven't touched the PPNL, but I think more > abstraction is needed here, perhaps in a separate patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3436) Make pigmix run with Hadoop2
[ https://issues.apache.org/jira/browse/PIG-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohini Palaniswamy updated PIG-3436: Attachment: PIG-3436-1.patch > Make pigmix run with Hadoop2 > > > Key: PIG-3436 > URL: https://issues.apache.org/jira/browse/PIG-3436 > Project: Pig > Issue Type: Improvement >Reporter: Rohini Palaniswamy >Assignee: Rohini Palaniswamy > Fix For: 0.12 > > Attachments: PIG-3436-1.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3419) Pluggable Execution Engine
[ https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747065#comment-13747065 ] Dmitriy V. Ryaboy commented on PIG-3419: I'd like this patch in trunk since it's not Tez-specific, and allows people to experiment with other runtimes (for example, Spark or Drill). > Pluggable Execution Engine > --- > > Key: PIG-3419 > URL: https://issues.apache.org/jira/browse/PIG-3419 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.12 >Reporter: Achal Soni >Assignee: Achal Soni >Priority: Minor > Attachments: execengine.patch, finalpatch.patch, > mapreduce_execengine.patch, stats_scriptstate.patch, test_suite.patch > > > In an effort to adapt Pig to work using Apache Tez > (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for > a cleaner ExecutionEngine abstraction than existed before. The changes are > not that major as Pig was already relatively abstracted out between the > frontend and backend. The changes in the attached commit are essentially the > barebones changes -- I tried to not change the structure of Pig's different > components too much. I think it will be interesting to see in the future how > we can refactor more areas of Pig to really honor this abstraction between > the frontend and backend. > Some of the changes was to reinstate an ExecutionEngine interface to tie > together the front end and backend, and making the changes in Pig to delegate > to the EE when necessary, and creating an MRExecutionEngine that implements > this interface. Other work included changing ExecType to cycle through the > ExecutionEngines on the classpath and select the appropriate one (this is > done using Java ServiceLoader, exactly how MapReduce does for choosing the > framework to use between local and distributed mode). Also I tried to make > ScriptState, JobStats, and PigStats as abstract as possible in its current > state. I think in the future some work will need to be done here to perhaps > re-evaluate the usage of ScriptState and the responsibilities of the > different statistics classes. I haven't touched the PPNL, but I think more > abstraction is needed here, perhaps in a separate patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3436) Make pigmix run with Hadoop2
Rohini Palaniswamy created PIG-3436: --- Summary: Make pigmix run with Hadoop2 Key: PIG-3436 URL: https://issues.apache.org/jira/browse/PIG-3436 Project: Pig Issue Type: Improvement Reporter: Rohini Palaniswamy Assignee: Rohini Palaniswamy Fix For: 0.12 Attachments: PIG-3436-1.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3419) Pluggable Execution Engine
[ https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747021#comment-13747021 ] Cheolsoo Park commented on PIG-3419: [~julienledem], I haven't looked at it yet, but I will review it tonight. I will also run full unit tests. Btw, I was meeting Mark, Olga, Rohini, and Daniel at LinkedIn this morning. We decided to create a tez branch. Rohini suggested that this patch should go into that branch instead of trunk. Can we agree where we should commit this patch first? Personally, I think this can go into trunk directly since it's quite general. But there were some concerns. > Pluggable Execution Engine > --- > > Key: PIG-3419 > URL: https://issues.apache.org/jira/browse/PIG-3419 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.12 >Reporter: Achal Soni >Assignee: Achal Soni >Priority: Minor > Attachments: execengine.patch, finalpatch.patch, > mapreduce_execengine.patch, stats_scriptstate.patch, test_suite.patch > > > In an effort to adapt Pig to work using Apache Tez > (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for > a cleaner ExecutionEngine abstraction than existed before. The changes are > not that major as Pig was already relatively abstracted out between the > frontend and backend. The changes in the attached commit are essentially the > barebones changes -- I tried to not change the structure of Pig's different > components too much. I think it will be interesting to see in the future how > we can refactor more areas of Pig to really honor this abstraction between > the frontend and backend. > Some of the changes was to reinstate an ExecutionEngine interface to tie > together the front end and backend, and making the changes in Pig to delegate > to the EE when necessary, and creating an MRExecutionEngine that implements > this interface. Other work included changing ExecType to cycle through the > ExecutionEngines on the classpath and select the appropriate one (this is > done using Java ServiceLoader, exactly how MapReduce does for choosing the > framework to use between local and distributed mode). Also I tried to make > ScriptState, JobStats, and PigStats as abstract as possible in its current > state. I think in the future some work will need to be done here to perhaps > re-evaluate the usage of ScriptState and the responsibilities of the > different statistics classes. I haven't touched the PPNL, but I think more > abstraction is needed here, perhaps in a separate patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3419) Pluggable Execution Engine
[ https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13746999#comment-13746999 ] Julien Le Dem commented on PIG-3419: I have submitted my review. This looks great [~achalsoni81]! [~cheolsoo] does it look good to you? Once Achal has updated his patch I'm willing to commit. > Pluggable Execution Engine > --- > > Key: PIG-3419 > URL: https://issues.apache.org/jira/browse/PIG-3419 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.12 >Reporter: Achal Soni >Assignee: Achal Soni >Priority: Minor > Attachments: execengine.patch, finalpatch.patch, > mapreduce_execengine.patch, stats_scriptstate.patch, test_suite.patch > > > In an effort to adapt Pig to work using Apache Tez > (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for > a cleaner ExecutionEngine abstraction than existed before. The changes are > not that major as Pig was already relatively abstracted out between the > frontend and backend. The changes in the attached commit are essentially the > barebones changes -- I tried to not change the structure of Pig's different > components too much. I think it will be interesting to see in the future how > we can refactor more areas of Pig to really honor this abstraction between > the frontend and backend. > Some of the changes was to reinstate an ExecutionEngine interface to tie > together the front end and backend, and making the changes in Pig to delegate > to the EE when necessary, and creating an MRExecutionEngine that implements > this interface. Other work included changing ExecType to cycle through the > ExecutionEngines on the classpath and select the appropriate one (this is > done using Java ServiceLoader, exactly how MapReduce does for choosing the > framework to use between local and distributed mode). Also I tried to make > ScriptState, JobStats, and PigStats as abstract as possible in its current > state. I think in the future some work will need to be done here to perhaps > re-evaluate the usage of ScriptState and the responsibilities of the > different statistics classes. I haven't touched the PPNL, but I think more > abstraction is needed here, perhaps in a separate patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3385) DISTINCT no longer uses custom partitioner
[ https://issues.apache.org/jira/browse/PIG-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13746937#comment-13746937 ] Koji Noguchi commented on PIG-3385: --- While looking at this jira, noticed custom partitioner being dropped when run with multi query optimization. Created PIG-3435. > DISTINCT no longer uses custom partitioner > -- > > Key: PIG-3385 > URL: https://issues.apache.org/jira/browse/PIG-3385 > Project: Pig > Issue Type: Bug > Components: impl >Reporter: Will Oberman >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3385-v01.patch, pig-3385-v02.patch > > > From u...@pig.apache.org: It looks like an optimization was put in to make > distinct use a special partitioner which prevents the user from setting the > partitioner. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3435) Custom Partitioner not working with MultiQueryOptimizer
Koji Noguchi created PIG-3435: - Summary: Custom Partitioner not working with MultiQueryOptimizer Key: PIG-3435 URL: https://issues.apache.org/jira/browse/PIG-3435 Project: Pig Issue Type: Bug Components: impl Reporter: Koji Noguchi Assignee: Koji Noguchi When looking at PIG-3385, noticed some issues in handling of custom partitioner with multi-query optimization. {noformat} C1 = group B1 by col1 PARTITION BY org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2; C2 = group B2 by col1 PARTITION BY org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2; {noformat} This seems to be merged to one mapreduce job correctly but custom partitioner information was lost. {noformat} C1 = group B1 by col1 PARTITION BY org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2; C2 = group B2 by col1 parallel 2; {noformat} This seems to be merged even though they should run on two different partitioner. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3419) Pluggable Execution Engine
[ https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Achal Soni updated PIG-3419: Attachment: finalpatch.patch Let me know if there are any pressing changes to this patch! > Pluggable Execution Engine > --- > > Key: PIG-3419 > URL: https://issues.apache.org/jira/browse/PIG-3419 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.12 >Reporter: Achal Soni >Assignee: Achal Soni >Priority: Minor > Attachments: execengine.patch, finalpatch.patch, > mapreduce_execengine.patch, stats_scriptstate.patch, test_suite.patch > > > In an effort to adapt Pig to work using Apache Tez > (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for > a cleaner ExecutionEngine abstraction than existed before. The changes are > not that major as Pig was already relatively abstracted out between the > frontend and backend. The changes in the attached commit are essentially the > barebones changes -- I tried to not change the structure of Pig's different > components too much. I think it will be interesting to see in the future how > we can refactor more areas of Pig to really honor this abstraction between > the frontend and backend. > Some of the changes was to reinstate an ExecutionEngine interface to tie > together the front end and backend, and making the changes in Pig to delegate > to the EE when necessary, and creating an MRExecutionEngine that implements > this interface. Other work included changing ExecType to cycle through the > ExecutionEngines on the classpath and select the appropriate one (this is > done using Java ServiceLoader, exactly how MapReduce does for choosing the > framework to use between local and distributed mode). Also I tried to make > ScriptState, JobStats, and PigStats as abstract as possible in its current > state. I think in the future some work will need to be done here to perhaps > re-evaluate the usage of ScriptState and the responsibilities of the > different statistics classes. I haven't touched the PPNL, but I think more > abstraction is needed here, perhaps in a separate patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3419) Pluggable Execution Engine
[ https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13746842#comment-13746842 ] Achal Soni commented on PIG-3419: - I have taken all the suggestions into account and regenerated a new patch that is hopefully cleaner, smaller, and reflects most of the suggestions. The patch is attached and the review board is the following: https://reviews.apache.org/r/13714/ > Pluggable Execution Engine > --- > > Key: PIG-3419 > URL: https://issues.apache.org/jira/browse/PIG-3419 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.12 >Reporter: Achal Soni >Assignee: Achal Soni >Priority: Minor > Attachments: execengine.patch, mapreduce_execengine.patch, > stats_scriptstate.patch, test_suite.patch > > > In an effort to adapt Pig to work using Apache Tez > (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for > a cleaner ExecutionEngine abstraction than existed before. The changes are > not that major as Pig was already relatively abstracted out between the > frontend and backend. The changes in the attached commit are essentially the > barebones changes -- I tried to not change the structure of Pig's different > components too much. I think it will be interesting to see in the future how > we can refactor more areas of Pig to really honor this abstraction between > the frontend and backend. > Some of the changes was to reinstate an ExecutionEngine interface to tie > together the front end and backend, and making the changes in Pig to delegate > to the EE when necessary, and creating an MRExecutionEngine that implements > this interface. Other work included changing ExecType to cycle through the > ExecutionEngines on the classpath and select the appropriate one (this is > done using Java ServiceLoader, exactly how MapReduce does for choosing the > framework to use between local and distributed mode). Also I tried to make > ScriptState, JobStats, and PigStats as abstract as possible in its current > state. I think in the future some work will need to be done here to perhaps > re-evaluate the usage of ScriptState and the responsibilities of the > different statistics classes. I haven't touched the PPNL, but I think more > abstraction is needed here, perhaps in a separate patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
RE: can't parse the values using XML loader
Part of the problem might be that the regexp has (.*) but you need (.*) Using regexps to parse XML is awfully brittle. An alternative is to use a UDF that calls out to an XML parser. I use ElementTree from python UDFs. Will Dowling From: Muni mahesh [mahesh87.had...@gmail.com] Sent: Wednesday, August 21, 2013 6:58 AM To: dev@pig.apache.org; u...@pig.apache.org Subject: can't parse the values using XML loader *Input file :* hadoop developer ajay india ITC 10.90 2013 *Pig Script:* register /usr/lib/pig/piggybank.jar; A = load '/home/sudeep/Desktop/CATALOG.xml' using org.apache.pig.piggybank.storage.XMLLoader('CATALOG') as (x: chararray); B = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,'\\n*\\n(.*)\\n\\s*(.*)\\n\\s*(.*)\\n\\s*(.*)\\n\\s*(.*)\\n\\s*(.*)\\n\\s*\\n*')) as (id: int, name:chararray); *Output Expected :* (hadoop, ajay, india, ITC, 10.90, 2013) *Issue : * But the output i am getting is :* () * *I hope it is not able to parse the values between the tags *
Re: can't parse the values using XML loader
Hello, Moreover REGEX_EXTRACT_ALL uses Matcher.matches() which tries to match the entire string to the input and not the parts of it. You may want to write your own REGEX UDF (If you are not going route suggested by Will) which uses Matcher.find() instead of Matcher.matches(). Regards, Amit From: "william.dowl...@thomsonreuters.com" To: u...@pig.apache.org; dev@pig.apache.org Sent: Wednesday, August 21, 2013 12:19 PM Subject: RE: can't parse the values using XML loader Part of the problem might be that the regexp has (.*) but you need (.*) Using regexps to parse XML is awfully brittle. An alternative is to use a UDF that calls out to an XML parser. I use ElementTree from python UDFs. Will Dowling From: Muni mahesh [mahesh87.had...@gmail.com] Sent: Wednesday, August 21, 2013 6:58 AM To: dev@pig.apache.org; u...@pig.apache.org Subject: can't parse the values using XML loader *Input file :* hadoop developer ajay india ITC 10.90 2013 *Pig Script:* register /usr/lib/pig/piggybank.jar; A = load '/home/sudeep/Desktop/CATALOG.xml' using org.apache.pig.piggybank.storage.XMLLoader('CATALOG') as (x: chararray); B = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,'\\n*\\n(.*)\\n\\s*(.*)\\n\\s*(.*)\\n\\s*(.*)\\n\\s*(.*)\\n\\s*(.*)\\n\\s*\\n*')) as (id: int, name:chararray); *Output Expected :* (hadoop, ajay, india, ITC, 10.90, 2013) *Issue : * But the output i am getting is :* () * *I hope it is not able to parse the values between the tags *
[jira] [Updated] (PIG-3385) DISTINCT no longer uses custom partitioner
[ https://issues.apache.org/jira/browse/PIG-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3385: -- Attachment: pig-3385-v02.patch Uploading a patch with test. Noticed that original test for custom partitioners didn't give different partition results than the default so added one silly partitioner that always return 1 (second reducer). > DISTINCT no longer uses custom partitioner > -- > > Key: PIG-3385 > URL: https://issues.apache.org/jira/browse/PIG-3385 > Project: Pig > Issue Type: Bug > Components: impl >Reporter: Will Oberman >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3385-v01.patch, pig-3385-v02.patch > > > From u...@pig.apache.org: It looks like an optimization was put in to make > distinct use a special partitioner which prevents the user from setting the > partitioner. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3385) DISTINCT no longer uses custom partitioner
[ https://issues.apache.org/jira/browse/PIG-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3385: -- Component/s: (was: documentation) impl Assignee: Koji Noguchi > DISTINCT no longer uses custom partitioner > -- > > Key: PIG-3385 > URL: https://issues.apache.org/jira/browse/PIG-3385 > Project: Pig > Issue Type: Bug > Components: impl >Reporter: Will Oberman >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3385-v01.patch > > > From u...@pig.apache.org: It looks like an optimization was put in to make > distinct use a special partitioner which prevents the user from setting the > partitioner. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3117) A debug mode in which pig does not delete temporary files
[ https://issues.apache.org/jira/browse/PIG-3117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ido Hadanny updated PIG-3117: - Fix Version/s: 0.12 Affects Version/s: 0.11.1 Status: Patch Available (was: Open) patch introduces pig.delete.intermediate.files property that keeps all intermediate files when set to false > A debug mode in which pig does not delete temporary files > - > > Key: PIG-3117 > URL: https://issues.apache.org/jira/browse/PIG-3117 > Project: Pig > Issue Type: Wish >Affects Versions: 0.11.1, 0.10.0 >Reporter: Ido Hadanny >Assignee: Cheolsoo Park > Fix For: 0.12 > > Attachments: remove_intermediate_results.diff > > > when we debug our pig jobs on pre-production data, we usually find bugs we > couldn't detect in our UT, as env and data are not quite the same. > when the final output of a script is not quite what we expect, we start > divide-and-conquer, running it line by line and inspecting the intermediate > output of each stage. > It would be great if we could simply configure pig not to delete the > intermediate MR outputs, and store them as plaintext instead of snappy format. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3117) A debug mode in which pig does not delete temporary files
[ https://issues.apache.org/jira/browse/PIG-3117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ido Hadanny updated PIG-3117: - Attachment: remove_intermediate_results.diff patch introduces pig.delete.intermediate.files property that keeps all intermediate files when set to false > A debug mode in which pig does not delete temporary files > - > > Key: PIG-3117 > URL: https://issues.apache.org/jira/browse/PIG-3117 > Project: Pig > Issue Type: Wish >Affects Versions: 0.10.0 >Reporter: Ido Hadanny >Assignee: Cheolsoo Park > Attachments: remove_intermediate_results.diff > > > when we debug our pig jobs on pre-production data, we usually find bugs we > couldn't detect in our UT, as env and data are not quite the same. > when the final output of a script is not quite what we expect, we start > divide-and-conquer, running it line by line and inspecting the intermediate > output of each stage. > It would be great if we could simply configure pig not to delete the > intermediate MR outputs, and store them as plaintext instead of snappy format. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3434) Null subexpression in bincond nullifies outer tuple (or bag)
[ https://issues.apache.org/jira/browse/PIG-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pavel Fedyakov updated PIG-3434: Description: According to docs, for bincond operator "If a Boolean subexpression results in null value, the resulting expression is null" (http://pig.apache.org/docs/r0.11.0/basic.html#nulls). It works as described in plain foreach..generate expression: {{in = load 'in';}} {{out = FOREACH in GENERATE 1, ($0 > 0 ? 2 : 3);}} {{dump out;}} in (3 lines, 2nd is empty): {{0}} {{1}} out: {{(1,3)}} {{(1,)}} {{(1,2)}} But if we wrap generated variables in tuple (or bag), we lose the whole 2nd line in output: {{out = FOREACH in GENERATE (1, ($0 > 0 ? 2 : 3));}} out: {{((1,3))}} {{()}} {{((1,2))}} was: According to docs, for bincond operator "If a Boolean subexpression results in null value, the resulting expression is null" (http://pig.apache.org/docs/r0.11.0/basic.html#nulls). It works as described in plain foreach..generate expression: {{in = load 'in';}} {{out = FOREACH in GENERATE 1, ($0 > 0 ? 2 : 3);}} {{dump out;}} in (3 lines, 2nd is empty): {{0}} {{1}} out: {{(1,3)}} {{(1,)}} {{(1,2)}} But if we wrap generated variables in tuple (or bag), we lost the whole 2nd line in output: {{out = FOREACH in GENERATE (1, ($0 > 0 ? 2 : 3));}} out: {{((1,3))}} {{()}} {{((1,2))}} > Null subexpression in bincond nullifies outer tuple (or bag) > > > Key: PIG-3434 > URL: https://issues.apache.org/jira/browse/PIG-3434 > Project: Pig > Issue Type: Bug >Reporter: Pavel Fedyakov > > According to docs, for bincond operator "If a Boolean subexpression results > in null value, the resulting expression is null" > (http://pig.apache.org/docs/r0.11.0/basic.html#nulls). > It works as described in plain foreach..generate expression: > {{in = load 'in';}} > {{out = FOREACH in GENERATE 1, ($0 > 0 ? 2 : 3);}} > {{dump out;}} > in (3 lines, 2nd is empty): > {{0}} > {{1}} > out: > {{(1,3)}} > {{(1,)}} > {{(1,2)}} > But if we wrap generated variables in tuple (or bag), we lose the whole 2nd > line in output: > {{out = FOREACH in GENERATE (1, ($0 > 0 ? 2 : 3));}} > out: > {{((1,3))}} > {{()}} > {{((1,2))}} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3434) Null subexpression in bincond nullifies outer tuple (or bag)
Pavel Fedyakov created PIG-3434: --- Summary: Null subexpression in bincond nullifies outer tuple (or bag) Key: PIG-3434 URL: https://issues.apache.org/jira/browse/PIG-3434 Project: Pig Issue Type: Bug Reporter: Pavel Fedyakov According to docs, for bincond operator "If a Boolean subexpression results in null value, the resulting expression is null" (http://pig.apache.org/docs/r0.11.0/basic.html#nulls). It works as described in plain foreach..generate expression: {{in = load 'in';}} {{out = FOREACH in GENERATE 1, ($0 > 0 ? 2 : 3);}} {{dump out;}} in (3 lines, 2nd is empty): {{0}} {{1}} out: {{(1,3)}} {{(1,)}} {{(1,2)}} But if we wrap generated variables in tuple (or bag), we lost the whole 2nd line in output: {{out = FOREACH in GENERATE (1, ($0 > 0 ? 2 : 3));}} out: {{((1,3))}} {{()}} {{((1,2))}} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
can't parse the values using XML loader
*Input file :* hadoop developer ajay india ITC 10.90 2013 *Pig Script:* register /usr/lib/pig/piggybank.jar; A = load '/home/sudeep/Desktop/CATALOG.xml' using org.apache.pig.piggybank.storage.XMLLoader('CATALOG') as (x: chararray); B = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,'\\n*\\n(.*)\\n\\s*(.*)\\n\\s*(.*)\\n\\s*(.*)\\n\\s*(.*)\\n\\s*(.*)\\n\\s*\\n*')) as (id: int, name:chararray); *Output Expected :* (hadoop, ajay, india, ITC, 10.90, 2013) *Issue : * But the output i am getting is :* () * *I hope it is not able to parse the values between the tags *
Re: Slow Group By operator
Hi Cheolsoo, >>What's your query like? Can you share it? Do you call any algebraic UDF >> after group by? I am wondering whether combiner matters in your test. I have been running 3 different types of queries. The first was performed on datasets of 6 different sizes: - Dataset size 1: 30,000 records (772KB) - Dataset size 2: 300,000 records (6.4MB) - Dataset size 3: 3,000,000 records (63MB) - Dataset size 4: 30 million records (628MB) - Dataset size 5: 300 million records (6.2GB) - Dataset size 6: 3 billion records (62GB) The datasets scale linearly, whereby the size equates to 3000 * 10n . A seventh dataset consisting of 1,000 records (23KB) was produced to perform join operations on. Its schema is as follows: name - string marks - integer gpa - float The data was generated using the generate data.pl perl script available for download from https://issues.apache.org/jira/browse/PIG-200 to produce the datasets. The results are as follows: * * * * * * *Set 1 * *Set 2** * *Set 3** * *Set 4** * *Set 5** * *Set 6* *Arithmetic** * 32.82* * 36.21* * 49.49* * 83.25* * 423.63* * 3900.78 *Filter 10%** * 32.94* * 34.32* * 44.56* * 66.68* * 295.59* * 2640.52 *Filter 90%** * 33.93* * 32.55* * 37.86* * 53.22* * 197.36* * 1657.37 *Group** * * *49.43* * 53.34* * 69.84* * 105.12* *497.61* * 4394.21 *Join** * * * 49.89* * 50.08* * 78.55* * 150.39* *1045.34* *10258.19 *Averaged performance of arithmetic, join, group, order, distinct select and filter operations on six datasets using Pig. Scripts were configured as to use 8 reduce and 11 map tasks.* * * * Set 1** * *Set 2** * *Set 3** * *Set 4** * *Set 5** * *Set 6* *Arithmetic** * 32.84* * 37.33* * 72.55* * 300.08 2633.7227821.19 *Filter 10% * 32.36* * 53.28* * 59.22* * 209.5** 1672.3* *18222.19 *Filter 90% * 31.23* * 32.68* * 36.8* * 69.55* * 331.88* *3320.59 *Group * * * 48.27* * 47.68* * 46.87* * 53.66* *141.36* *1233.4 *Join * * * * *48.54* *56.86* * 104.6* * 517.5* * 4388.34* * - *Distinct** * * *48.73* *53.28* * 72.54* * 109.77* * - * * * * - *Averaged performance of arithmetic, join, group, distinct select and filter operations on six datasets using Hive. Scripts were configured as to use 8 reduce and 11 map tasks.* (If you want to see the standard deviation, let me know). So, to summarize the results: Pig outperforms Hive, with the exception of using *Group By*. The Pig scripts used for this benchmark are as follows: *Arithmetic* -- Generate with basic arithmetic A = load '$input/dataset_3' using PigStorage('\t') as (name, age, gpa) PARALLEL $reducers; B = foreach A generate age * gpa + 3, age/gpa - 1.5 PARALLEL $reducers; store B into '$output/dataset_3_projection' using PigStorage() PARALLEL $reducers; * * *Filter 10%* -- Filter that removes 10% of data A = load '$input/dataset_3' using PigStorage('\t') as (name, age, gpa) PARALLEL $reducers; B = filter A by gpa < '3.6' PARALLEL $reducers; store B into '$output/dataset_3_filter_10' using PigStorage() PARALLEL $reducers; *Filter 90%* -- Filter that removes 90% of data A = load '$input/dataset_3' using PigStorage('\t') as (name, age, gpa) PARALLEL $reducers; B = filter A by age < '25' PARALLEL $reducers; store B into '$output/dataset_3_filter_90' using PigStorage() PARALLEL $reducers; * * *Group* A = load '$input/dataset_3' using PigStorage('\t') as (name, age, gpa) PARALLEL $reducers; B = group A by name PARALLEL $reducers; C = foreach B generate flatten(group), COUNT(A.age) PARALLEL $reducers; store C into '$output/dataset_3_group' using PigStorage() PARALLEL $reducers; * * *Join* A = load '$input/dataset_3' using PigStorage('\t') as (name, age, gpa) PARALLEL $reducers; B = load '$input/dataset_join' using PigStorage('\t') as (name, age, gpa) PARALLEL $reducers; C = cogroup A by name inner, B by name inner PARALLEL $reducers; D = foreach C generate flatten(A), flatten(B) PARALLEL $reducers; store D into '$output/dataset_3_cogroup_big' using PigStorage() PARALLEL $reducers; Similarly, here the Hive scripts: *Arithmetic* SELECT (dataset.age * dataset.gpa + 3) AS F1, (dataset.age/dataset.gpa - 1.5) AS F2 FROM dataset WHERE dataset.gpa > 0; *Filter 10%* SELECT * FROM dataset WHERE dataset.gpa < 3.6; *Filter 90%* SELECT * FROM dataset WHERE dataset.age < 25; *Group* SELECT COUNT(dataset.age) FROM dataset GROUP BY dataset.name; *Join* SELECT * FROM dataset JOIN dataset_join ON dataset.name = dataset_join.name; I will re-run the benchmarks to see whether it is the reduce or
[jira] [Created] (PIG-3433) The import sdsu cannot be resolved
Ido Hadanny created PIG-3433: Summary: The import sdsu cannot be resolved Key: PIG-3433 URL: https://issues.apache.org/jira/browse/PIG-3433 Project: Pig Issue Type: Bug Components: build Affects Versions: 0.11.1 Environment: Eclipse indigo Reporter: Ido Hadanny executed: ➜ trunk svn update At revision 1516115. ant clean eclipse-files ant compile gen getting: https://issues.apache.org/jira/browse/PIG-3399 AND after manually removing the wrong javacc-4.2 dependency, getting: "The import sdsu cannot be resolved" in DataGenerator.java -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira