[jira] Commented: (PIG-6) Addition of Hbase Storage Option In Load/Store Statement
[ https://issues.apache.org/jira/browse/PIG-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715385#action_12715385 ] Amr Awadallah commented on PIG-6: - Any progress on this? > Addition of Hbase Storage Option In Load/Store Statement > > > Key: PIG-6 > URL: https://issues.apache.org/jira/browse/PIG-6 > Project: Pig > Issue Type: New Feature > Environment: all environments >Reporter: Edward J. Yoon > Fix For: 0.2.0 > > Attachments: hbase-0.18.1-test.jar, hbase-0.18.1.jar, m34813f5.txt, > PIG-6.patch, PIG-6_V01.patch > > > It needs to be able to load full table in hbase. (maybe ... difficult? i'm > not sure yet.) > Also, as described below, > It needs to compose an abstract 2d-table only with certain data filtered from > hbase array structure using arbitrary query-delimited. > {code} > A = LOAD table('hbase_table'); > or > B = LOAD table('hbase_table') Using HbaseQuery('Query-delimited by attributes > & timestamp') as (f1, f2[, f3]); > {code} > Once test is done on my local machines, > I will clarify the grammars and give you more examples to help you explain > more storage options. > Any advice welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-825) PIG_HADOOP_VERSION should be 18
[ https://issues.apache.org/jira/browse/PIG-825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715368#action_12715368 ] Hudson commented on PIG-825: Integrated in Pig-trunk #460 (See [http://hudson.zones.apache.org/hudson/job/Pig-trunk/460/]) : PIG_HADOOP_VERSION should be set to 18. > PIG_HADOOP_VERSION should be 18 > --- > > Key: PIG-825 > URL: https://issues.apache.org/jira/browse/PIG-825 > Project: Pig > Issue Type: Bug > Components: grunt >Reporter: Dmitriy V. Ryaboy > Fix For: 0.3.0 > > Attachments: pig-825.patch, pig-825.patch > > > PIG_HADOOP_VERSION should be set to 18, not 17, as Hadoop 0.18 is now > considered default. > Patch coming. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Build failed in Hudson: Pig-Patch-minerva.apache.org #65
See http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/65/changes Changes: [gates] PIG-825: PIG_HADOOP_VERSION should be set to 18. [olga] PIG-802: PERFORMANCE: not creating bags for ORDER BY (serakesh via olgan) [pradeepkth] PIG-816: PigStorage() does not accept Unicode characters in its contructor (pradeepkth) -- [...truncated 788 lines...] A src/docs/src/documentation/content/xdocs/images/group.svg A src/docs/src/documentation/content/xdocs/index.xml A src/docs/src/documentation/content/xdocs/piglatin.xml A src/docs/src/documentation/content/xdocs/tabs.xml A src/docs/src/documentation/content/xdocs/quickstart.xml A src/docs/src/documentation/content/xdocs/udf.xml A src/docs/src/documentation/content/test1.html A src/docs/src/documentation/content/locationmap.xml A src/docs/src/documentation/resources A src/docs/src/documentation/resources/schema A src/docs/src/documentation/resources/schema/hello-v10.dtd A src/docs/src/documentation/resources/schema/symbols-project-v10.ent A src/docs/src/documentation/resources/schema/catalog.xcat A src/docs/src/documentation/resources/images A src/docs/src/documentation/resources/images/ellipse-2.svg A src/docs/src/documentation/resources/stylesheets A src/docs/src/documentation/resources/stylesheets/hello2document.xsl A src/docs/src/documentation/README.txt A src/docs/src/documentation/classes A src/docs/src/documentation/classes/CatalogManager.properties A src/docs/forrest.properties.xml A src/overview.html A scripts A lib-src A lib-src/bzip2 A lib-src/bzip2/org A lib-src/bzip2/org/apache A lib-src/bzip2/org/apache/tools A lib-src/bzip2/org/apache/tools/bzip2r A lib-src/bzip2/org/apache/tools/bzip2r/BZip2Constants.java A lib-src/bzip2/org/apache/tools/bzip2r/CBZip2InputStream.java A lib-src/bzip2/org/apache/tools/bzip2r/CBZip2OutputStream.java A lib-src/bzip2/org/apache/tools/bzip2r/CRC.java A lib-src/shock A lib-src/shock/org A lib-src/shock/org/apache A lib-src/shock/org/apache/pig A lib-src/shock/org/apache/pig/shock A lib-src/shock/org/apache/pig/shock/SSHSocketImplFactory.java A build.xml A NOTICE.txt A LICENSE.txt A contrib A contrib/CHANGES.txt A contrib/piggybank A contrib/piggybank/java A contrib/piggybank/java/lib A contrib/piggybank/java/src A contrib/piggybank/java/src/test A contrib/piggybank/java/src/test/java A contrib/piggybank/java/src/test/java/org A contrib/piggybank/java/src/test/java/org/apache A contrib/piggybank/java/src/test/java/org/apache/pig A contrib/piggybank/java/src/test/java/org/apache/pig/piggybank A contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test A contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/filtering A contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage A contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation A contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/TestMathUDF.java A contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/TestStat.java A contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/util A contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/util/TestTop.java A contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/util/TestSearchQuery.java A contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/TestEvalString.java A contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/comparison A contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/grouping A contrib/piggybank/java/src/main A contrib/piggybank/java/src/main/java A contrib/piggybank/java/src/main/java/org A contrib/piggybank/java/src/main/java/org/apache A contrib/piggybank/java/src/main/java/org/apache/pig A contrib/piggybank/java/src/main/java/org/apache/pig/piggybank A contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/filtering A contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage A contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation A contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/MaxTupleBy1stField.java A contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string A contrib/piggybank/java/src/main/java/org/apach
[jira] Commented: (PIG-796) support conversion from numeric types to chararray
[ https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715325#action_12715325 ] Pradeep Kamath commented on PIG-796: A few comments: - In TestPOCast.java the variables can be named as something like "opWithInputTypeAsByteArray" for the POCast objects since the intent is not so clear with the current names - In POCast.java you can check for the realType inside the catch clause rather than before trying to cast to ByteArray. This way, if the cast to ByteArray is always successful, we will not be incurring the overhead of the if(realType==null) check - In POCast.java, you can avoid catching ExecException and checking for errorCode == 1071. Since the getNext() call in POCast already throws ExecException, you can just let ExecExceptions from DataType.toXXX() methods bubble out. > support conversion from numeric types to chararray > --- > > Key: PIG-796 > URL: https://issues.apache.org/jira/browse/PIG-796 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.2.0 >Reporter: Olga Natkovich > Attachments: 796.patch, pig-796.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-829) DECLARE statement stop processing after special characters such as dot "." , "+" "%" etc..
DECLARE statement stop processing after special characters such as dot "." , "+" "%" etc.. -- Key: PIG-829 URL: https://issues.apache.org/jira/browse/PIG-829 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.3.0 Reporter: Viraj Bhat Fix For: 0.3.0 The below Pig script does not work well, when special characters are used in the DECLARE statement. {code} %DECLARE OUT foo.bar x = LOAD 'something' as (a:chararray, b:chararray); y = FILTER x BY ( a MATCHES '^.*yahoo.*$' ); STORE y INTO '$OUT'; {code} When the above script is run in the dry run mode; the substituted file does not contain the special character. {code} java -cp pig.jar:/homes/viraj/hadoop-0.18.0-dev/conf -Dhod.server='' org.apache.pig.Main -r declaresp.pig {code} Resulting file: "declaresp.pig.substituted" {code} x = LOAD 'something' as (a:chararray, b:chararray); y = FILTER x BY ( a MATCHES '^.*yahoo.*$' ); STORE y INTO 'foo'; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-796) support conversion from numeric types to chararray
[ https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-796: --- Status: Patch Available (was: Open) > support conversion from numeric types to chararray > --- > > Key: PIG-796 > URL: https://issues.apache.org/jira/browse/PIG-796 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.2.0 >Reporter: Olga Natkovich > Attachments: 796.patch, pig-796.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-753) Provide support for UDFs without parameters
[ https://issues.apache.org/jira/browse/PIG-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-753: --- Status: Open (was: Patch Available) > Provide support for UDFs without parameters > --- > > Key: PIG-753 > URL: https://issues.apache.org/jira/browse/PIG-753 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.3.0 >Reporter: Jeff Zhang > Fix For: 0.3.0 > > Attachments: Pig_753_Patch.txt > > > Pig do not support UDF without parameters, it force me provide a parameter. > like the following statement: > B = FOREACH A GENERATE bagGenerator(); this will generate error. I have to > provide a parameter like following > B = FOREACH A GENERATE bagGenerator($0); > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-753) Provide support for UDFs without parameters
[ https://issues.apache.org/jira/browse/PIG-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715312#action_12715312 ] Alan Gates commented on PIG-753: The patch should include a unit tests that to test whether a pig script will parse with a udf that has no parameters, and whether the backend will properly execute a udf that takes no parameters. > Provide support for UDFs without parameters > --- > > Key: PIG-753 > URL: https://issues.apache.org/jira/browse/PIG-753 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.3.0 >Reporter: Jeff Zhang > Fix For: 0.3.0 > > Attachments: Pig_753_Patch.txt > > > Pig do not support UDF without parameters, it force me provide a parameter. > like the following statement: > B = FOREACH A GENERATE bagGenerator(); this will generate error. I have to > provide a parameter like following > B = FOREACH A GENERATE bagGenerator($0); > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-796) support conversion from numeric types to chararray
[ https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-796: - Attachment: 796.patch updated patch. This patch fixes the following issue: Sometimes (e.g. when values coming out of map lookup) Pig assumes type of element as ByteArray when actually it is of some other type. In such cases request for a Cast fails. This patch first finds out the actual type of element before casting it (specifically when Pig thinks its ByteArray) and then do the cast. It also caches the type. When type changes ClassCastException is raised which gets caught and cast is then tried again. Cached value of type is also updated. This ensures that type is not determined on each cast call as well as handling of casts when types changes from one call to the next. > support conversion from numeric types to chararray > --- > > Key: PIG-796 > URL: https://issues.apache.org/jira/browse/PIG-796 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.2.0 >Reporter: Olga Natkovich > Attachments: 796.patch, pig-796.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-828) Problem accessing a tuple within a bag
[ https://issues.apache.org/jira/browse/PIG-828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-828: --- Attachment: tupleacc.pig studenttab5 Input script and data. > Problem accessing a tuple within a bag > -- > > Key: PIG-828 > URL: https://issues.apache.org/jira/browse/PIG-828 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.3.0 >Reporter: Viraj Bhat > Fix For: 0.3.0 > > Attachments: studenttab5, tupleacc.pig > > > Below pig script creates a tuple which contains 3 columns, 2 of which are > chararray's and the third column is a bag of constant chararray. The script > later projects the tuple within a bag. > {code} > a = load 'studenttab5' as (name, age, gpa); > b = foreach a generate ('viraj', {('sms')}, 'pig') as > document:(id,singlebag:{singleTuple:(single)}, article); > describe b; > c = foreach b generate document.singlebag; > dump c; > {code} > When we run this script we get a run-time error in the Map phase. > > java.lang.ClassCastException: org.apache.pig.data.DefaultTuple cannot be cast > to org.apache.pig.data.DataBag > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:402) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:183) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:400) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:183) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:250) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:245) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:236) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) > at > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209) > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-828) Problem accessing a tuple within a bag
Problem accessing a tuple within a bag -- Key: PIG-828 URL: https://issues.apache.org/jira/browse/PIG-828 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Viraj Bhat Fix For: 0.3.0 Below pig script creates a tuple which contains 3 columns, 2 of which are chararray's and the third column is a bag of constant chararray. The script later projects the tuple within a bag. {code} a = load 'studenttab5' as (name, age, gpa); b = foreach a generate ('viraj', {('sms')}, 'pig') as document:(id,singlebag:{singleTuple:(single)}, article); describe b; c = foreach b generate document.singlebag; dump c; {code} When we run this script we get a run-time error in the Map phase. java.lang.ClassCastException: org.apache.pig.data.DefaultTuple cannot be cast to org.apache.pig.data.DataBag at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:402) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:183) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:400) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:183) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:250) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:245) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: A proposal for changing pig's memory management
Alan Gates wrote: On May 19, 2009, at 10:30 PM, Mridul Muralidharan wrote: I am still not very convinced about the value about this implementation - particularly considering the advances made since 1.3 in memory allocators and garbage collection. My fundamental concern is not with the slowness of garbage collection. I am asserting (along with the paper) that garbage collection is not an optimal choice for a large data processing system. I don't want to improve the garbage collector, I want to manage a subset of the memory without it. I should probably have elaborated better. Most objects in Pig are in young generation (pls correct me if I am wrong) - so promoting them from there (which is handled pretty optimally and blazingly fast by vm) into slower/longer memory pools should be done with some thought (management of buffers, etc). The only (corner) cases where this is not valid, from top of my head, is when a single tuple becomes really large due to a bag (usually) with either large number of tuples in it, or tuples with larger payloads : and imo that results in quite similar costs with this proposal too - but I could be wrong. The side effect of this proposal is many, and sometimes non-obvious. Like implicitly moving young generation data into older generation, causing much more memory pressure for gc, fragmentation of memory blocks causing quite a bit of memory pressure, replicating quite a bit of functionality with garbage collection, possibility of bugs with ref counting, etc. I don't understand your concerns regarding the load on the gc and memory fragmentation. Let's say I have 10,000 tuples, each with 10 fields. Let's also assume that these tuples live long enough to make it into the "old" memory pool, since this is the interesting case where objects live long enough to cause a problem. In the current implementation there will be 110,000 objects that the gc has to manage moving into the old pool, and check every time it cleans the old pool. In the proposed implementation there would be 10,001 objects (assuming all the data fit into one buffer) to manage. And rather than allocating 100,000 small pieces of memory, we would have allocated one large segment. My belief is that this would lighten the load on the gc. Old gen memory management is not very trivial. For example, which should probably be very commonly known now - if an old block is freed and yet the cost of moving existing blocks around to use the 'free' block is high, vm just leaves it around. Over time, you will end up with fragmentation on old gen which cant be freed. (This is not a vm bug - the costs outweigh the benefits). That being said, as I mentioned above, the costs of mem usage is not linear - young gen is way faster (allocation, management, free) than objects promoted to older generations (successively) [compaction, reference changes, etc in gc]. In pig's case, since it is essentially streaming in nature - most tuples/bag - except in corner cases, would fall into young gen where things are faster. Just a note though - The last time I had to dabble in memory management for my server needs, it was already pretty complex and un-intutive (not to mention env and impl specific) - and that was a few years back - unfortunately, I have not kept abreast with recent changes (and quite a few have gone into vm for java 6 I was told) : so probably my comments above might not be valid anymore. Other than saying you would probably want to test extensively like we had to do, and that things are not as simple as they normally appear [and imo almost all books/articles get it wrong - so testing is only way out], I cant really comment more authoritatively anymore :-) Any improvement to pig memory management would be a welcome change though ! Regards, Mridul This does replicate some of the functionality of the garbage collector. Complex systems frequently need to re-implement foundational functionality in order to optimize it for their needs. Hence many RDBMS engines have their own implementations of memory management, file I/O, thread scheduling, etc. As for bugs in ref counting, I agree that forgetting to deallocate is one of the most pernicious problems of allowing programmers to do memory management. But in this case all that will happen is that a buffer will get left around that isn't needed. If the system needs more memory then that buffer will eventually get selected for flushing to disk, and then it will stay there as no one will call it back into memory. So the cost of forgetting to deallocate is minor. If assumption that current working set of bag/tuple does not need to be spilled, and anything else can be, then this will pretty much deteriorate to current impl in worst case. That is not the assumption. There are two issues: 1) trying to spill bags only when we determine we need to is highly error prone, because we
[jira] Resolved: (PIG-825) PIG_HADOOP_VERSION should be 18
[ https://issues.apache.org/jira/browse/PIG-825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates resolved PIG-825. Resolution: Fixed Fix Version/s: 0.3.0 Patch checked in. Thanks Dmitriy. > PIG_HADOOP_VERSION should be 18 > --- > > Key: PIG-825 > URL: https://issues.apache.org/jira/browse/PIG-825 > Project: Pig > Issue Type: Bug > Components: grunt >Reporter: Dmitriy V. Ryaboy > Fix For: 0.3.0 > > Attachments: pig-825.patch, pig-825.patch > > > PIG_HADOOP_VERSION should be set to 18, not 17, as Hadoop 0.18 is now > considered default. > Patch coming. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-827) Redesign graph operations in OperatorPlan
Redesign graph operations in OperatorPlan - Key: PIG-827 URL: https://issues.apache.org/jira/browse/PIG-827 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.1 Reporter: Santhosh Srinivasan Fix For: 0.2.1 The graph operations swap, insertBetween, pushBefore, etc. have to be re-implemented in a layered fashion. The layering will facilitate the re-use of operations. In addition, use of operator.rewire in the aforementioned operations requires transaction like ability due to various pre-conditions. Often, the result of one of the operations leaves the graph in an inconsistent state for the rewire operation. Clear layering and assignment of the ability to rewire will remove these inconsistencies. For now, use of rewire has resulted in a slightly less maintainable code along with the necessity to use rewire with discretion. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer
[ https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715143#action_12715143 ] Santhosh Srinivasan commented on PIG-697: - The graph operation pushAfter was added as a complementary operation to pushBefore. Currently, on the logical side, there are no concrete use cases for pushAfter. The only operator that truly supports multiple outputs is split. Our current model for split is to have an no-op split operator that has multiple successors, split outputs, each of which is the equivalent of a filter. The split output has inner plans which could have projection operators that hold references to the split's predecessor. When an operator is pushed after split, the operator will be placed between the split and split output. As a result, when rewire on split is called, the call is dispatched to the split output. The references in the split output after the rewire will now point to split's predecessor instead of pointing to the operator that was pushed after. The intention of the pushAfter in the case of a split is to push it after the split output. However, the generic pushAfter operation does not distinguish between split and split output. A possible way out is to override this method in the logical plan and duplicate most of the code in the OperatorPlan and add new code to handle split. As of now, the pushAfter will not be used in the logical layer. > Proposed improvements to pig's optimizer > > > Key: PIG-697 > URL: https://issues.apache.org/jira/browse/PIG-697 > Project: Pig > Issue Type: Bug > Components: impl >Reporter: Alan Gates >Assignee: Santhosh Srinivasan > Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, > OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, > OptimizerPhase3_parrt1.patch > > > I propose the following changes to pig optimizer, plan, and operator > functionality to support more robust optimization: > 1) Remove the required array from Rule. This will change rules so that they > only match exact patterns instead of allowing missing elements in the pattern. > This has the downside that if a given rule applies to two patterns (say > Load->Filter->Group, Load->Group) you have to write two rules. But it has > the upside that > the resulting rules know exactly what they are getting. The original intent > of this was to reduce the number of rules that needed to be written. But the > resulting rules have do a lot of work to understand the operators they are > working with. With exact matches only, each rule will know exactly the > operators it > is working on and can apply the logic of shifting the operators around. All > four of the existing rules set all entries of required to true, so removing > this > will have no effect on them. > 2) Change PlanOptimizer.optimize to iterate over the rules until there are no > conversions or a certain number of iterations has been reached. Currently the > function is: > {code} > public final void optimize() throws OptimizerException { > RuleMatcher matcher = new RuleMatcher(); > for (Rule rule : mRules) { > if (matcher.match(rule)) { > // It matches the pattern. Now check if the transformer > // approves as well. > List> matches = matcher.getAllMatches(); > for (List match:matches) > { > if (rule.transformer.check(match)) { > // The transformer approves. > rule.transformer.transform(match); > } > } > } > } > } > {code} > It would change to be: > {code} > public final void optimize() throws OptimizerException { > RuleMatcher matcher = new RuleMatcher(); > boolean sawMatch; > int iterators = 0; > do { > sawMatch = false; > for (Rule rule : mRules) { > List> matches = matcher.getAllMatches(); > for (List match:matches) { > // It matches the pattern. Now check if the transformer > // approves as well. > if (rule.transformer.check(match)) { > // The transformer approves. > sawMatch = true; > rule.transformer.transform(match); > } > } > } > // Not sure if 1000 is the right number of iterations, maybe it > // should be configurable so that large scripts don't stop too > // early. > } while (sawMatch && numIterations++ < 1000); > } > {code} > The reason for limiting the number of itera
[jira] Updated: (PIG-826) DISTINCT as "Function/Operator" rather than statement/operator - High Level Pig
[ https://issues.apache.org/jira/browse/PIG-826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Ciemiewicz updated PIG-826: - Summary: DISTINCT as "Function/Operator" rather than statement/operator - High Level Pig (was: DISTINCT as "Function" rather than statement - High Level Pig) > DISTINCT as "Function/Operator" rather than statement/operator - High Level > Pig > --- > > Key: PIG-826 > URL: https://issues.apache.org/jira/browse/PIG-826 > Project: Pig > Issue Type: New Feature >Reporter: David Ciemiewicz > > In SQL, a user would think nothing of doing something like: > {code} > select > COUNT(DISTINCT(user)) as user_count, > COUNT(DISTINCT(country)) as country_count, > COUNT(DISTINCT(url) as url_count > from > server_logs; > {code} > But in Pig, we'd need to do something like the following. And this is about > the most > compact version I could come up with. > {code} > Logs = load 'log' using PigStorage() > as ( user: chararray, country: chararray, url: chararray); > DistinctUsers = distinct (foreach Logs generate user); > DistinctCountries = distinct (foreach Logs generate country); > DistinctUrls = distinct (foreach Logs generate url); > DistinctUsersCount = foreach (group DistinctUsers all) generate > group, COUNT(DistinctUsers) as user_count; > DistinctCountriesCount = foreach (group DistinctCountries all) generate > group, COUNT(DistinctCountries) as country_count; > DistinctUrlCount = foreach (group DistinctUrls all) generate > group, COUNT(DistinctUrls) as url_count; > AllDistinctCounts = cross > DistinctUsersCount, DistinctCountriesCount, DistinctUrlCount; > Report = foreach AllDistinctCounts generate > DistinctUsersCount::user_count, > DistinctCountriesCount::country_count, > DistinctUrlCount::url_count; > store Report into 'log_report' using PigStorage(); > {code} > It would be good if there was a higher level version of Pig that permitted > code to be written as: > {code} > Logs = load 'log' using PigStorage() > as ( user: chararray, country: chararray, url: chararray); > Report = overall Logs generate > COUNT(DISTINCT(user)) as user_count, > COUNT(DISTINCT(country)) as country_count, > COUNT(DISTINCT(url)) as url_count; > store Report into 'log_report' using PigStorage(); > {code} > I do want this in Pig and not as SQL. I'd expect High Level Pig to generate > Lower Level Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-753) Provide support for UDFs without parameters
[ https://issues.apache.org/jira/browse/PIG-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated PIG-753: --- Fix Version/s: 0.3.0 Affects Version/s: 0.3.0 Status: Patch Available (was: Open) Submit the patch. Now we do not have to provider a parameter for UDF, zero-parameters UDF is also OK too. > Provide support for UDFs without parameters > --- > > Key: PIG-753 > URL: https://issues.apache.org/jira/browse/PIG-753 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.3.0 >Reporter: Jeff Zhang > Fix For: 0.3.0 > > Attachments: Pig_753_Patch.txt > > > Pig do not support UDF without parameters, it force me provide a parameter. > like the following statement: > B = FOREACH A GENERATE bagGenerator(); this will generate error. I have to > provide a parameter like following > B = FOREACH A GENERATE bagGenerator($0); > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-753) Provide support for UDFs without parameters
[ https://issues.apache.org/jira/browse/PIG-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated PIG-753: --- Attachment: Pig_753_Patch.txt attach the patch > Provide support for UDFs without parameters > --- > > Key: PIG-753 > URL: https://issues.apache.org/jira/browse/PIG-753 > Project: Pig > Issue Type: Improvement >Reporter: Jeff Zhang > Attachments: Pig_753_Patch.txt > > > Pig do not support UDF without parameters, it force me provide a parameter. > like the following statement: > B = FOREACH A GENERATE bagGenerator(); this will generate error. I have to > provide a parameter like following > B = FOREACH A GENERATE bagGenerator($0); > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.