[jira] Resolved: (PIG-1605) Adding soft link to plan to solve input file dependency
[ https://issues.apache.org/jira/browse/PIG-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai resolved PIG-1605. - Hadoop Flags: [Reviewed] Resolution: Fixed Release audit warning is due to jdiff. No new file added. Patch committed to both trunk and 0.8 branch. > Adding soft link to plan to solve input file dependency > --- > > Key: PIG-1605 > URL: https://issues.apache.org/jira/browse/PIG-1605 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1605-1.patch, PIG-1605-2.patch > > > In scalar implementation, we need to deal with implicit dependencies. > [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve > the problem by adding a LOScalar operator. Here is a different approach. We > will add a soft link to the plan, and soft link is only visible to the > walkers. By doing this, we can make sure we visit LOStore which generate > scalar first, and then LOForEach which use the scalar. All other part of the > logical plan does not know the existence of the soft link. The benefits are: > 1. Logical plan do not need to deal with LOScalar, this makes logical plan > cleaner > 2. Conceptually scalar dependency is different. Regular link represent a data > flow in pipeline. In scalar, the dependency means an operator depends on a > file generated by the other operator. It's different type of data dependency. > 3. Soft link can solve other dependency problem in the future. If we > introduce another UDF dependent on a file generated by another operator, we > can use this mechanism to solve it. > 4. With soft link, we can use scalar come from different sources in the same > statement, which in my mind is not a rare use case. (eg: D = foreach C > generate c0/A.total, c1/B.count; ) > Currently, there are two cases we can use soft link: > 1. scalar dependency, where ReadScalar UDF will use a file generate by a > LOStore > 2. store-load dependency, where we will load a file which is generated by a > store in the same script. This happens in multi-store case. Currently we > solve it by regular link. It is better to use a soft link. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1605) Adding soft link to plan to solve input file dependency
[ https://issues.apache.org/jira/browse/PIG-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1605: Attachment: PIG-1605-2.patch PIG-1605-2.patch fix findbug warnings. test-patch result: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 455 release audit warnings (more than the trunk's current 453 warning s). > Adding soft link to plan to solve input file dependency > --- > > Key: PIG-1605 > URL: https://issues.apache.org/jira/browse/PIG-1605 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1605-1.patch, PIG-1605-2.patch > > > In scalar implementation, we need to deal with implicit dependencies. > [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve > the problem by adding a LOScalar operator. Here is a different approach. We > will add a soft link to the plan, and soft link is only visible to the > walkers. By doing this, we can make sure we visit LOStore which generate > scalar first, and then LOForEach which use the scalar. All other part of the > logical plan does not know the existence of the soft link. The benefits are: > 1. Logical plan do not need to deal with LOScalar, this makes logical plan > cleaner > 2. Conceptually scalar dependency is different. Regular link represent a data > flow in pipeline. In scalar, the dependency means an operator depends on a > file generated by the other operator. It's different type of data dependency. > 3. Soft link can solve other dependency problem in the future. If we > introduce another UDF dependent on a file generated by another operator, we > can use this mechanism to solve it. > 4. With soft link, we can use scalar come from different sources in the same > statement, which in my mind is not a rare use case. (eg: D = foreach C > generate c0/A.total, c1/B.count; ) > Currently, there are two cases we can use soft link: > 1. scalar dependency, where ReadScalar UDF will use a file generate by a > LOStore > 2. store-load dependency, where we will load a file which is generated by a > store in the same script. This happens in multi-store case. Currently we > solve it by regular link. It is better to use a soft link. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed
[ https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1635: -- Status: Patch Available (was: Open) > Logical simplifier does not simplify away constants under AND and OR; after > simplificaion the ordering of operands of AND and OR may get changed > > > Key: PIG-1635 > URL: https://issues.apache.org/jira/browse/PIG-1635 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Yan Zhou >Assignee: Yan Zhou >Priority: Minor > Fix For: 0.8.0 > > Attachments: PIG-1635.patch > > > b = FILTER a by (( f1 > 1) AND (1 == 1)) > or > b = FILTER a by ((f1 > 1) OR ( 1==0)) > should be simplified to > b = FILTER a by f1 > 1; > Regarding ordering change, an example is that > b = filter a by ((f1 is not null) AND (f2 is not null)); > Even without possible simplification, the expression is changed to > b = filter a by ((f2 is not null) AND (f1 is not null)); > Even though the ordering change in this case, and probably in most other > cases, does not create any difference, but for two reasons some users might > care about the ordering: if stateful UDFs are used as operands of AND or OR; > and if the ordering is intended by the application designer to maximize the > chances to shortcut the composite boolean evaluation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed
[ https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1635: -- Attachment: PIG-1635.patch > Logical simplifier does not simplify away constants under AND and OR; after > simplificaion the ordering of operands of AND and OR may get changed > > > Key: PIG-1635 > URL: https://issues.apache.org/jira/browse/PIG-1635 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Yan Zhou >Assignee: Yan Zhou >Priority: Minor > Fix For: 0.8.0 > > Attachments: PIG-1635.patch > > > b = FILTER a by (( f1 > 1) AND (1 == 1)) > or > b = FILTER a by ((f1 > 1) OR ( 1==0)) > should be simplified to > b = FILTER a by f1 > 1; > Regarding ordering change, an example is that > b = filter a by ((f1 is not null) AND (f2 is not null)); > Even without possible simplification, the expression is changed to > b = filter a by ((f2 is not null) AND (f1 is not null)); > Even though the ordering change in this case, and probably in most other > cases, does not create any difference, but for two reasons some users might > care about the ordering: if stateful UDFs are used as operands of AND or OR; > and if the ordering is intended by the application designer to maximize the > chances to shortcut the composite boolean evaluation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1636) Scalar fail if the scalar variable is generated by limit
[ https://issues.apache.org/jira/browse/PIG-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913339#action_12913339 ] Thejas M Nair commented on PIG-1636: +1 > Scalar fail if the scalar variable is generated by limit > > > Key: PIG-1636 > URL: https://issues.apache.org/jira/browse/PIG-1636 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1636-1.patch > > > The following script fail: > {code} > a = load 'studenttab10k' as (name: chararray, age: int, gpa: float); > b = group a all; > c = foreach b generate SUM(a.age) as total; > c1= limit c 1; > d = foreach a generate name, age/(double)c1.total as d_sum; > store d into '111'; > {code} > The problem is we have a reference to c1 in d. In the optimizer, we push > limit before foreach, d still reference to limit, and we get the wrong schema > for the scalar. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1605) Adding soft link to plan to solve input file dependency
[ https://issues.apache.org/jira/browse/PIG-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913335#action_12913335 ] Thejas M Nair commented on PIG-1605: Looks good. +1 Possible optimizations - (can be done in future )- 1. If column-pruning rule removes the relation-as-scalar column, then the soft-link can be removed. 2. split-filter rule will be disabled if it has a relation-as-scalar in the filter expression. If we filter expressions has the relation-as-scalar and update soft-links accordingly, we don't need to disable this rule. > Adding soft link to plan to solve input file dependency > --- > > Key: PIG-1605 > URL: https://issues.apache.org/jira/browse/PIG-1605 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1605-1.patch > > > In scalar implementation, we need to deal with implicit dependencies. > [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve > the problem by adding a LOScalar operator. Here is a different approach. We > will add a soft link to the plan, and soft link is only visible to the > walkers. By doing this, we can make sure we visit LOStore which generate > scalar first, and then LOForEach which use the scalar. All other part of the > logical plan does not know the existence of the soft link. The benefits are: > 1. Logical plan do not need to deal with LOScalar, this makes logical plan > cleaner > 2. Conceptually scalar dependency is different. Regular link represent a data > flow in pipeline. In scalar, the dependency means an operator depends on a > file generated by the other operator. It's different type of data dependency. > 3. Soft link can solve other dependency problem in the future. If we > introduce another UDF dependent on a file generated by another operator, we > can use this mechanism to solve it. > 4. With soft link, we can use scalar come from different sources in the same > statement, which in my mind is not a rare use case. (eg: D = foreach C > generate c0/A.total, c1/B.count; ) > Currently, there are two cases we can use soft link: > 1. scalar dependency, where ReadScalar UDF will use a file generate by a > LOStore > 2. store-load dependency, where we will load a file which is generated by a > store in the same script. This happens in multi-store case. Currently we > solve it by regular link. It is better to use a soft link. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1598) Pig gobbles up error messages - Part 2
[ https://issues.apache.org/jira/browse/PIG-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1598: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Patch looks good. Committed to both trunk and 0.8 branch. > Pig gobbles up error messages - Part 2 > -- > > Key: PIG-1598 > URL: https://issues.apache.org/jira/browse/PIG-1598 > Project: Pig > Issue Type: Improvement >Reporter: Ashutosh Chauhan >Assignee: niraj rai > Fix For: 0.8.0 > > Attachments: PIG-1598_0.patch > > > Another case of PIG-1531 . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-772) Semantics of Filter statement inside ForEach should support filtering on aliases used in the Group statement preceding it
[ https://issues.apache.org/jira/browse/PIG-772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates resolved PIG-772. Resolution: Invalid The error message here is bad, but this is an error. You are trying to secretly do a join in the filter line by referencing two relations (N and A). Pig does not allow a filter operator to have multiple inputs. > Semantics of Filter statement inside ForEach should support filtering on > aliases used in the Group statement preceding it > - > > Key: PIG-772 > URL: https://issues.apache.org/jira/browse/PIG-772 > Project: Pig > Issue Type: Improvement > Components: impl >Affects Versions: 0.3.0 >Reporter: Viraj Bhat >Assignee: Alan Gates >Priority: Minor > Fix For: 0.9.0 > > Attachments: half.txt > > > I have a Pig script which tries to display all bags which are greater than > the average value in the group. > Input: half.txt > === > A 1 > A 2 > A 3 > B 1 > B 3 > > {code} > A = LOAD 'half.txt' AS (key:CHARARRAY, val:INT); > B = GROUP A BY key; > C = FOREACH B { >N = AVG(A.val); >HALF = FILTER A by val >= N; > GENERATE >FLATTEN(GROUP), >HALF; > }; > dump C; > {code} > > Expected Output: > > (A,{(A,2),(A,3)}) > (B,{(B,3)}) > > Presently the semantics of the Filter statement inside the FOREACH does not > support these types of operations. > Error when running the above script. > = > ERROR 1000: Error during parsing. Invalid alias: A in {key: chararray,val: > int} > org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during > parsing. Invalid alias: A in {key: chararray,val: int} > at org.apache.pig.PigServer.parseQuery(PigServer.java:320) > at org.apache.pig.PigServer.registerQuery(PigServer.java:279) > at > org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:529) > at > org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:280) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:99) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88) > at org.apache.pig.Main.main(Main.java:364) > = -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1639) New logical plan: PushUpFilter should not optimize if filter condition contains UDF
[ https://issues.apache.org/jira/browse/PIG-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1639: Assignee: Xuefu Zhang (was: Daniel Dai) > New logical plan: PushUpFilter should not optimize if filter condition > contains UDF > --- > > Key: PIG-1639 > URL: https://issues.apache.org/jira/browse/PIG-1639 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Xuefu Zhang > Fix For: 0.8.0 > > > The following script fail: > {code} > a = load 'file' AS (f1, f2, f3); > b = group a by f1; > c = filter b by COUNT(a) > 1; > dump c; > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1640) bin/pig does not run in local mode due to classes missing from classpath
bin/pig does not run in local mode due to classes missing from classpath Key: PIG-1640 URL: https://issues.apache.org/jira/browse/PIG-1640 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Fix For: 0.8.0 This issue was reported by one of Yahoo users. I have not verified the problem. Here is the report "when do bin/pig -x local, the shell doesn't come up. It complained about jline not being found. Here is a patch to bin/pig: +for f in $PIG_HOME/build/ivy/lib/Pig/*.jar; do +CLASSPATH=${CLASSPATH}:$f; +done +" -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1639) New logical plan: PushUpFilter should not optimize if filter condition contains UDF
[ https://issues.apache.org/jira/browse/PIG-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1639: Description: The following script fail: {code} a = load 'file' AS (f1, f2, f3); b = group a by f1; c = filter b by COUNT(a) > 1; dump c; {code} > New logical plan: PushUpFilter should not optimize if filter condition > contains UDF > --- > > Key: PIG-1639 > URL: https://issues.apache.org/jira/browse/PIG-1639 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > > The following script fail: > {code} > a = load 'file' AS (f1, f2, f3); > b = group a by f1; > c = filter b by COUNT(a) > 1; > dump c; > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1639) New logical plan: PushUpFilter should not optimize if filter condition contains UDF
New logical plan: PushUpFilter should not optimize if filter condition contains UDF --- Key: PIG-1639 URL: https://issues.apache.org/jira/browse/PIG-1639 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1636) Scalar fail if the scalar variable is generated by limit
[ https://issues.apache.org/jira/browse/PIG-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1636: Attachment: PIG-1636-1.patch This patch depends on PIG-1605. > Scalar fail if the scalar variable is generated by limit > > > Key: PIG-1636 > URL: https://issues.apache.org/jira/browse/PIG-1636 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1636-1.patch > > > The following script fail: > {code} > a = load 'studenttab10k' as (name: chararray, age: int, gpa: float); > b = group a all; > c = foreach b generate SUM(a.age) as total; > c1= limit c 1; > d = foreach a generate name, age/(double)c1.total as d_sum; > store d into '111'; > {code} > The problem is we have a reference to c1 in d. In the optimizer, we push > limit before foreach, d still reference to limit, and we get the wrong schema > for the scalar. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1638) sh output gets mixed up with the grunt prompt
[ https://issues.apache.org/jira/browse/PIG-1638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] niraj rai updated PIG-1638: --- Status: Patch Available (was: Open) > sh output gets mixed up with the grunt prompt > - > > Key: PIG-1638 > URL: https://issues.apache.org/jira/browse/PIG-1638 > Project: Pig > Issue Type: Bug > Components: grunt >Affects Versions: 0.8.0 >Reporter: niraj rai >Assignee: niraj rai >Priority: Minor > Fix For: 0.8.0 > > Attachments: PIG-1638_0.patch > > > Many times, the grunt prompt gets mixed up with the sh output.e.g. > grunt> sh ls > 000 > autocomplete > bin > build > build.xml > grunt> CHANGES.txt > conf > contrib > In the above case, grunt> is mixed up with the output. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1638) sh output gets mixed up with the grunt prompt
[ https://issues.apache.org/jira/browse/PIG-1638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] niraj rai updated PIG-1638: --- Attachment: PIG-1638_0.patch > sh output gets mixed up with the grunt prompt > - > > Key: PIG-1638 > URL: https://issues.apache.org/jira/browse/PIG-1638 > Project: Pig > Issue Type: Bug > Components: grunt >Affects Versions: 0.8.0 >Reporter: niraj rai >Assignee: niraj rai >Priority: Minor > Fix For: 0.8.0 > > Attachments: PIG-1638_0.patch > > > Many times, the grunt prompt gets mixed up with the sh output.e.g. > grunt> sh ls > 000 > autocomplete > bin > build > build.xml > grunt> CHANGES.txt > conf > contrib > In the above case, grunt> is mixed up with the output. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1638) sh output gets mixed up with the grunt prompt
sh output gets mixed up with the grunt prompt - Key: PIG-1638 URL: https://issues.apache.org/jira/browse/PIG-1638 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.8.0 Reporter: niraj rai Assignee: niraj rai Priority: Minor Fix For: 0.8.0 Many times, the grunt prompt gets mixed up with the sh output.e.g. grunt> sh ls 000 autocomplete bin build build.xml grunt> CHANGES.txt conf contrib In the above case, grunt> is mixed up with the output. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-419) Combiner optimizations extended to nested foreach statements as well
[ https://issues.apache.org/jira/browse/PIG-419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-419: -- Assignee: Thejas M Nair > Combiner optimizations extended to nested foreach statements as well > > > Key: PIG-419 > URL: https://issues.apache.org/jira/browse/PIG-419 > Project: Pig > Issue Type: Improvement >Reporter: Anand Murugappan >Assignee: Thejas M Nair > > While Pig 2.0 seems to have optimized foreach statements by using the > combiner more aggressively, nested foreach statements lack this > functionality. Given that several of our projects use nested foreach > statements, we would like to see the optimizations extended to those cases as > well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-453) Scope resolution operators in flattened schemas need to be fixed
[ https://issues.apache.org/jira/browse/PIG-453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-453: --- Assignee: Alan Gates (was: Santhosh Srinivasan) Fix Version/s: 0.9.0 > Scope resolution operators in flattened schemas need to be fixed > > > Key: PIG-453 > URL: https://issues.apache.org/jira/browse/PIG-453 > Project: Pig > Issue Type: Bug >Affects Versions: 0.2.0 >Reporter: Santhosh Srinivasan >Assignee: Alan Gates >Priority: Minor > Fix For: 0.9.0 > > > Currently, the scope resolution operator :: is stored as part of the field > schema alias. As a result, users may get confused by queries like: > {code} > a = load 'st10k' as (name, age, gpa); > b = group a by name; > c = foreach b generate flatten(a); > d = filter c by name != 'fred'; > e = group d by name; > f = foreach e generate flatten(d); > g = foreach f generate name; > {code} > With PIG-451, the schema for f will have a column with aliases a::name and > d::a::name. The use of d::a::name is particularly confusing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-438) Handle realiasing of existing Alias (A=B;)
[ https://issues.apache.org/jira/browse/PIG-438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-438: --- Assignee: Alan Gates Fix Version/s: 0.9.0 Priority: Minor (was: Major) > Handle realiasing of existing Alias (A=B;) > --- > > Key: PIG-438 > URL: https://issues.apache.org/jira/browse/PIG-438 > Project: Pig > Issue Type: Bug >Affects Versions: 0.2.0 >Reporter: Pradeep Kamath >Assignee: Alan Gates >Priority: Minor > Fix For: 0.9.0 > > > We do not handle re-aliasing of an existing alias - this should be handled > correctly. > The following script should work: > {code} > a = load 'studenttab10k'; > b = filter a by $1 > '25'; > c = b; > -- use b > d = cogroup b by $0, a by $0; > e = foreach d generate flatten(b), flatten(a); > dump e > -- use c > f = cogroup c by $0, a by $0; > g = foreach f generate flatten(c), flatten(a); > dump g; > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-479) PERFORMANCE: more extensive use of the combier
[ https://issues.apache.org/jira/browse/PIG-479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-479: -- Assignee: Thejas M Nair > PERFORMANCE: more extensive use of the combier > -- > > Key: PIG-479 > URL: https://issues.apache.org/jira/browse/PIG-479 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.2.0 >Reporter: Olga Natkovich >Assignee: Thejas M Nair > > On types branch, the combiner is used anytime a foreach includes only simple > projections and/or algebraic functions. It would also be useful to invoke > the combiner in cases where algebraic and non-algebraic operations are mixed, > or where expression evaluation is included in the foreach. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-496) project of bags from complex data causes failures
[ https://issues.apache.org/jira/browse/PIG-496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-496: --- Assignee: Alan Gates Fix Version/s: 0.9.0 Priority: Minor (was: Major) > project of bags from complex data causes failures > - > > Key: PIG-496 > URL: https://issues.apache.org/jira/browse/PIG-496 > Project: Pig > Issue Type: Bug >Reporter: Olga Natkovich >Assignee: Alan Gates >Priority: Minor > Fix For: 0.9.0 > > > A = load 'complex data' as (x: bag{}); > B = foreach A generate x.($1, $2); > produces stack trace: > 2008-10-14 15:11:07,639 [main] ERROR > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher - Error > message from task (reduce) > task_200809241441_9923_r_00java.lang.NullPointerException > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:183) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:215) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:166) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:252) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:222) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:134) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318) > at > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) > Pradeep suspects that the problem is in > src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POProject.java; > line 374 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-516) order by
[ https://issues.apache.org/jira/browse/PIG-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-516: --- Assignee: Alan Gates Fix Version/s: 0.9.0 > order by > - > > Key: PIG-516 > URL: https://issues.apache.org/jira/browse/PIG-516 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.2.0 >Reporter: Christopher Olston >Assignee: Alan Gates >Priority: Minor > Fix For: 0.9.0 > > > I want to do ORDER A BY f($0). Should be allowed. (Workaround of adding a > column, sorting, then removing column is yucky and in fact impossible if I > don't know the schema.) > Important use case: ORDER A BY Random(), to do random sorting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-666) Bug in Schema comparison for equality
[ https://issues.apache.org/jira/browse/PIG-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-666: --- Assignee: Alan Gates Fix Version/s: 0.9.0 > Bug in Schema comparison for equality > - > > Key: PIG-666 > URL: https://issues.apache.org/jira/browse/PIG-666 > Project: Pig > Issue Type: Bug > Components: build > Environment: i686 i386 GNU/Linux >Reporter: Araceli Henley >Assignee: Alan Gates >Priority: Minor > Fix For: 0.9.0 > > > Santhosh is currently improving Error Handling, I ran these tests against : > pig_phase_3.jar > This is a bug in the schema comparison for equality > # valid use of MAX with Bag as value > TEST: AggregateFunc_23.pig > A =LOAD '/user/pig/tests/data/types/DataAll' USING PigStorage() AS ( > Fint:int, Flong:long, Fdouble:double, Ffloat:float, Fchar:chararray, > Fchararray:chararray, Fbytearray:bytearray, Fmap:map[], Fbag:BAG{ t:tuple( > name, age, avg ) }, Ftuple:( name:chararray, age:int, avg:float) ); > B =GROUP A ALL; > X =FOREACH B GENERATE A.Fint, MAX( ( BAG{tuple(int)}) A.Fbag.age ); > STORE X INTO > '/user/pig/tests/results/araceli.1234381533/AggregateFunc_23.out' USING > PigStorage(); > # valid use of SUM with int with valid cast > TEST: AggregateFunc_231.pig > A =LOAD '/user/pig/tests/data/types/DataAll' USING PigStorage() AS ( > Fint:int, Flong:long, Fdouble:double, Ffloat:float, Fchar:chararray, > Fchararray:chararray, Fbytearray:bytearray, Fmap:map[], Fbag:BAG{ t:tuple( > name, age, avg ) }, Ftuple:( name:chararray, age:int, avg:float) ); > B =GROUP A ALL; > X =FOREACH B GENERATE A.Fint, SUM( ( BAG{ tuple(double)} ) A.Fbag.age ); > STORE X INTO > '/user/pig/tests/results/araceli.1234381533/AggregateFunc_231.out' USING > PigStorage(); > # valid use of SUM with cast for field in bag > TEST: AggregateFunc_26.pig > A =LOAD '/user/pig/tests/data/types/DataAll' USING PigStorage() AS ( > Fint:int, Flong:long, Fdouble:double, Ffloat:float, Fchar:chararray, > Fchararray:chararray, Fbytearray:bytearray, Fmap:map[], Fbag:BAG{ t:tuple( > name, age, avg ) }, Ftuple:( name:chararray, age:int, avg:float) ) > ;B =GROUP A ALL; > X =FOREACH B GENERATE SUM ( (BAG{tuple(int)}) A.Fbag.age ); > STORE X INTO > '/user/pig/tests/results/araceli.1234381533/AggregateFunc_26.out' USING > PigStorage(); > # valid use of MIN with cast for field in bag > TEST: AggregateFunc_27.pig > A =LOAD '/user/pig/tests/data/types/DataAll' USING PigStorage() AS ( > Fint:int, Flong:long, Fdouble:double, Ffloat:float, Fchar:chararray, > Fchararray:chararray, Fbytearray:bytearray, Fmap:map[], Fbag:BAG{ t:tuple( > name, age, avg ) }, Ftuple:( name:chararray, age:int, avg:float) ); > B =GROUP A ALL; > X =FOREACH B GENERATE MIN ( (BAG{tuple(int)}) A.Fbag.age ); > STORE X INTO > '/user/pig/tests/results/araceli.1234381533/AggregateFunc_27.out' USING > PigStorage(); > # valid use of AVG with Long as value > TEST: AggregateFunc_46.pig > A =LOAD '/user/pig/tests/data/types/DataAll' USING PigStorage() AS ( > Fint:int, Flong:long, Fdouble:double, Ffloat:float, Fchar:chararray, > Fchararray:chararray, Fbytearray:bytearray, Fmap:map[], Fbag:BAG{ t:tuple( > name, age, avg ) }, Ftuple:( name:chararray, age:int, avg:float) ); > B =GROUP A ALL; > X =FOREACH B GENERATE A.Fint, AVG( ( BAG{tuple(double)} ) A.Fint ); > STORE X INTO > '/user/pig/tests/results/araceli.1234381533/AggregateFunc_47.out' USING > PigStorage(); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-667) Error in projection implementation or in typechecking when casting a member of Bag
[ https://issues.apache.org/jira/browse/PIG-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-667: --- Assignee: Alan Gates Fix Version/s: 0.9.0 Priority: Minor (was: Major) > Error in projection implementation or in typechecking when casting a member > of Bag > --- > > Key: PIG-667 > URL: https://issues.apache.org/jira/browse/PIG-667 > Project: Pig > Issue Type: Bug > Environment: i686 i386 GNU/Linux >Reporter: Araceli Henley >Assignee: Alan Gates >Priority: Minor > Fix For: 0.9.0 > > > As one of its members, a bag contains "age" of type "int". When this value is > used as an argument to DIFF and cast as an int for the comparison, the > following error is thrown: > 9/02/11 14:20:46 INFO mapReduceLayer.MapReduceLauncher: 50% complete > 09/02/11 14:21:31 ERROR mapReduceLayer.MapReduceLauncher: Map reduce job > failed > 09/02/11 14:21:31 ERROR mapReduceLayer.MapReduceLauncher: Number of failed > jobs: 1 > 09/02/11 14:21:31 ERROR mapReduceLayer.MapReduceLauncher: Job failed! > error message for task: map > error message for task: reduce > 09/02/11 14:21:31 ERROR grunt.Grunt: ERROR 1072: Out of bounds access: > Request for field number 1 exceeds tuple size of 1 > Steps to reproduce > # valid use of DIFF with valid cast for bag field > TEST ErrorHandling.AggregateFunc_601 > A =LOAD '/user/pig/tests/data/types/DataAll' USING PigStorage() AS ( > Fint:int, Flong:long, Fdouble:double, Ffloat:float, Fchar:chararray, > Fchararray:chararray, Fbytearray:bytearray, Fmap:map[], Fbag:BAG{ t:tuple( > name, age, avg ) }, Ftuple:( name:chararray, age:int, avg:float) ); > B =GROUP A ALL; > X =FOREACH B GENERATE DIFF ( ( BAG{tuple(int)} ) A.Fbag.age, A.Fint ); > STORE X INTO > '/user/pig/tests/results/araceli.1234390832/AggregateFunc_601.out' USING > PigStorage(); > # invalid use of DIFF with valid cast for bag field, DIFF contains one > argument instead off two > TEST ErrorHandling.AggregateFunc_60 > A =LOAD '/user/pig/tests/data/types/DataAll' USING PigStorage() AS ( > Fint:int, Flong:long, Fdouble:double, Ffloat:float, Fchar:chararray, > Fchararray:chararray, Fbytearray:bytearray, Fmap:map[], Fbag:BAG{ t:tuple( > name, age, avg ) }, Ftuple:( name:chararray, age:int, avg:float) ); > B =GROUP A ALL; > X =FOREACH B GENERATE DIFF ( ( BAG{tuple(int)} ) A.Fbag.age ); > STORE X INTO > '/user/pig/tests/results/araceli.1234381533/AggregateFunc_60.out' USING > PigStorage(); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-669) Bug in Schema comparison for casting
[ https://issues.apache.org/jira/browse/PIG-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates resolved PIG-669. Resolution: Not A Problem This is correct behavior. SUM does not accept two arguments. > Bug in Schema comparison for casting > > > Key: PIG-669 > URL: https://issues.apache.org/jira/browse/PIG-669 > Project: Pig > Issue Type: Bug > Environment: i686 i386 GNU/Linux >Reporter: Araceli Henley > > This is a bug int he Schema comparison for casting. This is a valid use of a > cast in SUM, the first and second arguments are a cast to a Bag with an int. > ERROR 1045: Could not infer the matching function for > org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an > explicit cast. > TEST: AggregateFunc_61 > A =LOAD '/user/pig/tests/data/types/DataAll' USING PigStorage() AS ( > Fint:int, Flong:long, Fdouble:double, Ffloat:float, Fchar:chararray, > Fchararray:chararray, Fbytearray:bytearray, Fmap:map[], Fbag:BAG{ t:tuple( > name, age, avg ) }, Ftuple:( name:chararray, age:int, avg:float) ); > B =GROUP A ALL; > X =FOREACH B GENERATE SUM ( ( BAG{tuple(int)} ) A.Fbag.age, ( BAG{tuple(int)} > ) A.Fbag.age); > STORE X INTO > '/user/pig/tests/results/araceli.1234465985/AggregateFunc_61.out' USING > PigStorage(); > Suggest you also try: > X =FOREACH B GENERATE SUM ( ( BAG{tuple(int)} ) A.Fbag.age, A.Fint ); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-670) DIFF contains an invalid expression - possible parser error
[ https://issues.apache.org/jira/browse/PIG-670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-670: --- Assignee: Xuefu Zhang Fix Version/s: 0.9.0 Priority: Minor (was: Major) > DIFF contains an invalid expression - possible parser error > --- > > Key: PIG-670 > URL: https://issues.apache.org/jira/browse/PIG-670 > Project: Pig > Issue Type: Bug > Environment: i686 i386 GNU/Linux >Reporter: Araceli Henley >Assignee: Xuefu Zhang >Priority: Minor > Fix For: 0.9.0 > > > Requires further investigation. > This test takes in an invalid expression as the first argument in the DIFF > function and results in the following error: > ERROR 1000: Error during parsing. Invalid alias: DIFF > Why is the parser interpreting DIFF as an alias? > TEST: AggregateFunc_131 > A =LOAD '/user/pig/tests/data/types/DataAll' USING PigStorage() AS ( > Fint:int, Flong:long, Fdouble:double, Ffloat:float, Fchar:chararray, > Fchararray:chararray, Fbytearray:bytearray, Fmap:map[], Fbag:BAG{ t:tuple( > name, age, avg ) }, Ftuple:( name:chararray, age:int, avg:float) ); > B =GROUP A ALL; > X =FOREACH B GENERATE DIFF( A.Fint + A.Fint + ); > STORE X INTO > '/user/pig/tests/results/araceli.1234381533/AggregateFunc_131.out' USING > PigStorage(); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-931) Samples Syntax Error in Pig UDF Manual
[ https://issues.apache.org/jira/browse/PIG-931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corinne Chandel resolved PIG-931. - Resolution: Fixed Fixed as part of Pig 080 beta-1 (see pig-1600 for patch). > Samples Syntax Error in Pig UDF Manual > -- > > Key: PIG-931 > URL: https://issues.apache.org/jira/browse/PIG-931 > Project: Pig > Issue Type: Improvement > Components: documentation >Affects Versions: 0.2.0, 0.3.0 > Environment: Windows XP, firefox 3.5.2 >Reporter: Yiwei Chen >Assignee: Corinne Chandel >Priority: Trivial > Fix For: 0.8.0 > > > All samples with 'extends EvalFunc' have syntax errors in > http://hadoop.apache.org/pig/docs/r0.3.0/udf.html . > There shouldn't be parentheses; they are angle brackets. > For example in "How to Write a Simple Eval Function" section: > public class UPPER extends EvalFunc (String) > should be > public class UPPER extends EvalFunc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-678) "as" support for group-by
[ https://issues.apache.org/jira/browse/PIG-678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-678: --- Assignee: Alan Gates Fix Version/s: 0.9.0 Priority: Minor (was: Major) > "as" support for group-by > - > > Key: PIG-678 > URL: https://issues.apache.org/jira/browse/PIG-678 > Project: Pig > Issue Type: Improvement >Reporter: Christopher Olston >Assignee: Alan Gates >Priority: Minor > Fix For: 0.9.0 > > > I should be able to use "as" with GROUP the same way I use it with LOAD, i.e. > rename the entire schema. This is especially important b/c the system > automatically assigns schema names for the output of group that many people > find unintuitive. > e.g. this should work: > grouped = GROUP data BY url AS (url, history); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-719) store into 'filename'; should be valid syntax, but does not work
[ https://issues.apache.org/jira/browse/PIG-719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-719: --- Assignee: Xuefu Zhang Fix Version/s: 0.9.0 > store into 'filename'; should be valid syntax, but does not work > --- > > Key: PIG-719 > URL: https://issues.apache.org/jira/browse/PIG-719 > Project: Pig > Issue Type: Bug > Environment: pig local model (although I think it's a parsing > problem, not an execution problem) >Reporter: Christopher Olston >Assignee: Xuefu Zhang >Priority: Minor > Fix For: 0.9.0 > > > This pig script should work: > STORE (LOAD 'inputfile') INTO 'outputfile'; > but it does not. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-749) No attempt to check if 'flatten(group) as' has the same cardinality as 'group alias by'
[ https://issues.apache.org/jira/browse/PIG-749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-749: --- Assignee: Alan Gates Fix Version/s: 0.9.0 > No attempt to check if 'flatten(group) as' has the same cardinality as 'group > alias by' > --- > > Key: PIG-749 > URL: https://issues.apache.org/jira/browse/PIG-749 > Project: Pig > Issue Type: Bug > Components: grunt >Affects Versions: 0.3.0 >Reporter: Viraj Bhat >Assignee: Alan Gates >Priority: Minor > Fix For: 0.9.0 > > > Pig script which does grouping for 3 columns and flattens as 4 columns works > when in principle it should not and maybe fail as a front-end error. > {code} > A = load 'groupcardinalitycheck.txt' using PigStorage() as (col1:chararray, > col2:chararray, col3:int, col4:chararray); > B = group A by (col1, col2, col3); > C = foreach B generate >flatten(group) as (col1, col2, col3, col4), >SIZE(A) as frequency; > dump C; > {code} > == > Data > == > hello CC 1 there > hello YSO 2 out > ouchCC 2 hey > == > Result of the preceding script > == > (ouch,CC,2,1L) > (hello,CC,1,1L) > (hello,YSO,2,1L) > == -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-750) Use combiner when a mix of algebraic and non-algebraic functions are used
[ https://issues.apache.org/jira/browse/PIG-750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-750: -- Assignee: Thejas M Nair Our performance tests have shown that having combiner and non-combiner functions in the same MR job actually severly slows things down. We suspect that this is because you have to pass the bags for the non-combiner functions through the combiner and you pay for the multiple (de)serialization passes. However, the other things noted in this bug, such as the need to use the combiner when algebraic UDFs are involved in simple expressions is valid, and is along the lines of issues Thejas is working on for the combiner. So I'm assigning the issue to him. > Use combiner when a mix of algebraic and non-algebraic functions are used > - > > Key: PIG-750 > URL: https://issues.apache.org/jira/browse/PIG-750 > Project: Pig > Issue Type: Improvement >Reporter: Amir Youssefi >Assignee: Thejas M Nair >Priority: Minor > > Currently Pig uses combiner when all a,b, c,... are algebraic (e.g. SUM, AVG > etc.) in foreach: > foreach X generate a,b,c,... > It's a performance improvement if it uses combiner when a mix of algebraic > and non-algebraic functions are used as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-916) Change the pig hbase interface to get more than one row at a time when scanning
[ https://issues.apache.org/jira/browse/PIG-916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy resolved PIG-916. --- Fix Version/s: 0.8.0 Resolution: Duplicate Fixed in PIG-1205 > Change the pig hbase interface to get more than one row at a time when > scanning > --- > > Key: PIG-916 > URL: https://issues.apache.org/jira/browse/PIG-916 > Project: Pig > Issue Type: Improvement >Reporter: Alex Newman >Assignee: Dmitriy V. Ryaboy >Priority: Trivial > Fix For: 0.8.0 > > > It should be significantly faster to get numerous rows at the same time > rather than one row at a time for large table extraction processes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-772) Semantics of Filter statement inside ForEach should support filtering on aliases used in the Group statement preceding it
[ https://issues.apache.org/jira/browse/PIG-772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-772: --- Assignee: Alan Gates Fix Version/s: 0.9.0 > Semantics of Filter statement inside ForEach should support filtering on > aliases used in the Group statement preceding it > - > > Key: PIG-772 > URL: https://issues.apache.org/jira/browse/PIG-772 > Project: Pig > Issue Type: Improvement > Components: impl >Affects Versions: 0.3.0 >Reporter: Viraj Bhat >Assignee: Alan Gates >Priority: Minor > Fix For: 0.9.0 > > Attachments: half.txt > > > I have a Pig script which tries to display all bags which are greater than > the average value in the group. > Input: half.txt > === > A 1 > A 2 > A 3 > B 1 > B 3 > > {code} > A = LOAD 'half.txt' AS (key:CHARARRAY, val:INT); > B = GROUP A BY key; > C = FOREACH B { >N = AVG(A.val); >HALF = FILTER A by val >= N; > GENERATE >FLATTEN(GROUP), >HALF; > }; > dump C; > {code} > > Expected Output: > > (A,{(A,2),(A,3)}) > (B,{(B,3)}) > > Presently the semantics of the Filter statement inside the FOREACH does not > support these types of operations. > Error when running the above script. > = > ERROR 1000: Error during parsing. Invalid alias: A in {key: chararray,val: > int} > org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during > parsing. Invalid alias: A in {key: chararray,val: int} > at org.apache.pig.PigServer.parseQuery(PigServer.java:320) > at org.apache.pig.PigServer.registerQuery(PigServer.java:279) > at > org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:529) > at > org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:280) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:99) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88) > at org.apache.pig.Main.main(Main.java:364) > = -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-827) Redesign graph operations in OperatorPlan
[ https://issues.apache.org/jira/browse/PIG-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates resolved PIG-827. Resolution: Fixed The new optimizer and plan structure introduced in 0.7 cover this. > Redesign graph operations in OperatorPlan > - > > Key: PIG-827 > URL: https://issues.apache.org/jira/browse/PIG-827 > Project: Pig > Issue Type: Improvement > Components: impl >Affects Versions: 0.3.0 >Reporter: Santhosh Srinivasan > > The graph operations swap, insertBetween, pushBefore, etc. have to be > re-implemented in a layered fashion. The layering will facilitate the re-use > of operations. In addition, use of operator.rewire in the aforementioned > operations requires transaction like ability due to various pre-conditions. > Often, the result of one of the operations leaves the graph in an > inconsistent state for the rewire operation. Clear layering and assignment of > the ability to rewire will remove these inconsistencies. For now, use of > rewire has resulted in a slightly less maintainable code along with the > necessity to use rewire with discretion. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-828) Problem accessing a tuple within a bag
[ https://issues.apache.org/jira/browse/PIG-828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-828: --- Assignee: Alan Gates Fix Version/s: 0.9.0 > Problem accessing a tuple within a bag > -- > > Key: PIG-828 > URL: https://issues.apache.org/jira/browse/PIG-828 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.3.0 >Reporter: Viraj Bhat >Assignee: Alan Gates > Fix For: 0.9.0 > > Attachments: studenttab5, tupleacc.pig > > > Below pig script creates a tuple which contains 3 columns, 2 of which are > chararray's and the third column is a bag of constant chararray. The script > later projects the tuple within a bag. > {code} > a = load 'studenttab5' as (name, age, gpa); > b = foreach a generate ('viraj', {('sms')}, 'pig') as > document:(id,singlebag:{singleTuple:(single)}, article); > describe b; > c = foreach b generate document.singlebag; > dump c; > {code} > When we run this script we get a run-time error in the Map phase. > > java.lang.ClassCastException: org.apache.pig.data.DefaultTuple cannot be cast > to org.apache.pig.data.DataBag > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:402) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:183) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:400) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:183) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:250) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:245) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:236) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) > at > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209) > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-836) Allow setting of end-of-record delimiter in PigStorage
[ https://issues.apache.org/jira/browse/PIG-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates resolved PIG-836. Resolution: Won't Fix PigStorage now depends on TextInputFormat to parse lines. It does not allow the user to specify the end of line indicator. If it does at some point in the future then Pig can make use of that. We are not going to rewrite TextInputFormat for ourselves just to get this feature. > Allow setting of end-of-record delimiter in PigStorage > -- > > Key: PIG-836 > URL: https://issues.apache.org/jira/browse/PIG-836 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: George Mavromatis >Assignee: Benjamin Reed > > PigStorage allows overriding the default field delimiter ('\t'), but does not > allow overriding the record delimiter ('\n'). > It is a valid use case that fields contain new lines, e.g. because they are > contents of a document/web page. It is possible for the user to create a > custom load/store UDF to achieve that, but that is extra work on the user, > many users will have to do it , and that udf would be the exact code > duplicate of the PigStorage except for the delimiter. > Thus, PigStorage() should allow to configure both field and record separators. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1076) Make PigOutputCommitter conform with new FileOututCommitter in hadoop trunk
[ https://issues.apache.org/jira/browse/PIG-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913242#action_12913242 ] Pradeep Kamath commented on PIG-1076: - The patch will need new hadoop sources which have not yet been released on apache - so until then the patch can be used against hadoop trunk but since pig build picks released hadoop this would not be seemless. > Make PigOutputCommitter conform with new FileOututCommitter in hadoop trunk > --- > > Key: PIG-1076 > URL: https://issues.apache.org/jira/browse/PIG-1076 > Project: Pig > Issue Type: Improvement >Reporter: Pradeep Kamath >Assignee: Pradeep Kamath > Attachments: PIG-1076.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-847) Setting twoLevelAccessRequired field in a bag schema should not be required to access fields in the tuples of the bag
[ https://issues.apache.org/jira/browse/PIG-847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-847: --- Fix Version/s: 0.9.0 > Setting twoLevelAccessRequired field in a bag schema should not be required > to access fields in the tuples of the bag > - > > Key: PIG-847 > URL: https://issues.apache.org/jira/browse/PIG-847 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.3.0 >Reporter: Pradeep Kamath >Assignee: Alan Gates > Fix For: 0.9.0 > > > Currently Pig interprets the result type of a relation as a bag. However the > schema of the relation directly contains the schema describing the fields in > the tuples for the relation. However when a udf wants to return a bag or if > there is a bag in input data or if the user creates a bag constant, the > schema of the bag has one field schema which is that of the tuple. The > Tuple's schema has the types of the fields. To be able to access the fields > from the bag directly in such a case by using something like > . or ., the schema of the bag should > have the twoLevelAccess set to true so that pig's type system can get > traverse the tuple schema and get to the field in question. This is confusing > - we should try and see if we can avoid needing this extra flag. A possible > solution is to treat bags the same way - whether they represent relations or > real bags. Another way is to introduce a special "relation" datatype for the > result type of a relation and bag type would be used only for true bags. In > this case, we would always need bag schema to have a tuple schema which would > describe the fields. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-871) Improve distribution of keys in reduce phase
[ https://issues.apache.org/jira/browse/PIG-871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-871: -- Assignee: Thejas M Nair > Improve distribution of keys in reduce phase > > > Key: PIG-871 > URL: https://issues.apache.org/jira/browse/PIG-871 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.3.0 >Reporter: Ankur >Assignee: Thejas M Nair > > The default hashing scheme used to distribute keys in reduce phase sometimes > results in an uneven distribution of keys resulting in 5 - 10 % of reducers > being overloaded with data. This bottleneck makes the PIG jobs really slow > and gives users a bad impression. > While there is no bullet proof solution to the problem in general, the > hashing can certainly be improved for better distribution. The proposal here > is to evaluate and incorporate other hashing schemes that give high avalanche > and more even distribution. We can start by evaluating MurmurHash which is > Apache 2.0 licensed and freely available here - > http://www.getopt.org/murmur/MurmurHash.java -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-904) Conversion from double to chararray for udf input arguments does not occur
[ https://issues.apache.org/jira/browse/PIG-904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913227#action_12913227 ] Alan Gates commented on PIG-904: I don't understand what the issue is here. CONCAT does not take doubles. The script above tries to pass it a double, and Pig properly says you can't do that. Is the issue that an implicit cast isn't inserted here? I don't think Pig currently does implicit casts to match possible UDF signatures. > Conversion from double to chararray for udf input arguments does not occur > -- > > Key: PIG-904 > URL: https://issues.apache.org/jira/browse/PIG-904 > Project: Pig > Issue Type: Bug >Affects Versions: 0.4.0 >Reporter: Pradeep Kamath >Assignee: Alan Gates > Fix For: 0.9.0 > > > Script showing the problem: > {noformat} > "a = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, > gpa:double); b = foreach a generate CONCAT(gpa, 'dummy'); dump b;" > Error shown: > 2009-08-03 17:04:27,573 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1045: Could not infer the matching function for org.apache.pig.builtin.CONCAT > as multiple or none of them fit. Please use an explicit cast. > {noformat} > The error goes away if gpa is casted to chararray. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-904) Conversion from double to chararray for udf input arguments does not occur
[ https://issues.apache.org/jira/browse/PIG-904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-904: --- Assignee: Alan Gates Fix Version/s: 0.9.0 > Conversion from double to chararray for udf input arguments does not occur > -- > > Key: PIG-904 > URL: https://issues.apache.org/jira/browse/PIG-904 > Project: Pig > Issue Type: Bug >Affects Versions: 0.4.0 >Reporter: Pradeep Kamath >Assignee: Alan Gates > Fix For: 0.9.0 > > > Script showing the problem: > {noformat} > "a = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, > gpa:double); b = foreach a generate CONCAT(gpa, 'dummy'); dump b;" > Error shown: > 2009-08-03 17:04:27,573 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1045: Could not infer the matching function for org.apache.pig.builtin.CONCAT > as multiple or none of them fit. Please use an explicit cast. > {noformat} > The error goes away if gpa is casted to chararray. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-946) Combiner optimizer does not optimize when limit follow group, foreach
[ https://issues.apache.org/jira/browse/PIG-946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-946: --- Assignee: Thejas M Nair Fix Version/s: 0.9.0 > Combiner optimizer does not optimize when limit follow group, foreach > - > > Key: PIG-946 > URL: https://issues.apache.org/jira/browse/PIG-946 > Project: Pig > Issue Type: Bug >Affects Versions: 0.3.0 >Reporter: Pradeep Kamath >Assignee: Thejas M Nair > Fix For: 0.9.0 > > Attachments: PIG-946-codechange-draft.patch > > > The following script is combinable but is not optimized: > a = load '/user/pig/tests/data/singlefile/studenttab10k'; > b = group a by $1; > c = foreach b generate group, AVG(a.$2); > d = limit c 10; > dump d; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-916) Change the pig hbase interface to get more than one row at a time when scanning
[ https://issues.apache.org/jira/browse/PIG-916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-916: --- Assignee: Dmitriy V. Ryaboy Dmitriy, isn't this fixed by your recent changes to HBaseStorage? > Change the pig hbase interface to get more than one row at a time when > scanning > --- > > Key: PIG-916 > URL: https://issues.apache.org/jira/browse/PIG-916 > Project: Pig > Issue Type: Improvement >Reporter: Alex Newman >Assignee: Dmitriy V. Ryaboy >Priority: Trivial > > It should be significantly faster to get numerous rows at the same time > rather than one row at a time for large table extraction processes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1016) Allow map to take non-bytearray value types.
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1016: Assignee: Alan Gates Fix Version/s: 0.9.0 > Allow map to take non-bytearray value types. > > > Key: PIG-1016 > URL: https://issues.apache.org/jira/browse/PIG-1016 > Project: Pig > Issue Type: Improvement > Components: data >Affects Versions: 0.4.0 >Reporter: hc busy >Assignee: Alan Gates > Fix For: 0.9.0 > > > Hi, I'm trying to load a map that has a tuple for value. The read fails in > 0.4.0 because of a misconfiguration in the parser. Where as in almost all > documentation it is stated that value of the map can be any time. > I've attached a patch that allows us to read in complex objects as value as > documented. I've done simple verification of loading in maps with tuple/map > values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1076) Make PigOutputCommitter conform with new FileOututCommitter in hadoop trunk
[ https://issues.apache.org/jira/browse/PIG-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913205#action_12913205 ] Alan Gates commented on PIG-1076: - Why did this get abandoned? > Make PigOutputCommitter conform with new FileOututCommitter in hadoop trunk > --- > > Key: PIG-1076 > URL: https://issues.apache.org/jira/browse/PIG-1076 > Project: Pig > Issue Type: Improvement >Reporter: Pradeep Kamath >Assignee: Pradeep Kamath > Attachments: PIG-1076.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1222) cast ends up with NULL value
[ https://issues.apache.org/jira/browse/PIG-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1222: Assignee: Alan Gates Fix Version/s: 0.9.0 Priority: Minor (was: Major) > cast ends up with NULL value > > > Key: PIG-1222 > URL: https://issues.apache.org/jira/browse/PIG-1222 > Project: Pig > Issue Type: Bug >Reporter: Ying He >Assignee: Alan Gates >Priority: Minor > Fix For: 0.9.0 > > > I want to generate data with bags, so I did this, > take a simple text file b.txt > 100 apple > 200 orange > 300 pear > 400 apple > then run query: > a = load 'b.txt' as (id, f); > b = group a by id; > store b into 'g' using BinStorage(); > then run another query to load data generated from previous step. > a = load 'g/part*' using BinStorage() as (id, d:bag{t:(v, s)}); > b = foreach a generate (double)id, flatten(d); > dump b; > then I got the following result: > (,100,apple) > (,100,apple) > (,200,orange) > (,200,apple) > (,300,strawberry) > (,300,pear) > (,400,pear) > the value for id is gone. If there is no cast, then the result is correct. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1244) parameter syntax in scripts, add support for ${VAR} (in addition to current $VAR)
[ https://issues.apache.org/jira/browse/PIG-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1244: Assignee: Xuefu Zhang Fix Version/s: 0.9.0 Priority: Minor (was: Major) > parameter syntax in scripts, add support for ${VAR} (in addition to current > $VAR) > - > > Key: PIG-1244 > URL: https://issues.apache.org/jira/browse/PIG-1244 > Project: Pig > Issue Type: Improvement > Components: impl > Environment: all >Reporter: Alejandro Abdelnur >Assignee: Xuefu Zhang >Priority: Minor > Fix For: 0.9.0 > > > Currently parameter syntax in pig scripts is $VAR. > This complicates scripts as parameter-literal concatenation is not supported. > For example: > An occurrence of '$OUT_tmp' in a script resolves to a parameter 'OUT_tmp', it > would be desirable this to resolve to a contactenation of $OUT&_tmp > This can be solved by supporting parameter syntax ${VAR}, so the pig parser > can identify the end of the parameter name. > Adding support for ${VAR} syntax in addition of $VAR would maintain backwards > compatibility. Changing to syntax ${VAR} syntax will break backwards > compatibility. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1297) algebraic interface of udf does not get used if the foreach with udf projects column within group
[ https://issues.apache.org/jira/browse/PIG-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1297: Assignee: Thejas M Nair Fix Version/s: 0.9.0 > algebraic interface of udf does not get used if the foreach with udf projects > column within group > - > > Key: PIG-1297 > URL: https://issues.apache.org/jira/browse/PIG-1297 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.9.0 > > > grunt> l = load 'file' as (a,b,c); > grunt> g = group l by (a,b); > grunt> f = foreach g generate SUM(l.c), group.a; > grunt> explain f; > ... > ... > #-- > # Map Reduce Plan > #-- > MapReduce node 1-752 > Map Plan > Local Rearrange[tuple]{tuple}(false) - 1-742 > | | > | Project[bytearray][0] - 1-743 > | | > | Project[bytearray][1] - 1-744 > | > |---Load(file:///Users/tejas/pig/trunk/file:org.apache.pig.builtin.PigStorage) > - 1-739 > Reduce Plan > Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-751 > | > |---New For Each(false,false)[bag] - 1-750 > | | > | POUserFunc(org.apache.pig.builtin.SUM)[double] - 1-747 > | | > | |---Project[bag][2] - 1-746 > | | > | |---Project[bag][1] - 1-745 > | | > | Project[bytearray][0] - 1-749 > | | > | |---Project[tuple][0] - 1-748 > | > |---Package[tuple]{tuple} - 1-741 > Global sort: false > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1637) Combiner not use because optimizor inserts a foreach between group and algebric function
Combiner not use because optimizor inserts a foreach between group and algebric function Key: PIG-1637 URL: https://issues.apache.org/jira/browse/PIG-1637 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 The following script does not use combiner after new optimization change. {code} A = load ':INPATH:/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as estimated_revenue; C = group B all; D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue); store D into ':OUTPATH:'; {code} This is because after group, optimizer detect group key is not used afterward, it add a foreach statement after C. This is how it looks like after optimization: {code} A = load ':INPATH:/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as estimated_revenue; C = group B all; C1 = foreach C generate B; D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue); store D into ':OUTPATH:'; {code} That cancel the combiner optimization for D. The way to solve the issue is to merge the C1 we inserted and D. Currently, we do not merge these two foreach. The reason is that one output of the first foreach (B) is referred twice in D, and currently rule assume after merge, we need to calculate B twice in D. Actually, C1 is only doing projection, no calculation of B. Merging C1 and D will not result calculating B twice. So C1 and D should be merged. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1636) Scalar fail if the scalar variable is generated by limit
Scalar fail if the scalar variable is generated by limit Key: PIG-1636 URL: https://issues.apache.org/jira/browse/PIG-1636 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 The following script fail: {code} a = load 'studenttab10k' as (name: chararray, age: int, gpa: float); b = group a all; c = foreach b generate SUM(a.age) as total; c1= limit c 1; d = foreach a generate name, age/(double)c1.total as d_sum; store d into '111'; {code} The problem is we have a reference to c1 in d. In the optimizer, we push limit before foreach, d still reference to limit, and we get the wrong schema for the scalar. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1337) Need a way to pass distributed cache configuration information to hadoop backend in Pig's LoadFunc
[ https://issues.apache.org/jira/browse/PIG-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913154#action_12913154 ] Alan Gates commented on PIG-1337: - The problem with allowing load and store functions access to the config file is that the config file they see is not the config file that goes to Hadoop. This is not all Pig's fault (see comments above on this). The other problem is that multiple instances of the same load and store function may be operating in a given script, so there are namespace issues to resolve. The proposal for Hadoop 0.22 is that rather than providing access to the config file at all Hadoop will serialize objects such as InputFormat and OutputFormat and pass those to the backend. It will make sense for Pig to follow suit and serialize all UDFs on the front end. This will remove the need for the UDFContext black magic that we do at the moment and should allow all UDFs to easily transfer information from front end to backend. So, hopefully this can get resolved when Pig migrates to Hadoop 0.22, whenever that is. > Need a way to pass distributed cache configuration information to hadoop > backend in Pig's LoadFunc > -- > > Key: PIG-1337 > URL: https://issues.apache.org/jira/browse/PIG-1337 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.6.0 >Reporter: Chao Wang > > The Zebra storage layer needs to use distributed cache to reduce name node > load during job runs. > To to this, Zebra needs to set up distributed cache related configuration > information in TableLoader (which extends Pig's LoadFunc) . > It is doing this within getSchema(conf). The problem is that the conf object > here is not the one that is being serialized to map/reduce backend. As such, > the distributed cache is not set up properly. > To work over this problem, we need Pig in its LoadFunc to ensure a way that > we can use to set up distributed cache information in a conf object, and this > conf object is the one used by map/reduce backend. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1371) Pig should handle deep casting of complex types
[ https://issues.apache.org/jira/browse/PIG-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1371: Fix Version/s: 0.9.0 > Pig should handle deep casting of complex types > > > Key: PIG-1371 > URL: https://issues.apache.org/jira/browse/PIG-1371 > Project: Pig > Issue Type: Bug >Reporter: Pradeep Kamath >Assignee: Alan Gates > Fix For: 0.9.0 > > Attachments: PIG-1371-partial.patch > > > Consider input data in BinStorage format which has a field of bag type - > bg:{t:(i:int)}. In the load statement if the schema specified has the type > for this field specified as bg:{t:(c:chararray)}, the current behavior is > that Pig thinks of the field to be of type specified in the load statement > (bg:{t:(c:chararray)}) but no deep cast from bag of int (the real data) to > bag of chararray (the user specified schema) is made. > There are two issues currently: > 1) The TypeCastInserter only considers the byte 'type' between the loader > presented schema and user specified schema to decided whether to introduce a > cast or not. In the above case since both schema have the type "bag" no cast > is inserted. This check has to be extended to consider the full FieldSchema > (with inner subschema) in order to decide whether a cast is needed. > 2) POCast should be changed to handle casting a complex type to the type > specified the user supplied FieldSchema. Here is there is one issue to be > considered - if the user specified the cast type to be bg:{t:(i:int, j:int)} > and the real data had only one field what should the result of the cast be: > * A bag with two fields - the int field and a null? - In this approach pig > is assuming the lone field in the data is the first field which might be > incorrect if it in fact is the second field. > * A null bag to indicate that the bag is of unknown value - this is the one > I personally prefer > * The cast throws an IncompatibleCastException -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1339) International characters in column names not supported
[ https://issues.apache.org/jira/browse/PIG-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1339: Fix Version/s: 0.9.0 We should see if the new parser makes this easier and if so fix it. > International characters in column names not supported > -- > > Key: PIG-1339 > URL: https://issues.apache.org/jira/browse/PIG-1339 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.6.0, 0.7.0, 0.8.0 >Reporter: Viraj Bhat > Fix For: 0.9.0 > > > There is a particular use-case in which someone specifies a column name to be > in International characters. > {code} > inputdata = load '/user/viraj/inputdata.txt' using PigStorage() as (あいうえお); > describe inputdata; > dump inputdata; > {code} > == > Pig Stack Trace > --- > ERROR 1000: Error during parsing. Lexical error at line 1, column 64. > Encountered: "\u3042" (12354), after : "" > org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at line > 1, column 64. Encountered: "\u3042" (12354), after : "" > at > org.apache.pig.impl.logicalLayer.parser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1791) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_scan_token(QueryParser.java:8959) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_51(QueryParser.java:7462) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_120(QueryParser.java:7769) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_106(QueryParser.java:7787) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_63(QueryParser.java:8609) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_32(QueryParser.java:8621) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3_4(QueryParser.java:8354) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_2_4(QueryParser.java:6903) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1249) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700) > at > org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) > at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164) > at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114) > at org.apache.pig.PigServer.registerQuery(PigServer.java:425) > at > org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737) > at > org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) > at org.apache.pig.Main.main(Main.java:391) > == > Thanks Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1339) International characters in column names not supported
[ https://issues.apache.org/jira/browse/PIG-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1339: Assignee: Xuefu Zhang > International characters in column names not supported > -- > > Key: PIG-1339 > URL: https://issues.apache.org/jira/browse/PIG-1339 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.6.0, 0.7.0, 0.8.0 >Reporter: Viraj Bhat >Assignee: Xuefu Zhang > Fix For: 0.9.0 > > > There is a particular use-case in which someone specifies a column name to be > in International characters. > {code} > inputdata = load '/user/viraj/inputdata.txt' using PigStorage() as (あいうえお); > describe inputdata; > dump inputdata; > {code} > == > Pig Stack Trace > --- > ERROR 1000: Error during parsing. Lexical error at line 1, column 64. > Encountered: "\u3042" (12354), after : "" > org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at line > 1, column 64. Encountered: "\u3042" (12354), after : "" > at > org.apache.pig.impl.logicalLayer.parser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1791) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_scan_token(QueryParser.java:8959) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_51(QueryParser.java:7462) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_120(QueryParser.java:7769) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_106(QueryParser.java:7787) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_63(QueryParser.java:8609) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_32(QueryParser.java:8621) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3_4(QueryParser.java:8354) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_2_4(QueryParser.java:6903) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1249) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700) > at > org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) > at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164) > at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114) > at org.apache.pig.PigServer.registerQuery(PigServer.java:425) > at > org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737) > at > org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) > at org.apache.pig.Main.main(Main.java:391) > == > Thanks Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1412) Make Pig OwlLoader work with remote HDFS in secure mode
[ https://issues.apache.org/jira/browse/PIG-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates resolved PIG-1412. - Resolution: Won't Fix Owl is dead, thus there is no need to fix OwlLoader. > Make Pig OwlLoader work with remote HDFS in secure mode > --- > > Key: PIG-1412 > URL: https://issues.apache.org/jira/browse/PIG-1412 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.7.0 >Reporter: Daniel Dai > > PIG-1403 does not address the case which LoadFunc does not expose hdfs URL to > Pig. One major use case is OwlLoader. We need to change OwlLoader to add > remote namenode to JobConf. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1429) Add Boolean Data Type to Pig
[ https://issues.apache.org/jira/browse/PIG-1429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1429: Fix Version/s: 0.9.0 > Add Boolean Data Type to Pig > > > Key: PIG-1429 > URL: https://issues.apache.org/jira/browse/PIG-1429 > Project: Pig > Issue Type: New Feature > Components: data >Affects Versions: 0.7.0 >Reporter: Russell Jurney >Assignee: Russell Jurney > Fix For: 0.9.0 > > Attachments: working_boolean.patch > > Original Estimate: 8h > Remaining Estimate: 8h > > Pig needs a Boolean data type. Pig-1097 is dependent on doing this. > I volunteer. Is there anything beyond the work in src/org/apache/pig/data/ > plus unit tests to make this work? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1479) Embed Pig in scripting languages
[ https://issues.apache.org/jira/browse/PIG-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1479: Assignee: Richard Ding Fix Version/s: 0.9.0 > Embed Pig in scripting languages > > > Key: PIG-1479 > URL: https://issues.apache.org/jira/browse/PIG-1479 > Project: Pig > Issue Type: New Feature >Reporter: Julien Le Dem >Assignee: Richard Ding > Fix For: 0.9.0 > > Attachments: PIG-1479.patch, PIG-1479_2.patch, pig-greek-test.tar, > pig-greek-test.tar, pig-greek.tgz > > > It should be possible to embed Pig calls in a scripting language and let > functions defined in the same script available as UDFs. > This is a spin off of https://issues.apache.org/jira/browse/PIG-928 which > lets users define UDFs in scripting languages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1491) Failure planning nested FOREACH with DISTINCT, POLoad cannot be cast to POLocalRearrange
[ https://issues.apache.org/jira/browse/PIG-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1491: Fix Version/s: 0.9.0 > Failure planning nested FOREACH with DISTINCT, POLoad cannot be cast to > POLocalRearrange > > > Key: PIG-1491 > URL: https://issues.apache.org/jira/browse/PIG-1491 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Scott Carey > Fix For: 0.9.0 > > > I have a failure that occurs during planning while using DISTINCT in a nested > FOREACH. > Caused by: java.lang.ClassCastException: > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad > cannot be cast to > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SecondaryKeyOptimizer.visitMROp(SecondaryKeyOptimizer.java:352) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:218) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:40) > at > org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:67) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1545) Secondary alias gives problem, when it has alias in the group by statement.
[ https://issues.apache.org/jira/browse/PIG-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1545: Assignee: Xuefu Zhang Fix Version/s: 0.9.0 This is a parser issue. If you do the same operation in the generate it works. > Secondary alias gives problem, when it has alias in the group by statement. > --- > > Key: PIG-1545 > URL: https://issues.apache.org/jira/browse/PIG-1545 > Project: Pig > Issue Type: Bug >Reporter: niraj rai >Assignee: Xuefu Zhang > Fix For: 0.9.0 > > > When I run the following script, I get the error: Could not open iterator for > C. > A = LOAD '/tmp' as (a:int, b:chararray, c:int); > B = GROUP A BY (a, b); > C = FOREACH B { bg = A.(b,c); GENERATE group, bg; } ; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1605) Adding soft link to plan to solve input file dependency
[ https://issues.apache.org/jira/browse/PIG-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1605: Attachment: PIG-1605-1.patch > Adding soft link to plan to solve input file dependency > --- > > Key: PIG-1605 > URL: https://issues.apache.org/jira/browse/PIG-1605 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1605-1.patch > > > In scalar implementation, we need to deal with implicit dependencies. > [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve > the problem by adding a LOScalar operator. Here is a different approach. We > will add a soft link to the plan, and soft link is only visible to the > walkers. By doing this, we can make sure we visit LOStore which generate > scalar first, and then LOForEach which use the scalar. All other part of the > logical plan does not know the existence of the soft link. The benefits are: > 1. Logical plan do not need to deal with LOScalar, this makes logical plan > cleaner > 2. Conceptually scalar dependency is different. Regular link represent a data > flow in pipeline. In scalar, the dependency means an operator depends on a > file generated by the other operator. It's different type of data dependency. > 3. Soft link can solve other dependency problem in the future. If we > introduce another UDF dependent on a file generated by another operator, we > can use this mechanism to solve it. > 4. With soft link, we can use scalar come from different sources in the same > statement, which in my mind is not a rare use case. (eg: D = foreach C > generate c0/A.total, c1/B.count; ) > Currently, there are two cases we can use soft link: > 1. scalar dependency, where ReadScalar UDF will use a file generate by a > LOStore > 2. store-load dependency, where we will load a file which is generated by a > store in the same script. This happens in multi-store case. Currently we > solve it by regular link. It is better to use a soft link. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1554) PERF: create accumulative bag in RelationToExpressionProject
[ https://issues.apache.org/jira/browse/PIG-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1554: Assignee: Thejas M Nair Fix Version/s: 0.9.0 > PERF: create accumulative bag in RelationToExpressionProject > > > Key: PIG-1554 > URL: https://issues.apache.org/jira/browse/PIG-1554 > Project: Pig > Issue Type: Improvement >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.9.0 > > > In nested-foreach, RelationToExpressionProject creates a DefaultDataBag out > of the results of PODistinct and POSort . If the results of the plan are > going to be consumed by a operations that support accumulative interface such > as COUNT the results can be linked to a new Accumulative bag . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1576) Difference in Semantics between Load statement in Pig and HDFS client on Command line
[ https://issues.apache.org/jira/browse/PIG-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1576: Fix Version/s: 0.9.0 > Difference in Semantics between Load statement in Pig and HDFS client on > Command line > - > > Key: PIG-1576 > URL: https://issues.apache.org/jira/browse/PIG-1576 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.6.0, 0.7.0 >Reporter: Viraj Bhat > Fix For: 0.9.0 > > > Here is my directory structure on HDFS which I want to access using Pig. > This is a sample, but in real use case I have more than 100 of these > directories. > {code} > $ hadoop fs -ls /user/viraj/recursive/ > Found 3 items > drwxr-xr-x - viraj supergroup 0 2010-08-26 11:25 > /user/viraj/recursive/20080615 > drwxr-xr-x - viraj supergroup 0 2010-08-26 11:25 > /user/viraj/recursive/20080616 > drwxr-xr-x - viraj supergroup 0 2010-08-26 11:25 > /user/viraj/recursive/20080617 > {code} > Using the command line I am access them using variety of options: > {code} > $ hadoop fs -ls /user/viraj/recursive/{200806}{15..17}/ > -rw-r--r-- 1 viraj supergroup 5791 2010-08-26 11:25 > /user/viraj/recursive/20080615/kv2.txt > -rw-r--r-- 1 viraj supergroup 5791 2010-08-26 11:25 > /user/viraj/recursive/20080616/kv2.txt > -rw-r--r-- 1 viraj supergroup 5791 2010-08-26 11:25 > /user/viraj/recursive/20080617/kv2.txt > $ hadoop fs -ls /user/viraj/recursive/{20080615..20080617}/ > -rw-r--r-- 1 viraj supergroup 5791 2010-08-26 11:25 > /user/viraj/recursive/20080615/kv2.txt > -rw-r--r-- 1 viraj supergroup 5791 2010-08-26 11:25 > /user/viraj/recursive/20080616/kv2.txt > -rw-r--r-- 1 viraj supergroup 5791 2010-08-26 11:25 > /user/viraj/recursive/20080617/kv2.txt > {code} > I have written a Pig script, all the below combination of load statements do > not work? > {code} > --A = load '/user/viraj/recursive/{200806}{15..17}/' using > PigStorage('\u0001') as (k:int, v:chararray); > A = load '/user/viraj/recursive/{20080615..20080617}/' using > PigStorage('\u0001') as (k:int, v:chararray); > AL = limit A 10; > dump AL; > {code} > I get the following error in Pig 0.8 > {noformat} > 2010-08-27 16:34:27,704 [main] ERROR > org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed! > 2010-08-27 16:34:27,711 [main] INFO org.apache.pig.tools.pigstats.PigStats - > Script Statistics: > HadoopVersion PigVersion UserId StartedAt FinishedAt > Features > 0.20.2 0.8.0-SNAPSHOT viraj 2010-08-27 16:34:24 2010-08-27 16:34:27 > LIMIT > Failed! > Failed Jobs: > JobId Alias Feature Message Outputs > N/A A,ALMessage: > org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to > create input splits for: /user/viraj/recursive/{20080615..20080617}/ > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:279) > at > org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) > at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378) > at > org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247) > at > org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279) > at java.lang.Thread.run(Thread.java:619) > Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > Pattern hdfs://localhost:9000/user/viraj/recursive/{20080615..20080617} > matches 0 files > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:224) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:268) > ... 7 more > hdfs://localhost:9000/tmp/temp241388470/tmp987803889, > {noformat} > The following works: > {code} > A = load '/user/viraj/recursive/{200806}{15,16,17}/' using > PigStorage('\u0001') as (k:int, v:chararray); > AL = limit A 10; > dump AL; > {code} > Why is there an inconsistency between HDFS client and Pig? > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1581) Parser fails to recognize semicolons in quoted strings
[ https://issues.apache.org/jira/browse/PIG-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1581: Assignee: Xuefu Zhang Fix Version/s: 0.9.0 > Parser fails to recognize semicolons in quoted strings > -- > > Key: PIG-1581 > URL: https://issues.apache.org/jira/browse/PIG-1581 > Project: Pig > Issue Type: Bug > Components: grunt >Affects Versions: 0.7.0 > Environment: CentOS 5.5 >Reporter: Christopher Hackman >Assignee: Xuefu Zhang >Priority: Minor > Fix For: 0.9.0 > > > Within some contexts, the parser fails to treat semicolons correctly, and > sees them as an EOL. > Given an input file: > /test1.txt (in the hdfs) > 1;a > 2;b > 3;c > 4;d > 5;e > And the following Pig script: > REGISTER /tmp/piggybank.jar ; > DEFINE REGEXEXTRACTALL > org.apache.pig.piggybank.evaluation.string.RegexExtractAll(); > lines = LOAD '/test1.txt' AS (line:chararray); > delimited = FOREACH lines GENERATE FLATTEN ( > REGEXEXTRACTALL(line, '^(\\d+);(\\w+)$') > ) AS ( > digit:int, > word:chararray > ); > DUMP delimited; > I receive the following error: > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. > Lexical error at line 5, column 40. Encountered: after : "\'^(d+);" -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1633) Using an alias withing Nested Foreach causes indeterminate behaviour
[ https://issues.apache.org/jira/browse/PIG-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913120#action_12913120 ] Alan Gates commented on PIG-1633: - This is a design decision we made when implementing nested foreach. Each expression in the generate list has its own pipeline. This had the advantage that it was easy to implement. The disadvantages are that it invokes certain operators (like your random function) multiple times. This is inefficient performance wise. In the case of indeterminate functions it also produces strange results. We could not think of any use cases where users would have indeterminate functions so we did not worry about that too much. If you have a real use case we would be interested. > Using an alias withing Nested Foreach causes indeterminate behaviour > > > Key: PIG-1633 > URL: https://issues.apache.org/jira/browse/PIG-1633 > Project: Pig > Issue Type: Bug >Affects Versions: 0.4.0, 0.5.0, 0.6.0, 0.7.0 >Reporter: Viraj Bhat > > I have created a RANDOMINT function which generates random numbers between (0 > and specified value), For example RANDOMINT(4) gives random numbers between 0 > and 3 (inclusive) > {code} > $hadoop fs -cat rand.dat > f > g > h > i > j > k > l > m > {code} > The pig script is as follows: > {code} > register math.jar; > A = load 'rand.dat' using PigStorage() as (data); > B = foreach A { > r = math.RANDOMINT(4); > generate > data, > r as random, > ((r == 3)?1:0) as quarter; > }; > dump B; > {code} > The results are as follows: > {code} > {color:red} > (f,0,0) > (g,3,0) > (h,0,0) > (i,2,0) > (j,3,0) > (k,2,0) > (l,0,1) > (m,1,0) > {color} > {code} > If you observe, (j,3,0) is created because r is used both in the foreach and > generate clauses and generate different values. > Modifying the above script to below solves the issue. The M/R jobs from both > scripts are the same. It is just a matter of convenience. > {code} > A = load 'rand.dat' using PigStorage() as (data); > B = foreach A generate > data, > math.RANDOMINT(4) as r; > C = foreach B generate > data, > r, > ((r == 3)?1:0) as quarter; > dump C; > {code} > Is this issue related to PIG:747? > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1634) Multiple names for the "group" field
[ https://issues.apache.org/jira/browse/PIG-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913115#action_12913115 ] Alan Gates commented on PIG-1634: - In Pig's semantics c.group, c.foo, and c.bar are all separate columns, and only the first one is $0. Because the bags from the cogroup contain all columns in the row (not just non-key columns) foo is in a and bar in b. Changing something like this would be a radical shift of Pig semantics. > Multiple names for the "group" field > > > Key: PIG-1634 > URL: https://issues.apache.org/jira/browse/PIG-1634 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.1.0, 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0 >Reporter: Viraj Bhat > > I am hoping that in Pig if I type > {quote} c = cogroup a by foo, b by bar", the fields c.group, c.foo and c.bar > should all map to c.$0 {quote} > This would improve the readability of the Pig script. > Here's a real usecase: > {code} > --- > pages = LOAD 'pages.dat' AS (url, pagerank); > visits = LOAD 'user_log.dat' AS (user_id, url); > page_visits = COGROUP pages BY url, visits BY url; > frequent_visits = FILTER page_visits BY COUNT(visits) >= 2; > answer = FOREACH frequent_visits GENERATE url, FLATTEN(pages.pagerank); > --- > {code} > (The important part is the final GENERATE statement, which references the > field "url", which was the grouping field in the earlier COGROUP.) To get it > to work I have to write it in a less intuitive way. > Maybe with the new parser changes in Pig 0.9 it would be easier to specify > that. > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed
[ https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1635: Fix Version/s: 0.8.0 > Logical simplifier does not simplify away constants under AND and OR; after > simplificaion the ordering of operands of AND and OR may get changed > > > Key: PIG-1635 > URL: https://issues.apache.org/jira/browse/PIG-1635 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Yan Zhou >Assignee: Yan Zhou >Priority: Minor > Fix For: 0.8.0 > > > b = FILTER a by (( f1 > 1) AND (1 == 1)) > or > b = FILTER a by ((f1 > 1) OR ( 1==0)) > should be simplified to > b = FILTER a by f1 > 1; > Regarding ordering change, an example is that > b = filter a by ((f1 is not null) AND (f2 is not null)); > Even without possible simplification, the expression is changed to > b = filter a by ((f2 is not null) AND (f1 is not null)); > Even though the ordering change in this case, and probably in most other > cases, does not create any difference, but for two reasons some users might > care about the ordering: if stateful UDFs are used as operands of AND or OR; > and if the ordering is intended by the application designer to maximize the > chances to shortcut the composite boolean evaluation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1531) Pig gobbles up error messages
[ https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913048#action_12913048 ] Ashutosh Chauhan commented on PIG-1531: --- Oh Hudson, oh well... Ran the full suite of 400 minutes of unit tests; all passed. Patch is ready for review. > Pig gobbles up error messages > - > > Key: PIG-1531 > URL: https://issues.apache.org/jira/browse/PIG-1531 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Ashutosh Chauhan >Assignee: niraj rai > Fix For: 0.8.0 > > Attachments: pig-1531_3.patch, pig-1531_4.patch, PIG_1531.patch, > PIG_1531_2.patch > > > Consider the following. I have my own Storer implementing StoreFunc and I am > throwing FrontEndException (and other Exceptions derived from PigException) > in its various methods. I expect those error messages to be shown in error > scenarios. Instead Pig gobbles up my error messages and shows its own generic > error message like: > {code} > 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 2116: Unexpected error. Could not validate the output specification for: > default.partitoned > Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log > {code} > Instead I expect it to display my error messages which it stores away in that > log file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed
[ https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913036#action_12913036 ] Yan Zhou commented on PIG-1635: --- This is regarding a new feature (PIG-1399) added for 0.8. > Logical simplifier does not simplify away constants under AND and OR; after > simplificaion the ordering of operands of AND and OR may get changed > > > Key: PIG-1635 > URL: https://issues.apache.org/jira/browse/PIG-1635 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Yan Zhou >Assignee: Yan Zhou >Priority: Minor > > b = FILTER a by (( f1 > 1) AND (1 == 1)) > or > b = FILTER a by ((f1 > 1) OR ( 1==0)) > should be simplified to > b = FILTER a by f1 > 1; > Regarding ordering change, an example is that > b = filter a by ((f1 is not null) AND (f2 is not null)); > Even without possible simplification, the expression is changed to > b = filter a by ((f2 is not null) AND (f1 is not null)); > Even though the ordering change in this case, and probably in most other > cases, does not create any difference, but for two reasons some users might > care about the ordering: if stateful UDFs are used as operands of AND or OR; > and if the ordering is intended by the application designer to maximize the > chances to shortcut the composite boolean evaluation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed
[ https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1635: -- Affects Version/s: 0.8.0 > Logical simplifier does not simplify away constants under AND and OR; after > simplificaion the ordering of operands of AND and OR may get changed > > > Key: PIG-1635 > URL: https://issues.apache.org/jira/browse/PIG-1635 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Yan Zhou >Assignee: Yan Zhou >Priority: Minor > > b = FILTER a by (( f1 > 1) AND (1 == 1)) > or > b = FILTER a by ((f1 > 1) OR ( 1==0)) > should be simplified to > b = FILTER a by f1 > 1; > Regarding ordering change, an example is that > b = filter a by ((f1 is not null) AND (f2 is not null)); > Even without possible simplification, the expression is changed to > b = filter a by ((f2 is not null) AND (f1 is not null)); > Even though the ordering change in this case, and probably in most other > cases, does not create any difference, but for two reasons some users might > care about the ordering: if stateful UDFs are used as operands of AND or OR; > and if the ordering is intended by the application designer to maximize the > chances to shortcut the composite boolean evaluation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed
Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed Key: PIG-1635 URL: https://issues.apache.org/jira/browse/PIG-1635 Project: Pig Issue Type: Bug Components: impl Reporter: Yan Zhou Assignee: Yan Zhou Priority: Minor b = FILTER a by (( f1 > 1) AND (1 == 1)) or b = FILTER a by ((f1 > 1) OR ( 1==0)) should be simplified to b = FILTER a by f1 > 1; Regarding ordering change, an example is that b = filter a by ((f1 is not null) AND (f2 is not null)); Even without possible simplification, the expression is changed to b = filter a by ((f2 is not null) AND (f1 is not null)); Even though the ordering change in this case, and probably in most other cases, does not create any difference, but for two reasons some users might care about the ordering: if stateful UDFs are used as operands of AND or OR; and if the ordering is intended by the application designer to maximize the chances to shortcut the composite boolean evaluation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1628) log this message at debug level : 'Pig Internal storage in use'
[ https://issues.apache.org/jira/browse/PIG-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913029#action_12913029 ] Yan Zhou commented on PIG-1628: --- +1. Patch looks good. > log this message at debug level : 'Pig Internal storage in use' > --- > > Key: PIG-1628 > URL: https://issues.apache.org/jira/browse/PIG-1628 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1628.1.patch > > > The temporary storage functions used are logging at the INFO level. This > should change to debug level, they are reducing the visibility of more useful > INFO messages. The messages include 'Pig Internal storage in use' from > InterStorage and 'TFile storage in use' from TFileStorage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1628) log this message at debug level : 'Pig Internal storage in use'
[ https://issues.apache.org/jira/browse/PIG-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-1628: --- Attachment: PIG-1628.1.patch Patch passes unit tests and test-patch. Ready for review. > log this message at debug level : 'Pig Internal storage in use' > --- > > Key: PIG-1628 > URL: https://issues.apache.org/jira/browse/PIG-1628 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1628.1.patch > > > The temporary storage functions used are logging at the INFO level. This > should change to debug level, they are reducing the visibility of more useful > INFO messages. The messages include 'Pig Internal storage in use' from > InterStorage and 'TFile storage in use' from TFileStorage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1628) log this message at debug level : 'Pig Internal storage in use'
[ https://issues.apache.org/jira/browse/PIG-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-1628: --- Status: Patch Available (was: Open) > log this message at debug level : 'Pig Internal storage in use' > --- > > Key: PIG-1628 > URL: https://issues.apache.org/jira/browse/PIG-1628 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1628.1.patch > > > The temporary storage functions used are logging at the INFO level. This > should change to debug level, they are reducing the visibility of more useful > INFO messages. The messages include 'Pig Internal storage in use' from > InterStorage and 'TFile storage in use' from TFileStorage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.