[jira] Commented: (PIG-506) Does pig need a NATIVE keyword?
[ https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903263#action_12903263 ] Daniel Dai commented on PIG-506: Patch looks good. One minor comment, PlanHelper.LoadStoreFinder may better be PlanHelper.LoadStoreNativeFinder. Does pig need a NATIVE keyword? --- Key: PIG-506 URL: https://issues.apache.org/jira/browse/PIG-506 Project: Pig Issue Type: New Feature Components: impl Reporter: Alan Gates Assignee: Aniket Mokashi Priority: Minor Fix For: 0.8.0 Attachments: NativeImplInitial.patch, NativeMapReduceFinale1.patch, NativeMapReduceFinale2.patch, NativeMapReduceFinale3.patch, PIG-506.2.patch, PIG-506.3.patch, PIG-506.patch, TestWordCount.jar Assume a user had a job that broke easily into three pieces. Further assume that pieces one and three were easily expressible in pig, but that piece two needed to be written in map reduce for whatever reason (performance, something that pig could not easily express, legacy job that was too important to change, etc.). Today the user would either have to use map reduce for the entire job or manually handle the stitching together of pig and map reduce jobs. What if instead pig provided a NATIVE keyword that would allow the script to pass off the data stream to the underlying system (in this case map reduce). The semantics of NATIVE would vary by underlying system. In the map reduce case, we would assume that this indicated a collection of one or more fully contained map reduce jobs, so that pig would store the data, invoke the map reduce jobs, and then read the resulting data to continue. It might look something like this: {code} A = load 'myfile'; X = load 'myotherfile'; B = group A by $0; C = foreach B generate group, myudf(B); D = native (jar=mymr.jar, infile=frompig outfile=topig); E = join D by $0, X by $0; ... {code} This differs from streaming in that it allows the user to insert an arbitrary amount of native processing, whereas streaming allows the insertion of one binary. It also differs in that, for streaming, data is piped directly into and out of the binary as part of the pig pipeline. Here the pipeline would be broken, data written to disk, and the native block invoked, then data read back from disk. Another alternative is to say this is unnecessary because the user can do the coordination from java, using the PIgServer interface to run pig and calling the map reduce job explicitly. The advantages of the native keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. Also the user can make use of existing java applications without being a java programmer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1574) Optimization rule PushUpFilter causes filter to be pushed up out joins
[ https://issues.apache.org/jira/browse/PIG-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated PIG-1574: - Attachment: (was: jira-1574-1.patch) Optimization rule PushUpFilter causes filter to be pushed up out joins -- Key: PIG-1574 URL: https://issues.apache.org/jira/browse/PIG-1574 Project: Pig Issue Type: Bug Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.8.0 The PushUpFilter optimization rule in the new logical plan moves the filter up to one of the join branch. It does this aggressively by find an operator that has all the projection UIDs. However, it didn't consider that the found operator might be another join. If that join is outer, then we cannot simply move the filter to one of its branches. As an example, the following script will be erroneously optimized: A = load 'myfile' as (d1:int); B = load 'anotherfile' as (d2:int); C = join A by d1 full outer, B by d2; D = load 'xxx' as (d3:int); E = join C by d1, D by d3; F = filter E by d1 5; G = store F into 'dummy'; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1574) Optimization rule PushUpFilter causes filter to be pushed up out joins
[ https://issues.apache.org/jira/browse/PIG-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated PIG-1574: - Status: Patch Available (was: Open) Optimization rule PushUpFilter causes filter to be pushed up out joins -- Key: PIG-1574 URL: https://issues.apache.org/jira/browse/PIG-1574 Project: Pig Issue Type: Bug Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1574-1.patch The PushUpFilter optimization rule in the new logical plan moves the filter up to one of the join branch. It does this aggressively by find an operator that has all the projection UIDs. However, it didn't consider that the found operator might be another join. If that join is outer, then we cannot simply move the filter to one of its branches. As an example, the following script will be erroneously optimized: A = load 'myfile' as (d1:int); B = load 'anotherfile' as (d2:int); C = join A by d1 full outer, B by d2; D = load 'xxx' as (d3:int); E = join C by d1, D by d3; F = filter E by d1 5; G = store F into 'dummy'; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1575) Complete the migration of optimization rule PushUpFilter including missing test cases
Complete the migration of optimization rule PushUpFilter including missing test cases - Key: PIG-1575 URL: https://issues.apache.org/jira/browse/PIG-1575 Project: Pig Issue Type: Bug Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.8.0 The Optimization rule under the new logical plan, PushUpFilter, only does a subset of optimization scenarios compared to the same rule under the old logical plan. For instance, it only considers filter after join, but the old optimization also considers other operators such as CoGroup, Union, Cross, etc. The migration of the rule should be complete. Also, the test cases created for testing the old PushUpFilter wasn't migrated to the new logical plan code base. It should be also migrated. (A few has been migrated in JIRA-1574.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1563) SUBSTRING function is broken
[ https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903578#action_12903578 ] Olga Natkovich commented on PIG-1563: - I was able to make it successfully working (without wrapping) for the functions that have fixed number of arguments: LAST_INDEX_OF REPLACE TRIM I don't believe there is currently a way to make it work with variable number of args (even if the number of combinations is fixed.) Moreover, if we add the mapping table in this case, it breaks the case of typed data which is bad. This is the case with the remaining functions - INDEXOF and SPLIT. So my suggestion is only to fix the first set of function and delay the rest to 0.9 when we fix the mapping code. Dmitry and others, are you ok with this? If so, I can update the patch to reflect this. SUBSTRING function is broken Key: PIG-1563 URL: https://issues.apache.org/jira/browse/PIG-1563 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1563.patch Script: A = load 'studenttab10k' as (name, age, gpa); C = foreach A generate SUBSTRING(name, 0,5); E = limit C 10; dump E; Output is always empty: () () () () () () () () () () -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1502) Document and track system limits
[ https://issues.apache.org/jira/browse/PIG-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1502: Fix Version/s: 0.9.0 (was: 0.8.0) Document and track system limits Key: PIG-1502 URL: https://issues.apache.org/jira/browse/PIG-1502 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Fix For: 0.9.0 We need to be able to publsih what system limitations are to make sure that Pig is used in the way it was intended and tested. For instance, if you combine 30 joins in a single MR job (via multiquery) this might not work. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1150) VAR() Variance UDF
[ https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903581#action_12903581 ] Olga Natkovich commented on PIG-1150: - Dmitry, are you planning to add unit tests? Do we still want this in for 0.8? (Since it is going into piggybank, we can do this post branching but then we need to test in 2 places.) VAR() Variance UDF -- Key: PIG-1150 URL: https://issues.apache.org/jira/browse/PIG-1150 Project: Pig Issue Type: New Feature Affects Versions: 0.5.0 Environment: UDF, written in Pig 0.5 contrib/ Reporter: Russell Jurney Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: var.patch I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates variance in a distributed manner, based on the AVG() builtin. It works by calculating the count, sum and sum of squares, as described here: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm Is this a worthwhile contribution? Taking the square root of this value using the contrib SQRT() function gives Standard Deviation, which is missing from Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1549) Provide utility to construct CNF form of predicates
[ https://issues.apache.org/jira/browse/PIG-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903591#action_12903591 ] Olga Natkovich commented on PIG-1549: - I don't think this patch applies. can you regenerate the patch with svn diff from the latest code and also add unit tests, thanks Provide utility to construct CNF form of predicates --- Key: PIG-1549 URL: https://issues.apache.org/jira/browse/PIG-1549 Project: Pig Issue Type: New Feature Components: impl Affects Versions: 0.8.0 Reporter: Swati Jain Assignee: Swati Jain Fix For: 0.8.0 Attachments: 0001-Add-CNF-utility-class.patch Provide utility to construct CNF form of predicates -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1494) PIG Logical Optimization: Use CNF in PushUpFilter
[ https://issues.apache.org/jira/browse/PIG-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903593#action_12903593 ] Olga Natkovich commented on PIG-1494: - Can this be moved from 0.8 to 0.9 release since we are about to branch for 0.9? PIG Logical Optimization: Use CNF in PushUpFilter - Key: PIG-1494 URL: https://issues.apache.org/jira/browse/PIG-1494 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Swati Jain Assignee: Swati Jain Priority: Minor Fix For: 0.8.0 The PushUpFilter rule is not able to handle complicated boolean expressions. For example, SplitFilter rule is splitting one LOFilter into two by AND. However it will not be able to split LOFilter if the top level operator is OR. For example: *ex script:* A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int); B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int); C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int); J1 = JOIN B by b1, C by c1; J2 = JOIN J1 by $0, A by a1; D = *Filter J2 by ( (c1 10) AND (a3+b3 10) ) OR (c2 == 5);* explain D; In the above example, the PushUpFilter is not able to push any filter condition across any join as it contains columns from all branches (inputs). But if we convert this expression into Conjunctive Normal Form (CNF) then we would be able to push filter condition c1 10 and c2 == 5 below both join conditions. Here is the CNF expression for highlighted line: ( (c1 10) OR (c2 == 5) ) AND ( (a3+b3 10) OR (c2 ==5) ) *Suggestion:* It would be a good idea to convert LOFilter's boolean expression into CNF, it would then be easy to push parts (conjuncts) of the LOFilter boolean expression selectively. We would also not require rule SplitFilter anymore if we were to add this utility to rule PushUpFilter itself. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1542) log level not propogated to MR task loggers
[ https://issues.apache.org/jira/browse/PIG-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-1542: --- Assignee: niraj rai This will be looked at after the branch since this is a regression and we don't have time to do it now. log level not propogated to MR task loggers --- Key: PIG-1542 URL: https://issues.apache.org/jira/browse/PIG-1542 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: niraj rai Fix For: 0.8.0 Specifying -d DEBUG does not affect the logging of the MR tasks . This was fixed earlier in PIG-882 . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1543) IsEmpty returns the wrong value after using LIMIT
[ https://issues.apache.org/jira/browse/PIG-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-1543: --- Assignee: Daniel Dai Daniel can you check if this is related to limit optimizer and if it was addressed with new optimizer. (This can be done post branch since it is a bug split.) IsEmpty returns the wrong value after using LIMIT - Key: PIG-1543 URL: https://issues.apache.org/jira/browse/PIG-1543 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Justin Hu Assignee: Daniel Dai Fix For: 0.8.0 1. Two input files: 1a: limit_empty.input_a 1 1 1 1b: limit_empty.input_b 2 2 2. The pig script: limit_empty.pig -- A contains only 1's B contains only 2's A = load 'limit_empty.input_a' as (a1:int); B = load 'limit_empty.input_a' as (b1:int); C =COGROUP A by a1, B by b1; D = FOREACH C generate A, B, (IsEmpty(A)? 0:1), (IsEmpty(B)? 0:1), COUNT(A), COUNT(B); store D into 'limit_empty.output/d'; -- After the script done, we see the right results: -- {(1),(1),(1)} {} 1 0 3 0 -- {} {(2),(2)} 0 1 0 2 C1 = foreach C { Alim = limit A 1; Blim = limit B 1; generate Alim, Blim; } D1 = FOREACH C1 generate Alim,Blim, (IsEmpty(Alim)? 0:1), (IsEmpty(Blim)? 0:1), COUNT(Alim), COUNT(Blim); store D1 into 'limit_empty.output/d1'; -- After the script done, we see the unexpected results: -- {(1)} {}1 1 1 0 -- {} {(2)} 1 1 0 1 dump D; dump D1; 3. Run the scrip and redirect the stdout (2 dumps) file. There are two issues: The major one: IsEmpty() returns FALSE for empty bag in limit_empty.output/d1/*, while IsEmpty() returns correctly in limit_empty.output/d/*. The difference is that one has been applied with LIMIT before using IsEmpty(). The minor one: The redirected output only contains the first dump: ({(1),(1),(1)},{},1,0,3L,0L) ({},{(2),(2)},0,1,0L,2L) We expect two more lines like: ({(1)},{},1,1,1L,0L) ({},{(2)},1,1,0L,1L) Besides, there is error says: [main] ERROR org.apache.pig.backend.hadoop.executionengine.HJob - java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.pig.data.Tuple -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1567) Optimization rule FilterAboveForeach is too restrictive and doesn't handle project * correctly
[ https://issues.apache.org/jira/browse/PIG-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-1567: --- Assignee: Xuefu Zhang Optimization rule FilterAboveForeach is too restrictive and doesn't handle project * correctly -- Key: PIG-1567 URL: https://issues.apache.org/jira/browse/PIG-1567 Project: Pig Issue Type: Bug Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.8.0 FilterAboveForeach rule is to optimize the plan by pushing up filter above previous foreach operator. However, during code review, two major problems were found: 1. Current implementation assumes that if no projection is found in the filter condition then all columns from foreach are projected. This issue prevents the following optimization: A = LOAD 'file.txt' AS (a(u,v), b, c); B = FOREACH A GENERATE $0, b; C = FILTER B BY 8 5; STORE C INTO 'empty'; 2. Current implementation doesn't handle * probjection, which means project all columns. As a result, it wasn't able to optimize the following: A = LOAD 'file.txt' AS (a(u,v), b, c); B = FOREACH A GENERATE $0, b; C = FILTER B BY Identity.class.getName(*) 5; STORE C INTO 'empty'; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1570) native mapreduce operator MR job does not follow same failure handling logic as other pig MR jobs
[ https://issues.apache.org/jira/browse/PIG-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-1570: --- Assignee: Thejas M Nair native mapreduce operator MR job does not follow same failure handling logic as other pig MR jobs - Key: PIG-1570 URL: https://issues.apache.org/jira/browse/PIG-1570 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 The code path for handling failure in MR job corresponding to native MR is different and does not have the same behavior. For example, even if the MR job for mapreduce operator fails, the number of jobs that failed is being reported as 0 in PigStats log. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1572) change default datatype when relations are used as scalar to bytearray
[ https://issues.apache.org/jira/browse/PIG-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-1572: --- Assignee: Thejas M Nair change default datatype when relations are used as scalar to bytearray -- Key: PIG-1572 URL: https://issues.apache.org/jira/browse/PIG-1572 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 When relations are cast to scalar, the current default type is chararray. This is inconsistent with the behavior in rest of pig-latin. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1567) Optimization rule FilterAboveForeach is too restrictive and doesn't handle project * correctly
[ https://issues.apache.org/jira/browse/PIG-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang resolved PIG-1567. -- Resolution: Duplicate Duplicate of PIG-1568. Optimization rule FilterAboveForeach is too restrictive and doesn't handle project * correctly -- Key: PIG-1567 URL: https://issues.apache.org/jira/browse/PIG-1567 Project: Pig Issue Type: Bug Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.8.0 FilterAboveForeach rule is to optimize the plan by pushing up filter above previous foreach operator. However, during code review, two major problems were found: 1. Current implementation assumes that if no projection is found in the filter condition then all columns from foreach are projected. This issue prevents the following optimization: A = LOAD 'file.txt' AS (a(u,v), b, c); B = FOREACH A GENERATE $0, b; C = FILTER B BY 8 5; STORE C INTO 'empty'; 2. Current implementation doesn't handle * probjection, which means project all columns. As a result, it wasn't able to optimize the following: A = LOAD 'file.txt' AS (a(u,v), b, c); B = FOREACH A GENERATE $0, b; C = FILTER B BY Identity.class.getName(*) 5; STORE C INTO 'empty'; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1150) VAR() Variance UDF
[ https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903634#action_12903634 ] Dmitriy V. Ryaboy commented on PIG-1150: I won't have time before the 30th. BTW one doesn't even need a udf if using the sum of squares approach.. :-) just generate the square and the sum in the foreach (it will perform the algebraic decomposition automatically) VAR() Variance UDF -- Key: PIG-1150 URL: https://issues.apache.org/jira/browse/PIG-1150 Project: Pig Issue Type: New Feature Affects Versions: 0.5.0 Environment: UDF, written in Pig 0.5 contrib/ Reporter: Russell Jurney Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: var.patch I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates variance in a distributed manner, based on the AVG() builtin. It works by calculating the count, sum and sum of squares, as described here: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm Is this a worthwhile contribution? Taking the square root of this value using the contrib SQRT() function gives Standard Deviation, which is missing from Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1563) SUBSTRING function is broken
[ https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903636#action_12903636 ] Dmitriy V. Ryaboy commented on PIG-1563: Sounds good. Should we just merge in the amazon contrib for some of these? SUBSTRING function is broken Key: PIG-1563 URL: https://issues.apache.org/jira/browse/PIG-1563 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1563.patch Script: A = load 'studenttab10k' as (name, age, gpa); C = foreach A generate SUBSTRING(name, 0,5); E = limit C 10; dump E; Output is always empty: () () () () () () () () () () -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1150) VAR() Variance UDF
[ https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903637#action_12903637 ] Olga Natkovich commented on PIG-1150: - So should we unlink this from the release? VAR() Variance UDF -- Key: PIG-1150 URL: https://issues.apache.org/jira/browse/PIG-1150 Project: Pig Issue Type: New Feature Affects Versions: 0.5.0 Environment: UDF, written in Pig 0.5 contrib/ Reporter: Russell Jurney Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: var.patch I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates variance in a distributed manner, based on the AVG() builtin. It works by calculating the count, sum and sum of squares, as described here: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm Is this a worthwhile contribution? Taking the square root of this value using the contrib SQRT() function gives Standard Deviation, which is missing from Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1563) SUBSTRING function is broken
[ https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903640#action_12903640 ] Olga Natkovich commented on PIG-1563: - which JIRA is that? I will just get this in - I think that's all I have time today but I can look at the other one as well next week SUBSTRING function is broken Key: PIG-1563 URL: https://issues.apache.org/jira/browse/PIG-1563 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1563.patch Script: A = load 'studenttab10k' as (name, age, gpa); C = foreach A generate SUBSTRING(name, 0,5); E = limit C 10; dump E; Output is always empty: () () () () () () () () () () -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1150) VAR() Variance UDF
[ https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903643#action_12903643 ] Dmitriy V. Ryaboy commented on PIG-1150: Yeah I think it's not a big deal if we are splitting piggybank out soon anyway. VAR() Variance UDF -- Key: PIG-1150 URL: https://issues.apache.org/jira/browse/PIG-1150 Project: Pig Issue Type: New Feature Affects Versions: 0.5.0 Environment: UDF, written in Pig 0.5 contrib/ Reporter: Russell Jurney Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: var.patch I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates variance in a distributed manner, based on the AVG() builtin. It works by calculating the count, sum and sum of squares, as described here: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm Is this a worthwhile contribution? Taking the square root of this value using the contrib SQRT() function gives Standard Deviation, which is missing from Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1563) SUBSTRING function is broken
[ https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903644#action_12903644 ] Dmitriy V. Ryaboy commented on PIG-1563: Olga, the amazon contrib is PIG-1565 SUBSTRING function is broken Key: PIG-1563 URL: https://issues.apache.org/jira/browse/PIG-1563 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1563.patch Script: A = load 'studenttab10k' as (name, age, gpa); C = foreach A generate SUBSTRING(name, 0,5); E = limit C 10; dump E; Output is always empty: () () () () () () () () () () -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1150) VAR() Variance UDF
[ https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1150: Fix Version/s: 0.9.0 (was: 0.8.0) VAR() Variance UDF -- Key: PIG-1150 URL: https://issues.apache.org/jira/browse/PIG-1150 Project: Pig Issue Type: New Feature Affects Versions: 0.5.0 Environment: UDF, written in Pig 0.5 contrib/ Reporter: Russell Jurney Assignee: Dmitriy V. Ryaboy Fix For: 0.9.0 Attachments: var.patch I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates variance in a distributed manner, based on the AVG() builtin. It works by calculating the count, sum and sum of squares, as described here: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm Is this a worthwhile contribution? Taking the square root of this value using the contrib SQRT() function gives Standard Deviation, which is missing from Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1512) PlanPrinter does not print LOJoin operator in the new logical optimization framework
[ https://issues.apache.org/jira/browse/PIG-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1512: Status: Resolved (was: Patch Available) Resolution: Fixed This is already fixed in the latest code. Thanks Swati! PlanPrinter does not print LOJoin operator in the new logical optimization framework Key: PIG-1512 URL: https://issues.apache.org/jira/browse/PIG-1512 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Swati Jain Assignee: Swati Jain Fix For: 0.8.0 Attachments: printJoin.patch PlanPrinter does not print LOJoin relational operator. As such, the LOJoin operator would not get printed when we do an explain. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1321) Logical Optimizer: Merge cascading foreach
[ https://issues.apache.org/jira/browse/PIG-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1321: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Patch committed. Thanks Xuefu! Logical Optimizer: Merge cascading foreach -- Key: PIG-1321 URL: https://issues.apache.org/jira/browse/PIG-1321 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1321-2.patch, jira-1321-3.patch, pig-1321.patch We can merge consecutive foreach statement. Eg: b = foreach a generate a0#'key1' as b0, a0#'key2' as b1, a1; c = foreach b generate b0#'kk1', b0#'kk2', b1, a1; = c = foreach a generate a0#'key1'#'kk1', a0#'key1'#'kk2', a0#'key2', a1; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1321) Logical Optimizer: Merge cascading foreach
[ https://issues.apache.org/jira/browse/PIG-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1321: Attachment: jira-1321-3.patch Repost the pre-condition: 1. two consecutive foreach statements. 2. the second foreach statement is a simple inner plan in which the ognly statement is a GENERATE statement. In other words, the second foreach statement must be something like FOREACH A GENERATE 3. The first foreach statement cannot contain flatten due to its complexity 4. No 1st foreach output is referred more than once in second foreach, eg: B = foreach ; C = foreach B generate $0, $1, $0 will not be merged. The reason if we merge, $0 will be calculated twice, which defeat the benefit of merging. All tests pass. test-patch result: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. Logical Optimizer: Merge cascading foreach -- Key: PIG-1321 URL: https://issues.apache.org/jira/browse/PIG-1321 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1321-2.patch, jira-1321-3.patch, pig-1321.patch We can merge consecutive foreach statement. Eg: b = foreach a generate a0#'key1' as b0, a0#'key2' as b1, a1; c = foreach b generate b0#'kk1', b0#'kk2', b1, a1; = c = foreach a generate a0#'key1'#'kk1', a0#'key1'#'kk2', a0#'key2', a1; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1515) Migrate logical optimization rule: PushDownForeachFlatten
[ https://issues.apache.org/jira/browse/PIG-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1515: Attachment: jira-1515-2.patch All tests pass. test-patch result: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. Migrate logical optimization rule: PushDownForeachFlatten - Key: PIG-1515 URL: https://issues.apache.org/jira/browse/PIG-1515 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1515-1.patch, jira-1515-2.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1515) Migrate logical optimization rule: PushDownForeachFlatten
[ https://issues.apache.org/jira/browse/PIG-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1515: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Patch committed. Thanks Xuefu! Migrate logical optimization rule: PushDownForeachFlatten - Key: PIG-1515 URL: https://issues.apache.org/jira/browse/PIG-1515 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1515-1.patch, jira-1515-2.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1399) Logical Optimizer: Expression optimizor rule
[ https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1399: -- Attachment: PIG-1399.patch Addressing the review comments except for not making several optimization rules since the ordering of the application of the rules is significant. Logical Optimizer: Expression optimizor rule Key: PIG-1399 URL: https://issues.apache.org/jira/browse/PIG-1399 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Yan Zhou Fix For: 0.8.0 Attachments: PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, PIG-1399.patch We can optimize expression in several ways: 1. Constant pre-calculation Example: B = filter A by a0 5+7; = B = filter A by a0 12; 2. Boolean expression optimization Example: B = filter A by not (not(a05) or a10); = B = filter A by a05 and a=10; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-529) Want support for loading CSV files
[ https://issues.apache.org/jira/browse/PIG-529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-529. Resolution: Duplicate This is duplicate of PIG-1555 which has been resolved for Pig 0.8 Want support for loading CSV files -- Key: PIG-529 URL: https://issues.apache.org/jira/browse/PIG-529 Project: Pig Issue Type: New Feature Components: data Reporter: Tom White Want to be able to load CSV data into Pig. This needs to handle quoting correctly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-771) PigDump does not properly output Chinese UTF8 characters - they are displayed as question marks ??
[ https://issues.apache.org/jira/browse/PIG-771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-771. Fix Version/s: 0.7.0 Resolution: Fixed PigDump is no longer supported PigDump does not properly output Chinese UTF8 characters - they are displayed as question marks ?? -- Key: PIG-771 URL: https://issues.apache.org/jira/browse/PIG-771 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz Fix For: 0.7.0 PigDump does not properly output Chinese UTF8 characters. The reason for this is that the function Tuple.toString() is called. DefaultTuple implements Tuple.toString() and it calls Object.toString() on the opaque object d. Instead, I think that the code should be changed instead to call the new DataType.toString() function. {code} @Override public String toString() { StringBuilder sb = new StringBuilder(); sb.append('('); for (IteratorObject it = mFields.iterator(); it.hasNext();) { Object d = it.next(); if(d != null) { if(d instanceof Map) { sb.append(DataType.mapToString((MapObject, Object)d)); } else { sb.append(DataType.toString(d)); // Change this one line if(d instanceof Long) { sb.append(L); } else if(d instanceof Float) { sb.append(F); } } } else { sb.append(); } if (it.hasNext()) sb.append(,); } sb.append(')'); return sb.toString(); } {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1482) Pig gets confused when more than one loader is involved
[ https://issues.apache.org/jira/browse/PIG-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903714#action_12903714 ] Thejas M Nair commented on PIG-1482: Patch review comments - - Schema.java {code} public FieldSchema(String a, Schema s, byte t) throws FrontendException { alias = a; schema = s; log.debug(t: + t + Bag: + DataType.BAG + tuple: + DataType.TUPLE); /* * The following check is removed because it may not be always true. As a matter of * fact, the condition can be produced using other constructors anyway. * if ((null != s) !(DataType.isSchemaType(t))) { int errCode = 1020; throw new FrontendException(Only a BAG or TUPLE can have schemas. Got + DataType.findTypeName(t), errCode, PigException.INPUT); } */ {code} I think some other code paths might be relying on this constructor for error checking. It would be safer to create another constructor with a check boolean argument {code} public FieldSchema(String a, Schema s, byte t, boolean innerTypeCheck) {code} and call that from above constructor and from FieldSchema.copyAndLink(..) - In LOStream.java.getSchema() mIsSchemaComputed is used to keep track of whether the fieldschema parents have been set. I think it will be better to use a different variable for the purpose - it will be more readable, and also not likely to break any assumptions people are likely to make about this variable that is from the LogicalOperator class. - TypeCheckingVisitor.java insertCastForUDF is called on input of udf , it seems like same logic should be used for other expressions as well (instead of insertCast(.) ). Also, insertCastForUDF(..) and insertCast(..) have only two lines different, we can share rest of the code. {code} private void insertCastForUDF(LOUserFunc udf, FieldSchema fromFS, FieldSchema toFs, ExpressionOperator predecessor){ toFs.setParent( fromFS.canonicalName, predecessor ); insertCast(udf, toFs.type, toFs, predecessor); } {code} - TypeCheckingVisitor.java In visit(LOCast), it seems like we can just pick any of the matching predecessor load functions, shouldn't we check if all the FuncSpec returned are the same ? {code} for( Map.EntryString, LogicalOperator entry : canonicalMap.entrySet() ) { FuncSpec loadFuncSpec = getLoadFuncSpec( entry.getValue(), entry.getKey() ); cast.setLoadFuncSpec( loadFuncSpec ); } {code} - LOProject.java the commented line can be removed - {code} +// mFieldSchema.setParent(fs.canonicalName, expressionOperator); {code} Pig gets confused when more than one loader is involved --- Key: PIG-1482 URL: https://issues.apache.org/jira/browse/PIG-1482 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ankur Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1482-final.patch, jira-1482-final.patch, jira-1482-final.patch In case of two relations being loaded using different loader, joined, grouped and projected, pig gets confused in trying to find appropriate loader for the requested cast. Consider the following script :- A = LOAD 'data1' USING PigStorage() AS (s, m, l); B = FOREACH A GENERATE s#'k1' as v1, m#'k2' as v2, l#'k3' as v3; C = FOREACH B GENERATE v1, (v2 == 'v2' ? 1L : 0L) as v2:long, (v3 == 'v3' ? 1 :0) as v3:int; D = LOAD 'data2' USING TextLoader() AS (a); E = JOIN C BY v1, D BY a USING 'replicated'; F = GROUP E BY (v1, a); G = FOREACH F GENERATE (chararray)group.v1, group.a; dump G; This throws the error, stack trace of which is in the next comment -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1576) Difference in Semantics between Load statement in Pig and HDFS client on Command line
Difference in Semantics between Load statement in Pig and HDFS client on Command line - Key: PIG-1576 URL: https://issues.apache.org/jira/browse/PIG-1576 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0, 0.6.0 Reporter: Viraj Bhat Here is my directory structure on HDFS which I want to access using Pig. This is a sample, but in real use case I have more than 100 of these directories. {code} $ hadoop fs -ls /user/viraj/recursive/ Found 3 items drwxr-xr-x - viraj supergroup 0 2010-08-26 11:25 /user/viraj/recursive/20080615 drwxr-xr-x - viraj supergroup 0 2010-08-26 11:25 /user/viraj/recursive/20080616 drwxr-xr-x - viraj supergroup 0 2010-08-26 11:25 /user/viraj/recursive/20080617 {code} Using the command line I am access them using variety of options: {code} $ hadoop fs -ls /user/viraj/recursive/{200806}{15..17}/ -rw-r--r-- 1 viraj supergroup 5791 2010-08-26 11:25 /user/viraj/recursive/20080615/kv2.txt -rw-r--r-- 1 viraj supergroup 5791 2010-08-26 11:25 /user/viraj/recursive/20080616/kv2.txt -rw-r--r-- 1 viraj supergroup 5791 2010-08-26 11:25 /user/viraj/recursive/20080617/kv2.txt $ hadoop fs -ls /user/viraj/recursive/{20080615..20080617}/ -rw-r--r-- 1 viraj supergroup 5791 2010-08-26 11:25 /user/viraj/recursive/20080615/kv2.txt -rw-r--r-- 1 viraj supergroup 5791 2010-08-26 11:25 /user/viraj/recursive/20080616/kv2.txt -rw-r--r-- 1 viraj supergroup 5791 2010-08-26 11:25 /user/viraj/recursive/20080617/kv2.txt {code} I have written a Pig script, all the below combination of load statements do not work? {code} --A = load '/user/viraj/recursive/{200806}{15..17}/' using PigStorage('\u0001') as (k:int, v:chararray); A = load '/user/viraj/recursive/{20080615..20080617}/' using PigStorage('\u0001') as (k:int, v:chararray); AL = limit A 10; dump AL; {code} I get the following error in Pig 0.8 {noformat} 2010-08-27 16:34:27,704 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed! 2010-08-27 16:34:27,711 [main] INFO org.apache.pig.tools.pigstats.PigStats - Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 0.20.2 0.8.0-SNAPSHOT viraj 2010-08-27 16:34:24 2010-08-27 16:34:27 LIMIT Failed! Failed Jobs: JobId Alias Feature Message Outputs N/A A,ALMessage: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: /user/viraj/recursive/{20080615..20080617}/ at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:279) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378) at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247) at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279) at java.lang.Thread.run(Thread.java:619) Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input Pattern hdfs://localhost:9000/user/viraj/recursive/{20080615..20080617} matches 0 files at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:224) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:268) ... 7 more hdfs://localhost:9000/tmp/temp241388470/tmp987803889, {noformat} The following works: {code} A = load '/user/viraj/recursive/{200806}{15,16,17}/' using PigStorage('\u0001') as (k:int, v:chararray); AL = limit A 10; dump AL; {code} Why is there an inconsistency between HDFS client and Pig? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1564) add support for multiple filesystems
[ https://issues.apache.org/jira/browse/PIG-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903721#action_12903721 ] Andrew Hitchcock commented on PIG-1564: --- Hi all, I think this patch is still useful. With current Pig trunk you can't CD between different filesystems. Example: grunt pwd hdfs://ip-10-218-57-248.ec2.internal:9000/user/hadoop grunt cd s3://anhi-test-data/ 2010-08-27 23:53:10,522 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. This file system object (hdfs://ip-10-218-57-248.ec2.internal:9000) does not support access to the request path 's3://anhi-test-data/' You possibly called FileSystem.get(conf) when you should of called FileSystem.get(uri, conf) to obtain a file system supporting your path. Details at logfile: /home/hadoop/pig_1282952081120.log This patch fixes that issue. Andrew add support for multiple filesystems Key: PIG-1564 URL: https://issues.apache.org/jira/browse/PIG-1564 Project: Pig Issue Type: Improvement Reporter: Andrew Hitchcock Attachments: PIG-1564-1.patch Currently you can't run Pig scripts that read data from one file system and write it to another. Also, Grunt doesn't support CDing from one directory to another on different file systems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1399) Logical Optimizer: Expression optimizor rule
[ https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1399: -- Attachment: PIG-1399.patch rebased on the latest trunk. Logical Optimizer: Expression optimizor rule Key: PIG-1399 URL: https://issues.apache.org/jira/browse/PIG-1399 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Yan Zhou Fix For: 0.8.0 Attachments: PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, PIG-1399.patch We can optimize expression in several ways: 1. Constant pre-calculation Example: B = filter A by a0 5+7; = B = filter A by a0 12; 2. Boolean expression optimization Example: B = filter A by not (not(a05) or a10); = B = filter A by a05 and a=10; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1564) add support for multiple filesystems
[ https://issues.apache.org/jira/browse/PIG-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903728#action_12903728 ] Dmitriy V. Ryaboy commented on PIG-1564: Andrew, does 'fs -cd s3://anhi-test-data/' work? The cd command is also deprecated (though not marked as such) :) add support for multiple filesystems Key: PIG-1564 URL: https://issues.apache.org/jira/browse/PIG-1564 Project: Pig Issue Type: Improvement Reporter: Andrew Hitchcock Attachments: PIG-1564-1.patch Currently you can't run Pig scripts that read data from one file system and write it to another. Also, Grunt doesn't support CDing from one directory to another on different file systems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1482) Pig gets confused when more than one loader is involved
[ https://issues.apache.org/jira/browse/PIG-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated PIG-1482: - Status: Open (was: Patch Available) Pig gets confused when more than one loader is involved --- Key: PIG-1482 URL: https://issues.apache.org/jira/browse/PIG-1482 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ankur Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1482-final-1.patch, jira-1482-final.patch, jira-1482-final.patch, jira-1482-final.patch In case of two relations being loaded using different loader, joined, grouped and projected, pig gets confused in trying to find appropriate loader for the requested cast. Consider the following script :- A = LOAD 'data1' USING PigStorage() AS (s, m, l); B = FOREACH A GENERATE s#'k1' as v1, m#'k2' as v2, l#'k3' as v3; C = FOREACH B GENERATE v1, (v2 == 'v2' ? 1L : 0L) as v2:long, (v3 == 'v3' ? 1 :0) as v3:int; D = LOAD 'data2' USING TextLoader() AS (a); E = JOIN C BY v1, D BY a USING 'replicated'; F = GROUP E BY (v1, a); G = FOREACH F GENERATE (chararray)group.v1, group.a; dump G; This throws the error, stack trace of which is in the next comment -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1482) Pig gets confused when more than one loader is involved
[ https://issues.apache.org/jira/browse/PIG-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated PIG-1482: - Attachment: jira-1482-final-1.patch Pig gets confused when more than one loader is involved --- Key: PIG-1482 URL: https://issues.apache.org/jira/browse/PIG-1482 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ankur Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1482-final-1.patch, jira-1482-final.patch, jira-1482-final.patch, jira-1482-final.patch In case of two relations being loaded using different loader, joined, grouped and projected, pig gets confused in trying to find appropriate loader for the requested cast. Consider the following script :- A = LOAD 'data1' USING PigStorage() AS (s, m, l); B = FOREACH A GENERATE s#'k1' as v1, m#'k2' as v2, l#'k3' as v3; C = FOREACH B GENERATE v1, (v2 == 'v2' ? 1L : 0L) as v2:long, (v3 == 'v3' ? 1 :0) as v3:int; D = LOAD 'data2' USING TextLoader() AS (a); E = JOIN C BY v1, D BY a USING 'replicated'; F = GROUP E BY (v1, a); G = FOREACH F GENERATE (chararray)group.v1, group.a; dump G; This throws the error, stack trace of which is in the next comment -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1482) Pig gets confused when more than one loader is involved
[ https://issues.apache.org/jira/browse/PIG-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated PIG-1482: - Status: Patch Available (was: Open) Updated the patch based on the review comments. For comments above, the one next to the last, the map should only contain one entry. Before the result is obtained, exception is thrown anytime two different loadfunspec's are found. It was done that way before. Pig gets confused when more than one loader is involved --- Key: PIG-1482 URL: https://issues.apache.org/jira/browse/PIG-1482 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ankur Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1482-final-1.patch, jira-1482-final.patch, jira-1482-final.patch, jira-1482-final.patch In case of two relations being loaded using different loader, joined, grouped and projected, pig gets confused in trying to find appropriate loader for the requested cast. Consider the following script :- A = LOAD 'data1' USING PigStorage() AS (s, m, l); B = FOREACH A GENERATE s#'k1' as v1, m#'k2' as v2, l#'k3' as v3; C = FOREACH B GENERATE v1, (v2 == 'v2' ? 1L : 0L) as v2:long, (v3 == 'v3' ? 1 :0) as v3:int; D = LOAD 'data2' USING TextLoader() AS (a); E = JOIN C BY v1, D BY a USING 'replicated'; F = GROUP E BY (v1, a); G = FOREACH F GENERATE (chararray)group.v1, group.a; dump G; This throws the error, stack trace of which is in the next comment -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1564) add support for multiple filesystems
[ https://issues.apache.org/jira/browse/PIG-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903733#action_12903733 ] Andrew Hitchcock commented on PIG-1564: --- Nope: grunt fs -cd s3://anhi-test-data/ cd: Unknown command Does that require a specific version of Hadoop to work (since it appears to be sending the call to Hadoop code)? add support for multiple filesystems Key: PIG-1564 URL: https://issues.apache.org/jira/browse/PIG-1564 Project: Pig Issue Type: Improvement Reporter: Andrew Hitchcock Attachments: PIG-1564-1.patch Currently you can't run Pig scripts that read data from one file system and write it to another. Also, Grunt doesn't support CDing from one directory to another on different file systems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1577) support to variable number of arguments in UDF
support to variable number of arguments in UDF -- Key: PIG-1577 URL: https://issues.apache.org/jira/browse/PIG-1577 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Olga Natkovich Fix For: 0.9.0 In the current implementation, functionality that allows to map arguments to classes does not support functions with variable number of arguments. Also it does not support funtions that can have variable (but fixed in number) number of arguments. This causes problems for string UDFs such as CONCAT that can take an arbitrary number of arguments or TRIM that can take 1,2, or 3 arguments -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1563) SUBSTRING function is broken
[ https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1563: Attachment: PIG_1563_v2.patch SUBSTRING function is broken Key: PIG-1563 URL: https://issues.apache.org/jira/browse/PIG-1563 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1563.patch, PIG_1563_v2.patch Script: A = load 'studenttab10k' as (name, age, gpa); C = foreach A generate SUBSTRING(name, 0,5); E = limit C 10; dump E; Output is always empty: () () () () () () () () () () -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1531) Pig gobbles up error messages
[ https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] niraj rai updated PIG-1531: --- Attachment: PIG_1531_2.patch I have tried to accommodate all the recommendations from Ashutosh. I have changed the existing test case to validate the error message, in case the store directory exist. Writing test case for the case, when input file deos not exist was more effort than fixing the actual fix. So, I verified it manually and they looked good. Thanks Niraj Pig gobbles up error messages - Key: PIG-1531 URL: https://issues.apache.org/jira/browse/PIG-1531 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: niraj rai Fix For: 0.8.0 Attachments: PIG_1531.patch, PIG_1531_2.patch Consider the following. I have my own Storer implementing StoreFunc and I am throwing FrontEndException (and other Exceptions derived from PigException) in its various methods. I expect those error messages to be shown in error scenarios. Instead Pig gobbles up my error messages and shows its own generic error message like: {code} 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2116: Unexpected error. Could not validate the output specification for: default.partitoned Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log {code} Instead I expect it to display my error messages which it stores away in that log file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1563) SUBSTRING function is broken
[ https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903744#action_12903744 ] Olga Natkovich commented on PIG-1563: - Uploaded new patch which does the following: (1) Adds mapping function for functions with fixed number of arguments: SUBSTRING, LAST_INDEX_OF, REPLACE,TRIM (2) Left the rest of the functions alone which means that until 0.9 they will only work on typed data. CONCAT is in the same category (3) Re-used applicable tests that Dmitry create, thanks! (3) Added a couple of e2e tests to make sure that we test the mapping function as well Please, review. We will keep the open till we address (2) in 0.9. SUBSTRING function is broken Key: PIG-1563 URL: https://issues.apache.org/jira/browse/PIG-1563 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1563.patch, PIG_1563_v2.patch Script: A = load 'studenttab10k' as (name, age, gpa); C = foreach A generate SUBSTRING(name, 0,5); E = limit C 10; dump E; Output is always empty: () () () () () () () () () () -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1563) SUBSTRING function is broken
[ https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903753#action_12903753 ] Dmitriy V. Ryaboy commented on PIG-1563: +1 question/comment -- any reason you discarded the new buildSimpleFuncSpec I wrote in the first iteration of this patch? I think it simplifies the code: {code} funcList.add(Utils.buildSimpleFuncSpec( this.getClass().getName(), DataType.CHARARRAY, DataType.CHARARRAY)); {code} vs {code} Schema s = new Schema(); s.add(new Schema.FieldSchema(null, DataType.CHARARRAY)); s.add(new Schema.FieldSchema(null, DataType.CHARARRAY)); funcList.add(new FuncSpec(this.getClass().getName(), s)); {code} SUBSTRING function is broken Key: PIG-1563 URL: https://issues.apache.org/jira/browse/PIG-1563 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1563.patch, PIG_1563_v2.patch Script: A = load 'studenttab10k' as (name, age, gpa); C = foreach A generate SUBSTRING(name, 0,5); E = limit C 10; dump E; Output is always empty: () () () () () () () () () () -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with
[ https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1178: Attachment: PIG-1178-8.patch PIG-1178-8.patch fix TestPruneColumn.testMapKey3 LogicalPlan and Optimizer are too complex and hard to work with --- Key: PIG-1178 URL: https://issues.apache.org/jira/browse/PIG-1178 Project: Pig Issue Type: Improvement Reporter: Alan Gates Assignee: Daniel Dai Fix For: 0.8.0 Attachments: expressions-2.patch, expressions.patch, lp.patch, lp.patch, PIG-1178-4.patch, PIG-1178-5.patch, PIG-1178-6.patch, PIG-1178-7.patch, PIG-1178-8.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch, pig_1178_2.patch, pig_1178_3.2.patch, pig_1178_3.3.patch, pig_1178_3.4.patch, pig_1178_3.patch The current implementation of the logical plan and the logical optimizer in Pig has proven to not be easily extensible. Developer feedback has indicated that adding new rules to the optimizer is quite burdensome. In addition, the logical plan has been an area of numerous bugs, many of which have been difficult to fix. Developers also feel that the logical plan is difficult to understand and maintain. The root cause for these issues is that a number of design decisions that were made as part of the 0.2 rewrite of the front end have now proven to be sub-optimal. The heart of this proposal is to revisit a number of those proposals and rebuild the logical plan with a simpler design that will make it much easier to maintain the logical plan as well as extend the logical optimizer. See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails
[ https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] niraj rai updated PIG-1343: --- Status: Patch Available (was: Open) pig_log file missing even though Main tells it is creating one and an M/R job fails Key: PIG-1343 URL: https://issues.apache.org/jira/browse/PIG-1343 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: niraj rai Fix For: 0.8.0 Attachments: 1343.patch, PIG-1343-1.patch, pig_1343_2.patch, pig_1343_4.patch, PIG_1343_5.patch There is a particular case where I was running with the latest trunk of Pig. {code} $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig [main] INFO org.apache.pig.Main - Logging error messages to: /homes/viraj/pig_1263420012601.log $ls -l pig_1263420012601.log ls: pig_1263420012601.log: No such file or directory {code} The job failed and the log file did not contain anything, the only way to debug was to look into the Jobtracker logs. Here are some reasons which would have caused this behavior: 1) The underlying filer/NFS had some issues. In that case do we not error on stdout? 2) There are some errors from the backend which are not being captured Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1562) Fix the version for the dependent packages for the maven
[ https://issues.apache.org/jira/browse/PIG-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] niraj rai updated PIG-1562: --- Status: Patch Available (was: Open) Fix the version for the dependent packages for the maven - Key: PIG-1562 URL: https://issues.apache.org/jira/browse/PIG-1562 Project: Pig Issue Type: Bug Reporter: niraj rai Assignee: niraj rai Fix For: 0.8.0 Attachments: PIG_1562_0.patch We need to fix the set version so that, version is properly set for the dependent packages in the maven repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1531) Pig gobbles up error messages
[ https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] niraj rai updated PIG-1531: --- Status: Patch Available (was: Open) Pig gobbles up error messages - Key: PIG-1531 URL: https://issues.apache.org/jira/browse/PIG-1531 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: niraj rai Fix For: 0.8.0 Attachments: PIG_1531.patch, PIG_1531_2.patch Consider the following. I have my own Storer implementing StoreFunc and I am throwing FrontEndException (and other Exceptions derived from PigException) in its various methods. I expect those error messages to be shown in error scenarios. Instead Pig gobbles up my error messages and shows its own generic error message like: {code} 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2116: Unexpected error. Could not validate the output specification for: default.partitoned Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log {code} Instead I expect it to display my error messages which it stores away in that log file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.