[jira] Commented: (PIG-993) [zebra] Abitlity to drop a column group in a table
[ https://issues.apache.org/jira/browse/PIG-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762707#action_12762707 ] Yan Zhou commented on PIG-993: -- The patch attached by me (Yan Zhou) was based upon Raghu's patch minus some unrelated changes. [zebra] Abitlity to drop a column group in a table -- Key: PIG-993 URL: https://issues.apache.org/jira/browse/PIG-993 Project: Pig Issue Type: Bug Reporter: Raghu Angadi Assignee: Raghu Angadi Attachments: DropColumnGroupExample.java, zebra-drop-cg.patch, zebra-drop-cg.patch A Zebra table is stored as multiple sub tables each containing a set of columns called column group (CG). The user specifies how these columns are grouped while creating a table through the _storage hint_. For some of the large tables, it might be necessary for users to remove a set of columns and retain the rest. This jira provides a way for users to delete an entire column group. The following comments will have more details on API and the semantics. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-983) PERFORMANCE: multi-query optimization on multiple group bys following a join or cogroup
[ https://issues.apache.org/jira/browse/PIG-983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-983: --- Resolution: Fixed Fix Version/s: 0.6.0 Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) +1 Patch committed, thanks Richard! PERFORMANCE: multi-query optimization on multiple group bys following a join or cogroup --- Key: PIG-983 URL: https://issues.apache.org/jira/browse/PIG-983 Project: Pig Issue Type: Improvement Components: impl Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.6.0 Attachments: PIG-983.patch The current multi-query optimizer works well with pig scripts like this one: {code} data = LOAD 'input' AS (a:chararray, b:int, c:int); A = GROUP data BY b; B = GROUP data BY c; C = FOREACH A GENERATE group, COUNT(data); D = FOREACH B GENERATE group, SUM(data.b); STORE C INTO 'output1'; STORE D INTO 'output2'; {code} In this case the original three Map-Reduce jobs are merged into one MR job by the optimizer. The current optimizer, however, won't reduce the number of MR jobs for the scripts in which multiple group bys follow a join or a cogroup, such as this one: {code} data1 = LOAD 'input1' AS (a1:chararray, b1:int, c1:int); data2 = LOAD 'input2' AS (a2:chararray, b2:int, c2:int); A = JOIN data1 BY a1, data2 BY a2; B = GROUP A BY data1::b1; C = GROUP B BY data2::c2; D = FOREACH B GENERATE group, COUNT(A); E = FOREACH C GENERATE group, SUM(A.data2::b2); STORE D INTO 'output1'; STORE E INTO 'output2'; {code} Three MR jobs are still needed to run this script. Multi-query optimizer should work with this kind of scripts by merging the group bys and reducing the overall MR jobs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-993) [zebra] Abitlity to drop a column group in a table
[ https://issues.apache.org/jira/browse/PIG-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762731#action_12762731 ] Yan Zhou commented on PIG-993: -- Patch Reviewed +1 [zebra] Abitlity to drop a column group in a table -- Key: PIG-993 URL: https://issues.apache.org/jira/browse/PIG-993 Project: Pig Issue Type: Bug Reporter: Raghu Angadi Assignee: Raghu Angadi Attachments: DropColumnGroupExample.java, zebra-drop-cg.patch, zebra-drop-cg.patch A Zebra table is stored as multiple sub tables each containing a set of columns called column group (CG). The user specifies how these columns are grouped while creating a table through the _storage hint_. For some of the large tables, it might be necessary for users to remove a set of columns and retain the rest. This jira provides a way for users to delete an entire column group. The following comments will have more details on API and the semantics. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-987) [zebra] Zebra Column Group Access Control
[ https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762733#action_12762733 ] Gaurav Jain commented on PIG-987: - Patch Reviewed +1 [zebra] Zebra Column Group Access Control - Key: PIG-987 URL: https://issues.apache.org/jira/browse/PIG-987 Project: Pig Issue Type: New Feature Affects Versions: 0.6.0 Reporter: Yan Zhou Assignee: Yan Zhou Attachments: ColumnGroupSecurity.patch Access Control: when processes try to read from the column groups, Zebra should be able to handle allowed vs. disallowed user/application accesses. The security is eventuallt granted by corresponding HDFS security of the data stored. Expected behavior when column group permissions are set: When user selects only columns that they do not have permissions to access, Zebra should return error with message Error #: Permission denied for accessing column column name or names Access control applies to an entire column group, so all columns in a column group have same permissions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-991) [zebra] A few minor bugs as described in the Description section
[ https://issues.apache.org/jira/browse/PIG-991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762734#action_12762734 ] Gaurav Jain commented on PIG-991: - Patch Reviewed +1 [zebra] A few minor bugs as described in the Description section Key: PIG-991 URL: https://issues.apache.org/jira/browse/PIG-991 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Yan Zhou Assignee: Yan Zhou Priority: Minor Fix For: 0.6.0 Attachments: Bugs.patch 1) lzo2 was used as the compressor name for the LZO compression algorithm; it should be lzo instead; 2) the default compression is changed from lzo to gz for gzip; 3) In JAVACC file SchemaParser.jjt, the package name was wrong using the old package org.apache.pig.table.types; 4) in build.xml, two new javacc targets are added to generate TableSchemaParser and TableStorageParser java codes; 5) Support of column group security ( https://issues.apache.org/jira/browse/PIG-987 ) lacked support of the dumpinfo method: the groups and permissions were not displayed. Note that as a consequence, the patch herein must be applied after that of JIRA987. 6) and 7) a couple of issues reported in Jira917. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-922) Logical optimizer: push up project
[ https://issues.apache.org/jira/browse/PIG-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762754#action_12762754 ] Pradeep Kamath commented on PIG-922: Some comments on new patch: PruneColumns.java: | 274 if (relevantFields!=null relevantFields.needAllFields()) | 275 { | 276 requiredInputFieldsList.set(j, new RequiredFields(true)); | 277 continue; | 278 } | 279 | 280 // Mapping output map keys to input map keys | 281 // | 282 if (rlo instanceof LOCogroup) | 283 { | 284 if (relevantFields!=null relevantFields.needAllFields()) | 285 { | 286 for (PairInteger, Integer pair : relevantFields.getFields()) | 287 relevantFields.setMapKeysInfo(pair.first, pair.second, | 288 new MapKeysInfo(true)); | 289 } | 290 } Wouldn't the last if be redundant since it is same as first if and first if is true, the loop continues and never reaches the last if line numbers per old code: 326 // Collect required map keys in foreach plan here. 327 // This is the only logical operator that we collect map keys 328 // which are introduced by the operator here.
[jira] Commented: (PIG-976) Multi-query optimization throws ClassCastException
[ https://issues.apache.org/jira/browse/PIG-976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762803#action_12762803 ] Hadoop QA commented on PIG-976: --- +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12421451/PIG-976.patch against trunk revision 822382. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 7 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/61/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/61/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/61/console This message is automatically generated. Multi-query optimization throws ClassCastException -- Key: PIG-976 URL: https://issues.apache.org/jira/browse/PIG-976 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.4.0 Reporter: Ankur Assignee: Richard Ding Attachments: PIG-976.patch Multi-query optimization fails to merge 2 branches when 1 is a result of Group By ALL and another is a result of Group By field1 where field 1 is of type long. Here is the script that fails with multi-query on. data = LOAD 'test' USING PigStorage('\t') AS (a:long, b:double, c:double); A = GROUP data ALL; B = FOREACH A GENERATE SUM(data.b) AS sum1, SUM(data.c) AS sum2; C = FOREACH B GENERATE (sum1/sum2) AS rate; STORE C INTO 'result1'; D = GROUP data BY a; E = FOREACH D GENERATE group AS a, SUM(data.b), SUM(data.c); STORE E into 'result2'; Here is the exception from the logs java.lang.ClassCastException: org.apache.pig.data.DefaultTuple cannot be cast to org.apache.pig.data.DataBag at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:399) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:180) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:145) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:197) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:235) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:254) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:240) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.runPipeline(PODemux.java:264) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.getNext(PODemux.java:254) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:196) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:174) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:63) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:906) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:786) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:228) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2206) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-994) Provide 'append' keyword to allow appending to diferent dataset once the feature is available in Hadoop
[ https://issues.apache.org/jira/browse/PIG-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762809#action_12762809 ] Alan Gates commented on PIG-994: Should it be a separate keyword or an option on store? I like it better as an option for store as it can then be create or append depending on the files existence. So it might look like: {code} store z into 'bla' append {code} Provide 'append' keyword to allow appending to diferent dataset once the feature is available in Hadoop --- Key: PIG-994 URL: https://issues.apache.org/jira/browse/PIG-994 Project: Pig Issue Type: New Feature Components: impl Affects Versions: 0.4.0 Environment: Grid clusters Reporter: Rekha Priority: Minor Provide 'append' keyword to allow appending to diferent dataset on pig 0.5.0 as it is now on hadoop 0.20(which has append feature) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Reopened: (PIG-948) [Usability] Relating pig script with MR jobs
[ https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai reopened PIG-948: See lots of ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Exception occured while trying to retrieve extra information about job in MapReduceLauncher.String index out of range: -1 in local hadoop mode after this patch. We shall suppress this message. [Usability] Relating pig script with MR jobs Key: PIG-948 URL: https://issues.apache.org/jira/browse/PIG-948 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Priority: Minor Fix For: 0.6.0 Attachments: pig-948-2.patch, pig-948-3.patch, pig-948.patch Currently its hard to find a way to relate pig script with specific MR job. In a loaded cluster with multiple simultaneous job submissions, its not easy to figure out which specific MR jobs were launched for a given pig script. If Pig can provide this info, it will be useful to debug and monitor the jobs resulting from a pig script. At the very least, Pig should be able to provide user the following information 1) Job id of the launched job. 2) Complete web url of jobtracker running this job. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-987) [zebra] Zebra Column Group Access Control
[ https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762812#action_12762812 ] Raghu Angadi commented on PIG-987: -- I tried to commit this patch. 'ant test' says all the tests fail, where as only one two tests fail without the patch. Does Hudson actual run Zebra tests? [zebra] Zebra Column Group Access Control - Key: PIG-987 URL: https://issues.apache.org/jira/browse/PIG-987 Project: Pig Issue Type: New Feature Affects Versions: 0.6.0 Reporter: Yan Zhou Assignee: Yan Zhou Attachments: ColumnGroupSecurity.patch Access Control: when processes try to read from the column groups, Zebra should be able to handle allowed vs. disallowed user/application accesses. The security is eventuallt granted by corresponding HDFS security of the data stored. Expected behavior when column group permissions are set: When user selects only columns that they do not have permissions to access, Zebra should return error with message Error #: Permission denied for accessing column column name or names Access control applies to an entire column group, so all columns in a column group have same permissions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-991) [zebra] A few minor bugs as described in the Description section
[ https://issues.apache.org/jira/browse/PIG-991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi updated PIG-991: - Release Note: (was: Patch should be applied after that of Jira987.) bq. Patch should be applied after that of Jira987. [moved above comment from 'Release Notes' to this comment]. [zebra] A few minor bugs as described in the Description section Key: PIG-991 URL: https://issues.apache.org/jira/browse/PIG-991 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Yan Zhou Assignee: Yan Zhou Priority: Minor Fix For: 0.6.0 Attachments: Bugs.patch 1) lzo2 was used as the compressor name for the LZO compression algorithm; it should be lzo instead; 2) the default compression is changed from lzo to gz for gzip; 3) In JAVACC file SchemaParser.jjt, the package name was wrong using the old package org.apache.pig.table.types; 4) in build.xml, two new javacc targets are added to generate TableSchemaParser and TableStorageParser java codes; 5) Support of column group security ( https://issues.apache.org/jira/browse/PIG-987 ) lacked support of the dumpinfo method: the groups and permissions were not displayed. Note that as a consequence, the patch herein must be applied after that of JIRA987. 6) and 7) a couple of issues reported in Jira917. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-948) [Usability] Relating pig script with MR jobs
[ https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-948: --- Attachment: PIG-948-4.patch [Usability] Relating pig script with MR jobs Key: PIG-948 URL: https://issues.apache.org/jira/browse/PIG-948 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Priority: Minor Fix For: 0.6.0 Attachments: pig-948-2.patch, pig-948-3.patch, PIG-948-4.patch, pig-948.patch Currently its hard to find a way to relate pig script with specific MR job. In a loaded cluster with multiple simultaneous job submissions, its not easy to figure out which specific MR jobs were launched for a given pig script. If Pig can provide this info, it will be useful to debug and monitor the jobs resulting from a pig script. At the very least, Pig should be able to provide user the following information 1) Job id of the launched job. 2) Complete web url of jobtracker running this job. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-948) [Usability] Relating pig script with MR jobs
[ https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-948: --- Status: Patch Available (was: Reopened) [Usability] Relating pig script with MR jobs Key: PIG-948 URL: https://issues.apache.org/jira/browse/PIG-948 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Priority: Minor Fix For: 0.6.0 Attachments: pig-948-2.patch, pig-948-3.patch, PIG-948-4.patch, pig-948.patch Currently its hard to find a way to relate pig script with specific MR job. In a loaded cluster with multiple simultaneous job submissions, its not easy to figure out which specific MR jobs were launched for a given pig script. If Pig can provide this info, it will be useful to debug and monitor the jobs resulting from a pig script. At the very least, Pig should be able to provide user the following information 1) Job id of the launched job. 2) Complete web url of jobtracker running this job. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-996) [zebra] Zebra build script does not have findbugs and clover targets.
[zebra] Zebra build script does not have findbugs and clover targets. - Key: PIG-996 URL: https://issues.apache.org/jira/browse/PIG-996 Project: Pig Issue Type: Bug Components: build Reporter: Chao Wang Assignee: Chao Wang Zebra build script does not have findbugs and clover targets, leading hudson build process to fail on Zebra. This jira is to fix this by adding these two targets. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-997) [zebra] Sorted Table Support by Zebra
[zebra] Sorted Table Support by Zebra - Key: PIG-997 URL: https://issues.apache.org/jira/browse/PIG-997 Project: Pig Issue Type: New Feature Reporter: Yan Zhou Fix For: 0.6.0 This new feature is for Zebra to support sorted data in storage. As a storage library, Zebra will not sort the data by itself. But it will support creation and use of sorted data either through PIG or through map/reduce tasks that use Zebra as storage format. The sorted table keeps the data in a totally sorted manner across all TFiles created by potentially all mappers or reducers. For sorted data creation through PIG's STORE operator , if the input data is sorted through ORDER BY, the new Zebra table will be marked as sorted on the sorted columns; For sorted data creation though Map/Reduce tasks, three new static methods of the BasicTableOutput class will be provided to allow or help the user to achieve the goal. setSortInfo allows the user to specify the sorted columns of the input tuple to be stored; getSortKeyGenerator and getSortKey help the user to generate the key acceptable by Zebra as a sorted key based upon the schema, sorted columns and the input tuple. For sorted data read through PIG's LOAD operator, pass string sorted as an extra argument to the TableLoader constructor to ask for sorted table to be loaded; For sorted data read through Map/Reduce tasks, a new static method of TableInputFormat class, requireSortedTable, can be called to ask for a sorted table to be read. Additionally, an overloaded version of the new method can be called to ask for a sorted table on specified sort columns and comparator. For this release, sorted table only supported sorting in ascending order, not in descending order. In addition, the sort keys must be of simple types not complex types such as RECORD, COLLECTION and MAP. Multiple-key sorting is supported. But the ordering of the multiple sort keys is significant with the first sort column being the primary sort key, the second being the secondary sort key, etc. In this release, the sort keys are stored along with the sort columns where the keys were originally created from, resulting in some data storage redundancy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-987) [zebra] Zebra Column Group Access Control
[ https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762824#action_12762824 ] Yan Zhou commented on PIG-987: -- I checked Hudson test results and they do not seem to run Zebra. But I ran ant test in contrib/zebra directory and they passed. What errors did you get? I suspect some env issue at your end. [zebra] Zebra Column Group Access Control - Key: PIG-987 URL: https://issues.apache.org/jira/browse/PIG-987 Project: Pig Issue Type: New Feature Affects Versions: 0.6.0 Reporter: Yan Zhou Assignee: Yan Zhou Attachments: ColumnGroupSecurity.patch Access Control: when processes try to read from the column groups, Zebra should be able to handle allowed vs. disallowed user/application accesses. The security is eventuallt granted by corresponding HDFS security of the data stored. Expected behavior when column group permissions are set: When user selects only columns that they do not have permissions to access, Zebra should return error with message Error #: Permission denied for accessing column column name or names Access control applies to an entire column group, so all columns in a column group have same permissions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-987) [zebra] Zebra Column Group Access Control
[ https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi updated PIG-987: - Attachment: TEST-org.apache.hadoop.zebra.mapred.TestCheckin.txt I am attaching {{mapred.TestCheckin.txt}} that passes without the patch. btw, not all tests pass even without the patch. What is the environment required? I did a fresh check out, and ran 'ant test'. I guess the tests failures on trunk are related to lzo. But I didn't expect more failures with the patch. Looks like PIG-991 removes the lzo dependency. I will try with that patch included. [zebra] Zebra Column Group Access Control - Key: PIG-987 URL: https://issues.apache.org/jira/browse/PIG-987 Project: Pig Issue Type: New Feature Affects Versions: 0.6.0 Reporter: Yan Zhou Assignee: Yan Zhou Attachments: ColumnGroupSecurity.patch, TEST-org.apache.hadoop.zebra.mapred.TestCheckin.txt Access Control: when processes try to read from the column groups, Zebra should be able to handle allowed vs. disallowed user/application accesses. The security is eventuallt granted by corresponding HDFS security of the data stored. Expected behavior when column group permissions are set: When user selects only columns that they do not have permissions to access, Zebra should return error with message Error #: Permission denied for accessing column column name or names Access control applies to an entire column group, so all columns in a column group have same permissions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-987) [zebra] Zebra Column Group Access Control
[ https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Wang updated PIG-987: -- I ran into the same issue also. I did a fresh checkout from apache trunk and ran ant test, there are 14 test cases failed. Actually, they are caused by some incompatible exception type between pig and zebra. It seems pig already moved on with the change (IOException changed to IndexOutofBoundException), but zebra is behind a bit in this. [zebra] Zebra Column Group Access Control - Key: PIG-987 URL: https://issues.apache.org/jira/browse/PIG-987 Project: Pig Issue Type: New Feature Affects Versions: 0.6.0 Reporter: Yan Zhou Assignee: Yan Zhou Attachments: ColumnGroupSecurity.patch, TEST-org.apache.hadoop.zebra.mapred.TestCheckin.txt Access Control: when processes try to read from the column groups, Zebra should be able to handle allowed vs. disallowed user/application accesses. The security is eventuallt granted by corresponding HDFS security of the data stored. Expected behavior when column group permissions are set: When user selects only columns that they do not have permissions to access, Zebra should return error with message Error #: Permission denied for accessing column column name or names Access control applies to an entire column group, so all columns in a column group have same permissions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-987) [zebra] Zebra Column Group Access Control
[ https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762829#action_12762829 ] Raghu Angadi commented on PIG-987: -- Not sure if this is related to PIG. When I applied PIG-991 over this, the tests passed (except the ones that fail on trunk). [zebra] Zebra Column Group Access Control - Key: PIG-987 URL: https://issues.apache.org/jira/browse/PIG-987 Project: Pig Issue Type: New Feature Affects Versions: 0.6.0 Reporter: Yan Zhou Assignee: Yan Zhou Attachments: ColumnGroupSecurity.patch, TEST-org.apache.hadoop.zebra.mapred.TestCheckin.txt Access Control: when processes try to read from the column groups, Zebra should be able to handle allowed vs. disallowed user/application accesses. The security is eventuallt granted by corresponding HDFS security of the data stored. Expected behavior when column group permissions are set: When user selects only columns that they do not have permissions to access, Zebra should return error with message Error #: Permission denied for accessing column column name or names Access control applies to an entire column group, so all columns in a column group have same permissions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Build failed in Hudson: Pig-trunk #580
See http://hudson.zones.apache.org/hudson/job/Pig-trunk/580/changes Changes: [pradeepkth] PERFORMANCE: multi-query optimization on multiple group bys following a join or cogroup (rding via pradeepkth) -- [...truncated 167006 lines...] [junit] 09/10/07 01:16:28 INFO hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:60366 is added to blk_6792183205926187173_1014 size 1859 [junit] 09/10/07 01:16:28 INFO datanode.DataNode: PacketResponder 2 for block blk_6792183205926187173_1014 terminating [junit] 09/10/07 01:16:28 INFO hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:51295 is added to blk_6792183205926187173_1014 size 1859 [junit] 09/10/07 01:16:28 INFO hdfs.StateChange: DIR* NameSystem.completeFile: file /tmp/hadoop-hudson/mapred/system/job_20091007011555276_0002/job.split is closed by DFSClient_-1468843592 [junit] 09/10/07 01:16:28 INFO FSNamesystem.audit: ugi=hudson,hudson ip=/127.0.0.1 cmd=create src=/tmp/hadoop-hudson/mapred/system/job_20091007011555276_0002/job.xml dst=nullperm=hudson:supergroup:rw-r--r-- [junit] 09/10/07 01:16:28 INFO FSNamesystem.audit: ugi=hudson,hudson ip=/127.0.0.1 cmd=setPermission src=/tmp/hadoop-hudson/mapred/system/job_20091007011555276_0002/job.xml dst=nullperm=hudson:supergroup:rw-r--r-- [junit] 09/10/07 01:16:28 INFO hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/hadoop-hudson/mapred/system/job_20091007011555276_0002/job.xml. blk_-6478926164587938265_1015 [junit] 09/10/07 01:16:28 INFO datanode.DataNode: Receiving block blk_-6478926164587938265_1015 src: /127.0.0.1:34216 dest: /127.0.0.1:59951 [junit] 09/10/07 01:16:28 INFO datanode.DataNode: Receiving block blk_-6478926164587938265_1015 src: /127.0.0.1:35478 dest: /127.0.0.1:51295 [junit] 09/10/07 01:16:28 INFO datanode.DataNode: Receiving block blk_-6478926164587938265_1015 src: /127.0.0.1:45552 dest: /127.0.0.1:49650 [junit] 09/10/07 01:16:28 INFO DataNode.clienttrace: src: /127.0.0.1:45552, dest: /127.0.0.1:49650, bytes: 48254, op: HDFS_WRITE, cliID: DFSClient_-1468843592, srvID: DS-1821165369-127.0.1.1-49650-1254878155200, blockid: blk_-6478926164587938265_1015 [junit] 09/10/07 01:16:28 INFO datanode.DataNode: PacketResponder 0 for block blk_-6478926164587938265_1015 terminating [junit] 09/10/07 01:16:28 INFO hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:49650 is added to blk_-6478926164587938265_1015 size 48254 [junit] 09/10/07 01:16:28 INFO DataNode.clienttrace: src: /127.0.0.1:35478, dest: /127.0.0.1:51295, bytes: 48254, op: HDFS_WRITE, cliID: DFSClient_-1468843592, srvID: DS-1845303905-127.0.1.1-51295-1254878153423, blockid: blk_-6478926164587938265_1015 [junit] 09/10/07 01:16:28 INFO hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:51295 is added to blk_-6478926164587938265_1015 size 48254 [junit] 09/10/07 01:16:28 INFO datanode.DataNode: PacketResponder 1 for block blk_-6478926164587938265_1015 terminating [junit] 09/10/07 01:16:28 INFO DataNode.clienttrace: src: /127.0.0.1:34216, dest: /127.0.0.1:59951, bytes: 48254, op: HDFS_WRITE, cliID: DFSClient_-1468843592, srvID: DS-632073239-127.0.1.1-59951-1254878154621, blockid: blk_-6478926164587938265_1015 [junit] 09/10/07 01:16:28 INFO hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:59951 is added to blk_-6478926164587938265_1015 size 48254 [junit] 09/10/07 01:16:28 INFO datanode.DataNode: PacketResponder 2 for block blk_-6478926164587938265_1015 terminating [junit] 09/10/07 01:16:28 INFO hdfs.StateChange: DIR* NameSystem.completeFile: file /tmp/hadoop-hudson/mapred/system/job_20091007011555276_0002/job.xml is closed by DFSClient_-1468843592 [junit] 09/10/07 01:16:28 INFO FSNamesystem.audit: ugi=hudson,hudson ip=/127.0.0.1 cmd=open src=/tmp/hadoop-hudson/mapred/system/job_20091007011555276_0002/job.xml dst=nullperm=null [junit] 09/10/07 01:16:28 INFO DataNode.clienttrace: src: /127.0.0.1:59951, dest: /127.0.0.1:34219, bytes: 48634, op: HDFS_READ, cliID: DFSClient_-1468843592, srvID: DS-632073239-127.0.1.1-59951-1254878154621, blockid: blk_-6478926164587938265_1015 [junit] 09/10/07 01:16:28 INFO FSNamesystem.audit: ugi=hudson,hudson ip=/127.0.0.1 cmd=open src=/tmp/hadoop-hudson/mapred/system/job_20091007011555276_0002/job.jar dst=nullperm=null [junit] 09/10/07 01:16:28 INFO DataNode.clienttrace: src: /127.0.0.1:49650, dest: /127.0.0.1:45554, bytes: 2482874, op: HDFS_READ, cliID: DFSClient_-1468843592, srvID: DS-1821165369-127.0.1.1-49650-1254878155200, blockid: blk_590227262299005753_1013 [junit] 09/10/07 01:16:28 INFO mapred.JobTracker: Initializing job_20091007011555276_0002 [junit] 09/10/07 01:16:28 INFO
[jira] Updated: (PIG-993) [zebra] Abitlity to drop a column group in a table
[ https://issues.apache.org/jira/browse/PIG-993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi updated PIG-993: - Fix Version/s: 0.6.0 [zebra] Abitlity to drop a column group in a table -- Key: PIG-993 URL: https://issues.apache.org/jira/browse/PIG-993 Project: Pig Issue Type: Bug Reporter: Raghu Angadi Assignee: Raghu Angadi Fix For: 0.6.0 Attachments: DropColumnGroupExample.java, zebra-drop-cg.patch, zebra-drop-cg.patch A Zebra table is stored as multiple sub tables each containing a set of columns called column group (CG). The user specifies how these columns are grouped while creating a table through the _storage hint_. For some of the large tables, it might be necessary for users to remove a set of columns and retain the rest. This jira provides a way for users to delete an entire column group. The following comments will have more details on API and the semantics. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-987) [zebra] Zebra Column Group Access Control
[ https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762854#action_12762854 ] Yan Zhou commented on PIG-987: -- It's because this patch expose the env problem using lzo as compression that 991 eventually fixes. Can you commit 991's patch along with this? What are tthe failures from trunk? What are the error messages? [zebra] Zebra Column Group Access Control - Key: PIG-987 URL: https://issues.apache.org/jira/browse/PIG-987 Project: Pig Issue Type: New Feature Affects Versions: 0.6.0 Reporter: Yan Zhou Assignee: Yan Zhou Attachments: ColumnGroupSecurity.patch, TEST-org.apache.hadoop.zebra.mapred.TestCheckin.txt Access Control: when processes try to read from the column groups, Zebra should be able to handle allowed vs. disallowed user/application accesses. The security is eventuallt granted by corresponding HDFS security of the data stored. Expected behavior when column group permissions are set: When user selects only columns that they do not have permissions to access, Zebra should return error with message Error #: Permission denied for accessing column column name or names Access control applies to an entire column group, so all columns in a column group have same permissions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-948) [Usability] Relating pig script with MR jobs
[ https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762861#action_12762861 ] Hadoop QA commented on PIG-948: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12421472/PIG-948-4.patch against trunk revision 822382. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/62/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/62/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/62/console This message is automatically generated. [Usability] Relating pig script with MR jobs Key: PIG-948 URL: https://issues.apache.org/jira/browse/PIG-948 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Priority: Minor Fix For: 0.6.0 Attachments: pig-948-2.patch, pig-948-3.patch, PIG-948-4.patch, pig-948.patch Currently its hard to find a way to relate pig script with specific MR job. In a loaded cluster with multiple simultaneous job submissions, its not easy to figure out which specific MR jobs were launched for a given pig script. If Pig can provide this info, it will be useful to debug and monitor the jobs resulting from a pig script. At the very least, Pig should be able to provide user the following information 1) Job id of the launched job. 2) Complete web url of jobtracker running this job. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-987) [zebra] Zebra Column Group Access Control
[ https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762871#action_12762871 ] Raghu Angadi commented on PIG-987: -- Even with PIG-991 included, I am seeing lzo related failures. Could you run tests on a clean checkout? If you didn't see the errors before then you probably have lzo set up in your environment, which is not a requirement. [zebra] Zebra Column Group Access Control - Key: PIG-987 URL: https://issues.apache.org/jira/browse/PIG-987 Project: Pig Issue Type: New Feature Affects Versions: 0.6.0 Reporter: Yan Zhou Assignee: Yan Zhou Attachments: ColumnGroupSecurity.patch, TEST-org.apache.hadoop.zebra.mapred.TestCheckin.txt Access Control: when processes try to read from the column groups, Zebra should be able to handle allowed vs. disallowed user/application accesses. The security is eventuallt granted by corresponding HDFS security of the data stored. Expected behavior when column group permissions are set: When user selects only columns that they do not have permissions to access, Zebra should return error with message Error #: Permission denied for accessing column column name or names Access control applies to an entire column group, so all columns in a column group have same permissions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-994) Provide 'append' keyword to allow appending to diferent dataset once the feature is available in Hadoop
[ https://issues.apache.org/jira/browse/PIG-994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rekha updated PIG-994: -- Tags: append, update, hadoop 0.20 (was: append, hadoop 0.20) Provide 'append' keyword to allow appending to diferent dataset once the feature is available in Hadoop --- Key: PIG-994 URL: https://issues.apache.org/jira/browse/PIG-994 Project: Pig Issue Type: New Feature Components: impl Affects Versions: 0.4.0 Environment: Grid clusters Reporter: Rekha Priority: Minor Provide 'append' keyword to allow appending to diferent dataset on pig 0.5.0 as it is now on hadoop 0.20(which has append feature) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-994) Provide 'append' keyword to allow appending to diferent dataset once the feature is available in Hadoop
[ https://issues.apache.org/jira/browse/PIG-994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rekha updated PIG-994: -- Thanks Alan. I am for 'option on store' mostly and definitely if they are exclusive possibilities. However for arguments sake, a keyword approach can be considered, in addition. This is because I am hoping append will open doors to be able to easily patch in update feature on similar lines into pig api, (and hopefully as part of same jira ticket) My idea of update is a syntax like update DS1 by (join_keys) from DS2 by (join_keys) parallel $PARALLEL This will update dataset1(DS1) with data from dataset2(DS2) based on key joins. {code} update b by (jon_key1, join_key2) from c by (join_key1, join_key2); //this will update the DS b directly //or alternatively //x = update b by (jon_key1, join_key2) from c by (join_key1, join_key2); // making it two-step. z = foreach b generate $0, $32, $50; // incase you are taking only few cols from main(b), new (c) store z into 'bla' append; // appends the o/p data into 'bla' directly. {code} The append case, this below construct will be another way of doing it. {code} append b, c; // appends directly into b. z = foreach b generate $0, $32, $50; // incase you are taking only few cols from main(b), new (c) store z into 'bla'; {code} Provide 'append' keyword to allow appending to diferent dataset once the feature is available in Hadoop --- Key: PIG-994 URL: https://issues.apache.org/jira/browse/PIG-994 Project: Pig Issue Type: New Feature Components: impl Affects Versions: 0.4.0 Environment: Grid clusters Reporter: Rekha Priority: Minor Provide 'append' keyword to allow appending to diferent dataset on pig 0.5.0 as it is now on hadoop 0.20(which has append feature) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-922) Logical optimizer: push up project
[ https://issues.apache.org/jira/browse/PIG-922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-922: --- Attachment: PIG-922-p3_6.patch Address comments by Pradeep and Hudson. Logical optimizer: push up project -- Key: PIG-922 URL: https://issues.apache.org/jira/browse/PIG-922 Project: Pig Issue Type: New Feature Components: impl Affects Versions: 0.3.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-922-p1_0.patch, PIG-922-p1_1.patch, PIG-922-p1_2.patch, PIG-922-p1_3.patch, PIG-922-p1_4.patch, PIG-922-p2_preview.patch, PIG-922-p2_preview2.patch, PIG-922-p3_1.patch, PIG-922-p3_2.patch, PIG-922-p3_3.patch, PIG-922-p3_4.patch, PIG-922-p3_5.patch, PIG-922-p3_6.patch This is a continuation work of [PIG-697|https://issues.apache.org/jira/browse/PIG-697]. We need to add another rule to the logical optimizer: Push up project, ie, prune columns as early as possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-922) Logical optimizer: push up project
[ https://issues.apache.org/jira/browse/PIG-922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-922: --- Status: Patch Available (was: Open) Logical optimizer: push up project -- Key: PIG-922 URL: https://issues.apache.org/jira/browse/PIG-922 Project: Pig Issue Type: New Feature Components: impl Affects Versions: 0.3.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-922-p1_0.patch, PIG-922-p1_1.patch, PIG-922-p1_2.patch, PIG-922-p1_3.patch, PIG-922-p1_4.patch, PIG-922-p2_preview.patch, PIG-922-p2_preview2.patch, PIG-922-p3_1.patch, PIG-922-p3_2.patch, PIG-922-p3_3.patch, PIG-922-p3_4.patch, PIG-922-p3_5.patch, PIG-922-p3_6.patch This is a continuation work of [PIG-697|https://issues.apache.org/jira/browse/PIG-697]. We need to add another rule to the logical optimizer: Push up project, ie, prune columns as early as possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.