[jira] Commented: (PIG-1053) Consider moving to Hadoop for local mode
[ https://issues.apache.org/jira/browse/PIG-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12770336#action_12770336 ] Raghu Angadi commented on PIG-1053: --- a big +1. It is understandable from PIG developer's point of view to be annoyed by beginners complaining about run time with toy local inputs. may be clear heads-up in tutorial would reduce those. > Consider moving to Hadoop for local mode > > > Key: PIG-1053 > URL: https://issues.apache.org/jira/browse/PIG-1053 > Project: Pig > Issue Type: Improvement >Reporter: Alan Gates > > We need to consider moving Pig to use Hadoop's local mode instead of its own. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-993) [zebra] Abitlity to drop a column group in a table
[ https://issues.apache.org/jira/browse/PIG-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766841#action_12766841 ] Raghu Angadi commented on PIG-993: -- I think the test needs to be fixed. It deletes 6 column groups from 6 different threads. The spec explicitly states read accesses and parallel deletions expected to fail. But the table is always left in consistent state. The rationale for this is that in practice these tables are accessed from different machines and it would add unnecessary complication to support coordinate all the readers and the writers (especially with no locking support on HDFS). Zebra tables have no state outside the directory. This applies to writing as well. One options I see is to make each thread make multiple attempts in case of errors. > [zebra] Abitlity to drop a column group in a table > -- > > Key: PIG-993 > URL: https://issues.apache.org/jira/browse/PIG-993 > Project: Pig > Issue Type: Bug > Reporter: Raghu Angadi > Assignee: Raghu Angadi > Fix For: 0.6.0 > > Attachments: DropColumnGroupExample.java, > TEST-org.apache.hadoop.zebra.io.TestCheckin.txt, zebra-drop-cg.patch, > zebra-drop-cg.patch, zebra-drop-cg.patch > > > A Zebra table is stored as multiple sub tables each containing a set of > columns called column group (CG). The user specifies how these columns are > grouped while creating a table through the _storage hint_. > For some of the large tables, it might be necessary for users to remove a set > of columns and retain the rest. This jira provides a way for users to delete > an entire column group. > The following comments will have more details on API and the semantics. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-993) [zebra] Abitlity to drop a column group in a table
[ https://issues.apache.org/jira/browse/PIG-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764552#action_12764552 ] Raghu Angadi commented on PIG-993: -- This patch depends on PIG-992. It is not a functional dependency and can be removed if required. > [zebra] Abitlity to drop a column group in a table > -- > > Key: PIG-993 > URL: https://issues.apache.org/jira/browse/PIG-993 > Project: Pig > Issue Type: Bug > Reporter: Raghu Angadi > Assignee: Raghu Angadi > Fix For: 0.6.0 > > Attachments: DropColumnGroupExample.java, zebra-drop-cg.patch, > zebra-drop-cg.patch > > > A Zebra table is stored as multiple sub tables each containing a set of > columns called column group (CG). The user specifies how these columns are > grouped while creating a table through the _storage hint_. > For some of the large tables, it might be necessary for users to remove a set > of columns and retain the rest. This jira provides a way for users to delete > an entire column group. > The following comments will have more details on API and the semantics. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-986) [zebra] Zebra Column Group Naming Support
[ https://issues.apache.org/jira/browse/PIG-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi updated PIG-986: - Resolution: Fixed Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) I just committed this. Thanks Yan. > [zebra] Zebra Column Group Naming Support > - > > Key: PIG-986 > URL: https://issues.apache.org/jira/browse/PIG-986 > Project: Pig > Issue Type: New Feature > Components: impl >Affects Versions: 0.4.0 >Reporter: Chao Wang >Assignee: Chao Wang > Fix For: 0.6.0 > > Attachments: ColumnGroupName.patch, ColumnGroupName.patch, > ColumnGroupName.patch > > > We introduce column group name to Zebra and make it a first-class citizen in > Zebra. This can ease management of column groups. > We plan to introduce an "as" clause for column group name in Zebra's syntax. > Functional Specifications: > 1) Column group names are optional. For column groups which do not have a > user-provided name, Zebra will assign some default column group names > internally that is unique for that table - CG0, CG1, CG2 ... Note: If CGx is > used by user, then it can not be used for internal names. > 2) We introduce an "AS" clause in Zebra's syntax for column group names. If > it occurs, it has to immediately follow [ ]. For example, "[a1, a2] as PI > secure by user:joe group:secure perm:640; [a3, a4] as General compress by > lzo". Note that keyword "AS" is case insensitive. > 3) Column group names are unique within one table and are case sensitive, > i.e., c1 and C1 are different. > 4) Column group names will be used as the physical column group directory > path names. > 5) Zebra V2 will support dropColumnGroup by column group names (will > integrate with Raghu's A29 drop column work). > 6) Zebra V2 can support backward compatibility (If there are Zebra V1 created > tables in production when V2 is released). More specifically, this means that > Zebra V2 can load from V1-created tables and do dropColumnGroup on it. > 7) Does NOT support renaming. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-986) [zebra] Zebra Column Group Naming Support
[ https://issues.apache.org/jira/browse/PIG-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi updated PIG-986: - Status: Patch Available (was: Open) > [zebra] Zebra Column Group Naming Support > - > > Key: PIG-986 > URL: https://issues.apache.org/jira/browse/PIG-986 > Project: Pig > Issue Type: New Feature > Components: impl >Affects Versions: 0.4.0 >Reporter: Chao Wang >Assignee: Chao Wang > Fix For: 0.6.0 > > Attachments: ColumnGroupName.patch, ColumnGroupName.patch, > ColumnGroupName.patch > > > We introduce column group name to Zebra and make it a first-class citizen in > Zebra. This can ease management of column groups. > We plan to introduce an "as" clause for column group name in Zebra's syntax. > Functional Specifications: > 1) Column group names are optional. For column groups which do not have a > user-provided name, Zebra will assign some default column group names > internally that is unique for that table - CG0, CG1, CG2 ... Note: If CGx is > used by user, then it can not be used for internal names. > 2) We introduce an "AS" clause in Zebra's syntax for column group names. If > it occurs, it has to immediately follow [ ]. For example, "[a1, a2] as PI > secure by user:joe group:secure perm:640; [a3, a4] as General compress by > lzo". Note that keyword "AS" is case insensitive. > 3) Column group names are unique within one table and are case sensitive, > i.e., c1 and C1 are different. > 4) Column group names will be used as the physical column group directory > path names. > 5) Zebra V2 will support dropColumnGroup by column group names (will > integrate with Raghu's A29 drop column work). > 6) Zebra V2 can support backward compatibility (If there are Zebra V1 created > tables in production when V2 is released). More specifically, this means that > Zebra V2 can load from V1-created tables and do dropColumnGroup on it. > 7) Does NOT support renaming. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-986) [zebra] Zebra Column Group Naming Support
[ https://issues.apache.org/jira/browse/PIG-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi updated PIG-986: - Status: Open (was: Patch Available) > [zebra] Zebra Column Group Naming Support > - > > Key: PIG-986 > URL: https://issues.apache.org/jira/browse/PIG-986 > Project: Pig > Issue Type: New Feature > Components: impl >Affects Versions: 0.4.0 >Reporter: Chao Wang >Assignee: Chao Wang > Fix For: 0.6.0 > > Attachments: ColumnGroupName.patch, ColumnGroupName.patch, > ColumnGroupName.patch > > > We introduce column group name to Zebra and make it a first-class citizen in > Zebra. This can ease management of column groups. > We plan to introduce an "as" clause for column group name in Zebra's syntax. > Functional Specifications: > 1) Column group names are optional. For column groups which do not have a > user-provided name, Zebra will assign some default column group names > internally that is unique for that table - CG0, CG1, CG2 ... Note: If CGx is > used by user, then it can not be used for internal names. > 2) We introduce an "AS" clause in Zebra's syntax for column group names. If > it occurs, it has to immediately follow [ ]. For example, "[a1, a2] as PI > secure by user:joe group:secure perm:640; [a3, a4] as General compress by > lzo". Note that keyword "AS" is case insensitive. > 3) Column group names are unique within one table and are case sensitive, > i.e., c1 and C1 are different. > 4) Column group names will be used as the physical column group directory > path names. > 5) Zebra V2 will support dropColumnGroup by column group names (will > integrate with Raghu's A29 drop column work). > 6) Zebra V2 can support backward compatibility (If there are Zebra V1 created > tables in production when V2 is released). More specifically, this means that > Zebra V2 can load from V1-created tables and do dropColumnGroup on it. > 7) Does NOT support renaming. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-991) [zebra] A few minor bugs as described in the Description section
[ https://issues.apache.org/jira/browse/PIG-991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi updated PIG-991: - Resolution: Fixed Status: Resolved (was: Patch Available) I just committed this. Thanks Yan. > [zebra] A few minor bugs as described in the Description section > > > Key: PIG-991 > URL: https://issues.apache.org/jira/browse/PIG-991 > Project: Pig > Issue Type: Bug >Affects Versions: 0.4.0 >Reporter: Yan Zhou >Assignee: Yan Zhou >Priority: Minor > Fix For: 0.6.0 > > Attachments: Bugs-2.patch, Bugs.patch > > > 1) "lzo2" was used as the compressor name for the LZO compression algorithm; > it should be "lzo" instead; > 2) the default compression is changed from "lzo" to "gz" for gzip; > 3) In JAVACC file SchemaParser.jjt, the package name was wrong using the old > "package org.apache.pig.table.types"; > 4) in build.xml, two new javacc targets are added to generate > TableSchemaParser and TableStorageParser java codes; > 5) Support of column group security ( > https://issues.apache.org/jira/browse/PIG-987 ) lacked support of the > dumpinfo method: the groups and permissions were not displayed. Note that as > a consequence, the patch herein must be applied after that of JIRA987. > 6) and 7) a couple of issues reported in Jira917. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-987) [zebra] Zebra Column Group Access Control
[ https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi updated PIG-987: - Resolution: Fixed Fix Version/s: 0.6.0 Status: Resolved (was: Patch Available) I just committed this. Thanks Yan! > [zebra] Zebra Column Group Access Control > - > > Key: PIG-987 > URL: https://issues.apache.org/jira/browse/PIG-987 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.6.0 >Reporter: Yan Zhou >Assignee: Yan Zhou > Fix For: 0.6.0 > > Attachments: ColumnGroupSecurity.patch, ColumnGroupSecurity.patch, > ColumnGroupSecurity.patch, TEST-org.apache.hadoop.zebra.io.TestCheckin.txt, > TEST-org.apache.hadoop.zebra.mapred.TestCheckin.txt, tmp-987-plus-991.patch > > > Access Control: when processes try to read from the column groups, Zebra > should be able to handle allowed vs. disallowed user/application accesses. > The security is eventuallt granted by corresponding HDFS security of the > data stored. > Expected behavior when column group permissions are set: > When user selects only columns that they do not have permissions to > access, Zebra should return error with message "Error #: Permission denied > for accessing column > Access control applies to an entire column group, so all columns in a column > group have same permissions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-991) [zebra] A few minor bugs as described in the Description section
[ https://issues.apache.org/jira/browse/PIG-991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi updated PIG-991: - Attachment: Bugs-2.patch I am committing a slightly modified patch. I removed the following lines that modified build.xml at the top level. Please ask one of the PIG committers to commit that change. The part that is removed : {noformat} @@ -940,4 +942,13 @@ + + + + + {noformat} > [zebra] A few minor bugs as described in the Description section > > > Key: PIG-991 > URL: https://issues.apache.org/jira/browse/PIG-991 > Project: Pig > Issue Type: Bug >Affects Versions: 0.4.0 >Reporter: Yan Zhou >Assignee: Yan Zhou >Priority: Minor > Fix For: 0.6.0 > > Attachments: Bugs-2.patch, Bugs.patch > > > 1) "lzo2" was used as the compressor name for the LZO compression algorithm; > it should be "lzo" instead; > 2) the default compression is changed from "lzo" to "gz" for gzip; > 3) In JAVACC file SchemaParser.jjt, the package name was wrong using the old > "package org.apache.pig.table.types"; > 4) in build.xml, two new javacc targets are added to generate > TableSchemaParser and TableStorageParser java codes; > 5) Support of column group security ( > https://issues.apache.org/jira/browse/PIG-987 ) lacked support of the > dumpinfo method: the groups and permissions were not displayed. Note that as > a consequence, the patch herein must be applied after that of JIRA987. > 6) and 7) a couple of issues reported in Jira917. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-987) [zebra] Zebra Column Group Access Control
[ https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763836#action_12763836 ] Raghu Angadi commented on PIG-987: -- Thanks Yan. It might be better to remove gauravj also since it is ignored anyway. This implies column access control is not tested in this patch, right? > [zebra] Zebra Column Group Access Control > - > > Key: PIG-987 > URL: https://issues.apache.org/jira/browse/PIG-987 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.6.0 >Reporter: Yan Zhou >Assignee: Yan Zhou > Attachments: ColumnGroupSecurity.patch, ColumnGroupSecurity.patch, > ColumnGroupSecurity.patch, TEST-org.apache.hadoop.zebra.io.TestCheckin.txt, > TEST-org.apache.hadoop.zebra.mapred.TestCheckin.txt, tmp-987-plus-991.patch > > > Access Control: when processes try to read from the column groups, Zebra > should be able to handle allowed vs. disallowed user/application accesses. > The security is eventuallt granted by corresponding HDFS security of the > data stored. > Expected behavior when column group permissions are set: > When user selects only columns that they do not have permissions to > access, Zebra should return error with message "Error #: Permission denied > for accessing column > Access control applies to an entire column group, so all columns in a column > group have same permissions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-987) [zebra] Zebra Column Group Access Control
[ https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763516#action_12763516 ] Raghu Angadi commented on PIG-987: -- > Can you chgrp a local FS file to a group called "users" on your box? No. Its the same problem. I don't have a group called "users".. and I don't think we can require others to have it. I didn't know owner is ignored. It is still allowed by storage hint? > [zebra] Zebra Column Group Access Control > - > > Key: PIG-987 > URL: https://issues.apache.org/jira/browse/PIG-987 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.6.0 >Reporter: Yan Zhou >Assignee: Yan Zhou > Attachments: ColumnGroupSecurity.patch, ColumnGroupSecurity.patch, > TEST-org.apache.hadoop.zebra.io.TestCheckin.txt, > TEST-org.apache.hadoop.zebra.mapred.TestCheckin.txt, tmp-987-plus-991.patch > > > Access Control: when processes try to read from the column groups, Zebra > should be able to handle allowed vs. disallowed user/application accesses. > The security is eventuallt granted by corresponding HDFS security of the > data stored. > Expected behavior when column group permissions are set: > When user selects only columns that they do not have permissions to > access, Zebra should return error with message "Error #: Permission denied > for accessing column > Access control applies to an entire column group, so all columns in a column > group have same permissions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-987) [zebra] Zebra Column Group Access Control
[ https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763346#action_12763346 ] Raghu Angadi commented on PIG-987: -- I finally got some time look into this. Yes. I think the it should be fixed in the tests. TestColumnGroup.java says : {noformat} ColumnGroup.Writer writer = new ColumnGroup.Writer(path, strSchema, sorted, "pig", "gz", "gauravj", "users", (short) Short.parseShort("755", 8), false, conf); {noformat} using local FS. How can we expect users to have a user name "gauravj" on their machines and run as superusers :)? just can not be done. If the test wants to run with these permissions we should do : a) use HDFS (MiniDFSCluster) rather than local filesystem. The tester has all the permissions on a MiniDFS. b) minor : use a generic name than gauravj. > [zebra] Zebra Column Group Access Control > - > > Key: PIG-987 > URL: https://issues.apache.org/jira/browse/PIG-987 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.6.0 >Reporter: Yan Zhou >Assignee: Yan Zhou > Attachments: ColumnGroupSecurity.patch, ColumnGroupSecurity.patch, > TEST-org.apache.hadoop.zebra.io.TestCheckin.txt, > TEST-org.apache.hadoop.zebra.mapred.TestCheckin.txt, tmp-987-plus-991.patch > > > Access Control: when processes try to read from the column groups, Zebra > should be able to handle allowed vs. disallowed user/application accesses. > The security is eventuallt granted by corresponding HDFS security of the > data stored. > Expected behavior when column group permissions are set: > When user selects only columns that they do not have permissions to > access, Zebra should return error with message "Error #: Permission denied > for accessing column > Access control applies to an entire column group, so all columns in a column > group have same permissions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-987) [zebra] Zebra Column Group Access Control
[ https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi updated PIG-987: - Attachment: tmp-987-plus-991.patch TEST-org.apache.hadoop.zebra.io.TestCheckin.txt Attachments : # tmp-987-plus-991.patch : latest patch here + patch for PIG-991 # TEST-org.apache.hadoop.zebra.io.TestCheckin.txt : output of the failed tests. Yan, looks like lzo related errors are fixed with the combined patch. But there are still some failures. I think some of these failures exist on trunk as well. > [zebra] Zebra Column Group Access Control > - > > Key: PIG-987 > URL: https://issues.apache.org/jira/browse/PIG-987 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.6.0 >Reporter: Yan Zhou >Assignee: Yan Zhou > Attachments: ColumnGroupSecurity.patch, ColumnGroupSecurity.patch, > TEST-org.apache.hadoop.zebra.io.TestCheckin.txt, > TEST-org.apache.hadoop.zebra.mapred.TestCheckin.txt, tmp-987-plus-991.patch > > > Access Control: when processes try to read from the column groups, Zebra > should be able to handle allowed vs. disallowed user/application accesses. > The security is eventuallt granted by corresponding HDFS security of the > data stored. > Expected behavior when column group permissions are set: > When user selects only columns that they do not have permissions to > access, Zebra should return error with message "Error #: Permission denied > for accessing column > Access control applies to an entire column group, so all columns in a column > group have same permissions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-987) [zebra] Zebra Column Group Access Control
[ https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762871#action_12762871 ] Raghu Angadi commented on PIG-987: -- Even with PIG-991 included, I am seeing lzo related failures. Could you run tests on a clean checkout? If you didn't see the errors before then you probably have lzo set up in your environment, which is not a requirement. > [zebra] Zebra Column Group Access Control > - > > Key: PIG-987 > URL: https://issues.apache.org/jira/browse/PIG-987 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.6.0 >Reporter: Yan Zhou >Assignee: Yan Zhou > Attachments: ColumnGroupSecurity.patch, > TEST-org.apache.hadoop.zebra.mapred.TestCheckin.txt > > > Access Control: when processes try to read from the column groups, Zebra > should be able to handle allowed vs. disallowed user/application accesses. > The security is eventuallt granted by corresponding HDFS security of the > data stored. > Expected behavior when column group permissions are set: > When user selects only columns that they do not have permissions to > access, Zebra should return error with message "Error #: Permission denied > for accessing column > Access control applies to an entire column group, so all columns in a column > group have same permissions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-993) [zebra] Abitlity to drop a column group in a table
[ https://issues.apache.org/jira/browse/PIG-993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi updated PIG-993: - Fix Version/s: 0.6.0 > [zebra] Abitlity to drop a column group in a table > -- > > Key: PIG-993 > URL: https://issues.apache.org/jira/browse/PIG-993 > Project: Pig > Issue Type: Bug > Reporter: Raghu Angadi > Assignee: Raghu Angadi > Fix For: 0.6.0 > > Attachments: DropColumnGroupExample.java, zebra-drop-cg.patch, > zebra-drop-cg.patch > > > A Zebra table is stored as multiple sub tables each containing a set of > columns called column group (CG). The user specifies how these columns are > grouped while creating a table through the _storage hint_. > For some of the large tables, it might be necessary for users to remove a set > of columns and retain the rest. This jira provides a way for users to delete > an entire column group. > The following comments will have more details on API and the semantics. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-987) [zebra] Zebra Column Group Access Control
[ https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762829#action_12762829 ] Raghu Angadi commented on PIG-987: -- Not sure if this is related to PIG. When I applied PIG-991 over this, the tests passed (except the ones that fail on trunk). > [zebra] Zebra Column Group Access Control > - > > Key: PIG-987 > URL: https://issues.apache.org/jira/browse/PIG-987 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.6.0 >Reporter: Yan Zhou >Assignee: Yan Zhou > Attachments: ColumnGroupSecurity.patch, > TEST-org.apache.hadoop.zebra.mapred.TestCheckin.txt > > > Access Control: when processes try to read from the column groups, Zebra > should be able to handle allowed vs. disallowed user/application accesses. > The security is eventuallt granted by corresponding HDFS security of the > data stored. > Expected behavior when column group permissions are set: > When user selects only columns that they do not have permissions to > access, Zebra should return error with message "Error #: Permission denied > for accessing column > Access control applies to an entire column group, so all columns in a column > group have same permissions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-987) [zebra] Zebra Column Group Access Control
[ https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi updated PIG-987: - Attachment: TEST-org.apache.hadoop.zebra.mapred.TestCheckin.txt I am attaching {{mapred.TestCheckin.txt}} that passes without the patch. btw, not all tests pass even without the patch. What is the environment required? I did a fresh check out, and ran 'ant test'. I guess the tests failures on trunk are related to lzo. But I didn't expect more failures with the patch. Looks like PIG-991 removes the lzo dependency. I will try with that patch included. > [zebra] Zebra Column Group Access Control > - > > Key: PIG-987 > URL: https://issues.apache.org/jira/browse/PIG-987 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.6.0 >Reporter: Yan Zhou >Assignee: Yan Zhou > Attachments: ColumnGroupSecurity.patch, > TEST-org.apache.hadoop.zebra.mapred.TestCheckin.txt > > > Access Control: when processes try to read from the column groups, Zebra > should be able to handle allowed vs. disallowed user/application accesses. > The security is eventuallt granted by corresponding HDFS security of the > data stored. > Expected behavior when column group permissions are set: > When user selects only columns that they do not have permissions to > access, Zebra should return error with message "Error #: Permission denied > for accessing column > Access control applies to an entire column group, so all columns in a column > group have same permissions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-991) [zebra] A few minor bugs as described in the Description section
[ https://issues.apache.org/jira/browse/PIG-991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi updated PIG-991: - Release Note: (was: Patch should be applied after that of Jira987.) bq. Patch should be applied after that of Jira987. [moved above comment from 'Release Notes' to this comment]. > [zebra] A few minor bugs as described in the Description section > > > Key: PIG-991 > URL: https://issues.apache.org/jira/browse/PIG-991 > Project: Pig > Issue Type: Bug >Affects Versions: 0.4.0 >Reporter: Yan Zhou >Assignee: Yan Zhou >Priority: Minor > Fix For: 0.6.0 > > Attachments: Bugs.patch > > > 1) "lzo2" was used as the compressor name for the LZO compression algorithm; > it should be "lzo" instead; > 2) the default compression is changed from "lzo" to "gz" for gzip; > 3) In JAVACC file SchemaParser.jjt, the package name was wrong using the old > "package org.apache.pig.table.types"; > 4) in build.xml, two new javacc targets are added to generate > TableSchemaParser and TableStorageParser java codes; > 5) Support of column group security ( > https://issues.apache.org/jira/browse/PIG-987 ) lacked support of the > dumpinfo method: the groups and permissions were not displayed. Note that as > a consequence, the patch herein must be applied after that of JIRA987. > 6) and 7) a couple of issues reported in Jira917. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-987) [zebra] Zebra Column Group Access Control
[ https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762812#action_12762812 ] Raghu Angadi commented on PIG-987: -- I tried to commit this patch. 'ant test' says all the tests fail, where as only one two tests fail without the patch. Does Hudson actual run Zebra tests? > [zebra] Zebra Column Group Access Control > - > > Key: PIG-987 > URL: https://issues.apache.org/jira/browse/PIG-987 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.6.0 >Reporter: Yan Zhou >Assignee: Yan Zhou > Attachments: ColumnGroupSecurity.patch > > > Access Control: when processes try to read from the column groups, Zebra > should be able to handle allowed vs. disallowed user/application accesses. > The security is eventuallt granted by corresponding HDFS security of the > data stored. > Expected behavior when column group permissions are set: > When user selects only columns that they do not have permissions to > access, Zebra should return error with message "Error #: Permission denied > for accessing column > Access control applies to an entire column group, so all columns in a column > group have same permissions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-993) [zebra] Abitlity to drop a column group in a table
[ https://issues.apache.org/jira/browse/PIG-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761769#action_12761769 ] Raghu Angadi commented on PIG-993: -- > zebra-drop-cg.patch : This patch would apply only after a patch for PIG-896. I meant say PIG-986. > [zebra] Abitlity to drop a column group in a table > -- > > Key: PIG-993 > URL: https://issues.apache.org/jira/browse/PIG-993 > Project: Pig > Issue Type: Bug > Reporter: Raghu Angadi >Assignee: Raghu Angadi > Fix For: 0.5.0 > > Attachments: DropColumnGroupExample.java, zebra-drop-cg.patch > > > A Zebra table is stored as multiple sub tables each containing a set of > columns called column group (CG). The user specifies how these columns are > grouped while creating a table through the _storage hint_. > For some of the large tables, it might be necessary for users to remove a set > of columns and retain the rest. This jira provides a way for users to delete > an entire column group. > The following comments will have more details on API and the semantics. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-993) [zebra] Abitlity to drop a column group in a table
[ https://issues.apache.org/jira/browse/PIG-993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi updated PIG-993: - Attachment: zebra-drop-cg.patch DropColumnGroupExample.java Attachments ; DropColumnGropuExample.java : a simple example to illustrate the functionality. zebra-drop-cg.patch : This patch would apply only after a patch for PIG-896. Some of the tests included there are written by Jing Huang. Jing also helped with testing the patchon real clusters with various errors. Yan Zhou helped with correctly handling missing column groups. > [zebra] Abitlity to drop a column group in a table > -- > > Key: PIG-993 > URL: https://issues.apache.org/jira/browse/PIG-993 > Project: Pig > Issue Type: Bug > Reporter: Raghu Angadi > Assignee: Raghu Angadi > Fix For: 0.5.0 > > Attachments: DropColumnGroupExample.java, zebra-drop-cg.patch > > > A Zebra table is stored as multiple sub tables each containing a set of > columns called column group (CG). The user specifies how these columns are > grouped while creating a table through the _storage hint_. > For some of the large tables, it might be necessary for users to remove a set > of columns and retain the rest. This jira provides a way for users to delete > an entire column group. > The following comments will have more details on API and the semantics. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-993) [zebra] Abitlity to drop a column group in a table
[ https://issues.apache.org/jira/browse/PIG-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761767#action_12761767 ] Raghu Angadi commented on PIG-993: -- Deletion procedure : # Check if a column group with the given name exists and throw an error if there is no such group. # If the column group is already deleted return normally. ** If a column group is already marked deleted and the corresponding physical directory still exists, try to remove the the column group data again. An earlier attempt might not have removed the directory. # Create a an empty file ".deleted-CGNAME" in the top level directory. # If the creation fails, check if the file already exists. This can happen when two users concurrently try to delete the same column group. If CG is marked deleted after this, return success. Exception is thrown for any other error. # Delete the column group directory. # An exception is thrown if deletion fails. Note that, column group is already marked deleted even though the deletion of a directory failed. A subsequent deletion of such a column group will again try to to delete the directory. > [zebra] Abitlity to drop a column group in a table > -- > > Key: PIG-993 > URL: https://issues.apache.org/jira/browse/PIG-993 > Project: Pig > Issue Type: Bug >Reporter: Raghu Angadi >Assignee: Raghu Angadi > Fix For: 0.5.0 > > > A Zebra table is stored as multiple sub tables each containing a set of > columns called column group (CG). The user specifies how these columns are > grouped while creating a table through the _storage hint_. > For some of the large tables, it might be necessary for users to remove a set > of columns and retain the rest. This jira provides a way for users to delete > an entire column group. > The following comments will have more details on API and the semantics. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-993) [zebra] Abitlity to drop a column group in a table
[ https://issues.apache.org/jira/browse/PIG-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761766#action_12761766 ] Raghu Angadi commented on PIG-993: -- API is pretty simple : {code} class org.apache.hadoop.zebra.BasicTable { /** see the patch for JavaDoc and attached example for usage */ public static void dropColumnGroup(Path path, Configuration conf, String cgName) throws IOException { ... } } {code} * Table schema is not modified. * this API takes a name for a column group. PIG-986 adds explicit names for CGs. * Once a CGs is deleted, NULL is returned for the fields that were stored in the CG. ** This is the main difference between just manually deleting a directory on filesystem and 'properly' deleting a CG. ** Many changes made in other parts of zebra are related to handling the missing CGs. > [zebra] Abitlity to drop a column group in a table > -- > > Key: PIG-993 > URL: https://issues.apache.org/jira/browse/PIG-993 > Project: Pig > Issue Type: Bug >Reporter: Raghu Angadi >Assignee: Raghu Angadi > Fix For: 0.5.0 > > > A Zebra table is stored as multiple sub tables each containing a set of > columns called column group (CG). The user specifies how these columns are > grouped while creating a table through the _storage hint_. > For some of the large tables, it might be necessary for users to remove a set > of columns and retain the rest. This jira provides a way for users to delete > an entire column group. > The following comments will have more details on API and the semantics. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-993) [zebra] Abitlity to drop a column group in a table
[zebra] Abitlity to drop a column group in a table -- Key: PIG-993 URL: https://issues.apache.org/jira/browse/PIG-993 Project: Pig Issue Type: Bug Reporter: Raghu Angadi Assignee: Raghu Angadi Fix For: 0.5.0 A Zebra table is stored as multiple sub tables each containing a set of columns called column group (CG). The user specifies how these columns are grouped while creating a table through the _storage hint_. For some of the large tables, it might be necessary for users to remove a set of columns and retain the rest. This jira provides a way for users to delete an entire column group. The following comments will have more details on API and the semantics. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-985) [zebra] Make necessary changes to build scripts to accommodate new zebra features plus other improvement.
[ https://issues.apache.org/jira/browse/PIG-985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761045#action_12761045 ] Raghu Angadi commented on PIG-985: -- > 5) drop column group change (Raghu Angadi) > 6) schema package separation change (Yan Zhou) Just to clarify, this patch does not contain the above two features. It only contains couple of minor changes made in build.xml as part of these changes. Separate jiras will be filed for these two and other features soon. > [zebra] Make necessary changes to build scripts to accommodate new zebra > features plus other improvement. > - > > Key: PIG-985 > URL: https://issues.apache.org/jira/browse/PIG-985 > Project: Pig > Issue Type: Task > Components: build >Reporter: Chao Wang >Assignee: Chao Wang > Attachments: patch > > > The whole task consists of a series of steps as follows: > 1) nightly test change - prevent checkin tests from running twice in nightly > (Chao Wang) > 2) row based block splits for tables change (Raghu Angadi) > 3) add clover target (Jing Huang) > 4) add findbugs target (Chao Wang) > 5) drop column group change (Raghu Angadi) > 6) schema package separation change (Yan Zhou) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-949) Zebra Bug: splitting map into multiple column group using storage hint causes unexpected behaviour
[ https://issues.apache.org/jira/browse/PIG-949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759789#action_12759789 ] Raghu Angadi commented on PIG-949: -- I just committed this. Thanks Yan for the fix and Jing for the test! > Zebra Bug: splitting map into multiple column group using storage hint causes > unexpected behaviour > -- > > Key: PIG-949 > URL: https://issues.apache.org/jira/browse/PIG-949 > Project: Pig > Issue Type: Bug >Affects Versions: 0.4.0 > Environment: linux >Reporter: Alok Singh >Assignee: Yan Zhou > Fix For: 0.5.0 > > Attachments: Pig_949.patch, Pig_949.patch, Pig_949.patch > > > Hi > The storage hint > specification plays a important part whether the output table is readable or > not > say if we have have the map 'map'. > One can split the map into a column group using [map#{k1}, map#{k2}...] > however the remaining map field will automatically be added to the default > group. > if user try to create a new column group for the remaining fields as follows > [map#{k1}, map#{k2}, ..][map] i.e create a seperate column group > the table writer will create the table. > however, if one tries to load the created table via pig or via map reduce > using TableInputFormat > > then the reader have problem reading the map > We get the following stack trace > 09/09/09 00:09:45 INFO mapred.JobClient: Task Id : > attempt_200908191538_33939_m_21_2, Status : FAILED > java.io.IOException: getValue() failed: null > at > org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getValue(BasicTable.java:775) > at > org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableInputFormat.java:717) > at > org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableInputFormat.java:651) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:356) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > at org.apache.hadoop.mapred.Child.main(Child.java:170) > Alok -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-949) Zebra Bug: splitting map into multiple column group using storage hint causes unexpected behaviour
[ https://issues.apache.org/jira/browse/PIG-949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi updated PIG-949: - Resolution: Fixed Fix Version/s: (was: 0.4.0) Status: Resolved (was: Patch Available) > Zebra Bug: splitting map into multiple column group using storage hint causes > unexpected behaviour > -- > > Key: PIG-949 > URL: https://issues.apache.org/jira/browse/PIG-949 > Project: Pig > Issue Type: Bug >Affects Versions: 0.4.0 > Environment: linux >Reporter: Alok Singh >Assignee: Yan Zhou > Fix For: 0.5.0 > > Attachments: Pig_949.patch, Pig_949.patch, Pig_949.patch > > > Hi > The storage hint > specification plays a important part whether the output table is readable or > not > say if we have have the map 'map'. > One can split the map into a column group using [map#{k1}, map#{k2}...] > however the remaining map field will automatically be added to the default > group. > if user try to create a new column group for the remaining fields as follows > [map#{k1}, map#{k2}, ..][map] i.e create a seperate column group > the table writer will create the table. > however, if one tries to load the created table via pig or via map reduce > using TableInputFormat > > then the reader have problem reading the map > We get the following stack trace > 09/09/09 00:09:45 INFO mapred.JobClient: Task Id : > attempt_200908191538_33939_m_21_2, Status : FAILED > java.io.IOException: getValue() failed: null > at > org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getValue(BasicTable.java:775) > at > org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableInputFormat.java:717) > at > org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableInputFormat.java:651) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:356) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > at org.apache.hadoop.mapred.Child.main(Child.java:170) > Alok -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [VOTE] Release Pig 0.4.0 (candidate 2)
+1. ran 'ant test-core'. contrib/zebra: 'ant test' passed after following directions as suggested : got a patch from PIG-660, and hadoop20.jar from PIG-833. For clarity we might attach patch suitable for PIG-660 for 0.4. Raghu. Olga Natkovich wrote: Hi, The new version is available in http://people.apache.org/~olga/pig-0.4.0-candidate-2/. I see one failure in a unit test in piggybank (contrib.) but it is not related to the functions themselves but seems to be an issue with MiniCluster and I don't feel we need to chase this down. I made sure that the same test runs ok with Hadoop 20. Please, vote by end of day on Thursday, 9/24. Olga -Original Message- From: Olga Natkovich [mailto:ol...@yahoo-inc.com] Sent: Thursday, September 17, 2009 12:09 PM To: pig-dev@hadoop.apache.org; priv...@hadoop.apache.org Subject: [VOTE] Release Pig 0.4.0 (candidate 1) Hi, I have fixed the issue causing the failure that Alan reported. Please test the new release: http://people.apache.org/~olga/pig-0.4.0-candidate-1/. Vote closes on Tuesday, 9/22. Olga -Original Message- From: Olga Natkovich [mailto:ol...@yahoo-inc.com] Sent: Monday, September 14, 2009 2:06 PM To: pig-dev@hadoop.apache.org; priv...@hadoop.apache.org Subject: [VOTE] Release Pig 0.4.0 (candidate 0) Hi, I created a candidate build for Pig 0.4.0 release. The highlights of this release are - Performance improvements especially in the area of JOIN support where we introduced two new join types: skew join to deal with data skew and sort merge join to take advantage of the sorted data sets. - Support for Outer join. - Works with Hadoop 18 I ran the release audit and rat report looked fine. The relevant part is attached below. Keys used to sign the release are available at http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS?view=markup. Please download the release and try it out: http://people.apache.org/~olga/pig-0.4.0-candidate-0. Should we release this? Vote closes on Thursday, 9/17. Olga [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/contrib/CHANGES.txt [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/contrib/zebra/CHANG ES.txt [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/broken-links.x ml [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/cookbook.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/index.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/linkmap.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/piglatin_refer ence.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/piglatin_users .html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/setup.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/tutorial.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/udf.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/api/package-li st [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes. html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/missingS inces.txt [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/user_com ments_for_pig_0.3.1_to_pig_0.5.0-dev.xml [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/ alldiffs_index_additions.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/ alldiffs_index_all.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/ alldiffs_index_changes.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/ alldiffs_index_removals.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/ changes-summary.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/ classes_index_additions.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/ classes_index_all.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/ classes_index_changes.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/ classes_index_removals.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/ constructors_index_additions.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/ constructors_index_all.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/ constructors_index_changes.html [java] !?
[jira] Updated: (PIG-949) Zebra Bug: splitting map into multiple column group using storage hint causes unexpected behaviour
[ https://issues.apache.org/jira/browse/PIG-949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi updated PIG-949: - Fix Version/s: 0.5.0 0.4.0 Status: Patch Available (was: Open) > Zebra Bug: splitting map into multiple column group using storage hint causes > unexpected behaviour > -- > > Key: PIG-949 > URL: https://issues.apache.org/jira/browse/PIG-949 > Project: Pig > Issue Type: Bug >Affects Versions: 0.4.0 > Environment: linux >Reporter: Alok Singh >Assignee: Yan Zhou > Fix For: 0.4.0, 0.5.0 > > Attachments: Pig_949.patch, Pig_949.patch, Pig_949.patch > > > Hi > The storage hint > specification plays a important part whether the output table is readable or > not > say if we have have the map 'map'. > One can split the map into a column group using [map#{k1}, map#{k2}...] > however the remaining map field will automatically be added to the default > group. > if user try to create a new column group for the remaining fields as follows > [map#{k1}, map#{k2}, ..][map] i.e create a seperate column group > the table writer will create the table. > however, if one tries to load the created table via pig or via map reduce > using TableInputFormat > > then the reader have problem reading the map > We get the following stack trace > 09/09/09 00:09:45 INFO mapred.JobClient: Task Id : > attempt_200908191538_33939_m_21_2, Status : FAILED > java.io.IOException: getValue() failed: null > at > org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getValue(BasicTable.java:775) > at > org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableInputFormat.java:717) > at > org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableInputFormat.java:651) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:356) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > at org.apache.hadoop.mapred.Child.main(Child.java:170) > Alok -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-949) Zebra Bug: splitting map into multiple column group using storage hint causes unexpected behaviour
[ https://issues.apache.org/jira/browse/PIG-949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi updated PIG-949: - Status: Open (was: Patch Available) > Zebra Bug: splitting map into multiple column group using storage hint causes > unexpected behaviour > -- > > Key: PIG-949 > URL: https://issues.apache.org/jira/browse/PIG-949 > Project: Pig > Issue Type: Bug >Affects Versions: 0.4.0 > Environment: linux >Reporter: Alok Singh >Assignee: Yan Zhou > Attachments: Pig_949.patch, Pig_949.patch, Pig_949.patch > > > Hi > The storage hint > specification plays a important part whether the output table is readable or > not > say if we have have the map 'map'. > One can split the map into a column group using [map#{k1}, map#{k2}...] > however the remaining map field will automatically be added to the default > group. > if user try to create a new column group for the remaining fields as follows > [map#{k1}, map#{k2}, ..][map] i.e create a seperate column group > the table writer will create the table. > however, if one tries to load the created table via pig or via map reduce > using TableInputFormat > > then the reader have problem reading the map > We get the following stack trace > 09/09/09 00:09:45 INFO mapred.JobClient: Task Id : > attempt_200908191538_33939_m_21_2, Status : FAILED > java.io.IOException: getValue() failed: null > at > org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getValue(BasicTable.java:775) > at > org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableInputFormat.java:717) > at > org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableInputFormat.java:651) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:356) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > at org.apache.hadoop.mapred.Child.main(Child.java:170) > Alok -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-949) Zebra Bug: splitting map into multiple column group using storage hint causes unexpected behaviour
[ https://issues.apache.org/jira/browse/PIG-949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758328#action_12758328 ] Raghu Angadi commented on PIG-949: -- Yan, please include the test case in the patch. Also I would suggest a regular name for the test case file something like 'TestMapAcrossMultipleCGs.java' or something shorter. Inside the file you could mention JIRA number in the comment. Raghu. > Zebra Bug: splitting map into multiple column group using storage hint causes > unexpected behaviour > -- > > Key: PIG-949 > URL: https://issues.apache.org/jira/browse/PIG-949 > Project: Pig > Issue Type: Bug >Affects Versions: 0.4.0 > Environment: linux >Reporter: Alok Singh >Assignee: Yan Zhou > Attachments: Pig_949.patch > > > Hi > The storage hint > specification plays a important part whether the output table is readable or > not > say if we have have the map 'map'. > One can split the map into a column group using [map#{k1}, map#{k2}...] > however the remaining map field will automatically be added to the default > group. > if user try to create a new column group for the remaining fields as follows > [map#{k1}, map#{k2}, ..][map] i.e create a seperate column group > the table writer will create the table. > however, if one tries to load the created table via pig or via map reduce > using TableInputFormat > > then the reader have problem reading the map > We get the following stack trace > 09/09/09 00:09:45 INFO mapred.JobClient: Task Id : > attempt_200908191538_33939_m_21_2, Status : FAILED > java.io.IOException: getValue() failed: null > at > org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getValue(BasicTable.java:775) > at > org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableInputFormat.java:717) > at > org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableInputFormat.java:651) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:356) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > at org.apache.hadoop.mapred.Child.main(Child.java:170) > Alok -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-918) [zebra] LOAD call will hang if only the first column group is queried
[ https://issues.apache.org/jira/browse/PIG-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi reassigned PIG-918: Assignee: Yan Zhou > [zebra] LOAD call will hang if only the first column group is queried > - > > Key: PIG-918 > URL: https://issues.apache.org/jira/browse/PIG-918 > Project: Pig > Issue Type: Bug >Affects Versions: 0.4.0 >Reporter: Yan Zhou >Assignee: Yan Zhou > Fix For: 0.4.0 > > Attachments: pig-zebra.patch, pig-zebra.patch > > > Zebra's LOAD call with projections that only nclude column(s) in the first > column group will hang because an improper range of random numbers for index > to the array of column groups always skips the first element so that if all > other column groups are not used, the looping keeps running without a chance > to break. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-918) [zebra] LOAD call will hang if only the first column group is queried
[ https://issues.apache.org/jira/browse/PIG-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi resolved PIG-918. -- Resolution: Fixed > [zebra] LOAD call will hang if only the first column group is queried > - > > Key: PIG-918 > URL: https://issues.apache.org/jira/browse/PIG-918 > Project: Pig > Issue Type: Bug >Affects Versions: 0.4.0 >Reporter: Yan Zhou >Assignee: Yan Zhou > Fix For: 0.4.0 > > Attachments: pig-zebra.patch, pig-zebra.patch > > > Zebra's LOAD call with projections that only nclude column(s) in the first > column group will hang because an improper range of random numbers for index > to the array of column groups always skips the first element so that if all > other column groups are not used, the looping keeps running without a chance > to break. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-918) [zebra] LOAD call will hang if only the first column group is queried
[ https://issues.apache.org/jira/browse/PIG-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12750055#action_12750055 ] Raghu Angadi commented on PIG-918: -- I just committed this. Thanks Yan. > [zebra] LOAD call will hang if only the first column group is queried > - > > Key: PIG-918 > URL: https://issues.apache.org/jira/browse/PIG-918 > Project: Pig > Issue Type: Bug >Affects Versions: 0.4.0 >Reporter: Yan Zhou > Fix For: 0.4.0 > > Attachments: pig-zebra.patch, pig-zebra.patch > > > Zebra's LOAD call with projections that only nclude column(s) in the first > column group will hang because an improper range of random numbers for index > to the array of column groups always skips the first element so that if all > other column groups are not used, the looping keeps running without a chance > to break. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-918) [zebra] LOAD call will hang if only the first column group is queried
[ https://issues.apache.org/jira/browse/PIG-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi updated PIG-918: - Affects Version/s: (was: 0.3.0) 0.4.0 > [zebra] LOAD call will hang if only the first column group is queried > - > > Key: PIG-918 > URL: https://issues.apache.org/jira/browse/PIG-918 > Project: Pig > Issue Type: Bug >Affects Versions: 0.4.0 >Reporter: Yan Zhou > Fix For: 0.4.0 > > Attachments: pig-zebra.patch, pig-zebra.patch > > > Zebra's LOAD call with projections that only nclude column(s) in the first > column group will hang because an improper range of random numbers for index > to the array of column groups always skips the first element so that if all > other column groups are not used, the looping keeps running without a chance > to break. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-918) [zebra] LOAD call will hang if only the first column group is queried
[ https://issues.apache.org/jira/browse/PIG-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi updated PIG-918: - Attachment: pig-zebra.patch When you generate a patch with 'git diff' please use 'git diff --no-prefix' so that patch applies with 'patch -p0' command. I am updating the attached patch with this change. > [zebra] LOAD call will hang if only the first column group is queried > - > > Key: PIG-918 > URL: https://issues.apache.org/jira/browse/PIG-918 > Project: Pig > Issue Type: Bug >Affects Versions: 0.3.0 >Reporter: Yan Zhou > Fix For: 0.4.0 > > Attachments: pig-zebra.patch, pig-zebra.patch > > > Zebra's LOAD call with projections that only nclude column(s) in the first > column group will hang because an improper range of random numbers for index > to the array of column groups always skips the first element so that if all > other column groups are not used, the looping keeps running without a chance > to break. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-833) Storage access layer
[ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745219#action_12745219 ] Raghu Angadi commented on PIG-833: -- Thanks Jing. There are some PIG examples listed at the bottom of Zebra wiki : http://wiki.apache.org/pig/zebra (wiki is still under construction). Just listing java strings in Jing's comment with out Jira formatting : {noformat} final static String STR_SCHEMA = "s1:bool, s2:int, s3:long, s4:float, s5:string, s6:bytes, " + "r1:record(f1:int, f2:long), r2:record(r3:record(f3:float, f4)), " + "m1:map(string),m2:map(map(int)), c:collection(f13:double, f14:float, f15:bytes)"; final static String STR_STORAGE = "[s1, s2]; [m1#{a}]; [r1.f1]; [s3, s4, r2.r3.f3]; [s5, s6, m2#{x|y}]; " + "[r1.f2, m1#{b}]; [r2.r3.f4, m2#{z}]"; {noformat} > Storage access layer > > > Key: PIG-833 > URL: https://issues.apache.org/jira/browse/PIG-833 > Project: Pig > Issue Type: New Feature >Reporter: Jay Tang > Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, > PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, > TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz > > > A layer is needed to provide a high level data access abstraction and a > tabular view of data in Hadoop, and could free Pig users from implementing > their own data storage/retrieval code. This layer should also include a > columnar storage format in order to provide fast data projection, > CPU/space-efficient data serialization, and a schema language to manage > physical storage metadata. Eventually it could also support predicate > pushdown for further performance improvement. Initially, this layer could be > a contrib project in Pig and become a hadoop subproject later on. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Proposal to create a branch for contrib project Zebra
Right. I just noticed the mails on Pig.0.4.0. I joined pig-dev list just yesterday. waiting for 0.4.0 might be good enough if it is just a couple of weeks. will keep a watch on it. I think we will wait for a few days and attach any new feature patches to jiras. Those patches can certainly wait there. For interdependencies of the patches, we might maintain a private git. Raghu. Santhosh Srinivasan wrote: I would recommend that zebra wait for Pig 0.4.0 (a couple of weeks?). A branch will be created for the 0.4.0 release and zebra will automatically benefit. Santhosh -Original Message- From: Raghu Angadi [mailto:rang...@yahoo-inc.com] Sent: Tuesday, August 18, 2009 9:49 AM To: pig-dev@hadoop.apache.org Subject: Re: Proposal to create a branch for contrib project Zebra Milind A Bhandarkar wrote: Since zebra.jar is not included in pig.jar (I hope not), I can still use stable zebra jar (binary) with latest pig compiled in trunk. The problem is that though the current version is "expected to be" stable, it would still require some bug fixes. We essentially need to maintain another branch (official or a private git) to provide version 0.1 jar with critical bug fixes. In that sense, would it be better if we created a "zebra-v1" branch and commit the new features to trunk? May be for regular users we can create Pig.jar and zebra.jar from different lines. Raghu. Also, build failure in zebra need not impact pig release, since the other contrib, i.e. Piggybank is also "build-optional". I think that creating a branch results in too many changes on that branch before a mainline merge happens. Each of the feature additions you mention would be very highly desirable even in the absence of others. Just my 2 non-binding cents. - milind
Re: Proposal to create a branch for contrib project Zebra
Milind A Bhandarkar wrote: Since zebra.jar is not included in pig.jar (I hope not), I can still use stable zebra jar (binary) with latest pig compiled in trunk. The problem is that though the current version is "expected to be" stable, it would still require some bug fixes. We essentially need to maintain another branch (official or a private git) to provide version 0.1 jar with critical bug fixes. In that sense, would it be better if we created a "zebra-v1" branch and commit the new features to trunk? May be for regular users we can create Pig.jar and zebra.jar from different lines. Raghu. Also, build failure in zebra need not impact pig release, since the other contrib, i.e. Piggybank is also "build-optional". I think that creating a branch results in too many changes on that branch before a mainline merge happens. Each of the feature additions you mention would be very highly desirable even in the absence of others. Just my 2 non-binding cents. - milind
Re: Proposal to create a branch for contrib project Zebra
Raghu Angadi wrote: Hi Santosh, There are two separate things : (a) voting a contributor as a committer (b) committing to a contrib project. [...] Reason for (a) is simple scalability. We can not monitor everything. If I meant to say "Reason for (b)" (why contrib commits are treated bit differently). Our motivation is not to bypass any oversight.. it is just so that we don't to burden PIG committers too much. We are happy if a PIG committer volunteers to oversee and commit. Raghu. you or another PIG developer volunteers to commit zebra patches, we are more than happy to let you do it. Please let us know. Or at any stage, if you feel we may be violating normal conventions (like breaking builds or committing some PIG changes).. please raise the issue. We have not seen serious problems in this regd with any other project, I think we should get benefit or doubt. I have not addressed the reason for a new branch here. will pitch for it another mail. Raghu. Santhosh Srinivasan wrote: Is there any precedence for such proposals? I am not comfortable with extending committer access to contrib teams. I would suggest that Zebra be made a sub-project of Hadoop and have a life of its own. Santhosh -Original Message----- From: Raghu Angadi [mailto:rang...@yahoo-inc.com] Sent: Monday, August 17, 2009 4:06 PM To: pig-dev@hadoop.apache.org Subject: Proposal to create a branch for contrib project Zebra Thanks to the PIG team, The first version of contrib project Zebra (PIG-833) is committed to PIG trunk. In short, Zebra is a table storage layer built for use in PIG and other Hadoop applications. While we are stabilizing current version V1 in the trunk, we plan to add more new features to it. We would like to create an svn branch for the new features. We will be responsible for managing zebra in PIG trunk and in the new branch. We will merge the branch when it is ready. We expect the changes to affect only 'contrib/zebra' directory. As a regular contributor to Hadoop, I will be the initial committer for Zebra. As more patches are contributed by other Zebra developers, there might be more commiters added through normal Hadoop/Apache procedure. I would like to create a branch called 'zebra-v2' with approval from PIG team. Thanks, Raghu.
Re: Proposal to create a branch for contrib project Zebra
The reason for a branch is purely based on fair number of improvements we are planning for Zebra and our desire to have a stable Zebra implementation for users to use along with PIG on Hadoop-0.20. New features planned (jiras will be filed soon) : * Column security (different permissions for different columns) * Ability to drop columns * ability to address "column groups" by name * Support for sorted tables, map side joins, * ... Many of these changes involve changes to table metadata, schema syntax, and on disk format of the metadata (all of these will be backward compatible). If Zebra was a project of its own, one would have made a 0.1.0 branch and worked on new features in the trunk. The new proposed branch is for achieving the same by keeping PIG and stable Zebra together. PIG branch 0.4.0 will be made when it is appropriate for PIG. Generally, a contrib project should not influence that decision. Is there an alternative to creating a branch? Would you prefer we commit new features to a line that is being used by users? Raghu. Milind A Bhandarkar wrote: IANAC, but my (non-binding) vote is also -1. I think all the improvements and feature addition to zebra should be available through pig trunk. The codebase is not big enough to justify creating a branch. If the reason is Pig's dependence on a checked in hadoop jar, the shims proposal by Dmitry should be taken up asap, so that those who want to use zebra can use pig trunk with hadoop 0.20 - milind On 8/17/09 5:14 PM, "Yiping Han" wrote: +1 On 8/18/09 7:11 AM, "Olga Natkovich" wrote: +1 -----Original Message- From: Raghu Angadi [mailto:rang...@yahoo-inc.com] Sent: Monday, August 17, 2009 4:06 PM To: pig-dev@hadoop.apache.org Subject: Proposal to create a branch for contrib project Zebra Thanks to the PIG team, The first version of contrib project Zebra (PIG-833) is committed to PIG trunk. In short, Zebra is a table storage layer built for use in PIG and other Hadoop applications. While we are stabilizing current version V1 in the trunk, we plan to add more new features to it. We would like to create an svn branch for the new features. We will be responsible for managing zebra in PIG trunk and in the new branch. We will merge the branch when it is ready. We expect the changes to affect only 'contrib/zebra' directory. As a regular contributor to Hadoop, I will be the initial committer for Zebra. As more patches are contributed by other Zebra developers, there might be more commiters added through normal Hadoop/Apache procedure. I would like to create a branch called 'zebra-v2' with approval from PIG team. Thanks, Raghu. -- Yiping Han F-3140 (408)349-4403 y...@yahoo-inc.com
[jira] Commented: (PIG-833) Storage access layer
[ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744361#action_12744361 ] Raghu Angadi commented on PIG-833: -- will try to get some initial docs attached to this jira asap. I think the current plan is to have proper wiki pages (and attached here). This is part of the reason by we would like to keep this jira open. The bulk initial dump is certainly not desirable but has been fairly common for many contrib projects in Hadoop. A bit of rush to get this committed to contrib is in part to avoid such large changes going again. The longer we delay larger the patch is going to get. We want to get the subsequent patches and discussions to public jira asap and we are already doing that. I would like to clarify that this is not a PIG feature but rather a contrib project. We would not want this commit to be generalized for PIG commits. All the responsibility is with Zebra team. This patch is the initial verion. It does include many tests. > Storage access layer > > > Key: PIG-833 > URL: https://issues.apache.org/jira/browse/PIG-833 > Project: Pig > Issue Type: New Feature >Reporter: Jay Tang > Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, > PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, > TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz > > > A layer is needed to provide a high level data access abstraction and a > tabular view of data in Hadoop, and could free Pig users from implementing > their own data storage/retrieval code. This layer should also include a > columnar storage format in order to provide fast data projection, > CPU/space-efficient data serialization, and a schema language to manage > physical storage metadata. Eventually it could also support predicate > pushdown for further performance improvement. Initially, this layer could be > a contrib project in Pig and become a hadoop subproject later on. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Proposal to create a branch for contrib project Zebra
Hi Santosh, There are two separate things : (a) voting a contributor as a committer (b) committing to a contrib project. (b): My experience with Hadoop is that "Contrib" by definition is very loosely coupled with core. By convention, we as committers to core (hdfs, mapred, etc) did not have to monitor changes to contrib as thoroughly as we would monitor core changes. It is the responsibility of contrib developers to make sure they are not breaking builds etc. Contrib changes get reviewed by people interested in the project. (a): Voting takes place when a contributor is being blessed as a committer. It involves some legal stuff as well. Although a committer has permissions to commit to any part of a project, it is expected that they don't misuse it. e.g. if I have a patch for core Map/Reduce, I would certainly wait for a regular MR contributor to review it and possibly commit it. It does not matter how many patches I might have contributed to say HDFS. Reason for (a) is simple scalability. We can not monitor everything. If you or another PIG developer volunteers to commit zebra patches, we are more than happy to let you do it. Please let us know. Or at any stage, if you feel we may be violating normal conventions (like breaking builds or committing some PIG changes).. please raise the issue. We have not seen serious problems in this regd with any other project, I think we should get benefit or doubt. I have not addressed the reason for a new branch here. will pitch for it another mail. Raghu. Santhosh Srinivasan wrote: Is there any precedence for such proposals? I am not comfortable with extending committer access to contrib teams. I would suggest that Zebra be made a sub-project of Hadoop and have a life of its own. Santhosh -Original Message----- From: Raghu Angadi [mailto:rang...@yahoo-inc.com] Sent: Monday, August 17, 2009 4:06 PM To: pig-dev@hadoop.apache.org Subject: Proposal to create a branch for contrib project Zebra Thanks to the PIG team, The first version of contrib project Zebra (PIG-833) is committed to PIG trunk. In short, Zebra is a table storage layer built for use in PIG and other Hadoop applications. While we are stabilizing current version V1 in the trunk, we plan to add more new features to it. We would like to create an svn branch for the new features. We will be responsible for managing zebra in PIG trunk and in the new branch. We will merge the branch when it is ready. We expect the changes to affect only 'contrib/zebra' directory. As a regular contributor to Hadoop, I will be the initial committer for Zebra. As more patches are contributed by other Zebra developers, there might be more commiters added through normal Hadoop/Apache procedure. I would like to create a branch called 'zebra-v2' with approval from PIG team. Thanks, Raghu.
Proposal to create a branch for contrib project Zebra
Thanks to the PIG team, The first version of contrib project Zebra (PIG-833) is committed to PIG trunk. In short, Zebra is a table storage layer built for use in PIG and other Hadoop applications. While we are stabilizing current version V1 in the trunk, we plan to add more new features to it. We would like to create an svn branch for the new features. We will be responsible for managing zebra in PIG trunk and in the new branch. We will merge the branch when it is ready. We expect the changes to affect only 'contrib/zebra' directory. As a regular contributor to Hadoop, I will be the initial committer for Zebra. As more patches are contributed by other Zebra developers, there might be more commiters added through normal Hadoop/Apache procedure. I would like to create a branch called 'zebra-v2' with approval from PIG team. Thanks, Raghu.
[jira] Commented: (PIG-833) Storage access layer
[ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742435#action_12742435 ] Raghu Angadi commented on PIG-833: -- > this means Pig contrib/ is no longer compatible with Hadoop 18. This is not desirable and expected to be temporary until PIG-660 is committed. PIG-660 has other dependencies different schedule. We thought committing zebra will make zebra builds and subsequent patches easier if it is committed. As such PIG does not build contrib from top level ('ant test-contrib' is a no-op). So each contrib project needs to be build explicitly anyway. This is different from Hadoop build. This this patch should not fail existing automated builds. > Storage access layer > > > Key: PIG-833 > URL: https://issues.apache.org/jira/browse/PIG-833 > Project: Pig > Issue Type: New Feature >Reporter: Jay Tang > Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, > PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, > TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz > > > A layer is needed to provide a high level data access abstraction and a > tabular view of data in Hadoop, and could free Pig users from implementing > their own data storage/retrieval code. This layer should also include a > columnar storage format in order to provide fast data projection, > CPU/space-efficient data serialization, and a schema language to manage > physical storage metadata. Eventually it could also support predicate > pushdown for further performance improvement. Initially, this layer could be > a contrib project in Pig and become a hadoop subproject later on. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-833) Storage access layer
[ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742069#action_12742069 ] Raghu Angadi commented on PIG-833: -- Alan, in order to run unit tests you need to build pig test-core. As mentioned in the instructions above please run {{'ant -Dtestcase=none test-core'}} under top level directory before running 'ant test' under contrib/zebra. > Storage access layer > > > Key: PIG-833 > URL: https://issues.apache.org/jira/browse/PIG-833 > Project: Pig > Issue Type: New Feature >Reporter: Jay Tang > Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, > PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, test.out, zebra-javadoc.tgz > > > A layer is needed to provide a high level data access abstraction and a > tabular view of data in Hadoop, and could free Pig users from implementing > their own data storage/retrieval code. This layer should also include a > columnar storage format in order to provide fast data projection, > CPU/space-efficient data serialization, and a schema language to manage > physical storage metadata. Eventually it could also support predicate > pushdown for further performance improvement. Initially, this layer could be > a contrib project in Pig and become a hadoop subproject later on. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-833) Storage access layer
[ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi updated PIG-833: - Attachment: PIG-833-zebra.patch.bz2 > Storage access layer > > > Key: PIG-833 > URL: https://issues.apache.org/jira/browse/PIG-833 > Project: Pig > Issue Type: New Feature >Reporter: Jay Tang > Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, > PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, zebra-javadoc.tgz > > > A layer is needed to provide a high level data access abstraction and a > tabular view of data in Hadoop, and could free Pig users from implementing > their own data storage/retrieval code. This layer should also include a > columnar storage format in order to provide fast data projection, > CPU/space-efficient data serialization, and a schema language to manage > physical storage metadata. Eventually it could also support predicate > pushdown for further performance improvement. Initially, this layer could be > a contrib project in Pig and become a hadoop subproject later on. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-833) Storage access layer
[ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi updated PIG-833: - Attachment: PIG-833-zebra.patch.bz2 Updated patch. Only change is that ant prints a descriptive error to user if hadoop20.jar does not exist in top level lib directory. It lists basic steps to get this built until PIG-660 is committed. > Storage access layer > > > Key: PIG-833 > URL: https://issues.apache.org/jira/browse/PIG-833 > Project: Pig > Issue Type: New Feature >Reporter: Jay Tang > Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, > PIG-833-zebra.patch.bz2, zebra-javadoc.tgz > > > A layer is needed to provide a high level data access abstraction and a > tabular view of data in Hadoop, and could free Pig users from implementing > their own data storage/retrieval code. This layer should also include a > columnar storage format in order to provide fast data projection, > CPU/space-efficient data serialization, and a schema language to manage > physical storage metadata. Eventually it could also support predicate > pushdown for further performance improvement. Initially, this layer could be > a contrib project in Pig and become a hadoop subproject later on. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-833) Storage access layer
[ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12736998#action_12736998 ] Raghu Angadi commented on PIG-833: -- There will be benchmark results either attached to this jira or to a subsequent jira. I would like to compare to SequenceFiles and the new format in Hive. Should to see on par performance. Major performance benefits come from commonly used projections (through column groups) and map side joins of sorted tables. An important part of motivation is some features like column security, ability to delete entire columns. We are running some larger scale benchmarks internally.. but these run on Yahoo's internal data sources. > Storage access layer > > > Key: PIG-833 > URL: https://issues.apache.org/jira/browse/PIG-833 > Project: Pig > Issue Type: New Feature >Reporter: Jay Tang > Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, zebra-javadoc.tgz > > > A layer is needed to provide a high level data access abstraction and a > tabular view of data in Hadoop, and could free Pig users from implementing > their own data storage/retrieval code. This layer should also include a > columnar storage format in order to provide fast data projection, > CPU/space-efficient data serialization, and a schema language to manage > physical storage metadata. Eventually it could also support predicate > pushdown for further performance improvement. Initially, this layer could be > a contrib project in Pig and become a hadoop subproject later on. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-833) Storage access layer
[ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi updated PIG-833: - Attachment: zebra-javadoc.tgz > Storage access layer > > > Key: PIG-833 > URL: https://issues.apache.org/jira/browse/PIG-833 > Project: Pig > Issue Type: New Feature >Reporter: Jay Tang > Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, zebra-javadoc.tgz > > > A layer is needed to provide a high level data access abstraction and a > tabular view of data in Hadoop, and could free Pig users from implementing > their own data storage/retrieval code. This layer should also include a > columnar storage format in order to provide fast data projection, > CPU/space-efficient data serialization, and a schema language to manage > physical storage metadata. Eventually it could also support predicate > pushdown for further performance improvement. Initially, this layer could be > a contrib project in Pig and become a hadoop subproject later on. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-833) Storage access layer
[ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi updated PIG-833: - Attachment: PIG-833-zebra.patch The first cut of contrib/zebra. The patch is very large and should probably compress the subsequent versions of it. More documentation on design and usage will be added to the jira. How to compile : -- * check out latest PIG trunk * Apply the latest patch from PIG-660 * copy attached hadoop20.jar to ./lib * run '{{ant jar}}' (and {{'ant -Dtestcase=none test-core'}} for zebra tests). * cd contrib/zebra * ant jar * ant test (for tests). Currently there are compile time deprecation warnings related to use of deprecated mapred API (JobConf). There is will be fixed later. > Storage access layer > > > Key: PIG-833 > URL: https://issues.apache.org/jira/browse/PIG-833 > Project: Pig > Issue Type: New Feature >Reporter: Jay Tang > Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch > > > A layer is needed to provide a high level data access abstraction and a > tabular view of data in Hadoop, and could free Pig users from implementing > their own data storage/retrieval code. This layer should also include a > columnar storage format in order to provide fast data projection, > CPU/space-efficient data serialization, and a schema language to manage > physical storage metadata. Eventually it could also support predicate > pushdown for further performance improvement. Initially, this layer could be > a contrib project in Pig and become a hadoop subproject later on. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-833) Storage access layer
[ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12736424#action_12736424 ] Raghu Angadi commented on PIG-833: -- Will surely look at Hive's storage layer and SerDe. I will be able to better comment on specifics once I get better handle. In the mean while I will attach the work that is already been done on Zebra. This is currently a contrib in PIG. Based on these experiences we could probably provide a common storage layer more widely suitable for multiple Hadoop related projects. > Storage access layer > > > Key: PIG-833 > URL: https://issues.apache.org/jira/browse/PIG-833 > Project: Pig > Issue Type: New Feature >Reporter: Jay Tang > Attachments: hadoop20.jar.bz2 > > > A layer is needed to provide a high level data access abstraction and a > tabular view of data in Hadoop, and could free Pig users from implementing > their own data storage/retrieval code. This layer should also include a > columnar storage format in order to provide fast data projection, > CPU/space-efficient data serialization, and a schema language to manage > physical storage metadata. Eventually it could also support predicate > pushdown for further performance improvement. Initially, this layer could be > a contrib project in Pig and become a hadoop subproject later on. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-833) Storage access layer
[ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi updated PIG-833: - Attachment: hadoop20.jar.bz2 Attaching hadoop20.jar that needs to be placed under lib/ directory under the top level PIG directory. will included specific instructions later in the jira. > Storage access layer > > > Key: PIG-833 > URL: https://issues.apache.org/jira/browse/PIG-833 > Project: Pig > Issue Type: New Feature >Reporter: Jay Tang > Attachments: hadoop20.jar.bz2 > > > A layer is needed to provide a high level data access abstraction and a > tabular view of data in Hadoop, and could free Pig users from implementing > their own data storage/retrieval code. This layer should also include a > columnar storage format in order to provide fast data projection, > CPU/space-efficient data serialization, and a schema language to manage > physical storage metadata. Eventually it could also support predicate > pushdown for further performance improvement. Initially, this layer could be > a contrib project in Pig and become a hadoop subproject later on. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-660) Integration with Hadoop 0.20
[ https://issues.apache.org/jira/browse/PIG-660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi updated PIG-660: - Attachment: PIG-660_6.patch Updated patch fixes two minor conflicts with the current pig trunk. > Integration with Hadoop 0.20 > > > Key: PIG-660 > URL: https://issues.apache.org/jira/browse/PIG-660 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.2.0 > Environment: Hadoop 0.20 >Reporter: Santhosh Srinivasan >Assignee: Santhosh Srinivasan > Fix For: 0.4.0 > > Attachments: PIG-660.patch, PIG-660_1.patch, PIG-660_2.patch, > PIG-660_3.patch, PIG-660_4.patch, PIG-660_5.patch, PIG-660_6.patch > > > With Hadoop 0.20, it will be possible to query the status of each map and > reduce in a map reduce job. This will allow better error reporting. Some of > the other items that could be on Hadoop's feature requests/bugs are > documented here for tracking. > 1. Hadoop should return objects instead of strings when exceptions are thrown > 2. The JobControl should handle all exceptions and report them appropriately. > For example, when the JobControl fails to launch jobs, it should handle > exceptions appropriately and should support APIs that query this state, i.e., > failure to launch jobs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-660) Integration with Hadoop 0.20
[ https://issues.apache.org/jira/browse/PIG-660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12736297#action_12736297 ] Raghu Angadi commented on PIG-660: -- Thanks Olga and Santosh. build.xml change is already in the patch. Thanks. I will attach hadoop20.jar that works with PIG. This is useful for anyone to tryout the patch. This will also be used by zebra (PIG-833). Please commit the jar file to PIG trunk. It could be updated with a later version of hadoop-0.20 branch. > Integration with Hadoop 0.20 > > > Key: PIG-660 > URL: https://issues.apache.org/jira/browse/PIG-660 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.2.0 > Environment: Hadoop 0.20 >Reporter: Santhosh Srinivasan >Assignee: Santhosh Srinivasan > Fix For: 0.4.0 > > Attachments: PIG-660.patch, PIG-660_1.patch, PIG-660_2.patch, > PIG-660_3.patch, PIG-660_4.patch, PIG-660_5.patch > > > With Hadoop 0.20, it will be possible to query the status of each map and > reduce in a map reduce job. This will allow better error reporting. Some of > the other items that could be on Hadoop's feature requests/bugs are > documented here for tracking. > 1. Hadoop should return objects instead of strings when exceptions are thrown > 2. The JobControl should handle all exceptions and report them appropriately. > For example, when the JobControl fails to launch jobs, it should handle > exceptions appropriately and should support APIs that query this state, i.e., > failure to launch jobs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-660) Integration with Hadoop 0.20
[ https://issues.apache.org/jira/browse/PIG-660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12736264#action_12736264 ] Raghu Angadi commented on PIG-660: -- Currently, hadoop jar for 0.18 under lib/ is called hadoop18.jar. Should we change build.xml to use hadoop20.jar instead of hadoop18.jar? I can file a jira to commit hadoop20.jar. This might be replaced by updated jar when this jira is committed. > Integration with Hadoop 0.20 > > > Key: PIG-660 > URL: https://issues.apache.org/jira/browse/PIG-660 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.2.0 > Environment: Hadoop 0.20 >Reporter: Santhosh Srinivasan >Assignee: Santhosh Srinivasan > Fix For: 0.4.0 > > Attachments: PIG-660.patch, PIG-660_1.patch, PIG-660_2.patch, > PIG-660_3.patch, PIG-660_4.patch, PIG-660_5.patch > > > With Hadoop 0.20, it will be possible to query the status of each map and > reduce in a map reduce job. This will allow better error reporting. Some of > the other items that could be on Hadoop's feature requests/bugs are > documented here for tracking. > 1. Hadoop should return objects instead of strings when exceptions are thrown > 2. The JobControl should handle all exceptions and report them appropriately. > For example, when the JobControl fails to launch jobs, it should handle > exceptions appropriately and should support APIs that query this state, i.e., > failure to launch jobs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.