[jira] Commented: (PIG-984) PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
[ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12761257#action_12761257 ] Alan Gates commented on PIG-984: The controlling philosophic point here is that pigs are domestic animals (see http://wiki.apache.org/pig/PigPhilosophy). Just as in join, where we have exposed all possible join implementations to the user, we want to do the same with this new feature. At some future point when we have a capable optimizer, we will try to select the best type of join, and try to select this form of grouping when it's appropriate. But even then, we want to expose this functionality to the user directly because the optimizer may not have access to the necessary information to determine the best grouping choice (e.g., data sources with no schema). And we don't want to wait until the optimizer can handle these things to start exposing it. I don't agree with Santosh's assertion that the language is evolving with no definition. I agree we do not yet have a comprehensive definition of Pig Latin, which we need. But this is in line with what we've done for joins, philosophically, semantically, and syntacticly. PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data Key: PIG-984 URL: https://issues.apache.org/jira/browse/PIG-984 Project: Pig Issue Type: New Feature Reporter: Richard Ding The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers. However, in the cases where the input data has the following properties 1. The records with the same key are grouped together (such as the data is sorted by the keys). 2. The records with the same key are in the same mapper input. the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads. Alan proposed adding a hint to the group by clause like this one: {code} A = load 'input' using SomeLoader(...); B = group A by $0 using mapside; C = foreach B generate ... {code} The proposed addition of using mapside to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys. It is expected that SomeLoader will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2). It will be the responsibility of the user (or the loader) to guarantee these properties (1) (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data. For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-984) PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
[ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12761270#action_12761270 ] Santhosh Srinivasan commented on PIG-984: - bq. But this is in line with what we've done for joins, philosophically, semantically, and syntacticly. Not exactly; with joins we are exposing different kinds of joins. Here we are exposing the underlying aspects of the framework (mapside). If there is a parallel framework that does not do map-reduce then having mapside in the language is philosophically and semantically not correct. PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data Key: PIG-984 URL: https://issues.apache.org/jira/browse/PIG-984 Project: Pig Issue Type: New Feature Reporter: Richard Ding The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers. However, in the cases where the input data has the following properties 1. The records with the same key are grouped together (such as the data is sorted by the keys). 2. The records with the same key are in the same mapper input. the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads. Alan proposed adding a hint to the group by clause like this one: {code} A = load 'input' using SomeLoader(...); B = group A by $0 using mapside; C = foreach B generate ... {code} The proposed addition of using mapside to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys. It is expected that SomeLoader will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2). It will be the responsibility of the user (or the loader) to guarantee these properties (1) (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data. For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-987) Zebra Column Group Access Control for A29 compliance and Performance
Zebra Column Group Access Control for A29 compliance and Performance Key: PIG-987 URL: https://issues.apache.org/jira/browse/PIG-987 Project: Pig Issue Type: New Feature Affects Versions: 0.6.0 Reporter: Yan Zhou Assignee: Yan Zhou Access Control: when processes try to read from the column groups, Zebra should be able to handle allowed vs. disallowed user/application accesses. Expected behavior when column group permissions are set: When user selects only columns that they do not have permissions to access, Zebra should return error with message Error #: Permission denied for accessing column column name or names Access control applies to an entire column group, so all columns in a column group have same permissions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-987) Zebra Column Group Access Control for A29 compliance and Performance
[ https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-987: - Status: Patch Available (was: Open) A29_ColumnGroupSecurity.patch is the patch file name. Zebra Column Group Access Control for A29 compliance and Performance Key: PIG-987 URL: https://issues.apache.org/jira/browse/PIG-987 Project: Pig Issue Type: New Feature Affects Versions: 0.6.0 Reporter: Yan Zhou Assignee: Yan Zhou Access Control: when processes try to read from the column groups, Zebra should be able to handle allowed vs. disallowed user/application accesses. Expected behavior when column group permissions are set: When user selects only columns that they do not have permissions to access, Zebra should return error with message Error #: Permission denied for accessing column column name or names Access control applies to an entire column group, so all columns in a column group have same permissions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-987) Zebra Column Group Access Control for A29 compliance and Performance
[ https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-987: - Attachment: A29_ColumnGroupSecurity.patch Zebra Column Group Access Control for A29 compliance and Performance Key: PIG-987 URL: https://issues.apache.org/jira/browse/PIG-987 Project: Pig Issue Type: New Feature Affects Versions: 0.6.0 Reporter: Yan Zhou Assignee: Yan Zhou Attachments: A29_ColumnGroupSecurity.patch Access Control: when processes try to read from the column groups, Zebra should be able to handle allowed vs. disallowed user/application accesses. Expected behavior when column group permissions are set: When user selects only columns that they do not have permissions to access, Zebra should return error with message Error #: Permission denied for accessing column column name or names Access control applies to an entire column group, so all columns in a column group have same permissions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-984) PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
[ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12761276#action_12761276 ] Alan Gates commented on PIG-984: I'm fine with changing the name from 'mapside' to 'collected' or something. I see your point that exposing the term 'mapside' is bad because it is hadoop specific. But I think the overall idea of allowing the user to select the type of grouping is good. PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data Key: PIG-984 URL: https://issues.apache.org/jira/browse/PIG-984 Project: Pig Issue Type: New Feature Reporter: Richard Ding The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers. However, in the cases where the input data has the following properties 1. The records with the same key are grouped together (such as the data is sorted by the keys). 2. The records with the same key are in the same mapper input. the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads. Alan proposed adding a hint to the group by clause like this one: {code} A = load 'input' using SomeLoader(...); B = group A by $0 using mapside; C = foreach B generate ... {code} The proposed addition of using mapside to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys. It is expected that SomeLoader will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2). It will be the responsibility of the user (or the loader) to guarantee these properties (1) (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data. For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-988) Better implementation of distinct aggs
Better implementation of distinct aggs -- Key: PIG-988 URL: https://issues.apache.org/jira/browse/PIG-988 Project: Pig Issue Type: Improvement Reporter: Alan Gates Distinct aggregates by definition cannot use the combiner (though the distinct can be and is done in the combiner). Since this is a common use case it would be good to optimize. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-988) Better implementation of distinct aggs
[ https://issues.apache.org/jira/browse/PIG-988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12761284#action_12761284 ] Alan Gates commented on PIG-988: Consider a script like: {code} A = load 'bla'; B = group A by $0; C = foreach B { D = A.$1; E = distinct D; generate group, COUNT(E); } {code} This is count distinct, and a fairly common thing to do. Currently Pig will use the combiner to remove as many duplicate values from D as possible. But a final distinct pass is still required on the reducer. Currently DistinctBag is used for this. In this particular case, it would be possible to instead use Hadoop's secondary sort to sort the incoming records on the full tuple, and then use a different implementation of DistinctBag that expected the incoming records to be sorted and remove duplicates. Note that this could not be used in conjunction with the order by optimization proposed in PIG-980. Better implementation of distinct aggs -- Key: PIG-988 URL: https://issues.apache.org/jira/browse/PIG-988 Project: Pig Issue Type: Improvement Reporter: Alan Gates Distinct aggregates by definition cannot use the combiner (though the distinct can be and is done in the combiner). Since this is a common use case it would be good to optimize. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-989) Allow type merge between numerical type and non-numerical type
Allow type merge between numerical type and non-numerical type -- Key: PIG-989 URL: https://issues.apache.org/jira/browse/PIG-989 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.5.0 Reporter: Daniel Dai Currently, we do not allow type merge between numerical type and non-numerical type. And the error message is confusing. Eg, if you run: a = load '1.txt' as (a0:chararray, a1:chararray); b = load '2.txt' as (b0:long, b1:chararray); c = join a by a0, b by b0; dump c; And the error message is ERROR 1051: Cannot cast to Unknown We shall: 1. Allow the type merge between numerical type and non-numerical type 2. Or at least, provide more meaningful error message to the user -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-987) Zebra Column Group Access Control for A29 compliance and Performance
[ https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12761327#action_12761327 ] Hadoop QA commented on PIG-987: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12421038/A29_ColumnGroupSecurity.patch against trunk revision 820394. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 38 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 288 release audit warnings (more than the trunk's current 281 warnings). +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/54/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/54/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/54/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/54/console This message is automatically generated. Zebra Column Group Access Control for A29 compliance and Performance Key: PIG-987 URL: https://issues.apache.org/jira/browse/PIG-987 Project: Pig Issue Type: New Feature Affects Versions: 0.6.0 Reporter: Yan Zhou Assignee: Yan Zhou Attachments: A29_ColumnGroupSecurity.patch Access Control: when processes try to read from the column groups, Zebra should be able to handle allowed vs. disallowed user/application accesses. Expected behavior when column group permissions are set: When user selects only columns that they do not have permissions to access, Zebra should return error with message Error #: Permission denied for accessing column column name or names Access control applies to an entire column group, so all columns in a column group have same permissions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-987) Zebra Column Group Access Control
[ https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-987: - Summary: Zebra Column Group Access Control (was: Zebra Column Group Access Control for A29 compliance and Performance) Zebra Column Group Access Control - Key: PIG-987 URL: https://issues.apache.org/jira/browse/PIG-987 Project: Pig Issue Type: New Feature Affects Versions: 0.6.0 Reporter: Yan Zhou Assignee: Yan Zhou Attachments: A29_ColumnGroupSecurity.patch Access Control: when processes try to read from the column groups, Zebra should be able to handle allowed vs. disallowed user/application accesses. Expected behavior when column group permissions are set: When user selects only columns that they do not have permissions to access, Zebra should return error with message Error #: Permission denied for accessing column column name or names Access control applies to an entire column group, so all columns in a column group have same permissions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-987) Zebra Column Group Access Control
[ https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-987: - Attachment: (was: A29_ColumnGroupSecurity.patch) Zebra Column Group Access Control - Key: PIG-987 URL: https://issues.apache.org/jira/browse/PIG-987 Project: Pig Issue Type: New Feature Affects Versions: 0.6.0 Reporter: Yan Zhou Assignee: Yan Zhou Access Control: when processes try to read from the column groups, Zebra should be able to handle allowed vs. disallowed user/application accesses. Expected behavior when column group permissions are set: When user selects only columns that they do not have permissions to access, Zebra should return error with message Error #: Permission denied for accessing column column name or names Access control applies to an entire column group, so all columns in a column group have same permissions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-987) Zebra Column Group Access Control
[ https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-987: - Attachment: ColumnGroupSecurity.patch Zebra Column Group Access Control - Key: PIG-987 URL: https://issues.apache.org/jira/browse/PIG-987 Project: Pig Issue Type: New Feature Affects Versions: 0.6.0 Reporter: Yan Zhou Assignee: Yan Zhou Attachments: ColumnGroupSecurity.patch Access Control: when processes try to read from the column groups, Zebra should be able to handle allowed vs. disallowed user/application accesses. Expected behavior when column group permissions are set: When user selects only columns that they do not have permissions to access, Zebra should return error with message Error #: Permission denied for accessing column column name or names Access control applies to an entire column group, so all columns in a column group have same permissions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-987) Zebra Column Group Access Control
[ https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12761365#action_12761365 ] Yan Zhou commented on PIG-987: -- During STORE, the storage hint is enhanced to take a new secure by section, e.g., [c1,c2] secure by group:secure perm:640 meaning the column group of columns c1 and c2 will belong to group secure with file permission octal value of 0640 which, in turn, means read+write for user, read for group and non for others. After Zebra table creation, all files and directories inside the secured column group will have the same permision and group membership within the table. If a column group is not secured, the default behavoir is determined by the HADOOP MAP/REDUCE default permision and group membership set upon the new files and directories. Zebra Column Group Access Control - Key: PIG-987 URL: https://issues.apache.org/jira/browse/PIG-987 Project: Pig Issue Type: New Feature Affects Versions: 0.6.0 Reporter: Yan Zhou Assignee: Yan Zhou Attachments: ColumnGroupSecurity.patch Access Control: when processes try to read from the column groups, Zebra should be able to handle allowed vs. disallowed user/application accesses. Expected behavior when column group permissions are set: When user selects only columns that they do not have permissions to access, Zebra should return error with message Error #: Permission denied for accessing column column name or names Access control applies to an entire column group, so all columns in a column group have same permissions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-592) schema inferred incorrectly
[ https://issues.apache.org/jira/browse/PIG-592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-592: --- Status: Patch Available (was: Open) schema inferred incorrectly --- Key: PIG-592 URL: https://issues.apache.org/jira/browse/PIG-592 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Christopher Olston Fix For: 0.6.0 Attachments: PIG-592-1.patch, PIG-592-2.patch A simple pig script, that never introduces any schema information: A = load 'foo'; B = foreach (group A by $8) generate group, COUNT($1); C = load 'bar'; // ('bar' has two columns) D = join B by $0, C by $0; E = foreach D generate $0, $1, $3; Fails, complaining that $3 does not exist: java.io.IOException: Out of bound access. Trying to access non-existent column: 3. Schema {B::group: bytearray,long,bytearray} has 3 column(s). Apparently Pig gets confused, and thinks it knows the schema for C (a single bytearray column). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-960) Using Hadoop's optimized LineRecordReader for reading Tuples in PigStorage
[ https://issues.apache.org/jira/browse/PIG-960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Modi updated PIG-960: --- Attachment: pig_rlr.patch Using Hadoop's optimized LineRecordReader for reading Tuples in PigStorage --- Key: PIG-960 URL: https://issues.apache.org/jira/browse/PIG-960 Project: Pig Issue Type: Improvement Components: impl Reporter: Ankit Modi Attachments: pig_rlr.patch PigStorage's reading of Tuples ( lines ) can be optimized using Hadoop's {{LineRecordReader}}. This can help in following areas - Improving performance reading of Tuples (lines) in {{PigStorage}} - Any future improvements in line reading done in Hadoop's {{LineRecordReader}} is automatically carried over to Pig Issues that are handled by this patch - BZip uses internal buffers and positioning for determining the number of bytes read. Hence buffering done by {{LineRecordReader}} has to be turned off - Current implementation of {{LocalSeekableInputStream}} does not implement {{available}} method. This method has to be implemented. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-960) Using Hadoop's optimized LineRecordReader for reading Tuples in PigStorage
[ https://issues.apache.org/jira/browse/PIG-960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Modi updated PIG-960: --- Attachment: (was: pig_rlr.patch) Using Hadoop's optimized LineRecordReader for reading Tuples in PigStorage --- Key: PIG-960 URL: https://issues.apache.org/jira/browse/PIG-960 Project: Pig Issue Type: Improvement Components: impl Reporter: Ankit Modi Attachments: pig_rlr.patch PigStorage's reading of Tuples ( lines ) can be optimized using Hadoop's {{LineRecordReader}}. This can help in following areas - Improving performance reading of Tuples (lines) in {{PigStorage}} - Any future improvements in line reading done in Hadoop's {{LineRecordReader}} is automatically carried over to Pig Issues that are handled by this patch - BZip uses internal buffers and positioning for determining the number of bytes read. Hence buffering done by {{LineRecordReader}} has to be turned off - Current implementation of {{LocalSeekableInputStream}} does not implement {{available}} method. This method has to be implemented. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-960) Using Hadoop's optimized LineRecordReader for reading Tuples in PigStorage
[ https://issues.apache.org/jira/browse/PIG-960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12761376#action_12761376 ] Ankit Modi commented on PIG-960: Added the latest patch making PigLineRecordReader a wrapper only. Using Hadoop's optimized LineRecordReader for reading Tuples in PigStorage --- Key: PIG-960 URL: https://issues.apache.org/jira/browse/PIG-960 Project: Pig Issue Type: Improvement Components: impl Reporter: Ankit Modi Attachments: pig_rlr.patch PigStorage's reading of Tuples ( lines ) can be optimized using Hadoop's {{LineRecordReader}}. This can help in following areas - Improving performance reading of Tuples (lines) in {{PigStorage}} - Any future improvements in line reading done in Hadoop's {{LineRecordReader}} is automatically carried over to Pig Issues that are handled by this patch - BZip uses internal buffers and positioning for determining the number of bytes read. Hence buffering done by {{LineRecordReader}} has to be turned off - Current implementation of {{LocalSeekableInputStream}} does not implement {{available}} method. This method has to be implemented. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-989) Allow type merge between numerical type and non-numerical type
[ https://issues.apache.org/jira/browse/PIG-989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-989: --- Status: Patch Available (was: Open) Allow type merge between numerical type and non-numerical type -- Key: PIG-989 URL: https://issues.apache.org/jira/browse/PIG-989 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.5.0 Reporter: Daniel Dai Attachments: PIG-989-1.patch Currently, we do not allow type merge between numerical type and non-numerical type. And the error message is confusing. Eg, if you run: a = load '1.txt' as (a0:chararray, a1:chararray); b = load '2.txt' as (b0:long, b1:chararray); c = join a by a0, b by b0; dump c; And the error message is ERROR 1051: Cannot cast to Unknown We shall: 1. Allow the type merge between numerical type and non-numerical type 2. Or at least, provide more meaningful error message to the user -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-989) Allow type merge between numerical type and non-numerical type
[ https://issues.apache.org/jira/browse/PIG-989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12761390#action_12761390 ] Daniel Dai commented on PIG-989: There is no good way to add a unit test to it. I tested it manually. Allow type merge between numerical type and non-numerical type -- Key: PIG-989 URL: https://issues.apache.org/jira/browse/PIG-989 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.5.0 Reporter: Daniel Dai Attachments: PIG-989-1.patch Currently, we do not allow type merge between numerical type and non-numerical type. And the error message is confusing. Eg, if you run: a = load '1.txt' as (a0:chararray, a1:chararray); b = load '2.txt' as (b0:long, b1:chararray); c = join a by a0, b by b0; dump c; And the error message is ERROR 1051: Cannot cast to Unknown We shall: 1. Allow the type merge between numerical type and non-numerical type 2. Or at least, provide more meaningful error message to the user -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-990) Provide a way to pin LogicalOperator Options
Provide a way to pin LogicalOperator Options Key: PIG-990 URL: https://issues.apache.org/jira/browse/PIG-990 Project: Pig Issue Type: Bug Components: impl Reporter: Dmitriy V. Ryaboy Priority: Minor Fix For: 0.6.0 This is a proactive patch, setting up the groundwork for adding an optimizer. Some of the LogicalOperators have options. For example, LOJoin has a variety of join types (regular, fr, skewed, merge), which can be set by the user or chosen by a hypothetical optimizer. If a user selects a join type, pig philoophy guides us to always respect the user's choice and not explore alternatives. Therefore, we need a way to pin options. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-592) schema inferred incorrectly
[ https://issues.apache.org/jira/browse/PIG-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12761472#action_12761472 ] Hadoop QA commented on PIG-592: --- +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12421093/PIG-592-3.patch against trunk revision 820394. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 12 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/56/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/56/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/56/console This message is automatically generated. schema inferred incorrectly --- Key: PIG-592 URL: https://issues.apache.org/jira/browse/PIG-592 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Christopher Olston Fix For: 0.6.0 Attachments: PIG-592-1.patch, PIG-592-2.patch, PIG-592-3.patch A simple pig script, that never introduces any schema information: A = load 'foo'; B = foreach (group A by $8) generate group, COUNT($1); C = load 'bar'; // ('bar' has two columns) D = join B by $0, C by $0; E = foreach D generate $0, $1, $3; Fails, complaining that $3 does not exist: java.io.IOException: Out of bound access. Trying to access non-existent column: 3. Schema {B::group: bytearray,long,bytearray} has 3 column(s). Apparently Pig gets confused, and thinks it knows the schema for C (a single bytearray column). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.