[jira] Commented: (PIG-984) PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data

2009-10-01 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12761257#action_12761257
 ] 

Alan Gates commented on PIG-984:


The controlling philosophic point here is that pigs are domestic animals (see 
http://wiki.apache.org/pig/PigPhilosophy).  Just as in join, where we have 
exposed all possible join implementations to the user, we want to do the same 
with this new feature.  At some future point when we have a capable optimizer, 
we will try to select the best type of join, and try to select this form of 
grouping when it's appropriate.  But even then, we want to expose this 
functionality to the user directly because the optimizer may not have access to 
the necessary information to determine the best grouping choice (e.g., data 
sources with no schema).  And we don't want to wait until the optimizer can 
handle these things to start exposing it.  

I don't agree with Santosh's assertion that the language is evolving with no 
definition.  I agree we do not yet have a comprehensive definition of Pig 
Latin, which we need.  But this is in line with what we've done for joins, 
philosophically, semantically, and syntacticly.

 PERFORMANCE: Implement a map-side group operator to speed up processing of 
 ordered data 
 

 Key: PIG-984
 URL: https://issues.apache.org/jira/browse/PIG-984
 Project: Pig
  Issue Type: New Feature
Reporter: Richard Ding

 The general group by operation in Pig needs both mappers and reducers (the 
 aggregation is done in reducers). This incurs disk writes/reads  between 
 mappers and reducers.
 However, in the cases where the input data has the following properties
1. The records with the same key are grouped together (such as the data is 
 sorted by the keys).
2. The records with the same key are in the same mapper input.
 the group by operation can be performed in the mappers only and thus remove 
 the overhead of disk writes/reads.
 Alan proposed adding a hint to the group by clause like this one:
 {code}
 A = load 'input' using SomeLoader(...);
 B = group A by $0 using mapside;
 C = foreach B generate ...
 {code}
 The proposed addition of using mapside to group will be a mapside group 
 operator that collects all records for a given key into a buffer. When it 
 sees a key change it will emit the key and bag for records it had buffered. 
 It will assume that all keys for a given record are collected together and 
 thus there is not need to buffer across keys. 
 It is expected that SomeLoader will be implemented by data systems such as 
 Zebra to ensure the data emitted by the loader satisfies the above properties 
 (1) and (2).
 It will be the responsibility of the user (or the loader) to guarantee these 
 properties (1)  (2) before invoking the mapside hint for the group by 
 clause. The Pig runtime can't check for the errors in the input data.
 For the group by clauses with mapside hint, Pig Latin will only support group 
 by columns (including *), not group by expressions nor group all. 
   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-984) PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data

2009-10-01 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12761270#action_12761270
 ] 

Santhosh Srinivasan commented on PIG-984:
-

bq. But this is in line with what we've done for joins, philosophically, 
semantically, and syntacticly.

Not exactly; with joins we are exposing different kinds of joins. Here we are 
exposing the underlying aspects of the framework (mapside). If there is a 
parallel framework that does not do map-reduce then having mapside in the 
language is philosophically and semantically not correct.

 PERFORMANCE: Implement a map-side group operator to speed up processing of 
 ordered data 
 

 Key: PIG-984
 URL: https://issues.apache.org/jira/browse/PIG-984
 Project: Pig
  Issue Type: New Feature
Reporter: Richard Ding

 The general group by operation in Pig needs both mappers and reducers (the 
 aggregation is done in reducers). This incurs disk writes/reads  between 
 mappers and reducers.
 However, in the cases where the input data has the following properties
1. The records with the same key are grouped together (such as the data is 
 sorted by the keys).
2. The records with the same key are in the same mapper input.
 the group by operation can be performed in the mappers only and thus remove 
 the overhead of disk writes/reads.
 Alan proposed adding a hint to the group by clause like this one:
 {code}
 A = load 'input' using SomeLoader(...);
 B = group A by $0 using mapside;
 C = foreach B generate ...
 {code}
 The proposed addition of using mapside to group will be a mapside group 
 operator that collects all records for a given key into a buffer. When it 
 sees a key change it will emit the key and bag for records it had buffered. 
 It will assume that all keys for a given record are collected together and 
 thus there is not need to buffer across keys. 
 It is expected that SomeLoader will be implemented by data systems such as 
 Zebra to ensure the data emitted by the loader satisfies the above properties 
 (1) and (2).
 It will be the responsibility of the user (or the loader) to guarantee these 
 properties (1)  (2) before invoking the mapside hint for the group by 
 clause. The Pig runtime can't check for the errors in the input data.
 For the group by clauses with mapside hint, Pig Latin will only support group 
 by columns (including *), not group by expressions nor group all. 
   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-987) Zebra Column Group Access Control for A29 compliance and Performance

2009-10-01 Thread Yan Zhou (JIRA)
Zebra Column Group Access Control for A29 compliance and Performance


 Key: PIG-987
 URL: https://issues.apache.org/jira/browse/PIG-987
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.6.0
Reporter: Yan Zhou
Assignee: Yan Zhou


Access Control: when processes try to read from the column groups, Zebra should 
be able to handle allowed vs. disallowed user/application accesses. 

Expected behavior when column group permissions are set:

When user selects only columns that they do not have permissions to access, 
Zebra should return error with message Error #: Permission denied for 
accessing column column name or names 

Access control applies to an entire column group, so all columns in a column 
group have same permissions. 


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-987) Zebra Column Group Access Control for A29 compliance and Performance

2009-10-01 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-987:
-

Status: Patch Available  (was: Open)

A29_ColumnGroupSecurity.patch is the patch file name.

 Zebra Column Group Access Control for A29 compliance and Performance
 

 Key: PIG-987
 URL: https://issues.apache.org/jira/browse/PIG-987
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.6.0
Reporter: Yan Zhou
Assignee: Yan Zhou

 Access Control: when processes try to read from the column groups, Zebra 
 should be able to handle allowed vs. disallowed user/application accesses. 
 Expected behavior when column group permissions are set:
 When user selects only columns that they do not have permissions to 
 access, Zebra should return error with message Error #: Permission denied 
 for accessing column column name or names 
 Access control applies to an entire column group, so all columns in a column 
 group have same permissions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-987) Zebra Column Group Access Control for A29 compliance and Performance

2009-10-01 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-987:
-

Attachment: A29_ColumnGroupSecurity.patch

 Zebra Column Group Access Control for A29 compliance and Performance
 

 Key: PIG-987
 URL: https://issues.apache.org/jira/browse/PIG-987
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.6.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Attachments: A29_ColumnGroupSecurity.patch


 Access Control: when processes try to read from the column groups, Zebra 
 should be able to handle allowed vs. disallowed user/application accesses. 
 Expected behavior when column group permissions are set:
 When user selects only columns that they do not have permissions to 
 access, Zebra should return error with message Error #: Permission denied 
 for accessing column column name or names 
 Access control applies to an entire column group, so all columns in a column 
 group have same permissions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-984) PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data

2009-10-01 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12761276#action_12761276
 ] 

Alan Gates commented on PIG-984:


I'm fine with changing the name from 'mapside' to 'collected' or something.  I 
see your point that exposing the term 'mapside' is bad because it is hadoop 
specific.  But I think the overall idea of allowing the user to select the type 
of grouping is good.

 PERFORMANCE: Implement a map-side group operator to speed up processing of 
 ordered data 
 

 Key: PIG-984
 URL: https://issues.apache.org/jira/browse/PIG-984
 Project: Pig
  Issue Type: New Feature
Reporter: Richard Ding

 The general group by operation in Pig needs both mappers and reducers (the 
 aggregation is done in reducers). This incurs disk writes/reads  between 
 mappers and reducers.
 However, in the cases where the input data has the following properties
1. The records with the same key are grouped together (such as the data is 
 sorted by the keys).
2. The records with the same key are in the same mapper input.
 the group by operation can be performed in the mappers only and thus remove 
 the overhead of disk writes/reads.
 Alan proposed adding a hint to the group by clause like this one:
 {code}
 A = load 'input' using SomeLoader(...);
 B = group A by $0 using mapside;
 C = foreach B generate ...
 {code}
 The proposed addition of using mapside to group will be a mapside group 
 operator that collects all records for a given key into a buffer. When it 
 sees a key change it will emit the key and bag for records it had buffered. 
 It will assume that all keys for a given record are collected together and 
 thus there is not need to buffer across keys. 
 It is expected that SomeLoader will be implemented by data systems such as 
 Zebra to ensure the data emitted by the loader satisfies the above properties 
 (1) and (2).
 It will be the responsibility of the user (or the loader) to guarantee these 
 properties (1)  (2) before invoking the mapside hint for the group by 
 clause. The Pig runtime can't check for the errors in the input data.
 For the group by clauses with mapside hint, Pig Latin will only support group 
 by columns (including *), not group by expressions nor group all. 
   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-988) Better implementation of distinct aggs

2009-10-01 Thread Alan Gates (JIRA)
Better implementation of distinct aggs
--

 Key: PIG-988
 URL: https://issues.apache.org/jira/browse/PIG-988
 Project: Pig
  Issue Type: Improvement
Reporter: Alan Gates


Distinct aggregates by definition cannot use the combiner (though the distinct 
can be and is done in the combiner).  Since this is a common use case it would 
be good to optimize.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-988) Better implementation of distinct aggs

2009-10-01 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12761284#action_12761284
 ] 

Alan Gates commented on PIG-988:


Consider a script like:

{code}
A = load 'bla';
B = group A by $0;
C = foreach B {
   D = A.$1;
   E = distinct D;
   generate group, COUNT(E);
}
{code}

This is count distinct, and a fairly common thing to do.  Currently Pig will 
use the combiner to remove as many duplicate values from D as possible.  But a 
final distinct pass is still required on the reducer.  Currently DistinctBag is 
used for this.  In this particular case, it would be possible to instead use 
Hadoop's secondary sort to sort the incoming records on the full tuple, and 
then use a different implementation of DistinctBag that expected the incoming 
records to be sorted and remove duplicates.

Note that this could not be used in conjunction with the order by optimization 
proposed in PIG-980.

 Better implementation of distinct aggs
 --

 Key: PIG-988
 URL: https://issues.apache.org/jira/browse/PIG-988
 Project: Pig
  Issue Type: Improvement
Reporter: Alan Gates

 Distinct aggregates by definition cannot use the combiner (though the 
 distinct can be and is done in the combiner).  Since this is a common use 
 case it would be good to optimize.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-989) Allow type merge between numerical type and non-numerical type

2009-10-01 Thread Daniel Dai (JIRA)
Allow type merge between numerical type and non-numerical type
--

 Key: PIG-989
 URL: https://issues.apache.org/jira/browse/PIG-989
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.5.0
Reporter: Daniel Dai


Currently, we do not allow type merge between numerical type and non-numerical 
type. And the error message is confusing. 

Eg, if you run:

a = load '1.txt' as (a0:chararray, a1:chararray);
b = load '2.txt' as (b0:long, b1:chararray);
c = join a by a0, b by b0;
dump c;

And the error message is ERROR 1051: Cannot cast to Unknown

We shall:
1. Allow the type merge between numerical type and non-numerical type
2. Or at least, provide more meaningful error message to the user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-987) Zebra Column Group Access Control for A29 compliance and Performance

2009-10-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12761327#action_12761327
 ] 

Hadoop QA commented on PIG-987:
---

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12421038/A29_ColumnGroupSecurity.patch
  against trunk revision 820394.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 38 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

-1 release audit.  The applied patch generated 288 release audit warnings 
(more than the trunk's current 281 warnings).

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/54/testReport/
Release audit warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/54/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/54/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/54/console

This message is automatically generated.

 Zebra Column Group Access Control for A29 compliance and Performance
 

 Key: PIG-987
 URL: https://issues.apache.org/jira/browse/PIG-987
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.6.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Attachments: A29_ColumnGroupSecurity.patch


 Access Control: when processes try to read from the column groups, Zebra 
 should be able to handle allowed vs. disallowed user/application accesses. 
 Expected behavior when column group permissions are set:
 When user selects only columns that they do not have permissions to 
 access, Zebra should return error with message Error #: Permission denied 
 for accessing column column name or names 
 Access control applies to an entire column group, so all columns in a column 
 group have same permissions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-987) Zebra Column Group Access Control

2009-10-01 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-987:
-

Summary: Zebra Column Group Access Control  (was: Zebra Column Group Access 
Control for A29 compliance and Performance)

 Zebra Column Group Access Control
 -

 Key: PIG-987
 URL: https://issues.apache.org/jira/browse/PIG-987
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.6.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Attachments: A29_ColumnGroupSecurity.patch


 Access Control: when processes try to read from the column groups, Zebra 
 should be able to handle allowed vs. disallowed user/application accesses. 
 Expected behavior when column group permissions are set:
 When user selects only columns that they do not have permissions to 
 access, Zebra should return error with message Error #: Permission denied 
 for accessing column column name or names 
 Access control applies to an entire column group, so all columns in a column 
 group have same permissions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-987) Zebra Column Group Access Control

2009-10-01 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-987:
-

Attachment: (was: A29_ColumnGroupSecurity.patch)

 Zebra Column Group Access Control
 -

 Key: PIG-987
 URL: https://issues.apache.org/jira/browse/PIG-987
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.6.0
Reporter: Yan Zhou
Assignee: Yan Zhou

 Access Control: when processes try to read from the column groups, Zebra 
 should be able to handle allowed vs. disallowed user/application accesses. 
 Expected behavior when column group permissions are set:
 When user selects only columns that they do not have permissions to 
 access, Zebra should return error with message Error #: Permission denied 
 for accessing column column name or names 
 Access control applies to an entire column group, so all columns in a column 
 group have same permissions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-987) Zebra Column Group Access Control

2009-10-01 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-987:
-

Attachment: ColumnGroupSecurity.patch

 Zebra Column Group Access Control
 -

 Key: PIG-987
 URL: https://issues.apache.org/jira/browse/PIG-987
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.6.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Attachments: ColumnGroupSecurity.patch


 Access Control: when processes try to read from the column groups, Zebra 
 should be able to handle allowed vs. disallowed user/application accesses. 
 Expected behavior when column group permissions are set:
 When user selects only columns that they do not have permissions to 
 access, Zebra should return error with message Error #: Permission denied 
 for accessing column column name or names 
 Access control applies to an entire column group, so all columns in a column 
 group have same permissions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-987) Zebra Column Group Access Control

2009-10-01 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12761365#action_12761365
 ] 

Yan Zhou commented on PIG-987:
--

During STORE, the storage hint is enhanced to take a new secure by  section, 
e.g., 

[c1,c2] secure by group:secure perm:640

meaning the column group of columns c1 and c2 will belong to group secure 
with file permission octal value of 0640 which, in turn, means read+write for 
user, read for group and non for others.

After Zebra table creation, all files and directories inside the secured column 
group will have the same permision and group membership within the table.

If a column group is not secured, the default behavoir is determined by the 
HADOOP MAP/REDUCE  default permision and group membership set upon the new 
files and directories.


 Zebra Column Group Access Control
 -

 Key: PIG-987
 URL: https://issues.apache.org/jira/browse/PIG-987
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.6.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Attachments: ColumnGroupSecurity.patch


 Access Control: when processes try to read from the column groups, Zebra 
 should be able to handle allowed vs. disallowed user/application accesses. 
 Expected behavior when column group permissions are set:
 When user selects only columns that they do not have permissions to 
 access, Zebra should return error with message Error #: Permission denied 
 for accessing column column name or names 
 Access control applies to an entire column group, so all columns in a column 
 group have same permissions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-592) schema inferred incorrectly

2009-10-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-592:
---

Status: Patch Available  (was: Open)

 schema inferred incorrectly
 ---

 Key: PIG-592
 URL: https://issues.apache.org/jira/browse/PIG-592
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Christopher Olston
 Fix For: 0.6.0

 Attachments: PIG-592-1.patch, PIG-592-2.patch


 A simple pig script, that never introduces any schema information:
 A = load 'foo';
 B = foreach (group A by $8) generate group, COUNT($1);
 C = load 'bar';   // ('bar' has two columns)
 D = join B by $0, C by $0;
 E = foreach D generate $0, $1, $3;
 Fails, complaining that $3 does not exist:
 java.io.IOException: Out of bound access. Trying to access non-existent 
 column: 3. Schema {B::group: bytearray,long,bytearray} has 3 column(s).
 Apparently Pig gets confused, and thinks it knows the schema for C (a single 
 bytearray column).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-960) Using Hadoop's optimized LineRecordReader for reading Tuples in PigStorage

2009-10-01 Thread Ankit Modi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-960:
---

Attachment: pig_rlr.patch

 Using Hadoop's optimized LineRecordReader for reading Tuples in PigStorage 
 ---

 Key: PIG-960
 URL: https://issues.apache.org/jira/browse/PIG-960
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Ankit Modi
 Attachments: pig_rlr.patch


 PigStorage's reading of Tuples ( lines ) can be optimized using Hadoop's 
 {{LineRecordReader}}.
 This can help in following areas
 - Improving performance reading of Tuples (lines) in {{PigStorage}}
 - Any future improvements in line reading done in Hadoop's 
 {{LineRecordReader}} is automatically carried over to Pig
 Issues that are handled by this patch
 - BZip uses internal buffers and positioning for determining the number of 
 bytes read. Hence buffering done by {{LineRecordReader}} has to be turned off
 - Current implementation of {{LocalSeekableInputStream}} does not implement 
 {{available}} method. This method has to be implemented.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-960) Using Hadoop's optimized LineRecordReader for reading Tuples in PigStorage

2009-10-01 Thread Ankit Modi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-960:
---

Attachment: (was: pig_rlr.patch)

 Using Hadoop's optimized LineRecordReader for reading Tuples in PigStorage 
 ---

 Key: PIG-960
 URL: https://issues.apache.org/jira/browse/PIG-960
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Ankit Modi
 Attachments: pig_rlr.patch


 PigStorage's reading of Tuples ( lines ) can be optimized using Hadoop's 
 {{LineRecordReader}}.
 This can help in following areas
 - Improving performance reading of Tuples (lines) in {{PigStorage}}
 - Any future improvements in line reading done in Hadoop's 
 {{LineRecordReader}} is automatically carried over to Pig
 Issues that are handled by this patch
 - BZip uses internal buffers and positioning for determining the number of 
 bytes read. Hence buffering done by {{LineRecordReader}} has to be turned off
 - Current implementation of {{LocalSeekableInputStream}} does not implement 
 {{available}} method. This method has to be implemented.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-960) Using Hadoop's optimized LineRecordReader for reading Tuples in PigStorage

2009-10-01 Thread Ankit Modi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12761376#action_12761376
 ] 

Ankit Modi commented on PIG-960:


Added the latest patch making PigLineRecordReader a wrapper only.

 Using Hadoop's optimized LineRecordReader for reading Tuples in PigStorage 
 ---

 Key: PIG-960
 URL: https://issues.apache.org/jira/browse/PIG-960
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Ankit Modi
 Attachments: pig_rlr.patch


 PigStorage's reading of Tuples ( lines ) can be optimized using Hadoop's 
 {{LineRecordReader}}.
 This can help in following areas
 - Improving performance reading of Tuples (lines) in {{PigStorage}}
 - Any future improvements in line reading done in Hadoop's 
 {{LineRecordReader}} is automatically carried over to Pig
 Issues that are handled by this patch
 - BZip uses internal buffers and positioning for determining the number of 
 bytes read. Hence buffering done by {{LineRecordReader}} has to be turned off
 - Current implementation of {{LocalSeekableInputStream}} does not implement 
 {{available}} method. This method has to be implemented.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-989) Allow type merge between numerical type and non-numerical type

2009-10-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-989:
---

Status: Patch Available  (was: Open)

 Allow type merge between numerical type and non-numerical type
 --

 Key: PIG-989
 URL: https://issues.apache.org/jira/browse/PIG-989
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.5.0
Reporter: Daniel Dai
 Attachments: PIG-989-1.patch


 Currently, we do not allow type merge between numerical type and 
 non-numerical type. And the error message is confusing. 
 Eg, if you run:
 a = load '1.txt' as (a0:chararray, a1:chararray);
 b = load '2.txt' as (b0:long, b1:chararray);
 c = join a by a0, b by b0;
 dump c;
 And the error message is ERROR 1051: Cannot cast to Unknown
 We shall:
 1. Allow the type merge between numerical type and non-numerical type
 2. Or at least, provide more meaningful error message to the user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-989) Allow type merge between numerical type and non-numerical type

2009-10-01 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12761390#action_12761390
 ] 

Daniel Dai commented on PIG-989:


There is no good way to add a unit test to it. I tested it manually.

 Allow type merge between numerical type and non-numerical type
 --

 Key: PIG-989
 URL: https://issues.apache.org/jira/browse/PIG-989
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.5.0
Reporter: Daniel Dai
 Attachments: PIG-989-1.patch


 Currently, we do not allow type merge between numerical type and 
 non-numerical type. And the error message is confusing. 
 Eg, if you run:
 a = load '1.txt' as (a0:chararray, a1:chararray);
 b = load '2.txt' as (b0:long, b1:chararray);
 c = join a by a0, b by b0;
 dump c;
 And the error message is ERROR 1051: Cannot cast to Unknown
 We shall:
 1. Allow the type merge between numerical type and non-numerical type
 2. Or at least, provide more meaningful error message to the user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-990) Provide a way to pin LogicalOperator Options

2009-10-01 Thread Dmitriy V. Ryaboy (JIRA)
Provide a way to pin LogicalOperator Options


 Key: PIG-990
 URL: https://issues.apache.org/jira/browse/PIG-990
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Dmitriy V. Ryaboy
Priority: Minor
 Fix For: 0.6.0


This is a proactive patch, setting up the groundwork for adding an optimizer.

Some of the LogicalOperators have options. For example, LOJoin has a variety of 
join types (regular, fr, skewed, merge), which can be set by the user or chosen 
by a hypothetical optimizer.  If a user selects a join type, pig philoophy 
guides us to always respect the user's choice and not explore alternatives.  
Therefore, we need a way to pin options.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-592) schema inferred incorrectly

2009-10-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12761472#action_12761472
 ] 

Hadoop QA commented on PIG-592:
---

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12421093/PIG-592-3.patch
  against trunk revision 820394.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 12 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/56/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/56/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/56/console

This message is automatically generated.

 schema inferred incorrectly
 ---

 Key: PIG-592
 URL: https://issues.apache.org/jira/browse/PIG-592
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Christopher Olston
 Fix For: 0.6.0

 Attachments: PIG-592-1.patch, PIG-592-2.patch, PIG-592-3.patch


 A simple pig script, that never introduces any schema information:
 A = load 'foo';
 B = foreach (group A by $8) generate group, COUNT($1);
 C = load 'bar';   // ('bar' has two columns)
 D = join B by $0, C by $0;
 E = foreach D generate $0, $1, $3;
 Fails, complaining that $3 does not exist:
 java.io.IOException: Out of bound access. Trying to access non-existent 
 column: 3. Schema {B::group: bytearray,long,bytearray} has 3 column(s).
 Apparently Pig gets confused, and thinks it knows the schema for C (a single 
 bytearray column).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.