date:20091006


[ 
https://issues.apache.org/jira/browse/PIG-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762707#action_12762707
 ] 

Yan Zhou commented on PIG-993:
--

The patch attached by me (Yan Zhou) was based upon Raghu's patch minus some 
unrelated changes.

 [zebra] Abitlity to drop a column group in a table
 --

 Key: PIG-993
 URL: https://issues.apache.org/jira/browse/PIG-993
 Project: Pig
  Issue Type: Bug
Reporter: Raghu Angadi
Assignee: Raghu Angadi
 Attachments: DropColumnGroupExample.java, zebra-drop-cg.patch, 
 zebra-drop-cg.patch


 A Zebra table is stored as multiple sub tables each containing a set of 
 columns called column group (CG). The user specifies how these columns are 
 grouped while creating a table through the _storage hint_.
 For some of the large tables, it might be necessary for users to remove a set 
 of columns and retain the rest. This jira provides a way for users to delete 
 an entire column group. 
 The following comments will have more details on API and the semantics. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-983) PERFORMANCE: multi-query optimization on multiple group bys following a join or cogroup

2009-10-06 Thread Pradeep Kamath (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-983:
---

   Resolution: Fixed
Fix Version/s: 0.6.0
 Hadoop Flags: [Reviewed]
   Status: Resolved  (was: Patch Available)

+1
Patch committed, thanks Richard!

 PERFORMANCE: multi-query optimization on multiple group bys following a join 
 or cogroup
 ---

 Key: PIG-983
 URL: https://issues.apache.org/jira/browse/PIG-983
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.6.0

 Attachments: PIG-983.patch


 The current multi-query optimizer works well with pig scripts like this one:
 {code}
 data = LOAD 'input' AS (a:chararray, b:int, c:int);
 A = GROUP data BY b;
 B = GROUP data BY c;
 C = FOREACH A GENERATE group, COUNT(data);
 D = FOREACH B GENERATE group, SUM(data.b);
 STORE C INTO 'output1';
 STORE D INTO 'output2';
 {code}
 In this case the original three Map-Reduce jobs are merged into one MR job by 
 the optimizer.
 The current optimizer, however, won't reduce the number of MR jobs for the 
 scripts in which multiple group bys follow a join or a cogroup, such as this 
 one:
 {code}
 data1 = LOAD 'input1' AS (a1:chararray, b1:int, c1:int);
 data2 = LOAD 'input2' AS (a2:chararray, b2:int, c2:int);
 A = JOIN data1 BY a1, data2 BY a2;
 B = GROUP A BY data1::b1;
 C = GROUP B BY data2::c2;
 D = FOREACH B GENERATE group, COUNT(A);
 E = FOREACH C GENERATE group, SUM(A.data2::b2);
 STORE D INTO 'output1';
 STORE E INTO 'output2';
 {code}
 Three MR jobs are still needed to run this script.
 Multi-query optimizer should work with this kind of scripts by merging the 
 group bys and reducing the overall MR jobs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-993) [zebra] Abitlity to drop a column group in a table


[ 
https://issues.apache.org/jira/browse/PIG-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762731#action_12762731
 ] 

Yan Zhou commented on PIG-993:
--

Patch Reviewed +1

 [zebra] Abitlity to drop a column group in a table
 --

 Key: PIG-993
 URL: https://issues.apache.org/jira/browse/PIG-993
 Project: Pig
  Issue Type: Bug
Reporter: Raghu Angadi
Assignee: Raghu Angadi
 Attachments: DropColumnGroupExample.java, zebra-drop-cg.patch, 
 zebra-drop-cg.patch


 A Zebra table is stored as multiple sub tables each containing a set of 
 columns called column group (CG). The user specifies how these columns are 
 grouped while creating a table through the _storage hint_.
 For some of the large tables, it might be necessary for users to remove a set 
 of columns and retain the rest. This jira provides a way for users to delete 
 an entire column group. 
 The following comments will have more details on API and the semantics. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-987) [zebra] Zebra Column Group Access Control

2009-10-06 Thread Gaurav Jain (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762733#action_12762733
 ] 

Gaurav Jain commented on PIG-987:
-


Patch Reviewed

+1

 [zebra] Zebra Column Group Access Control
 -

 Key: PIG-987
 URL: https://issues.apache.org/jira/browse/PIG-987
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.6.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Attachments: ColumnGroupSecurity.patch


 Access Control: when processes try to read from the column groups, Zebra 
 should be able to handle allowed vs. disallowed user/application accesses.  
 The security is eventuallt granted by corresponding  HDFS security of the 
 data stored.
 Expected behavior when column group permissions are set:
 When user selects only columns that they do not have permissions to 
 access, Zebra should return error with message Error #: Permission denied 
 for accessing column column name or names 
 Access control applies to an entire column group, so all columns in a column 
 group have same permissions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-991) [zebra] A few minor bugs as described in the Description section

2009-10-06 Thread Gaurav Jain (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762734#action_12762734
 ] 

Gaurav Jain commented on PIG-991:
-



Patch Reviewed 
+1

 [zebra] A few minor bugs as described in the Description section
 

 Key: PIG-991
 URL: https://issues.apache.org/jira/browse/PIG-991
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Yan Zhou
Assignee: Yan Zhou
Priority: Minor
 Fix For: 0.6.0

 Attachments: Bugs.patch


 1) lzo2 was used as the compressor name for the LZO compression algorithm; 
 it should be lzo instead;
 2) the default compression is changed from lzo to gz for gzip;
 3) In JAVACC file SchemaParser.jjt, the package name was wrong using the old 
 package org.apache.pig.table.types;
 4) in build.xml, two new javacc targets are added to generate 
 TableSchemaParser and TableStorageParser java codes;
 5) Support of column group security ( 
 https://issues.apache.org/jira/browse/PIG-987 ) lacked support of the 
 dumpinfo method: the groups and permissions were not displayed. Note that as 
 a consequence, the patch herein must be applied after that of JIRA987.
 6) and 7) a couple of issues reported in Jira917.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-922) Logical optimizer: push up project

2009-10-06 Thread Pradeep Kamath (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762754#action_12762754
 ] 

Pradeep Kamath commented on PIG-922:


Some comments on new patch:
PruneColumns.java:
| 274 if (relevantFields!=null  
relevantFields.needAllFields()) 

 
| 275 { 

   
| 276 requiredInputFieldsList.set(j, new 
RequiredFields(true));  

  
| 277 continue; 

   
| 278 } 

   
| 279   

   
| 280 // Mapping output map keys to input map keys  

   
| 281 //

   
| 282 if (rlo instanceof LOCogroup) 

   
| 283 { 

   
| 284 if (relevantFields!=null  
relevantFields.needAllFields()) 

 
| 285 { 

   
| 286 for (PairInteger, Integer pair : 
relevantFields.getFields()) 

  
| 287 relevantFields.setMapKeysInfo(pair.first, 
pair.second,
   
| 288 new MapKeysInfo(true));   

   
| 289 } 

   
| 290 }  

Wouldn't the last if be redundant since it is same as first if and first if is 
true, the loop continues and never reaches the last if

line numbers per old code:

  326 // Collect required map keys in foreach plan 
here.   

  327 // This is the only logical operator that we 
collect map keys

  328 // which are introduced by the operator here.

[jira] Commented: (PIG-976) Multi-query optimization throws ClassCastException

2009-10-06 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762803#action_12762803
 ] 

Hadoop QA commented on PIG-976:
---

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12421451/PIG-976.patch
  against trunk revision 822382.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 7 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/61/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/61/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/61/console

This message is automatically generated.

 Multi-query optimization throws ClassCastException
 --

 Key: PIG-976
 URL: https://issues.apache.org/jira/browse/PIG-976
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.4.0
Reporter: Ankur
Assignee: Richard Ding
 Attachments: PIG-976.patch


 Multi-query optimization fails to merge 2 branches when 1 is a result of 
 Group By ALL and another is a result of Group By field1 where field 1 is of 
 type long. Here is the script that fails with multi-query on.
 data = LOAD 'test' USING PigStorage('\t') AS (a:long, b:double, c:double); 
 A = GROUP data ALL;
 B = FOREACH A GENERATE SUM(data.b) AS sum1, SUM(data.c) AS sum2;
 C = FOREACH B GENERATE (sum1/sum2) AS rate; 
 STORE C INTO 'result1';
 D = GROUP data BY a; 
 E = FOREACH D GENERATE group AS a, SUM(data.b), SUM(data.c);
 STORE E into 'result2';
  
 Here is the exception from the logs
 java.lang.ClassCastException: org.apache.pig.data.DefaultTuple cannot be cast 
 to org.apache.pig.data.DataBag
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:399)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:180)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:145)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:197)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:235)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:254)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:240)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.runPipeline(PODemux.java:264)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.getNext(PODemux.java:254)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:196)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:174)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:63)
   at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:906)
   at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:786)
   at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:228)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2206)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-994) Provide 'append' keyword to allow appending to diferent dataset once the feature is available in Hadoop

2009-10-06 Thread Alan Gates (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762809#action_12762809
 ] 

Alan Gates commented on PIG-994:


Should it be a separate keyword or an option on store?  I like it better as an 
option for store as it can then be create or append depending on the files 
existence.  So it might look like:

{code}
store z into 'bla' append
{code}



 Provide 'append' keyword to allow appending to diferent dataset once the 
 feature is available in Hadoop
 ---

 Key: PIG-994
 URL: https://issues.apache.org/jira/browse/PIG-994
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: 0.4.0
 Environment: Grid clusters
Reporter: Rekha
Priority: Minor

 Provide 'append' keyword to allow appending to diferent dataset on pig 0.5.0 
 as it is now on hadoop 0.20(which has append feature)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (PIG-948) [Usability] Relating pig script with MR jobs


 [ 
https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai reopened PIG-948:



See lots of ERROR 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Exception occured while trying to retrieve extra information about job in 
MapReduceLauncher.String index out of range: -1 in local hadoop mode after 
this patch. We shall suppress this message.

 [Usability] Relating pig script with MR jobs
 

 Key: PIG-948
 URL: https://issues.apache.org/jira/browse/PIG-948
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
Priority: Minor
 Fix For: 0.6.0

 Attachments: pig-948-2.patch, pig-948-3.patch, pig-948.patch


 Currently its hard to find a way to relate pig script with specific MR job. 
 In a loaded cluster with multiple simultaneous job submissions, its not easy 
 to figure out which specific MR jobs were launched for a given pig script. If 
 Pig can provide this info, it will be useful to debug and monitor the jobs 
 resulting from a pig script.
 At the very least, Pig should be able to provide user the following 
 information
 1) Job id of the launched job.
 2) Complete web url of jobtracker running this job. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-987) [zebra] Zebra Column Group Access Control


[ 
https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762812#action_12762812
 ] 

Raghu Angadi commented on PIG-987:
--

I tried to commit this patch. 'ant test' says all the tests fail, where as only 
one two tests fail without the patch.

Does Hudson actual run Zebra tests?


 [zebra] Zebra Column Group Access Control
 -

 Key: PIG-987
 URL: https://issues.apache.org/jira/browse/PIG-987
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.6.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Attachments: ColumnGroupSecurity.patch


 Access Control: when processes try to read from the column groups, Zebra 
 should be able to handle allowed vs. disallowed user/application accesses.  
 The security is eventuallt granted by corresponding  HDFS security of the 
 data stored.
 Expected behavior when column group permissions are set:
 When user selects only columns that they do not have permissions to 
 access, Zebra should return error with message Error #: Permission denied 
 for accessing column column name or names 
 Access control applies to an entire column group, so all columns in a column 
 group have same permissions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-991) [zebra] A few minor bugs as described in the Description section


 [ 
https://issues.apache.org/jira/browse/PIG-991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raghu Angadi updated PIG-991:
-

Release Note:   (was: Patch should be applied after that of Jira987.)

bq. Patch should be applied after that of Jira987.

[moved above comment from 'Release Notes' to this comment].

 [zebra] A few minor bugs as described in the Description section
 

 Key: PIG-991
 URL: https://issues.apache.org/jira/browse/PIG-991
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Yan Zhou
Assignee: Yan Zhou
Priority: Minor
 Fix For: 0.6.0

 Attachments: Bugs.patch


 1) lzo2 was used as the compressor name for the LZO compression algorithm; 
 it should be lzo instead;
 2) the default compression is changed from lzo to gz for gzip;
 3) In JAVACC file SchemaParser.jjt, the package name was wrong using the old 
 package org.apache.pig.table.types;
 4) in build.xml, two new javacc targets are added to generate 
 TableSchemaParser and TableStorageParser java codes;
 5) Support of column group security ( 
 https://issues.apache.org/jira/browse/PIG-987 ) lacked support of the 
 dumpinfo method: the groups and permissions were not displayed. Note that as 
 a consequence, the patch herein must be applied after that of JIRA987.
 6) and 7) a couple of issues reported in Jira917.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-948) [Usability] Relating pig script with MR jobs


 [ 
https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-948:
---

Attachment: PIG-948-4.patch

 [Usability] Relating pig script with MR jobs
 

 Key: PIG-948
 URL: https://issues.apache.org/jira/browse/PIG-948
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
Priority: Minor
 Fix For: 0.6.0

 Attachments: pig-948-2.patch, pig-948-3.patch, PIG-948-4.patch, 
 pig-948.patch


 Currently its hard to find a way to relate pig script with specific MR job. 
 In a loaded cluster with multiple simultaneous job submissions, its not easy 
 to figure out which specific MR jobs were launched for a given pig script. If 
 Pig can provide this info, it will be useful to debug and monitor the jobs 
 resulting from a pig script.
 At the very least, Pig should be able to provide user the following 
 information
 1) Job id of the launched job.
 2) Complete web url of jobtracker running this job. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-948) [Usability] Relating pig script with MR jobs


 [ 
https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-948:
---

Status: Patch Available  (was: Reopened)

 [Usability] Relating pig script with MR jobs
 

 Key: PIG-948
 URL: https://issues.apache.org/jira/browse/PIG-948
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
Priority: Minor
 Fix For: 0.6.0

 Attachments: pig-948-2.patch, pig-948-3.patch, PIG-948-4.patch, 
 pig-948.patch


 Currently its hard to find a way to relate pig script with specific MR job. 
 In a loaded cluster with multiple simultaneous job submissions, its not easy 
 to figure out which specific MR jobs were launched for a given pig script. If 
 Pig can provide this info, it will be useful to debug and monitor the jobs 
 resulting from a pig script.
 At the very least, Pig should be able to provide user the following 
 information
 1) Job id of the launched job.
 2) Complete web url of jobtracker running this job. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-996) [zebra] Zebra build script does not have findbugs and clover targets.

2009-10-06 Thread Chao Wang (JIRA)

[zebra] Zebra build script does not have findbugs and clover targets.
-

 Key: PIG-996
 URL: https://issues.apache.org/jira/browse/PIG-996
 Project: Pig
  Issue Type: Bug
  Components: build
Reporter: Chao Wang
Assignee: Chao Wang


Zebra build script does not have findbugs and clover targets, leading hudson 
build process to fail on Zebra.

This jira is to fix this by adding these two targets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-997) [zebra] Sorted Table Support by Zebra

[zebra] Sorted Table Support by Zebra
-

 Key: PIG-997
 URL: https://issues.apache.org/jira/browse/PIG-997
 Project: Pig
  Issue Type: New Feature
Reporter: Yan Zhou
 Fix For: 0.6.0


This new feature is for Zebra to support sorted data in storage. As a storage 
library, Zebra will not sort the data by itself. But it will support creation 
and use of sorted data either through PIG  or through map/reduce tasks that use 
Zebra as storage format.

The sorted table keeps the data in a totally sorted manner across all TFiles 
created by potentially all mappers or reducers.

For sorted data creation through PIG's STORE operator ,  if the input data is 
sorted through ORDER BY, the new Zebra table will be marked as sorted on the 
sorted columns;

For sorted data creation though Map/Reduce tasks,  three new static methods of 
the BasicTableOutput class will be provided to allow or help the user to 
achieve the goal. setSortInfo allows the user to specify the sorted columns 
of the input tuple to be stored; getSortKeyGenerator and getSortKey help 
the user to generate the key acceptable by Zebra as a sorted key based upon the 
schema, sorted columns and the input tuple.

For sorted data read through PIG's LOAD operator, pass string sorted as an 
extra argument to the TableLoader constructor to ask for sorted table to be 
loaded;

For sorted data read through Map/Reduce tasks, a new static method of 
TableInputFormat class, requireSortedTable, can be called to ask for a sorted 
table to be read. Additionally, an overloaded version of the new method can be 
called to ask for a sorted table on specified sort columns and comparator.

For this release, sorted table only supported sorting in ascending order, not 
in descending order. In addition, the sort keys must be of simple types not 
complex types such as RECORD, COLLECTION and MAP. 

Multiple-key sorting is supported. But the ordering of the multiple sort keys 
is significant with the first sort column being the primary sort key, the 
second being the secondary sort key, etc.

In this release, the sort keys are stored along with the sort columns where the 
keys were originally created from, resulting in some data storage redundancy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-987) [zebra] Zebra Column Group Access Control


[ 
https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762824#action_12762824
 ] 

Yan Zhou commented on PIG-987:
--

I checked Hudson test results and they do not seem to run Zebra.

But I ran ant test in contrib/zebra directory and they passed. What errors 
did you get? I suspect some env issue at your end.

 [zebra] Zebra Column Group Access Control
 -

 Key: PIG-987
 URL: https://issues.apache.org/jira/browse/PIG-987
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.6.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Attachments: ColumnGroupSecurity.patch


 Access Control: when processes try to read from the column groups, Zebra 
 should be able to handle allowed vs. disallowed user/application accesses.  
 The security is eventuallt granted by corresponding  HDFS security of the 
 data stored.
 Expected behavior when column group permissions are set:
 When user selects only columns that they do not have permissions to 
 access, Zebra should return error with message Error #: Permission denied 
 for accessing column column name or names 
 Access control applies to an entire column group, so all columns in a column 
 group have same permissions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-987) [zebra] Zebra Column Group Access Control

[
https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Raghu Angadi updated PIG-987:
-

Attachment: TEST-org.apache.hadoop.zebra.mapred.TestCheckin.txt

I am attaching {{mapred.TestCheckin.txt}} that passes without the patch.

btw, not all tests pass even without the patch. What is the environment
required? I did a fresh check out, and ran 'ant test'.

I guess the tests failures on trunk are related to lzo. But I didn't expect
more failures with the patch.

Looks like PIG-991 removes the lzo dependency. I will try with that patch
included.

[zebra] Zebra Column Group Access Control
-

Key: PIG-987
URL: https://issues.apache.org/jira/browse/PIG-987
Project: Pig
Issue Type: New Feature
Affects Versions: 0.6.0
Reporter: Yan Zhou
Assignee: Yan Zhou
Attachments: ColumnGroupSecurity.patch,
TEST-org.apache.hadoop.zebra.mapred.TestCheckin.txt

Access Control: when processes try to read from the column groups, Zebra
should be able to handle allowed vs. disallowed user/application accesses.
The security is eventuallt granted by corresponding HDFS security of the
data stored.
Expected behavior when column group permissions are set:
When user selects only columns that they do not have permissions to
access, Zebra should return error with message Error #: Permission denied
for accessing column column name or names
Access control applies to an entire column group, so all columns in a column
group have same permissions.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-987) [zebra] Zebra Column Group Access Control

2009-10-06 Thread Chao Wang (JIRA)

[
https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chao Wang updated PIG-987:
--

I ran into the same issue also.

I did a fresh checkout from apache trunk and ran ant test, there are 14 test
cases failed.

Actually, they are caused by some incompatible exception type between pig and
zebra. It seems pig already moved on with the change (IOException changed to
IndexOutofBoundException), but zebra is behind a bit in this.

[zebra] Zebra Column Group Access Control
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-987) [zebra] Zebra Column Group Access Control


[ 
https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762829#action_12762829
 ] 

Raghu Angadi commented on PIG-987:
--

Not sure if this is related to PIG. When I applied PIG-991 over this, the tests 
passed (except the ones that fail on trunk).


 [zebra] Zebra Column Group Access Control
 -

 Key: PIG-987
 URL: https://issues.apache.org/jira/browse/PIG-987
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.6.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Attachments: ColumnGroupSecurity.patch, 
 TEST-org.apache.hadoop.zebra.mapred.TestCheckin.txt


 Access Control: when processes try to read from the column groups, Zebra 
 should be able to handle allowed vs. disallowed user/application accesses.  
 The security is eventuallt granted by corresponding  HDFS security of the 
 data stored.
 Expected behavior when column group permissions are set:
 When user selects only columns that they do not have permissions to 
 access, Zebra should return error with message Error #: Permission denied 
 for accessing column column name or names 
 Access control applies to an entire column group, so all columns in a column 
 group have same permissions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Build failed in Hudson: Pig-trunk #580

2009-10-06 Thread Apache Hudson Server

See http://hudson.zones.apache.org/hudson/job/Pig-trunk/580/changes

Changes:

[pradeepkth] PERFORMANCE: multi-query optimization on multiple group bys 
following a join or cogroup (rding via pradeepkth)

--
[...truncated 167006 lines...]
[junit] 09/10/07 01:16:28 INFO hdfs.StateChange: BLOCK* 
NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:60366 is added to 
blk_6792183205926187173_1014 size 1859
[junit] 09/10/07 01:16:28 INFO datanode.DataNode: PacketResponder 2 for 
block blk_6792183205926187173_1014 terminating
[junit] 09/10/07 01:16:28 INFO hdfs.StateChange: BLOCK* 
NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:51295 is added to 
blk_6792183205926187173_1014 size 1859
[junit] 09/10/07 01:16:28 INFO hdfs.StateChange: DIR* 
NameSystem.completeFile: file 
/tmp/hadoop-hudson/mapred/system/job_20091007011555276_0002/job.split is closed 
by DFSClient_-1468843592
[junit] 09/10/07 01:16:28 INFO FSNamesystem.audit: ugi=hudson,hudson
ip=/127.0.0.1   cmd=create  
src=/tmp/hadoop-hudson/mapred/system/job_20091007011555276_0002/job.xml 
dst=nullperm=hudson:supergroup:rw-r--r--
[junit] 09/10/07 01:16:28 INFO FSNamesystem.audit: ugi=hudson,hudson
ip=/127.0.0.1   cmd=setPermission   
src=/tmp/hadoop-hudson/mapred/system/job_20091007011555276_0002/job.xml 
dst=nullperm=hudson:supergroup:rw-r--r--
[junit] 09/10/07 01:16:28 INFO hdfs.StateChange: BLOCK* 
NameSystem.allocateBlock: 
/tmp/hadoop-hudson/mapred/system/job_20091007011555276_0002/job.xml. 
blk_-6478926164587938265_1015
[junit] 09/10/07 01:16:28 INFO datanode.DataNode: Receiving block 
blk_-6478926164587938265_1015 src: /127.0.0.1:34216 dest: /127.0.0.1:59951
[junit] 09/10/07 01:16:28 INFO datanode.DataNode: Receiving block 
blk_-6478926164587938265_1015 src: /127.0.0.1:35478 dest: /127.0.0.1:51295
[junit] 09/10/07 01:16:28 INFO datanode.DataNode: Receiving block 
blk_-6478926164587938265_1015 src: /127.0.0.1:45552 dest: /127.0.0.1:49650
[junit] 09/10/07 01:16:28 INFO DataNode.clienttrace: src: /127.0.0.1:45552, 
dest: /127.0.0.1:49650, bytes: 48254, op: HDFS_WRITE, cliID: 
DFSClient_-1468843592, srvID: DS-1821165369-127.0.1.1-49650-1254878155200, 
blockid: blk_-6478926164587938265_1015
[junit] 09/10/07 01:16:28 INFO datanode.DataNode: PacketResponder 0 for 
block blk_-6478926164587938265_1015 terminating
[junit] 09/10/07 01:16:28 INFO hdfs.StateChange: BLOCK* 
NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:49650 is added to 
blk_-6478926164587938265_1015 size 48254
[junit] 09/10/07 01:16:28 INFO DataNode.clienttrace: src: /127.0.0.1:35478, 
dest: /127.0.0.1:51295, bytes: 48254, op: HDFS_WRITE, cliID: 
DFSClient_-1468843592, srvID: DS-1845303905-127.0.1.1-51295-1254878153423, 
blockid: blk_-6478926164587938265_1015
[junit] 09/10/07 01:16:28 INFO hdfs.StateChange: BLOCK* 
NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:51295 is added to 
blk_-6478926164587938265_1015 size 48254
[junit] 09/10/07 01:16:28 INFO datanode.DataNode: PacketResponder 1 for 
block blk_-6478926164587938265_1015 terminating
[junit] 09/10/07 01:16:28 INFO DataNode.clienttrace: src: /127.0.0.1:34216, 
dest: /127.0.0.1:59951, bytes: 48254, op: HDFS_WRITE, cliID: 
DFSClient_-1468843592, srvID: DS-632073239-127.0.1.1-59951-1254878154621, 
blockid: blk_-6478926164587938265_1015
[junit] 09/10/07 01:16:28 INFO hdfs.StateChange: BLOCK* 
NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:59951 is added to 
blk_-6478926164587938265_1015 size 48254
[junit] 09/10/07 01:16:28 INFO datanode.DataNode: PacketResponder 2 for 
block blk_-6478926164587938265_1015 terminating
[junit] 09/10/07 01:16:28 INFO hdfs.StateChange: DIR* 
NameSystem.completeFile: file 
/tmp/hadoop-hudson/mapred/system/job_20091007011555276_0002/job.xml is closed 
by DFSClient_-1468843592
[junit] 09/10/07 01:16:28 INFO FSNamesystem.audit: ugi=hudson,hudson
ip=/127.0.0.1   cmd=open
src=/tmp/hadoop-hudson/mapred/system/job_20091007011555276_0002/job.xml 
dst=nullperm=null
[junit] 09/10/07 01:16:28 INFO DataNode.clienttrace: src: /127.0.0.1:59951, 
dest: /127.0.0.1:34219, bytes: 48634, op: HDFS_READ, cliID: 
DFSClient_-1468843592, srvID: DS-632073239-127.0.1.1-59951-1254878154621, 
blockid: blk_-6478926164587938265_1015
[junit] 09/10/07 01:16:28 INFO FSNamesystem.audit: ugi=hudson,hudson
ip=/127.0.0.1   cmd=open
src=/tmp/hadoop-hudson/mapred/system/job_20091007011555276_0002/job.jar 
dst=nullperm=null
[junit] 09/10/07 01:16:28 INFO DataNode.clienttrace: src: /127.0.0.1:49650, 
dest: /127.0.0.1:45554, bytes: 2482874, op: HDFS_READ, cliID: 
DFSClient_-1468843592, srvID: DS-1821165369-127.0.1.1-49650-1254878155200, 
blockid: blk_590227262299005753_1013
[junit] 09/10/07 01:16:28 INFO mapred.JobTracker: Initializing 
job_20091007011555276_0002
[junit] 09/10/07 01:16:28 INFO

[jira] Updated: (PIG-993) [zebra] Abitlity to drop a column group in a table


 [ 
https://issues.apache.org/jira/browse/PIG-993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raghu Angadi updated PIG-993:
-

Fix Version/s: 0.6.0

 [zebra] Abitlity to drop a column group in a table
 --

 Key: PIG-993
 URL: https://issues.apache.org/jira/browse/PIG-993
 Project: Pig
  Issue Type: Bug
Reporter: Raghu Angadi
Assignee: Raghu Angadi
 Fix For: 0.6.0

 Attachments: DropColumnGroupExample.java, zebra-drop-cg.patch, 
 zebra-drop-cg.patch


 A Zebra table is stored as multiple sub tables each containing a set of 
 columns called column group (CG). The user specifies how these columns are 
 grouped while creating a table through the _storage hint_.
 For some of the large tables, it might be necessary for users to remove a set 
 of columns and retain the rest. This jira provides a way for users to delete 
 an entire column group. 
 The following comments will have more details on API and the semantics. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-987) [zebra] Zebra Column Group Access Control


[ 
https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762854#action_12762854
 ] 

Yan Zhou commented on PIG-987:
--

It's because this patch expose the env problem using lzo as compression that 
991 eventually fixes.

Can you commit 991's patch along with this? What are tthe failures from trunk? 
What are the error messages?

 [zebra] Zebra Column Group Access Control
 -

 Key: PIG-987
 URL: https://issues.apache.org/jira/browse/PIG-987
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.6.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Attachments: ColumnGroupSecurity.patch, 
 TEST-org.apache.hadoop.zebra.mapred.TestCheckin.txt


 Access Control: when processes try to read from the column groups, Zebra 
 should be able to handle allowed vs. disallowed user/application accesses.  
 The security is eventuallt granted by corresponding  HDFS security of the 
 data stored.
 Expected behavior when column group permissions are set:
 When user selects only columns that they do not have permissions to 
 access, Zebra should return error with message Error #: Permission denied 
 for accessing column column name or names 
 Access control applies to an entire column group, so all columns in a column 
 group have same permissions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-948) [Usability] Relating pig script with MR jobs

2009-10-06 Thread Hadoop QA (JIRA)

[
https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762861#action_12762861
]

Hadoop QA commented on PIG-948:
---

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12421472/PIG-948-4.patch
against trunk revision 822382.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified
tests.
Please justify why no tests are needed for this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac
compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of
release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/62/testReport/
Findbugs warnings:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/62/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/62/console

This message is automatically generated.

[Usability] Relating pig script with MR jobs

Key: PIG-948
URL: https://issues.apache.org/jira/browse/PIG-948
Project: Pig
Issue Type: Improvement
Components: impl
Affects Versions: 0.4.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
Priority: Minor
Fix For: 0.6.0

Attachments: pig-948-2.patch, pig-948-3.patch, PIG-948-4.patch,
pig-948.patch

Currently its hard to find a way to relate pig script with specific MR job.
In a loaded cluster with multiple simultaneous job submissions, its not easy
to figure out which specific MR jobs were launched for a given pig script. If
Pig can provide this info, it will be useful to debug and monitor the jobs
resulting from a pig script.
At the very least, Pig should be able to provide user the following
information
1) Job id of the launched job.
2) Complete web url of jobtracker running this job.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-987) [zebra] Zebra Column Group Access Control


[ 
https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762871#action_12762871
 ] 

Raghu Angadi commented on PIG-987:
--

Even with PIG-991 included, I am seeing lzo related failures. Could you run 
tests on a clean checkout? If you didn't see the errors before then you 
probably have lzo set up in your environment, which is not a requirement. 



 [zebra] Zebra Column Group Access Control
 -

 Key: PIG-987
 URL: https://issues.apache.org/jira/browse/PIG-987
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.6.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Attachments: ColumnGroupSecurity.patch, 
 TEST-org.apache.hadoop.zebra.mapred.TestCheckin.txt


 Access Control: when processes try to read from the column groups, Zebra 
 should be able to handle allowed vs. disallowed user/application accesses.  
 The security is eventuallt granted by corresponding  HDFS security of the 
 data stored.
 Expected behavior when column group permissions are set:
 When user selects only columns that they do not have permissions to 
 access, Zebra should return error with message Error #: Permission denied 
 for accessing column column name or names 
 Access control applies to an entire column group, so all columns in a column 
 group have same permissions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-994) Provide 'append' keyword to allow appending to diferent dataset once the feature is available in Hadoop

2009-10-06 Thread Rekha (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rekha updated PIG-994:
--

Tags: append, update, hadoop 0.20  (was: append, hadoop 0.20)

 Provide 'append' keyword to allow appending to diferent dataset once the 
 feature is available in Hadoop
 ---

 Key: PIG-994
 URL: https://issues.apache.org/jira/browse/PIG-994
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: 0.4.0
 Environment: Grid clusters
Reporter: Rekha
Priority: Minor

 Provide 'append' keyword to allow appending to diferent dataset on pig 0.5.0 
 as it is now on hadoop 0.20(which has append feature)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-994) Provide 'append' keyword to allow appending to diferent dataset once the feature is available in Hadoop

2009-10-06 Thread Rekha (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rekha updated PIG-994:
--


Thanks Alan.

I am for 'option on store' mostly and definitely if they are exclusive 
possibilities.

However for arguments sake, a keyword approach can be considered, in addition.

This is because I am hoping append will open doors to be able to easily patch 
in update feature on similar lines into pig api, (and hopefully as part of same 
jira ticket)
My idea of update is a syntax like  update DS1 by (join_keys) from DS2 by 
(join_keys) parallel $PARALLEL
This will update dataset1(DS1) with data from dataset2(DS2) based on key joins.

{code}
update b by (jon_key1, join_key2) from c by (join_key1, join_key2); //this will 
update the DS b directly
//or alternatively
//x = update b by (jon_key1, join_key2) from c by (join_key1, join_key2); // 
making it two-step.
z = foreach b generate $0, $32, $50; // incase you are taking only few cols 
from main(b), new (c)
store z into 'bla' append; // appends the o/p data into 'bla' directly.
{code}

The append case, this below construct will be another way of doing it.
{code}
append b, c; // appends directly into b.
z = foreach b generate $0, $32, $50; // incase you are taking only few cols 
from main(b), new (c)
store z into 'bla';
{code}


 Provide 'append' keyword to allow appending to diferent dataset once the 
 feature is available in Hadoop
 ---

 Key: PIG-994
 URL: https://issues.apache.org/jira/browse/PIG-994
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: 0.4.0
 Environment: Grid clusters
Reporter: Rekha
Priority: Minor

 Provide 'append' keyword to allow appending to diferent dataset on pig 0.5.0 
 as it is now on hadoop 0.20(which has append feature)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-922) Logical optimizer: push up project


 [ 
https://issues.apache.org/jira/browse/PIG-922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-922:
---

Attachment: PIG-922-p3_6.patch

Address comments by Pradeep and Hudson.

 Logical optimizer: push up project
 --

 Key: PIG-922
 URL: https://issues.apache.org/jira/browse/PIG-922
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: 0.3.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-922-p1_0.patch, PIG-922-p1_1.patch, 
 PIG-922-p1_2.patch, PIG-922-p1_3.patch, PIG-922-p1_4.patch, 
 PIG-922-p2_preview.patch, PIG-922-p2_preview2.patch, PIG-922-p3_1.patch, 
 PIG-922-p3_2.patch, PIG-922-p3_3.patch, PIG-922-p3_4.patch, 
 PIG-922-p3_5.patch, PIG-922-p3_6.patch


 This is a continuation work of 
 [PIG-697|https://issues.apache.org/jira/browse/PIG-697]. We need to add 
 another rule to the logical optimizer: Push up project, ie, prune columns as 
 early as possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-922) Logical optimizer: push up project