[jira] Created: (PIG-945) [Zebra] Column Security feature for Zebra

2009-09-03 Thread Gaurav Jain (JIRA)
[Zebra] Column Security feature for Zebra
-

 Key: PIG-945
 URL: https://issues.apache.org/jira/browse/PIG-945
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.4.0
Reporter: Gaurav Jain



In this feature, user can secure the data in a particular column group by 
setting appropriate HDFS permissions, if needed. Zebra Column Group security is 
as secure as HDFS security. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-952) [Zebra] Make Zebra Version Same as Pig Version

2009-09-10 Thread Gaurav Jain (JIRA)
[Zebra] Make Zebra Version Same as Pig Version
--

 Key: PIG-952
 URL: https://issues.apache.org/jira/browse/PIG-952
 Project: Pig
  Issue Type: Improvement
  Components: build
Affects Versions: 0.4.0
Reporter: Gaurav Jain
Assignee: Gaurav Jain
Priority: Minor
 Fix For: 0.4.0



Zebra release versions need to be same as Pig release versions for consistency

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-987) [zebra] Zebra Column Group Access Control

2009-10-06 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762733#action_12762733
 ] 

Gaurav Jain commented on PIG-987:
-


Patch Reviewed

+1

> [zebra] Zebra Column Group Access Control
> -
>
> Key: PIG-987
> URL: https://issues.apache.org/jira/browse/PIG-987
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.6.0
>Reporter: Yan Zhou
>Assignee: Yan Zhou
> Attachments: ColumnGroupSecurity.patch
>
>
> Access Control: when processes try to read from the column groups, Zebra 
> should be able to handle allowed vs. disallowed user/application accesses.  
> The security is eventuallt granted by corresponding  HDFS security of the 
> data stored.
> Expected behavior when column group permissions are set:
> When user selects only columns that they do not have permissions to 
> access, Zebra should return error with message "Error #: Permission denied 
> for accessing column  
> Access control applies to an entire column group, so all columns in a column 
> group have same permissions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-991) [zebra] A few minor bugs as described in the Description section

2009-10-06 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762734#action_12762734
 ] 

Gaurav Jain commented on PIG-991:
-



Patch Reviewed 
+1

> [zebra] A few minor bugs as described in the Description section
> 
>
> Key: PIG-991
> URL: https://issues.apache.org/jira/browse/PIG-991
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: Yan Zhou
>Assignee: Yan Zhou
>Priority: Minor
> Fix For: 0.6.0
>
> Attachments: Bugs.patch
>
>
> 1) "lzo2" was used as the compressor name for the LZO compression algorithm; 
> it should be "lzo" instead;
> 2) the default compression is changed from "lzo" to "gz" for gzip;
> 3) In JAVACC file SchemaParser.jjt, the package name was wrong using the old 
> "package org.apache.pig.table.types";
> 4) in build.xml, two new javacc targets are added to generate 
> TableSchemaParser and TableStorageParser java codes;
> 5) Support of column group security ( 
> https://issues.apache.org/jira/browse/PIG-987 ) lacked support of the 
> dumpinfo method: the groups and permissions were not displayed. Note that as 
> a consequence, the patch herein must be applied after that of JIRA987.
> 6) and 7) a couple of issues reported in Jira917.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-997) [zebra] Sorted Table Support by Zebra

2009-10-30 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772177#action_12772177
 ] 

Gaurav Jain commented on PIG-997:
-

Reviewed

+1

> [zebra] Sorted Table Support by Zebra
> -
>
> Key: PIG-997
> URL: https://issues.apache.org/jira/browse/PIG-997
> Project: Pig
>  Issue Type: New Feature
>Reporter: Yan Zhou
>Assignee: Yan Zhou
> Fix For: 0.6.0
>
> Attachments: SortedTable.patch, SortedTable.patch
>
>
> This new feature is for Zebra to support sorted data in storage. As a storage 
> library, Zebra will not sort the data by itself. But it will support creation 
> and use of sorted data either through PIG  or through map/reduce tasks that 
> use Zebra as storage format.
> The sorted table keeps the data in a "totally sorted" manner across all 
> TFiles created by potentially all mappers or reducers.
> For sorted data creation through PIG's STORE operator ,  if the input data is 
> sorted through "ORDER BY", the new Zebra table will be marked as sorted on 
> the sorted columns;
> For sorted data creation though Map/Reduce tasks,  three new static methods 
> of the BasicTableOutput class will be provided to allow or help the user to 
> achieve the goal. "setSortInfo" allows the user to specify the sorted columns 
> of the input tuple to be stored; "getSortKeyGenerator" and "getSortKey" help 
> the user to generate the key acceptable by Zebra as a sorted key based upon 
> the schema, sorted columns and the input tuple.
> For sorted data read through PIG's LOAD operator, pass string "sorted" as an 
> extra argument to the TableLoader constructor to ask for sorted table to be 
> loaded;
> For sorted data read through Map/Reduce tasks, a new static method of 
> TableInputFormat class, requireSortedTable, can be called to ask for a sorted 
> table to be read. Additionally, an overloaded version of the new method can 
> be called to ask for a sorted table on specified sort columns and comparator.
> For this release, sorted table only supported sorting in ascending order, not 
> in descending order. In addition, the sort keys must be of simple types not 
> complex types such as RECORD, COLLECTION and MAP. 
> Multiple-key sorting is supported. But the ordering of the multiple sort keys 
> is significant with the first sort column being the primary sort key, the 
> second being the secondary sort key, etc.
> In this release, the sort keys are stored along with the sort columns where 
> the keys were originally created from, resulting in some data storage 
> redundancy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1095) [zebra] Schema support of anonymous fields in COLECTION fails

2009-11-24 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782264#action_12782264
 ] 

Gaurav Jain commented on PIG-1095:
--


+1

Ideally, specific references/implementations should not be made in generic 
concepts. In this case I did not like the idea of handling projection specially 
 in Schema class.

However, given the dev explanation, I am giving a +1



> [zebra] Schema support of anonymous fields in COLECTION fails
> -
>
> Key: PIG-1095
> URL: https://issues.apache.org/jira/browse/PIG-1095
> Project: Pig
>  Issue Type: Bug
>Reporter: Yan Zhou
>Assignee: Yan Zhou
>Priority: Minor
> Fix For: 0.6.0, 0.7.0
>
> Attachments: PIG-1095.patch
>
>
> The schema parser fails on schemas of COLLECTION columns like 
> c:collection(int).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1111) [Zebra]

2009-11-25 Thread Gaurav Jain (JIRA)
[Zebra]
---

 Key: PIG-
 URL: https://issues.apache.org/jira/browse/PIG-
 Project: Pig
  Issue Type: New Feature
Reporter: Gaurav Jain
Assignee: Gaurav Jain
 Fix For: 0.6.0, 0.7.0



Zebra enables application to stream data into different zebra table instances.

New Interface added:

setMultipleOutputs( JobConf jobconf, String commaSeparatedLocation, Class theClass.

Zebra maintains a list of tables instances based on commaseparatedlocations ( 
in that order )

ZebraOutputPartitioner interface has getOutputPartition method which is 
implemented by the application. It will return an index into the list. Zebra 
will write to that instance

We also introduce a new mapred property for setting multiple outputs.

mapred.lib.table.multi.output.dirs
 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1074) Zebra store function should allow '::' in column names in output schema

2009-11-25 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782698#action_12782698
 ] 

Gaurav Jain commented on PIG-1074:
--


Looks incorrect

 )+ (  |  |  )* (  
)? (  )* (  |  |  )* >

Should be

 )+ (  |  |  )* (  
)? (  )+ (  |  |  )* >

or

 )+ (  |  |  )* (  
)? (  |  |  )+ >


> Zebra store function should allow '::' in column names in output schema
> ---
>
> Key: PIG-1074
> URL: https://issues.apache.org/jira/browse/PIG-1074
> Project: Pig
>  Issue Type: Bug
>Reporter: Pradeep Kamath
>Assignee: Yan Zhou
> Fix For: 0.6.0, 0.7.0
>
> Attachments: PIG-1074.patch
>
>
> the following script fails: 
>  {noformat}
> a = load '/zebra/singlefile/studenttab10k' using 
> org.apache.hadoop.zebra.pig.TableLoader() as (name, age, gpa);
> b = load '/zebra/singlefile/votertab10k' using 
> org.apache.hadoop.zebra.pig.TableLoader() as (name, age, registration, 
> contributions);
> c = filter a by age < 20;
> d = filter b by age < 20;
> store c into 
> '/user/pig/out//ZebraMultiQuery_30.out.1' using 
> org.apache.hadoop.zebra.pig.TableStorer('');
> store d into 
> '/user/pig/out//ZebraMultiQuery_30.out.2' using 
> org.apache.hadoop.zebra.pig.TableStorer('');
> e = cogroup c by name, d by name;
> f = foreach e generate flatten(c), flatten(d);
> store f into '/user/pig//ZebraMultiQuery_30.out.3' 
> using org.apache.hadoop.zebra.pig.TableStorer('');
> {noformat}
> Here the schema of f has names like c::name and it looks like zebra storefunc 
> does not allow '::' in column name 
> The stack trace is
>  
> ERROR 2997: Unable to recreate exception from backend error: 
> java.io.IOException: ColumnGroup.Writer constructor failed : Partition 
> constructor failed :Encountered " ":" ": "" at line 1, column 3.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1074) Zebra store function should allow '::' in column names in output schema

2009-11-25 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782708#action_12782708
 ] 

Gaurav Jain commented on PIG-1074:
--

+1

> Zebra store function should allow '::' in column names in output schema
> ---
>
> Key: PIG-1074
> URL: https://issues.apache.org/jira/browse/PIG-1074
> Project: Pig
>  Issue Type: Bug
>Reporter: Pradeep Kamath
>Assignee: Yan Zhou
> Fix For: 0.6.0, 0.7.0
>
> Attachments: PIG-1074.patch, PIG-1074.patch
>
>
> the following script fails: 
>  {noformat}
> a = load '/zebra/singlefile/studenttab10k' using 
> org.apache.hadoop.zebra.pig.TableLoader() as (name, age, gpa);
> b = load '/zebra/singlefile/votertab10k' using 
> org.apache.hadoop.zebra.pig.TableLoader() as (name, age, registration, 
> contributions);
> c = filter a by age < 20;
> d = filter b by age < 20;
> store c into 
> '/user/pig/out//ZebraMultiQuery_30.out.1' using 
> org.apache.hadoop.zebra.pig.TableStorer('');
> store d into 
> '/user/pig/out//ZebraMultiQuery_30.out.2' using 
> org.apache.hadoop.zebra.pig.TableStorer('');
> e = cogroup c by name, d by name;
> f = foreach e generate flatten(c), flatten(d);
> store f into '/user/pig//ZebraMultiQuery_30.out.3' 
> using org.apache.hadoop.zebra.pig.TableStorer('');
> {noformat}
> Here the schema of f has names like c::name and it looks like zebra storefunc 
> does not allow '::' in column name 
> The stack trace is
>  
> ERROR 2997: Unable to recreate exception from backend error: 
> java.io.IOException: ColumnGroup.Writer constructor failed : Partition 
> constructor failed :Encountered " ":" ": "" at line 1, column 3.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1074) Zebra store function should allow '::' in column names in output schema

2009-11-25 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782722#action_12782722
 ] 

Gaurav Jain commented on PIG-1074:
--


Question for Pig:


>From the example in descrption, looks like column name can be 

c::name while storing in zebra table.

Can it be 

c::c::c::c::name?

> Zebra store function should allow '::' in column names in output schema
> ---
>
> Key: PIG-1074
> URL: https://issues.apache.org/jira/browse/PIG-1074
> Project: Pig
>  Issue Type: Bug
>Reporter: Pradeep Kamath
>Assignee: Yan Zhou
> Fix For: 0.6.0, 0.7.0
>
> Attachments: PIG-1074.patch, PIG-1074.patch, PIG-1074.patch
>
>
> the following script fails: 
>  {noformat}
> a = load '/zebra/singlefile/studenttab10k' using 
> org.apache.hadoop.zebra.pig.TableLoader() as (name, age, gpa);
> b = load '/zebra/singlefile/votertab10k' using 
> org.apache.hadoop.zebra.pig.TableLoader() as (name, age, registration, 
> contributions);
> c = filter a by age < 20;
> d = filter b by age < 20;
> store c into 
> '/user/pig/out//ZebraMultiQuery_30.out.1' using 
> org.apache.hadoop.zebra.pig.TableStorer('');
> store d into 
> '/user/pig/out//ZebraMultiQuery_30.out.2' using 
> org.apache.hadoop.zebra.pig.TableStorer('');
> e = cogroup c by name, d by name;
> f = foreach e generate flatten(c), flatten(d);
> store f into '/user/pig//ZebraMultiQuery_30.out.3' 
> using org.apache.hadoop.zebra.pig.TableStorer('');
> {noformat}
> Here the schema of f has names like c::name and it looks like zebra storefunc 
> does not allow '::' in column name 
> The stack trace is
>  
> ERROR 2997: Unable to recreate exception from backend error: 
> java.io.IOException: ColumnGroup.Writer constructor failed : Partition 
> constructor failed :Encountered " ":" ": "" at line 1, column 3.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1074) Zebra store function should allow '::' in column names in output schema

2009-11-25 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782724#action_12782724
 ] 

Gaurav Jain commented on PIG-1074:
--

+1

With the latest patch,

it allow

c::c::c::c::d

but it does not allow

aab::c

If that is needed by PIG,  then we need to revisit

> Zebra store function should allow '::' in column names in output schema
> ---
>
> Key: PIG-1074
> URL: https://issues.apache.org/jira/browse/PIG-1074
> Project: Pig
>  Issue Type: Bug
>Reporter: Pradeep Kamath
>Assignee: Yan Zhou
> Fix For: 0.6.0, 0.7.0
>
> Attachments: PIG-1074.patch, PIG-1074.patch, PIG-1074.patch
>
>
> the following script fails: 
>  {noformat}
> a = load '/zebra/singlefile/studenttab10k' using 
> org.apache.hadoop.zebra.pig.TableLoader() as (name, age, gpa);
> b = load '/zebra/singlefile/votertab10k' using 
> org.apache.hadoop.zebra.pig.TableLoader() as (name, age, registration, 
> contributions);
> c = filter a by age < 20;
> d = filter b by age < 20;
> store c into 
> '/user/pig/out//ZebraMultiQuery_30.out.1' using 
> org.apache.hadoop.zebra.pig.TableStorer('');
> store d into 
> '/user/pig/out//ZebraMultiQuery_30.out.2' using 
> org.apache.hadoop.zebra.pig.TableStorer('');
> e = cogroup c by name, d by name;
> f = foreach e generate flatten(c), flatten(d);
> store f into '/user/pig//ZebraMultiQuery_30.out.3' 
> using org.apache.hadoop.zebra.pig.TableStorer('');
> {noformat}
> Here the schema of f has names like c::name and it looks like zebra storefunc 
> does not allow '::' in column name 
> The stack trace is
>  
> ERROR 2997: Unable to recreate exception from backend error: 
> java.io.IOException: ColumnGroup.Writer constructor failed : Partition 
> constructor failed :Encountered " ":" ": "" at line 1, column 3.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1111) [Zebra] multiple outputs support

2009-12-01 Thread Gaurav Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Jain updated PIG-:
-

Attachment: PIG-.patch

Source code and test cases for the feature

> [Zebra] multiple outputs support
> 
>
> Key: PIG-
> URL: https://issues.apache.org/jira/browse/PIG-
> Project: Pig
>  Issue Type: New Feature
>Reporter: Gaurav Jain
>Assignee: Gaurav Jain
> Fix For: 0.6.0, 0.7.0
>
> Attachments: PIG-.patch
>
>
> Zebra enables application to stream data into different zebra table instances.
> New Interface added:
> setMultipleOutputs( JobConf jobconf, String commaSeparatedLocation, Class extends ZebraOutputPartitioner> theClass.
> Zebra maintains a list of tables instances based on commaseparatedlocations ( 
> in that order )
> ZebraOutputPartitioner interface has getOutputPartition method which is 
> implemented by the application. It will return an index into the list. Zebra 
> will write to that instance
> We also introduce a new mapred property for setting multiple outputs.
> mapred.lib.table.multi.output.dirs
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1111) [Zebra] multiple outputs support

2009-12-02 Thread Gaurav Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Jain updated PIG-:
-

Affects Version/s: 0.7.0
   0.6.0
   Status: Patch Available  (was: Open)


Please review and provide feedback at your earliest convenience

> [Zebra] multiple outputs support
> 
>
> Key: PIG-
> URL: https://issues.apache.org/jira/browse/PIG-
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.6.0, 0.7.0
>Reporter: Gaurav Jain
>Assignee: Gaurav Jain
> Fix For: 0.6.0, 0.7.0
>
> Attachments: PIG-.patch
>
>
> Zebra enables application to stream data into different zebra table instances.
> New Interface added:
> setMultipleOutputs( JobConf jobconf, String commaSeparatedLocation, Class extends ZebraOutputPartitioner> theClass.
> Zebra maintains a list of tables instances based on commaseparatedlocations ( 
> in that order )
> ZebraOutputPartitioner interface has getOutputPartition method which is 
> implemented by the application. It will return an index into the list. Zebra 
> will write to that instance
> We also introduce a new mapred property for setting multiple outputs.
> mapred.lib.table.multi.output.dirs
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1111) [Zebra] multiple outputs support

2009-12-02 Thread Gaurav Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Jain updated PIG-:
-

Status: Open  (was: Patch Available)


Submitting an update

> [Zebra] multiple outputs support
> 
>
> Key: PIG-
> URL: https://issues.apache.org/jira/browse/PIG-
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.6.0, 0.7.0
>Reporter: Gaurav Jain
>Assignee: Gaurav Jain
> Fix For: 0.6.0, 0.7.0
>
> Attachments: PIG-.patch
>
>
> Zebra enables application to stream data into different zebra table instances.
> New Interface added:
> setMultipleOutputs( JobConf jobconf, String commaSeparatedLocation, Class extends ZebraOutputPartitioner> theClass.
> Zebra maintains a list of tables instances based on commaseparatedlocations ( 
> in that order )
> ZebraOutputPartitioner interface has getOutputPartition method which is 
> implemented by the application. It will return an index into the list. Zebra 
> will write to that instance
> We also introduce a new mapred property for setting multiple outputs.
> mapred.lib.table.multi.output.dirs
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1111) [Zebra] multiple outputs support

2009-12-03 Thread Gaurav Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Jain updated PIG-:
-

Attachment: PIG-.patch


I did some code cleaning in this patch.

Please review at your earliest convenience.

> [Zebra] multiple outputs support
> 
>
> Key: PIG-
> URL: https://issues.apache.org/jira/browse/PIG-
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.6.0, 0.7.0
>Reporter: Gaurav Jain
>Assignee: Gaurav Jain
> Fix For: 0.6.0, 0.7.0
>
> Attachments: PIG-.patch, PIG-.patch
>
>
> Zebra enables application to stream data into different zebra table instances.
> New Interface added:
> setMultipleOutputs( JobConf jobconf, String commaSeparatedLocation, Class extends ZebraOutputPartitioner> theClass.
> Zebra maintains a list of tables instances based on commaseparatedlocations ( 
> in that order )
> ZebraOutputPartitioner interface has getOutputPartition method which is 
> implemented by the application. It will return an index into the list. Zebra 
> will write to that instance
> We also introduce a new mapred property for setting multiple outputs.
> mapred.lib.table.multi.output.dirs
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1111) [Zebra] multiple outputs support

2009-12-03 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785216#action_12785216
 ] 

Gaurav Jain commented on PIG-:
--


There was an code review feedback outside of jira by Yan Zhou.

1)   why build.xml needs any changes?

2)   BasicTableOutputFormat.IS_MULTI should be of package scope instead of 
public

3)   In RecordWriter::write() method, the check of 
"if(jobConf.getBoolean(BasicTableOutputFormat.IS_MULTI, false) == true)" should 
be replaced with a simple "if (op != null)". As a consequence, "jobConf" 
variable is not needed;

4)   A lot of RuntimeExceptions have been thrown, which should be replaced 
with IOException

5)   getRecordWriter:  why remove the check for Path's nullness? The patch 
seems to be inconsistent with what's on trunk. Patch says the check is 
completely removed; while the trunk has an empty check;

6)   TableRecordWriter: commaSeparatedLocs is never used;

7)   In getOutputPartition, why are setConf/getConf necessary? Just curious.


In the latest patch all the above issues have been addressed


> [Zebra] multiple outputs support
> 
>
> Key: PIG-
> URL: https://issues.apache.org/jira/browse/PIG-
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.6.0, 0.7.0
>Reporter: Gaurav Jain
>Assignee: Gaurav Jain
> Fix For: 0.6.0, 0.7.0
>
> Attachments: PIG-.patch, PIG-.patch
>
>
> Zebra enables application to stream data into different zebra table instances.
> New Interface added:
> setMultipleOutputs( JobConf jobconf, String commaSeparatedLocation, Class extends ZebraOutputPartitioner> theClass.
> Zebra maintains a list of tables instances based on commaseparatedlocations ( 
> in that order )
> ZebraOutputPartitioner interface has getOutputPartition method which is 
> implemented by the application. It will return an index into the list. Zebra 
> will write to that instance
> We also introduce a new mapred property for setting multiple outputs.
> mapred.lib.table.multi.output.dirs
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1111) [Zebra] multiple outputs support

2009-12-03 Thread Gaurav Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Jain updated PIG-:
-

Status: Patch Available  (was: Open)


Please review

> [Zebra] multiple outputs support
> 
>
> Key: PIG-
> URL: https://issues.apache.org/jira/browse/PIG-
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.6.0, 0.7.0
>Reporter: Gaurav Jain
>Assignee: Gaurav Jain
> Fix For: 0.6.0, 0.7.0
>
> Attachments: PIG-.patch, PIG-.patch
>
>
> Zebra enables application to stream data into different zebra table instances.
> New Interface added:
> setMultipleOutputs( JobConf jobconf, String commaSeparatedLocation, Class extends ZebraOutputPartitioner> theClass.
> Zebra maintains a list of tables instances based on commaseparatedlocations ( 
> in that order )
> ZebraOutputPartitioner interface has getOutputPartition method which is 
> implemented by the application. It will return an index into the list. Zebra 
> will write to that instance
> We also introduce a new mapred property for setting multiple outputs.
> mapred.lib.table.multi.output.dirs
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1119) [zebra] "group" is a Pig preserved word, zebra needs to use other string for table's group information

2009-12-03 Thread Gaurav Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Jain updated PIG-1119:
-

Attachment: PIG-1119.patch


Please review the patch

> [zebra] "group" is a Pig preserved word, zebra needs to use other string for 
> table's group information
> --
>
> Key: PIG-1119
> URL: https://issues.apache.org/jira/browse/PIG-1119
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.6.0
>Reporter: Jing Huang
> Fix For: 0.6.0
>
> Attachments: PIG-1119.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1119) [zebra] "group" is a Pig preserved word, zebra needs to use other string for table's group information

2009-12-03 Thread Gaurav Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Jain updated PIG-1119:
-

Status: Patch Available  (was: Open)


Please review the patch

> [zebra] "group" is a Pig preserved word, zebra needs to use other string for 
> table's group information
> --
>
> Key: PIG-1119
> URL: https://issues.apache.org/jira/browse/PIG-1119
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.6.0
>Reporter: Jing Huang
> Fix For: 0.6.0
>
> Attachments: PIG-1119.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1111) [Zebra] multiple outputs support

2009-12-03 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785396#action_12785396
 ] 

Gaurav Jain commented on PIG-:
--


In response to feedback:

1) build.xml has tags to run multiple outputs tests

2) Changed to package scope

3) Change has been made

4) Throws IOException now

5) Was done as part of code cleaning. Null'ness check is done in 
getOuputPaths() now

6) This variable is taken out

7) Since ZebraOutputPartittion implements Configurable interface. setConf and 
getConf are interface methods

> [Zebra] multiple outputs support
> 
>
> Key: PIG-
> URL: https://issues.apache.org/jira/browse/PIG-
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.6.0, 0.7.0
>Reporter: Gaurav Jain
>Assignee: Gaurav Jain
> Fix For: 0.6.0, 0.7.0
>
> Attachments: PIG-.patch, PIG-.patch
>
>
> Zebra enables application to stream data into different zebra table instances.
> New Interface added:
> setMultipleOutputs( JobConf jobconf, String commaSeparatedLocation, Class extends ZebraOutputPartitioner> theClass.
> Zebra maintains a list of tables instances based on commaseparatedlocations ( 
> in that order )
> ZebraOutputPartitioner interface has getOutputPartition method which is 
> implemented by the application. It will return an index into the list. Zebra 
> will write to that instance
> We also introduce a new mapred property for setting multiple outputs.
> mapred.lib.table.multi.output.dirs
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1104) [zebra] Provide streaming support in Zebra.

2009-12-03 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785552#action_12785552
 ] 

Gaurav Jain commented on PIG-1104:
--


My 2 cents:

hadoop.common.CsvRecordOutput code is duplicated in CsvZebraTupleOutput which 
might not be a good idea. 

Instead, in CsvZebraTupleOutput, CsvRecordOutput sld be used a member variable 
and the CsvRecordOutput.stream should be connected to Byte Stream.

You can also extend the class.

The stream is UTF-8 encoded


Then, in   CsvZebraTupleOutput.writeTuple() sld look like this

writeTuple ( ... ) {

   cvsRecordOutputObject.writelong( ... )
   cvsRecordOutputObject.writeInt( ... )
   ...
   ...

}

then create the string from ByteArraySteam connected above with a utf8 encoding 
charset.


There might be various null object exceptions

For example:

// nullness of c sld be checked here

ZebraTuple(List c) {
   mFields = new ArrayList(c.size());

There are other similar behaviour

  







> [zebra] Provide streaming support in Zebra.
> ---
>
> Key: PIG-1104
> URL: https://issues.apache.org/jira/browse/PIG-1104
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Chao Wang
>Assignee: Chao Wang
> Fix For: 0.6.0, 0.7.0
>
> Attachments: PIG1104.patch
>
>
> Hadoop streaming is very popular among Hadoop users. The main attraction is 
> the simplicity of use. A user can write the application logic in any language 
> and process large amounts of data using Hadoop framework. As more people 
> start to use Zebra to store their data, we expect users would like to run 
> Hadoop streaming scripts to easily process Zebra tables. 
> The following lists a simple example of using Hadoop streaming to access 
> Zebra data. It loads data from foo table using Zebra's TableInputFormat and 
> then writes the data into output using default TextOutputFormat. 
> $ hadoop jar hadoop-streaming.jar -D mapred.reduce.tasks=0 -input foo -output 
> output -mapper 'cat' -inputformat 
> org.apache.hadoop.zebra.mapred.TableInputFormat 
> More detailed, Zebra uses Pig DefaultTuple implementation of Tuple for its 
> records. Currently, when Zebra's TableInputFormat is used for input, the user 
> script sees each line containing " key_if_any\tTuple.toString() ". We plan to 
> generate CSV format representation of our Pig tuples. To this end, we plan to 
> do the following: 
> 1) Derive a sub class ZupleTuple from pig's DefaultTuple class and override 
> its toString() method to present the data into CSV format. 
> 2) On Zebra side, the tuple factory should be changed to create ZebraTuple 
> objects, instead of DefaultTuple objects. 
> Note that we can only support streaming on the input side - ability to use 
> streaming to read data from Zebra tables. For the output side, the streaming 
> support is not feasible, since the streaming mapper or reducer only emits 
> "Text\tText", the output collector has no way of knowing how to convert this 
> to (BytesWritable,Tuple).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1104) [zebra] Provide streaming support in Zebra.

2009-12-03 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785578#action_12785578
 ] 

Gaurav Jain commented on PIG-1104:
--


+1

> [zebra] Provide streaming support in Zebra.
> ---
>
> Key: PIG-1104
> URL: https://issues.apache.org/jira/browse/PIG-1104
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Chao Wang
>Assignee: Chao Wang
> Fix For: 0.6.0, 0.7.0
>
> Attachments: PIG1104.patch
>
>
> Hadoop streaming is very popular among Hadoop users. The main attraction is 
> the simplicity of use. A user can write the application logic in any language 
> and process large amounts of data using Hadoop framework. As more people 
> start to use Zebra to store their data, we expect users would like to run 
> Hadoop streaming scripts to easily process Zebra tables. 
> The following lists a simple example of using Hadoop streaming to access 
> Zebra data. It loads data from foo table using Zebra's TableInputFormat and 
> then writes the data into output using default TextOutputFormat. 
> $ hadoop jar hadoop-streaming.jar -D mapred.reduce.tasks=0 -input foo -output 
> output -mapper 'cat' -inputformat 
> org.apache.hadoop.zebra.mapred.TableInputFormat 
> More detailed, Zebra uses Pig DefaultTuple implementation of Tuple for its 
> records. Currently, when Zebra's TableInputFormat is used for input, the user 
> script sees each line containing " key_if_any\tTuple.toString() ". We plan to 
> generate CSV format representation of our Pig tuples. To this end, we plan to 
> do the following: 
> 1) Derive a sub class ZupleTuple from pig's DefaultTuple class and override 
> its toString() method to present the data into CSV format. 
> 2) On Zebra side, the tuple factory should be changed to create ZebraTuple 
> objects, instead of DefaultTuple objects. 
> Note that we can only support streaming on the input side - ability to use 
> streaming to read data from Zebra tables. For the output side, the streaming 
> support is not feasible, since the streaming mapper or reducer only emits 
> "Text\tText", the output collector has no way of knowing how to convert this 
> to (BytesWritable,Tuple).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-653) Make fieldsToRead work in loader

2009-12-04 Thread Gaurav Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Jain updated PIG-653:


Attachment: PIG-653.patch


Zebra changes for the proposed feature

Please reveiw at your earliest convenience

> Make fieldsToRead work in loader
> 
>
> Key: PIG-653
> URL: https://issues.apache.org/jira/browse/PIG-653
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
>Assignee: Pradeep Kamath
> Attachments: PIG-653-2.comment, PIG-653-3-proposal.txt, PIG-653.patch
>
>
> Currently pig does not call the fieldsToRead function in LoadFunc, thus it 
> does not provide information to load functions on what fields are needed.  We 
> need to implement a visitor that determines (where possible) which fields in 
> a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1119) [zebra] "group" is a Pig preserved word, zebra needs to use other string for table's group information

2009-12-04 Thread Gaurav Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Jain updated PIG-1119:
-

Attachment: PIG-1119.patch


Changes incorporated as part for code review feedback

> [zebra] "group" is a Pig preserved word, zebra needs to use other string for 
> table's group information
> --
>
> Key: PIG-1119
> URL: https://issues.apache.org/jira/browse/PIG-1119
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.6.0
>Reporter: Jing Huang
> Fix For: 0.6.0
>
> Attachments: PIG-1119.patch, PIG-1119.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1119) [zebra] "group" is a Pig preserved word, zebra needs to use other string for table's group information

2009-12-04 Thread Gaurav Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Jain updated PIG-1119:
-

Status: Open  (was: Patch Available)


Providing an updated version

> [zebra] "group" is a Pig preserved word, zebra needs to use other string for 
> table's group information
> --
>
> Key: PIG-1119
> URL: https://issues.apache.org/jira/browse/PIG-1119
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.6.0
>Reporter: Jing Huang
> Fix For: 0.6.0
>
> Attachments: PIG-1119.patch, PIG-1119.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-653) Make fieldsToRead work in loader

2009-12-04 Thread Gaurav Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Jain updated PIG-653:


Status: Patch Available  (was: Open)

> Make fieldsToRead work in loader
> 
>
> Key: PIG-653
> URL: https://issues.apache.org/jira/browse/PIG-653
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
>Assignee: Pradeep Kamath
> Attachments: PIG-653-2.comment, PIG-653-3-proposal.txt, PIG-653.patch
>
>
> Currently pig does not call the fieldsToRead function in LoadFunc, thus it 
> does not provide information to load functions on what fields are needed.  We 
> need to implement a visitor that determines (where possible) which fields in 
> a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1119) [zebra] "group" is a Pig preserved word, zebra needs to use other string for table's group information

2009-12-04 Thread Gaurav Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Jain updated PIG-1119:
-

Status: Patch Available  (was: Open)

> [zebra] "group" is a Pig preserved word, zebra needs to use other string for 
> table's group information
> --
>
> Key: PIG-1119
> URL: https://issues.apache.org/jira/browse/PIG-1119
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.6.0
>Reporter: Jing Huang
> Fix For: 0.6.0
>
> Attachments: PIG-1119.patch, PIG-1119.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1125) [zebra] Using typed APIs for Zebra's Map/Reduce interface

2009-12-08 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787649#action_12787649
 ] 

Gaurav Jain commented on PIG-1125:
--


in requireSortedTable,

we are taking array of strings for sortcolumns which can be changed to 
List to be consistent with typeApi semantics



> [zebra] Using typed APIs for Zebra's Map/Reduce interface
> -
>
> Key: PIG-1125
> URL: https://issues.apache.org/jira/browse/PIG-1125
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.4.0
>Reporter: Chao Wang
>Assignee: Chao Wang
> Fix For: 0.6.0, 0.7.0
>
> Attachments: PIG-1125.patch
>
>
> We plan to modify Zebra's M/R interface to use typed APIs, i.e., APIs taking 
> object arguments, instead of String arguments.
> Take TableInputFormat as an example:
> setSchema(jobConf conf, String schema) is changing to setSchema(jobConf conf, 
> ZebraSchemaInfo schemaInfo)
> setProjection(jobConf conf, String projection) is changing to 
> setProjection(jobConf conf, ZebraProjectionInfo projectionInfo)
> and so on.
> Benefits: 1) Typed APIs make it easier to detect usage mistakes earlier and 
> 2) Typed APIs are richer and hide things better.
> In the meanwhile, we plan to make the old APIs deprecated, instead of 
> removing them, for the sake of safety.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1125) [zebra] Using typed APIs for Zebra's Map/Reduce interface

2009-12-08 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787722#action_12787722
 ] 

Gaurav Jain commented on PIG-1125:
--



+1

> [zebra] Using typed APIs for Zebra's Map/Reduce interface
> -
>
> Key: PIG-1125
> URL: https://issues.apache.org/jira/browse/PIG-1125
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.4.0
>Reporter: Chao Wang
>Assignee: Chao Wang
> Fix For: 0.6.0, 0.7.0
>
> Attachments: PIG-1125.patch, PIG-1125.patch
>
>
> We plan to modify Zebra's M/R interface to use typed APIs, i.e., APIs taking 
> object arguments, instead of String arguments.
> Take TableInputFormat as an example:
> setSchema(jobConf conf, String schema) is changing to setSchema(jobConf conf, 
> ZebraSchemaInfo schemaInfo)
> setProjection(jobConf conf, String projection) is changing to 
> setProjection(jobConf conf, ZebraProjectionInfo projectionInfo)
> and so on.
> Benefits: 1) Typed APIs make it easier to detect usage mistakes earlier and 
> 2) Typed APIs are richer and hide things better.
> In the meanwhile, we plan to make the old APIs deprecated, instead of 
> removing them, for the sake of safety.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1140) [zebra] Use of Hadoop 2.0 APIs

2010-01-19 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802441#action_12802441
 ] 

Gaurav Jain commented on PIG-1140:
--


+1 

Pig related Zebra changes have not been migrated to new Hadoop 20 Api in this 
patch. Those will contniue to work with Old Hadoop 18 Api.

Pig is re-designing its interfaces and will be incorporated in Zebra in the 
next patch.

Also, in BasicTableOuputFormat M/R commit interface is a no-op for now in this 
patch as its used exclusivley for Pig interfaces

> [zebra] Use of Hadoop 2.0 APIs  
> 
>
> Key: PIG-1140
> URL: https://issues.apache.org/jira/browse/PIG-1140
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.6.0
>Reporter: Yan Zhou
> Fix For: 0.7.0
>
> Attachments: zebra.0112
>
>
> Currently, Zebra is still using already deprecated Hadoop 1.8 APIs. Need to 
> upgrade to its 2.0 APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1206) [zebra] throws an exception if a descending "order by" by pig tries to to create such a table

2010-02-03 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829226#action_12829226
 ] 

Gaurav Jain commented on PIG-1206:
--


Zebra will treat a PIG descending order by clause as "un-ordered"

+1

> [zebra] throws an exception if a descending "order by" by pig tries to to 
> create such a table
> -
>
> Key: PIG-1206
> URL: https://issues.apache.org/jira/browse/PIG-1206
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.6.0
>Reporter: Yan Zhou
>Assignee: Yan Zhou
> Fix For: 0.6.0
>
> Attachments: PIG-1206.patch
>
>
> As Zebra does not support descending sorted table, zebra will throw an 
> exception at backend when TFile sortness check fails. It has been determined 
> that a desirable behavoir is to store the data as unsorted after logging a 
> warning,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1115) [zebra] temp files are not cleaned.

2010-02-05 Thread Gaurav Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Jain reassigned PIG-1115:


Assignee: Gaurav Jain

> [zebra] temp files are not cleaned.
> ---
>
> Key: PIG-1115
> URL: https://issues.apache.org/jira/browse/PIG-1115
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Hong Tang
>Assignee: Gaurav Jain
>
> Temp files created by zebra during table creation are not cleaned where there 
> is any task failure, which results in waste of disk space.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1115) [zebra] temp files are not cleaned.

2010-02-05 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830383#action_12830383
 ] 

Gaurav Jain commented on PIG-1115:
--

Proposed Solution:

-- Zebra will implement ZebraOutputCommitter

-- Zebra FrontEnd will create all the final directories and schema files 

$basicTable/.btschema
$basicTable/CG0/.schema
$basicTable/CG1/.schema


-- Zebra will create a temporary directory per BasicTable and write all data 
there during RecordWrite.write() under

 $basicTable/_temporary/CG0/part-
 $basicTable/_temporary/CG1/part-

-- _temporary directory will always be created under $basicTable

-- In BackEnd, Zebra created RecordWrites which in turn creates CGInserter. 
CGInserter works on directory, which we call 'workOutputPath' , 
  $basicTable/_temporary/$CG/
 But It needs .schema file which is located 2 levels up. So it 
reads schema file from
  $basicTable/$workOutputPath.getName()

-- In CGInserter.close(), 
 $basicTable/_temporary/CG0/part-   --->
  $basicTable/CG0/part-
-- In ZebraOutputCommitter.cleanupJob(), BasicTableOutputFormat.close() will be 
called.
-- In BasicTableOutPutFormat.close()
  remove ($basicTable/_temporary/   
)






> [zebra] temp files are not cleaned.
> ---
>
> Key: PIG-1115
> URL: https://issues.apache.org/jira/browse/PIG-1115
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Hong Tang
>
> Temp files created by zebra during table creation are not cleaned where there 
> is any task failure, which results in waste of disk space.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1140) [zebra] Use of Hadoop 2.0 APIs

2010-02-10 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832287#action_12832287
 ] 

Gaurav Jain commented on PIG-1140:
--


Few suggestions to the implementation


TableLoader: 
 -- In initialize method(), we sld do 
  
   Configuration conf = new Configuration(false) which creates an empty object. 
 
   Configuration conf = new Configuration() populates the object from 
default-*xml which may contain conflicting properties. 
 
( Good to have ) 
 
 -- In seekNear method(), we might want to check the nullness of 
tableRecordReader. ( Good to have ) 
 
 -- In createIndexReader(), since we set the projection, we sld not send null 
projection to 
 createTableRecordReader(job, null). 
 It sld be createTableRecordReader(job, 
TableInoutFormat.getProjection(job)) (need to have) 
 
 -- In setLocation() and getSchema(), if we are handling paths == null then we 
might want to check paths.isEmpty() as well. (good to have) 
 
 
 
 
 TableStorer: 
 
 -- Instead of implementing new classes (TableOutputFormat and 
TableOutputCommitter), we sld use BasicTableOutputFormat and 
BasicTableOutputFormat.TableOutputCommitter in zebra mapreduce package ( must 
have ) 
 
   (There would be a separate jira/patch to do 
the same ) 
 
 -- Code from storeSchema sld go 
TableOutputFormat.TableOutputCommitter.cleanupJob(). 
 
 -- Does pig calls OutputCommitter.abortJob() for failed jobs ? 
 


> [zebra] Use of Hadoop 2.0 APIs  
> 
>
> Key: PIG-1140
> URL: https://issues.apache.org/jira/browse/PIG-1140
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.6.0
>Reporter: Yan Zhou
>Assignee: Xuefu Zhang
> Fix For: 0.7.0
>
> Attachments: zebra.0209
>
>
> Currently, Zebra is still using already deprecated Hadoop 1.8 APIs. Need to 
> upgrade to its 2.0 APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1140) [zebra] Use of Hadoop 2.0 APIs

2010-02-11 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832776#action_12832776
 ] 

Gaurav Jain commented on PIG-1140:
--

 
+1

> [zebra] Use of Hadoop 2.0 APIs  
> 
>
> Key: PIG-1140
> URL: https://issues.apache.org/jira/browse/PIG-1140
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.6.0
>Reporter: Yan Zhou
>Assignee: Xuefu Zhang
> Fix For: 0.7.0
>
> Attachments: zebra.0209, zebra.0211
>
>
> Currently, Zebra is still using already deprecated Hadoop 1.8 APIs. Need to 
> upgrade to its 2.0 APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1115) [zebra] temp files are not cleaned.

2010-02-16 Thread Gaurav Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Jain updated PIG-1115:
-

Attachment: PIG-1115.patch


Patch for the fix.

We rely on application to call BTOF.close() for successful jobs as in Hadoop 
0.21 OutputCommitter we can not differetiate b/w failed and successful jobs. 
Hadoop patch for this issue is available in Hadoop 0.22

For the same reasons, we rely on applications to clean any unwanted files/dirs 
for FAILED JOBS as they are doing currently.

Once the Hadoop patch/release is available, we can port the above inside zebra 
libraries.

> [zebra] temp files are not cleaned.
> ---
>
> Key: PIG-1115
> URL: https://issues.apache.org/jira/browse/PIG-1115
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Hong Tang
>Assignee: Gaurav Jain
> Attachments: PIG-1115.patch
>
>
> Temp files created by zebra during table creation are not cleaned where there 
> is any task failure, which results in waste of disk space.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1115) [zebra] temp files are not cleaned.

2010-02-16 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834380#action_12834380
 ] 

Gaurav Jain commented on PIG-1115:
--


We discussed the backport with M/R team ( patch MAPREDUCE-947), earliest it can 
be done is in the next release of Hadoop.

I meant Hadoop 0.20/0.21 ( any release other than trunk )

> [zebra] temp files are not cleaned.
> ---
>
> Key: PIG-1115
> URL: https://issues.apache.org/jira/browse/PIG-1115
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Hong Tang
>Assignee: Gaurav Jain
> Attachments: PIG-1115.patch
>
>
> Temp files created by zebra during table creation are not cleaned where there 
> is any task failure, which results in waste of disk space.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1240) [Zebra] suggestion to have zebra manifest file contain version and svn-revision etc.

2010-02-16 Thread Gaurav Jain (JIRA)
[Zebra]  suggestion to have zebra manifest file contain version and 
svn-revision etc.
-

 Key: PIG-1240
 URL: https://issues.apache.org/jira/browse/PIG-1240
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.7.0
Reporter: Gaurav Jain
Assignee: Gaurav Jain
Priority: Minor
 Fix For: 0.7.0



Zebra jars' manifest file sld  have zebra manifest file contain version and 
svn-revision etc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1240) [Zebra] suggestion to have zebra manifest file contain version and svn-revision etc.

2010-02-16 Thread Gaurav Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Jain updated PIG-1240:
-

Attachment: PIG-1240.patch


Old looked like:


Manifest-Version: 1.0
Ant-Version: Apache Ant 1.7.1
Created-By: 14.0-b16 (Sun Microsystems Inc.)
Main-class:   org.apache.hadoop.zebra.io.BasicTable



New Manifest file would look like


Manifest-Version: 1.0
Ant-Version: Apache Ant 1.7.1
Created-By: 14.0-b16 (Sun Microsystems Inc.)

Name: org/apache/hadoop/zebra
Implementation-Vendor: Apache
Implementation-Title: Zebra
Implementation-Version: 0.7.0-dev
Build-TimeStamp: Feb 16 2010, 20:17:07
Svn-Revision: 910376
+


Zebra is a libarary and does not have a main-class. So Main-class attribute is 
removed .

> [Zebra]  suggestion to have zebra manifest file contain version and 
> svn-revision etc.
> -
>
> Key: PIG-1240
> URL: https://issues.apache.org/jira/browse/PIG-1240
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.7.0
>Reporter: Gaurav Jain
>Assignee: Gaurav Jain
>Priority: Minor
> Fix For: 0.7.0
>
> Attachments: PIG-1240.patch
>
>
> Zebra jars' manifest file sld  have zebra manifest file contain version and 
> svn-revision etc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1164) [zebra]smoke test

2010-02-22 Thread Gaurav Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Jain updated PIG-1164:
-

Attachment: PIG-SMOKE.patch


Contains:

ant target for building zebra smoke jar and tar balls

M/R and Pig smoke tests

> [zebra]smoke test
> -
>
> Key: PIG-1164
> URL: https://issues.apache.org/jira/browse/PIG-1164
> Project: Pig
>  Issue Type: Test
>Affects Versions: 0.6.0
>Reporter: Jing Huang
> Fix For: 0.7.0
>
> Attachments: PIG-SMOKE.patch, smoke.patch
>
>
> Change zebra build.xml file to add smoke target. 
> And env.sh and run script under zebra/src/test/smoke dir

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1164) [zebra]smoke test

2010-02-23 Thread Gaurav Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Jain updated PIG-1164:
-

Attachment: PIG-1164.patch


Excluded smoke tests from zebra nightly tests

> [zebra]smoke test
> -
>
> Key: PIG-1164
> URL: https://issues.apache.org/jira/browse/PIG-1164
> Project: Pig
>  Issue Type: Test
>Affects Versions: 0.6.0
>Reporter: Jing Huang
> Fix For: 0.7.0
>
> Attachments: PIG-1164.patch, PIG-SMOKE.patch, smoke.patch
>
>
> Change zebra build.xml file to add smoke target. 
> And env.sh and run script under zebra/src/test/smoke dir

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1207) [zebra] Data sanity check should be performed at the end of writing instead of later at query time

2010-03-09 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843390#action_12843390
 ] 

Gaurav Jain commented on PIG-1207:
--


Looks good

+1

> [zebra] Data sanity check should be performed at the end  of writing instead 
> of later at query time
> ---
>
> Key: PIG-1207
> URL: https://issues.apache.org/jira/browse/PIG-1207
> Project: Pig
>  Issue Type: Improvement
>Reporter: Yan Zhou
>Assignee: Yan Zhou
> Attachments: PIG-1207.patch, PIG-1207.patch
>
>
> Currently the equity check of number of rows across different column groups 
> are performed by the query. And the error info is sketchy and only emits a 
> "Column groups are not evenly distributed", or worse,  throws an 
> IndexOufOfBound exception from CGScanner.getCGValue since BasicTable.atEnd 
> and BasicTable.getKey, which are called just before BasicTable.getValue, only 
> checks the first column group in projection and any discrepancy of the number 
> of rows per file cross multiple column groups in projection could have  
> BasicTable.atEnd  return false and BasicTable.getKey return a key normally 
> but another column group already exaust its current file and the call to its 
> CGScanner.getCGValue throw the exception. 
> This check should also be performed at the end of writing and the error info 
> should be more informational.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1258) [zebra] Number of sorted input splits is unusually high

2010-03-19 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847560#action_12847560
 ] 

Gaurav Jain commented on PIG-1258:
--


+1

> [zebra] Number of sorted input splits is unusually high
> ---
>
> Key: PIG-1258
> URL: https://issues.apache.org/jira/browse/PIG-1258
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.6.0
>Reporter: Yan Zhou
> Attachments: PIG-1258.patch
>
>
> Number of sorted input splits is unusually high if the projections are on 
> multiple column groups, or a union of tables, or column group(s) that hold 
> many small tfiles. In one test, the number is about 100 times bigger that 
> from unsorted input splits on the same input tables.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1318) [Zebra] Invalid type for source_table field when using order-preserving Sorted Table Union

2010-03-22 Thread Gaurav Jain (JIRA)
[Zebra] Invalid type for source_table field when using order-preserving Sorted 
Table Union
--

 Key: PIG-1318
 URL: https://issues.apache.org/jira/browse/PIG-1318
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Gaurav Jain
 Fix For: 0.7.0


When we are trying to use order-preserving sorted union:


We got the following schema, where the type of 'source_table' is (null) with no 
column name:

{id: chararray,name: chararray,context: chararray,writer: chararray,rev: 
chararray,schema: chararray,(null)}

I tried to project the 'source_table' field but failed:

B = FOREACH A GENERATE id, $6; 
DUMP B;

But then we got exception org.apache.pig.impl.logicalLayer.FrontendException: 
ERROR 1066: Unable to open iterator for alias B.

Can you guys please let us know how to access this column? Or is the symptom 
described above is a bug?



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1318) [Zebra] Invalid type for source_table field when using order-preserving Sorted Table Union

2010-03-23 Thread Gaurav Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Jain updated PIG-1318:
-

Attachment: PIG-1318.patch

> [Zebra] Invalid type for source_table field when using order-preserving 
> Sorted Table Union
> --
>
> Key: PIG-1318
> URL: https://issues.apache.org/jira/browse/PIG-1318
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Gaurav Jain
> Fix For: 0.7.0
>
> Attachments: PIG-1318.patch
>
>
> When we are trying to use order-preserving sorted union:
> 
> We got the following schema, where the type of 'source_table' is (null) with 
> no column name:
> {id: chararray,name: chararray,context: chararray,writer: chararray,rev: 
> chararray,schema: chararray,(null)}
> I tried to project the 'source_table' field but failed:
> B = FOREACH A GENERATE id, $6; 
> DUMP B;
> But then we got exception org.apache.pig.impl.logicalLayer.FrontendException: 
> ERROR 1066: Unable to open iterator for alias B.
> Can you guys please let us know how to access this column? Or is the symptom 
> described above is a bug?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1318) [Zebra] Invalid type for source_table field when using order-preserving Sorted Table Union

2010-03-23 Thread Gaurav Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Jain updated PIG-1318:
-

Status: Patch Available  (was: Open)


fix for the jira

> [Zebra] Invalid type for source_table field when using order-preserving 
> Sorted Table Union
> --
>
> Key: PIG-1318
> URL: https://issues.apache.org/jira/browse/PIG-1318
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Gaurav Jain
> Fix For: 0.7.0
>
> Attachments: PIG-1318.patch
>
>
> When we are trying to use order-preserving sorted union:
> 
> We got the following schema, where the type of 'source_table' is (null) with 
> no column name:
> {id: chararray,name: chararray,context: chararray,writer: chararray,rev: 
> chararray,schema: chararray,(null)}
> I tried to project the 'source_table' field but failed:
> B = FOREACH A GENERATE id, $6; 
> DUMP B;
> But then we got exception org.apache.pig.impl.logicalLayer.FrontendException: 
> ERROR 1066: Unable to open iterator for alias B.
> Can you guys please let us know how to access this column? Or is the symptom 
> described above is a bug?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1355) [Zebra] Zebra Multiple Outputs should enable application to skip records

2010-04-05 Thread Gaurav Jain (JIRA)
[Zebra]  Zebra Multiple Outputs should enable application to skip records
-

 Key: PIG-1355
 URL: https://issues.apache.org/jira/browse/PIG-1355
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.8.0
Reporter: Gaurav Jain
Priority: Minor
 Fix For: 0.8.0



Applications may not always want to write a record to a table. Zebra should 
allow application to do the same.

Zebra Mutipile Outputs interface allow users to stream data to different tables 
by inspecting the data Tuple. 

https://issues.apache.org/jira/browse/PIG-

So,

If ZebraOutputPartition returns -1, Zebra Multiple Outputs will skip that 
record and thus will not write to any table

However, Zebra BasicTableOutputFormat ( different from Zebra Multiple Outputs ) 
will write every record to a table

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1355) [Zebra] Zebra Multiple Outputs should enable application to skip records

2010-04-05 Thread Gaurav Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Jain reassigned PIG-1355:


Assignee: Gaurav Jain

> [Zebra]  Zebra Multiple Outputs should enable application to skip records
> -
>
> Key: PIG-1355
> URL: https://issues.apache.org/jira/browse/PIG-1355
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Gaurav Jain
>Assignee: Gaurav Jain
>Priority: Minor
> Fix For: 0.8.0
>
>
> Applications may not always want to write a record to a table. Zebra should 
> allow application to do the same.
> Zebra Mutipile Outputs interface allow users to stream data to different 
> tables by inspecting the data Tuple. 
> https://issues.apache.org/jira/browse/PIG-
> So,
> If ZebraOutputPartition returns -1, Zebra Multiple Outputs will skip that 
> record and thus will not write to any table
> However, Zebra BasicTableOutputFormat ( different from Zebra Multiple Outputs 
> ) will write every record to a table

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1355) [Zebra] Zebra Multiple Outputs should enable application to skip records

2010-04-05 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853574#action_12853574
 ] 

Gaurav Jain commented on PIG-1355:
--


BasicTableOutputFormat supports both Single and Multi output mode.

If multi output mode is not set, every record will be written to a table.

-1 is treated specially to skip records multi mode


> [Zebra]  Zebra Multiple Outputs should enable application to skip records
> -
>
> Key: PIG-1355
> URL: https://issues.apache.org/jira/browse/PIG-1355
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Gaurav Jain
>Assignee: Gaurav Jain
>Priority: Minor
> Fix For: 0.8.0
>
>
> Applications may not always want to write a record to a table. Zebra should 
> allow application to do the same.
> Zebra Mutipile Outputs interface allow users to stream data to different 
> tables by inspecting the data Tuple. 
> https://issues.apache.org/jira/browse/PIG-
> So,
> If ZebraOutputPartition returns -1, Zebra Multiple Outputs will skip that 
> record and thus will not write to any table
> However, Zebra BasicTableOutputFormat ( different from Zebra Multiple Outputs 
> ) will write every record to a table

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1361) [Zebra] Zebra TableLoader.getSchema() should return the projectionSchema specified in the constructor of TableLoader instead of pruned proejction by pig

2010-04-07 Thread Gaurav Jain (JIRA)
[Zebra] Zebra TableLoader.getSchema() should return the projectionSchema 
specified in the constructor of TableLoader instead of pruned proejction by pig 
-

 Key: PIG-1361
 URL: https://issues.apache.org/jira/browse/PIG-1361
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.8.0
Reporter: Gaurav Jain
Priority: Minor
 Fix For: 0.8.0



Pig request for consistency reasons among different TableLoader  that Zebra 
TableLoader.getSchema() should return the projectionSchema specified in the 
constructor of TableLoader instead of pruned proejction by pig 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1361) [Zebra] Zebra TableLoader.getSchema() should return the projectionSchema specified in the constructor of TableLoader instead of pruned proejction by pig

2010-04-07 Thread Gaurav Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Jain reassigned PIG-1361:


Assignee: Gaurav Jain

> [Zebra] Zebra TableLoader.getSchema() should return the projectionSchema 
> specified in the constructor of TableLoader instead of pruned proejction by 
> pig 
> -
>
> Key: PIG-1361
> URL: https://issues.apache.org/jira/browse/PIG-1361
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Gaurav Jain
>Assignee: Gaurav Jain
>Priority: Minor
> Fix For: 0.8.0
>
>
> Pig request for consistency reasons among different TableLoader  that Zebra 
> TableLoader.getSchema() should return the projectionSchema specified in the 
> constructor of TableLoader instead of pruned proejction by pig 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1291) [zebra] Zebra need to support the virtual column 'source_table' for the unsorted table unions also

2010-04-08 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855038#action_12855038
 ] 

Gaurav Jain commented on PIG-1291:
--

 +1

> [zebra] Zebra need to support the virtual column 'source_table' for the 
> unsorted table unions also 
> ---
>
> Key: PIG-1291
> URL: https://issues.apache.org/jira/browse/PIG-1291
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.7.0, 0.8.0
>Reporter: Alok Singh
>Assignee: Yan Zhou
> Fix For: 0.7.0, 0.8.0
>
> Attachments: PIG-1291.patch, PIG-1291.patch, PIG-1291.patch
>
>
> In Pig contrib project zebra,
>  When user do the union of the sorted tables, the resulting table contains a 
> virtual column called  'source_table'.
> Which allows user to know the original table name from where the content of 
> the row of the result table is coming from.
> This feature is also very useful for the case when the input tables are not 
> sorted.
> Based on the discussion with the zebra dev team, it should be easy to 
> implement.
> I am filing this enhancemnet jira for zebra.
> Alok

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1361) [Zebra] Zebra TableLoader.getSchema() should return the projectionSchema specified in the constructor of TableLoader instead of pruned proejction by pig

2010-04-10 Thread Gaurav Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Jain updated PIG-1361:
-

Attachment: PIG-1361.patch

> [Zebra] Zebra TableLoader.getSchema() should return the projectionSchema 
> specified in the constructor of TableLoader instead of pruned proejction by 
> pig 
> -
>
> Key: PIG-1361
> URL: https://issues.apache.org/jira/browse/PIG-1361
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Gaurav Jain
>Assignee: Gaurav Jain
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: PIG-1361.patch
>
>
> Pig request for consistency reasons among different TableLoader  that Zebra 
> TableLoader.getSchema() should return the projectionSchema specified in the 
> constructor of TableLoader instead of pruned proejction by pig 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (PIG-1361) [Zebra] Zebra TableLoader.getSchema() should return the projectionSchema specified in the constructor of TableLoader instead of pruned proejction by pig

2010-04-10 Thread Gaurav Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Jain updated PIG-1361:
-

Status: Patch Available  (was: Open)

> [Zebra] Zebra TableLoader.getSchema() should return the projectionSchema 
> specified in the constructor of TableLoader instead of pruned proejction by 
> pig 
> -
>
> Key: PIG-1361
> URL: https://issues.apache.org/jira/browse/PIG-1361
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Gaurav Jain
>Assignee: Gaurav Jain
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: PIG-1361.patch
>
>
> Pig request for consistency reasons among different TableLoader  that Zebra 
> TableLoader.getSchema() should return the projectionSchema specified in the 
> constructor of TableLoader instead of pruned proejction by pig 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (PIG-1411) [Zebra] Can Zebra use HAR to reduce file/block count for namenode

2010-05-10 Thread Gaurav Jain (JIRA)
[Zebra] Can Zebra use HAR to reduce file/block count for namenode
-

 Key: PIG-1411
 URL: https://issues.apache.org/jira/browse/PIG-1411
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: 0.8.0
Reporter: Gaurav Jain
Priority: Minor
 Fix For: 0.8.0



Due to column group structure,  Zebra can create extra files for namenode to 
remember. That means namenode taking more memory for Zebra related files.

The goal is to reduce the no of files/blocks

The idea among various options is to use HAR ( Hadoop Archive ). Hadoop Archive 
reduces the block  and file count by copying data from small files ( 1M, 2M 
...) into a hdfs-block of larger size. Thus, reducing the total no. of blocks 
and files.


 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1411) [Zebra] Can Zebra use HAR to reduce file/block count for namenode

2010-05-10 Thread Gaurav Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Jain reassigned PIG-1411:


Assignee: Gaurav Jain

> [Zebra] Can Zebra use HAR to reduce file/block count for namenode
> -
>
> Key: PIG-1411
> URL: https://issues.apache.org/jira/browse/PIG-1411
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Gaurav Jain
>Assignee: Gaurav Jain
>Priority: Minor
> Fix For: 0.8.0
>
>
> Due to column group structure,  Zebra can create extra files for namenode to 
> remember. That means namenode taking more memory for Zebra related files.
> The goal is to reduce the no of files/blocks
> The idea among various options is to use HAR ( Hadoop Archive ). Hadoop 
> Archive reduces the block  and file count by copying data from small files ( 
> 1M, 2M ...) into a hdfs-block of larger size. Thus, reducing the total no. of 
> blocks and files.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1411) [Zebra] Can Zebra use HAR to reduce file/block count for namenode

2010-05-10 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866003#action_12866003
 ] 

Gaurav Jain commented on PIG-1411:
--


We ran few performance tests and found that:

-- Zebra Table with 700 file and 70G of data in them
-- Ceated a HAR file for the above table with 36 files of ~2GB

Observed:

10-15 seconds overhead in creating split per HAR file

50% increase in SLOT_MILLI_MAPS b/w HAR and Non-HAR reading of Zebra Table. HAR 
taking more time 

Similar results were observed for union of above tables ( 5 tables union )

Further performance tests are subject to fix for MAPREDUCE-1712



> [Zebra] Can Zebra use HAR to reduce file/block count for namenode
> -
>
> Key: PIG-1411
> URL: https://issues.apache.org/jira/browse/PIG-1411
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Gaurav Jain
>Assignee: Gaurav Jain
>Priority: Minor
> Fix For: 0.8.0
>
>
> Due to column group structure,  Zebra can create extra files for namenode to 
> remember. That means namenode taking more memory for Zebra related files.
> The goal is to reduce the no of files/blocks
> The idea among various options is to use HAR ( Hadoop Archive ). Hadoop 
> Archive reduces the block  and file count by copying data from small files ( 
> 1M, 2M ...) into a hdfs-block of larger size. Thus, reducing the total no. of 
> blocks and files.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1411) [Zebra] Can Zebra use HAR to reduce file/block count for namenode

2010-05-10 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866008#action_12866008
 ] 

Gaurav Jain commented on PIG-1411:
--


In general, 

HAR is a good idea for a use case with lots of small files. 

In namenode:

-- Each block takes 200 bytes
-- There are 3 replicas, so 600 bytes
-- 200 bytes for inode for 1 block
-- 800 - 1K bytes for 1 file with 1 block.

Lets say, 

-- There are 128 files with 1M size. 
-- 128K  bytes taken in namenode

With HAR

-- HDFS block size of 128M
-- All the 128 1M blocks will be written to 1 block in a HAR part file
-- 1K taken in namenode

As seen the amount of memory consumption goes down considerably.

So, in this use case, if fixed performance overhead is acceptable to 
application, then HAR is good choice for LONG RUNNING Jobs.

However, for files >= 128M, HAR does not have siginificant memory savings. 

Expalined below 





 











> [Zebra] Can Zebra use HAR to reduce file/block count for namenode
> -
>
> Key: PIG-1411
> URL: https://issues.apache.org/jira/browse/PIG-1411
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Gaurav Jain
>Assignee: Gaurav Jain
>Priority: Minor
> Fix For: 0.8.0
>
>
> Due to column group structure,  Zebra can create extra files for namenode to 
> remember. That means namenode taking more memory for Zebra related files.
> The goal is to reduce the no of files/blocks
> The idea among various options is to use HAR ( Hadoop Archive ). Hadoop 
> Archive reduces the block  and file count by copying data from small files ( 
> 1M, 2M ...) into a hdfs-block of larger size. Thus, reducing the total no. of 
> blocks and files.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1411) [Zebra] Can Zebra use HAR to reduce file/block count for namenode

2010-05-11 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866225#action_12866225
 ] 

Gaurav Jain commented on PIG-1411:
--


-- 128 Files with 128 blocks of 128M each
-- 128K bytes taken in namenode

-- With 2GB HAR block size, 128 files --> 8 files ( 16 blocks in one HAR part 
file )
-- ~80K bytes taken in namenode
-- As total number of hdfs blocks will remain same of size 128M

So, its a  ~50% improvement in namespace which is not huge and needs to be 
evaluated against performance loss of using HAR

WIth larger files, savings are not huge and performance should be taken into 
account before using HAR  

With larger blocks size for both HAR or HDFS, further gains are expected. But 
those have their own tradeoffs

> [Zebra] Can Zebra use HAR to reduce file/block count for namenode
> -
>
> Key: PIG-1411
> URL: https://issues.apache.org/jira/browse/PIG-1411
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Gaurav Jain
>Assignee: Gaurav Jain
>Priority: Minor
> Fix For: 0.8.0
>
>
> Due to column group structure,  Zebra can create extra files for namenode to 
> remember. That means namenode taking more memory for Zebra related files.
> The goal is to reduce the no of files/blocks
> The idea among various options is to use HAR ( Hadoop Archive ). Hadoop 
> Archive reduces the block  and file count by copying data from small files ( 
> 1M, 2M ...) into a hdfs-block of larger size. Thus, reducing the total no. of 
> blocks and files.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1432) [zebra] There are some debuging info output to STDOUT in PIG's TableStorer call path

2010-06-02 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874871#action_12874871
 ] 

Gaurav Jain commented on PIG-1432:
--


+1

> [zebra] There are some debuging info output to STDOUT in PIG's TableStorer 
> call path
> 
>
> Key: PIG-1432
> URL: https://issues.apache.org/jira/browse/PIG-1432
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Yan Zhou
>Assignee: Yan Zhou
>Priority: Trivial
> Fix For: 0.7.0
>
> Attachments: PIG-1432.patch
>
>
> Users redirecting STDOUT to disk file got "disk full" errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1444) [Zebra] Zebra build should have a test-smoke target

2010-06-09 Thread Gaurav Jain (JIRA)
[Zebra] Zebra build should have a test-smoke target
---

 Key: PIG-1444
 URL: https://issues.apache.org/jira/browse/PIG-1444
 Project: Pig
  Issue Type: Task
  Components: build
Affects Versions: 0.8.0
Reporter: Gaurav Jain
Priority: Minor
 Fix For: 0.8.0


Zebra build should have a test-smoke target that should atleast use minicluster 
for its test-cases

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1444) [Zebra] Zebra build should have a test-smoke target

2010-06-09 Thread Gaurav Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Jain updated PIG-1444:
-

Attachment: PIG-1444.patch


patch 1

> [Zebra] Zebra build should have a test-smoke target
> ---
>
> Key: PIG-1444
> URL: https://issues.apache.org/jira/browse/PIG-1444
> Project: Pig
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.8.0
>Reporter: Gaurav Jain
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: PIG-1444.patch
>
>
> Zebra build should have a test-smoke target that should atleast use 
> minicluster for its test-cases

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1451) [zebra] change the build.test property in build to test.build.dir to be in consistent with PIG

2010-06-15 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879066#action_12879066
 ] 

Gaurav Jain commented on PIG-1451:
--


+1

> [zebra] change the build.test property in build to test.build.dir to be in 
> consistent with PIG
> --
>
> Key: PIG-1451
> URL: https://issues.apache.org/jira/browse/PIG-1451
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.6.0, 0.7.0, 0.8.0
>Reporter: Yan Zhou
>Assignee: Yan Zhou
>Priority: Minor
> Fix For: 0.6.0, 0.7.0, 0.8.0
>
> Attachments: PIG-1451.patch
>
>
> Because build process handles PIG and Zebra builds in the same settings,  the 
> property should be the same so the build process have consistent controls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.