[jira] Updated: (PIG-831) Records and bytes written reported by pig are wrong in a multi-store program

2009-06-04 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-831:
---

Status: Patch Available  (was: Open)

> Records and bytes written reported by pig are wrong in a multi-store program
> 
>
> Key: PIG-831
> URL: https://issues.apache.org/jira/browse/PIG-831
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.3.0
>    Reporter: Alan Gates
>Assignee: Alan Gates
>Priority: Minor
> Attachments: PIG-831.patch
>
>
> The stats features checked in as part of PIG-626 (reporting the number of 
> records and bytes written at the end of the query) print wrong values (often 
> but not always 0) when the pig script being run contains more than 1 store.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-830) Port Apache Log parsing piggybank contrib to Pig 0.2

2009-06-05 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716677#action_12716677
 ] 

Alan Gates commented on PIG-830:


I've started reviewing it.  I hope to finish the review today or Monday.

> Port Apache Log parsing piggybank contrib to Pig 0.2
> 
>
> Key: PIG-830
> URL: https://issues.apache.org/jira/browse/PIG-830
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.2.0
>Reporter: Dmitriy V. Ryaboy
>Priority: Minor
> Attachments: pig-830-v2.patch, pig-830-v3.patch, pig-830.patch, 
> TEST-org.apache.pig.piggybank.test.storage.TestMyRegExLoader.txt
>
>
> The piggybank contribs (pig-472, pig-473,  pig-474, pig-476, pig-486, 
> pig-487, pig-488, pig-503, pig-509) got dropped after the types branch was 
> merged in.
> They should be updated to work with the current APIs and added back into 
> trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-823) Hadoop Metadata Service

2009-06-05 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716679#action_12716679
 ] 

Alan Gates commented on PIG-823:


By lower level of metadata, we don't mean storing information already present 
in the namenode.  The difference is in the model perspective.  Hive's metadata 
model consists of tables and partitions, which is appropriate since it works 
with SQL which presents a relational view to users.  Our proposal is to 
construct a metadata service that models directories and files.  Map Reduce and 
Pig Latin present a file based view to users, and thus this model is more 
appropriate for those tools.

I met a couple of times with the Facebook team to discuss metadata, and our 
desire to have a hierarchical model.  They agreed that this did not fit with 
the model they were using.  We both agreed that any metadata service built 
around the files should have an interface that their metadata service can 
easily connect to, so that if a user wishes to use both they can do so without 
needing to register metadata in both.

As for documentation, we're working on getting ready for external release.  We 
hope to post it in the next week or so.


> Hadoop Metadata Service
> ---
>
> Key: PIG-823
> URL: https://issues.apache.org/jira/browse/PIG-823
> Project: Pig
>  Issue Type: New Feature
>Reporter: Olga Natkovich
>
> This JIRA is created to track development of a metadata system for  Hadoop. 
> The goal of the system is to allow users and applications to register data 
> stored on HDFS, search for the data available on HDFS, and associate metadata 
> such as schema, statistics, etc. with a particular data unit or a data set 
> stored on HDFS. The initial goal is to provide a fairly generic, low level 
> abstraction that any user or application on HDFS can use to store an retrieve 
> metadata. Over time a higher level abstractions closely tied to particular 
> applications or tools can be developed.
> Over time, it would make sense for the metadata service to become a 
> subproject within Hadoop. For now, the proposal is to make it a contrib to 
> Pig since Pig SQL is likely to be the first user of the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-830) Port Apache Log parsing piggybank contrib to Pig 0.2

2009-06-05 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-830:
---

   Resolution: Fixed
Fix Version/s: 0.3.0
   Status: Resolved  (was: Patch Available)

Patch checked in.  Thanks Dmitriy for picking this up and getting it back into 
Piggy Bank.

> Port Apache Log parsing piggybank contrib to Pig 0.2
> 
>
> Key: PIG-830
> URL: https://issues.apache.org/jira/browse/PIG-830
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.2.0
>Reporter: Dmitriy V. Ryaboy
>Priority: Minor
> Fix For: 0.3.0
>
> Attachments: pig-830-v2.patch, pig-830-v3.patch, pig-830.patch, 
> TEST-org.apache.pig.piggybank.test.storage.TestMyRegExLoader.txt
>
>
> The piggybank contribs (pig-472, pig-473,  pig-474, pig-476, pig-486, 
> pig-487, pig-488, pig-503, pig-509) got dropped after the types branch was 
> merged in.
> They should be updated to work with the current APIs and added back into 
> trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-823) Hadoop Metadata Service

2009-06-09 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717841#action_12717841
 ] 

Alan Gates commented on PIG-823:


http://wiki.apache.org/pig/Metadata

> Hadoop Metadata Service
> ---
>
> Key: PIG-823
> URL: https://issues.apache.org/jira/browse/PIG-823
> Project: Pig
>  Issue Type: New Feature
>Reporter: Olga Natkovich
>
> This JIRA is created to track development of a metadata system for  Hadoop. 
> The goal of the system is to allow users and applications to register data 
> stored on HDFS, search for the data available on HDFS, and associate metadata 
> such as schema, statistics, etc. with a particular data unit or a data set 
> stored on HDFS. The initial goal is to provide a fairly generic, low level 
> abstraction that any user or application on HDFS can use to store an retrieve 
> metadata. Over time a higher level abstractions closely tied to particular 
> applications or tools can be developed.
> Over time, it would make sense for the metadata service to become a 
> subproject within Hadoop. For now, the proposal is to make it a contrib to 
> Pig since Pig SQL is likely to be the first user of the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-823) Hadoop Metadata Service

2009-06-10 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718063#action_12718063
 ] 

Alan Gates commented on PIG-823:


In response to Matei's comment:

The intent is not that this is Pig metadata, but that it be grid wide metadata. 
 We don't want to put it directly in HDFS by extending the namenode, since the 
namenode is already heavily loaded and a central contention point in the 
system.  We also want it to remain optional, as many users will not need it.

The vision is that this will be a separate module that Hadoop users can choose 
to install and use with their system, along with other modules they use, such 
as Pig, Hive, Chuckwa, etc.

The Pig team is volunteering to put it in our contrib for now because Pig is 
interested in it and willing to devote the resources to help it get started.

> Hadoop Metadata Service
> ---
>
> Key: PIG-823
> URL: https://issues.apache.org/jira/browse/PIG-823
> Project: Pig
>  Issue Type: New Feature
>Reporter: Olga Natkovich
>
> This JIRA is created to track development of a metadata system for  Hadoop. 
> The goal of the system is to allow users and applications to register data 
> stored on HDFS, search for the data available on HDFS, and associate metadata 
> such as schema, statistics, etc. with a particular data unit or a data set 
> stored on HDFS. The initial goal is to provide a fairly generic, low level 
> abstraction that any user or application on HDFS can use to store an retrieve 
> metadata. Over time a higher level abstractions closely tied to particular 
> applications or tools can be developed.
> Over time, it would make sense for the metadata service to become a 
> subproject within Hadoop. For now, the proposal is to make it a contrib to 
> Pig since Pig SQL is likely to be the first user of the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.




[jira] Commented: (PIG-6) Addition of Hbase Storage Option In Load/Store Statement

2009-06-10 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718071#action_12718071
 ] 

Alan Gates commented on PIG-6:
--

The outstanding patch that has not been applied (m34813f5.txt) connects Pig 
with Hbase 0.19.  Since Pig does not yet support Hadoop 0.19 as a released 
version (there is a patch you can apply and build yourself to make it work) we 
haven't incorporated this patch yet either.

> Addition of Hbase Storage Option In Load/Store Statement
> 
>
> Key: PIG-6
> URL: https://issues.apache.org/jira/browse/PIG-6
> Project: Pig
>  Issue Type: New Feature
> Environment: all environments
>Reporter: Edward J. Yoon
> Fix For: 0.2.0
>
> Attachments: hbase-0.18.1-test.jar, hbase-0.18.1.jar, m34813f5.txt, 
> PIG-6.patch, PIG-6_V01.patch
>
>
> It needs to be able to load full table in hbase.  (maybe ... difficult? i'm 
> not sure yet.)
> Also, as described below, 
> It needs to compose an abstract 2d-table only with certain data filtered from 
> hbase array structure using arbitrary query-delimited. 
> {code}
> A = LOAD table('hbase_table');
> or
> B = LOAD table('hbase_table') Using HbaseQuery('Query-delimited by attributes 
> & timestamp') as (f1, f2[, f3]);
> {code}
> Once test is done on my local machines, 
> I will clarify the grammars and give you more examples to help you explain 
> more storage options. 
> Any advice welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-753) Provide support for UDFs without parameters

2009-06-13 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-753:
---

Status: Patch Available  (was: Open)

Marking patch submitted so that hudson will pick it up.

> Provide support for UDFs without parameters
> ---
>
> Key: PIG-753
> URL: https://issues.apache.org/jira/browse/PIG-753
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Jeff Zhang
> Fix For: 0.3.0
>
> Attachments: Pig_753_Patch.txt
>
>
> Pig do not support UDF without parameters, it force me provide a parameter.
> like the following statement:
>  B = FOREACH A GENERATE bagGenerator();  this will generate error. I have to 
> provide a parameter like following
>  B = FOREACH A GENERATE bagGenerator($0);
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-753) Provide support for UDFs without parameters

2009-06-15 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719616#action_12719616
 ] 

Alan Gates commented on PIG-753:


The test failures are in bzip tests, which I doubt are affected by this.  I'll 
run them myself with the patch to check.  But the release audit warnings are 
real.  The two new test files need to have apache headers put on them.  You can 
grab the header from any of the other java files.

> Provide support for UDFs without parameters
> ---
>
> Key: PIG-753
> URL: https://issues.apache.org/jira/browse/PIG-753
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Jeff Zhang
> Fix For: 0.3.0
>
> Attachments: Pig_753_Patch.txt
>
>
> Pig do not support UDF without parameters, it force me provide a parameter.
> like the following statement:
>  B = FOREACH A GENERATE bagGenerator();  this will generate error. I have to 
> provide a parameter like following
>  B = FOREACH A GENERATE bagGenerator($0);
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-753) Provide support for UDFs without parameters

2009-06-15 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-753:
---

Status: Open  (was: Patch Available)

> Provide support for UDFs without parameters
> ---
>
> Key: PIG-753
> URL: https://issues.apache.org/jira/browse/PIG-753
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Jeff Zhang
> Fix For: 0.3.0
>
> Attachments: Pig_753_Patch.txt
>
>
> Pig do not support UDF without parameters, it force me provide a parameter.
> like the following statement:
>  B = FOREACH A GENERATE bagGenerator();  this will generate error. I have to 
> provide a parameter like following
>  B = FOREACH A GENERATE bagGenerator($0);
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-842) PigStorage should support multi-byte delimiters

2009-06-16 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720259#action_12720259
 ] 

Alan Gates commented on PIG-842:


I'm concerned about the performance hit of supporting multi-byte comparators.  
Before we commit to doing this in PigStorage, we should test how much it slows 
down reading data.  If it is significant, we should consider having a 
PigMultiByteStorage or something that handles multi-byte delimiter characters.  
It could extend PigStorage and only differ in how it parses the records.

> PigStorage should support multi-byte delimiters
> ---
>
> Key: PIG-842
> URL: https://issues.apache.org/jira/browse/PIG-842
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.3.0
>Reporter: Santhosh Srinivasan
> Fix For: 0.3.0
>
>
> Currently, PigStorage supports single byte delimiters. Users have requested 
> mult-byte delimiters. There are performance implications with multi-byte 
> delimiters. i.e., instead of looking for a single byte, PigStorage should 
> look for a pattern ala BinStorage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-734) Non-string keys in maps

2009-06-18 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-734:
---

Status: Open  (was: Patch Available)

> Non-string keys in maps
> ---
>
> Key: PIG-734
> URL: https://issues.apache.org/jira/browse/PIG-734
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.2.0
>    Reporter: Alan Gates
>    Assignee: Alan Gates
>Priority: Minor
> Fix For: 0.3.0
>
> Attachments: PIG-734.patch
>
>
> With the addition of types to pig, maps were changed to allow any atomic type 
> to be a key.  However, in practice we do not see people using keys other than 
> strings.  And allowing multiple types is causing us issues in serializing 
> data (we have to check what every key type is) and in the design for non-java 
> UDFs (since many scripting languages include associative arrays such as 
> Perl's hash).
> So I propose we scope back maps to only have string keys.  This would be a 
> non-compatible change.  But I am not aware of anyone using non-string keys, 
> so hopefully it would have little or no impact.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-734) Non-string keys in maps

2009-06-18 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-734:
---

Attachment: PIG-734_2.patch

New version of the patch, brought up to date with current trunk.

> Non-string keys in maps
> ---
>
> Key: PIG-734
> URL: https://issues.apache.org/jira/browse/PIG-734
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.2.0
>    Reporter: Alan Gates
>    Assignee: Alan Gates
>Priority: Minor
> Fix For: 0.3.0
>
> Attachments: PIG-734.patch, PIG-734_2.patch
>
>
> With the addition of types to pig, maps were changed to allow any atomic type 
> to be a key.  However, in practice we do not see people using keys other than 
> strings.  And allowing multiple types is causing us issues in serializing 
> data (we have to check what every key type is) and in the design for non-java 
> UDFs (since many scripting languages include associative arrays such as 
> Perl's hash).
> So I propose we scope back maps to only have string keys.  This would be a 
> non-compatible change.  But I am not aware of anyone using non-string keys, 
> so hopefully it would have little or no impact.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-734) Non-string keys in maps

2009-06-18 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-734:
---

Status: Patch Available  (was: Open)

> Non-string keys in maps
> ---
>
> Key: PIG-734
> URL: https://issues.apache.org/jira/browse/PIG-734
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.2.0
>    Reporter: Alan Gates
>    Assignee: Alan Gates
>Priority: Minor
> Fix For: 0.3.0
>
> Attachments: PIG-734.patch, PIG-734_2.patch
>
>
> With the addition of types to pig, maps were changed to allow any atomic type 
> to be a key.  However, in practice we do not see people using keys other than 
> strings.  And allowing multiple types is causing us issues in serializing 
> data (we have to check what every key type is) and in the design for non-java 
> UDFs (since many scripting languages include associative arrays such as 
> Perl's hash).
> So I propose we scope back maps to only have string keys.  This would be a 
> non-compatible change.  But I am not aware of anyone using non-string keys, 
> so hopefully it would have little or no impact.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-734) Non-string keys in maps

2009-06-18 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-734:
---

Status: Open  (was: Patch Available)

> Non-string keys in maps
> ---
>
> Key: PIG-734
> URL: https://issues.apache.org/jira/browse/PIG-734
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.2.0
>    Reporter: Alan Gates
>    Assignee: Alan Gates
>Priority: Minor
> Fix For: 0.3.0
>
> Attachments: PIG-734.patch, PIG-734_2.patch
>
>
> With the addition of types to pig, maps were changed to allow any atomic type 
> to be a key.  However, in practice we do not see people using keys other than 
> strings.  And allowing multiple types is causing us issues in serializing 
> data (we have to check what every key type is) and in the design for non-java 
> UDFs (since many scripting languages include associative arrays such as 
> Perl's hash).
> So I propose we scope back maps to only have string keys.  This would be a 
> non-compatible change.  But I am not aware of anyone using non-string keys, 
> so hopefully it would have little or no impact.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-734) Non-string keys in maps

2009-06-18 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-734:
---

Fix Version/s: (was: 0.3.0)
   0.4.0
   Status: Patch Available  (was: Open)

> Non-string keys in maps
> ---
>
> Key: PIG-734
> URL: https://issues.apache.org/jira/browse/PIG-734
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.2.0
>    Reporter: Alan Gates
>    Assignee: Alan Gates
>Priority: Minor
> Fix For: 0.4.0
>
> Attachments: PIG-734.patch, PIG-734_2.patch, PIG-734_3.patch
>
>
> With the addition of types to pig, maps were changed to allow any atomic type 
> to be a key.  However, in practice we do not see people using keys other than 
> strings.  And allowing multiple types is causing us issues in serializing 
> data (we have to check what every key type is) and in the design for non-java 
> UDFs (since many scripting languages include associative arrays such as 
> Perl's hash).
> So I propose we scope back maps to only have string keys.  This would be a 
> non-compatible change.  But I am not aware of anyone using non-string keys, 
> so hopefully it would have little or no impact.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-734) Non-string keys in maps

2009-06-18 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-734:
---

Attachment: PIG-734_3.patch

Attaching a version of the file that fixes some of the introduced compiler 
warnings.  The findbugs warnings have to do with naming convention.  All of the 
function names in QueryParser start with upper case, so I am only following the 
convention there.

> Non-string keys in maps
> ---
>
> Key: PIG-734
> URL: https://issues.apache.org/jira/browse/PIG-734
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.2.0
>    Reporter: Alan Gates
>    Assignee: Alan Gates
>Priority: Minor
> Fix For: 0.4.0
>
> Attachments: PIG-734.patch, PIG-734_2.patch, PIG-734_3.patch
>
>
> With the addition of types to pig, maps were changed to allow any atomic type 
> to be a key.  However, in practice we do not see people using keys other than 
> strings.  And allowing multiple types is causing us issues in serializing 
> data (we have to check what every key type is) and in the design for non-java 
> UDFs (since many scripting languages include associative arrays such as 
> Perl's hash).
> So I propose we scope back maps to only have string keys.  This would be a 
> non-compatible change.  But I am not aware of anyone using non-string keys, 
> so hopefully it would have little or no impact.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-753) Provide support for UDFs without parameters

2009-06-18 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721559#action_12721559
 ] 

Alan Gates commented on PIG-753:


+1

I tested the patch, and the issue was just with the bzip tests.

I'd like to have Santosh's opinion on this as he is the expert in the logical 
plan and type checker area where these changes are.

> Provide support for UDFs without parameters
> ---
>
> Key: PIG-753
> URL: https://issues.apache.org/jira/browse/PIG-753
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Jeff Zhang
> Attachments: Pig_753_Patch.txt
>
>
> Pig do not support UDF without parameters, it force me provide a parameter.
> like the following statement:
>  B = FOREACH A GENERATE bagGenerator();  this will generate error. I have to 
> provide a parameter like following
>  B = FOREACH A GENERATE bagGenerator($0);
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-856) PERFORMANCE: reduce number of replicas

2009-06-18 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721590#action_12721590
 ] 

Alan Gates commented on PIG-856:


My $0.02, based on the assumption that we see a significant performance 
improvement using only 1 replica instead of 2 or 3:

In the long term we might want Pig to retry jobs if they fail for this.  But in 
the short term, I would think some users would be willing to trade reliability 
for performance and some would not, so we should let them choose.  

> PERFORMANCE: reduce number of replicas
> --
>
> Key: PIG-856
> URL: https://issues.apache.org/jira/browse/PIG-856
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Olga Natkovich
>
> Currently Pig uses the default number of replicas between MR jobs. Currently, 
> the number is 3. Given the temp nature of the data, we should never need more 
> than 2 and should explicitely set it to improve performance and to be nicer 
> to the name node.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-753) Provide support for UDFs without parameters

2009-06-18 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates resolved PIG-753.


   Resolution: Fixed
Fix Version/s: 0.4.0

I added headers to the two test files.  I reran the unit tests and the bzip 
(and all other) unit tests passed.

Patch checked in.  Thanks Jeff.

> Provide support for UDFs without parameters
> ---
>
> Key: PIG-753
> URL: https://issues.apache.org/jira/browse/PIG-753
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Jeff Zhang
> Fix For: 0.4.0
>
> Attachments: Pig_753_Patch.txt
>
>
> Pig do not support UDF without parameters, it force me provide a parameter.
> like the following statement:
>  B = FOREACH A GENERATE bagGenerator();  this will generate error. I have to 
> provide a parameter like following
>  B = FOREACH A GENERATE bagGenerator($0);
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer

2009-06-19 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721968#action_12721968
 ] 

Alan Gates commented on PIG-697:


Why is it that some Logical operators (LOCross, LOStream) don't have rewire 
implemented?

Near the end of ProjectFixerUpper.vist(POProject), you have a TODO about the 
walking.  We should figure out whether that is necessary or not, as doing 
visiting by the visit function and by the walker can result in double visiting.

Is there a need to add a clear concept to LogicalTransformer in order to clear 
state between calls to check, since each transformer will potentially be called 
multiple times now?

> Proposed improvements to pig's optimizer
> 
>
> Key: PIG-697
> URL: https://issues.apache.org/jira/browse/PIG-697
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Alan Gates
>Assignee: Santhosh Srinivasan
> Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
> OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
> OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch
>
>
> I propose the following changes to pig optimizer, plan, and operator 
> functionality to support more robust optimization:
> 1) Remove the required array from Rule.  This will change rules so that they 
> only match exact patterns instead of allowing missing elements in the pattern.
> This has the downside that if a given rule applies to two patterns (say 
> Load->Filter->Group, Load->Group) you have to write two rules.  But it has 
> the upside that
> the resulting rules know exactly what they are getting.  The original intent 
> of this was to reduce the number of rules that needed to be written.  But the
> resulting rules have do a lot of work to understand the operators they are 
> working with.  With exact matches only, each rule will know exactly the 
> operators it
> is working on and can apply the logic of shifting the operators around.  All 
> four of the existing rules set all entries of required to true, so removing 
> this
> will have no effect on them.
> 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
> conversions or a certain number of iterations has been reached.  Currently the
> function is:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> for (Rule rule : mRules) {
> if (matcher.match(rule)) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> List> matches = matcher.getAllMatches();
> for (List match:matches)
> {
>   if (rule.transformer.check(match)) {
>   // The transformer approves.
>   rule.transformer.transform(match);
>   }
> }
> }
> }
> }
> {code}
> It would change to be:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> boolean sawMatch;
> int iterators = 0;
> do {
> sawMatch = false;
> for (Rule rule : mRules) {
> List> matches = matcher.getAllMatches();
> for (List match:matches) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> if (rule.transformer.check(match)) {
> // The transformer approves.
> sawMatch = true;
> rule.transformer.transform(match);
> }
> }
> }
> // Not sure if 1000 is the right number of iterations, maybe it
> // should be configurable so that large scripts don't stop too 
> // early.
> } while (sawMatch && numIterations++ < 1000);
> }
> {code}
> The reason for limiting the number of iterations is to avoid infinite loops.  
> The reason for iterating over the rules is so that each rule can be applied 
> multiple
> times as necessary.  This allows us to write simple rules, mostly swaps 
> between neighboring operators, without worrying that we get the plan right in 
> one pass.
> For example, we might have a plan that looks like:  
> Load->Join->Filter->Foreach, and we want to optimize it to 
> Load->F

[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer

2009-06-19 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721977#action_12721977
 ] 

Alan Gates commented on PIG-697:


+1, looks good.

> Proposed improvements to pig's optimizer
> 
>
> Key: PIG-697
> URL: https://issues.apache.org/jira/browse/PIG-697
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Alan Gates
>Assignee: Santhosh Srinivasan
> Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
> OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
> OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch
>
>
> I propose the following changes to pig optimizer, plan, and operator 
> functionality to support more robust optimization:
> 1) Remove the required array from Rule.  This will change rules so that they 
> only match exact patterns instead of allowing missing elements in the pattern.
> This has the downside that if a given rule applies to two patterns (say 
> Load->Filter->Group, Load->Group) you have to write two rules.  But it has 
> the upside that
> the resulting rules know exactly what they are getting.  The original intent 
> of this was to reduce the number of rules that needed to be written.  But the
> resulting rules have do a lot of work to understand the operators they are 
> working with.  With exact matches only, each rule will know exactly the 
> operators it
> is working on and can apply the logic of shifting the operators around.  All 
> four of the existing rules set all entries of required to true, so removing 
> this
> will have no effect on them.
> 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
> conversions or a certain number of iterations has been reached.  Currently the
> function is:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> for (Rule rule : mRules) {
> if (matcher.match(rule)) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> List> matches = matcher.getAllMatches();
> for (List match:matches)
> {
>   if (rule.transformer.check(match)) {
>   // The transformer approves.
>   rule.transformer.transform(match);
>   }
> }
> }
> }
> }
> {code}
> It would change to be:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> boolean sawMatch;
> int iterators = 0;
> do {
> sawMatch = false;
> for (Rule rule : mRules) {
> List> matches = matcher.getAllMatches();
> for (List match:matches) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> if (rule.transformer.check(match)) {
> // The transformer approves.
> sawMatch = true;
> rule.transformer.transform(match);
> }
> }
> }
> // Not sure if 1000 is the right number of iterations, maybe it
> // should be configurable so that large scripts don't stop too 
> // early.
> } while (sawMatch && numIterations++ < 1000);
> }
> {code}
> The reason for limiting the number of iterations is to avoid infinite loops.  
> The reason for iterating over the rules is so that each rule can be applied 
> multiple
> times as necessary.  This allows us to write simple rules, mostly swaps 
> between neighboring operators, without worrying that we get the plan right in 
> one pass.
> For example, we might have a plan that looks like:  
> Load->Join->Filter->Foreach, and we want to optimize it to 
> Load->Foreach->Filter->Join.  With two simple
> rules (swap filter and join and swap foreach and filter), applied 
> iteratively, we can get from the initial to final plan, without needing to 
> understanding the
> big picture of the entire plan.
> 3) Add three calls to OperatorPlan:
> {code}
> /**
>  * Swap two operators in a plan.  Both of the operators must have single
>  * inputs and single outputs.
>  * @param first operator
>  * @param second operator
>

[jira] Commented: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-23 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723204#action_12723204
 ] 

Alan Gates commented on PIG-820:


+1

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
> another loader
> -
>
> Key: PIG-820
> URL: https://issues.apache.org/jira/browse/PIG-820
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.3.0, 0.4.0
>Reporter: Alan Gates
>Assignee: Ashutosh Chauhan
> Fix For: 0.4.0
>
> Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch
>
>
> Currently a sampling job requires that data already be stored in 
> BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
> order by this
> has mostly been acceptable, because users tend to use order by at the end of 
> their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an 
> order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this 
> requirement to read the entire input and write it back out will not be 
> acceptable.
> Join is often the first operation of a script, and thus is much more likely 
> to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, 
> using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a 
> Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
> 
> /**
>  * Skip ahead in the input stream.
>  * @param n number of bytes to skip
>  * @return number of bytes actually skipped.  The return semantics are
>  * exactly the same as {...@link java.io.InpuStream#skip(long)}
>  */
> public long skip(long n) throws IOException;
> 
> /**
>  * Get the current position in the stream.
>  * @return position in the stream.
>  */
> public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data 
> implemented the SamplableLoader interface.  If so, rather than create an 
> initial MR
> job to do the translation it would create the sampling job, having 
> RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-734) Non-string keys in maps

2009-06-23 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-734:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Patch checked in.

> Non-string keys in maps
> ---
>
> Key: PIG-734
> URL: https://issues.apache.org/jira/browse/PIG-734
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.2.0
>    Reporter: Alan Gates
>    Assignee: Alan Gates
>Priority: Minor
> Fix For: 0.4.0
>
> Attachments: PIG-734.patch, PIG-734_2.patch, PIG-734_3.patch
>
>
> With the addition of types to pig, maps were changed to allow any atomic type 
> to be a key.  However, in practice we do not see people using keys other than 
> strings.  And allowing multiple types is causing us issues in serializing 
> data (we have to check what every key type is) and in the design for non-java 
> UDFs (since many scripting languages include associative arrays such as 
> Perl's hash).
> So I propose we scope back maps to only have string keys.  This would be a 
> non-compatible change.  But I am not aware of anyone using non-string keys, 
> so hopefully it would have little or no impact.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-06-24 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723812#action_12723812
 ] 

Alan Gates commented on PIG-794:


PIG-734 has been committed.  This will allow this patch to simplify its 
handling of maps to match avro maps, since Pig maps now only allow strings as 
keys.

> Use Avro serialization in Pig
> -
>
> Key: PIG-794
> URL: https://issues.apache.org/jira/browse/PIG-794
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.2.0
>Reporter: Rakesh Setty
> Fix For: 0.2.0
>
> Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
> jackson-asl-0.9.4.jar, PIG-794.patch
>
>
> We would like to use Avro serialization in Pig to pass data between MR jobs 
> instead of the current BinStorage. Attached is an implementation of 
> AvroBinStorage which performs significantly better compared to BinStorage on 
> our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-793) Improving memory efficiency of Tuple implementation

2009-06-26 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-793:
--

Assignee: Alan Gates

> Improving memory efficiency of Tuple implementation
> ---
>
> Key: PIG-793
> URL: https://issues.apache.org/jira/browse/PIG-793
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>    Assignee: Alan Gates
>
> Currently, our tuple is a real pig and uses a lot of extra memory. 
> There are several places where we can improve memory efficiency:
> (1) Laying out memory for the fields rather than using java objects since 
> since each object for a numeric field takes 16 bytes
> (2) For the cases where we know the schema using Java arrays rather than 
> ArrayList.
> There might be more.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation

2009-06-26 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724594#action_12724594
 ] 

Alan Gates commented on PIG-793:


Using jmap, I've been toying around with our DefaultTuple implementation to see 
how much memory it takes.  For a tuple with 3 elements, one int, one double, 
one 20 character string I see it taking:

16 bytes for the Tuple object
24 bytes for the ArrayList in the tuple
~26 bytes for pointers in the ArrayList
16 bytes for the Integer
16 bytes for the Double
24 bytes for the String overhead
~52 bytes for the String data

Pointers in the ArrayList and character data in the String appear to be padded 
and vary somewhat depending on how I run the experiments.

I played with changing the ArrayList in DefaultTuple to an Object[].  
There are two advantages, the 24 bytes of ArrayList shrinks to 12 for the 
Object[], and as I wrote it to always have the Object[] be exactly the right 
size there is no padding cost.  The downside to this is append becomes a more 
expensive operation because it's growing the Object[] by one every time.  
However, after some investigation I believe that most places we use append can 
be changed to use set, thus alieviating this issue.  I'm working on a patch to 
change this.  Once I have that done I'll report on how that changes memory 
usage as well as any performance gains or losses.

A related item I would like to look into is using Hadoop's Text instead of 
String to back chararray.  Text takes 16 bytes of overhead + 36 bytes for 
string data to store 20 characters, versus the 24 / 52 of String.  Obviously 
this would be a huge change and needs to have very impressive results to be 
considered.  I'll play with it and report results here.


> Improving memory efficiency of Tuple implementation
> ---
>
> Key: PIG-793
> URL: https://issues.apache.org/jira/browse/PIG-793
> Project: Pig
>  Issue Type: Improvement
>    Reporter: Olga Natkovich
>Assignee: Alan Gates
>
> Currently, our tuple is a real pig and uses a lot of extra memory. 
> There are several places where we can improve memory efficiency:
> (1) Laying out memory for the fields rather than using java objects since 
> since each object for a numeric field takes 16 bytes
> (2) For the cases where we know the schema using Java arrays rather than 
> ArrayList.
> There might be more.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer

2009-06-26 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724731#action_12724731
 ] 

Alan Gates commented on PIG-697:


+1 on the phase 4 part 1 patch.

> Proposed improvements to pig's optimizer
> 
>
> Key: PIG-697
> URL: https://issues.apache.org/jira/browse/PIG-697
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Alan Gates
>Assignee: Santhosh Srinivasan
> Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
> OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
> OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch, 
> OptimizerPhase4_part1-1.patch
>
>
> I propose the following changes to pig optimizer, plan, and operator 
> functionality to support more robust optimization:
> 1) Remove the required array from Rule.  This will change rules so that they 
> only match exact patterns instead of allowing missing elements in the pattern.
> This has the downside that if a given rule applies to two patterns (say 
> Load->Filter->Group, Load->Group) you have to write two rules.  But it has 
> the upside that
> the resulting rules know exactly what they are getting.  The original intent 
> of this was to reduce the number of rules that needed to be written.  But the
> resulting rules have do a lot of work to understand the operators they are 
> working with.  With exact matches only, each rule will know exactly the 
> operators it
> is working on and can apply the logic of shifting the operators around.  All 
> four of the existing rules set all entries of required to true, so removing 
> this
> will have no effect on them.
> 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
> conversions or a certain number of iterations has been reached.  Currently the
> function is:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> for (Rule rule : mRules) {
> if (matcher.match(rule)) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> List> matches = matcher.getAllMatches();
> for (List match:matches)
> {
>   if (rule.transformer.check(match)) {
>   // The transformer approves.
>   rule.transformer.transform(match);
>   }
> }
> }
> }
> }
> {code}
> It would change to be:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> boolean sawMatch;
> int iterators = 0;
> do {
> sawMatch = false;
> for (Rule rule : mRules) {
> List> matches = matcher.getAllMatches();
> for (List match:matches) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> if (rule.transformer.check(match)) {
> // The transformer approves.
> sawMatch = true;
> rule.transformer.transform(match);
> }
> }
> }
> // Not sure if 1000 is the right number of iterations, maybe it
> // should be configurable so that large scripts don't stop too 
> // early.
> } while (sawMatch && numIterations++ < 1000);
> }
> {code}
> The reason for limiting the number of iterations is to avoid infinite loops.  
> The reason for iterating over the rules is so that each rule can be applied 
> multiple
> times as necessary.  This allows us to write simple rules, mostly swaps 
> between neighboring operators, without worrying that we get the plan right in 
> one pass.
> For example, we might have a plan that looks like:  
> Load->Join->Filter->Foreach, and we want to optimize it to 
> Load->Foreach->Filter->Join.  With two simple
> rules (swap filter and join and swap foreach and filter), applied 
> iteratively, we can get from the initial to final plan, without needing to 
> understanding the
> big picture of the entire plan.
> 3) Add three calls to OperatorPlan:
> {code}
> /**
>  * Swap two operators in a plan.  Both of the operators must have single
>  * inputs and single outputs.
>  * @pa

[jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation

2009-06-27 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724880#action_12724880
 ] 

Alan Gates commented on PIG-793:


The cost for storing data raw is:

16 bytes for the tuple object
12 bytes for the byte array object
12 bytes + 2 bytes/field for a short[] to hold offsets into the byte[]
Then as you say above for the data itself, plus 1 byte per field to store type 
and nullness.

So our example tuple would take ~85 bytes.

But in general, yes you can do much better with raw bytes.  We played with this 
some and we found that the cost of Tuple.get/set goes up 10x because of the 
need to turn the bytes into objects.  In a typical query this added about 2x to 
the overall run time.  The solution to this would be to rewrite all the Pig 
operators to work on byte data instead of objects.  This is a large project, 
and doesn't solve the UDFs.  We could pay the performance penalty for UDFs, or 
we could change the UDFs to take byte data.  Currently many of our users are 
asking for the ability to write UDFs in Python or other scripting languages.  
If we instead go the other way and basically make them write C style Java I 
don't think that will be popular.

What we're playing with now (changing ArrayList to Object[] and String 
to Text) will reap somewhere around 50% of the benefits in terms of memory 
savings as going to fully raw data.  But it's around 10% of the work.  I'm not 
excluding moving to storing everything in a byte[] in the future.  But I want 
to see if for a little work now we can get a descent amount of improvement.

> Improving memory efficiency of Tuple implementation
> ---
>
> Key: PIG-793
> URL: https://issues.apache.org/jira/browse/PIG-793
> Project: Pig
>  Issue Type: Improvement
>    Reporter: Olga Natkovich
>Assignee: Alan Gates
>
> Currently, our tuple is a real pig and uses a lot of extra memory. 
> There are several places where we can improve memory efficiency:
> (1) Laying out memory for the fields rather than using java objects since 
> since each object for a numeric field takes 16 bytes
> (2) For the cases where we know the schema using Java arrays rather than 
> ArrayList.
> There might be more.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-30 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-820:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

v6 of the patch checked in.  Thanks Ashutosh.

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
> another loader
> -
>
> Key: PIG-820
> URL: https://issues.apache.org/jira/browse/PIG-820
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.3.0, 0.4.0
>Reporter: Alan Gates
>Assignee: Ashutosh Chauhan
> Fix For: 0.4.0
>
> Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, 
> pig-820_v4.patch, pig-820_v5.patch, pig-820_v6.patch
>
>
> Currently a sampling job requires that data already be stored in 
> BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
> order by this
> has mostly been acceptable, because users tend to use order by at the end of 
> their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an 
> order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this 
> requirement to read the entire input and write it back out will not be 
> acceptable.
> Join is often the first operation of a script, and thus is much more likely 
> to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, 
> using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a 
> Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
> 
> /**
>  * Skip ahead in the input stream.
>  * @param n number of bytes to skip
>  * @return number of bytes actually skipped.  The return semantics are
>  * exactly the same as {...@link java.io.InpuStream#skip(long)}
>  */
> public long skip(long n) throws IOException;
> 
> /**
>  * Get the current position in the stream.
>  * @return position in the stream.
>  */
> public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data 
> implemented the SamplableLoader interface.  If so, rather than create an 
> initial MR
> job to do the translation it would create the sampling job, having 
> RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-788) Proposal to remove float from Pig data types

2009-06-30 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates resolved PIG-788.


Resolution: Won't Fix

Avro has decided to keep float as a type.

> Proposal to remove float from Pig data types
> 
>
> Key: PIG-788
> URL: https://issues.apache.org/jira/browse/PIG-788
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.2.0
>Reporter: Alan Gates
>Assignee: Alan Gates
>
> Pig would like to use the new Hadoop Avro serialization package to pass data 
> between MR jobs, and eventually between Pig and UDFs that are not written in 
> Java.  Avro will not be supporting the float data type, but only double (see 
> AVRO-17).  Pig currently support both float and double.  Double is the 
> default floating point type (so if the user says x + 1.0, 1.0 is taken to be 
> a double, not a float).  Float was initially included in the list of Pig 
> types because Hadoop supported it as one of the Writable types, and we were 
> trying to make sure all of Hadoop's writable types could be represented in 
> Pig.  
> In practice we do not see anyone using the float type.   In order to be able 
> to easily use Avro I propose dropping the float type.  
> Please speak up if you are using the float type and you have a compelling 
> reason not to use double.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported

2009-07-01 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726124#action_12726124
 ] 

Alan Gates commented on PIG-773:


In response to point 2 of Santhosh's previous comment, I agree it is fine for 
now.  We need to answer this question definitively, but until we do I think 
this is a reasonable answer.

> Empty complex constants (empty bag, empty tuple and empty map) should be 
> supported
> --
>
> Key: PIG-773
> URL: https://issues.apache.org/jira/browse/PIG-773
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Pradeep Kamath
>Assignee: Ashutosh Chauhan
>Priority: Minor
> Fix For: 0.4.0
>
> Attachments: pig-773.patch, pig-773_v2.patch, pig-773_v3.patch
>
>
> We should be able to create empty bag constant using {}, empty tuple constant 
> using (), empty map constant using [] within a pig script

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer

2009-07-02 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726528#action_12726528
 ] 

Alan Gates commented on PIG-697:


A couple of questions and a comment on patch4-part2

I don't understand what the following code does:
{code}
List foreachAddedFields = 
foreachProjectionMap.getAddedFields();
if(foreachAddedFields != null) {
Set foreachAddedFieldsSet = new 
HashSet(foreachAddedFields);
flattenedColumnSet.removeAll(foreachAddedFieldsSet);
}
{code}

Why are you removing added fields from the flattened set?  Won't all flattened 
fields appear as added in the projection map?

I think it would be very helpful to insert some comments on why this rule only 
applies if the successor is an Order, Cross, or Join.  

Why was the code dealing with flattening a bag with an unknown schema removed 
from LOForeach?


> Proposed improvements to pig's optimizer
> 
>
> Key: PIG-697
> URL: https://issues.apache.org/jira/browse/PIG-697
> Project: Pig
>  Issue Type: Bug
>      Components: impl
>Reporter: Alan Gates
>Assignee: Santhosh Srinivasan
> Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
> OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
> OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch, 
> OptimizerPhase4_part1-1.patch, OptimizerPhase4_part2.patch
>
>
> I propose the following changes to pig optimizer, plan, and operator 
> functionality to support more robust optimization:
> 1) Remove the required array from Rule.  This will change rules so that they 
> only match exact patterns instead of allowing missing elements in the pattern.
> This has the downside that if a given rule applies to two patterns (say 
> Load->Filter->Group, Load->Group) you have to write two rules.  But it has 
> the upside that
> the resulting rules know exactly what they are getting.  The original intent 
> of this was to reduce the number of rules that needed to be written.  But the
> resulting rules have do a lot of work to understand the operators they are 
> working with.  With exact matches only, each rule will know exactly the 
> operators it
> is working on and can apply the logic of shifting the operators around.  All 
> four of the existing rules set all entries of required to true, so removing 
> this
> will have no effect on them.
> 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
> conversions or a certain number of iterations has been reached.  Currently the
> function is:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> for (Rule rule : mRules) {
> if (matcher.match(rule)) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> List> matches = matcher.getAllMatches();
> for (List match:matches)
> {
>   if (rule.transformer.check(match)) {
>   // The transformer approves.
>   rule.transformer.transform(match);
>   }
> }
> }
> }
> }
> {code}
> It would change to be:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> boolean sawMatch;
> int iterators = 0;
> do {
> sawMatch = false;
> for (Rule rule : mRules) {
> List> matches = matcher.getAllMatches();
> for (List match:matches) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> if (rule.transformer.check(match)) {
> // The transformer approves.
> sawMatch = true;
> rule.transformer.transform(match);
> }
> }
> }
> // Not sure if 1000 is the right number of iterations, maybe it
> // should be configurable so that large scripts don't stop too 
> // early.
> } while (sawMatch && numIterations++ < 1000);
> }
> {code}
> The reason for limiting the number of iterations is to avoid infinite loops.  
> The reason for iterating over the rules is so that each rule can be applied 
> m

[jira] Created: (PIG-878) Pig is returning too many blocks in the InputSplit

2009-07-10 Thread Alan Gates (JIRA)
Pig is returning too many blocks in the InputSplit
--

 Key: PIG-878
 URL: https://issues.apache.org/jira/browse/PIG-878
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Alan Gates
Assignee: Alan Gates
Priority: Critical


When SlicerWrapper builds a slice, it currently returns the 3 locations for 
every block in the file it is slicing, instead of the 3 locations for the block 
covered by that slice.  This means Pig's odds of having its maps placed on 
nodes local to the data goes way down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-07-10 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12729700#action_12729700
 ] 

Alan Gates commented on PIG-794:


I agree with Doug's comments that it's better to use an API to build the schema 
that will give us compile time checking.  I think it will also (hopefully) be 
easier to figure out the schema when reading the code, as it will avoid the 
need to read JSON directly.

I have a general question on the approach.  This is a direct port of Pig's 
BinStorage to use Avro, including the writing of indicator bytes for types.  I 
do not have a deep knowledge of Avro.  But I had assumed that since it was a 
de/serialization framework with types, part of what it would provide was type 
recognition.  That is, can't this code rely on Avro to set the type for it?  Do 
we need to be writing those indicator bytes ourselves?  Perhaps this is the 
same comment that Doug is making about using GenericDatumReader and addField.

In response to Hong's comment, the sync marks are vulnerable as you point out.  
But the loader needs some way to find a proper starting place when it's handed 
any block but the initial block of a file.  I wonder if we could create a new 
sync type.  It would always consist of a 100 byte marker (say the first 25 
prime numbers, or the first 25 digits of pi or something).  We could then write 
a tuple with that sync type every 1000 records in the data.  Loaders that don't 
start at position 0 could then seek to the first sync type it found before it 
began reading.  All loaders would read past the end of their position until 
they saw a sync type.

As for this being compatible with with non-pig apps, that isn't the purpose of 
this AvroStorage function.  This is for pig to pass data between MR jobs for 
itself.  Having a tool independent storage format is a bigger project, as it 
requires agreeing on things like sync marks, how to represent different Avro 
objects, etc.

> Use Avro serialization in Pig
> -
>
> Key: PIG-794
> URL: https://issues.apache.org/jira/browse/PIG-794
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.2.0
>Reporter: Rakesh Setty
> Fix For: 0.2.0
>
> Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
> jackson-asl-0.9.4.jar, PIG-794.patch
>
>
> We would like to use Avro serialization in Pig to pass data between MR jobs 
> instead of the current BinStorage. Attached is an implementation of 
> AvroBinStorage which performs significantly better compared to BinStorage on 
> our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-877) Push up filter does not account for added columns in foreach

2009-07-13 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730521#action_12730521
 ] 

Alan Gates commented on PIG-877:


Is the NPE at run time or optimization time?

> Push up filter does not account for added columns in foreach
> 
>
> Key: PIG-877
> URL: https://issues.apache.org/jira/browse/PIG-877
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.1
>Reporter: Santhosh Srinivasan
>Assignee: Santhosh Srinivasan
> Fix For: 0.3.1
>
> Attachments: PIG-877.patch
>
>
> If a filter follows a foreach that produces an added column then push up 
> filter fails with a null pointer exception.
> {code}
> ...
> x = foreach w generate $0, COUNT($1);
> y = filter x by $1 > 10;
> {code}
> In the above example, the column in the filter's expression is an added 
> column. As a result, the optimizer rule is not able to map it back to the 
> input resulting in a null value. The subsequent for loop is failing due to 
> NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-877) Push up filter does not account for added columns in foreach

2009-07-13 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730529#action_12730529
 ] 

Alan Gates commented on PIG-877:


+1, patch looks good.

> Push up filter does not account for added columns in foreach
> 
>
> Key: PIG-877
> URL: https://issues.apache.org/jira/browse/PIG-877
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.1
>Reporter: Santhosh Srinivasan
>Assignee: Santhosh Srinivasan
> Fix For: 0.3.1
>
> Attachments: PIG-877.patch
>
>
> If a filter follows a foreach that produces an added column then push up 
> filter fails with a null pointer exception.
> {code}
> ...
> x = foreach w generate $0, COUNT($1);
> y = filter x by $1 > 10;
> {code}
> In the above example, the column in the filter's expression is an added 
> column. As a result, the optimizer rule is not able to map it back to the 
> input resulting in a null value. The subsequent for loop is failing due to 
> NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-889) Pig can not access reporter of PigHadoopLog in Load Func

2009-07-16 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-889:
---

Status: Open  (was: Patch Available)

> Pig can not access reporter of PigHadoopLog in Load Func
> 
>
> Key: PIG-889
> URL: https://issues.apache.org/jira/browse/PIG-889
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.4.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Fix For: 0.4.0
>
> Attachments: Pig_889_Patch.txt
>
>
> I'd like to increment Counter in my own LoadFunc, but it will throw 
> NullPointerException. It seems that the reporter is not initialized.  
> I looked into this problem and find that it need to call 
> PigHadoopLogger.getInstance().setReporter(reporter) in PigInputFormat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-889) Pig can not access reporter of PigHadoopLog in Load Func

2009-07-16 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732136#action_12732136
 ] 

Alan Gates commented on PIG-889:


Jeff,

Could you include a unit test that has a load func that fetcher the logger.  
That way we'll be sure you fix fixes what you want.

> Pig can not access reporter of PigHadoopLog in Load Func
> 
>
> Key: PIG-889
> URL: https://issues.apache.org/jira/browse/PIG-889
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.4.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Fix For: 0.4.0
>
> Attachments: Pig_889_Patch.txt
>
>
> I'd like to increment Counter in my own LoadFunc, but it will throw 
> NullPointerException. It seems that the reporter is not initialized.  
> I looked into this problem and find that it need to call 
> PigHadoopLogger.getInstance().setReporter(reporter) in PigInputFormat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-240) Support launching concurrent Pig jobs from one VM

2009-07-16 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732148#action_12732148
 ] 

Alan Gates commented on PIG-240:


Jeff,

It looks tome like we don't want to share LogicalPlanCloneHelper across 
threads.  Instead, there should be one instance per job.  Otherwise separate 
jobs may mingle their logical plan maps.  If we instead had a shared container 
that tracked LogicalPlanCloneHelper by thread id and then change LogicalPlan 
and LogicalPlanCloneHelper to fetch the right map, this should work.

> Support launching concurrent Pig jobs from one VM
> -
>
> Key: PIG-240
> URL: https://issues.apache.org/jira/browse/PIG-240
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Tom White
>Assignee: Jeff Zhang
> Attachments: patch_240.txt, pig-240.patch
>
>
> For some applications it would be convenient to launch concurrent Pig jobs 
> from a single VM. This is currently not possible since Pig has static mutable 
> state.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-792) PERFORMANCE: Support skewed join in pig

2009-07-16 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732207#action_12732207
 ] 

Alan Gates commented on PIG-792:


Your code has tabs in it.  It should instead have 4 spaces.


In MRCompiler.visitSkewedJoin, I don't understand the following:

{code}
 int rp = op.getRequestedParallelism();
 
 Pair sampleJobPair = getSkewedJoinSampleJob(op, 
mro, fSpec, partitionFile, rp);
 rp = sampleJobPair.second;
 
 // set parallelism of SkewedJoin same as the sampling job
 op.setRequestedParallelism(rp);
{code}
Why is the job parallelism being reset based on the sample?  

The results from the join sampling puts out data in a certain format.  That
format should be documented clearly in the comments somewhere.  It is referred
to in the class comments for SkewedPartitioner, but not completely specified.

Rather than creating a separate POSkewedJoinFileSetter to correct the changes
made by the SampleOptimizer, the SampleOptimizer should be changed to
correctly handle file names in the case of skewed join.

Why does MapReduceOper need to know about skewedJoinPartitionFile?

In POPartitionRearrange.constructPROutput, what does
{code}
opTuple.set(1, Byte.valueOf(""+reducerIdx));
{code}
do?  It looks like you're forcing reducerIdx to String and then to byte.
That's rather inefficient.  We can't go straight from int to byte?  And why is
reducerIdx and Integer instead of an int?

There are still some System.out/err.println statements in the code.  These
should be removed or converted to log.debug statements.

Why do we need a new NullablePartitionWritable class?  Couldn't Tuple be used
for this?


 



> PERFORMANCE: Support skewed join in pig
> ---
>
> Key: PIG-792
> URL: https://issues.apache.org/jira/browse/PIG-792
> Project: Pig
>  Issue Type: Improvement
>Reporter: Sriranjan Manjunath
> Attachments: skewedjoin.patch
>
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-728) All backend error messages must be logged to preserve the original error messages

2009-07-16 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732253#action_12732253
 ] 

Alan Gates commented on PIG-728:


+1

> All backend error messages must be logged to preserve the original error 
> messages
> -
>
> Key: PIG-728
> URL: https://issues.apache.org/jira/browse/PIG-728
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Santhosh Srinivasan
>Assignee: Santhosh Srinivasan
>Priority: Minor
> Fix For: 0.4.0
>
> Attachments: PIG-728_1.patch
>
>
> The current error handling framework logs backend error messages only when 
> Pig is not able to parse the error message. Instead, Pig should log the 
> backend error message irrespective of Pig's ability to parse backend error 
> messages. On a side note, the use of instantiateFuncFromSpec in Launcher.java 
> is not consistent and should avoid the use of class_name + "(" + 
> string_constructor_args + ")".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-878) Pig is returning too many blocks in the InputSplit

2009-07-17 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-878:
---

Attachment: PIG-878.patch

Patch written collaboratively with Arun Murthy

> Pig is returning too many blocks in the InputSplit
> --
>
> Key: PIG-878
> URL: https://issues.apache.org/jira/browse/PIG-878
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>    Reporter: Alan Gates
>    Assignee: Alan Gates
>Priority: Critical
> Attachments: PIG-878.patch
>
>
> When SlicerWrapper builds a slice, it currently returns the 3 locations for 
> every block in the file it is slicing, instead of the 3 locations for the 
> block covered by that slice.  This means Pig's odds of having its maps placed 
> on nodes local to the data goes way down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-878) Pig is returning too many blocks in the InputSplit

2009-07-17 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-878:
---

Status: Patch Available  (was: Open)

> Pig is returning too many blocks in the InputSplit
> --
>
> Key: PIG-878
> URL: https://issues.apache.org/jira/browse/PIG-878
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>    Reporter: Alan Gates
>    Assignee: Alan Gates
>Priority: Critical
> Attachments: PIG-878.patch
>
>
> When SlicerWrapper builds a slice, it currently returns the 3 locations for 
> every block in the file it is slicing, instead of the 3 locations for the 
> block covered by that slice.  This means Pig's odds of having its maps placed 
> on nodes local to the data goes way down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-878) Pig is returning too many blocks in the InputSplit

2009-07-20 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733293#action_12733293
 ] 

Alan Gates commented on PIG-878:


Should note also that I didn't add any tests because this was a fix for 
existing functionality, and frankly I'm not exactly sure how to test it.

Also should have noted thanks to Milind for first brining this to our attention.

> Pig is returning too many blocks in the InputSplit
> --
>
> Key: PIG-878
> URL: https://issues.apache.org/jira/browse/PIG-878
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Alan Gates
>Assignee: Alan Gates
>Priority: Critical
> Fix For: 0.4.0
>
> Attachments: PIG-878.patch
>
>
> When SlicerWrapper builds a slice, it currently returns the 3 locations for 
> every block in the file it is slicing, instead of the 3 locations for the 
> block covered by that slice.  This means Pig's odds of having its maps placed 
> on nodes local to the data goes way down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-878) Pig is returning too many blocks in the InputSplit

2009-07-20 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-878:
---

   Resolution: Fixed
Fix Version/s: 0.4.0
   Status: Resolved  (was: Patch Available)

Checked in patch.  Thanks Arun.

> Pig is returning too many blocks in the InputSplit
> --
>
> Key: PIG-878
> URL: https://issues.apache.org/jira/browse/PIG-878
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>    Reporter: Alan Gates
>    Assignee: Alan Gates
>Priority: Critical
> Fix For: 0.4.0
>
> Attachments: PIG-878.patch
>
>
> When SlicerWrapper builds a slice, it currently returns the 3 locations for 
> every block in the file it is slicing, instead of the 3 locations for the 
> block covered by that slice.  This means Pig's odds of having its maps placed 
> on nodes local to the data goes way down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-893) support cast of chararray to other simple types

2009-07-27 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12735665#action_12735665
 ] 

Alan Gates commented on PIG-893:


Semantics of chararray to numeric conversion seem straightforward.  Adopting 
standards similar to Java's Integer.valueOf() etc. would make sense.

I do not think these should be UDFs, they should be part of the language 
definition.

> support cast of chararray to other simple types
> ---
>
> Key: PIG-893
> URL: https://issues.apache.org/jira/browse/PIG-893
> Project: Pig
>  Issue Type: New Feature
>Reporter: Thejas M Nair
>
> Pig should support casting of chararray to 
> integer,long,float,double,bytearray. If the conversion fails for reasons such 
> as overflow, cast should return null and log a warning.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-792) PERFORMANCE: Support skewed join in pig

2009-07-27 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12735674#action_12735674
 ] 

Alan Gates commented on PIG-792:


I'll take a look at the revised patch today.

> PERFORMANCE: Support skewed join in pig
> ---
>
> Key: PIG-792
> URL: https://issues.apache.org/jira/browse/PIG-792
> Project: Pig
>  Issue Type: Improvement
>Reporter: Sriranjan Manjunath
> Attachments: skewedjoin.patch
>
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-878) Pig is returning too many blocks in the InputSplit

2009-07-31 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12737655#action_12737655
 ] 

Alan Gates commented on PIG-878:


Port fix to 0.3 branch as well.

> Pig is returning too many blocks in the InputSplit
> --
>
> Key: PIG-878
> URL: https://issues.apache.org/jira/browse/PIG-878
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>    Reporter: Alan Gates
>Assignee: Alan Gates
>Priority: Critical
> Fix For: 0.4.0
>
> Attachments: PIG-878.patch
>
>
> When SlicerWrapper builds a slice, it currently returns the 3 locations for 
> every block in the file it is slicing, instead of the 3 locations for the 
> block covered by that slice.  This means Pig's odds of having its maps placed 
> on nodes local to the data goes way down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-891) Fixing dfs statement for Pig

2009-08-05 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739652#action_12739652
 ] 

Alan Gates commented on PIG-891:


+1 to Pradeep's suggestion.  I think it would be better for Pig dump its shell 
command implementations and have grunt dispatch shell commands to Hadoop.  This 
avoids Pig needing to implement these commands.  It also means Pig will 
automatically be in sync with Hadoop on how shell commands work.

> Fixing dfs statement for Pig
> 
>
> Key: PIG-891
> URL: https://issues.apache.org/jira/browse/PIG-891
> Project: Pig
>  Issue Type: Bug
>Reporter: Daniel Dai
>Assignee: Jeff Zhang
>Priority: Minor
>
> Several hadoop dfs commands are not support or restrictive on current Pig. We 
> need to fix that. These include:
> 1. Several commands do not supported: lsr, dus, count, rmr, expunge, put, 
> moveFromLocal, get, getmerge, text, moveToLocal, mkdir, touchz, test, stat, 
> tail, chmod, chown, chgrp. A reference for these command can be found in 
> http://hadoop.apache.org/common/docs/current/hdfs_shell.html
> 2. All existing dfs commands do not support globing.
> 3. Pig should provide a programmatic way to perform dfs commands. Several of 
> them exist in PigServer, but not all of them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-893) support cast of chararray to other simple types

2009-08-06 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740188#action_12740188
 ] 

Alan Gates commented on PIG-893:


CastUtil should not include byteToX() calls.  Pig never casts from ByteArray to 
a type, it leaves that to loaders because Pig has no idea what the 
representation of the data is in the bytes.  It might be UTF8 (as it is for 
PigStorage) or something entirely different.

CastUtil.stringToX calls should not call byteToX methods.  This is inefficient 
since those methods just turn the bytes back into a String and then do the 
conversion explicitly.

> support cast of chararray to other simple types
> ---
>
> Key: PIG-893
> URL: https://issues.apache.org/jira/browse/PIG-893
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Thejas M Nair
>Assignee: Jeff Zhang
> Fix For: 0.4.0
>
> Attachments: Pig_893_Patch.txt
>
>
> Pig should support casting of chararray to 
> integer,long,float,double,bytearray. If the conversion fails for reasons such 
> as overflow, cast should return null and log a warning.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-893) support cast of chararray to other simple types

2009-08-06 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-893:
---

Status: Open  (was: Patch Available)

> support cast of chararray to other simple types
> ---
>
> Key: PIG-893
> URL: https://issues.apache.org/jira/browse/PIG-893
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Thejas M Nair
>Assignee: Jeff Zhang
> Fix For: 0.4.0
>
> Attachments: Pig_893_Patch.txt
>
>
> Pig should support casting of chararray to 
> integer,long,float,double,bytearray. If the conversion fails for reasons such 
> as overflow, cast should return null and log a warning.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-833) Storage access layer

2009-08-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-833:
---

Attachment: test.out

When I run ant test in contrib/zebra, I get failures.  I've attached the output 
of the command.

> Storage access layer
> 
>
> Key: PIG-833
> URL: https://issues.apache.org/jira/browse/PIG-833
> Project: Pig
>  Issue Type: New Feature
>Reporter: Jay Tang
> Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, 
> PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, test.out, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a 
> tabular view of data in Hadoop, and could free Pig users from implementing 
> their own data storage/retrieval code.  This layer should also include a 
> columnar storage format in order to provide fast data projection, 
> CPU/space-efficient data serialization, and a schema language to manage 
> physical storage metadata.  Eventually it could also support predicate 
> pushdown for further performance improvement.  Initially, this layer could be 
> a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-833) Storage access layer

2009-08-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-833:
---

Attachment: TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt

Okay, now that I've first built Pig's test, I run the tests and I get:

{code}
 [delete] Deleting directory 
/Users/gates/src/pig/apache/top/zebra/trunk/build/contrib/zebra/test/logs
[mkdir] Created dir: 
/Users/gates/src/pig/apache/top/zebra/trunk/build/contrib/zebra/test/logs
[junit] Running org.apache.hadoop.zebra.io.TestCheckin
[junit] Tests run: 125, Failures: 0, Errors: 0, Time elapsed: 16.894 sec
[junit] Running org.apache.hadoop.zebra.mapred.TestCheckin
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 158.741 sec
[junit] Running org.apache.hadoop.zebra.pig.TestCheckin1
[junit] Tests run: 0, Failures: 0, Errors: 2, Time elapsed: 0.13 sec
[junit] Test org.apache.hadoop.zebra.pig.TestCheckin1 FAILED
[junit] Running org.apache.hadoop.zebra.pig.TestCheckin2
[junit] Tests run: 0, Failures: 0, Errors: 2, Time elapsed: 0.131 sec
[junit] Test org.apache.hadoop.zebra.pig.TestCheckin2 FAILED
[junit] Running org.apache.hadoop.zebra.pig.TestCheckin3
[junit] Tests run: 0, Failures: 0, Errors: 2, Time elapsed: 0.133 sec
[junit] Test org.apache.hadoop.zebra.pig.TestCheckin3 FAILED
[junit] Running org.apache.hadoop.zebra.pig.TestCheckin4
[junit] Tests run: 0, Failures: 0, Errors: 2, Time elapsed: 0.128 sec
[junit] Test org.apache.hadoop.zebra.pig.TestCheckin4 FAILED
[junit] Running org.apache.hadoop.zebra.pig.TestCheckin5
[junit] Tests run: 0, Failures: 0, Errors: 2, Time elapsed: 0.128 sec
[junit] Test org.apache.hadoop.zebra.pig.TestCheckin5 FAILED
[junit] Running org.apache.hadoop.zebra.types.TestCheckin
[junit] Tests run: 45, Failures: 0, Errors: 0, Time elapsed: 0.253 sec
{code}

I've attached the output from one of the tests.

> Storage access layer
> 
>
> Key: PIG-833
> URL: https://issues.apache.org/jira/browse/PIG-833
> Project: Pig
>  Issue Type: New Feature
>Reporter: Jay Tang
> Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, 
> PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, 
> TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a 
> tabular view of data in Hadoop, and could free Pig users from implementing 
> their own data storage/retrieval code.  This layer should also include a 
> columnar storage format in order to provide fast data projection, 
> CPU/space-efficient data serialization, and a schema language to manage 
> physical storage metadata.  Eventually it could also support predicate 
> pushdown for further performance improvement.  Initially, this layer could be 
> a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-833) Storage access layer

2009-08-11 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742093#action_12742093
 ] 

Alan Gates commented on PIG-833:


My bad.  I missed the line in the instructions where it said to apply the 
PIG-660 patch.  I applied that and am trying again.

> Storage access layer
> 
>
> Key: PIG-833
> URL: https://issues.apache.org/jira/browse/PIG-833
> Project: Pig
>  Issue Type: New Feature
>Reporter: Jay Tang
> Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, 
> PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, 
> TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a 
> tabular view of data in Hadoop, and could free Pig users from implementing 
> their own data storage/retrieval code.  This layer should also include a 
> columnar storage format in order to provide fast data projection, 
> CPU/space-efficient data serialization, and a schema language to manage 
> physical storage metadata.  Eventually it could also support predicate 
> pushdown for further performance improvement.  Initially, this layer could be 
> a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-833) Storage access layer

2009-08-11 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742100#action_12742100
 ] 

Alan Gates commented on PIG-833:


Patch checked in.  All the unit tests passed.

> Storage access layer
> 
>
> Key: PIG-833
> URL: https://issues.apache.org/jira/browse/PIG-833
> Project: Pig
>  Issue Type: New Feature
>Reporter: Jay Tang
> Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, 
> PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, 
> TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a 
> tabular view of data in Hadoop, and could free Pig users from implementing 
> their own data storage/retrieval code.  This layer should also include a 
> columnar storage format in order to provide fast data projection, 
> CPU/space-efficient data serialization, and a schema language to manage 
> physical storage metadata.  Eventually it could also support predicate 
> pushdown for further performance improvement.  Initially, this layer could be 
> a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-893) support cast of chararray to other simple types

2009-08-11 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742144#action_12742144
 ] 

Alan Gates commented on PIG-893:


I'm reviewing this patch.

> support cast of chararray to other simple types
> ---
>
> Key: PIG-893
> URL: https://issues.apache.org/jira/browse/PIG-893
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Thejas M Nair
>Assignee: Jeff Zhang
> Fix For: 0.4.0
>
> Attachments: Pig_893.Patch
>
>
> Pig should support casting of chararray to 
> integer,long,float,double,bytearray. If the conversion fails for reasons such 
> as overflow, cast should return null and log a warning.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-893) support cast of chararray to other simple types

2009-08-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-893:
---

  Resolution: Fixed
Release Note: PIG-893:  Added casts from chararray to int, long, float, and 
double.
  Status: Resolved  (was: Patch Available)

Patch checked in.  Thanks Jeff for your work on this.

> support cast of chararray to other simple types
> ---
>
> Key: PIG-893
> URL: https://issues.apache.org/jira/browse/PIG-893
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Thejas M Nair
>Assignee: Jeff Zhang
> Fix For: 0.4.0
>
> Attachments: Pig_893.Patch
>
>
> Pig should support casting of chararray to 
> integer,long,float,double,bytearray. If the conversion fails for reasons such 
> as overflow, cast should return null and log a warning.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-911) [Piggybank] SequenceFileLoader

2009-08-11 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742239#action_12742239
 ] 

Alan Gates commented on PIG-911:


Dmitry,

First this is great.  We've had requests to read Sequence files.  Being able to 
write them also would be great.

A few thoughts:

1) This should not extend UTF8StorageConverter.  This loader will be returning 
actual data types, not bytes that need to be interpreted.  I would think 
instead that it should implement the bytesToX() methods itself and just throw 
an exception saying it didn't expect to do any conversion.

2) The getSampledTuple looks fine if skip is handling getting the stream to the 
point that reading the next tuple is viable.

3) In the bindTo call, where you obtain the key and value by reflection, should 
there be a try/catch block there in case the cast to Writable fails?  In the 
same way, in describe schema you're asking how to suppress warnings from the 
cast in reader.getKeyClass().  But don't you want to check that what you got 
really is a writable, since there is no guarantee?



> [Piggybank] SequenceFileLoader 
> ---
>
> Key: PIG-911
> URL: https://issues.apache.org/jira/browse/PIG-911
> Project: Pig
>  Issue Type: New Feature
>Reporter: Dmitriy V. Ryaboy
> Attachments: pig_sequencefile.patch
>
>
> The proposed piggybank contribution adds a SequenceFileLoader to the 
> piggybank.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-845) PERFORMANCE: Merge Join

2009-08-12 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742245#action_12742245
 ] 

Alan Gates commented on PIG-845:


Dmitry wrote> Would it make sense to expose this to the users via a 'CREATE 
INDEX' (or similar) command?
That way the index could be persisted, and the user could tell you to use an 
existing index instead of rescanning the data.

Ashutosh wrote> If we allow that then we also need to deal with managing and 
persisting the index. Once Owl is integrated, we could make use of that to do 
all this for Pig. Till then, we can continue creating index every time and as I 
said overhead of index creation is negligible as compared to run times of 
actual joins.

My thinking was that at some future point, Pig would automatically cache this 
sample the first time it creates it, so that subsequent joins on the same data 
set could make use of it without the sample.  I'm hoping we can use Owl for 
that, as Ashutosh indicated.

-

Dmitry wrote> I am not sure about the approach of pushing sampling above 
filters. Have you guys benchmarked this? Seems like you'd wind up reading the 
whole file in the sample job if the filter is selective enough (and high filter 
selectivity would also make materialize->sample go much faster).

You want to build your index on the pre-filtered data because your index is 
telling you what block to look for the data in.  The fact that the filter may 
have removed that record doesn't matter.  It will either be in the block 
indicated in the index or not present.  Also, you want to avoid filtering and 
then building the index because it adds another write and read of the data (you 
have to filter, write the data to HDFS, then read it to build the index, then 
read it again to do the join).

> PERFORMANCE: Merge Join
> ---
>
> Key: PIG-845
> URL: https://issues.apache.org/jira/browse/PIG-845
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Ashutosh Chauhan
> Attachments: merge-join-1.patch, merge-join-for-review.patch
>
>
> Thsi join would work if the data for both tables is sorted on the join key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-928) UDFs in scripting languages

2009-08-19 Thread Alan Gates (JIRA)
UDFs in scripting languages
---

 Key: PIG-928
 URL: https://issues.apache.org/jira/browse/PIG-928
 Project: Pig
  Issue Type: New Feature
Reporter: Alan Gates


It should be possible to write UDFs in scripting languages such as python, 
ruby, etc.  This frees users from needing to compile Java, generate a jar, etc. 
 It also opens Pig to programmers who prefer scripting languages over Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-928) UDFs in scripting languages

2009-08-19 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-928:
---

Attachment: package.zip

Attaching some preliminary work by Kishore Gopalakrishna on this.  This code is 
a good start, but not ready for inclusion.  It needs to be cleaned up, put in 
our class structure, etc.  

Comments from Kishore:

It contains all the libraries required and also the GenericEval UDF and
GenericFilter UDF

I dint get a chance to get the Algebraic function working.

To test it, just unzip the package and run

rm -rf wordcount/output;
pig -x local wordcount.pig ---> to test eval
pig -x local wordcount_filter.pig ---> to test filter [sorry it should
be named filter.pig]
cat wordcount/output

> UDFs in scripting languages
> ---
>
> Key: PIG-928
> URL: https://issues.apache.org/jira/browse/PIG-928
> Project: Pig
>  Issue Type: New Feature
>    Reporter: Alan Gates
> Attachments: package.zip
>
>
> It should be possible to write UDFs in scripting languages such as python, 
> ruby, etc.  This frees users from needing to compile Java, generate a jar, 
> etc.  It also opens Pig to programmers who prefer scripting languages over 
> Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-930) merge join should handle compressed bz2 sorted files

2009-08-27 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748372#action_12748372
 ] 

Alan Gates commented on PIG-930:


One question that seems worth asking is, is it right that the offset returned 
on bzip2 data cannot be properly bound to?  If possible it would seem better to 
correct this issue in the loader that is dealing with bzip2 data then forcing 
operators up in the pipeline to handle the fact that not all data formats will 
properly honor the bind position.

> merge join should handle compressed bz2 sorted files
> 
>
> Key: PIG-930
> URL: https://issues.apache.org/jira/browse/PIG-930
> Project: Pig
>  Issue Type: Bug
>Reporter: Pradeep Kamath
>
> There are two issues - POLoad which is used to read the right side input does 
> not handle bz2 files right now. This needs to be fixed.
> Further inn the index map job we bindTo(startOfBlockOffSet) (this will 
> internally discard first tuple if offset > 0). Then we do the following:
> {noformat}
> While(tuple survives pipeline) {
>   Pos =  getPosition()
>   getNext() 
>   run the tuple  through pipeline in the right side which could have filter
> }
> Emit(key, pos, filename).
> {noformat}
>  
> Then in the map job which does the join, we bindTo(pos > 0 ? pos  1 : pos) 
> (we do pos -1 because bindTo will discard first tuple for pos> 0). Then we do 
> getNext()
> Now in bz2 compressed files, getPosition() returns a position which is not 
> really accurate. The problem is it could be a position in the middle of a 
> compressed bz2 block. Then when we use that position to bindTo() in the final 
> map job, the code would first hunt for a bz2 block header thus skipping the 
> whole current bz2 block. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-759) HBaseStorage scheme for Load/Slice function

2009-09-08 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752601#action_12752601
 ] 

Alan Gates commented on PIG-759:


Things can be passed as bytes in Pig by passing them as bytearrays.  This is 
the default if a type is not declared.

I can't assign the bug to you because you're not in the list of assignable 
people for Pig bugs.  I think Olga has to add you to that list.

> HBaseStorage scheme for Load/Slice function
> ---
>
> Key: PIG-759
> URL: https://issues.apache.org/jira/browse/PIG-759
> Project: Pig
>  Issue Type: Bug
>Reporter: Gunther Hagleitner
> Attachments: patch.p1
>
>
> We would like to change the HBaseStorage function to use a scheme when 
> loading a table in pig. The scheme we are thinking of is: "hbase". So in 
> order to load an hbase table in a pig script the statement should read:
> {noformat}
> table = load 'hbase://' using HBaseStorage();
> {noformat}
> If the scheme is omitted pig would assume the tablename to be an hdfs path 
> and the storage function would use the last component of the path as a table 
> name and output a warning.
> For details on why see jira issue: PIG-758

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-759) HBaseStorage scheme for Load/Slice function

2009-09-08 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-759:
---

Fix Version/s: 0.4.0
   Status: Patch Available  (was: Open)

Marking as submitted so Hudson will pick it up.

> HBaseStorage scheme for Load/Slice function
> ---
>
> Key: PIG-759
> URL: https://issues.apache.org/jira/browse/PIG-759
> Project: Pig
>  Issue Type: Bug
>Reporter: Gunther Hagleitner
> Fix For: 0.4.0
>
> Attachments: patch.p1
>
>
> We would like to change the HBaseStorage function to use a scheme when 
> loading a table in pig. The scheme we are thinking of is: "hbase". So in 
> order to load an hbase table in a pig script the statement should read:
> {noformat}
> table = load 'hbase://' using HBaseStorage();
> {noformat}
> If the scheme is omitted pig would assume the tablename to be an hdfs path 
> and the storage function would use the last component of the path as a table 
> name and output a warning.
> For details on why see jira issue: PIG-758

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-939) Checkstyle pulls in junit3.7 which causes the build of test code to fail.

2009-09-08 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752617#action_12752617
 ] 

Alan Gates commented on PIG-939:


Why is antlr being added as a dependency?  I don't think Pig uses antlr.

> Checkstyle pulls in junit3.7 which causes the build of test code to fail.
> -
>
> Key: PIG-939
> URL: https://issues.apache.org/jira/browse/PIG-939
> Project: Pig
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.3.0
>Reporter: Lee Tucker
>Assignee: Giridharan Kesavan
> Attachments: pig-939.patch
>
>
> Pig fails to compile if you execute: 
> ant -D clean findbugs checkstyle 
> test 
> It gets the error:
> [javac] Compiling 153 source files to 
> /export/crawlspace/kryptonite/hadoopqa/workspace/workspace/CCDI-Pig-2.3/pig-2.3.0.0.20.0.2967040009/build/test/classes
> [javac] 
> /export/crawlspace/kryptonite/hadoopqa/workspace/workspace/CCDI-Pig-2.3/pig-2.3.0.0.20.0.2967040009/test/org/apache/pig/test/PigExecTestCase.java:31:
>  cannot find symbol
> [javac] symbol  : constructor TestCase()
> [javac] location: class junit.framework.TestCase
> [javac] public abstract class PigExecTestCase extends TestCase {
> [javac] ^
> Once that's done, there's a copy of junit 3.7 cached from ivy that will 
> continue to cause the build to fail.  It will succeed, if you remove it, and 
> then do:
> ant -D clean findbugs test
> This proves it's running checkstyle that pulls in junit 3.7

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-833) Storage access layer

2009-09-08 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates resolved PIG-833.


   Resolution: Fixed
Fix Version/s: 0.4.0

Patch was checked in a while ago.

> Storage access layer
> 
>
> Key: PIG-833
> URL: https://issues.apache.org/jira/browse/PIG-833
> Project: Pig
>  Issue Type: New Feature
>Reporter: Jay Tang
> Fix For: 0.4.0
>
> Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, 
> PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, 
> TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz
>
>
> A layer is needed to provide a high level data access abstraction and a 
> tabular view of data in Hadoop, and could free Pig users from implementing 
> their own data storage/retrieval code.  This layer should also include a 
> columnar storage format in order to provide fast data projection, 
> CPU/space-efficient data serialization, and a schema language to manage 
> physical storage metadata.  Eventually it could also support predicate 
> pushdown for further performance improvement.  Initially, this layer could be 
> a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-759) HBaseStorage scheme for Load/Slice function

2009-09-08 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752692#action_12752692
 ] 

Alan Gates commented on PIG-759:


You can ignore the core tests failure, as Hudson is having some problem with 
the tests.  When I run the unit tests on my box they all pass.  But the find 
bugs warnings will need to be fixed before the patch can be committed.

> HBaseStorage scheme for Load/Slice function
> ---
>
> Key: PIG-759
> URL: https://issues.apache.org/jira/browse/PIG-759
> Project: Pig
>  Issue Type: Bug
>Reporter: Gunther Hagleitner
> Fix For: 0.4.0
>
> Attachments: patch.p1
>
>
> We would like to change the HBaseStorage function to use a scheme when 
> loading a table in pig. The scheme we are thinking of is: "hbase". So in 
> order to load an hbase table in a pig script the statement should read:
> {noformat}
> table = load 'hbase://' using HBaseStorage();
> {noformat}
> If the scheme is omitted pig would assume the tablename to be an hdfs path 
> and the storage function would use the last component of the path as a table 
> name and output a warning.
> For details on why see jira issue: PIG-758

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-759) HBaseStorage scheme for Load/Slice function

2009-09-08 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-759:
---

Status: Open  (was: Patch Available)

> HBaseStorage scheme for Load/Slice function
> ---
>
> Key: PIG-759
> URL: https://issues.apache.org/jira/browse/PIG-759
> Project: Pig
>  Issue Type: Bug
>Reporter: Gunther Hagleitner
> Fix For: 0.4.0
>
> Attachments: patch.p1
>
>
> We would like to change the HBaseStorage function to use a scheme when 
> loading a table in pig. The scheme we are thinking of is: "hbase". So in 
> order to load an hbase table in a pig script the statement should read:
> {noformat}
> table = load 'hbase://' using HBaseStorage();
> {noformat}
> If the scheme is omitted pig would assume the tablename to be an hdfs path 
> and the storage function would use the last component of the path as a table 
> name and output a warning.
> For details on why see jira issue: PIG-758

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-939) Checkstyle pulls in junit3.7 which causes the build of test code to fail.

2009-09-09 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753165#action_12753165
 ] 

Alan Gates commented on PIG-939:


+1

> Checkstyle pulls in junit3.7 which causes the build of test code to fail.
> -
>
> Key: PIG-939
> URL: https://issues.apache.org/jira/browse/PIG-939
> Project: Pig
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.3.0
>Reporter: Lee Tucker
>Assignee: Giridharan Kesavan
> Attachments: pig-939.patch
>
>
> Pig fails to compile if you execute: 
> ant -D clean findbugs checkstyle 
> test 
> It gets the error:
> [javac] Compiling 153 source files to 
> /export/crawlspace/kryptonite/hadoopqa/workspace/workspace/CCDI-Pig-2.3/pig-2.3.0.0.20.0.2967040009/build/test/classes
> [javac] 
> /export/crawlspace/kryptonite/hadoopqa/workspace/workspace/CCDI-Pig-2.3/pig-2.3.0.0.20.0.2967040009/test/org/apache/pig/test/PigExecTestCase.java:31:
>  cannot find symbol
> [javac] symbol  : constructor TestCase()
> [javac] location: class junit.framework.TestCase
> [javac] public abstract class PigExecTestCase extends TestCase {
> [javac] ^
> Once that's done, there's a copy of junit 3.7 cached from ivy that will 
> continue to cause the build to fail.  It will succeed, if you remove it, and 
> then do:
> ant -D clean findbugs test
> This proves it's running checkstyle that pulls in junit 3.7

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-927) null should be handled consistently in Join

2009-09-09 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753249#action_12753249
 ] 

Alan Gates commented on PIG-927:


It seems that the right semantic would be to follow SQL consistently, as that 
is what we say we do.

> null should be handled consistently in Join
> ---
>
> Key: PIG-927
> URL: https://issues.apache.org/jira/browse/PIG-927
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Pradeep Kamath
>
> Currenlty Pig mostly follows SQL semantics for handling null. However there 
> are certain cases where pig may need to handle nulls correctly. One example 
> is the join - joins on single keys results in null keys not matching to 
> produce an output. However if the join is on >1 keys, in the key tuple, if 
> one of the values is null, it still matches with another key tuple which has 
> a null for that value. We need to decide the right semantics here. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

2009-09-10 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753843#action_12753843
 ] 

Alan Gates commented on PIG-953:


-1 to adding an orderPreserving flag on operators.  We have no intention of 
ever promising that any relational operator beyond Order and Limit preserve 
order.  The fact that some happen to now (like filter) is a side effect of the 
current implementation, not a feature.  If we add a flag, it becomes a feature 
that we will be expected to maintain.

> Enable merge join in pig to work with loaders and store functions which can 
> internally index sorted data 
> -
>
> Key: PIG-953
> URL: https://issues.apache.org/jira/browse/PIG-953
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Attachments: PIG-953.patch
>
>
> Currently merge join implementation in pig includes construction of an index 
> on sorted data and use of that index to seek into the "right input" to 
> efficiently perform the join operation. Some loaders (notably the zebra 
> loader) internally implement an index on sorted data and can perform this 
> seek efficiently using their index. So the use of the index needs to be 
> abstracted in such a way that when the loader supports indexing, pig uses it 
> (indirectly through the loader) and does not construct an index. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation

2009-09-14 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12755019#action_12755019
 ] 

Alan Gates commented on PIG-793:


Sri is looking into the array vs arraylist changes as well.

> Improving memory efficiency of Tuple implementation
> ---
>
> Key: PIG-793
> URL: https://issues.apache.org/jira/browse/PIG-793
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>    Assignee: Alan Gates
>
> Currently, our tuple is a real pig and uses a lot of extra memory. 
> There are several places where we can improve memory efficiency:
> (1) Laying out memory for the fields rather than using java objects since 
> since each object for a numeric field takes 16 bytes
> (2) For the cases where we know the schema using Java arrays rather than 
> ArrayList.
> There might be more.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-831) Records and bytes written reported by pig are wrong in a multi-store program

2009-09-15 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-831:
---

   Resolution: Fixed
Fix Version/s: 0.4.0
   Status: Resolved  (was: Patch Available)

Fix checked in 6 June 2009

> Records and bytes written reported by pig are wrong in a multi-store program
> 
>
> Key: PIG-831
> URL: https://issues.apache.org/jira/browse/PIG-831
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.3.0
>    Reporter: Alan Gates
>Assignee: Alan Gates
>Priority: Minor
> Fix For: 0.4.0
>
> Attachments: PIG-831.patch
>
>
> The stats features checked in as part of PIG-626 (reporting the number of 
> records and bytes written at the end of the query) print wrong values (often 
> but not always 0) when the pig script being run contains more than 1 store.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-802) PERFORMANCE: not creating bags for ORDER BY

2009-09-15 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-802:
---

   Resolution: Fixed
Fix Version/s: 0.4.0
   Status: Resolved  (was: Patch Available)

Fix checked in 30 May 2009

> PERFORMANCE: not creating bags for ORDER BY
> ---
>
> Key: PIG-802
> URL: https://issues.apache.org/jira/browse/PIG-802
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
> Fix For: 0.4.0
>
> Attachments: OrderByOptimization.patch
>
>
> Order by should be changed to not use POPackage to put all of the tuples in a 
> bag on the reduce side, as the bag is just immediately flattened. It can 
> instead work like join does for the last input in the join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-865) Performance: Unnnecessary computation in FRJoin

2009-09-15 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-865:
---

Status: Open  (was: Patch Available)

When I ran PigMix_2 (which does FR join) on this patch it actually slowed it 
down about 10%.

> Performance: Unnnecessary computation in FRJoin
> ---
>
> Key: PIG-865
> URL: https://issues.apache.org/jira/browse/PIG-865
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.3.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
>Priority: Minor
> Attachments: pig-865.patch, pig-865_v2.patch
>
>
> In POFRJoin implementation POLocalRearrange is used to extract join keys from 
> the input tuples. If keys match then to perform actual join input tuples are 
> fed to Foreach which does a cross on its inputs. After keys are extracted 
> using POLocalRearrange output; function getValueTuple(POLocalRearrange lr, 
> Tuple tuple) is called to reconstruct the input tuple. It seems that this 
> function call is unnecessary since we already have input tuple at that time. 
> This is not a bug, but since this function would get called for every tuple, 
> if it is eliminated, it should certainly help to improve performance. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-911) [Piggybank] SequenceFileLoader

2009-09-15 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12755709#action_12755709
 ] 

Alan Gates commented on PIG-911:


I'm reviewing this patch

> [Piggybank] SequenceFileLoader 
> ---
>
> Key: PIG-911
> URL: https://issues.apache.org/jira/browse/PIG-911
> Project: Pig
>  Issue Type: New Feature
>Reporter: Dmitriy V. Ryaboy
> Attachments: pig_911.2.patch, pig_sequencefile.patch
>
>
> The proposed piggybank contribution adds a SequenceFileLoader to the 
> piggybank.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-960) Using Hadoop's optimized LineRecordReader for reading Tuples in PigStorage

2009-09-15 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12755780#action_12755780
 ] 

Alan Gates commented on PIG-960:


+1, patch looks good.  Since I wrote the first 25% or so of the code someone 
else should review this too.

> Using Hadoop's optimized LineRecordReader for reading Tuples in PigStorage 
> ---
>
> Key: PIG-960
> URL: https://issues.apache.org/jira/browse/PIG-960
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Ankit Modi
> Attachments: pig_rlr.patch
>
>
> PigStorage's reading of Tuples ( lines ) can be optimized using Hadoop's 
> {{LineRecordReader}}.
> This can help in following areas
> - Improving performance reading of Tuples (lines) in {{PigStorage}}
> - Any future improvements in line reading done in Hadoop's 
> {{LineRecordReader}} is automatically carried over to Pig
> Issues that are handled by this patch
> - BZip uses internal buffers and positioning for determining the number of 
> bytes read. Hence buffering done by {{LineRecordReader}} has to be turned off
> - Current implementation of {{LocalSeekableInputStream}} does not implement 
> {{available}} method. This method has to be implemented.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-911) [Piggybank] SequenceFileLoader

2009-09-15 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-911:
---

   Resolution: Fixed
Fix Version/s: 0.5.0
   Status: Resolved  (was: Patch Available)

Committed patch.  Thanks Dmitry.

> [Piggybank] SequenceFileLoader 
> ---
>
> Key: PIG-911
> URL: https://issues.apache.org/jira/browse/PIG-911
> Project: Pig
>  Issue Type: New Feature
>Reporter: Dmitriy V. Ryaboy
> Fix For: 0.5.0
>
> Attachments: pig_911.2.patch, pig_sequencefile.patch
>
>
> The proposed piggybank contribution adds a SequenceFileLoader to the 
> piggybank.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-926) Merge-Join phase 2

2009-09-15 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-926:
---

   Resolution: Fixed
Fix Version/s: 0.4.0
   Status: Resolved  (was: Patch Available)

Patch committed 20 August 2009

> Merge-Join phase 2
> --
>
> Key: PIG-926
> URL: https://issues.apache.org/jira/browse/PIG-926
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
>Priority: Minor
> Fix For: 0.4.0
>
> Attachments: mj_phase2_1.patch
>
>
> This jira is created to keep track of phase-2 work for MergeJoin. Various 
> limitations exist in phase-1 for Merge Join which are listed on: 
> http://wiki.apache.org/pig/PigMergeJoin Those will be addressed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-366) PigPen - Eclipse plugin for a graphical PigLatin editor

2009-09-17 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756636#action_12756636
 ] 

Alan Gates commented on PIG-366:


At this point no one has picked up PigPen recently and kept it up to date.  I 
know it worked with Pig 0.2.0, but it has not been updated since then.

> PigPen - Eclipse plugin for a graphical PigLatin editor
> ---
>
> Key: PIG-366
> URL: https://issues.apache.org/jira/browse/PIG-366
> Project: Pig
>  Issue Type: New Feature
>Reporter: Shubham Chopra
>Assignee: Shubham Chopra
>Priority: Minor
> Attachments: org.apache.pig.pigpen_0.0.1.jar, 
> org.apache.pig.pigpen_0.0.1.tgz, org.apache.pig.pigpen_0.0.4.jar, 
> pigpen.patch, pigPen.patch, PigPen.tgz
>
>
> This is an Eclipse plugin that provides a GUI that can help users create 
> PigLatin scripts and see the example generator outputs on the fly and submit 
> the jobs to hadoop clusters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-951) Reset parallelism to 1 for indexing job in MergeJoin

2009-09-17 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756654#action_12756654
 ] 

Alan Gates commented on PIG-951:


I'll be reviewing this patch.

> Reset parallelism to 1 for indexing job in MergeJoin
> 
>
> Key: PIG-951
> URL: https://issues.apache.org/jira/browse/PIG-951
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Attachments: pig-951.patch
>
>
> After sampling one tuple from every block, one reducer is used to sort the 
> index entries in reduce phase to produce sorted index to be used in actual 
> join job. Thus, parallelism of index job should be explictly set to 1. 
> Currently, its not.
> Currently, this is a non-issue, since we don't allow any blocking operators 
> in pipeline before merge-join. However, later when we do allow blocking 
> operators, then parallelism of indexing job will be that of preceding 
> blocking operator. Even then, job will complete successfully because all 
> tuple will go to only one reducer, because we are grouping on only one key 
> "all". However, it will waste cluster resources by starting all the extra 
> reducers which get no data and thus do nothing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-951) Reset parallelism to 1 for indexing job in MergeJoin

2009-09-18 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-951:
---

   Resolution: Fixed
Fix Version/s: 0.6.0
   Status: Resolved  (was: Patch Available)

Patch checked in.  Thanks Ashutosh.

> Reset parallelism to 1 for indexing job in MergeJoin
> 
>
> Key: PIG-951
> URL: https://issues.apache.org/jira/browse/PIG-951
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.6.0
>
> Attachments: pig-951.patch
>
>
> After sampling one tuple from every block, one reducer is used to sort the 
> index entries in reduce phase to produce sorted index to be used in actual 
> join job. Thus, parallelism of index job should be explictly set to 1. 
> Currently, its not.
> Currently, this is a non-issue, since we don't allow any blocking operators 
> in pipeline before merge-join. However, later when we do allow blocking 
> operators, then parallelism of indexing job will be that of preceding 
> blocking operator. Even then, job will complete successfully because all 
> tuple will go to only one reducer, because we are grouping on only one key 
> "all". However, it will waste cluster resources by starting all the extra 
> reducers which get no data and thus do nothing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-948) [Usability] Relating pig script with MR jobs

2009-09-18 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-948:
---

Status: Open  (was: Patch Available)

Marking this as open again rather than patch available until issues with job 
tracker URI in the message are resolved.

> [Usability] Relating pig script with MR jobs
> 
>
> Key: PIG-948
> URL: https://issues.apache.org/jira/browse/PIG-948
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
>Priority: Minor
> Attachments: pig-948.patch
>
>
> Currently its hard to find a way to relate pig script with specific MR job. 
> In a loaded cluster with multiple simultaneous job submissions, its not easy 
> to figure out which specific MR jobs were launched for a given pig script. If 
> Pig can provide this info, it will be useful to debug and monitor the jobs 
> resulting from a pig script.
> At the very least, Pig should be able to provide user the following 
> information
> 1) Job id of the launched job.
> 2) Complete web url of jobtracker running this job. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-777) Code refactoring: Create optimization out of store/load post processing code

2009-09-18 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-777:
---

Status: Open  (was: Patch Available)

Moving from patch available to open since the contributed patch has been 
committed and the JIRA is being held open to address other issues.

> Code refactoring: Create optimization out of store/load post processing code
> 
>
> Key: PIG-777
> URL: https://issues.apache.org/jira/browse/PIG-777
> Project: Pig
>  Issue Type: Improvement
>Reporter: Gunther Hagleitner
> Attachments: log_message.patch
>
>
> The postProcessing method in the pig server checks whether a logical graph 
> contains stores to and loads from the same location. If so, it will either 
> connect the store and load, or optimize by throwing out the load and 
> connecting the store predecessor with the successor of the load.
> Ideally the introduction of the store and load connection should happen in 
> the query compiler, while the optimization should then happen in an separate 
> optimizer step as part of the optimizer framework.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-752) local mode doesn't read bzip2 and gzip compressed data files

2009-09-18 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-752:
---

Status: Open  (was: Patch Available)

When I try to apply this patch I get:

{code}
patching file src/org/apache/pig/impl/util/IOStreamFactory.java
patching file src/org/apache/pig/backend/hadoop/datastorage/HFile.java
Hunk #1 FAILED at 29.
Hunk #2 FAILED at 67.
2 out of 2 hunks FAILED -- saving rejects to file 
src/org/apache/pig/backend/hadoop/datastorage/HFile.java.rej
patching file src/org/apache/pig/backend/executionengine/PigSlice.java
patching file src/org/apache/pig/impl/io/FileLocalizer.java
patching file 
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigOutputFormat.java
patching file test/org/apache/pig/test/TestBZip.java
{code}

> local mode doesn't read bzip2 and gzip compressed data files
> 
>
> Key: PIG-752
> URL: https://issues.apache.org/jira/browse/PIG-752
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: David Ciemiewicz
>Assignee: Jeff Zhang
> Fix For: 0.4.0
>
> Attachments: Pig_752.Patch
>
>
> Problem 1)  use of .bz2 file extension does not store results bzip2 
> compressed in Local mode (-exectype local)
> If I use the .bz2 filename extension in a STORE statement on HDFS, the 
> results are stored with bzip2 compression.
> If I use the .bz2 filename extension in a STORE statement on local file 
> system, the results are NOT stored with bzip2 compression.
> compact.bz2.pig:
> {code}
> A = load 'events.test' using PigStorage();
> store A into 'events.test.bz2' using PigStorage();
> C = load 'events.test.bz2' using PigStorage();
> C = limit C 10;
> dump C;
> {code}
> {code}
> -bash-3.00$ pig -exectype local compact.bz2.pig
> -bash-3.00$ file events.test
> events.test: ASCII English text, with very long lines
> -bash-3.00$ file events.test.bz2
> events.test.bz2: ASCII English text, with very long lines
> -bash-3.00$ cat events.test | bzip2 > events.test.bz2
> -bash-3.00$ file events.test.bz2
> events.test.bz2: bzip2 compressed data, block size = 900k
> {code}
> The output format in local mode is definitely not bzip2, but it should be.
> {code}
> Problem 2) pig in local mode does not decompress bzip2 compressed files, but 
> should, to be consistent with HDFS
> read.bz2.pig:
> {code}
> A = load 'events.test.bz2' using PigStorage();
> A = limit A 10;
> dump A;
> {code}
> The output should be human readable but is instead garbage, indicating no 
> decompression took place during the load:
> {code}
> -bash-3.00$ pig -exectype local read.bz2.pig
> USING: /grid/0/gs/pig/current
> 2009-04-03 18:26:30,455 [main] INFO  
> org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
> 2009-04-03 18:26:30,456 [main] INFO  
> org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
> (BZh91AY&syoz?u?...@{x_?d?|u-??mK???;??4?C??)
> ((R? 6?*m?&???g, 
> ?6?Zj?k,???0?QT?d???hY?#mJ?>[j???z?m?t?u?K)??K5+??)?m?E7j?X?8a??
> ??U?p@@MT?$?B?P??N??=???(z<}gk...@c$\??i]?g:?J)
> a(R?,?u?v???...@?i@??J??!D?)???A?PP?IY??m?
> (mP(i?4,#F[?I)@>?...@??|7^?}U??wwg,?u?$?T???((Q!D?=`*?}hP??_|??=?(??2???m=?xG?(?rC?B?(33??:4?N???t|??T?*??k??NT?x???=?fyv?w>f??4z???4t?)
> (?oou?t???Kwl?3?nCM?WS?;l???P?s?x
> a???e)B??9?  ?44
> ((?...@4?)
> (f)
> (?...@+?d?0@>?U)
> (Q?SR)
> -bash-3.00$ 
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-682) Fix the ssh tunneling code

2009-09-18 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-682:
---

Status: Open  (was: Patch Available)

Moving to open until the patch is changed per the comments by Santhosh and 
Pradeep.

> Fix the ssh tunneling code
> --
>
> Key: PIG-682
> URL: https://issues.apache.org/jira/browse/PIG-682
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Benjamin Reed
> Attachments: jsch-0.1.41.jar, PIG-682.patch
>
>
> Hadoop has changed a bit and the ssh-gateway code no longer works. pig needs 
> to be updated to register with the new socket framework. reporting of 
> problems also needs to be better.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-651) PERFORMANCE: Use specialized POForEachNoFlatten for cases where the foreach has no flattens

2009-09-18 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757297#action_12757297
 ] 

Alan Gates commented on PIG-651:


Is it worth adding this complexity to the code for a 2% speed up?  I'd vote no.

> PERFORMANCE: Use specialized POForEachNoFlatten for cases where the foreach 
> has no flattens
> ---
>
> Key: PIG-651
> URL: https://issues.apache.org/jira/browse/PIG-651
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Attachments: PIG-651.patch
>
>
> POForEach has lot of code to handle flattening (cross product) of the fields 
> in the generate. This is relevant only when atleast one field in the generate 
> needs to be flattened. If all fields in the generate do not need to be 
> flattened, a more simplified and hopefully more efficient POForEach can be 
> used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-593) RegExLoader stops an non-matching line

2009-09-18 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-593:
---

Resolution: Duplicate
Status: Resolved  (was: Patch Available)

Looks like this issue has already been addressed with a separate patch.

> RegExLoader stops an non-matching line
> --
>
> Key: PIG-593
> URL: https://issues.apache.org/jira/browse/PIG-593
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.1.0
>Reporter: Vadim Zaliva
>Priority: Minor
> Attachments: PIG-593.diff
>
>
> Class RegExLoader and all its subclasses stop if some of lines does not match 
> provided regular expression.
> In particular, I have noticed this when CombinedLogLoader stopped at the 
> following line:
> 58.210.62.24 - - [29/Dec/2008:23:06:57 -0800] "GET 
> /tor/browse/?id=24746&rel=FLY
> 999%40Jack's+Teen+America+22%2FFLY999原創%40單掛D.C.資訊交流網+Jack's+Teen+Ameri
> ca+22+cd1.avi HTTP/1.1" 8952 200 
> "http://img252.imageshack.us/tor/browse/?id=247
> 46&rel=FLY999%40Jack%27s+Teen+America+22" "Mozilla/4.0 (compatible; MSIE 6.0; 
> Wi
> ndows NT 5.1; )" "-"
> Looks like some japanese characters here do not match \S expression used.  
> In general I expect it to skip such lines, not to stop processing data file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-592) schema inferred incorrectly

2009-09-18 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757306#action_12757306
 ] 

Alan Gates commented on PIG-592:


+1, patch looks good.  Let's get this in, as it's an annoying bug.

> schema inferred incorrectly
> ---
>
> Key: PIG-592
> URL: https://issues.apache.org/jira/browse/PIG-592
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: Christopher Olston
> Fix For: 0.5.0
>
> Attachments: PIG-592-1.patch
>
>
> A simple pig script, that never introduces any schema information:
> A = load 'foo';
> B = foreach (group A by $8) generate group, COUNT($1);
> C = load 'bar';   // ('bar' has two columns)
> D = join B by $0, C by $0;
> E = foreach D generate $0, $1, $3;
> Fails, complaining that $3 does not exist:
> java.io.IOException: Out of bound access. Trying to access non-existent 
> column: 3. Schema {B::group: bytearray,long,bytearray} has 3 column(s).
> Apparently Pig gets confused, and thinks it knows the schema for C (a single 
> bytearray column).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-966) Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces

2009-09-18 Thread Alan Gates (JIRA)
Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces
---

 Key: PIG-966
 URL: https://issues.apache.org/jira/browse/PIG-966
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates


I propose that we rework the LoadFunc, StoreFunc, and Slice/r interfaces 
significantly.  See http://wiki.apache.org/pig/LoadStoreRedesignProposal for 
full details

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-967) Proposal for adding a metadata interface to Pig

2009-09-18 Thread Alan Gates (JIRA)
Proposal for adding a metadata interface to Pig
---

 Key: PIG-967
 URL: https://issues.apache.org/jira/browse/PIG-967
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates


Pig needs to have an interface to connect to metadata systems.  
http://wiki.apache.org/pig/MetadataInterfaceProposal proposes and interface for 
this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-651) PERFORMANCE: Use specialized POForEachNoFlatten for cases where the foreach has no flattens

2009-09-18 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-651:
---

Resolution: Won't Fix
Status: Resolved  (was: Patch Available)

> PERFORMANCE: Use specialized POForEachNoFlatten for cases where the foreach 
> has no flattens
> ---
>
> Key: PIG-651
> URL: https://issues.apache.org/jira/browse/PIG-651
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Attachments: PIG-651.patch
>
>
> POForEach has lot of code to handle flattening (cross product) of the fields 
> in the generate. This is relevant only when atleast one field in the generate 
> needs to be flattened. If all fields in the generate do not need to be 
> flattened, a more simplified and hopefully more efficient POForEach can be 
> used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-513) PERFORMANCE: optimize some of the code in DefaultTuple

2009-09-18 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-513:
---

   Resolution: Fixed
Fix Version/s: 0.6.0
   Status: Resolved  (was: Patch Available)

Patch checked in.  Thanks Ashutosh.

> PERFORMANCE: optimize some of the code in DefaultTuple
> --
>
> Key: PIG-513
> URL: https://issues.apache.org/jira/browse/PIG-513
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.2.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Fix For: 0.6.0
>
> Attachments: PIG-513.patch, pig-513_2.patch
>
>
> The following areas in DefaultTuple.java can be changed:
> The member methods get(), set(), getType() and isNull() all call 
> checkBounds() which is redundant call since all these 4 functions throw 
> ExecException. Instead of doing a bounds check, we can catch the 
> IndexOutOfBounds exception in a try-catch and throw it as an ExecException
> The write() method has the following unused object (d in the code below):
> {code}
> for (int i = 0; i < sz; i++) {
> try {
> Object d = get(i);
> } catch (ExecException ee) {
> throw new RuntimeException(ee);
> }
> DataReaderWriter.writeDatum(out, mFields.get(i));
> }
> {code}
> {noformat}
> The get(i) call in the try should be replaced by the writeDatum call directly 
> since d is never used and there is an unncessary call to get()
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-752) local mode doesn't read bzip2 and gzip compressed data files

2009-09-21 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757944#action_12757944
 ] 

Alan Gates commented on PIG-752:


It means that the patch program was unable to apply your patch to HFile.java.  
I would try regenerating the patch against the latest trunk and see if you get 
better results.

> local mode doesn't read bzip2 and gzip compressed data files
> 
>
> Key: PIG-752
> URL: https://issues.apache.org/jira/browse/PIG-752
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: David Ciemiewicz
>Assignee: Jeff Zhang
> Fix For: 0.4.0
>
> Attachments: Pig_752.Patch
>
>
> Problem 1)  use of .bz2 file extension does not store results bzip2 
> compressed in Local mode (-exectype local)
> If I use the .bz2 filename extension in a STORE statement on HDFS, the 
> results are stored with bzip2 compression.
> If I use the .bz2 filename extension in a STORE statement on local file 
> system, the results are NOT stored with bzip2 compression.
> compact.bz2.pig:
> {code}
> A = load 'events.test' using PigStorage();
> store A into 'events.test.bz2' using PigStorage();
> C = load 'events.test.bz2' using PigStorage();
> C = limit C 10;
> dump C;
> {code}
> {code}
> -bash-3.00$ pig -exectype local compact.bz2.pig
> -bash-3.00$ file events.test
> events.test: ASCII English text, with very long lines
> -bash-3.00$ file events.test.bz2
> events.test.bz2: ASCII English text, with very long lines
> -bash-3.00$ cat events.test | bzip2 > events.test.bz2
> -bash-3.00$ file events.test.bz2
> events.test.bz2: bzip2 compressed data, block size = 900k
> {code}
> The output format in local mode is definitely not bzip2, but it should be.
> {code}
> Problem 2) pig in local mode does not decompress bzip2 compressed files, but 
> should, to be consistent with HDFS
> read.bz2.pig:
> {code}
> A = load 'events.test.bz2' using PigStorage();
> A = limit A 10;
> dump A;
> {code}
> The output should be human readable but is instead garbage, indicating no 
> decompression took place during the load:
> {code}
> -bash-3.00$ pig -exectype local read.bz2.pig
> USING: /grid/0/gs/pig/current
> 2009-04-03 18:26:30,455 [main] INFO  
> org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
> 2009-04-03 18:26:30,456 [main] INFO  
> org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
> (BZh91AY&syoz?u?...@{x_?d?|u-??mK???;??4?C??)
> ((R? 6?*m?&???g, 
> ?6?Zj?k,???0?QT?d???hY?#mJ?>[j???z?m?t?u?K)??K5+??)?m?E7j?X?8a??
> ??U?p@@MT?$?B?P??N??=???(z<}gk...@c$\??i]?g:?J)
> a(R?,?u?v???...@?i@??J??!D?)???A?PP?IY??m?
> (mP(i?4,#F[?I)@>?...@??|7^?}U??wwg,?u?$?T???((Q!D?=`*?}hP??_|??=?(??2???m=?xG?(?rC?B?(33??:4?N???t|??T?*??k??NT?x???=?fyv?w>f??4z???4t?)
> (?oou?t???Kwl?3?nCM?WS?;l???P?s?x
> a???e)B??9?  ?44
> ((?...@4?)
> (f)
> (?...@+?d?0@>?U)
> (Q?SR)
> -bash-3.00$ 
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-968) findContainingJar fails when there's a + in the path

2009-09-21 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-968:
---

Status: Open  (was: Patch Available)

You need to add a unit test that checks that this works when there is a + in 
the path.

Also, a more general question:  I'm guessing that '+' isn't the only mishandled 
character.  Are there others that should be checked?

> findContainingJar fails when there's a + in the path
> 
>
> Key: PIG-968
> URL: https://issues.apache.org/jira/browse/PIG-968
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.4.0, 0.5.0
>Reporter: Todd Lipcon
> Attachments: pig-968.txt
>
>
> This is the same bug as in MAPREDUCE-714. Please see discussion there.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-966) Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces

2009-09-21 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758026#action_12758026
 ] 

Alan Gates commented on PIG-966:


Responses to Dmitry's and Ashutosh's comments:

{quote}
Can you explain why everything has a Load prefix? Seems like this limits the 
interfaces unnecessarily, and is a bit inconsistent semantically (LoadMetadata 
does not represent metadata associated with loading - it loads metadata. 
LoadStatistics does not load statistics; it represents statistics, and is 
loaded using LoadMetadata).
{quote}
I don't claim to be a naming guru, so I'm open to other naming suggestions.  I 
chose to prefix all of the interfaces with Load or Store to show that they were 
related to Load and Store.  For example, by calling it LoadMetadata I
did intend to show explicitly that this is metadata associated with loading.  I 
agree that naming schemas and statistics something other than Load is good, 
because they aren't used solely for loading.

{quote}
In regards to the appropriate parameters for setURI - can you explain the 
advantage of this over Strings in more detail? I think the current setLocation 
approach is preferable; it gives users more flexibility. Plus Hadoop Paths are 
constructed from strings, not URIs, so we are forcing a string->uri->string 
conversion on the common case.
{quote}
The real concern I have here is I want Pig to be able to distinguish when users 
intend to refer to a filename and when they don't. This is important because 
Pig sometimes munges file names.  Consider the following Pig
Latin script:

{code}
cd '/user/gates';
A = load './bla';
...
Z = limit Y 10;
cd '/tmp';
dump Z;
{code}

By the time Pig evaluates Z for dumping, ./bla will have a different meaning 
than it did when the user typed it.  Pig understands that and transforms the 
load statement to load '/user/gates/bla'.  But it needs to know not
to mess with statements like:

{code}
A = load 'http://...';
{code}

By explicitly making the location a URI we encourage users and load function 
writers to think this way.  Your argument that Hadoop paths are by default 
strings is persuasive.  Perhaps its best to leave this as strings but look
for a scheme at the beginning and interpret it as a URI if it has one (which is 
what Pig does now).

{quote}
prepareToRead: does it need a finishReading() mate?
{quote}
A good idea.  Same for finishWriting() below.

{quote}
I would like to see a "standard" method for getting the jobconf (or whatever it 
is called in 20/21), both for LoadFunc and StoreFunc.
{quote}
I agree, but I didn't take that on here.  We need a standard way to move 
configuration information (Hadoop and Pig) into Load, Store, and Eval Funcs.  
But I viewed that as a separate issue that should be solved for all UDFs.

{quote}
We think that the schema should be uniform for everything a single instance of 
a loader is responsible for loading (and the loader can fill in null or 
defaults where appropriate if some resources are missing fields).
{quote}
Agreed, that is what I was trying to say.  Perhaps it wasn't clear.

{quote}
Should org.apache.pig.impl.logicalLayer.schema.Schema be changed to use this as 
an internal representation?
{quote}
No.  It serves a different purpose, which is to define the content of data 
flows inside the logical plan.  We should not tie these two together.

{quote}
PartitionKeys aren't really part of schema; they are a storage/distribution 
property. This should go into the Metadata and refer to the schema.
{quote}
We need partition keys as part of this interface, as Pig will need to be able 
to pass partition keys to loaders that are capable of doing partition pruning.  
So we could add getPartitionKeys to the LoadMetadata interface.

{quote}
Why the public fields? Not that I am a huge fan of getters and setters but I 
sense findbugs warnings heading our way.
{quote}
LoadSchema and LoadStatistics as proposed are structs.  I don't see any reason 
to pretend otherwise.  And I'm not inclined to bend my programming style to 
match that of whoever wrote findbugs.

{quote}
I had envisioned statistics as more of a key-value thing, with some keys 
predefined in a separate class. So we would have:

ResourceStats.NUM_RECORDS
ResourceStats.SIZE_IN_BYTES
//etc

and to get the stats we would call

MyResourceStats.getLong(ResourceStats.NUM_RECORDS)
MyResourceStats.getFloat(ResourceStats.SOMETHING_THAT_IS_A_FLOAT)
//etc

This allows us to be far more flexible in regards to the things marked as 
"//probably more here."
{quote}

The problem with key/value set ups like this is it can be hard for people to 
understand what is already there.  So they end up not using what already 
exists, or worse, re-inventing the wheel.  My hope is that by
versioning this we c

[jira] Commented: (PIG-966) Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces

2009-09-21 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758089#action_12758089
 ] 

Alan Gates commented on PIG-966:




{quote}
I must not be clear on what pushing down to a loader does. My interpretation 
was that it allows pushing down operations to the point where you don't read 
unnecessary data off disk. A classic example of filter projection would be 
filtering by a partition key (so, dt >sysdate-30 , and our data is stored in 
files one per day). An example of projection pushdown is when we have a column 
store that simply avoids loading some of the columns.

I don't see how a loader can push down a join. That seems to require reading 
and changing data. Is the idea that such a join can be performed without an MR 
step? That seems like a Pig thing, not a loader thing.

In any case, yes, I think something like this would require a new interface in 
the same namespace, since it's a drastically different capability.

Any thoughts on advisability of simplifying projection pushdown to just work on 
an int array? I know it's limiting, but it's going to be a heck of a lot easier 
for users to implement.
{quote}

Limiting the data you need to read off disk is partition pruning, or in the 
case of columnar stores, column pruning.  But this isn't the only case in which 
you might want to push down operators.  Consider
data that has (name, age, address) and is partitioned on name.  A user might 
want to query only over adults (age > 17).  This isn't a partition field.  But 
if it's a columnar store and age is compressed in
say run length or offset encoding the load function may be able to apply the 
filter on the compressed data.  This can be a huge win, as we avoid 
decompressing whole rows that we don't need.  To see another
case where we might want to push operators to the loader, consider the case 
where a user is loading a set of Zebra files, all of which are sorted on one 
key.  Pig may want to keep those zebra files
sorted.  It will need a way to tell the loader to merge those files as it loads 
them rather than concatenate them and force Pig to resort the input.

I understand your concern on making it difficult to pass down just projection.  
And you are not the only one to express this concern.  Though even there for 
full projections, we need more than a simple int array, so that we can
handle things like map, bag, etc. projections.  But maybe we need a simpler 
option for users who just want to push projection and then the full blown 
option for power users who want to push selection, etc.
Beginner and advanced interfaces I guess.



> Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces
> ---
>
> Key: PIG-966
> URL: https://issues.apache.org/jira/browse/PIG-966
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Alan Gates
>Assignee: Alan Gates
>
> I propose that we rework the LoadFunc, StoreFunc, and Slice/r interfaces 
> significantly.  See http://wiki.apache.org/pig/LoadStoreRedesignProposal for 
> full details

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-966) Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces

2009-09-22 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758518#action_12758518
 ] 

Alan Gates commented on PIG-966:


In thinking about it more, it becomes obvious that we have to separate out 
determining the partition keys for an input from getting the schema, as Dmitry 
and Ashutosh suggested above.  The reason is that Pig cannot ask the loader for 
a schema until it has completely defined what will be loaded (because the 
schema will depend on what is being loaded).  And to completely define what is 
being loaded it needs to determine the partition keys and possibly specify a 
filter condition for them.  So we need to add a getPartitionKeys and 
setPartitionFilter to the LoadMetadata interface.  

> Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces
> ---
>
> Key: PIG-966
> URL: https://issues.apache.org/jira/browse/PIG-966
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>    Reporter: Alan Gates
>Assignee: Alan Gates
>
> I propose that we rework the LoadFunc, StoreFunc, and Slice/r interfaces 
> significantly.  See http://wiki.apache.org/pig/LoadStoreRedesignProposal for 
> full details

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-968) findContainingJar fails when there's a + in the path

2009-09-24 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759188#action_12759188
 ] 

Alan Gates commented on PIG-968:


Ok, if it's hard to test in an automated way that's fine.  To test it manually, 
is it sufficient to create a jar with a + in the path, register it in a Pig 
Latin script, and then use a UDF from that jar in the script?

> findContainingJar fails when there's a + in the path
> 
>
> Key: PIG-968
> URL: https://issues.apache.org/jira/browse/PIG-968
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.4.0, 0.5.0
>Reporter: Todd Lipcon
> Attachments: pig-968.txt
>
>
> This is the same bug as in MAPREDUCE-714. Please see discussion there.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



<    1   2   3   4   5   6   7   8   9   10   >