[jira] Commented: (PIG-1661) Add alternative search-provider to Pig site

2010-10-02 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917246#action_12917246
 ] 

Santhosh Srinivasan commented on PIG-1661:
--

Sure, worth a try.

> Add alternative search-provider to Pig site
> ---
>
> Key: PIG-1661
> URL: https://issues.apache.org/jira/browse/PIG-1661
> Project: Pig
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Alex Baranau
>Priority: Minor
> Attachments: PIG-1661.patch
>
>
> Use search-hadoop.com service to make available search in Pig sources, MLs, 
> wiki, etc.
> This was initially proposed on user mailing list. The search service was 
> already added in site's skin (common for all Hadoop related projects) via 
> AVRO-626 so this issue is about enabling it for Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1344) PigStorage should be able to read back complex data containing delimiters created by PigStorage

2010-03-30 Thread Santhosh Srinivasan (JIRA)
PigStorage should be able to read back complex data containing delimiters 
created by PigStorage
---

 Key: PIG-1344
 URL: https://issues.apache.org/jira/browse/PIG-1344
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Santhosh Srinivasan
Assignee: Daniel Dai
 Fix For: 0.8.0


With Pig 0.7, the TextDataParser has been removed and the logic to parse 
complex data types has moved to Utf8StorageConverter. However, this does not 
handle the case where the complex data types could contain delimiters ('{', 
'}', ',', '(', ')', '[', ']', '#'). Fixing this issue will make PigStorage self 
contained and more usable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1331) Owl Hadoop Table Management Service

2010-03-26 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850355#action_12850355
 ] 

Santhosh Srinivasan commented on PIG-1331:
--

Thanks for the information. Looking at the Hive design at 
http://wiki.apache.org/hadoop/Hive/Design , it looks like there is no 
significant difference between Owl and Hive. As you indicate, I hope we 
converge to a common metastore for Hadoop.



> Owl Hadoop Table Management Service
> ---
>
> Key: PIG-1331
> URL: https://issues.apache.org/jira/browse/PIG-1331
> Project: Pig
>  Issue Type: New Feature
>Reporter: Jay Tang
>
> This JIRA is a proposal to create a Hadoop table management service: Owl. 
> Today, MapReduce and Pig applications interacts directly with HDFS 
> directories and files and must deal with low level data management issues 
> such as storage format, serialization/compression schemes, data layout, and 
> efficient data accesses, etc, often with different solutions. Owl aims to 
> provide a standard way to addresses this issue and abstracts away the 
> complexities of reading/writing huge amount of data from/to HDFS.
> Owl has a data access API that is modeled after the traditional Hadoop 
> !InputFormt and a management API to manipulate Owl objects.  This JIRA is 
> related to Pig-823 (Hadoop Metadata Service) as Owl has an internal metadata 
> store.  Owl integrates with different storage module like Zebra with a 
> pluggable architecture.
>  Initially, the proposal is to submit Owl as a Pig contrib project.  Over 
> time, it makes sense to move it to a Hadoop subproject.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1331) Owl Hadoop Table Management Service

2010-03-26 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850342#action_12850342
 ] 

Santhosh Srinivasan commented on PIG-1331:
--

Jay, 

In PIG-823 there was a discussion around how Owl is different from Hive's 
metastore. Is that still true today? If not, can you elaborate on the key 
differences between the two systems?

Thanks,
Santhosh

> Owl Hadoop Table Management Service
> ---
>
> Key: PIG-1331
> URL: https://issues.apache.org/jira/browse/PIG-1331
> Project: Pig
>  Issue Type: New Feature
>Reporter: Jay Tang
>
> This JIRA is a proposal to create a Hadoop table management service: Owl. 
> Today, MapReduce and Pig applications interacts directly with HDFS 
> directories and files and must deal with low level data management issues 
> such as storage format, serialization/compression schemes, data layout, and 
> efficient data accesses, etc, often with different solutions. Owl aims to 
> provide a standard way to addresses this issue and abstracts away the 
> complexities of reading/writing huge amount of data from/to HDFS.
> Owl has a data access API that is modeled after the traditional Hadoop 
> !InputFormt and a management API to manipulate Owl objects.  This JIRA is 
> related to Pig-823 (Hadoop Metadata Service) as Owl has an internal metadata 
> store.  Owl integrates with different storage module like Zebra with a 
> pluggable architecture.
>  Initially, the proposal is to submit Owl as a Pig contrib project.  Over 
> time, it makes sense to move it to a Hadoop subproject.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1117) Pig reading hive columnar rc tables

2010-01-11 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798917#action_12798917
 ] 

Santhosh Srinivasan commented on PIG-1117:
--

+1 on making it part of main piggybank. We should not be creating a separate 
directory just to handle hive.

> Pig reading hive columnar rc tables
> ---
>
> Key: PIG-1117
> URL: https://issues.apache.org/jira/browse/PIG-1117
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.7.0
>Reporter: Gerrit Jansen van Vuuren
>Assignee: Gerrit Jansen van Vuuren
> Fix For: 0.7.0
>
> Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch, 
> PIG-1117.patch, PIG-117-v.0.6.0.patch, PIG-117-v.0.7.0.patch
>
>
> I've coded a LoadFunc implementation that can read from Hive Columnar RC 
> tables, this is needed for a project that I'm working on because all our data 
> is stored using the Hive thrift serialized Columnar RC format. I have looked 
> at the piggy bank but did not find any implementation that could do this. 
> We've been running it on our cluster for the last week and have worked out 
> most bugs.
>  
> There are still some improvements to be done but I would need  like setting 
> the amount of mappers based on date partitioning. Its been optimized so as to 
> read only specific columns and can churn through a data set almost 8 times 
> faster with this improvement because not all column data is read.
> I would like to contribute the class to the piggybank can you guide me in 
> what I need to do?
> I've used hive specific classes to implement this, is it possible to add this 
> to the piggy bank build ivy for automatic download of the dependencies?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1065) In-determinate behaviour of Union when there are 2 non-matching schema's

2009-11-10 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776098#action_12776098
 ] 

Santhosh Srinivasan commented on PIG-1065:
--

bq. Aliasing inside foreach is hugely useful for readability. Are you 
suggesting removing the ability to assign aliases inside a forearch, or just to 
change/assign schemas?

For consistency, all relational operators should support the AS clause. 
Gradually, the aliasing on a per column basis in foreach should be removed from 
the documentation, deprecated and eventually removed. This is a long term 
recommendation.

> In-determinate behaviour of Union when there are 2 non-matching schema's
> 
>
> Key: PIG-1065
> URL: https://issues.apache.org/jira/browse/PIG-1065
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.6.0
>Reporter: Viraj Bhat
> Fix For: 0.6.0
>
>
> I have a script which first does a union of these schemas and then does a 
> ORDER BY of this result.
> {code}
> f1 = LOAD '1.txt' as (key:chararray, v:chararray);
> f2 = LOAD '2.txt' as (key:chararray);
> u0 = UNION f1, f2;
> describe u0;
> dump u0;
> u1 = ORDER u0 BY $0;
> dump u1;
> {code}
> When I run in Map Reduce mode I get the following result:
> $java -cp pig.jar:$HADOOP_HOME/conf org.apache.pig.Main broken.pig
> 
> Schema for u0 unknown.
> 
> (1,2)
> (2,3)
> (1)
> (2)
> 
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to 
> open iterator for alias u1
> at org.apache.pig.PigServer.openIterator(PigServer.java:475)
> at 
> org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532)
> at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
> at org.apache.pig.Main.main(Main.java:397)
> 
> Caused by: java.io.IOException: Type mismatch in key from map: expected 
> org.apache.pig.impl.io.NullableBytesWritable, recieved 
> org.apache.pig.impl.io.NullableText
> at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:251)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 
> When I run the same script in local mode I get a different result, as we know 
> that local mode does not use any Hadoop Classes.
> $java -cp pig.jar org.apache.pig.Main -x local broken.pig
> 
> Schema for u0 unknown
> 
> (1,2)
> (1)
> (2,3)
> (2)
> 
> (1,2)
> (1)
> (2,3)
> (2)
> 
> Here are some questions
> 1) Why do we allow union if the schemas do not match
> 2) Should we not print an error message/warning so that the user knows that 
> this is not allowed or he can get unexpected results?
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1065) In-determinate behaviour of Union when there are 2 non-matching schema's

2009-11-10 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775968#action_12775968
 ] 

Santhosh Srinivasan commented on PIG-1065:
--

The schema will then correspond to the prefix as it is implemented today. For 
example if the AS statement is define for the flatten($1) and if $1 flattens to 
10 columns and if the AS clause has 3 columns then the prefix is used and the 
remaining are left undefined.

> In-determinate behaviour of Union when there are 2 non-matching schema's
> 
>
> Key: PIG-1065
> URL: https://issues.apache.org/jira/browse/PIG-1065
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.6.0
>Reporter: Viraj Bhat
> Fix For: 0.6.0
>
>
> I have a script which first does a union of these schemas and then does a 
> ORDER BY of this result.
> {code}
> f1 = LOAD '1.txt' as (key:chararray, v:chararray);
> f2 = LOAD '2.txt' as (key:chararray);
> u0 = UNION f1, f2;
> describe u0;
> dump u0;
> u1 = ORDER u0 BY $0;
> dump u1;
> {code}
> When I run in Map Reduce mode I get the following result:
> $java -cp pig.jar:$HADOOP_HOME/conf org.apache.pig.Main broken.pig
> 
> Schema for u0 unknown.
> 
> (1,2)
> (2,3)
> (1)
> (2)
> 
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to 
> open iterator for alias u1
> at org.apache.pig.PigServer.openIterator(PigServer.java:475)
> at 
> org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532)
> at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
> at org.apache.pig.Main.main(Main.java:397)
> 
> Caused by: java.io.IOException: Type mismatch in key from map: expected 
> org.apache.pig.impl.io.NullableBytesWritable, recieved 
> org.apache.pig.impl.io.NullableText
> at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:251)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 
> When I run the same script in local mode I get a different result, as we know 
> that local mode does not use any Hadoop Classes.
> $java -cp pig.jar org.apache.pig.Main -x local broken.pig
> 
> Schema for u0 unknown
> 
> (1,2)
> (1)
> (2,3)
> (2)
> 
> (1,2)
> (1)
> (2,3)
> (2)
> 
> Here are some questions
> 1) Why do we allow union if the schemas do not match
> 2) Should we not print an error message/warning so that the user knows that 
> this is not allowed or he can get unexpected results?
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1065) In-determinate behaviour of Union when there are 2 non-matching schema's

2009-11-05 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774153#action_12774153
 ] 

Santhosh Srinivasan commented on PIG-1065:
--

Answer to Question 1: Pig 1.0 had that syntax and it was retained for backward 
compatibility. Paolo suggested that for uniformity, the 'AS' clause for the 
load statements should be extended to all relational operators. Gradually, the 
column aliasing in the foreach should be removed from the documentation and 
eventually removed from the language.

> In-determinate behaviour of Union when there are 2 non-matching schema's
> 
>
> Key: PIG-1065
> URL: https://issues.apache.org/jira/browse/PIG-1065
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.6.0
>Reporter: Viraj Bhat
> Fix For: 0.6.0
>
>
> I have a script which first does a union of these schemas and then does a 
> ORDER BY of this result.
> {code}
> f1 = LOAD '1.txt' as (key:chararray, v:chararray);
> f2 = LOAD '2.txt' as (key:chararray);
> u0 = UNION f1, f2;
> describe u0;
> dump u0;
> u1 = ORDER u0 BY $0;
> dump u1;
> {code}
> When I run in Map Reduce mode I get the following result:
> $java -cp pig.jar:$HADOOP_HOME/conf org.apache.pig.Main broken.pig
> 
> Schema for u0 unknown.
> 
> (1,2)
> (2,3)
> (1)
> (2)
> 
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to 
> open iterator for alias u1
> at org.apache.pig.PigServer.openIterator(PigServer.java:475)
> at 
> org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532)
> at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
> at org.apache.pig.Main.main(Main.java:397)
> 
> Caused by: java.io.IOException: Type mismatch in key from map: expected 
> org.apache.pig.impl.io.NullableBytesWritable, recieved 
> org.apache.pig.impl.io.NullableText
> at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:251)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 
> When I run the same script in local mode I get a different result, as we know 
> that local mode does not use any Hadoop Classes.
> $java -cp pig.jar org.apache.pig.Main -x local broken.pig
> 
> Schema for u0 unknown
> 
> (1,2)
> (1)
> (2,3)
> (2)
> 
> (1,2)
> (1)
> (2,3)
> (2)
> 
> Here are some questions
> 1) Why do we allow union if the schemas do not match
> 2) Should we not print an error message/warning so that the user knows that 
> this is not allowed or he can get unexpected results?
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1073) LogicalPlanCloner can't clone plan containing LOJoin

2009-11-05 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774147#action_12774147
 ] 

Santhosh Srinivasan commented on PIG-1073:
--

If my memory serves me correctly, the logical plan cloning was implemented (by 
me) for cloning inner plans for foreach. As such, the top level plan cloning 
was never tested and some items are marked as TODO (see visit methods for 
LOLoad, LOStore and LOStream).

If you want to use it as you mention in your test cases, then you need to add 
code for cloning the LOLoad, LOStore, LOStream and LOJoin operators.


> LogicalPlanCloner can't clone plan containing LOJoin
> 
>
> Key: PIG-1073
> URL: https://issues.apache.org/jira/browse/PIG-1073
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Ashutosh Chauhan
>
> Add following testcase in LogicalPlanBuilder.java
> public void testLogicalPlanCloner() throws CloneNotSupportedException{
> LogicalPlan lp = buildPlan("C = join ( load 'A') by $0, (load 'B') by 
> $0;");
> LogicalPlanCloner cloner = new LogicalPlanCloner(lp);
> cloner.getClonedPlan();
> }
> and this fails with the following stacktrace:
> java.lang.NullPointerException
> at 
> org.apache.pig.impl.logicalLayer.LOVisitor.visit(LOVisitor.java:171)
> at 
> org.apache.pig.impl.logicalLayer.PlanSetter.visit(PlanSetter.java:63)
> at org.apache.pig.impl.logicalLayer.LOJoin.visit(LOJoin.java:213)
> at org.apache.pig.impl.logicalLayer.LOJoin.visit(LOJoin.java:45)
> at 
> org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:67)
> at 
> org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:69)
> at 
> org.apache.pig.impl.plan.DepthFirstWalker.walk(DepthFirstWalker.java:50)
> at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
> at 
> org.apache.pig.impl.logicalLayer.LogicalPlanCloneHelper.getClonedPlan(LogicalPlanCloneHelper.java:73)
> at 
> org.apache.pig.impl.logicalLayer.LogicalPlanCloner.getClonedPlan(LogicalPlanCloner.java:46)
> at 
> org.apache.pig.test.TestLogicalPlanBuilder.testLogicalPlanCloneHelper(TestLogicalPlanBuilder.java:2110)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1016) Reading in map data seems broken

2009-10-29 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771442#action_12771442
 ] 

Santhosh Srinivasan commented on PIG-1016:
--

Hc Busy, thanks for taking time to contribute the patch, explaining the details 
and especially for being patient. A few more questions and details have to be 
cleared up before we commit this patch.

IMHO, the right comparison should be along the lines of checking if o1 and o2 
are NullableBytesWritable followed by a check for PigNullableWritable and then 
followed by error handling code.

Alan, can you comment on this approach?

There is a more important semantic issue. If the map value types are strings 
and if the strings are numeric, then the value types for the maps will be of 
different types. In that case, the load function will break. In addition, 
conversion routines might fail when the compareTo method is invoked. An example 
to illustrate this issue.

Suppose, the records is ['key'#1234567890124567]. PIG-880 would treat the value 
as a string and there would be no problem. Now, with the changes reverted, the 
type is inferred as integer and the parsing will fail as the value is too big 
to fit into an integer

Secondly, assuming that the integer was small enough to be converted, the 
comparison method in DataType.java will return the wrong results when an 
integer and a string are compared. For example, if the records are:

[key#*$]
[key#123]

The first value is treated as a string and the second value is treated as an 
integer. The compareTo method will return 1 to indicate that string > integer 
while in reality 123 > *$

Please correct me if the last statement is incorrect or let me know if it needs 
more explanation.

Thoughts/comments from other committers?

> Reading in map data seems broken
> 
>
> Key: PIG-1016
> URL: https://issues.apache.org/jira/browse/PIG-1016
> Project: Pig
>  Issue Type: Improvement
>  Components: data
>Affects Versions: 0.4.0
>Reporter: hc busy
> Fix For: 0.5.0
>
> Attachments: PIG-1016.patch
>
>
> Hi, I'm trying to load a map that has a tuple for value. The read fails in 
> 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
> documentation it is stated that value of the map can be any time.
> I've attached a patch that allows us to read in complex objects as value as 
> documented. I've done simple verification of loading in maps with tuple/map 
> values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1016) Reading in map data seems broken

2009-10-28 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771287#action_12771287
 ] 

Santhosh Srinivasan commented on PIG-1016:
--

I am summarizing my understanding of the patch that has been submitted by hc 
busy.

Root cause: PIG-880 changed the value type of maps in PigStorage from native 
Java types to DataByteArray. As a result of this change, parsing of complex 
types as map values was disabled.

Proposed fix: Revert the changes made as part of PIG-880 to interpret map 
values as Java types. In addition, change the comparison method to check for 
the object type and call the appropriate compareTo method. The latter is 
required to workaround the fact that the front-end assigns the value type to be 
DataByteArray whereas the backend sees the actual type (Integer, Long, Tuple, 
DataBag, etc.)

Based on this understanding I have the following review comment(s).

Index: 
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigBytesRawComparator.java
===

Can you explain the checks in the if and the else? Specifically, 
NullableBytesWritable is a subclass of PigNullableWritable. As a result, in the 
if part, the check for both o1 and o2 not being PigNullableWritable is 
confusing as nbw1 and nbw2 are cast to NullableBytesWritable if o1 and o2 are 
not PigNullableWritable.  

{code}
+// find bug is complaining about nulls. This check sequence will 
prevent nulls from being dereferenced.
+if(o1!=null && o2!=null){
+
+// In case the objects are comparable
+if((o1 instanceof NullableBytesWritable && o2 instanceof 
NullableBytesWritable)||
+   !(o1 instanceof PigNullableWritable && o2 instanceof 
PigNullableWritable)
+){
+
+  NullableBytesWritable nbw1 = (NullableBytesWritable)o1;
+  NullableBytesWritable nbw2 = (NullableBytesWritable)o2;
+  
+  // If either are null, handle differently.
+  if (!nbw1.isNull() && !nbw2.isNull()) {
+  rc = 
((DataByteArray)nbw1.getValueAsPigType()).compareTo((DataByteArray)nbw2.getValueAsPigType());
+  } else {
+  // For sorting purposes two nulls are equal.
+  if (nbw1.isNull() && nbw2.isNull()) rc = 0;
+  else if (nbw1.isNull()) rc = -1;
+  else rc = 1;
+  }
+}else{
+  // enter here only if both o1 and o2 are 
non-NullableByteWritable PigNullableWritable's
+  PigNullableWritable nbw1 = (PigNullableWritable)o1;
+  PigNullableWritable nbw2 = (PigNullableWritable)o2;
+  // If either are null, handle differently.
+  if (!nbw1.isNull() && !nbw2.isNull()) {
+  rc = nbw1.compareTo(nbw2);
+  } else {
+  // For sorting purposes two nulls are equal.
+  if (nbw1.isNull() && nbw2.isNull()) rc = 0;
+  else if (nbw1.isNull()) rc = -1;
+  else rc = 1;
+  }
+}
+}else{
+  if(o1==null && o2==null){rc=0;}
+  else if(o1==null) {rc=-1;}
+  else{ rc=1; }
{code}

> Reading in map data seems broken
> 
>
> Key: PIG-1016
> URL: https://issues.apache.org/jira/browse/PIG-1016
> Project: Pig
>  Issue Type: Improvement
>  Components: data
>Affects Versions: 0.4.0
>Reporter: hc busy
> Fix For: 0.5.0
>
> Attachments: PIG-1016.patch
>
>
> Hi, I'm trying to load a map that has a tuple for value. The read fails in 
> 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
> documentation it is stated that value of the map can be any time.
> I've attached a patch that allows us to read in complex objects as value as 
> documented. I've done simple verification of loading in maps with tuple/map 
> values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1056) table can not be loaded after store

2009-10-27 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12770743#action_12770743
 ] 

Santhosh Srinivasan commented on PIG-1056:
--

Do you have the right load statement? I don't see the using clause that 
specifies the zebra loader.

> table can not be loaded after store
> ---
>
> Key: PIG-1056
> URL: https://issues.apache.org/jira/browse/PIG-1056
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.6.0
>Reporter: Jing Huang
>
> Pig Stack Trace
> ---
> ERROR 1018: Problem determining schema during load
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during 
> parsing. Problem determining schema during load
> at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1023)
> at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:967)
> at org.apache.pig.PigServer.registerQuery(PigServer.java:383)
> at 
> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:716)
> at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
> at org.apache.pig.Main.main(Main.java:397)
> Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException: Problem 
> determining schema during load
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:734)
> at 
> org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
> at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1017)
> ... 8 more
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1018: 
> Problem determining schema during load
> at org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:155)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:732)
> ... 10 more
> Caused by: java.io.IOException: No table specified for input
> at 
> org.apache.hadoop.zebra.pig.TableLoader.checkConf(TableLoader.java:238)
> at 
> org.apache.hadoop.zebra.pig.TableLoader.determineSchema(TableLoader.java:258)
> at org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:148)
> ... 11 more
> 
> ~ 
> 
> script:
> register /grid/0/dev/hadoopqa/hadoop/lib/zebra.jar;
> A = load 'filter.txt' as (name:chararray, age:int);
> B = filter A by age < 20;
> --dump B;
> store B into 'filter1' using 
> org.apache.hadoop.zebra.pig.TableStorer('[name];[age]');
> rec1 = load 'B' using org.apache.hadoop.zebra.pig.TableLoader();
> dump rec1;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1012) FINDBUGS: SE_BAD_FIELD: Non-transient non-serializable instance field in serializable class

2009-10-21 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12768368#action_12768368
 ] 

Santhosh Srinivasan commented on PIG-1012:
--

I just looked at the first patch. It was setting generate to true in 
TestMRCompiler.java It should be set to false in order to run the test case 
correctly.

+++ test/org/apache/pig/test/TestMRCompiler.java

-private boolean generate = false;
+private boolean generate = true;

> FINDBUGS: SE_BAD_FIELD: Non-transient non-serializable instance field in 
> serializable class
> ---
>
> Key: PIG-1012
> URL: https://issues.apache.org/jira/browse/PIG-1012
> Project: Pig
>  Issue Type: Bug
>Reporter: Olga Natkovich
> Attachments: PIG-1012-2.patch, PIG-1012.patch
>
>
> SeClass org.apache.pig.backend.executionengine.PigSlice defines 
> non-transient non-serializable instance field is
> SeClass org.apache.pig.backend.executionengine.PigSlice defines 
> non-transient non-serializable instance field loader
> Sejava.util.zip.GZIPInputStream stored into non-transient field 
> PigSlice.is
> Seorg.apache.pig.backend.datastorage.SeekableInputStream stored into 
> non-transient field PigSlice.is
> Seorg.apache.tools.bzip2r.CBZip2InputStream stored into non-transient 
> field PigSlice.is
> Seorg.apache.pig.builtin.PigStorage stored into non-transient field 
> PigSlice.loader
> Seorg.apache.pig.backend.hadoop.DoubleWritable$Comparator implements 
> Comparator but not Serializable
> Se
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigBagWritableComparator
>  implements Comparator but not Serializable
> Se
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigCharArrayWritableComparator
>  implements Comparator but not Serializable
> Se
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigDBAWritableComparator
>  implements Comparator but not Serializable
> Se
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigDoubleWritableComparator
>  implements Comparator but not Serializable
> Se
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigFloatWritableComparator
>  implements Comparator but not Serializable
> Se
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigIntWritableComparator
>  implements Comparator but not Serializable
> Se
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigLongWritableComparator
>  implements Comparator but not Serializable
> Se
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigTupleWritableComparator
>  implements Comparator but not Serializable
> Se
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigWritableComparator
>  implements Comparator but not Serializable
> SeClass 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper 
> defines non-transient non-serializable instance field nig
> SeClass 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.EqualToExpr
>  defines non-transient non-serializable instance field log
> SeClass 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.GreaterThanExpr
>  defines non-transient non-serializable instance field log
> SeClass 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.GTOrEqualToExpr
>  defines non-transient non-serializable instance field log
> SeClass 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.LessThanExpr
>  defines non-transient non-serializable instance field log
> SeClass 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.LTOrEqualToExpr
>  defines non-transient non-serializable instance field log
> SeClass 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.NotEqualToExpr
>  defines non-transient non-serializable instance field log
> SeClass 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast
>  defines non-transient non-serializable instance field log
> SeClass 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject
>  defines non-transient non-serializable instance field bagIterator
> SeClass 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserComparisonFunc
>  defines non-transient non-serializable instance field log
> SeClass 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressi

[jira] Commented: (PIG-1016) Reading in map data seems broken

2009-10-15 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766382#action_12766382
 ] 

Santhosh Srinivasan commented on PIG-1016:
--

hc busy,

>From your example snippet, I was not able to understand if Pig is preventing 
>you from doing that based on the current code base. If not, what is the error 
>that you are seeing?

Santhosh

> Reading in map data seems broken
> 
>
> Key: PIG-1016
> URL: https://issues.apache.org/jira/browse/PIG-1016
> Project: Pig
>  Issue Type: Improvement
>  Components: data
>Affects Versions: 0.4.0
>Reporter: hc busy
> Attachments: PIG-1016.patch
>
>
> Hi, I'm trying to load a map that has a tuple for value. The read fails in 
> 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
> documentation it is stated that value of the map can be any time.
> I've attached a patch that allows us to read in complex objects as value as 
> documented. I've done simple verification of loading in maps with tuple/map 
> values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1014) Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all records are counted without considering nullness of the fields in the records

2009-10-14 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765779#action_12765779
 ] 

Santhosh Srinivasan commented on PIG-1014:
--

Another option is to change the implementation of COUNT to reflect the proposed 
semantics. If the underlying UDF is changed then the user should be notified 
via an information message. If the user checks the explain output then (s)he 
will notice COUNT_STAR and will be confused.

> Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all 
> records are counted without considering nullness of the fields in the records
> 
>
> Key: PIG-1014
> URL: https://issues.apache.org/jira/browse/PIG-1014
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: Pradeep Kamath
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1016) Reading in map data seems broken

2009-10-14 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765695#action_12765695
 ] 

Santhosh Srinivasan commented on PIG-1016:
--

The fix proposed in this JIRA reverts the changes made as part of PIG-880. Can 
you explain in more detail about the issue that you are facing currently? 
Specifically, can you provide a test case that reproduces this bug.

> Reading in map data seems broken
> 
>
> Key: PIG-1016
> URL: https://issues.apache.org/jira/browse/PIG-1016
> Project: Pig
>  Issue Type: Improvement
>  Components: data
>Affects Versions: 0.4.0
>Reporter: hc busy
> Attachments: PIG-1016.patch
>
>
> Hi, I'm trying to load a map that has a tuple for value. The read fails in 
> 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
> documentation it is stated that value of the map can be any time.
> I've attached a patch that allows us to read in complex objects as value as 
> documented. I've done simple verification of loading in maps with tuple/map 
> values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1014) Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all records are counted without considering nullness of the fields in the records

2009-10-13 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765357#action_12765357
 ] 

Santhosh Srinivasan commented on PIG-1014:
--

After a discussion with Pradeep who also graciously ran SQL queries to verify 
semantics, we have the following proposal:

The semantics of COUNT could be defined as:

1. COUNT( A ) is equivalent to COUNT( A.* ) and the result of COUNT( A ) will 
count null tuples in the relation
2. COUNT( A.$0) will not count null tuples in the relation

3. COUNT(A.($0, $1)) is equivalent to COUNT( A1.* ) where A1 is the relation 
containing tuples with two columns and will exhibit the behavior of statement 1

OR 

3. COUNT(A.($0, $1)) is equivalent to COUNT( A1.* ) where A1 is the relation 
containing tuples with two columns and will exhibit the behavior of statement 2

Point 3 needs more discussion.

Comments/thoughts/suggestions/anything else welcome.


> Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all 
> records are counted without considering nullness of the fields in the records
> 
>
> Key: PIG-1014
> URL: https://issues.apache.org/jira/browse/PIG-1014
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: Pradeep Kamath
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1014) Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all records are counted without considering nullness of the fields in the records

2009-10-13 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765194#action_12765194
 ] 

Santhosh Srinivasan commented on PIG-1014:
--

Essentially, Pradeep is pointing out an issue in the implementation of COUNT. 
If that is the case then COUNT has to be fixed or the semantics of COUNT has to 
be documented to explain the current implementation. I would vote for fixing 
COUNT to have the correct semantics.

> Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all 
> records are counted without considering nullness of the fields in the records
> 
>
> Key: PIG-1014
> URL: https://issues.apache.org/jira/browse/PIG-1014
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: Pradeep Kamath
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-984) PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data

2009-10-12 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764846#action_12764846
 ] 

Santhosh Srinivasan commented on PIG-984:
-

Very quick comment. The parser has a log.info which should be converted to a 
log.debug

Index: src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt
===


+[ ("\"collected\"" { 
+log.info("Using mapside");


> PERFORMANCE: Implement a map-side group operator to speed up processing of 
> ordered data 
> 
>
> Key: PIG-984
> URL: https://issues.apache.org/jira/browse/PIG-984
> Project: Pig
>  Issue Type: New Feature
>Reporter: Richard Ding
>Assignee: Richard Ding
> Attachments: PIG-984.patch, PIG-984_1.patch
>
>
> The general group by operation in Pig needs both mappers and reducers (the 
> aggregation is done in reducers). This incurs disk writes/reads  between 
> mappers and reducers.
> However, in the cases where the input data has the following properties
>1. The records with the same key are grouped together (such as the data is 
> sorted by the keys).
>2. The records with the same key are in the same mapper input.
> the group by operation can be performed in the mappers only and thus remove 
> the overhead of disk writes/reads.
> Alan proposed adding a hint to the group by clause like this one:
> {code}
> A = load 'input' using SomeLoader(...);
> B = group A by $0 using "mapside";
> C = foreach B generate ...
> {code}
> The proposed addition of using "mapside" to group will be a mapside group 
> operator that collects all records for a given key into a buffer. When it 
> sees a key change it will emit the key and bag for records it had buffered. 
> It will assume that all keys for a given record are collected together and 
> thus there is not need to buffer across keys. 
> It is expected that "SomeLoader" will be implemented by data systems such as 
> Zebra to ensure the data emitted by the loader satisfies the above properties 
> (1) and (2).
> It will be the responsibility of the user (or the loader) to guarantee these 
> properties (1) & (2) before invoking the mapside hint for the group by 
> clause. The Pig runtime can't check for the errors in the input data.
> For the group by clauses with mapside hint, Pig Latin will only support group 
> by columns (including *), not group by expressions nor group all. 
>   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1014) Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all records are counted without considering nullness of the fields in the records

2009-10-12 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764792#action_12764792
 ] 

Santhosh Srinivasan commented on PIG-1014:
--

If the user wants to count without nulls then the user should use COUNT_STAR. 
One of the philosophies of Pig has been to allow users to do exactly what they 
want. Here, we are violating that philosophy and secondly we are second 
guessing the user's intention.

> Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all 
> records are counted without considering nullness of the fields in the records
> 
>
> Key: PIG-1014
> URL: https://issues.apache.org/jira/browse/PIG-1014
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: Pradeep Kamath
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1014) Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all records are counted without considering nullness of the fields in the records

2009-10-12 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764771#action_12764771
 ] 

Santhosh Srinivasan commented on PIG-1014:
--

Is Pig trying to guess the user's intent? What if the user wanted to do count 
without nulls ?

> Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all 
> records are counted without considering nullness of the fields in the records
> 
>
> Key: PIG-1014
> URL: https://issues.apache.org/jira/browse/PIG-1014
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: Pradeep Kamath
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1014) Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all records are counted without considering nullness of the fields in the records

2009-10-10 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764368#action_12764368
 ] 

Santhosh Srinivasan commented on PIG-1014:
--

When the semantics of COUNT was changed, I thought this was communicated with 
the users. What is the intention of this jira?

> Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all 
> records are counted without considering nullness of the fields in the records
> 
>
> Key: PIG-1014
> URL: https://issues.apache.org/jira/browse/PIG-1014
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: Pradeep Kamath
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-995) Limit Optimizer throw exception "ERROR 2156: Error while fixing projections"

2009-10-09 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764119#action_12764119
 ] 

Santhosh Srinivasan commented on PIG-995:
-

Review comments:

The initialization code is fine. However, the try catch block is shared between 
the rebuildSchemas() and rebuildProjectionMaps() method invocation. This could 
lead to misleading error message. Specifically, if the rebuildSchemas() throws 
an exception then the error message will indicate that rebuilding projection 
maps failed.

> Limit Optimizer throw exception "ERROR 2156: Error while fixing projections"
> 
>
> Key: PIG-995
> URL: https://issues.apache.org/jira/browse/PIG-995
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.3.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.6.0
>
> Attachments: PIG-995-1.patch, PIG-995-2.patch, PIG-995-3.patch
>
>
> The following script fail:
> A = load '1.txt' AS (a0, a1, a2);
> B = order A by a1;
> C = limit B 10;
> D = foreach C generate $0;
> dump D;
> Error log:
> Caused by: org.apache.pig.impl.plan.VisitorException: ERROR 2156: Error while 
> fixing projections. Projection map of node to be replaced is null.
> at 
> org.apache.pig.impl.logicalLayer.ProjectFixerUpper.visit(ProjectFixerUpper.java:138)
> at 
> org.apache.pig.impl.logicalLayer.LOProject.visit(LOProject.java:408)
> at org.apache.pig.impl.logicalLayer.LOProject.visit(LOProject.java:58)
> at 
> org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:65)
> at 
> org.apache.pig.impl.plan.DepthFirstWalker.walk(DepthFirstWalker.java:50)
> at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
> at 
> org.apache.pig.impl.logicalLayer.LOForEach.rewire(LOForEach.java:761)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-984) PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data

2009-10-01 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761270#action_12761270
 ] 

Santhosh Srinivasan commented on PIG-984:
-

bq. But this is in line with what we've done for joins, philosophically, 
semantically, and syntacticly.

Not exactly; with joins we are exposing different kinds of joins. Here we are 
exposing the underlying aspects of the framework (mapside). If there is a 
parallel framework that does not do map-reduce then having mapside in the 
language is philosophically and semantically not correct.

> PERFORMANCE: Implement a map-side group operator to speed up processing of 
> ordered data 
> 
>
> Key: PIG-984
> URL: https://issues.apache.org/jira/browse/PIG-984
> Project: Pig
>  Issue Type: New Feature
>Reporter: Richard Ding
>
> The general group by operation in Pig needs both mappers and reducers (the 
> aggregation is done in reducers). This incurs disk writes/reads  between 
> mappers and reducers.
> However, in the cases where the input data has the following properties
>1. The records with the same key are grouped together (such as the data is 
> sorted by the keys).
>2. The records with the same key are in the same mapper input.
> the group by operation can be performed in the mappers only and thus remove 
> the overhead of disk writes/reads.
> Alan proposed adding a hint to the group by clause like this one:
> {code}
> A = load 'input' using SomeLoader(...);
> B = group A by $0 using "mapside";
> C = foreach B generate ...
> {code}
> The proposed addition of using "mapside" to group will be a mapside group 
> operator that collects all records for a given key into a buffer. When it 
> sees a key change it will emit the key and bag for records it had buffered. 
> It will assume that all keys for a given record are collected together and 
> thus there is not need to buffer across keys. 
> It is expected that "SomeLoader" will be implemented by data systems such as 
> Zebra to ensure the data emitted by the loader satisfies the above properties 
> (1) and (2).
> It will be the responsibility of the user (or the loader) to guarantee these 
> properties (1) & (2) before invoking the mapside hint for the group by 
> clause. The Pig runtime can't check for the errors in the input data.
> For the group by clauses with mapside hint, Pig Latin will only support group 
> by columns (including *), not group by expressions nor group all. 
>   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-984) PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data

2009-09-30 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761073#action_12761073
 ] 

Santhosh Srinivasan commented on PIG-984:
-

bq. This is something that can be inferred looking at the schema and 
distribution key. I understand wanting a manual handle to turn on the behavior 
while developing, but the production version of this can be done automatically 
( "if distributed by and sorted on a subset of group keys, apply map-side 
group" rule in the optimizer).

+1 Thats what I meant when I said

bq. 1. I am concerned about extending the language for supporting features that 
can be handled internally. The scope of the language has not been defined but 
the language continues to evolve.

> PERFORMANCE: Implement a map-side group operator to speed up processing of 
> ordered data 
> 
>
> Key: PIG-984
> URL: https://issues.apache.org/jira/browse/PIG-984
> Project: Pig
>  Issue Type: New Feature
>Reporter: Richard Ding
>
> The general group by operation in Pig needs both mappers and reducers (the 
> aggregation is done in reducers). This incurs disk writes/reads  between 
> mappers and reducers.
> However, in the cases where the input data has the following properties
>1. The records with the same key are grouped together (such as the data is 
> sorted by the keys).
>2. The records with the same key are in the same mapper input.
> the group by operation can be performed in the mappers only and thus remove 
> the overhead of disk writes/reads.
> Alan proposed adding a hint to the group by clause like this one:
> {code}
> A = load 'input' using SomeLoader(...);
> B = group A by $0 using "mapside";
> C = foreach B generate ...
> {code}
> The proposed addition of using "mapside" to group will be a mapside group 
> operator that collects all records for a given key into a buffer. When it 
> sees a key change it will emit the key and bag for records it had buffered. 
> It will assume that all keys for a given record are collected together and 
> thus there is not need to buffer across keys. 
> It is expected that "SomeLoader" will be implemented by data systems such as 
> Zebra to ensure the data emitted by the loader satisfies the above properties 
> (1) and (2).
> It will be the responsibility of the user (or the loader) to guarantee these 
> properties (1) & (2) before invoking the mapside hint for the group by 
> clause. The Pig runtime can't check for the errors in the input data.
> For the group by clauses with mapside hint, Pig Latin will only support group 
> by columns (including *), not group by expressions nor group all. 
>   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-984) PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data

2009-09-30 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761028#action_12761028
 ] 

Santhosh Srinivasan commented on PIG-984:
-

A couple of things:

1. I am concerned about extending the language for supporting features that can 
be handled internally. The scope of the language has not been defined but the 
language continues to evolve.

2. I agree with Thejas' comment about allowing expressions that do not alter 
the property. Pig will not be able to check that but it is no different from 
being able to check if the data is sorted or not.

> PERFORMANCE: Implement a map-side group operator to speed up processing of 
> ordered data 
> 
>
> Key: PIG-984
> URL: https://issues.apache.org/jira/browse/PIG-984
> Project: Pig
>  Issue Type: New Feature
>Reporter: Richard Ding
>
> The general group by operation in Pig needs both mappers and reducers (the 
> aggregation is done in reducers). This incurs disk writes/reads  between 
> mappers and reducers.
> However, in the cases where the input data has the following properties
>1. The records with the same key are grouped together (such as the data is 
> sorted by the keys).
>2. The records with the same key are in the same mapper input.
> the group by operation can be performed in the mappers only and thus remove 
> the overhead of disk writes/reads.
> Alan proposed adding a hint to the group by clause like this one:
> {code}
> A = load 'input' using SomeLoader(...);
> B = group A by $0 using "mapside";
> C = foreach B generate ...
> {code}
> The proposed addition of using "mapside" to group will be a mapside group 
> operator that collects all records for a given key into a buffer. When it 
> sees a key change it will emit the key and bag for records it had buffered. 
> It will assume that all keys for a given record are collected together and 
> thus there is not need to buffer across keys. 
> It is expected that "SomeLoader" will be implemented by data systems such as 
> Zebra to ensure the data emitted by the loader satisfies the above properties 
> (1) and (2).
> It will be the responsibility of the user (or the loader) to guarantee these 
> properties (1) & (2) before invoking the mapside hint for the group by 
> clause. The Pig runtime can't check for the errors in the input data.
> For the group by clauses with mapside hint, Pig Latin will only support group 
> by columns (including *), not group by expressions nor group all. 
>   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-955) Skewed join generates incorrect results

2009-09-11 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754349#action_12754349
 ] 

Santhosh Srinivasan commented on PIG-955:
-

Hi Ying,

How are Fragment Replicate Join and Skewed Join related as you mention in your 
bug description? Also, skewed join has been part of trunk for more than a month 
now. Your bug description states that Pig needs skewed join.

Thanks,
Santhosh

> Skewed join generates  incorrect results 
> -
>
> Key: PIG-955
> URL: https://issues.apache.org/jira/browse/PIG-955
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-955.patch
>
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-922) Logical optimizer: push up project

2009-08-25 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12747560#action_12747560
 ] 

Santhosh Srinivasan commented on PIG-922:
-

For relational operators that require multiple inputs, the list will correspond 
to each of its inputs. If you notice getRequiredFields, the list is populated 
on a per input basis. In the case of getRequiredInputs, I see that the use of 
the list is not consistent.for LOJoin, LOUnion, LOCogroup and LOCross.

> Logical optimizer: push up project
> --
>
> Key: PIG-922
> URL: https://issues.apache.org/jira/browse/PIG-922
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Affects Versions: 0.3.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.4.0
>
> Attachments: PIG-922-p1_0.patch, PIG-922-p1_1.patch, 
> PIG-922-p1_2.patch
>
>
> This is a continuation work of 
> [PIG-697|https://issues.apache.org/jira/browse/PIG-697]. We need to add 
> another rule to the logical optimizer: Push up project, ie, prune columns as 
> early as possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-922) Logical optimizer: push up project

2009-08-25 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12747521#action_12747521
 ] 

Santhosh Srinivasan commented on PIG-922:
-

I am not sure about the logic for handling cogroup. Let me take another example 
of an operator with multiple inputs - union. If you look at the code below, the 
method returns a single required fields element. The required fields element 
contains a reference to all the inputs that are required to compute that 
particular column. However, wrt cogroup you are returning a list of required 
fields that contains nulls for all the positions that are of no interest.

{code}
+ArrayList> inputList = new 
ArrayList>();
+for (int i=0;i(i, column));
+List result = new ArrayList();
+result.add(new RequiredFields(inputList));
+return result;
{code}

> Logical optimizer: push up project
> --
>
> Key: PIG-922
> URL: https://issues.apache.org/jira/browse/PIG-922
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Affects Versions: 0.3.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.4.0
>
> Attachments: PIG-922-p1_0.patch, PIG-922-p1_1.patch, 
> PIG-922-p1_2.patch
>
>
> This is a continuation work of 
> [PIG-697|https://issues.apache.org/jira/browse/PIG-697]. We need to add 
> another rule to the logical optimizer: Push up project, ie, prune columns as 
> early as possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-922) Logical optimizer: push up project

2009-08-21 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746387#action_12746387
 ] 

Santhosh Srinivasan commented on PIG-922:
-

Review comments for patch p1_1

I have not reviewed the test cases. I have reviewed all the sources.

Index: src/org/apache/pig/impl/logicalLayer/RelationalOperator.java
===

In the second example, why are A.a0 and B.b0 relevant input columns for C.$0 ? 
I don't see this logic in LOJoin.getRelevantInputs()

{code}
+ * eg2:
+ * A = load 'a' AS (a0, a1);
+ * B = load 'b' AS (b0, b1);
+ * C = join A by a0, B by b0;
+ * 
+ * Relevant input columns for C.$0 is A.a0, B.b0. Relevant input columns 
for C.$1 is A.a1.
{code}

Index: src/org/apache/pig/impl/logicalLayer/LOForEach.java
===

I am not sure about the logic for the computation of the inner plan number that 
produces the output column in getRelevantInputs. I would recommend that you 
cache the schema generated by the inner plan (as part of getSchema()) and use 
that information here.

{code}
+// find the index of foreach inner plan for this particular output 
column
+LogicalOperator pOp = null;
+int planIndex = 0;
+try {
+pOp = 
mSchema.getField(0).getReverseCanonicalMap().keySet().iterator().next();
+ 
+for (int i=1;i<=column;i++)
+{
+if 
(mSchema.getField(i).getReverseCanonicalMap().keySet().iterator().next()!=pOp)
+{
+planIndex++;
+pOp = 
mSchema.getField(i).getReverseCanonicalMap().keySet().iterator().next();
+}
+}
+} catch (FrontendException e) {
+log.warn("Cannot retrieve field schema from "+mSchema.toString());
+return null;
+}
{code}

Index: src/org/apache/pig/impl/logicalLayer/LOCogroup.java
===

Why are we adding null to the list of required fields while iterating over the 
inputs?

{code}
+if(inputNum == column-1) {
+result.add(new RequiredFields(true));
+} else {
+result.add(null);
+}
{code}


Index: src/org/apache/pig/impl/plan/RequiredFields.java
===

Where are the following methods used? I did not see any calls to them.

{code}
+
+// return true if this merge modify the object itself 
+public boolean merge(RequiredFields r2)
+{
+boolean newRequiredFields = false;
+if (r2==null)
+return newRequiredFields;
+if (r2.getNeedAllFields())
+{
+mNeedAllFields = true;
+}
+if (!r2.getNeedNoFields())
+{
+mNeedNoFields = false;
+}
+if (r2.getFields()==null)
+return newRequiredFields;
+for (Pair f:r2.getFields())
+{
+if (mFields==null)
+mFields = new ArrayList>(); 
+if (!mFields.contains(f))
+{
+mFields.add(f);
+mNeedNoFields = false;
+newRequiredFields = true;
+}
+}
+return newRequiredFields;
+}
+
+public void reIndex(int i)
+{
+for (Pair p:mFields)
+{
+p.first = i;
+}
+}
{code}

> Logical optimizer: push up project
> --
>
> Key: PIG-922
> URL: https://issues.apache.org/jira/browse/PIG-922
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Affects Versions: 0.3.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.4.0
>
> Attachments: PIG-922-p1_0.patch, PIG-922-p1_1.patch
>
>
> This is a continuation work of 
> [PIG-697|https://issues.apache.org/jira/browse/PIG-697]. We need to add 
> another rule to the logical optimizer: Push up project, ie, prune columns as 
> early as possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-924) Make Pig work with multiple versions of Hadoop

2009-08-20 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745568#action_12745568
 ] 

Santhosh Srinivasan commented on PIG-924:
-

Hadoop has promised "APIs in stone" forever and has not delivered on that 
promise yet. Higher layers in the stack have to learn how to cope with a ever 
changing lower layer. How this change is managed is a matter of convenience to 
the owners of the higher layer. I really like Shims approach which avoids the 
cost of branching out Pig every time we make a compatible release. The cost of 
creating a branch for each version of hadoop seems to be too high compared to 
the cost of the Shims approach.

Of course, there are pros and cons to each approach. The question here is when 
will Hadoop set its APIs in stone and how many more releases will we have 
before this happens. If the answer to the question is 12 months and 2 more 
releases, then we should go with the Shims approach. If the answer is 3-6 
months and one more release then we should stick with our current approach and 
pay the small penalty of patches supplied to work with the specific release of 
Hadoop.

Summary: Use the shims patch if APIs are not set in stone within a quarter or 
two and if there is more than one release of Hadoop.

> Make Pig work with multiple versions of Hadoop
> --
>
> Key: PIG-924
> URL: https://issues.apache.org/jira/browse/PIG-924
> Project: Pig
>  Issue Type: Bug
>Reporter: Dmitriy V. Ryaboy
> Attachments: pig_924.2.patch, pig_924.3.patch, pig_924.patch
>
>
> The current Pig build scripts package hadoop and other dependencies into the 
> pig.jar file.
> This means that if users upgrade Hadoop, they also need to upgrade Pig.
> Pig has relatively few dependencies on Hadoop interfaces that changed between 
> 18, 19, and 20.  It is possibly to write a dynamic shim that allows Pig to 
> use the correct calls for any of the above versions of Hadoop. Unfortunately, 
> the building process precludes us from the ability to do this at runtime, and 
> forces an unnecessary Pig rebuild even if dynamic shims are created.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-913) Error in Pig script when grouping on chararray column

2009-08-11 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742150#action_12742150
 ] 

Santhosh Srinivasan commented on PIG-913:
-

+1 for the fix. As Dmitriy indicates, we need new unit test cases after Hudson 
verifies the patch.

> Error in Pig script when grouping on chararray column
> -
>
> Key: PIG-913
> URL: https://issues.apache.org/jira/browse/PIG-913
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: Viraj Bhat
>Priority: Critical
> Fix For: 0.4.0
>
> Attachments: PIG-913.patch
>
>
> I have a very simple script which fails at parsetime due to the schema I 
> specified in the loader.
> {code}
> data = LOAD '/user/viraj/studenttab10k' AS (s:chararray);
> dataSmall = limit data 100;
> bb = GROUP dataSmall by $0;
> dump bb;
> {code}
> =
> 2009-08-06 18:47:56,297 [main] INFO  org.apache.pig.Main - Logging error 
> messages to: /homes/viraj/pig-svn/trunk/pig_1249609676296.log
> 09/08/06 18:47:56 INFO pig.Main: Logging error messages to: 
> /homes/viraj/pig-svn/trunk/pig_1249609676296.log
> 2009-08-06 18:47:56,459 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
> to hadoop file system at: hdfs://localhost:9000
> 09/08/06 18:47:56 INFO executionengine.HExecutionEngine: Connecting to hadoop 
> file system at: hdfs://localhost:9000
> 2009-08-06 18:47:56,694 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
> to map-reduce job tracker at: localhost:9001
> 09/08/06 18:47:56 INFO executionengine.HExecutionEngine: Connecting to 
> map-reduce job tracker at: localhost:9001
> 2009-08-06 18:47:57,008 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 1002: Unable to store alias bb
> 09/08/06 18:47:57 ERROR grunt.Grunt: ERROR 1002: Unable to store alias bb
> Details at logfile: /homes/viraj/pig-svn/trunk/pig_1249609676296.log
> =
> =
> Pig Stack Trace
> ---
> ERROR 1002: Unable to store alias bb
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to 
> open iterator for alias bb
> at org.apache.pig.PigServer.openIterator(PigServer.java:481)
> at 
> org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:531)
> at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
> at org.apache.pig.Main.main(Main.java:397)
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: 
> Unable to store alias bb
> at org.apache.pig.PigServer.store(PigServer.java:536)
> at org.apache.pig.PigServer.openIterator(PigServer.java:464)
> ... 6 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.pig.impl.logicalLayer.LOCogroup.unsetSchema(LOCogroup.java:359)
> at 
> org.apache.pig.impl.logicalLayer.optimizer.SchemaRemover.visit(SchemaRemover.java:64)
> at 
> org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:335)
> at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:46)
> at 
> org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68)
> at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
> at 
> org.apache.pig.impl.logicalLayer.optimizer.LogicalTransformer.rebuildSchemas(LogicalTransformer.java:67)
> at 
> org.apache.pig.impl.logicalLayer.optimizer.LogicalOptimizer.optimize(LogicalOptimizer.java:187)
> at org.apache.pig.PigServer.compileLp(PigServer.java:854)
> at org.apache.pig.PigServer.compileLp(PigServer.java:791)
> at org.apache.pig.PigServer.store(PigServer.java:509)
> ... 7 more
> =

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-561) Need to generate empty tuples and bags as a part of Pig Syntax

2009-08-09 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan resolved PIG-561.
-

Resolution: Duplicate

Duplicate of PIG-773

> Need to generate empty tuples and bags as a part of Pig Syntax
> --
>
> Key: PIG-561
> URL: https://issues.apache.org/jira/browse/PIG-561
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.2.0
>Reporter: Viraj Bhat
>
> There is a need to sometimes generate empty tuples and bags as a part of the 
> Pig syntax rather than using UDF's
> {code}
> a = load 'mydata.txt' using PigStorage();
> b =foreach a generate ( ) as emptytuple;
> c = foreach a generate { } as emptybag;
> dump c;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer

2009-08-07 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-697:


  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

All optimizer related patches have been committed.

> Proposed improvements to pig's optimizer
> 
>
> Key: PIG-697
> URL: https://issues.apache.org/jira/browse/PIG-697
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Alan Gates
>Assignee: Santhosh Srinivasan
> Attachments: Optimizer_Phase5.patch, OptimizerPhase1.patch, 
> OptimizerPhase1_part2.patch, OptimizerPhase2.patch, 
> OptimizerPhase3_parrt1-1.patch, OptimizerPhase3_parrt1.patch, 
> OptimizerPhase3_part2_3.patch, OptimizerPhase4_part1-1.patch, 
> OptimizerPhase4_part2.patch
>
>
> I propose the following changes to pig optimizer, plan, and operator 
> functionality to support more robust optimization:
> 1) Remove the required array from Rule.  This will change rules so that they 
> only match exact patterns instead of allowing missing elements in the pattern.
> This has the downside that if a given rule applies to two patterns (say 
> Load->Filter->Group, Load->Group) you have to write two rules.  But it has 
> the upside that
> the resulting rules know exactly what they are getting.  The original intent 
> of this was to reduce the number of rules that needed to be written.  But the
> resulting rules have do a lot of work to understand the operators they are 
> working with.  With exact matches only, each rule will know exactly the 
> operators it
> is working on and can apply the logic of shifting the operators around.  All 
> four of the existing rules set all entries of required to true, so removing 
> this
> will have no effect on them.
> 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
> conversions or a certain number of iterations has been reached.  Currently the
> function is:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> for (Rule rule : mRules) {
> if (matcher.match(rule)) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> List> matches = matcher.getAllMatches();
> for (List match:matches)
> {
>   if (rule.transformer.check(match)) {
>   // The transformer approves.
>   rule.transformer.transform(match);
>   }
> }
> }
> }
> }
> {code}
> It would change to be:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> boolean sawMatch;
> int iterators = 0;
> do {
> sawMatch = false;
> for (Rule rule : mRules) {
> List> matches = matcher.getAllMatches();
> for (List match:matches) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> if (rule.transformer.check(match)) {
> // The transformer approves.
> sawMatch = true;
> rule.transformer.transform(match);
> }
> }
> }
> // Not sure if 1000 is the right number of iterations, maybe it
> // should be configurable so that large scripts don't stop too 
> // early.
> } while (sawMatch && numIterations++ < 1000);
> }
> {code}
> The reason for limiting the number of iterations is to avoid infinite loops.  
> The reason for iterating over the rules is so that each rule can be applied 
> multiple
> times as necessary.  This allows us to write simple rules, mostly swaps 
> between neighboring operators, without worrying that we get the plan right in 
> one pass.
> For example, we might have a plan that looks like:  
> Load->Join->Filter->Foreach, and we want to optimize it to 
> Load->Foreach->Filter->Join.  With two simple
> rules (swap filter and join and swap foreach and filter), applied 
> iteratively, we can get from the initial to final plan, without needing to 
> understanding the
> big picture of the entire plan.
> 3) Add three calls to OperatorPlan:
> {code}
> /**
>  * Swap two operators in a plan.  Both of the operators must have single
>  * inputs and single outputs.
>  * @param first operator
>  * @param second operator
>  * @throws PlanException if either operator is not single input and output.
>  */
> public void swap(E first, E second) throws PlanException {
> ...
> }
> /**
> 

[jira] Commented: (PIG-912) Rename/Add 'string' as a type in place of chararray - and deprecate (and probably eventually remove) the use of 'chararray'

2009-08-06 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740357#action_12740357
 ] 

Santhosh Srinivasan commented on PIG-912:
-

+1

> Rename/Add 'string' as a type in place of chararray - and deprecate (and 
> probably eventually remove) the use of 'chararray'
> ---
>
> Key: PIG-912
> URL: https://issues.apache.org/jira/browse/PIG-912
> Project: Pig
>  Issue Type: Bug
>Reporter: Mridul Muralidharan
>
> The type 'chararray' in pig does not refer to an array of characters (char 
> []) but rather to java.lang.String
> This is inconsistent and confusing naming; and additionally, will be a 
> interoperability issue with other systems which support schema's (zebra among 
> others).
> It would be good to have a consistent naming across projects, while also 
> having appropriate names for the various types.
> Since use of 'chararray' is already widely deployed, it would be good to :
> a) Add a type 'string' (or equivalent) which is an alias for 'chararray'.
> Additionally, it is possible to envision these too (if deemed necessary - not 
> a main requiremnt) :
> b) Modify documentation and example scripts to use this new type.
> c) Emit warnings about chararray being deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-908) Need a way to correlate MR jobs with Pig statements

2009-08-04 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739147#action_12739147
 ] 

Santhosh Srinivasan commented on PIG-908:
-

+1

This approach has been discussed but not documented.

> Need a way to correlate MR jobs with Pig statements
> ---
>
> Key: PIG-908
> URL: https://issues.apache.org/jira/browse/PIG-908
> Project: Pig
>  Issue Type: Wish
>Reporter: Dmitriy V. Ryaboy
>
> Complex Pig Scripts often generate many Map-Reduce jobs, especially with the 
> recent introduction of multi-store capabilities.
> For example, the first script in the Pig tutorial produces 5 MR jobs.
> There is currently very little support for debugging resulting jobs; if one 
> of the MR jobs fails, it is hard to figure out which part of the script it 
> was responsible for. Explain plans help, but even with the explain plan, a 
> fair amount of effort (and sometimes, experimentation) is required to 
> correlate the failing MR job with the corresponding PigLatin statements.
> This ticket is created to discuss approaches to alleviating this problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-724) Treating map values in PigStorage

2009-07-31 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan resolved PIG-724.
-

  Resolution: Fixed
Assignee: Santhosh Srinivasan
Hadoop Flags: [Incompatible change]

Issue fixed as part of PIG-880

> Treating map values in PigStorage
> -
>
> Key: PIG-724
> URL: https://issues.apache.org/jira/browse/PIG-724
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.2.1
>Reporter: Santhosh Srinivasan
>Assignee: Santhosh Srinivasan
> Fix For: 0.2.1
>
>
> Currently, PigStorage cannot treats the materialized string 123 as an integer 
> with the value 123. If the user intended this to be the string 123, 
> PigStorage cannot deal with it. This reasoning also applies to doubles. Due 
> to this issue, maps that contain values which are of the same type but 
> manifest the issue discussed at beginning of the paragraph, Pig throws its 
> hands up at runtime.  An example to illustrate the problem will help.
> In the example below a sample row in the data (map.txt) contains the 
> following:
> [key01#35,key02#value01]
> When Pig tries to convert the stream to a map, it creates a Map Object> where the key is a string and the value is an integer. Running the 
> script shown below, results in a run-time error.
> {code}
> grunt> a = load 'map.txt' as (themap: map[]);
> grunt> b = filter a by (chararray)(themap#'key01') == 'hello';
>   
> grunt> dump b;
> 2009-03-18 15:19:03,773 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - 0% complete
> 2009-03-18 15:19:28,797 [main] ERROR 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - Map reduce job failed
> 2009-03-18 15:19:28,817 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 1081: Cannot cast to chararray. Expected bytearray but received: int
> {code} 
> There are two ways to resolve this issue:
> 1. Change the conversion routine for bytesToMap to return a map where the 
> value is a bytearray and not the actual type. This change breaks backward 
> compatibility
> 2. Introduce checks in POCast where conversions that are legal in the type 
> checking world are allowed, i.e., run time checks will be made to check for 
> compatible casts. In the above example, an int can be converted to a 
> chararray and the cast will be made. If on the other hand, it was a chararray 
> to int conversion then an exception will be thrown.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-880) Order by is borken with complex fields

2009-07-31 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-880:


Tags: MapValues Bytearray
  Resolution: Fixed
Hadoop Flags: [Incompatible change, Reviewed]
  Status: Resolved  (was: Patch Available)

Patch has been committed. This fix breaks backward compatibility where 
PigStorage reads maps. The type of the map values will now be bytearray instead 
of the actual type.

> Order by is borken with complex fields
> --
>
> Key: PIG-880
> URL: https://issues.apache.org/jira/browse/PIG-880
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Olga Natkovich
>Assignee: Santhosh Srinivasan
> Fix For: 0.4.0
>
> Attachments: PIG-880-bytearray-mapvalue-code-without-tests.patch, 
> PIG-880_1.patch
>
>
> Pig script:
> a = load 'studentcomplextab10k' as (smap:map[],c2,c3);
> f = foreach a generate smap#'name, smap#'age', smap#'gpa' ;
> s = order f by $0;   
> store s into 'sc.out' 
> Stack:
> Caused by: java.lang.ArrayStoreException
> at java.lang.System.arraycopy(Native Method)
> at java.util.Arrays.copyOf(Arrays.java:2763)
> at java.util.ArrayList.toArray(ArrayList.java:305)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.convertToArray(WeightedRangePartitioner.java:154)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:96)
> ... 5 more
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:230)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:179)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:204)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
> at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:769)
> at org.apache.pig.PigServer.execute(PigServer.java:762)
> at org.apache.pig.PigServer.access$100(PigServer.java:91)
> at org.apache.pig.PigServer$Graph.execute(PigServer.java:933)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:245)
> at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
> at org.apache.pig.Main.main(Main.java:389)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer

2009-07-31 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-697:


Status: Patch Available  (was: In Progress)

> Proposed improvements to pig's optimizer
> 
>
> Key: PIG-697
> URL: https://issues.apache.org/jira/browse/PIG-697
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Alan Gates
>Assignee: Santhosh Srinivasan
> Attachments: Optimizer_Phase5.patch, OptimizerPhase1.patch, 
> OptimizerPhase1_part2.patch, OptimizerPhase2.patch, 
> OptimizerPhase3_parrt1-1.patch, OptimizerPhase3_parrt1.patch, 
> OptimizerPhase3_part2_3.patch, OptimizerPhase4_part1-1.patch, 
> OptimizerPhase4_part2.patch
>
>
> I propose the following changes to pig optimizer, plan, and operator 
> functionality to support more robust optimization:
> 1) Remove the required array from Rule.  This will change rules so that they 
> only match exact patterns instead of allowing missing elements in the pattern.
> This has the downside that if a given rule applies to two patterns (say 
> Load->Filter->Group, Load->Group) you have to write two rules.  But it has 
> the upside that
> the resulting rules know exactly what they are getting.  The original intent 
> of this was to reduce the number of rules that needed to be written.  But the
> resulting rules have do a lot of work to understand the operators they are 
> working with.  With exact matches only, each rule will know exactly the 
> operators it
> is working on and can apply the logic of shifting the operators around.  All 
> four of the existing rules set all entries of required to true, so removing 
> this
> will have no effect on them.
> 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
> conversions or a certain number of iterations has been reached.  Currently the
> function is:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> for (Rule rule : mRules) {
> if (matcher.match(rule)) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> List> matches = matcher.getAllMatches();
> for (List match:matches)
> {
>   if (rule.transformer.check(match)) {
>   // The transformer approves.
>   rule.transformer.transform(match);
>   }
> }
> }
> }
> }
> {code}
> It would change to be:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> boolean sawMatch;
> int iterators = 0;
> do {
> sawMatch = false;
> for (Rule rule : mRules) {
> List> matches = matcher.getAllMatches();
> for (List match:matches) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> if (rule.transformer.check(match)) {
> // The transformer approves.
> sawMatch = true;
> rule.transformer.transform(match);
> }
> }
> }
> // Not sure if 1000 is the right number of iterations, maybe it
> // should be configurable so that large scripts don't stop too 
> // early.
> } while (sawMatch && numIterations++ < 1000);
> }
> {code}
> The reason for limiting the number of iterations is to avoid infinite loops.  
> The reason for iterating over the rules is so that each rule can be applied 
> multiple
> times as necessary.  This allows us to write simple rules, mostly swaps 
> between neighboring operators, without worrying that we get the plan right in 
> one pass.
> For example, we might have a plan that looks like:  
> Load->Join->Filter->Foreach, and we want to optimize it to 
> Load->Foreach->Filter->Join.  With two simple
> rules (swap filter and join and swap foreach and filter), applied 
> iteratively, we can get from the initial to final plan, without needing to 
> understanding the
> big picture of the entire plan.
> 3) Add three calls to OperatorPlan:
> {code}
> /**
>  * Swap two operators in a plan.  Both of the operators must have single
>  * inputs and single outputs.
>  * @param first operator
>  * @param second operator
>  * @throws PlanException if either operator is not single input and output.
>  */
> public void swap(E first, E second) throws PlanException {
> ...
> }
> /**
>  * Push one operator in front of another.  This function is for use when
>  * the first operator has multipl

[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer

2009-07-31 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-697:


Attachment: Optimizer_Phase5.patch

Attached patch removes references to LOFRJoin and replaces it with LOJoin. All 
the optimization rules and test cases now use LOJoin.

> Proposed improvements to pig's optimizer
> 
>
> Key: PIG-697
> URL: https://issues.apache.org/jira/browse/PIG-697
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Alan Gates
>Assignee: Santhosh Srinivasan
> Attachments: Optimizer_Phase5.patch, OptimizerPhase1.patch, 
> OptimizerPhase1_part2.patch, OptimizerPhase2.patch, 
> OptimizerPhase3_parrt1-1.patch, OptimizerPhase3_parrt1.patch, 
> OptimizerPhase3_part2_3.patch, OptimizerPhase4_part1-1.patch, 
> OptimizerPhase4_part2.patch
>
>
> I propose the following changes to pig optimizer, plan, and operator 
> functionality to support more robust optimization:
> 1) Remove the required array from Rule.  This will change rules so that they 
> only match exact patterns instead of allowing missing elements in the pattern.
> This has the downside that if a given rule applies to two patterns (say 
> Load->Filter->Group, Load->Group) you have to write two rules.  But it has 
> the upside that
> the resulting rules know exactly what they are getting.  The original intent 
> of this was to reduce the number of rules that needed to be written.  But the
> resulting rules have do a lot of work to understand the operators they are 
> working with.  With exact matches only, each rule will know exactly the 
> operators it
> is working on and can apply the logic of shifting the operators around.  All 
> four of the existing rules set all entries of required to true, so removing 
> this
> will have no effect on them.
> 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
> conversions or a certain number of iterations has been reached.  Currently the
> function is:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> for (Rule rule : mRules) {
> if (matcher.match(rule)) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> List> matches = matcher.getAllMatches();
> for (List match:matches)
> {
>   if (rule.transformer.check(match)) {
>   // The transformer approves.
>   rule.transformer.transform(match);
>   }
> }
> }
> }
> }
> {code}
> It would change to be:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> boolean sawMatch;
> int iterators = 0;
> do {
> sawMatch = false;
> for (Rule rule : mRules) {
> List> matches = matcher.getAllMatches();
> for (List match:matches) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> if (rule.transformer.check(match)) {
> // The transformer approves.
> sawMatch = true;
> rule.transformer.transform(match);
> }
> }
> }
> // Not sure if 1000 is the right number of iterations, maybe it
> // should be configurable so that large scripts don't stop too 
> // early.
> } while (sawMatch && numIterations++ < 1000);
> }
> {code}
> The reason for limiting the number of iterations is to avoid infinite loops.  
> The reason for iterating over the rules is so that each rule can be applied 
> multiple
> times as necessary.  This allows us to write simple rules, mostly swaps 
> between neighboring operators, without worrying that we get the plan right in 
> one pass.
> For example, we might have a plan that looks like:  
> Load->Join->Filter->Foreach, and we want to optimize it to 
> Load->Foreach->Filter->Join.  With two simple
> rules (swap filter and join and swap foreach and filter), applied 
> iteratively, we can get from the initial to final plan, without needing to 
> understanding the
> big picture of the entire plan.
> 3) Add three calls to OperatorPlan:
> {code}
> /**
>  * Swap two operators in a plan.  Both of the operators must have single
>  * inputs and single outputs.
>  * @param first operator
>  * @param second operator
>  * @throws PlanException if either operator is not single input and output.
>  */
> public void swap(E first, E second) throws PlanException {
>   

[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer

2009-07-31 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-697:


Status: In Progress  (was: Patch Available)

> Proposed improvements to pig's optimizer
> 
>
> Key: PIG-697
> URL: https://issues.apache.org/jira/browse/PIG-697
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Alan Gates
>Assignee: Santhosh Srinivasan
> Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
> OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
> OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch, 
> OptimizerPhase4_part1-1.patch, OptimizerPhase4_part2.patch
>
>
> I propose the following changes to pig optimizer, plan, and operator 
> functionality to support more robust optimization:
> 1) Remove the required array from Rule.  This will change rules so that they 
> only match exact patterns instead of allowing missing elements in the pattern.
> This has the downside that if a given rule applies to two patterns (say 
> Load->Filter->Group, Load->Group) you have to write two rules.  But it has 
> the upside that
> the resulting rules know exactly what they are getting.  The original intent 
> of this was to reduce the number of rules that needed to be written.  But the
> resulting rules have do a lot of work to understand the operators they are 
> working with.  With exact matches only, each rule will know exactly the 
> operators it
> is working on and can apply the logic of shifting the operators around.  All 
> four of the existing rules set all entries of required to true, so removing 
> this
> will have no effect on them.
> 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
> conversions or a certain number of iterations has been reached.  Currently the
> function is:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> for (Rule rule : mRules) {
> if (matcher.match(rule)) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> List> matches = matcher.getAllMatches();
> for (List match:matches)
> {
>   if (rule.transformer.check(match)) {
>   // The transformer approves.
>   rule.transformer.transform(match);
>   }
> }
> }
> }
> }
> {code}
> It would change to be:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> boolean sawMatch;
> int iterators = 0;
> do {
> sawMatch = false;
> for (Rule rule : mRules) {
> List> matches = matcher.getAllMatches();
> for (List match:matches) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> if (rule.transformer.check(match)) {
> // The transformer approves.
> sawMatch = true;
> rule.transformer.transform(match);
> }
> }
> }
> // Not sure if 1000 is the right number of iterations, maybe it
> // should be configurable so that large scripts don't stop too 
> // early.
> } while (sawMatch && numIterations++ < 1000);
> }
> {code}
> The reason for limiting the number of iterations is to avoid infinite loops.  
> The reason for iterating over the rules is so that each rule can be applied 
> multiple
> times as necessary.  This allows us to write simple rules, mostly swaps 
> between neighboring operators, without worrying that we get the plan right in 
> one pass.
> For example, we might have a plan that looks like:  
> Load->Join->Filter->Foreach, and we want to optimize it to 
> Load->Foreach->Filter->Join.  With two simple
> rules (swap filter and join and swap foreach and filter), applied 
> iteratively, we can get from the initial to final plan, without needing to 
> understanding the
> big picture of the entire plan.
> 3) Add three calls to OperatorPlan:
> {code}
> /**
>  * Swap two operators in a plan.  Both of the operators must have single
>  * inputs and single outputs.
>  * @param first operator
>  * @param second operator
>  * @throws PlanException if either operator is not single input and output.
>  */
> public void swap(E first, E second) throws PlanException {
> ...
> }
> /**
>  * Push one operator in front of another.  This function is for use when
>  * the first operator has multiple inputs.  The caller can s

[jira] Updated: (PIG-880) Order by is borken with complex fields

2009-07-30 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-880:


Status: In Progress  (was: Patch Available)

> Order by is borken with complex fields
> --
>
> Key: PIG-880
> URL: https://issues.apache.org/jira/browse/PIG-880
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Olga Natkovich
>Assignee: Santhosh Srinivasan
> Fix For: 0.4.0
>
> Attachments: PIG-880-bytearray-mapvalue-code-without-tests.patch, 
> PIG-880_1.patch
>
>
> Pig script:
> a = load 'studentcomplextab10k' as (smap:map[],c2,c3);
> f = foreach a generate smap#'name, smap#'age', smap#'gpa' ;
> s = order f by $0;   
> store s into 'sc.out' 
> Stack:
> Caused by: java.lang.ArrayStoreException
> at java.lang.System.arraycopy(Native Method)
> at java.util.Arrays.copyOf(Arrays.java:2763)
> at java.util.ArrayList.toArray(ArrayList.java:305)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.convertToArray(WeightedRangePartitioner.java:154)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:96)
> ... 5 more
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:230)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:179)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:204)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
> at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:769)
> at org.apache.pig.PigServer.execute(PigServer.java:762)
> at org.apache.pig.PigServer.access$100(PigServer.java:91)
> at org.apache.pig.PigServer$Graph.execute(PigServer.java:933)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:245)
> at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
> at org.apache.pig.Main.main(Main.java:389)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-880) Order by is borken with complex fields

2009-07-30 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-880:


Status: Patch Available  (was: In Progress)

> Order by is borken with complex fields
> --
>
> Key: PIG-880
> URL: https://issues.apache.org/jira/browse/PIG-880
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Olga Natkovich
>Assignee: Santhosh Srinivasan
> Fix For: 0.4.0
>
> Attachments: PIG-880-bytearray-mapvalue-code-without-tests.patch, 
> PIG-880_1.patch
>
>
> Pig script:
> a = load 'studentcomplextab10k' as (smap:map[],c2,c3);
> f = foreach a generate smap#'name, smap#'age', smap#'gpa' ;
> s = order f by $0;   
> store s into 'sc.out' 
> Stack:
> Caused by: java.lang.ArrayStoreException
> at java.lang.System.arraycopy(Native Method)
> at java.util.Arrays.copyOf(Arrays.java:2763)
> at java.util.ArrayList.toArray(ArrayList.java:305)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.convertToArray(WeightedRangePartitioner.java:154)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:96)
> ... 5 more
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:230)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:179)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:204)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
> at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:769)
> at org.apache.pig.PigServer.execute(PigServer.java:762)
> at org.apache.pig.PigServer.access$100(PigServer.java:91)
> at org.apache.pig.PigServer$Graph.execute(PigServer.java:933)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:245)
> at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
> at org.apache.pig.Main.main(Main.java:389)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-880) Order by is borken with complex fields

2009-07-30 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-880:


Attachment: PIG-880_1.patch

Attaching a new patch that fixes a couple of unit tests.

> Order by is borken with complex fields
> --
>
> Key: PIG-880
> URL: https://issues.apache.org/jira/browse/PIG-880
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Olga Natkovich
>Assignee: Santhosh Srinivasan
> Fix For: 0.4.0
>
> Attachments: PIG-880-bytearray-mapvalue-code-without-tests.patch, 
> PIG-880_1.patch
>
>
> Pig script:
> a = load 'studentcomplextab10k' as (smap:map[],c2,c3);
> f = foreach a generate smap#'name, smap#'age', smap#'gpa' ;
> s = order f by $0;   
> store s into 'sc.out' 
> Stack:
> Caused by: java.lang.ArrayStoreException
> at java.lang.System.arraycopy(Native Method)
> at java.util.Arrays.copyOf(Arrays.java:2763)
> at java.util.ArrayList.toArray(ArrayList.java:305)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.convertToArray(WeightedRangePartitioner.java:154)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:96)
> ... 5 more
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:230)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:179)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:204)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
> at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:769)
> at org.apache.pig.PigServer.execute(PigServer.java:762)
> at org.apache.pig.PigServer.access$100(PigServer.java:91)
> at org.apache.pig.PigServer$Graph.execute(PigServer.java:933)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:245)
> at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
> at org.apache.pig.Main.main(Main.java:389)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-880) Order by is borken with complex fields

2009-07-30 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-880:


Attachment: (was: PIG-880.patch)

> Order by is borken with complex fields
> --
>
> Key: PIG-880
> URL: https://issues.apache.org/jira/browse/PIG-880
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Olga Natkovich
>Assignee: Santhosh Srinivasan
> Fix For: 0.4.0
>
> Attachments: PIG-880-bytearray-mapvalue-code-without-tests.patch, 
> PIG-880_1.patch
>
>
> Pig script:
> a = load 'studentcomplextab10k' as (smap:map[],c2,c3);
> f = foreach a generate smap#'name, smap#'age', smap#'gpa' ;
> s = order f by $0;   
> store s into 'sc.out' 
> Stack:
> Caused by: java.lang.ArrayStoreException
> at java.lang.System.arraycopy(Native Method)
> at java.util.Arrays.copyOf(Arrays.java:2763)
> at java.util.ArrayList.toArray(ArrayList.java:305)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.convertToArray(WeightedRangePartitioner.java:154)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:96)
> ... 5 more
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:230)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:179)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:204)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
> at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:769)
> at org.apache.pig.PigServer.execute(PigServer.java:762)
> at org.apache.pig.PigServer.access$100(PigServer.java:91)
> at org.apache.pig.PigServer$Graph.execute(PigServer.java:933)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:245)
> at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
> at org.apache.pig.Main.main(Main.java:389)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-898) TextDataParser does not handle delimiters from one complex type in another

2009-07-30 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12737319#action_12737319
 ] 

Santhosh Srinivasan commented on PIG-898:
-

In addition, empty bags, tuples and constants and nulls are not handled.

> TextDataParser does not handle delimiters from one complex type in another
> --
>
> Key: PIG-898
> URL: https://issues.apache.org/jira/browse/PIG-898
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.4.0
>Reporter: Santhosh Srinivasan
>Priority: Minor
> Fix For: 0.4.0
>
>
> Currently, TextDataParser does not handle delimiters of one complex type in 
> another. An example of such a case is key1(#value1} will not be parsed 
> correctly. The production for strings matches any sequence of character that 
> do not contain any delimiters for the complex types.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-880) Order by is borken with complex fields

2009-07-30 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-880:


Status: Patch Available  (was: In Progress)

> Order by is borken with complex fields
> --
>
> Key: PIG-880
> URL: https://issues.apache.org/jira/browse/PIG-880
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Olga Natkovich
>Assignee: Santhosh Srinivasan
> Fix For: 0.4.0
>
> Attachments: PIG-880-bytearray-mapvalue-code-without-tests.patch, 
> PIG-880.patch
>
>
> Pig script:
> a = load 'studentcomplextab10k' as (smap:map[],c2,c3);
> f = foreach a generate smap#'name, smap#'age', smap#'gpa' ;
> s = order f by $0;   
> store s into 'sc.out' 
> Stack:
> Caused by: java.lang.ArrayStoreException
> at java.lang.System.arraycopy(Native Method)
> at java.util.Arrays.copyOf(Arrays.java:2763)
> at java.util.ArrayList.toArray(ArrayList.java:305)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.convertToArray(WeightedRangePartitioner.java:154)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:96)
> ... 5 more
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:230)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:179)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:204)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
> at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:769)
> at org.apache.pig.PigServer.execute(PigServer.java:762)
> at org.apache.pig.PigServer.access$100(PigServer.java:91)
> at org.apache.pig.PigServer$Graph.execute(PigServer.java:933)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:245)
> at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
> at org.apache.pig.Main.main(Main.java:389)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Work started: (PIG-880) Order by is borken with complex fields

2009-07-30 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on PIG-880 started by Santhosh Srinivasan.

> Order by is borken with complex fields
> --
>
> Key: PIG-880
> URL: https://issues.apache.org/jira/browse/PIG-880
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Olga Natkovich
>Assignee: Santhosh Srinivasan
> Fix For: 0.4.0
>
> Attachments: PIG-880-bytearray-mapvalue-code-without-tests.patch, 
> PIG-880.patch
>
>
> Pig script:
> a = load 'studentcomplextab10k' as (smap:map[],c2,c3);
> f = foreach a generate smap#'name, smap#'age', smap#'gpa' ;
> s = order f by $0;   
> store s into 'sc.out' 
> Stack:
> Caused by: java.lang.ArrayStoreException
> at java.lang.System.arraycopy(Native Method)
> at java.util.Arrays.copyOf(Arrays.java:2763)
> at java.util.ArrayList.toArray(ArrayList.java:305)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.convertToArray(WeightedRangePartitioner.java:154)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:96)
> ... 5 more
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:230)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:179)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:204)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
> at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:769)
> at org.apache.pig.PigServer.execute(PigServer.java:762)
> at org.apache.pig.PigServer.access$100(PigServer.java:91)
> at org.apache.pig.PigServer$Graph.execute(PigServer.java:933)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:245)
> at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
> at org.apache.pig.Main.main(Main.java:389)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-880) Order by is borken with complex fields

2009-07-30 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-880:


Status: Open  (was: Patch Available)

> Order by is borken with complex fields
> --
>
> Key: PIG-880
> URL: https://issues.apache.org/jira/browse/PIG-880
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Olga Natkovich
>Assignee: Santhosh Srinivasan
> Fix For: 0.4.0
>
> Attachments: PIG-880-bytearray-mapvalue-code-without-tests.patch, 
> PIG-880.patch
>
>
> Pig script:
> a = load 'studentcomplextab10k' as (smap:map[],c2,c3);
> f = foreach a generate smap#'name, smap#'age', smap#'gpa' ;
> s = order f by $0;   
> store s into 'sc.out' 
> Stack:
> Caused by: java.lang.ArrayStoreException
> at java.lang.System.arraycopy(Native Method)
> at java.util.Arrays.copyOf(Arrays.java:2763)
> at java.util.ArrayList.toArray(ArrayList.java:305)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.convertToArray(WeightedRangePartitioner.java:154)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:96)
> ... 5 more
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:230)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:179)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:204)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
> at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:769)
> at org.apache.pig.PigServer.execute(PigServer.java:762)
> at org.apache.pig.PigServer.access$100(PigServer.java:91)
> at org.apache.pig.PigServer$Graph.execute(PigServer.java:933)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:245)
> at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
> at org.apache.pig.Main.main(Main.java:389)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-898) TextDataParser does not handle delimiters from one complex type in another

2009-07-29 Thread Santhosh Srinivasan (JIRA)
TextDataParser does not handle delimiters from one complex type in another
--

 Key: PIG-898
 URL: https://issues.apache.org/jira/browse/PIG-898
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.4.0
Reporter: Santhosh Srinivasan
Priority: Minor
 Fix For: 0.4.0


Currently, TextDataParser does not handle delimiters of one complex type in 
another. An example of such a case is key1(#value1} will not be parsed 
correctly. The production for strings matches any sequence of character that do 
not contain any delimiters for the complex types.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-880) Order by is borken with complex fields

2009-07-29 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-880:


Status: Patch Available  (was: Open)

> Order by is borken with complex fields
> --
>
> Key: PIG-880
> URL: https://issues.apache.org/jira/browse/PIG-880
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Olga Natkovich
>Assignee: Santhosh Srinivasan
> Fix For: 0.4.0
>
> Attachments: PIG-880-bytearray-mapvalue-code-without-tests.patch, 
> PIG-880.patch
>
>
> Pig script:
> a = load 'studentcomplextab10k' as (smap:map[],c2,c3);
> f = foreach a generate smap#'name, smap#'age', smap#'gpa' ;
> s = order f by $0;   
> store s into 'sc.out' 
> Stack:
> Caused by: java.lang.ArrayStoreException
> at java.lang.System.arraycopy(Native Method)
> at java.util.Arrays.copyOf(Arrays.java:2763)
> at java.util.ArrayList.toArray(ArrayList.java:305)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.convertToArray(WeightedRangePartitioner.java:154)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:96)
> ... 5 more
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:230)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:179)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:204)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
> at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:769)
> at org.apache.pig.PigServer.execute(PigServer.java:762)
> at org.apache.pig.PigServer.access$100(PigServer.java:91)
> at org.apache.pig.PigServer$Graph.execute(PigServer.java:933)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:245)
> at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
> at org.apache.pig.Main.main(Main.java:389)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-880) Order by is borken with complex fields

2009-07-29 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-880:


Attachment: PIG-880.patch

Attached patch creates maps with value type set to DataByteArray (i.e., 
bytearray) for text data parsed by PigStorage. This change is consistent with 
the language semantics of treating value type as bytearray. New test cases have 
been added.

> Order by is borken with complex fields
> --
>
> Key: PIG-880
> URL: https://issues.apache.org/jira/browse/PIG-880
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Olga Natkovich
>Assignee: Santhosh Srinivasan
> Fix For: 0.4.0
>
> Attachments: PIG-880-bytearray-mapvalue-code-without-tests.patch, 
> PIG-880.patch
>
>
> Pig script:
> a = load 'studentcomplextab10k' as (smap:map[],c2,c3);
> f = foreach a generate smap#'name, smap#'age', smap#'gpa' ;
> s = order f by $0;   
> store s into 'sc.out' 
> Stack:
> Caused by: java.lang.ArrayStoreException
> at java.lang.System.arraycopy(Native Method)
> at java.util.Arrays.copyOf(Arrays.java:2763)
> at java.util.ArrayList.toArray(ArrayList.java:305)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.convertToArray(WeightedRangePartitioner.java:154)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:96)
> ... 5 more
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:230)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:179)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:204)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
> at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:769)
> at org.apache.pig.PigServer.execute(PigServer.java:762)
> at org.apache.pig.PigServer.access$100(PigServer.java:91)
> at org.apache.pig.PigServer$Graph.execute(PigServer.java:933)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:245)
> at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
> at org.apache.pig.Main.main(Main.java:389)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-880) Order by is borken with complex fields

2009-07-29 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan reassigned PIG-880:
---

Assignee: Santhosh Srinivasan

> Order by is borken with complex fields
> --
>
> Key: PIG-880
> URL: https://issues.apache.org/jira/browse/PIG-880
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Olga Natkovich
>Assignee: Santhosh Srinivasan
> Fix For: 0.4.0
>
> Attachments: PIG-880-bytearray-mapvalue-code-without-tests.patch
>
>
> Pig script:
> a = load 'studentcomplextab10k' as (smap:map[],c2,c3);
> f = foreach a generate smap#'name, smap#'age', smap#'gpa' ;
> s = order f by $0;   
> store s into 'sc.out' 
> Stack:
> Caused by: java.lang.ArrayStoreException
> at java.lang.System.arraycopy(Native Method)
> at java.util.Arrays.copyOf(Arrays.java:2763)
> at java.util.ArrayList.toArray(ArrayList.java:305)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.convertToArray(WeightedRangePartitioner.java:154)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:96)
> ... 5 more
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:230)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:179)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:204)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
> at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:769)
> at org.apache.pig.PigServer.execute(PigServer.java:762)
> at org.apache.pig.PigServer.access$100(PigServer.java:91)
> at org.apache.pig.PigServer$Graph.execute(PigServer.java:933)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:245)
> at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
> at org.apache.pig.Main.main(Main.java:389)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-897) Pig should support counters

2009-07-29 Thread Santhosh Srinivasan (JIRA)
Pig should support counters
---

 Key: PIG-897
 URL: https://issues.apache.org/jira/browse/PIG-897
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: 0.4.0
Reporter: Santhosh Srinivasan
 Fix For: 0.4.0


Pig should support the use of counters. The use of the counters can possibly be 
via the script or via Java APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-889) Pig can not access reporter of PigHadoopLog in Load Func

2009-07-29 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan resolved PIG-889.
-

  Resolution: Won't Fix
Release Note: As per the discussion with Jeff, closing the bug as won't fix

> Pig can not access reporter of PigHadoopLog in Load Func
> 
>
> Key: PIG-889
> URL: https://issues.apache.org/jira/browse/PIG-889
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.4.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Fix For: 0.4.0
>
> Attachments: Pig_889_Patch.txt
>
>
> I'd like to increment Counter in my own LoadFunc, but it will throw 
> NullPointerException. It seems that the reporter is not initialized.  
> I looked into this problem and find that it need to call 
> PigHadoopLogger.getInstance().setReporter(reporter) in PigInputFormat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-889) Pig can not access reporter of PigHadoopLog in Load Func

2009-07-29 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12736990#action_12736990
 ] 

Santhosh Srinivasan commented on PIG-889:
-

PigHadoopLogger implements the PigLogger interface. As part of the 
implementation it uses the Hadoop reporter for aggregating the warning messages.

> Pig can not access reporter of PigHadoopLog in Load Func
> 
>
> Key: PIG-889
> URL: https://issues.apache.org/jira/browse/PIG-889
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.4.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Fix For: 0.4.0
>
> Attachments: Pig_889_Patch.txt
>
>
> I'd like to increment Counter in my own LoadFunc, but it will throw 
> NullPointerException. It seems that the reporter is not initialized.  
> I looked into this problem and find that it need to call 
> PigHadoopLogger.getInstance().setReporter(reporter) in PigInputFormat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-882) log level not propogated to loggers

2009-07-28 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12736359#action_12736359
 ] 

Santhosh Srinivasan commented on PIG-882:
-

Minor comment:

Index: src/org/apache/pig/Main.java
===

Instead of printing the warning message to stdout, it should be printed to 
stderr.

{code}
+catch (IOException e)
+{
+System.out.println("Warn: Cannot open log4j properties file, use 
default");
+}
{code}


The rest of the patch looks fine.

> log level not propogated to loggers 
> 
>
> Key: PIG-882
> URL: https://issues.apache.org/jira/browse/PIG-882
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Thejas M Nair
> Attachments: PIG-882-1.patch, PIG-882-2.patch
>
>
> Pig accepts log level as a parameter. But the log level it captures is not 
> set appropriately, so that loggers in different classes log at the specified 
> level.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-660) Integration with Hadoop 0.20

2009-07-28 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12736283#action_12736283
 ] 

Santhosh Srinivasan commented on PIG-660:
-

The build.xml in the patch(es) have the reference to hadoop20.jar. The missing 
part is the hadoop20.jar that Pig can use to build its sources. Pig cannot use 
the hadoop20.jar coming from the Hadoop release.

> Integration with Hadoop 0.20
> 
>
> Key: PIG-660
> URL: https://issues.apache.org/jira/browse/PIG-660
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.2.0
> Environment: Hadoop 0.20
>Reporter: Santhosh Srinivasan
>Assignee: Santhosh Srinivasan
> Fix For: 0.4.0
>
> Attachments: PIG-660.patch, PIG-660_1.patch, PIG-660_2.patch, 
> PIG-660_3.patch, PIG-660_4.patch, PIG-660_5.patch
>
>
> With Hadoop 0.20, it will be possible to query the status of each map and 
> reduce in a map reduce job. This will allow better error reporting. Some of 
> the other items that could be on Hadoop's feature requests/bugs are 
> documented here for tracking.
> 1. Hadoop should return objects instead of strings when exceptions are thrown
> 2. The JobControl should handle all exceptions and report them appropriately. 
> For example, when the JobControl fails to launch jobs, it should handle 
> exceptions appropriately and should support APIs that query this state, i.e., 
> failure to launch jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-889) Pig can not access reporter of PigHadoopLog in Load Func

2009-07-28 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12736201#action_12736201
 ] 

Santhosh Srinivasan commented on PIG-889:
-

The PigHadoopLogger implements the PigLogger interface. The only supported 
method for this interface is warn(). Supporting counters as part of Pig will 
involve part of what you suggest. While your implementation extends 
PigHadoopLogger, it is not generic to support counters in Pig. Other load 
functions will have to use direct references to PigHadoopLogger which is not 
the correct way of accessing and updating counters. Pig needs to extend the 
load function interface (and store function interface?) to allow access to 
counters.

Summary: Pig needs to support counters and its a slightly bigger topic. 
Extending functionality of existing classes that are meant for a different 
reason will make support difficult in the future.

If you agree, we can mark this issue as invalid and open a new jira that will 
capture requirements for supporting counters in Pig?

> Pig can not access reporter of PigHadoopLog in Load Func
> 
>
> Key: PIG-889
> URL: https://issues.apache.org/jira/browse/PIG-889
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.4.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Fix For: 0.4.0
>
> Attachments: Pig_889_Patch.txt
>
>
> I'd like to increment Counter in my own LoadFunc, but it will throw 
> NullPointerException. It seems that the reporter is not initialized.  
> I looked into this problem and find that it need to call 
> PigHadoopLogger.getInstance().setReporter(reporter) in PigInputFormat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-889) Pig can not access reporter of PigHadoopLog in Load Func

2009-07-24 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12735129#action_12735129
 ] 

Santhosh Srinivasan commented on PIG-889:
-

The issue here is the lack of support for counters within Pig.

The intention of warn method in the PigLogger interface was to allow sources 
within Pig and UDFs  for warning aggregation. Your use of the reporter within 
the logger is not supported. An implementation detail prevents the correct use 
of this interface for load functions. The Hadoop reporter object is provided in 
the getRecordReader, map and reduce calls. For load functions, Pig provides an 
interface and for UDFs, an abstract class. As a result, the logger instance 
cannot be initialized in the loaders till we decide to add a method to support 
it. 

Will having the code from PigMapBase.map()  in 
PigInputFormat.java.getRecordReader work for you? 

{code}
PigHadoopLogger pigHadoopLogger = PigHadoopLogger.getInstance();
pigHadoopLogger.setAggregate(aggregateWarning);
pigHadoopLogger.setReporter(reporter);
PhysicalOperator.setPigLogger(pigHadoopLogger);
{code}

Note that this is a workaround for your situation. I would highly recommend 
that you move to the use of counters when they are supported.

> Pig can not access reporter of PigHadoopLog in Load Func
> 
>
> Key: PIG-889
> URL: https://issues.apache.org/jira/browse/PIG-889
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.4.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Fix For: 0.4.0
>
> Attachments: Pig_889_Patch.txt
>
>
> I'd like to increment Counter in my own LoadFunc, but it will throw 
> NullPointerException. It seems that the reporter is not initialized.  
> I looked into this problem and find that it need to call 
> PigHadoopLogger.getInstance().setReporter(reporter) in PigInputFormat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported

2009-07-23 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-773:


  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

Patch has been committed. Thanks for the fix Ashutosh.

> Empty complex constants (empty bag, empty tuple and empty map) should be 
> supported
> --
>
> Key: PIG-773
> URL: https://issues.apache.org/jira/browse/PIG-773
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Pradeep Kamath
>Assignee: Ashutosh Chauhan
>Priority: Minor
> Fix For: 0.4.0
>
> Attachments: pig-773.patch, pig-773_v2.patch, pig-773_v3.patch, 
> pig-773_v4.patch, pig-773_v5.patch
>
>
> We should be able to create empty bag constant using {}, empty tuple constant 
> using (), empty map constant using [] within a pig script

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported

2009-07-23 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734810#action_12734810
 ] 

Santhosh Srinivasan commented on PIG-773:
-

+ 1 for the changes.

> Empty complex constants (empty bag, empty tuple and empty map) should be 
> supported
> --
>
> Key: PIG-773
> URL: https://issues.apache.org/jira/browse/PIG-773
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Pradeep Kamath
>Assignee: Ashutosh Chauhan
>Priority: Minor
> Fix For: 0.4.0
>
> Attachments: pig-773.patch, pig-773_v2.patch, pig-773_v3.patch, 
> pig-773_v4.patch, pig-773_v5.patch
>
>
> We should be able to create empty bag constant using {}, empty tuple constant 
> using (), empty map constant using [] within a pig script

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-892) Make COUNT and AVG deal with nulls accordingly with SQL standar

2009-07-23 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734806#action_12734806
 ] 

Santhosh Srinivasan commented on PIG-892:
-

1. Index: src/org/apache/pig/builtin/FloatAvg.java
===

The size of 't' is not checked before t.get(0) in the method count


{code}
+if (t != null && t.get(0) != null)
+cnt++;
+}
{code}

2. Index: src/org/apache/pig/builtin/IntAvg.java
===

Same comment as FloatAvg.java

3. Index: src/org/apache/pig/builtin/DoubleAvg.java
===

Same comment as FloatAvg.java

4. Index: src/org/apache/pig/builtin/AVG.java
===

Same comment as FloatAvg.java

5. Index: src/org/apache/pig/builtin/LongAvg.java
===

Same comment as FloatAvg.java


6. Index: src/org/apache/pig/builtin/COUNT_STAR.java
===

I am not sure about the naming convention here. None of the built-in functions 
have a special character in the class name. COUNTSTAR would be better than 
COUNT_STAR.


> Make COUNT and AVG deal with nulls accordingly with SQL standar
> ---
>
> Key: PIG-892
> URL: https://issues.apache.org/jira/browse/PIG-892
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Olga Natkovich
>Assignee: Olga Natkovich
> Fix For: 0.4.0
>
> Attachments: PIG-892.patch, PIG-892_v2.patch
>
>
> both COUNT and AVG need to ignore nulls. Also add COUNT_STAR to match 
> COUNT(*) in SQL

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-892) Make COUNT and AVG deal with nulls accordingly with SQL standar

2009-07-22 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734417#action_12734417
 ] 

Santhosh Srinivasan commented on PIG-892:
-

I am reviewing the patch.

> Make COUNT and AVG deal with nulls accordingly with SQL standar
> ---
>
> Key: PIG-892
> URL: https://issues.apache.org/jira/browse/PIG-892
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Olga Natkovich
>Assignee: Olga Natkovich
> Fix For: 0.4.0
>
> Attachments: PIG-892.patch, PIG-892_v2.patch
>
>
> both COUNT and AVG need to ignore nulls. Also add COUNT_STAR to match 
> COUNT(*) in SQL

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-889) Pig can not access reporter of PigHadoopLog in Load Func

2009-07-21 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733825#action_12733825
 ] 

Santhosh Srinivasan commented on PIG-889:
-

Comments:

The reporter inside the logger is setup correctly in PigInputFormat for Hadoop. 
However the usage of the logger to retrieve the reporter and then increment 
counters is flawed for the following reasons:

1. In the test case, the new loader uses PigHadoopLogger directly. When the 
loader is used in local mode, the notion of Hadoop disappears and the reference 
to PigHadoopLogger is not usable (i.e., will result in a NullPointerException).

{code}
+   @Override
+   public Tuple getNext() throws IOException {
+   PigHadoopLogger.getInstance().getReporter().incrCounter(
+   MyCounter.TupleCounter, 1);
+   return super.getNext();
+   }
{code}

2. The loggers were meant for warning aggregations. Here, there is a case being 
made to expand the capabilities to allow user defined counter aggregations. If 
thats the case, then new methods have to be added to the PigLogger interface.

> Pig can not access reporter of PigHadoopLog in Load Func
> 
>
> Key: PIG-889
> URL: https://issues.apache.org/jira/browse/PIG-889
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.4.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Fix For: 0.4.0
>
> Attachments: Pig_889_Patch.txt
>
>
> I'd like to increment Counter in my own LoadFunc, but it will throw 
> NullPointerException. It seems that the reporter is not initialized.  
> I looked into this problem and find that it need to call 
> PigHadoopLogger.getInstance().setReporter(reporter) in PigInputFormat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-893) support cast of chararray to other simple types

2009-07-21 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733773#action_12733773
 ] 

Santhosh Srinivasan commented on PIG-893:
-

What are the semantics of chararray (string) to numeric types? 

Pig does not support conversion of any non-bytearray type to bytearray. The 
proposal in the jira description is minimalistic. Does it match with that of 
SQL? 

Without clear articulation about what these things mean, we cannot/should not 
support chararray to numeric type conversions. PiggyBank already supports UDFs 
that convert strings to int, double, etc.

It's a nice to have, as part of the language but its better positioned as a 
UDF. If clear semantics are laid out then making it part of the language will 
be a matter of consensus.

> support cast of chararray to other simple types
> ---
>
> Key: PIG-893
> URL: https://issues.apache.org/jira/browse/PIG-893
> Project: Pig
>  Issue Type: New Feature
>Reporter: Thejas M Nair
>
> Pig should support casting of chararray to 
> integer,long,float,double,bytearray. If the conversion fails for reasons such 
> as overflow, cast should return null and log a warning.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-695) Pig should not fail when error logs cannot be created

2009-07-21 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-695:


Fix Version/s: 0.4.0

> Pig should not fail when error logs cannot be created
> -
>
> Key: PIG-695
> URL: https://issues.apache.org/jira/browse/PIG-695
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.2.0
>Reporter: Santhosh Srinivasan
>Assignee: Santhosh Srinivasan
> Fix For: 0.4.0
>
> Attachments: PIG-695.patch
>
>
> Currently, PIG validates the log file location and fails/exits when the log 
> file cannot be created. Instead, it should print a warning and continue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-695) Pig should not fail when error logs cannot be created

2009-07-21 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-695:


  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

Patch has been committed.

> Pig should not fail when error logs cannot be created
> -
>
> Key: PIG-695
> URL: https://issues.apache.org/jira/browse/PIG-695
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.2.0
>Reporter: Santhosh Srinivasan
>Assignee: Santhosh Srinivasan
> Attachments: PIG-695.patch
>
>
> Currently, PIG validates the log file location and fails/exits when the log 
> file cannot be created. Instead, it should print a warning and continue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-695) Pig should not fail when error logs cannot be created

2009-07-20 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733416#action_12733416
 ] 

Santhosh Srinivasan commented on PIG-695:
-

There are no unit tests added for this fix as this is part of the testing Main. 
Currently there are no unit tests for Main.

> Pig should not fail when error logs cannot be created
> -
>
> Key: PIG-695
> URL: https://issues.apache.org/jira/browse/PIG-695
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.2.0
>Reporter: Santhosh Srinivasan
>Assignee: Santhosh Srinivasan
> Attachments: PIG-695.patch
>
>
> Currently, PIG validates the log file location and fails/exits when the log 
> file cannot be created. Instead, it should print a warning and continue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-880) Order by is borken with complex fields

2009-07-19 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733018#action_12733018
 ] 

Santhosh Srinivasan commented on PIG-880:
-

Review Comment:

PigStorage should read map values as strings instead of interpreting the types. 
This way, integers that are too long to fit into Integer, etc. will still be 
interpreted as bytearray.

> Order by is borken with complex fields
> --
>
> Key: PIG-880
> URL: https://issues.apache.org/jira/browse/PIG-880
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Olga Natkovich
> Fix For: 0.4.0
>
> Attachments: PIG-880-bytearray-mapvalue-code-without-tests.patch
>
>
> Pig script:
> a = load 'studentcomplextab10k' as (smap:map[],c2,c3);
> f = foreach a generate smap#'name, smap#'age', smap#'gpa' ;
> s = order f by $0;   
> store s into 'sc.out' 
> Stack:
> Caused by: java.lang.ArrayStoreException
> at java.lang.System.arraycopy(Native Method)
> at java.util.Arrays.copyOf(Arrays.java:2763)
> at java.util.ArrayList.toArray(ArrayList.java:305)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.convertToArray(WeightedRangePartitioner.java:154)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:96)
> ... 5 more
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:230)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:179)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:204)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
> at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:769)
> at org.apache.pig.PigServer.execute(PigServer.java:762)
> at org.apache.pig.PigServer.access$100(PigServer.java:91)
> at org.apache.pig.PigServer$Graph.execute(PigServer.java:933)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:245)
> at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
> at org.apache.pig.Main.main(Main.java:389)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Work stopped: (PIG-695) Pig should not fail when error logs cannot be created

2009-07-17 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on PIG-695 stopped by Santhosh Srinivasan.

> Pig should not fail when error logs cannot be created
> -
>
> Key: PIG-695
> URL: https://issues.apache.org/jira/browse/PIG-695
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.2.0
>Reporter: Santhosh Srinivasan
>Assignee: Santhosh Srinivasan
> Attachments: PIG-695.patch
>
>
> Currently, PIG validates the log file location and fails/exits when the log 
> file cannot be created. Instead, it should print a warning and continue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-695) Pig should not fail when error logs cannot be created

2009-07-17 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-695:


Status: Patch Available  (was: Open)

> Pig should not fail when error logs cannot be created
> -
>
> Key: PIG-695
> URL: https://issues.apache.org/jira/browse/PIG-695
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.2.0
>Reporter: Santhosh Srinivasan
>Assignee: Santhosh Srinivasan
> Attachments: PIG-695.patch
>
>
> Currently, PIG validates the log file location and fails/exits when the log 
> file cannot be created. Instead, it should print a warning and continue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Work started: (PIG-695) Pig should not fail when error logs cannot be created

2009-07-17 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on PIG-695 started by Santhosh Srinivasan.

> Pig should not fail when error logs cannot be created
> -
>
> Key: PIG-695
> URL: https://issues.apache.org/jira/browse/PIG-695
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.2.0
>Reporter: Santhosh Srinivasan
>Assignee: Santhosh Srinivasan
> Attachments: PIG-695.patch
>
>
> Currently, PIG validates the log file location and fails/exits when the log 
> file cannot be created. Instead, it should print a warning and continue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-695) Pig should not fail when error logs cannot be created

2009-07-17 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-695:


Attachment: PIG-695.patch

Attached patch ensures that Pig does not error out when the error log file is 
not writable. 

> Pig should not fail when error logs cannot be created
> -
>
> Key: PIG-695
> URL: https://issues.apache.org/jira/browse/PIG-695
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.2.0
>Reporter: Santhosh Srinivasan
>Assignee: Santhosh Srinivasan
> Attachments: PIG-695.patch
>
>
> Currently, PIG validates the log file location and fails/exits when the log 
> file cannot be created. Instead, it should print a warning and continue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-728) All backend error messages must be logged to preserve the original error messages

2009-07-16 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-728:


  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

Issue has been resolved.

> All backend error messages must be logged to preserve the original error 
> messages
> -
>
> Key: PIG-728
> URL: https://issues.apache.org/jira/browse/PIG-728
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Santhosh Srinivasan
>Assignee: Santhosh Srinivasan
>Priority: Minor
> Fix For: 0.4.0
>
> Attachments: PIG-728_1.patch
>
>
> The current error handling framework logs backend error messages only when 
> Pig is not able to parse the error message. Instead, Pig should log the 
> backend error message irrespective of Pig's ability to parse backend error 
> messages. On a side note, the use of instantiateFuncFromSpec in Launcher.java 
> is not consistent and should avoid the use of class_name + "(" + 
> string_constructor_args + ")".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-728) All backend error messages must be logged to preserve the original error messages

2009-07-16 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732281#action_12732281
 ] 

Santhosh Srinivasan commented on PIG-728:
-

Patch has been committed.

> All backend error messages must be logged to preserve the original error 
> messages
> -
>
> Key: PIG-728
> URL: https://issues.apache.org/jira/browse/PIG-728
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Santhosh Srinivasan
>Assignee: Santhosh Srinivasan
>Priority: Minor
> Fix For: 0.4.0
>
> Attachments: PIG-728_1.patch
>
>
> The current error handling framework logs backend error messages only when 
> Pig is not able to parse the error message. Instead, Pig should log the 
> backend error message irrespective of Pig's ability to parse backend error 
> messages. On a side note, the use of instantiateFuncFromSpec in Launcher.java 
> is not consistent and should avoid the use of class_name + "(" + 
> string_constructor_args + ")".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-728) All backend error messages must be logged to preserve the original error messages

2009-07-16 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-728:


Attachment: PIG-728_1.patch

Attaching a new patch that fixes the findbugs issue.

> All backend error messages must be logged to preserve the original error 
> messages
> -
>
> Key: PIG-728
> URL: https://issues.apache.org/jira/browse/PIG-728
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Santhosh Srinivasan
>Assignee: Santhosh Srinivasan
>Priority: Minor
> Fix For: 0.4.0
>
> Attachments: PIG-728_1.patch
>
>
> The current error handling framework logs backend error messages only when 
> Pig is not able to parse the error message. Instead, Pig should log the 
> backend error message irrespective of Pig's ability to parse backend error 
> messages. On a side note, the use of instantiateFuncFromSpec in Launcher.java 
> is not consistent and should avoid the use of class_name + "(" + 
> string_constructor_args + ")".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-728) All backend error messages must be logged to preserve the original error messages

2009-07-16 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-728:


Status: Patch Available  (was: In Progress)

> All backend error messages must be logged to preserve the original error 
> messages
> -
>
> Key: PIG-728
> URL: https://issues.apache.org/jira/browse/PIG-728
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Santhosh Srinivasan
>Assignee: Santhosh Srinivasan
>Priority: Minor
> Fix For: 0.4.0
>
> Attachments: PIG-728_1.patch
>
>
> The current error handling framework logs backend error messages only when 
> Pig is not able to parse the error message. Instead, Pig should log the 
> backend error message irrespective of Pig's ability to parse backend error 
> messages. On a side note, the use of instantiateFuncFromSpec in Launcher.java 
> is not consistent and should avoid the use of class_name + "(" + 
> string_constructor_args + ")".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-728) All backend error messages must be logged to preserve the original error messages

2009-07-16 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-728:


Attachment: (was: PIG-728.patch)

> All backend error messages must be logged to preserve the original error 
> messages
> -
>
> Key: PIG-728
> URL: https://issues.apache.org/jira/browse/PIG-728
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Santhosh Srinivasan
>Assignee: Santhosh Srinivasan
>Priority: Minor
> Fix For: 0.4.0
>
>
> The current error handling framework logs backend error messages only when 
> Pig is not able to parse the error message. Instead, Pig should log the 
> backend error message irrespective of Pig's ability to parse backend error 
> messages. On a side note, the use of instantiateFuncFromSpec in Launcher.java 
> is not consistent and should avoid the use of class_name + "(" + 
> string_constructor_args + ")".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-728) All backend error messages must be logged to preserve the original error messages

2009-07-16 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-728:


Status: In Progress  (was: Patch Available)

> All backend error messages must be logged to preserve the original error 
> messages
> -
>
> Key: PIG-728
> URL: https://issues.apache.org/jira/browse/PIG-728
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Santhosh Srinivasan
>Assignee: Santhosh Srinivasan
>Priority: Minor
> Fix For: 0.4.0
>
>
> The current error handling framework logs backend error messages only when 
> Pig is not able to parse the error message. Instead, Pig should log the 
> backend error message irrespective of Pig's ability to parse backend error 
> messages. On a side note, the use of instantiateFuncFromSpec in Launcher.java 
> is not consistent and should avoid the use of class_name + "(" + 
> string_constructor_args + ")".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-728) All backend error messages must be logged to preserve the original error messages

2009-07-15 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-728:


Attachment: PIG-728.patch

The attached patch logs all backend error messages before Pig tries to parse 
the messages. In addition, the log format has been cleaned up to be more user 
friendly. No new test cases have been added.

> All backend error messages must be logged to preserve the original error 
> messages
> -
>
> Key: PIG-728
> URL: https://issues.apache.org/jira/browse/PIG-728
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Santhosh Srinivasan
>Assignee: Santhosh Srinivasan
>Priority: Minor
> Fix For: 0.4.0
>
> Attachments: PIG-728.patch
>
>
> The current error handling framework logs backend error messages only when 
> Pig is not able to parse the error message. Instead, Pig should log the 
> backend error message irrespective of Pig's ability to parse backend error 
> messages. On a side note, the use of instantiateFuncFromSpec in Launcher.java 
> is not consistent and should avoid the use of class_name + "(" + 
> string_constructor_args + ")".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-728) All backend error messages must be logged to preserve the original error messages

2009-07-15 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-728:


Fix Version/s: (was: 0.2.1)
   0.4.0
Affects Version/s: (was: 0.2.1)
   0.3.0
   Status: Patch Available  (was: Open)

> All backend error messages must be logged to preserve the original error 
> messages
> -
>
> Key: PIG-728
> URL: https://issues.apache.org/jira/browse/PIG-728
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Santhosh Srinivasan
>Assignee: Santhosh Srinivasan
>Priority: Minor
> Fix For: 0.4.0
>
> Attachments: PIG-728.patch
>
>
> The current error handling framework logs backend error messages only when 
> Pig is not able to parse the error message. Instead, Pig should log the 
> backend error message irrespective of Pig's ability to parse backend error 
> messages. On a side note, the use of instantiateFuncFromSpec in Launcher.java 
> is not consistent and should avoid the use of class_name + "(" + 
> string_constructor_args + ")".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-877) Push up filter does not account for added columns in foreach

2009-07-14 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-877:


  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

Issue has been fixed.

> Push up filter does not account for added columns in foreach
> 
>
> Key: PIG-877
> URL: https://issues.apache.org/jira/browse/PIG-877
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.1
>Reporter: Santhosh Srinivasan
>Assignee: Santhosh Srinivasan
> Fix For: 0.3.1
>
> Attachments: PIG-877.patch
>
>
> If a filter follows a foreach that produces an added column then push up 
> filter fails with a null pointer exception.
> {code}
> ...
> x = foreach w generate $0, COUNT($1);
> y = filter x by $1 > 10;
> {code}
> In the above example, the column in the filter's expression is an added 
> column. As a result, the optimizer rule is not able to map it back to the 
> input resulting in a null value. The subsequent for loop is failing due to 
> NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-877) Push up filter does not account for added columns in foreach

2009-07-13 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730595#action_12730595
 ] 

Santhosh Srinivasan commented on PIG-877:
-

Patch has been committed.

> Push up filter does not account for added columns in foreach
> 
>
> Key: PIG-877
> URL: https://issues.apache.org/jira/browse/PIG-877
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.1
>Reporter: Santhosh Srinivasan
>Assignee: Santhosh Srinivasan
> Fix For: 0.3.1
>
> Attachments: PIG-877.patch
>
>
> If a filter follows a foreach that produces an added column then push up 
> filter fails with a null pointer exception.
> {code}
> ...
> x = foreach w generate $0, COUNT($1);
> y = filter x by $1 > 10;
> {code}
> In the above example, the column in the filter's expression is an added 
> column. As a result, the optimizer rule is not able to map it back to the 
> input resulting in a null value. The subsequent for loop is failing due to 
> NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-877) Push up filter does not account for added columns in foreach

2009-07-13 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730525#action_12730525
 ] 

Santhosh Srinivasan commented on PIG-877:
-

Its at Optimization time.

> Push up filter does not account for added columns in foreach
> 
>
> Key: PIG-877
> URL: https://issues.apache.org/jira/browse/PIG-877
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.1
>Reporter: Santhosh Srinivasan
>Assignee: Santhosh Srinivasan
> Fix For: 0.3.1
>
> Attachments: PIG-877.patch
>
>
> If a filter follows a foreach that produces an added column then push up 
> filter fails with a null pointer exception.
> {code}
> ...
> x = foreach w generate $0, COUNT($1);
> y = filter x by $1 > 10;
> {code}
> In the above example, the column in the filter's expression is an added 
> column. As a result, the optimizer rule is not able to map it back to the 
> input resulting in a null value. The subsequent for loop is failing due to 
> NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-877) Push up filter does not account for added columns in foreach

2009-07-10 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12729729#action_12729729
 ] 

Santhosh Srinivasan commented on PIG-877:
-

The unit test failures are unrelated to the fix.

> Push up filter does not account for added columns in foreach
> 
>
> Key: PIG-877
> URL: https://issues.apache.org/jira/browse/PIG-877
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.1
>Reporter: Santhosh Srinivasan
>Assignee: Santhosh Srinivasan
> Fix For: 0.3.1
>
> Attachments: PIG-877.patch
>
>
> If a filter follows a foreach that produces an added column then push up 
> filter fails with a null pointer exception.
> {code}
> ...
> x = foreach w generate $0, COUNT($1);
> y = filter x by $1 > 10;
> {code}
> In the above example, the column in the filter's expression is an added 
> column. As a result, the optimizer rule is not able to map it back to the 
> input resulting in a null value. The subsequent for loop is failing due to 
> NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-877) Push up filter does not account for added columns in foreach

2009-07-09 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-877:


Attachment: PIG-877.patch

Attached patch fixes the NPE.

> Push up filter does not account for added columns in foreach
> 
>
> Key: PIG-877
> URL: https://issues.apache.org/jira/browse/PIG-877
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.1
>Reporter: Santhosh Srinivasan
>Assignee: Santhosh Srinivasan
> Fix For: 0.3.1
>
> Attachments: PIG-877.patch
>
>
> If a filter follows a foreach that produces an added column then push up 
> filter fails with a null pointer exception.
> {code}
> ...
> x = foreach w generate $0, COUNT($1);
> y = filter x by $1 > 10;
> {code}
> In the above example, the column in the filter's expression is an added 
> column. As a result, the optimizer rule is not able to map it back to the 
> input resulting in a null value. The subsequent for loop is failing due to 
> NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-877) Push up filter does not account for added columns in foreach

2009-07-09 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-877:


Status: Patch Available  (was: Open)

> Push up filter does not account for added columns in foreach
> 
>
> Key: PIG-877
> URL: https://issues.apache.org/jira/browse/PIG-877
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.1
>Reporter: Santhosh Srinivasan
>Assignee: Santhosh Srinivasan
> Fix For: 0.3.1
>
> Attachments: PIG-877.patch
>
>
> If a filter follows a foreach that produces an added column then push up 
> filter fails with a null pointer exception.
> {code}
> ...
> x = foreach w generate $0, COUNT($1);
> y = filter x by $1 > 10;
> {code}
> In the above example, the column in the filter's expression is an added 
> column. As a result, the optimizer rule is not able to map it back to the 
> input resulting in a null value. The subsequent for loop is failing due to 
> NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-877) Push up filter does not account for added columns in foreach

2009-07-09 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan reassigned PIG-877:
---

Assignee: Santhosh Srinivasan

> Push up filter does not account for added columns in foreach
> 
>
> Key: PIG-877
> URL: https://issues.apache.org/jira/browse/PIG-877
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.1
>Reporter: Santhosh Srinivasan
>Assignee: Santhosh Srinivasan
> Fix For: 0.3.1
>
>
> If a filter follows a foreach that produces an added column then push up 
> filter fails with a null pointer exception.
> {code}
> ...
> x = foreach w generate $0, COUNT($1);
> y = filter x by $1 > 10;
> {code}
> In the above example, the column in the filter's expression is an added 
> column. As a result, the optimizer rule is not able to map it back to the 
> input resulting in a null value. The subsequent for loop is failing due to 
> NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-877) Push up filter does not account for added columns in foreach

2009-07-09 Thread Santhosh Srinivasan (JIRA)
Push up filter does not account for added columns in foreach


 Key: PIG-877
 URL: https://issues.apache.org/jira/browse/PIG-877
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Santhosh Srinivasan
 Fix For: 0.3.1


If a filter follows a foreach that produces an added column then push up filter 
fails with a null pointer exception.

{code}
...
x = foreach w generate $0, COUNT($1);
y = filter x by $1 > 10;
{code}

In the above example, the column in the filter's expression is an added column. 
As a result, the optimizer rule is not able to map it back to the input 
resulting in a null value. The subsequent for loop is failing due to NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-792) PERFORMANCE: Support skewed join in pig

2009-07-09 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12729420#action_12729420
 ] 

Santhosh Srinivasan commented on PIG-792:
-

Index: 
src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/LogToPhyTranslationVisitor.java
===

If the method addLocalRearrange is not required, can it be removed?

> PERFORMANCE: Support skewed join in pig
> ---
>
> Key: PIG-792
> URL: https://issues.apache.org/jira/browse/PIG-792
> Project: Pig
>  Issue Type: Improvement
>Reporter: Sriranjan Manjunath
> Attachments: skewedjoin.patch
>
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-851) Map type used as return type in UDFs not recognized at all times

2009-07-08 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12728991#action_12728991
 ] 

Santhosh Srinivasan commented on PIG-851:
-

Patch has been committed. Thanks for fixing this issue Jeff.

> Map type used as return type in UDFs not recognized at all times
> 
>
> Key: PIG-851
> URL: https://issues.apache.org/jira/browse/PIG-851
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.3.0
>Reporter: Santhosh Srinivasan
>Assignee: Jeff Zhang
> Fix For: 0.4.0
>
> Attachments: Pig_851_patch.txt
>
>
> When an UDF returns a map and the outputSchema method is not overridden, Pig 
> does not figure out the data type. As a result, the type is set to unknown 
> resulting in run time failure. An example script and UDF follow
> {code}
> public class mapUDF extends EvalFunc> {
> @Override
> public Map exec(Tuple input) throws IOException {
> return new HashMap();
> }
> //Note that the outputSchema method is commented out
> /*
> @Override
> public Schema outputSchema(Schema input) {
> try {
> return new Schema(new Schema.FieldSchema(null, null, 
> DataType.MAP));
> } catch (FrontendException e) {
> return null;
> }
> }
> */
> {code}
> {code}
> grunt> a = load 'student_tab.data';   
> grunt> b = foreach a generate EXPLODE(1);
> grunt> describe b;
> b: {Unknown}
> grunt> dump b;
> 2009-06-15 17:59:01,776 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - Failed!
> 2009-06-15 17:59:01,781 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2080: Foreach currently does not handle type Unknown
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-874) Problems in pushing down foreach with flatten

2009-07-07 Thread Santhosh Srinivasan (JIRA)
Problems in pushing down foreach with flatten
-

 Key: PIG-874
 URL: https://issues.apache.org/jira/browse/PIG-874
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Santhosh Srinivasan
 Fix For: 0.4.0


If the graph contains more than one foreach connected to an operator, pushing 
down foreach with flatten is not possible with the current optimizer pattern 
matching algorithm and current implementation of rewire. The following 
mechanism of pushing foreach with flatten does not work.

1. Search for foreach (with flatten) connected to an operator
2. If checks pass then unflatten the flattened column in the foreach
3. Create a new foreach that flattens the mapped column (the original column 
number could have changed) and insert the new foreach after the old foreach's 
successor.

An example to illustrate the problem:

{code}
A = load 'myfile' as (name, age, gpa:(letter_grade, point_score));
B = foreach A generate $0, $1, flatten($2);
C = load 'anotherfile' as (name, age, preference:(course_name, instructor));
D = foreach C generate $0, $1, flatten($2);
E = join B by $0, D by $0 using "replicated";
F = limit E 10;
{code}

In the code snipped (see above), the optimizer will find two matches, B->E and 
D->E. For the first pattern match (B->E), $2 will be unflattened and a new 
foreach will be introduced after the join.

{code}
A = load 'myfile' as (name, age, gpa:(letter_grade, point_score));
B = foreach A generate $0, $1, $2;
C = load 'anotherfile' as (name, age, preference:(course_name, instructor));
D = foreach C generate $0, $1, flatten($2);
E = join B by $0, D by $0 using "replicated";
E1 = foreach E generate $0, $1, flatten($2), $3, $4, $5, $6;
F = limit E1 10;
{code}

For the second match (D->E), the same transformation is applied. However, this 
transformation will not work for the following reason. The new foreach is now 
inserted between the E and E1. When E1 is rewired, rewire is unable to map $6 
in E1 as it never exists in E. In order to fix such situations, the pattern 
matching should return a global match instead of a local match.

Reference: PIG-873

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-873) Optimizer should allow search for global patterns

2009-07-07 Thread Santhosh Srinivasan (JIRA)
Optimizer should allow search for global patterns
-

 Key: PIG-873
 URL: https://issues.apache.org/jira/browse/PIG-873
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.3.1
Reporter: Santhosh Srinivasan
 Fix For: 0.4.0


Currently, the optimizer works on the following mechanism:

1. Specify the pattern to be searched
2. For each occurrence of the pattern, check and then apply a transformation

With this approach, the search for a pattern is localized. An example will 
illustrate the problem.

If the pattern to be searched for is foreach (with flatten) connected to any 
operator and if the graph has more than one foreach (with flatten) connected to 
an operator (cross, join, union, etc), then each instance of foreach connected 
to the operator is returned as a match. While this is fine for a localized view 
(per match), at a global view the pattern to be searched for is any number of 
foreach connected to an operator.

The implication of not having a globalized view is more rules. There will be 
one rule for one foreach connected to an opeator, one rule for two foreachs 
connected to an operators, etc.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported

2009-07-06 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12727674#action_12727674
 ] 

Santhosh Srinivasan commented on PIG-773:
-

A comment on the latest patch (pig-773_v4.patch):

Index: src/org/apache/pig/data/DataType.java
===

Since the bag schema contains the tuple schema, bag schema should set the two 
level access to true.

{code}
@@ -998,8 +999,8 @@
 schema = schemas.get(0);
 if(null == schema) {
 Schema.FieldSchema tupleFs = new 
Schema.FieldSchema(null, null, TUPLE);
-Schema bagSchema = new Schema(tupleFs);
-return new Schema.FieldSchema(null, null, BAG);
+bagSchema = new Schema(tupleFs);
+return new Schema.FieldSchema(null, bagSchema, BAG);
{code}

> Empty complex constants (empty bag, empty tuple and empty map) should be 
> supported
> --
>
> Key: PIG-773
> URL: https://issues.apache.org/jira/browse/PIG-773
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Pradeep Kamath
>Assignee: Ashutosh Chauhan
>Priority: Minor
> Fix For: 0.4.0
>
> Attachments: pig-773.patch, pig-773_v2.patch, pig-773_v3.patch, 
> pig-773_v4.patch
>
>
> We should be able to create empty bag constant using {}, empty tuple constant 
> using (), empty map constant using [] within a pig script

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-792) PERFORMANCE: Support skewed join in pig

2009-07-02 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-792:


Status: Patch Available  (was: Open)

> PERFORMANCE: Support skewed join in pig
> ---
>
> Key: PIG-792
> URL: https://issues.apache.org/jira/browse/PIG-792
> Project: Pig
>  Issue Type: Improvement
>Reporter: Sriranjan Manjunath
> Attachments: skewedjoin.patch
>
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer

2009-07-02 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726684#action_12726684
 ] 

Santhosh Srinivasan commented on PIG-697:
-

Phase 4 part 2 patch has been committed

> Proposed improvements to pig's optimizer
> 
>
> Key: PIG-697
> URL: https://issues.apache.org/jira/browse/PIG-697
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Alan Gates
>Assignee: Santhosh Srinivasan
> Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
> OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
> OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch, 
> OptimizerPhase4_part1-1.patch, OptimizerPhase4_part2.patch
>
>
> I propose the following changes to pig optimizer, plan, and operator 
> functionality to support more robust optimization:
> 1) Remove the required array from Rule.  This will change rules so that they 
> only match exact patterns instead of allowing missing elements in the pattern.
> This has the downside that if a given rule applies to two patterns (say 
> Load->Filter->Group, Load->Group) you have to write two rules.  But it has 
> the upside that
> the resulting rules know exactly what they are getting.  The original intent 
> of this was to reduce the number of rules that needed to be written.  But the
> resulting rules have do a lot of work to understand the operators they are 
> working with.  With exact matches only, each rule will know exactly the 
> operators it
> is working on and can apply the logic of shifting the operators around.  All 
> four of the existing rules set all entries of required to true, so removing 
> this
> will have no effect on them.
> 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
> conversions or a certain number of iterations has been reached.  Currently the
> function is:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> for (Rule rule : mRules) {
> if (matcher.match(rule)) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> List> matches = matcher.getAllMatches();
> for (List match:matches)
> {
>   if (rule.transformer.check(match)) {
>   // The transformer approves.
>   rule.transformer.transform(match);
>   }
> }
> }
> }
> }
> {code}
> It would change to be:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> boolean sawMatch;
> int iterators = 0;
> do {
> sawMatch = false;
> for (Rule rule : mRules) {
> List> matches = matcher.getAllMatches();
> for (List match:matches) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> if (rule.transformer.check(match)) {
> // The transformer approves.
> sawMatch = true;
> rule.transformer.transform(match);
> }
> }
> }
> // Not sure if 1000 is the right number of iterations, maybe it
> // should be configurable so that large scripts don't stop too 
> // early.
> } while (sawMatch && numIterations++ < 1000);
> }
> {code}
> The reason for limiting the number of iterations is to avoid infinite loops.  
> The reason for iterating over the rules is so that each rule can be applied 
> multiple
> times as necessary.  This allows us to write simple rules, mostly swaps 
> between neighboring operators, without worrying that we get the plan right in 
> one pass.
> For example, we might have a plan that looks like:  
> Load->Join->Filter->Foreach, and we want to optimize it to 
> Load->Foreach->Filter->Join.  With two simple
> rules (swap filter and join and swap foreach and filter), applied 
> iteratively, we can get from the initial to final plan, without needing to 
> understanding the
> big picture of the entire plan.
> 3) Add three calls to OperatorPlan:
> {code}
> /**
>  * Swap two operators in a plan.  Both of the operators must have single
>  * inputs and single outputs.
>  * @param first operator
>  * @param second operator
>  * @throws PlanException if either operator is not single input and output.
>  */
> public void swap(E first, E second) throws PlanException {
> ...
> }
> /**
>  * Push one operator in front of another.  This function is for use when
>  * the first

[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer

2009-07-02 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726601#action_12726601
 ] 

Santhosh Srinivasan commented on PIG-697:
-

-1 javac. The applied patch generated 250 javac compiler warnings (more than 
the trunk's current 248 warnings).

The additional 2 compiler warning messages are related to type inference. At 
this point these messages are harmless. 

-1 javac. The applied patch generated 250 javac compiler warnings (more than 
the trunk's current 248 warnings).

Dodgy warning:
The find bug warnings are harmless, there is an  explicit check for null to 
print null as opposed to the contents of the object.  

Correctness warning:
There are checks in place to ensure that the variable can never be null.


> Proposed improvements to pig's optimizer
> 
>
> Key: PIG-697
> URL: https://issues.apache.org/jira/browse/PIG-697
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Alan Gates
>Assignee: Santhosh Srinivasan
> Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
> OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
> OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch, 
> OptimizerPhase4_part1-1.patch, OptimizerPhase4_part2.patch
>
>
> I propose the following changes to pig optimizer, plan, and operator 
> functionality to support more robust optimization:
> 1) Remove the required array from Rule.  This will change rules so that they 
> only match exact patterns instead of allowing missing elements in the pattern.
> This has the downside that if a given rule applies to two patterns (say 
> Load->Filter->Group, Load->Group) you have to write two rules.  But it has 
> the upside that
> the resulting rules know exactly what they are getting.  The original intent 
> of this was to reduce the number of rules that needed to be written.  But the
> resulting rules have do a lot of work to understand the operators they are 
> working with.  With exact matches only, each rule will know exactly the 
> operators it
> is working on and can apply the logic of shifting the operators around.  All 
> four of the existing rules set all entries of required to true, so removing 
> this
> will have no effect on them.
> 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
> conversions or a certain number of iterations has been reached.  Currently the
> function is:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> for (Rule rule : mRules) {
> if (matcher.match(rule)) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> List> matches = matcher.getAllMatches();
> for (List match:matches)
> {
>   if (rule.transformer.check(match)) {
>   // The transformer approves.
>   rule.transformer.transform(match);
>   }
> }
> }
> }
> }
> {code}
> It would change to be:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> boolean sawMatch;
> int iterators = 0;
> do {
> sawMatch = false;
> for (Rule rule : mRules) {
> List> matches = matcher.getAllMatches();
> for (List match:matches) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> if (rule.transformer.check(match)) {
> // The transformer approves.
> sawMatch = true;
> rule.transformer.transform(match);
> }
> }
> }
> // Not sure if 1000 is the right number of iterations, maybe it
> // should be configurable so that large scripts don't stop too 
> // early.
> } while (sawMatch && numIterations++ < 1000);
> }
> {code}
> The reason for limiting the number of iterations is to avoid infinite loops.  
> The reason for iterating over the rules is so that each rule can be applied 
> multiple
> times as necessary.  This allows us to write simple rules, mostly swaps 
> between neighboring operators, without worrying that we get the plan right in 
> one pass.
> For example, we might have a plan that looks like:  
> Load->Join->Filter->Foreach, and we want to optimize it to 
> Load->Foreach->Filter->Join.  With two simple
> rules (swap filter and join and swap foreach and filter), applied 
> iteratively, we can get from the initial to fina

[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer

2009-07-02 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726566#action_12726566
 ] 

Santhosh Srinivasan commented on PIG-697:
-

1. Removing added fields from the flattened set.

The flattened set is the set of all flattened columns. It can contain mapped 
and added fields. In order to remove the added fields from this set, the 
removeAll method is used.

2. Comments on why the rule applies only to Order, Cross and Join

Will add these comments.

3. Removing code in LOForEach for flattening a bag with unknown schema

The code that I removed was redundant and also had a bug. The check for a field 
getting mapped was neglected in one case. After I added the check, the code for 
the if and the else was identical. I removed the redundant code and made it 
simpler.

> Proposed improvements to pig's optimizer
> 
>
> Key: PIG-697
> URL: https://issues.apache.org/jira/browse/PIG-697
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Alan Gates
>Assignee: Santhosh Srinivasan
> Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
> OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
> OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch, 
> OptimizerPhase4_part1-1.patch, OptimizerPhase4_part2.patch
>
>
> I propose the following changes to pig optimizer, plan, and operator 
> functionality to support more robust optimization:
> 1) Remove the required array from Rule.  This will change rules so that they 
> only match exact patterns instead of allowing missing elements in the pattern.
> This has the downside that if a given rule applies to two patterns (say 
> Load->Filter->Group, Load->Group) you have to write two rules.  But it has 
> the upside that
> the resulting rules know exactly what they are getting.  The original intent 
> of this was to reduce the number of rules that needed to be written.  But the
> resulting rules have do a lot of work to understand the operators they are 
> working with.  With exact matches only, each rule will know exactly the 
> operators it
> is working on and can apply the logic of shifting the operators around.  All 
> four of the existing rules set all entries of required to true, so removing 
> this
> will have no effect on them.
> 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
> conversions or a certain number of iterations has been reached.  Currently the
> function is:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> for (Rule rule : mRules) {
> if (matcher.match(rule)) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> List> matches = matcher.getAllMatches();
> for (List match:matches)
> {
>   if (rule.transformer.check(match)) {
>   // The transformer approves.
>   rule.transformer.transform(match);
>   }
> }
> }
> }
> }
> {code}
> It would change to be:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> boolean sawMatch;
> int iterators = 0;
> do {
> sawMatch = false;
> for (Rule rule : mRules) {
> List> matches = matcher.getAllMatches();
> for (List match:matches) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> if (rule.transformer.check(match)) {
> // The transformer approves.
> sawMatch = true;
> rule.transformer.transform(match);
> }
> }
> }
> // Not sure if 1000 is the right number of iterations, maybe it
> // should be configurable so that large scripts don't stop too 
> // early.
> } while (sawMatch && numIterations++ < 1000);
> }
> {code}
> The reason for limiting the number of iterations is to avoid infinite loops.  
> The reason for iterating over the rules is so that each rule can be applied 
> multiple
> times as necessary.  This allows us to write simple rules, mostly swaps 
> between neighboring operators, without worrying that we get the plan right in 
> one pass.
> For example, we might have a plan that looks like:  
> Load->Join->Filter->Foreach, and we want to optimize it to 
> Load->Foreach->Filter->Join.  With two simple
> rules (swap filter and join and swap foreach and filter), applied 

[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer

2009-07-01 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-697:


Status: Patch Available  (was: In Progress)

> Proposed improvements to pig's optimizer
> 
>
> Key: PIG-697
> URL: https://issues.apache.org/jira/browse/PIG-697
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Alan Gates
>Assignee: Santhosh Srinivasan
> Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
> OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
> OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch, 
> OptimizerPhase4_part1-1.patch, OptimizerPhase4_part2.patch
>
>
> I propose the following changes to pig optimizer, plan, and operator 
> functionality to support more robust optimization:
> 1) Remove the required array from Rule.  This will change rules so that they 
> only match exact patterns instead of allowing missing elements in the pattern.
> This has the downside that if a given rule applies to two patterns (say 
> Load->Filter->Group, Load->Group) you have to write two rules.  But it has 
> the upside that
> the resulting rules know exactly what they are getting.  The original intent 
> of this was to reduce the number of rules that needed to be written.  But the
> resulting rules have do a lot of work to understand the operators they are 
> working with.  With exact matches only, each rule will know exactly the 
> operators it
> is working on and can apply the logic of shifting the operators around.  All 
> four of the existing rules set all entries of required to true, so removing 
> this
> will have no effect on them.
> 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
> conversions or a certain number of iterations has been reached.  Currently the
> function is:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> for (Rule rule : mRules) {
> if (matcher.match(rule)) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> List> matches = matcher.getAllMatches();
> for (List match:matches)
> {
>   if (rule.transformer.check(match)) {
>   // The transformer approves.
>   rule.transformer.transform(match);
>   }
> }
> }
> }
> }
> {code}
> It would change to be:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> boolean sawMatch;
> int iterators = 0;
> do {
> sawMatch = false;
> for (Rule rule : mRules) {
> List> matches = matcher.getAllMatches();
> for (List match:matches) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> if (rule.transformer.check(match)) {
> // The transformer approves.
> sawMatch = true;
> rule.transformer.transform(match);
> }
> }
> }
> // Not sure if 1000 is the right number of iterations, maybe it
> // should be configurable so that large scripts don't stop too 
> // early.
> } while (sawMatch && numIterations++ < 1000);
> }
> {code}
> The reason for limiting the number of iterations is to avoid infinite loops.  
> The reason for iterating over the rules is so that each rule can be applied 
> multiple
> times as necessary.  This allows us to write simple rules, mostly swaps 
> between neighboring operators, without worrying that we get the plan right in 
> one pass.
> For example, we might have a plan that looks like:  
> Load->Join->Filter->Foreach, and we want to optimize it to 
> Load->Foreach->Filter->Join.  With two simple
> rules (swap filter and join and swap foreach and filter), applied 
> iteratively, we can get from the initial to final plan, without needing to 
> understanding the
> big picture of the entire plan.
> 3) Add three calls to OperatorPlan:
> {code}
> /**
>  * Swap two operators in a plan.  Both of the operators must have single
>  * inputs and single outputs.
>  * @param first operator
>  * @param second operator
>  * @throws PlanException if either operator is not single input and output.
>  */
> public void swap(E first, E second) throws PlanException {
> ...
> }
> /**
>  * Push one operator in front of another.  This function is for use when
>  * the first operator has multiple inputs.  The caller can s

  1   2   3   4   5   >