[jira] Updated: (PIG-794) Use Avro serialization in Pig

2010-08-31 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-794:
---

Attachment: AvroStorage_2.patch

> Use Avro serialization in Pig
> -
>
> Key: PIG-794
> URL: https://issues.apache.org/jira/browse/PIG-794
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.2.0
>Reporter: Rakesh Setty
>Assignee: Dmitriy V. Ryaboy
> Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
> AvroStorage_2.patch, jackson-asl-0.9.4.jar, PIG-794.patch
>
>
> We would like to use Avro serialization in Pig to pass data between MR jobs 
> instead of the current BinStorage. Attached is an implementation of 
> AvroBinStorage which performs significantly better compared to BinStorage on 
> our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-08-31 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904551#action_12904551
 ] 

Jeff Zhang commented on PIG-794:


I did some experiment on Avro, Avro_Storage_2.patch is the detail 
implementation.

Here I use avro as the data storage between map reduce jobs to replace 
InterStorage which has been optimized compared to BinStorage. 
 I use a simple pig script which will been translate into 2 mapred jobs
{code}
a = load '/a.txt';
b = load '/b.txt';
c = join a by $0, b by $0;
d = group c by $0;
dump d;
{code}

The following table shows my experiment result (1 master + 3 slaves)
|| Storage || Time spent on job_1 || Output size of job_1 || Mapper task number 
of job_2 || Time spent on job_2 || Total spent time on pig script
| AvroStorage | 5min 57 sec | 7.97G | 120 | 16min 50 sec | 22min 47 sec| 
| InterStorage | 4min 33 sec | 9.55G | 143 | 17min 17 sec | 21min 50 sec|

The experiment shows that AvroStorage has more compact format than InterStorage 
( according the output size of job_1), but has more overhead on serialization ( 
according the time spent on job_1). I think the time spent on job_2 using 
AvroStorage is less than that using InterStorage is because the input size of 
job_2 (the output of job_1) which using AvroStorage is much less than that 
using InterStorage, so it need less mapper task.

Overall, AvroStorage is not so good as expected.
One reason is maybe I do not use Avro's API correctly (hope avro guys can 
review my code), another reason is maybe avro's serialization performance is 
not so good.
BTW, I use avro trunk.


> Use Avro serialization in Pig
> -
>
> Key: PIG-794
> URL: https://issues.apache.org/jira/browse/PIG-794
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.2.0
>Reporter: Rakesh Setty
>Assignee: Dmitriy V. Ryaboy
> Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
> AvroStorage_2.patch, jackson-asl-0.9.4.jar, PIG-794.patch
>
>
> We would like to use Avro serialization in Pig to pass data between MR jobs 
> instead of the current BinStorage. Attached is an implementation of 
> AvroBinStorage which performs significantly better compared to BinStorage on 
> our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-794) Use Avro serialization in Pig

2010-08-31 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-794:
---

Attachment: AvroTest.java

> Use Avro serialization in Pig
> -
>
> Key: PIG-794
> URL: https://issues.apache.org/jira/browse/PIG-794
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.2.0
>Reporter: Rakesh Setty
>Assignee: Dmitriy V. Ryaboy
> Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
> AvroStorage_2.patch, AvroTest.java, jackson-asl-0.9.4.jar, PIG-794.patch
>
>
> We would like to use Avro serialization in Pig to pass data between MR jobs 
> instead of the current BinStorage. Attached is an implementation of 
> AvroBinStorage which performs significantly better compared to BinStorage on 
> our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-08-31 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904555#action_12904555
 ] 

Jeff Zhang commented on PIG-794:


Besides the above experiment, I also did a experiment to compare 
AvroRecordWriter and InterRecordWriter in local environment. You can see the 
attached file AvroTest.java
I write 50,000,000 records using these two RecordWriter, and time spent on 
AvroRecordWriter is 70 seconds while it is 29 seconds using InterRecordWriter. 

The performance of InterRecordWriter is much better than AvroRecordWriter, 
internally they use DataFileWriter (avro) and FSDataOutputStream (inter).  And 
both of them use BufferedOutputStream as one buffer layer. The difference is 
that DataFileWriter (avro) has another buffer layer, it will first write 
contents to an in-memory block and then write it to BufferedOutputStream when 
the block is full. Not sure whether this layer have overhead.




> Use Avro serialization in Pig
> -
>
> Key: PIG-794
> URL: https://issues.apache.org/jira/browse/PIG-794
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.2.0
>Reporter: Rakesh Setty
>Assignee: Dmitriy V. Ryaboy
> Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
> AvroStorage_2.patch, AvroTest.java, jackson-asl-0.9.4.jar, PIG-794.patch
>
>
> We would like to use Avro serialization in Pig to pass data between MR jobs 
> instead of the current BinStorage. Attached is an implementation of 
> AvroBinStorage which performs significantly better compared to BinStorage on 
> our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-794) Use Avro serialization in Pig

2010-08-31 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-794:
---

Attachment: AvroStorage_3.patch

> Use Avro serialization in Pig
> -
>
> Key: PIG-794
> URL: https://issues.apache.org/jira/browse/PIG-794
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.2.0
>Reporter: Rakesh Setty
>Assignee: Dmitriy V. Ryaboy
> Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
> AvroStorage_2.patch, AvroStorage_3.patch, AvroTest.java, 
> jackson-asl-0.9.4.jar, PIG-794.patch
>
>
> We would like to use Avro serialization in Pig to pass data between MR jobs 
> instead of the current BinStorage. Attached is an implementation of 
> AvroBinStorage which performs significantly better compared to BinStorage on 
> our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-08-31 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904575#action_12904575
 ] 

Jeff Zhang commented on PIG-794:


Attach the updated patch Avro_Strorage_3.patch ( I found one place can been 
optimized)
The following is the latest experiment result (which shows AvroStorage is a 
little better than InterStorage)
||Storage   || Time spent on job_1  || Output size of job_1 || 
Mapper task number of job_2  || Time spent on job_2  || Total spent time on pig 
script ||
|AvroStorage   |3min 51 sec |7.96G  |120 |17min 09 sec |21min 0 sec|
|InterStorage   |4min 33 sec|9.55G  |143|17min 17 sec   |21min 50 sec|

> Use Avro serialization in Pig
> -
>
> Key: PIG-794
> URL: https://issues.apache.org/jira/browse/PIG-794
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.2.0
>Reporter: Rakesh Setty
>Assignee: Dmitriy V. Ryaboy
> Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
> AvroStorage_2.patch, AvroStorage_3.patch, AvroTest.java, 
> jackson-asl-0.9.4.jar, PIG-794.patch
>
>
> We would like to use Avro serialization in Pig to pass data between MR jobs 
> instead of the current BinStorage. Attached is an implementation of 
> AvroBinStorage which performs significantly better compared to BinStorage on 
> our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-08-31 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904615#action_12904615
 ] 

Dmitriy V. Ryaboy commented on PIG-794:
---

Jeff, have you checkoed out Scott Carey's work here: 
https://issues.apache.org/jira/browse/AVRO-592 ?

> Use Avro serialization in Pig
> -
>
> Key: PIG-794
> URL: https://issues.apache.org/jira/browse/PIG-794
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.2.0
>Reporter: Rakesh Setty
>Assignee: Dmitriy V. Ryaboy
> Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
> AvroStorage_2.patch, AvroStorage_3.patch, AvroTest.java, 
> jackson-asl-0.9.4.jar, PIG-794.patch
>
>
> We would like to use Avro serialization in Pig to pass data between MR jobs 
> instead of the current BinStorage. Attached is an implementation of 
> AvroBinStorage which performs significantly better compared to BinStorage on 
> our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-08-31 Thread Scott Carey (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904674#action_12904674
 ] 

Scott Carey commented on PIG-794:
-

AVRO-592 creates an AvroStorage class for writing and reading M/R inputs and 
outputs but does not deal with intermediate M/R output.  I have some updates to 
that in progress that simplify it more.   Some aspects may be re-usable for 
this too.   

One thing to note is that Avro cannot be completely optimal for intermediate 
M/R output because the Hadoop API for this has a performance flaw that prevents 
efficient use of buffers and input/output streams there.  This would affect 
InterStorage as well though.

I'll take a look at the patch here and see if I can see any performance 
optimizations.
Note, that there are still several performance optimizations left to do in Avro 
itself.  For example, the BinaryDecoder has been optimized, but not the Encoder 
yet.

Also, I'm somewhat blocked with AVRO-592 due to lack of Pig 0.7 maven 
availability. 



> Use Avro serialization in Pig
> -
>
> Key: PIG-794
> URL: https://issues.apache.org/jira/browse/PIG-794
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.2.0
>Reporter: Rakesh Setty
>Assignee: Dmitriy V. Ryaboy
> Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
> AvroStorage_2.patch, AvroStorage_3.patch, AvroTest.java, 
> jackson-asl-0.9.4.jar, PIG-794.patch
>
>
> We would like to use Avro serialization in Pig to pass data between MR jobs 
> instead of the current BinStorage. Attached is an implementation of 
> AvroBinStorage which performs significantly better compared to BinStorage on 
> our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-08-31 Thread Scott Carey (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904680#action_12904680
 ] 

Scott Carey commented on PIG-794:
-

So a summary of the differences I can see quickly are:

h5. Schema usage:
This creates a 'generic' Avro schema that can be used for any pig data.  Each 
field in a Tuple is a Union of all possible pig types, and each Tuple is a list 
of fields.  It does not preserve the field names or types -- these are not 
important for intermediate data anyway.

AVRO-592 translates the Pig schema into a specific Avro schema that persists 
the field names and types, so that:
STORE foo INTO 'file' USING AvroStorage();
Will create a file that
foo2 = LOAD 'file' USING AvroStorage(); 
will be able to re-create the exact schema for use in a script.

h5. Serialization and Deserialization:
This uses the same style as Avro's GenericRecord, which traverses the schema on 
the fly and writes fields for each record.

AVRO-592 constructs a state machine for each specific schema to optimally 
traverse a Tuple to serialize a record or create a Tuple when deserializing.  
This should be faster but the code is definitely harder to read (but easy to 
unit test -- AVRO-592 has 98% unit test code coverage on that portion).


Integrating these should not be too hard.  I'll try and put my latest version 
of AVRO-592 up there late today or tomorrow.




> Use Avro serialization in Pig
> -
>
> Key: PIG-794
> URL: https://issues.apache.org/jira/browse/PIG-794
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.2.0
>Reporter: Rakesh Setty
>Assignee: Dmitriy V. Ryaboy
> Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
> AvroStorage_2.patch, AvroStorage_3.patch, AvroTest.java, 
> jackson-asl-0.9.4.jar, PIG-794.patch
>
>
> We would like to use Avro serialization in Pig to pass data between MR jobs 
> instead of the current BinStorage. Attached is an implementation of 
> AvroBinStorage which performs significantly better compared to BinStorage on 
> our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-08-31 Thread Scott Carey (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904683#action_12904683
 ] 

Scott Carey commented on PIG-794:
-

bq.  The performance of InterRecordWriter is much better than AvroRecordWriter, 
internally they use DataFileWriter (avro) and FSDataOutputStream (inter). And 
both of them use BufferedOutputStream as one buffer layer. The difference is 
that DataFileWriter (avro) has another buffer layer, it will first write 
contents to an in-memory block and then write it to BufferedOutputStream when 
the block is full. Not sure whether this layer have overhead.

I've tested this a bit before, the extra block copy is minor overhead.  How the 
BufferedOutputStream is used is the problem.  We have not yet optimized the 
write side of Avro completely -- there are enhancements to the serialization 
process that can be done.

> Use Avro serialization in Pig
> -
>
> Key: PIG-794
> URL: https://issues.apache.org/jira/browse/PIG-794
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.2.0
>Reporter: Rakesh Setty
>Assignee: Dmitriy V. Ryaboy
> Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
> AvroStorage_2.patch, AvroStorage_3.patch, AvroTest.java, 
> jackson-asl-0.9.4.jar, PIG-794.patch
>
>
> We would like to use Avro serialization in Pig to pass data between MR jobs 
> instead of the current BinStorage. Attached is an implementation of 
> AvroBinStorage which performs significantly better compared to BinStorage on 
> our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-08-31 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904687#action_12904687
 ] 

Doug Cutting commented on PIG-794:
--

A few comments about the attached code:
 - is there a reason you don't subclass GenericDatumReader and 
GenericDatumWriter, overriding readRecord() and writeRecord()?  That would 
simplify things and better guarantee that you're conforming to a schema.  
Currently, e.g., your writeMap() doesn't appear to write a valid Avro map, 
writeArray() doesn't write a valid Avro array, etc., so the data written is not 
interoperable,.
 - my guess is that a lot of time is spent in findSchemaIndex().  if that's 
right, you might improve this in various ways, e.g.:
 -- sort this by the most common types.  the order in Pig's DataType.java is 
probably a good one.
 -- try using a static Map cache of indexes
- have you run this under a profiler?

I don't see where this specifies an Avro schema for Pig data.  It's possible to 
construct a generic schema for all Pig data.  In this, a Bag should be record 
with a single field, an array of Tuples.  A Tuple should be a record with a 
single field, an array of a union of all types.  Given such a schema, one could 
then write a DatumReader/Writer using the control logic of Pig's 
DataReaderWriter (i.e., a switch based on the value of DataType.findType(), 
but, instead of calling DataInput/Output methods, use Encoder/Decoder methods 
with a ValidatingEncoder (at least while debugging) to ensure you conform to 
that schema.

Alternately, in Avro 1.4 (snapshot in Maven now, release this week, hopefully) 
Avro arrays can be arbitrary Collection implementations.  Bag already 
implements all of the required Collection methods -- clear(), add(), size(), & 
iterator(), so there's no reason I can see for Bag not to implement 
Collection.  So then one could subclass GenericData, GenericDatumReader 
& Writer, overriding:

{code}
protected boolean isRecord(Object datum) {
  return datum instanceof Tuple || datum instanceof Bag;
}
protected void writeRecord(Schema schema, Object datum, Encoder out) throws 
IOException {
  if (TUPLE_NAME.equals(schema.getFullName()))
datum = ((Tuple)datum.getAll();
  writeArray(schema.getFields().get(0).getType(), datum, out);
}
protected Object readRecord(Object old, Schema expected, ResolvingDecoder in) 
throws IOException {
  Object result;
  if (TUPLE_NAME.equals(schema.getFullName())) {
old = new ArrayList();
result = new Tuple(old);
  } else {
old = result = new Bag();
  }
  readArray(old, expected.getFields().get(0).getType(), in);
  return result;
}
{code}
   
Finally, if you knew the schema for the dataset being processed, rather than 
using a fully-general Pig schema, then you could translate that schema to an 
Avro schema.  This schema in most cases would not likely have a huge, 
compute-intensive-to-write union in it .  Or you might use something like what 
Scott's proposed in AVRO-592.


> Use Avro serialization in Pig
> -
>
> Key: PIG-794
> URL: https://issues.apache.org/jira/browse/PIG-794
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.2.0
>Reporter: Rakesh Setty
>Assignee: Dmitriy V. Ryaboy
> Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
> AvroStorage_2.patch, AvroStorage_3.patch, AvroTest.java, 
> jackson-asl-0.9.4.jar, PIG-794.patch
>
>
> We would like to use Avro serialization in Pig to pass data between MR jobs 
> instead of the current BinStorage. Attached is an implementation of 
> AvroBinStorage which performs significantly better compared to BinStorage on 
> our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1314) Add DateTime Support to Pig

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1314:


Fix Version/s: (was: 0.8.0)

Unlinking from 0.8 since we are branching today

> Add DateTime Support to Pig
> ---
>
> Key: PIG-1314
> URL: https://issues.apache.org/jira/browse/PIG-1314
> Project: Pig
>  Issue Type: Bug
>  Components: data
>Affects Versions: 0.7.0
>Reporter: Russell Jurney
>Assignee: Russell Jurney
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> Hadoop/Pig are primarily used to parse log data, and most logs have a 
> timestamp component.  Therefore Pig should support dates as a primitive.
> Can someone familiar with adding types to pig comment on how hard this is?  
> We're looking at doing this, rather than use UDFs.  Is this a patch that 
> would be accepted?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1429) Add Boolean Data Type to Pig

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1429:


Fix Version/s: (was: 0.8.0)

Unlinking because we are branching for release today

> Add Boolean Data Type to Pig
> 
>
> Key: PIG-1429
> URL: https://issues.apache.org/jira/browse/PIG-1429
> Project: Pig
>  Issue Type: New Feature
>  Components: data
>Affects Versions: 0.7.0
>Reporter: Russell Jurney
>Assignee: Russell Jurney
> Attachments: working_boolean.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> Pig needs a Boolean data type.  Pig-1097 is dependent on doing this.  
> I volunteer.  Is there anything beyond the work in src/org/apache/pig/data/ 
> plus unit tests to make this work?  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1549) Provide utility to construct CNF form of predicates

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1549:


Fix Version/s: (was: 0.8.0)

Unlinking from 0.8 release since we are about to branch

> Provide utility to construct CNF form of predicates
> ---
>
> Key: PIG-1549
> URL: https://issues.apache.org/jira/browse/PIG-1549
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Swati Jain
>Assignee: Swati Jain
> Attachments: 0001-Add-CNF-utility-class.patch
>
>
> Provide utility to construct CNF form of predicates

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1530) PIG Logical Optimization: Push LOFilter above LOCogroup

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-1530.
-

Resolution: Duplicate

Xuefu is addressing this issue as part of 
https://issues.apache.org/jira/browse/PIG-1575.

>  PIG Logical Optimization: Push LOFilter above LOCogroup
> 
>
> Key: PIG-1530
> URL: https://issues.apache.org/jira/browse/PIG-1530
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Reporter: Swati Jain
>Assignee: Swati Jain
>Priority: Minor
> Fix For: 0.8.0
>
>
> Consider the following:
> {noformat}
> A = load '' USING PigStorage(',') as (a1:int,a2:int,a3:int);
> B = load '' USING PigStorage(',') as (b1:int,b2:int,b3:int);
> G = COGROUP A by (a1,a2) , B by (b1,b2);
> D = Filter G by group.$0 + 5 > group.$1;
> explain D;
> {noformat}
> In the above example, LOFilter can be pushed above LOCogroup. Note there are 
> some tricky NULL issues to think about when the Cogroup is not of type INNER 
> (Similar to issues that need to be thought through when pushing LOFilter on 
> the right side of a LeftOuterJoin).
> Also note that typically the LOFilter in user programs will be below a 
> ForEach-Cogroup pair. To make this really useful, we need to also implement 
> LOFilter pushed across ForEach. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1494) PIG Logical Optimization: Use CNF in PushUpFilter

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1494:



Unlinking from 0.8 since we are about to branch for release

> PIG Logical Optimization: Use CNF in PushUpFilter
> -
>
> Key: PIG-1494
> URL: https://issues.apache.org/jira/browse/PIG-1494
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Swati Jain
>Assignee: Swati Jain
>Priority: Minor
>
> The PushUpFilter rule is not able to handle complicated boolean expressions.
> For example, SplitFilter rule is splitting one LOFilter into two by "AND". 
> However it will not be able to split LOFilter if the top level operator is 
> "OR". For example:
> *ex script:*
> A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int);
> B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int);
> C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int);
> J1 = JOIN B by b1, C by c1;
> J2 = JOIN J1 by $0, A by a1;
> D = *Filter J2 by ( (c1 < 10) AND (a3+b3 > 10) ) OR (c2 == 5);*
> explain D;
> In the above example, the PushUpFilter is not able to push any filter 
> condition across any join as it contains columns from all branches (inputs). 
> But if we convert this expression into "Conjunctive Normal Form" (CNF) then 
> we would be able to push filter condition c1< 10 and c2 == 5 below both join 
> conditions. Here is the CNF expression for highlighted line:
> ( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) )
> *Suggestion:* It would be a good idea to convert LOFilter's boolean 
> expression into CNF, it would then be easy to push parts (conjuncts) of the 
> LOFilter boolean expression selectively. We would also not require rule 
> SplitFilter anymore if we were to add this utility to rule PushUpFilter 
> itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1494) PIG Logical Optimization: Use CNF in PushUpFilter

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1494:


Fix Version/s: (was: 0.8.0)

> PIG Logical Optimization: Use CNF in PushUpFilter
> -
>
> Key: PIG-1494
> URL: https://issues.apache.org/jira/browse/PIG-1494
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Swati Jain
>Assignee: Swati Jain
>Priority: Minor
>
> The PushUpFilter rule is not able to handle complicated boolean expressions.
> For example, SplitFilter rule is splitting one LOFilter into two by "AND". 
> However it will not be able to split LOFilter if the top level operator is 
> "OR". For example:
> *ex script:*
> A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int);
> B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int);
> C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int);
> J1 = JOIN B by b1, C by c1;
> J2 = JOIN J1 by $0, A by a1;
> D = *Filter J2 by ( (c1 < 10) AND (a3+b3 > 10) ) OR (c2 == 5);*
> explain D;
> In the above example, the PushUpFilter is not able to push any filter 
> condition across any join as it contains columns from all branches (inputs). 
> But if we convert this expression into "Conjunctive Normal Form" (CNF) then 
> we would be able to push filter condition c1< 10 and c2 == 5 below both join 
> conditions. Here is the CNF expression for highlighted line:
> ( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) )
> *Suggestion:* It would be a good idea to convert LOFilter's boolean 
> expression into CNF, it would then be easy to push parts (conjuncts) of the 
> LOFilter boolean expression selectively. We would also not require rule 
> SplitFilter anymore if we were to add this utility to rule PushUpFilter 
> itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1582) upgrade commons-logging version with ivy

2010-08-31 Thread Giridharan Kesavan (JIRA)
upgrade commons-logging version with ivy


 Key: PIG-1582
 URL: https://issues.apache.org/jira/browse/PIG-1582
 Project: Pig
  Issue Type: Improvement
  Components: build
Reporter: Giridharan Kesavan


to upgrade the commons-logging version for pig from 1.0.3 to 1.1.1

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1582) upgrade commons-logging version with ivy

2010-08-31 Thread Giridharan Kesavan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giridharan Kesavan reassigned PIG-1582:
---

Assignee: Giridharan Kesavan

> upgrade commons-logging version with ivy
> 
>
> Key: PIG-1582
> URL: https://issues.apache.org/jira/browse/PIG-1582
> Project: Pig
>  Issue Type: Improvement
>  Components: build
>Reporter: Giridharan Kesavan
>Assignee: Giridharan Kesavan
> Attachments: pig-1582.patch
>
>
> to upgrade the commons-logging version for pig from 1.0.3 to 1.1.1

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1582) upgrade commons-logging version with ivy

2010-08-31 Thread Giridharan Kesavan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giridharan Kesavan updated PIG-1582:


Attachment: pig-1582.patch

> upgrade commons-logging version with ivy
> 
>
> Key: PIG-1582
> URL: https://issues.apache.org/jira/browse/PIG-1582
> Project: Pig
>  Issue Type: Improvement
>  Components: build
>Reporter: Giridharan Kesavan
>Assignee: Giridharan Kesavan
> Attachments: pig-1582.patch
>
>
> to upgrade the commons-logging version for pig from 1.0.3 to 1.1.1

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1582) upgrade commons-logging version with ivy

2010-08-31 Thread Giridharan Kesavan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giridharan Kesavan updated PIG-1582:


Status: Patch Available  (was: Open)

> upgrade commons-logging version with ivy
> 
>
> Key: PIG-1582
> URL: https://issues.apache.org/jira/browse/PIG-1582
> Project: Pig
>  Issue Type: Improvement
>  Components: build
>Reporter: Giridharan Kesavan
>Assignee: Giridharan Kesavan
> Attachments: pig-1582.patch
>
>
> to upgrade the commons-logging version for pig from 1.0.3 to 1.1.1

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1582) upgrade commons-logging version with ivy

2010-08-31 Thread Giridharan Kesavan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giridharan Kesavan updated PIG-1582:


   Status: Resolved  (was: Patch Available)
Fix Version/s: 0.8.0
   Resolution: Fixed

> upgrade commons-logging version with ivy
> 
>
> Key: PIG-1582
> URL: https://issues.apache.org/jira/browse/PIG-1582
> Project: Pig
>  Issue Type: Improvement
>  Components: build
>Reporter: Giridharan Kesavan
>Assignee: Giridharan Kesavan
> Fix For: 0.8.0
>
> Attachments: pig-1582.patch
>
>
> to upgrade the commons-logging version for pig from 1.0.3 to 1.1.1

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1583) piggybank unit test TestLookupInFiles is broken

2010-08-31 Thread Daniel Dai (JIRA)
piggybank unit test TestLookupInFiles is broken
---

 Key: PIG-1583
 URL: https://issues.apache.org/jira/browse/PIG-1583
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0
 Attachments: PIG-1583-1.patch

Error message:
10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from 
attempt_20100831093139211_0001_m_00_3: 
org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught 
error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles 
[LookupInFiles : Cannot open file one]
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.io.IOException: LookupInFiles : Cannot open file one
at 
org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92)
at 
org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115)
at 
org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
... 10 more
Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one 
does not exist
at 
org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224)
at 
org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172)
at 
org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89)
... 13 more


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1583) piggybank unit test TestLookupInFiles is broken

2010-08-31 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1583:


Attachment: PIG-1583-1.patch

> piggybank unit test TestLookupInFiles is broken
> ---
>
> Key: PIG-1583
> URL: https://issues.apache.org/jira/browse/PIG-1583
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1583-1.patch
>
>
> Error message:
> 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from 
> attempt_20100831093139211_0001_m_00_3: 
> org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught 
> error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles 
> [LookupInFiles : Cannot open file one]
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Caused by: java.io.IOException: LookupInFiles : Cannot open file one
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
> ... 10 more
> Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one 
> does not exist
> at 
> org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224)
> at 
> org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89)
> ... 13 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1583) piggybank unit test TestLookupInFiles is broken

2010-08-31 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1583:


Attachment: (was: PIG-1583-1.patch)

> piggybank unit test TestLookupInFiles is broken
> ---
>
> Key: PIG-1583
> URL: https://issues.apache.org/jira/browse/PIG-1583
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1583-1.patch
>
>
> Error message:
> 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from 
> attempt_20100831093139211_0001_m_00_3: 
> org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught 
> error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles 
> [LookupInFiles : Cannot open file one]
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Caused by: java.io.IOException: LookupInFiles : Cannot open file one
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
> ... 10 more
> Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one 
> does not exist
> at 
> org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224)
> at 
> org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89)
> ... 13 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1583) piggybank unit test TestLookupInFiles is broken

2010-08-31 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1583:


Attachment: PIG-1583-1.patch

> piggybank unit test TestLookupInFiles is broken
> ---
>
> Key: PIG-1583
> URL: https://issues.apache.org/jira/browse/PIG-1583
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1583-1.patch
>
>
> Error message:
> 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from 
> attempt_20100831093139211_0001_m_00_3: 
> org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught 
> error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles 
> [LookupInFiles : Cannot open file one]
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Caused by: java.io.IOException: LookupInFiles : Cannot open file one
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
> ... 10 more
> Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one 
> does not exist
> at 
> org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224)
> at 
> org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89)
> ... 13 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Branching for Pig 0.8 release

2010-08-31 Thread Olga Natkovich
Hi,

I am about to branch for release. Please, hold off your commits till I am done. 
I will send a follow up email at that time.

Thanks,

Olga



RE: Branching for Pig 0.8 release

2010-08-31 Thread Olga Natkovich
The branch has been created. Only bug fixes related to 0.8 release should be 
committed there. When committing on the branch, please, make sure to make (and 
test) the corresponding changes on the trunk.

Thanks,

Olga

-Original Message-
From: Olga Natkovich [mailto:ol...@yahoo-inc.com] 
Sent: Tuesday, August 31, 2010 2:08 PM
To: pig-dev@hadoop.apache.org
Subject: Branching for Pig 0.8 release

Hi,

I am about to branch for release. Please, hold off your commits till I am done. 
I will send a follow up email at that time.

Thanks,

Olga



[jira] Commented: (PIG-1583) piggybank unit test TestLookupInFiles is broken

2010-08-31 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904783#action_12904783
 ] 

Xuefu Zhang commented on PIG-1583:
--

+1 Patch Looks Good.

> piggybank unit test TestLookupInFiles is broken
> ---
>
> Key: PIG-1583
> URL: https://issues.apache.org/jira/browse/PIG-1583
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1583-1.patch
>
>
> Error message:
> 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from 
> attempt_20100831093139211_0001_m_00_3: 
> org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught 
> error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles 
> [LookupInFiles : Cannot open file one]
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Caused by: java.io.IOException: LookupInFiles : Cannot open file one
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
> ... 10 more
> Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one 
> does not exist
> at 
> org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224)
> at 
> org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89)
> ... 13 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1583) piggybank unit test TestLookupInFiles is broken

2010-08-31 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-1583:
-

Status: Patch Available  (was: Open)

> piggybank unit test TestLookupInFiles is broken
> ---
>
> Key: PIG-1583
> URL: https://issues.apache.org/jira/browse/PIG-1583
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1583-1.patch
>
>
> Error message:
> 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from 
> attempt_20100831093139211_0001_m_00_3: 
> org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught 
> error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles 
> [LookupInFiles : Cannot open file one]
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Caused by: java.io.IOException: LookupInFiles : Cannot open file one
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
> ... 10 more
> Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one 
> does not exist
> at 
> org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224)
> at 
> org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89)
> ... 13 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1506) Need to clarify the difference between null handling in JOIN and COGROUP

2010-08-31 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904785#action_12904785
 ] 

Olga Natkovich commented on PIG-1506:
-

This is what we need to document:

In the case of GROUP/COGROUP, the data with NULL key from the same input is 
grouped together. For instance:

Input data:

joe 5   2.5
sam 3.0
bob 3.5

script:

A = load 'small' as (name, age, gpa);
B = group A by age;
dump B;

Output:

(5,{(joe,5,2.5)})
(,{(sam,,3.0),(bob,,3.5)})

Note that both records with null age are grouped together.

However, data with null keys from different inputs is considered different and 
will generate multiple tuples in case of cogroup. For instance:

Input: Self cogroup on the same input.

Script:

A = load 'small' as (name, age, gpa);
B = load 'small' as (name, age, gpa);
C = cogroup A by age, B by age;
dump C;

Output:

(5,{(joe,5,2.5)},{(joe,5,2.5)})
(,{(sam,,3.0),(bob,,3.5)},{})
(,{},{(sam,,3.0),(bob,,3.5)})

Note that there are 2 tuples in the output corresponding to the null key: one 
that contains tuples from the first input (with no much from the second) and 
one the other way around.

JOIN adds another interesting twist to this because it follows SQL standard 
which means that JOIN by default represents inner join which through away all 
the nulls.

Input: the same as for COGROUP

Script:

A = load 'small' as (name, age, gpa);
B = load 'small' as (name, age, gpa);
C = join A by age, B by age;
dump C;

Output:

(joe,5,2.5,joe,5,2.5)

Note that all tuples that had NULL key got filtered out.


> Need to clarify the difference between null handling in JOIN and COGROUP
> 
>
> Key: PIG-1506
> URL: https://issues.apache.org/jira/browse/PIG-1506
> Project: Pig
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Olga Natkovich
>Assignee: Corinne Chandel
> Fix For: 0.8.0
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1399) Logical Optimizer: Expression optimizor rule

2010-08-31 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904808#action_12904808
 ] 

Alan Gates commented on PIG-1399:
-

 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 6 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.
 [exec] 
 [exec] 

> Logical Optimizer: Expression optimizor rule
> 
>
> Key: PIG-1399
> URL: https://issues.apache.org/jira/browse/PIG-1399
> Project: Pig
>  Issue Type: Sub-task
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: newPatchFindbugsWarnings.html, PIG-1399.patch, 
> PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, 
> PIG-1399.patch, PIG-1399.patch
>
>
> We can optimize expression in several ways:
> 1. Constant pre-calculation
> Example:
> B = filter A by a0 > 5+7;
> => B = filter A by a0 > 12;
> 2. Boolean expression optimization
> Example:
> B = filter A by not (not(a0>5) or a>10);
> => B = filter A by a0>5 and a<=10;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1506) Need to clarify the difference between null handling in JOIN and COGROUP

2010-08-31 Thread Scott Carey (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904819#action_12904819
 ] 

Scott Carey commented on PIG-1506:
--

The SQL behavior of the above for an outer join would be to have five rows 
output -- just like COGROUP would if flattened.  So that seems fine to me.  A 
self-join should be the same as a COGROUP with yourself, which is different 
than a simple GROUP.

However, there is a problem with inner join and nulls.
Pig JOIN is not like SQL with respect to nulls on multi-column joins.  (I have 
not tried on trunk however)

In SQL, if ANY of the columns in a multi-column join is null, the row is not 
output. 

Try:

{code}
A = load 'small' as (name, age, gpa);
B = load 'small' as (name, age, gpa);
C = join A by (name,age), B by (name,age);
dump C;
{code}

The result for SQL would be one row of the form 
joe 5 2.5 joe 5 2.5



> Need to clarify the difference between null handling in JOIN and COGROUP
> 
>
> Key: PIG-1506
> URL: https://issues.apache.org/jira/browse/PIG-1506
> Project: Pig
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Olga Natkovich
>Assignee: Corinne Chandel
> Fix For: 0.8.0
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1584) deal with inner cogroup

2010-08-31 Thread Olga Natkovich (JIRA)
deal with inner cogroup
---

 Key: PIG-1584
 URL: https://issues.apache.org/jira/browse/PIG-1584
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
 Fix For: 0.9.0


The current implementation of inner in case of cogroup is in conflict with 
join. We need to decide of whether to fix inner cogroup or just remove the 
functionality if it is not widely used

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1583) piggybank unit test TestLookupInFiles is broken

2010-08-31 Thread Giridharan Kesavan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giridharan Kesavan updated PIG-1583:


Status: Open  (was: Patch Available)

submitting to hudson 

> piggybank unit test TestLookupInFiles is broken
> ---
>
> Key: PIG-1583
> URL: https://issues.apache.org/jira/browse/PIG-1583
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1583-1.patch
>
>
> Error message:
> 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from 
> attempt_20100831093139211_0001_m_00_3: 
> org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught 
> error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles 
> [LookupInFiles : Cannot open file one]
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Caused by: java.io.IOException: LookupInFiles : Cannot open file one
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
> ... 10 more
> Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one 
> does not exist
> at 
> org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224)
> at 
> org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89)
> ... 13 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1583) piggybank unit test TestLookupInFiles is broken

2010-08-31 Thread Giridharan Kesavan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giridharan Kesavan updated PIG-1583:


Status: Patch Available  (was: Open)

> piggybank unit test TestLookupInFiles is broken
> ---
>
> Key: PIG-1583
> URL: https://issues.apache.org/jira/browse/PIG-1583
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1583-1.patch
>
>
> Error message:
> 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from 
> attempt_20100831093139211_0001_m_00_3: 
> org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught 
> error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles 
> [LookupInFiles : Cannot open file one]
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Caused by: java.io.IOException: LookupInFiles : Cannot open file one
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
> ... 10 more
> Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one 
> does not exist
> at 
> org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224)
> at 
> org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89)
> ... 13 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1506) Need to clarify the difference between null handling in JOIN and COGROUP

2010-08-31 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904829#action_12904829
 ] 

Olga Natkovich commented on PIG-1506:
-

I verified that 0.8 code does deal correctly with multi-column keys with nulls

> Need to clarify the difference between null handling in JOIN and COGROUP
> 
>
> Key: PIG-1506
> URL: https://issues.apache.org/jira/browse/PIG-1506
> Project: Pig
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Olga Natkovich
>Assignee: Corinne Chandel
> Fix For: 0.8.0
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-31 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1501:
--

Release Note: 
This feature will save HDFS space used to store the intermediate data used by 
PIG and potentially improve query execution speed. In general, the more 
intermediate data generated, the more storage and speedup benefits.

There are no backward compatibility issues as result of this feature.

Two java properties are used to control the behavoir:

pig.tmpfilecompression, default to false, tells if the temporary files should 
be compressed or not.  If true, then

pig.tmpfilecompression.codec specifies which compression codec to use. 
Currently, PIG only accepts "gz" and "lzo" as possible values. Since LZO is 
under GPL license, Hadoop may need to be configured to use LZO codec. Please 
refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details.


An example is the following "test.pig" script:

register pigperf.jar;
A = load '/user/pig/tests/data/pigmix/page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent:long, query_term, ip_addr, timestamp, 
estimated_revenue, page_info, page_links);
B1 = filter A by timespent == 4;
B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term);
C = join B1 by query_term, B by query_term using 'skewed' parallel 300;
D = distinct C parallel 300;
store D into 'output.lzo';

which is launched as follows:

java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar 
-Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 
-Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo 
org.apache.pig.Main ./test.pig

[ Show » ] Yan Zhou added a comment - 26/Aug/10 11:14 AM This feature will save 
HDFS space used to store the intermediate data used by PIG and potentially 
improve query execution speed. In general, the more intermediate data 
generated, the more storage and speedup benefits. There are no backward 
compatibility issues as result of this feature. An example is the following 
"test.pig" script: register pigperf.jar; A = load 
'/user/pig/tests/data/pigmix/page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, 
timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info, 
page_links); B1 = filter A by timespent == 4; B = load 
'/user/pig/tests/data/pigmix/queryterm' as (query_term); C = join B1 by 
query_term, B by query_term using 'skewed' parallel 300; D = distinct C 
parallel 300; store D into 'output.lzo'; which is launched as follows: java -cp 
/grid/0/gs/conf/current:/grid/0/jars/pig.jar 
-Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 
-Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo 
org.apache.pig.Main ./test.pig 


> need to investigate the impact of compression on pig performance
> 
>
> Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/PIG-1501
> Project: Pig
>  Issue Type: Test
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
> PIG-1501.patch, PIG-1501.patch, PIG-1501.patch
>
>
> We would like to understand how compressing map results as well as well as 
> reducer output in a chain of MR jobs impacts performance. We can use PigMix 
> queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-31 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1501:
--

Release Note: 
This feature will save HDFS space used to store the intermediate data used by 
PIG and potentially improve query execution speed. In general, the more 
intermediate data generated, the more storage and speedup benefits.

There are no backward compatibility issues as result of this feature.

Two java properties are used to control the behavoir:

pig.tmpfilecompression, default to false, tells if the temporary files should 
be compressed or not.  If true, then

pig.tmpfilecompression.codec specifies which compression codec to use. 
Currently, PIG only accepts "gz" and "lzo" as possible values. Since LZO is 
under GPL license, Hadoop may need to be configured to use LZO codec. Please 
refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details.


An example is the following "test.pig" script:

register pigperf.jar;
A = load '/user/pig/tests/data/pigmix/page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent:long, query_term, ip_addr, timestamp, 
estimated_revenue, page_info, page_links);
B1 = filter A by timespent == 4;
B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term);
C = join B1 by query_term, B by query_term using 'skewed' parallel 300;
D = distinct C parallel 300;
store D into 'output.lzo';

which is launched as follows:

java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar 
-Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 
-Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo 
org.apache.pig.Main ./test.pig

  was:
This feature will save HDFS space used to store the intermediate data used by 
PIG and potentially improve query execution speed. In general, the more 
intermediate data generated, the more storage and speedup benefits.

There are no backward compatibility issues as result of this feature.

Two java properties are used to control the behavoir:

pig.tmpfilecompression, default to false, tells if the temporary files should 
be compressed or not.  If true, then

pig.tmpfilecompression.codec specifies which compression codec to use. 
Currently, PIG only accepts "gz" and "lzo" as possible values. Since LZO is 
under GPL license, Hadoop may need to be configured to use LZO codec. Please 
refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details.


An example is the following "test.pig" script:

register pigperf.jar;
A = load '/user/pig/tests/data/pigmix/page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent:long, query_term, ip_addr, timestamp, 
estimated_revenue, page_info, page_links);
B1 = filter A by timespent == 4;
B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term);
C = join B1 by query_term, B by query_term using 'skewed' parallel 300;
D = distinct C parallel 300;
store D into 'output.lzo';

which is launched as follows:

java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar 
-Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 
-Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo 
org.apache.pig.Main ./test.pig

[ Show » ] Yan Zhou added a comment - 26/Aug/10 11:14 AM This feature will save 
HDFS space used to store the intermediate data used by PIG and potentially 
improve query execution speed. In general, the more intermediate data 
generated, the more storage and speedup benefits. There are no backward 
compatibility issues as result of this feature. An example is the following 
"test.pig" script: register pigperf.jar; A = load 
'/user/pig/tests/data/pigmix/page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, 
timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info, 
page_links); B1 = filter A by timespent == 4; B = load 
'/user/pig/tests/data/pigmix/queryterm' as (query_term); C = join B1 by 
query_term, B by query_term using 'skewed' parallel 300; D = distinct C 
parallel 300; store D into 'output.lzo'; which is launched as follows: java -cp 
/grid/0/gs/conf/current:/grid/0/jars/pig.jar 
-Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 
-Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo 
org.apache.pig.Main ./test.pig 



> need to investigate the impact of compression on pig performance
> 
>
> Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/PIG-1501
> Project: Pig
>  Issue Type: Test
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
> PIG-1501.patch, PIG-1501.patch, PIG-1501.patch
>
>
> We would like to understand how com

[jira] Commented: (PIG-1506) Need to clarify the difference between null handling in JOIN and COGROUP

2010-08-31 Thread Scott Carey (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904835#action_12904835
 ] 

Scott Carey commented on PIG-1506:
--

I have just confirmed that on 0.7 it works fine, but 0.5 does not. So this was 
fixed in 0.6 or 0.7.  I suppose I can take out some null guards from my scripts 
now :)

This was my test:

{code}
A = LOAD '/tmp/test.txt' as (a,b,c);
B = LOAD '/tmp/test.txt' as (a,b,c);
C = JOIN A by (a,b), B by (a,b);

DUMP A;
DUMP C;
{code}

With 0.5 I get:
A:
(fred,1,3)
(bob,,4)
C:
(bob,,4,bob,,4)
(fred,1,3,fred,1,3)

and with 0.7 C is:
(fred,1,3,fred,1,3)



> Need to clarify the difference between null handling in JOIN and COGROUP
> 
>
> Key: PIG-1506
> URL: https://issues.apache.org/jira/browse/PIG-1506
> Project: Pig
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Olga Natkovich
>Assignee: Corinne Chandel
> Fix For: 0.8.0
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1585) Add new properties to help and documentation

2010-08-31 Thread Olga Natkovich (JIRA)
Add new properties to help and documentation


 Key: PIG-1585
 URL: https://issues.apache.org/jira/browse/PIG-1585
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Olga Natkovich
 Fix For: 0.8.0


New properties:

Compression:

pig.tmpfilecompression, default to false, tells if the temporary files should 
be compressed or not. If true, then 
pig.tmpfilecompression.codec specifies which compression codec to use. 
Currently, PIG only accepts "gz" and "lzo" as possible values. Since LZO is 
under GPL license, Hadoop may need to be configured to use LZO codec. Please 
refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details. 

Combining small files:

pig.noSplitCombination - disables combining multiple small files to the block 
size


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-31 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904843#action_12904843
 ] 

Ashutosh Chauhan commented on PIG-1501:
---

If its not backward-incompatible then is there any specific reason to default 
pig.tmpfilecompression to false. This seems to be a useful feature, so it 
should be true by default, no ?

> need to investigate the impact of compression on pig performance
> 
>
> Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/PIG-1501
> Project: Pig
>  Issue Type: Test
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
> PIG-1501.patch, PIG-1501.patch, PIG-1501.patch
>
>
> We would like to understand how compressing map results as well as well as 
> reducer output in a chain of MR jobs impacts performance. We can use PigMix 
> queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1572) change default datatype when relations are used as scalar to bytearray

2010-08-31 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1572:
---

Attachment: PIG-1572.2.patch

PIG-1572.2.patch 
- Fixed loss of lineage information in translation during explain call
- Added cast on output of ReadScalars so that type information is not lost 
during schema reset from optimizer.

Unit tests and test-patch has passed. Patch is ready for review.

 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.


> change default datatype when relations are used as scalar to bytearray
> --
>
> Key: PIG-1572
> URL: https://issues.apache.org/jira/browse/PIG-1572
> Project: Pig
>  Issue Type: Bug
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1572.1.patch, PIG-1572.2.patch
>
>
> When relations are cast to scalar, the current default type is chararray. 
> This is inconsistent with the behavior in rest of pig-latin.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-31 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904848#action_12904848
 ] 

Olga Natkovich commented on PIG-1501:
-

Ashutosh,

The reason it is off by default is because the default compression is gzip 
which is really slow and most of the time not what you want. Because of the 
licensing issue with lzo, users need to setup it on their own. Once they do the 
setup, they can enable the compression.

> need to investigate the impact of compression on pig performance
> 
>
> Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/PIG-1501
> Project: Pig
>  Issue Type: Test
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
> PIG-1501.patch, PIG-1501.patch, PIG-1501.patch
>
>
> We would like to understand how compressing map results as well as well as 
> reducer output in a chain of MR jobs impacts performance. We can use PigMix 
> queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1586) Parameter subsitution using -param option runs into problems when substituing entire pig statements in a shell script (maybe this is a bash problem)

2010-08-31 Thread Viraj Bhat (JIRA)
Parameter subsitution using -param option runs into problems when substituing 
entire pig statements in a shell script (maybe this is a bash problem)


 Key: PIG-1586
 URL: https://issues.apache.org/jira/browse/PIG-1586
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Viraj Bhat


I have a Pig script as a template:

{code}
register Countwords.jar;
A = $INPUT;
B = FOREACH A GENERATE
examples.udf.SubString($0,0,1),
$1 as num;
C = GROUP B BY $0;
D = FOREACH C GENERATE group, SUM(B.num);
STORE D INTO $OUTPUT;
{code}


I attempt to do Parameter substitutions using the following:

Using Shell script:

{code}
#!/bin/bash
java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r -file 
sub.pig \
 -param INPUT="(foreach (COGROUP(load '/user/viraj/dataset1' USING 
PigStorage() AS (word:chararray,num:int)) by (word),(load 
'/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by 
(word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2)))" \
 -param OUTPUT="\'/user/viraj/output\' USING PigStorage()"
{code}

register Countwords.jar;

A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS 
(word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING 
PigStorage() AS (word:chararray,num:int)) by (word)) generate 
flatten(examples.udf.CountWords(runsub.sh,,)));
B = FOREACH A GENERATE
examples.udf.SubString($0,0,1),
$1 as num;
C = GROUP B BY $0;
D = FOREACH C GENERATE group, SUM(B.num);

STORE D INTO /user/viraj/output;
{code}

The shell substitutes the $0 before passing it to java. 
a) Is there a workaround for this?  
b) Is this is Pig param problem?


Viraj



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1586) Parameter subsitution using -param option runs into problems when substituing entire pig statements in a shell script (maybe this is a bash problem)

2010-08-31 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1586:


Description: 
I have a Pig script as a template:

{code}
register Countwords.jar;
A = $INPUT;
B = FOREACH A GENERATE
examples.udf.SubString($0,0,1),
$1 as num;
C = GROUP B BY $0;
D = FOREACH C GENERATE group, SUM(B.num);
STORE D INTO $OUTPUT;
{code}


I attempt to do Parameter substitutions using the following:

Using Shell script:

{code}
#!/bin/bash
java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r -file 
sub.pig \
 -param INPUT="(foreach (COGROUP(load '/user/viraj/dataset1' USING 
PigStorage() AS (word:chararray,num:int)) by (word),(load 
'/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by 
(word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2)))" \
 -param OUTPUT="\'/user/viraj/output\' USING PigStorage()"
{code}

{code}
register Countwords.jar;

A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS 
(word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING 
PigStorage() AS (word:chararray,num:int)) by (word)) generate 
flatten(examples.udf.CountWords(runsub.sh,,)));
B = FOREACH A GENERATE
examples.udf.SubString($0,0,1),
$1 as num;
C = GROUP B BY $0;
D = FOREACH C GENERATE group, SUM(B.num);

STORE D INTO /user/viraj/output;
{code}

The shell substitutes the $0 before passing it to java. 
a) Is there a workaround for this?  
b) Is this is Pig param problem?


Viraj

  was:
I have a Pig script as a template:

{code}
register Countwords.jar;
A = $INPUT;
B = FOREACH A GENERATE
examples.udf.SubString($0,0,1),
$1 as num;
C = GROUP B BY $0;
D = FOREACH C GENERATE group, SUM(B.num);
STORE D INTO $OUTPUT;
{code}


I attempt to do Parameter substitutions using the following:

Using Shell script:

{code}
#!/bin/bash
java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r -file 
sub.pig \
 -param INPUT="(foreach (COGROUP(load '/user/viraj/dataset1' USING 
PigStorage() AS (word:chararray,num:int)) by (word),(load 
'/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by 
(word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2)))" \
 -param OUTPUT="\'/user/viraj/output\' USING PigStorage()"
{code}

register Countwords.jar;

A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS 
(word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING 
PigStorage() AS (word:chararray,num:int)) by (word)) generate 
flatten(examples.udf.CountWords(runsub.sh,,)));
B = FOREACH A GENERATE
examples.udf.SubString($0,0,1),
$1 as num;
C = GROUP B BY $0;
D = FOREACH C GENERATE group, SUM(B.num);

STORE D INTO /user/viraj/output;
{code}

The shell substitutes the $0 before passing it to java. 
a) Is there a workaround for this?  
b) Is this is Pig param problem?


Viraj




> Parameter subsitution using -param option runs into problems when substituing 
> entire pig statements in a shell script (maybe this is a bash problem)
> 
>
> Key: PIG-1586
> URL: https://issues.apache.org/jira/browse/PIG-1586
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Viraj Bhat
>
> I have a Pig script as a template:
> {code}
> register Countwords.jar;
> A = $INPUT;
> B = FOREACH A GENERATE
> examples.udf.SubString($0,0,1),
> $1 as num;
> C = GROUP B BY $0;
> D = FOREACH C GENERATE group, SUM(B.num);
> STORE D INTO $OUTPUT;
> {code}
> I attempt to do Parameter substitutions using the following:
> Using Shell script:
> {code}
> #!/bin/bash
> java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r 
> -file sub.pig \
>  -param INPUT="(foreach (COGROUP(load '/user/viraj/dataset1' 
> USING PigStorage() AS (word:chararray,num:int)) by (word),(load 
> '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by 
> (word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2)))" \
>  -param OUTPUT="\'/user/viraj/output\' USING PigStorage()"
> {code}
> {code}
> register Countwords.jar;
> A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS 
> (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING 
> PigStorage() AS (word:chararray,num:int)) by (word)) generate 
> flatten(examples.udf.CountWords(runsub.sh,,)));
> B = FOREACH A GENERATE
> examples.udf.SubString($0,0,1),
> $1 as num;
> C = GROUP B BY $0;
> D = FOREACH C GENERATE group, SUM(B.num);
> STORE D INTO /user/viraj/output;
> {code}
> The shell substitutes the $0 before passing it to java. 
> a) Is there a workaround for this?  
> b) Is this is Pig param problem?
> Viraj

-- 
This message is automatically g

[jira] Created: (PIG-1587) Cloning utility functions for new logical plan

2010-08-31 Thread Daniel Dai (JIRA)
Cloning utility functions for new logical plan
--

 Key: PIG-1587
 URL: https://issues.apache.org/jira/browse/PIG-1587
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
 Fix For: 0.9.0


We sometimes need to copy a logical operator/plan when writing an optimization 
rule. Currently copy an operator/plan is awkward. We need to write some 
utilities to facilitate this process. Swati contribute PIG-1510 but we feel it 
still cannot address most use cases. I propose to add some more utilities into 
new logical plan:

all LogicalExpressions:
{code}
copy(LogicalExpressionPlan newPlan, boolean keepUid);
{code}
* Do a shallow copy of the logical expression operator (except for fieldSchema, 
uidOnlySchema, ProjectExpression.attachedRelationalOp)
* Set the plan to newPlan
* If keepUid is true, further copy uidOnlyFieldSchema

all LogicalRelationalOperators:
{code}
copy(LogicalPlan newPlan, boolean keepUid);
{code}
* Do a shallow copy of the logical relational operator (except for schema, uid 
related fields)
* Set the plan to newPlan;
* If the operator have inner plan/expression plan, copy the whole inner plan 
with the same keepUid flag (Especially, LOInnerLoad will copy its inner 
project, with the same keepUid flag)
* If keepUid is true, further copy uid related fields (LOUnion.uidMapping, 
LOCogroup.groupKeyUidOnlySchema, LOCogroup.generatedInputUids)

LogicalExpressionPlan.java
{code}
LogicalExpressionPlan copy(LogicalRelationalOperator attachedRelationalOp, 
boolean keepUid);
{code}
* Copy expression operator along with connection with the same keepUid flag
* Set all ProjectExpression.attachedRelationalOp to attachedRelationalOp 
parameter

{code}
List merge(LogicalExpressionPlan plan);
{code}
* Merge plan into the current logical expression plan as an independent tree
* return the sources of this independent tree


LogicalPlan.java
{code}
LogicalPlan copy(boolean keepUid);
{code}
* Main use case to copy inner plan of ForEach
* Copy all relational operator along with connection
* Copy all expression plans inside relational operator, set plan and 
attachedRelationalOp properly

{code}
List merge(LogicalPlan plan);
{code}
* Merge plan into the current logical plan as an independent tree
* return the sources of this independent tree


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1586) Parameter subsitution using -param option runs into problems when substituing entire pig statements in a shell script (maybe this is a bash problem)

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-1586:
---

Assignee: Viraj Bhat

Viraj volunteered to print the line that pig gets as part of parameter 
substitution to see if the escapes and quotes are eaten by the shell. Thanks 
Viraj

> Parameter subsitution using -param option runs into problems when substituing 
> entire pig statements in a shell script (maybe this is a bash problem)
> 
>
> Key: PIG-1586
> URL: https://issues.apache.org/jira/browse/PIG-1586
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Viraj Bhat
>Assignee: Viraj Bhat
>
> I have a Pig script as a template:
> {code}
> register Countwords.jar;
> A = $INPUT;
> B = FOREACH A GENERATE
> examples.udf.SubString($0,0,1),
> $1 as num;
> C = GROUP B BY $0;
> D = FOREACH C GENERATE group, SUM(B.num);
> STORE D INTO $OUTPUT;
> {code}
> I attempt to do Parameter substitutions using the following:
> Using Shell script:
> {code}
> #!/bin/bash
> java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r 
> -file sub.pig \
>  -param INPUT="(foreach (COGROUP(load '/user/viraj/dataset1' 
> USING PigStorage() AS (word:chararray,num:int)) by (word),(load 
> '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by 
> (word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2)))" \
>  -param OUTPUT="\'/user/viraj/output\' USING PigStorage()"
> {code}
> {code}
> register Countwords.jar;
> A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS 
> (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING 
> PigStorage() AS (word:chararray,num:int)) by (word)) generate 
> flatten(examples.udf.CountWords(runsub.sh,,)));
> B = FOREACH A GENERATE
> examples.udf.SubString($0,0,1),
> $1 as num;
> C = GROUP B BY $0;
> D = FOREACH C GENERATE group, SUM(B.num);
> STORE D INTO /user/viraj/output;
> {code}
> The shell substitutes the $0 before passing it to java. 
> a) Is there a workaround for this?  
> b) Is this is Pig param problem?
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1588) Parameter pre-processing of values containing pig positional variables ($0, $1 etc)

2010-08-31 Thread Laukik Chitnis (JIRA)
Parameter pre-processing of values containing pig positional variables ($0, $1 
etc)
---

 Key: PIG-1588
 URL: https://issues.apache.org/jira/browse/PIG-1588
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Laukik Chitnis
 Fix For: 0.7.0


Pig 0.7 requires the positional variables to be escaped by a \\ when passed as 
part of a parameter value (either through cmd line param or through 
param_file), which was not the case in Pig 0.6 Assuming that this was not an 
intended breakage of backward compatibility (could not find it in release 
notes), this would be a bug.

For example, We need to pass
INPUT=CountWords(\\$0,\\$1,\\$2)

instead of simply
INPUT=CountWords($0,$1,$2)



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1588) Parameter pre-processing of values containing pig positional variables ($0, $1 etc)

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-1588.
-

Resolution: Duplicate

This is duplicate of https://issues.apache.org/jira/browse/PIG-1586 and at this 
point we do not believe that either is a bug in pig. Viraj is verifying that 
but we think that shell removes the escapes before giving it to Pig

> Parameter pre-processing of values containing pig positional variables ($0, 
> $1 etc)
> ---
>
> Key: PIG-1588
> URL: https://issues.apache.org/jira/browse/PIG-1588
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Laukik Chitnis
> Fix For: 0.7.0
>
>
> Pig 0.7 requires the positional variables to be escaped by a \\ when passed 
> as part of a parameter value (either through cmd line param or through 
> param_file), which was not the case in Pig 0.6 Assuming that this was not an 
> intended breakage of backward compatibility (could not find it in release 
> notes), this would be a bug.
> For example, We need to pass
> INPUT=CountWords(\\$0,\\$1,\\$2)
> instead of simply
> INPUT=CountWords($0,$1,$2)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1537) Column pruner causes wrong results when using both Custom Store UDF and PigStorage

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-1537.
-

Resolution: Fixed

> Column pruner causes wrong results when using both Custom Store UDF and 
> PigStorage
> --
>
> Key: PIG-1537
> URL: https://issues.apache.org/jira/browse/PIG-1537
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Viraj Bhat
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
>
> I have script which is of this pattern and it uses 2 StoreFunc's:
> {code}
> register loader.jar
> register piggy-bank/java/build/storage.jar;
> %DEFAULT OUTPUTDIR /user/viraj/prunecol/
> ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);
> ss_sc_filtered_0 = FILTER ss_sc_0 BY
> a#'id' matches '1.*' OR
> a#'id' matches '2.*' OR
> a#'id' matches '3.*' OR
> a#'id' matches '4.*';
> ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c);
> ss_sc_filtered_1 = FILTER ss_sc_1 BY
> a#'id' matches '65.*' OR
> a#'id' matches '466.*' OR
> a#'id' matches '043.*' OR
> a#'id' matches '044.*' OR
> a#'id' matches '0650.*' OR
> a#'id' matches '001.*';
> ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1;
> ss_sc_all_proj = FOREACH ss_sc_all GENERATE
> a#'query' as query,
> a#'testid' as testid,
> a#'timestamp' as timestamp,
> a,
> b,
> c;
> ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10;
> ss_sc_all_map = FOREACH ss_sc_all_ord  GENERATE a, b, c;
> STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage();
> ss_sc_all_map_count = group ss_sc_all_map all;
> count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as 
> record_count,COUNT($1);
> STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009');
> {code}
> I run this script using:
> a) java -cp pig0.7.jar script.pig
> b) java -cp pig0.7.jar -t PruneColumns script.pig
> What I observe is that the alias "count" produces the same number of records 
> but "ss_sc_all_map" have different sizes when run with above 2 options.
> Is due to the fact that there are 2 store func's used?
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-747) Logical to Physical Plan Translation fails when temporary alias are created within foreach

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-747:
---

Fix Version/s: 0.9.0
   (was: 0.8.0)

> Logical to Physical Plan Translation fails when temporary alias are created 
> within foreach
> --
>
> Key: PIG-747
> URL: https://issues.apache.org/jira/browse/PIG-747
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: Viraj Bhat
>Assignee: Daniel Dai
> Fix For: 0.9.0
>
> Attachments: physicalplan.txt, physicalplanprob.pig, PIG-747-1.patch
>
>
> Consider a the pig script which calculates a new column F inside the foreach 
> as:
> {code}
> A = load 'physicalplan.txt' as (col1,col2,col3);
> B = foreach A {
>D = col1/col2;
>E = col3/col2;
>F = E - (D*D);
>generate
>F as newcol;
> };
> dump B;
> {code}
> This gives the following error:
> ===
> Caused by: 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogicalToPhysicalTranslatorException:
>  ERROR 2015: Invalid physical operators in the physical plan
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:377)
> at 
> org.apache.pig.impl.logicalLayer.LOMultiply.visit(LOMultiply.java:63)
> at 
> org.apache.pig.impl.logicalLayer.LOMultiply.visit(LOMultiply.java:29)
> at 
> org.apache.pig.impl.plan.DependencyOrderWalkerWOSeenChk.walk(DependencyOrderWalkerWOSeenChk.java:68)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:908)
> at 
> org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:122)
> at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:41)
> at 
> org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68)
> at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:246)
> ... 10 more
> Caused by: org.apache.pig.impl.plan.PlanException: ERROR 0: Attempt to give 
> operator of type 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.Divide
>  multiple outputs.  This operator does not support multiple outputs.
> at 
> org.apache.pig.impl.plan.OperatorPlan.connect(OperatorPlan.java:158)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.plans.PhysicalPlan.connect(PhysicalPlan.java:89)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:373)
> ... 19 more
> ===

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1319) New logical optimization rules

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1319:


Fix Version/s: 0.9.0
   (was: 0.8.0)

> New logical optimization rules
> --
>
> Key: PIG-1319
> URL: https://issues.apache.org/jira/browse/PIG-1319
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.9.0
>
>
> In [PIG-1178|https://issues.apache.org/jira/browse/PIG-1178], we build a new 
> logical optimization framework. One design goal for the new logical optimizer 
> is to make it easier to add new logical optimization rules. In this Jira, we 
> keep track of the development of these new logical optimization rules.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1373) We need to add jdiff output to docs on the website

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1373:


Status: Resolved  (was: Patch Available)
Resolution: Fixed

> We need to add jdiff output to docs on the website
> --
>
> Key: PIG-1373
> URL: https://issues.apache.org/jira/browse/PIG-1373
> Project: Pig
>  Issue Type: Bug
>Reporter: Alan Gates
>Assignee: Daniel Dai
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: PIG-1373-1.patch, PIG-1373-2.patch
>
>
> Our build process constructs a jdiff between APIs for different versions.  
> But we don't post the results of that to the website when we deploy the docs. 
>  We should, in order to help users understand changes across versions of pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-31 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904868#action_12904868
 ] 

Yan Zhou commented on PIG-1501:
---

To be more eaccurate, the default compression would be gzip if the compression 
was made on by default.  Currently, the compression has to be specified and 
takes no default value. This is to ask user to take full appreciation of pros 
and cons of either compression method.

> need to investigate the impact of compression on pig performance
> 
>
> Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/PIG-1501
> Project: Pig
>  Issue Type: Test
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
> PIG-1501.patch, PIG-1501.patch, PIG-1501.patch
>
>
> We would like to understand how compressing map results as well as well as 
> reducer output in a chain of MR jobs impacts performance. We can use PigMix 
> queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Does Pig Re-Use FileInputLoadFuncs Objects?

2010-08-31 Thread Russell Jurney
Pardon the cross-post: Does Pig ever re-use FileInputLoadFunc objects?  We
suspect state is being retained between different stores, but we don't
actually know this.  Figured I'd ask to verify the hunch.

Our load func for our in-house format works fine with Pig scripts
normally... but I have a pig script that looks like this:

LOAD thing1
SPLIT thing1 INTO thing2, thing3
STORE thing2 INTO thing2
STORE thing3 INTO thing3

LOAD thing4
SPLIT thing4 INTO thing5, thing6
STORE thing5 INTO thing5
STORE thing6 INTO thing6


And it works via PigStorage, but not via our FileInputLoadFunc.

Russ