date:20100319


 [ 
https://issues.apache.org/jira/browse/PIG-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1289:


  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

Unit test failure due to port conflict. Manual test successful. Patch committed.

 PIG Join fails while doing a filter on joined data
 --

 Key: PIG-1289
 URL: https://issues.apache.org/jira/browse/PIG-1289
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Karim Saadah
Assignee: Daniel Dai
Priority: Minor
 Fix For: 0.7.0

 Attachments: PIG-1289-1.patch, PIG-1289-2.patch


 PIG Join fails while doing a filter on joined data
 Here are the steps to reproduce it:
 -bash-3.1$ pig -latest -x local
 grunt a = load 'first.dat' using PigStorage('\u0001') as (f1:int, 
 f2:chararray);
 grunt DUMP a;
 (1,A)
 (2,B)
 (3,C)
 (4,D)
 grunt b = load 'second.dat' using PigStorage() as (f3:chararray);
 grunt DUMP b;
 (A)
 (D)
 (E)
 grunt c = join a by f2 LEFT OUTER, b by f3;
 grunt DUMP c;
 (1,A,A)
 (2,B,)
 (3,C,)
 (4,D,D)
 grunt describe c;
 c: {a::f1: int,a::f2: chararray,b::f3: chararray}
 grunt d = filter c by (f3 is null or f3 =='');
 grunt dump d;
 2010-03-03 15:00:37,129 [main] INFO  
 org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned 
 for b
 2010-03-03 15:00:37,129 [main] INFO  
 org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned 
 for b
 2010-03-03 15:00:37,129 [main] INFO  
 org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned 
 for a
 2010-03-03 15:00:37,130 [main] INFO  
 org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned 
 for a
 2010-03-03 15:00:37,130 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1002: Unable to store alias d
 This one is failing too:
 grunt d = filter c by (b::f3 is null or b::f3 =='');
 or this one not returning results as expected:
 grunt d = foreach c generate f1 as f1, f2 as f2, f3 as f3;
 grunt e = filter d by (f3 is null or f3 =='');
 grunt DUMP e;
 (1,A,)
 (2,B,)
 (3,C,)
 (4,D,)
 while the expected result is
 (2,B,)
 (3,C,)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1258) [zebra] Number of sorted input splits is unusually high

2010-03-19 Thread Hadoop QA (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847260#action_12847260
]

Hadoop QA commented on PIG-1258:

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12438944/PIG-1258.patch
against trunk revision 925034.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 9 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac
compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of
release audit warnings.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/244/testReport/
Findbugs warnings:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/244/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/244/console

This message is automatically generated.

[zebra] Number of sorted input splits is unusually high
---

Key: PIG-1258
URL: https://issues.apache.org/jira/browse/PIG-1258
Project: Pig
Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Yan Zhou
Attachments: PIG-1258.patch

Number of sorted input splits is unusually high if the projections are on
multiple column groups, or a union of tables, or column group(s) that hold
many small tfiles. In one test, the number is about 100 times bigger that
from unsorted input splits on the same input tables.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

[
https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dmitriy V. Ryaboy updated PIG-1117:
---

Attachment: PIG-1117-0.7.0-reviewed.patch

Minor review changes, all superficial.

- changed the spacing to confirm to project conventions
- spaces before / after the curly braces where I saw them missing
- spelling and occasional references to HiveRCLoader in the docs (you've
renamed it to HiveColumnarLoader)
- minor tweak to get rid of one remaining deprecation warning in the
RecordReader

Tests pass on my machine.

Gerrit, if you are ok with these changes, I will commit.

Pig reading hive columnar rc tables
---

Key: PIG-1117
URL: https://issues.apache.org/jira/browse/PIG-1117
Project: Pig
Issue Type: New Feature
Affects Versions: 0.7.0
Reporter: Gerrit Jansen van Vuuren
Assignee: Gerrit Jansen van Vuuren
Fix For: 0.7.0

Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch,
PIG-1117-0.7.0-new.patch, PIG-1117-0.7.0-reviewed.patch, PIG-1117.patch,
PIG-117-v.0.6.0.patch, PIG-117-v.0.7.0.patch

I've coded a LoadFunc implementation that can read from Hive Columnar RC
tables, this is needed for a project that I'm working on because all our data
is stored using the Hive thrift serialized Columnar RC format. I have looked
at the piggy bank but did not find any implementation that could do this.
We've been running it on our cluster for the last week and have worked out
most bugs.

There are still some improvements to be done but I would need like setting
the amount of mappers based on date partitioning. Its been optimized so as to
read only specific columns and can churn through a data set almost 8 times
faster with this improvement because not all column data is read.
I would like to contribute the class to the piggybank can you guide me in
what I need to do?
I've used hive specific classes to implement this, is it possible to add this
to the piggy bank build ivy for automatic download of the dependencies?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

[
https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dmitriy V. Ryaboy updated PIG-1117:
---

Attachment: PIG-1117-0.7.0-reviewed.patch

Attaching again -- forgot to click the license check box.
Which reminded me to check for Apache license headers in the new files, and
turns out they were missing -- so I added them. Assuming that's ok since Gerrit
granted license for the patches when he attached them to the Jira.

Pig reading hive columnar rc tables
---

Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch,
PIG-1117-0.7.0-new.patch, PIG-1117-0.7.0-reviewed.patch,
PIG-1117-0.7.0-reviewed.patch, PIG-1117.patch, PIG-117-v.0.6.0.patch,
PIG-117-v.0.7.0.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1258) [zebra] Number of sorted input splits is unusually high

2010-03-19 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1258:
--

Status: Open  (was: Patch Available)

The test report page having the claimed failures of some core tests is not 
available on the web. Will resubmit.

 [zebra] Number of sorted input splits is unusually high
 ---

 Key: PIG-1258
 URL: https://issues.apache.org/jira/browse/PIG-1258
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Yan Zhou
 Attachments: PIG-1258.patch


 Number of sorted input splits is unusually high if the projections are on 
 multiple column groups, or a union of tables, or column group(s) that hold 
 many small tfiles. In one test, the number is about 100 times bigger that 
 from unsorted input splits on the same input tables.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1258) [zebra] Number of sorted input splits is unusually high

2010-03-19 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1258:
--

Status: Patch Available  (was: Open)

Resumbit so hudson will rerun.

 [zebra] Number of sorted input splits is unusually high
 ---

 Key: PIG-1258
 URL: https://issues.apache.org/jira/browse/PIG-1258
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Yan Zhou
 Attachments: PIG-1258.patch


 Number of sorted input splits is unusually high if the projections are on 
 multiple column groups, or a union of tables, or column group(s) that hold 
 many small tfiles. In one test, the number is about 100 times bigger that 
 from unsorted input splits on the same input tables.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

[
https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dmitriy V. Ryaboy updated PIG-1117:
---

Resolution: Fixed
Status: Resolved (was: Patch Available)

Patch commited.
Thanks for this contribution, Gerrit! This will really help people who are
working with both Hive and Pig.

Now we just need a Zebra SerDe... :-)

Pig reading hive columnar rc tables
---

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1298) Restore file traversal behavior to Pig loaders


 [ 
https://issues.apache.org/jira/browse/PIG-1298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1298:
--

Status: Patch Available  (was: Open)

 Restore file traversal behavior to Pig loaders
 --

 Key: PIG-1298
 URL: https://issues.apache.org/jira/browse/PIG-1298
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.7.0

 Attachments: PIG-1298.patch, PIG-1298_1.patch


 Given a location to a Pig loader, it is expected to recursively load all the 
 files under the location (i.e., all the files returned with  ls -R 
 command). However, after the transition to using Hadoop 20 API,  only files 
 returned with ls command are loaded.
  
   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1298) Restore file traversal behavior to Pig loaders


 [ 
https://issues.apache.org/jira/browse/PIG-1298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1298:
--

Status: Open  (was: Patch Available)

 Restore file traversal behavior to Pig loaders
 --

 Key: PIG-1298
 URL: https://issues.apache.org/jira/browse/PIG-1298
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.7.0

 Attachments: PIG-1298.patch, PIG-1298_1.patch


 Given a location to a Pig loader, it is expected to recursively load all the 
 files under the location (i.e., all the files returned with  ls -R 
 command). However, after the transition to using Hadoop 20 API,  only files 
 returned with ls command are loaded.
  
   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1298) Restore file traversal behavior to Pig loaders


 [ 
https://issues.apache.org/jira/browse/PIG-1298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1298:
--

Attachment: PIG-1298_1.patch

Fix release audit issue.

 Restore file traversal behavior to Pig loaders
 --

 Key: PIG-1298
 URL: https://issues.apache.org/jira/browse/PIG-1298
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.7.0

 Attachments: PIG-1298.patch, PIG-1298_1.patch


 Given a location to a Pig loader, it is expected to recursively load all the 
 files under the location (i.e., all the files returned with  ls -R 
 command). However, after the transition to using Hadoop 20 API,  only files 
 returned with ls command are loaded.
  
   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1117) Pig reading hive columnar rc tables

2010-03-19 Thread Gerrit Jansen van Vuuren (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847455#action_12847455
 ] 

Gerrit Jansen van Vuuren commented on PIG-1117:
---

:)

Yep I might just start on a Zebra SerDe for Hive, then we can have complete 
Hive Pig Harmony.

 Pig reading hive columnar rc tables
 ---

 Key: PIG-1117
 URL: https://issues.apache.org/jira/browse/PIG-1117
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.7.0
Reporter: Gerrit Jansen van Vuuren
Assignee: Gerrit Jansen van Vuuren
 Fix For: 0.7.0

 Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch, 
 PIG-1117-0.7.0-new.patch, PIG-1117-0.7.0-reviewed.patch, 
 PIG-1117-0.7.0-reviewed.patch, PIG-1117.patch, PIG-117-v.0.6.0.patch, 
 PIG-117-v.0.7.0.patch


 I've coded a LoadFunc implementation that can read from Hive Columnar RC 
 tables, this is needed for a project that I'm working on because all our data 
 is stored using the Hive thrift serialized Columnar RC format. I have looked 
 at the piggy bank but did not find any implementation that could do this. 
 We've been running it on our cluster for the last week and have worked out 
 most bugs.
  
 There are still some improvements to be done but I would need  like setting 
 the amount of mappers based on date partitioning. Its been optimized so as to 
 read only specific columns and can churn through a data set almost 8 times 
 faster with this improvement because not all column data is read.
 I would like to contribute the class to the piggybank can you guide me in 
 what I need to do?
 I've used hive specific classes to implement this, is it possible to add this 
 to the piggy bank build ivy for automatic download of the dependencies?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-915) Load row names in HBase loader


[ 
https://issues.apache.org/jira/browse/PIG-915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847467#action_12847467
 ] 

Olga Natkovich commented on PIG-915:


Jeff, are you still planning to get this patch for 0.7.0? We are planning to 
branch on Monday and need to get it in before that. Otherwise, we can postpone 
it till 0.8.0 release.

 Load row names in HBase loader
 --

 Key: PIG-915
 URL: https://issues.apache.org/jira/browse/PIG-915
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.6.0
Reporter: Alex Newman
Assignee: Jeff Zhang
Priority: Minor
 Fix For: 0.7.0

 Attachments: Pig_915.Patch


 Currently their is no way to get the Row names when doing a query from HBase, 
 we should probably remedy this as important data may be stored there.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1182) Pig reference manual does not mention syntax for comments


 [ 
https://issues.apache.org/jira/browse/PIG-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1182:


Fix Version/s: (was: 0.7.0)
 Assignee: (was: Corinne Chandel)

 Pig reference manual does not mention syntax for comments
 -

 Key: PIG-1182
 URL: https://issues.apache.org/jira/browse/PIG-1182
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.5.0
Reporter: David Ciemiewicz

 The Pig 0.5.0 reference manual does not mention how to write comments in your 
 pig code using -- (two dashes).
 http://hadoop.apache.org/pig/docs/r0.5.0/piglatin_reference.html
 Also, does /* */ also work?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc


[ 
https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847470#action_12847470
 ] 

Olga Natkovich commented on PIG-1205:
-

Jeff, are you still planning to get this into Pig 0.7.0 by Monday or should we 
move this to Pig 0.8.0?

 Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
 --

 Key: PIG-1205
 URL: https://issues.apache.org/jira/browse/PIG-1205
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.7.0
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Fix For: 0.7.0

 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, 
 PIG_1205_4.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc


[ 
https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847473#action_12847473
 ] 

Dmitriy V. Ryaboy commented on PIG-1205:


fwiw -- I have an implementation for 0.6 that does most of what I outlined 
above; could probably port to 0.7 and make apache-friendly within the next 
couple of weeks.

 Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
 --

 Key: PIG-1205
 URL: https://issues.apache.org/jira/browse/PIG-1205
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.7.0
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Fix For: 0.7.0

 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, 
 PIG_1205_4.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1238) Dump does not respect the schema


[ 
https://issues.apache.org/jira/browse/PIG-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847474#action_12847474
 ] 

Richard Ding commented on PIG-1238:
---

Hi Ankur,

I run following script

{code}
A = LOAD '1.txt' USING PigStorage();
B = FOREACH A GENERATE ['a'#'12'] as b:map[], ['b'#['c'#'12']] as mapFields;
C = FOREACH B GENERATE(CHARARRAY) mapFields#'b'#'c' AS f1, RANDOM() AS f2;
D = ORDER C BY f2 PARALLEL 10;
E = LIMIT D 20;
F = FOREACH E GENERATE f1;
dump F;
{code}

and it gets the correct result. Can you sync again with the trunk and let me 
know if the problem still exists?

 Dump does not respect the schema
 

 Key: PIG-1238
 URL: https://issues.apache.org/jira/browse/PIG-1238
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur
Assignee: Richard Ding
 Fix For: 0.7.0

 Attachments: PIG-1238.patch


 For complex data type and certain sequence of operations dump produces 
 results with non-existent field in the relation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Begin a discussion about Pig as a top level project

2010-03-19 Thread Alan Gates

You have probably heard by now that there is a discussion going on in
the Hadoop PMC as to whether a number of the subprojects (Hbase, Avro,
Zookeeper, Hive, and Pig) should move out from under the Hadoop
umbrella and become top level Apache projects (TLP). This discussion
has picked up recently since the Apache board has clearly communicated
to the Hadoop PMC that it is concerned that Hadoop is acting as an
umbrella project with many disjoint subprojects underneath it. They
are concerned that this gives Apache little insight into the health
and happenings of the subproject communities which in turn means
Apache cannot properly mentor those communities.

The purpose of this email is to start a discussion within the Pig
community about this topic. Let me cover first what becoming TLP
would mean for Pig, and then I'll go into what options I think we as a
community have.

Becoming a TLP would mean that Pig would itself have a PMC that would
report directly to the Apache board. Who would be on the PMC would be
something we as a community would need to decide. Common options
would be to say all active committers are on the PMC, or all active
committers who have been a committer for at least a year. We would
also need to elect a chair of the PMC. This lucky person would have
no additional power, but would have the additional responsibility of
writing quarterly reports on Pig's status for Apache board meetings,
as well as coordinating with Apache to get accounts for new
committers, etc. For more information see http://www.apache.org/foundation/how-it-works.html#roles

Becoming a TLP would not mean that we are ostracized from the Hadoop
community. We would continue to be invited to Hadoop Summits, HUGs,
etc. Since all Pig developers and users are by definition Hadoop
users, we would continue to be a strong presence in the Hadoop
community.

I see three ways that we as a community can respond to this:

1) Say yes, we want to be a TLP now.
2) Say yes, we want to be a TLP, but not yet. We feel we need more
time to mature. If we choose this option we need to be able to
clearly articulate how much time we need and what we hope to see
change in that time.
3) Say no, we feel the benefits for us staying with Hadoop outweigh
the drawbacks of being a disjoint subproject. If we choose this, we
need to be able to say exactly what those benefits are and why we feel
they will be compromised by leaving the Hadoop project.

There may other options that I haven't thought of. Please feel free
to suggest any you think of.

Questions? Thoughts? Let the discussion begin.

Alan.

[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc


[ 
https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847476#action_12847476
 ] 

Olga Natkovich commented on PIG-1205:
-

Sounds good. Then I will mark it for inclusion in 0.8.0.

 Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
 --

 Key: PIG-1205
 URL: https://issues.apache.org/jira/browse/PIG-1205
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.7.0
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Fix For: 0.8.0

 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, 
 PIG_1205_4.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc


 [ 
https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1205:


Fix Version/s: (was: 0.7.0)
   0.8.0

 Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
 --

 Key: PIG-1205
 URL: https://issues.apache.org/jira/browse/PIG-1205
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.7.0
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Fix For: 0.8.0

 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, 
 PIG_1205_4.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1285) Allow SingleTupleBag to be serialized


[ 
https://issues.apache.org/jira/browse/PIG-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847481#action_12847481
 ] 

Olga Natkovich commented on PIG-1285:
-

Dmitry, are you still planning to get this in before Monday or should we move 
it to Pig 0.8.0?

 Allow SingleTupleBag to be serialized
 -

 Key: PIG-1285
 URL: https://issues.apache.org/jira/browse/PIG-1285
 Project: Pig
  Issue Type: Improvement
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.7.0

 Attachments: PIG-1285.patch


 Currently, Pig uses a SingleTupleBag for efficiency when a full-blown 
 spillable bag implementation is not needed in the Combiner optimization.
 Unfortunately this can create problems. The below Initial.exec() code fails 
 at run-time with the message that a SingleTupleBag cannot be serialized:
 {code}
 @Override
 public Tuple exec(Tuple in) throws IOException {
   // single record. just copy.
   if (in == null) return null;   
   try {
  Tuple resTuple = tupleFactory_.newTuple(in.size());
  for (int i=0; i in.size(); i++) {
resTuple.set(i, in.get(i));
 }
 return resTuple;
} catch (IOException e) {
  log.warn(e);
  return null;
   }
 }
 {code}
 The code below can fix the problem in the UDF, but it seems like something 
 that should be handled transparently, not requiring UDF authors to know about 
 SingleTupleBags.
 {code}
 @Override
 public Tuple exec(Tuple in) throws IOException {
   // single record. just copy.
   if (in == null) return null;   
   
   /*
* Unfortunately SingleTupleBags are not serializable. We cache whether 
 a given index contains a bag
* in the map below, and copy all bags into DefaultBags before 
 returning to avoid serialization exceptions.
*/
   MapInteger, Boolean isBagAtIndex = Maps.newHashMap();
   
   try {
 Tuple resTuple = tupleFactory_.newTuple(in.size());
 for (int i=0; i in.size(); i++) {
   Object obj = in.get(i);
   if (!isBagAtIndex.containsKey(i)) {
 isBagAtIndex.put(i, obj instanceof SingleTupleBag);
   }
   if (isBagAtIndex.get(i)) {
 DataBag newBag = bagFactory_.newDefaultBag();
 newBag.addAll((DataBag)obj);
 obj = newBag;
   }
   resTuple.set(i, obj);
 }
 return resTuple;
   } catch (IOException e) {
 log.warn(e);
 return null;
   }
 }
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-1308) Inifinite loop in JobClient when reading from BinStorage Message: [org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2]


 [ 
https://issues.apache.org/jira/browse/PIG-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-1308:
---

Assignee: Pradeep Kamath

 Inifinite loop in JobClient when reading from BinStorage Message: 
 [org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
 process : 2]
 

 Key: PIG-1308
 URL: https://issues.apache.org/jira/browse/PIG-1308
 Project: Pig
  Issue Type: Bug
Reporter: Viraj Bhat
Assignee: Pradeep Kamath
 Fix For: 0.7.0


 Simple script fails to read files from BinStorage() and fails to submit jobs 
 to JobTracker. This occurs with trunk and not with Pig 0.6 branch.
 {code}
 data = load 'binstoragesample' using BinStorage() as (s, m, l);
 A = foreach ULT generate   s#'key' as value;
 X = limit A 20;
 dump X;
 {code}
 When this script is submitted to the Jobtracker, we found the following error:
 2010-03-18 22:31:22,296 [main] INFO  
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
 process : 2
 2010-03-18 22:32:01,574 [main] INFO  
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
 process : 2
 2010-03-18 22:32:43,276 [main] INFO  
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
 process : 2
 2010-03-18 22:33:21,743 [main] INFO  
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
 process : 2
 2010-03-18 22:34:02,004 [main] INFO  
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
 process : 2
 2010-03-18 22:34:43,442 [main] INFO  
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
 process : 2
 2010-03-18 22:35:25,907 [main] INFO  
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
 process : 2
 2010-03-18 22:36:07,402 [main] INFO  
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
 process : 2
 2010-03-18 22:36:48,596 [main] INFO  
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
 process : 2
 2010-03-18 22:37:28,014 [main] INFO  
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
 process : 2
 2010-03-18 22:38:04,823 [main] INFO  
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
 process : 2
 2010-03-18 22:38:38,981 [main] INFO  
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
 process : 2
 2010-03-18 22:39:12,220 [main] INFO  
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
 process : 2
 Stack Trace revelead 
 at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:144)
 at 
 org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:115)
 at org.apache.pig.builtin.BinStorage.getSchema(BinStorage.java:404)
 at 
 org.apache.pig.impl.logicalLayer.LOLoad.determineSchema(LOLoad.java:167)
 at 
 org.apache.pig.impl.logicalLayer.LOLoad.getProjectionMap(LOLoad.java:263)
 at 
 org.apache.pig.impl.logicalLayer.ProjectionMapCalculator.visit(ProjectionMapCalculator.java:112)
 at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:210)
 at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:52)
 at 
 org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69)
 at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
 at 
 org.apache.pig.impl.logicalLayer.optimizer.LogicalTransformer.rebuildProjectionMaps(LogicalTransformer.java:76)
 at 
 org.apache.pig.impl.logicalLayer.optimizer.LogicalOptimizer.optimize(LogicalOptimizer.java:216)
 at org.apache.pig.PigServer.compileLp(PigServer.java:883)
 at org.apache.pig.PigServer.store(PigServer.java:564)
 The binstorage data was generated from 2 datasets using limit and union:
 {code}
 Large1 = load 'input1'  using PigStorage();
 Large2 = load 'input2' using PigStorage();
 V = limit Large1 1;
 C = limit Large2 1;
 U = union V, C;
 store U into 'binstoragesample' using BinStorage();
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1285) Allow SingleTupleBag to be serialized


[ 
https://issues.apache.org/jira/browse/PIG-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847507#action_12847507
 ] 

Dmitriy V. Ryaboy commented on PIG-1285:


Yeah I'll post it over the weekend.

Just to make sure -- Pradeep, you would be ok then if I just copied the 
writeFields and readFields out of DefaultAbstractBag into SingleTupleBag?

 Allow SingleTupleBag to be serialized
 -

 Key: PIG-1285
 URL: https://issues.apache.org/jira/browse/PIG-1285
 Project: Pig
  Issue Type: Improvement
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.7.0

 Attachments: PIG-1285.patch


 Currently, Pig uses a SingleTupleBag for efficiency when a full-blown 
 spillable bag implementation is not needed in the Combiner optimization.
 Unfortunately this can create problems. The below Initial.exec() code fails 
 at run-time with the message that a SingleTupleBag cannot be serialized:
 {code}
 @Override
 public Tuple exec(Tuple in) throws IOException {
   // single record. just copy.
   if (in == null) return null;   
   try {
  Tuple resTuple = tupleFactory_.newTuple(in.size());
  for (int i=0; i in.size(); i++) {
resTuple.set(i, in.get(i));
 }
 return resTuple;
} catch (IOException e) {
  log.warn(e);
  return null;
   }
 }
 {code}
 The code below can fix the problem in the UDF, but it seems like something 
 that should be handled transparently, not requiring UDF authors to know about 
 SingleTupleBags.
 {code}
 @Override
 public Tuple exec(Tuple in) throws IOException {
   // single record. just copy.
   if (in == null) return null;   
   
   /*
* Unfortunately SingleTupleBags are not serializable. We cache whether 
 a given index contains a bag
* in the map below, and copy all bags into DefaultBags before 
 returning to avoid serialization exceptions.
*/
   MapInteger, Boolean isBagAtIndex = Maps.newHashMap();
   
   try {
 Tuple resTuple = tupleFactory_.newTuple(in.size());
 for (int i=0; i in.size(); i++) {
   Object obj = in.get(i);
   if (!isBagAtIndex.containsKey(i)) {
 isBagAtIndex.put(i, obj instanceof SingleTupleBag);
   }
   if (isBagAtIndex.get(i)) {
 DataBag newBag = bagFactory_.newDefaultBag();
 newBag.addAll((DataBag)obj);
 obj = newBag;
   }
   resTuple.set(i, obj);
 }
 return resTuple;
   } catch (IOException e) {
 log.warn(e);
 return null;
   }
 }
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1309) Map-side Cogroup

2010-03-19 Thread Ashutosh Chauhan (JIRA)

Map-side Cogroup


 Key: PIG-1309
 URL: https://issues.apache.org/jira/browse/PIG-1309
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Attachments: mapsideCogrp.patch

In never ending quest to make Pig go faster, we want to parallelize as many 
relational operations as possible. Its already possible to do Group-by( PIG-984 
) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira is to add 
map-side implementation of Cogroup in Pig. Details to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1309) Map-side Cogroup

2010-03-19 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1309:
--

Attachment: mapsideCogrp.patch

Preliminary patch to discuss the approach. Not ready for inclusion yet.

 Map-side Cogroup
 

 Key: PIG-1309
 URL: https://issues.apache.org/jira/browse/PIG-1309
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Attachments: mapsideCogrp.patch


 In never ending quest to make Pig go faster, we want to parallelize as many 
 relational operations as possible. Its already possible to do Group-by( 
 PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira 
 is to add map-side implementation of Cogroup in Pig. Details to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1253) [zebra] make map/reduce test cases run on real cluster

2010-03-19 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847521#action_12847521
 ] 

Yan Zhou commented on PIG-1253:
---

+1 on PIG-1253-0.6.patch that is committed to the 0.6 branch.

 [zebra] make map/reduce test cases run on real cluster
 --

 Key: PIG-1253
 URL: https://issues.apache.org/jira/browse/PIG-1253
 Project: Pig
  Issue Type: Task
Affects Versions: 0.6.0
Reporter: Chao Wang
Assignee: Chao Wang
 Fix For: 0.7.0

 Attachments: PIG-1253-0.6.patch, PIG-1253.patch, PIG-1253.patch


 The goal of this task is to make map/reduce test cases run on real cluster.
 Currently map/reduce test cases are mostly tested under local mode. When 
 running on real cluster, all involved jars have to be manually deployed in 
 advance which is not desired. 
 The major change here is to support -libjars option to be able to ship user 
 jars to backend automatically.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1258) [zebra] Number of sorted input splits is unusually high

2010-03-19 Thread Gaurav Jain (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847560#action_12847560
 ] 

Gaurav Jain commented on PIG-1258:
--


+1

 [zebra] Number of sorted input splits is unusually high
 ---

 Key: PIG-1258
 URL: https://issues.apache.org/jira/browse/PIG-1258
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Yan Zhou
 Attachments: PIG-1258.patch


 Number of sorted input splits is unusually high if the projections are on 
 multiple column groups, or a union of tables, or column group(s) that hold 
 many small tfiles. In one test, the number is about 100 times bigger that 
 from unsorted input splits on the same input tables.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1307) when we spill the DefaultDataBag we are not setting the sized changed flag to be true.


[ 
https://issues.apache.org/jira/browse/PIG-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847587#action_12847587
 ] 

Daniel Dai commented on PIG-1307:
-

Hi, Ben,
Is the patch ready? Do you need help to add some test cases?

 when we spill the DefaultDataBag we are not setting the sized changed flag to 
 be true.
 --

 Key: PIG-1307
 URL: https://issues.apache.org/jira/browse/PIG-1307
 Project: Pig
  Issue Type: Bug
Reporter: Benjamin Reed
Assignee: Benjamin Reed
 Fix For: 0.7.0

 Attachments: PIG-1307.patch


 pig uses a size changed flag to indicate when we should recalculate the 
 memory footprint of the bag. the setting of this flag is sprinkled throughout 
 the code. unfortunately, it is missing in DefaultDataBag.spill(). there may 
 be other cases as well. the problem with this case is that when the low 
 memory threshold kicks in, bags are spilled until the desired amount of 
 memory is freed. since the flag is not being reset subsequent calls to the 
 threshold events will retrigger the spill() and think more memory was freed 
 even though nothing was actually spilled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1309) Map-side Cogroup


[ 
https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847589#action_12847589
 ] 

Alan Gates commented on PIG-1309:
-

Here's a write up on the design behind this:  
http://wiki.apache.org/pig/MapSideCogroup

 Map-side Cogroup
 

 Key: PIG-1309
 URL: https://issues.apache.org/jira/browse/PIG-1309
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Attachments: mapsideCogrp.patch


 In never ending quest to make Pig go faster, we want to parallelize as many 
 relational operations as possible. Its already possible to do Group-by( 
 PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira 
 is to add map-side implementation of Cogroup in Pig. Details to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-03-19 Thread Allen Wittenauer (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847594#action_12847594
 ] 

Allen Wittenauer commented on PIG-794:
--

What is the latest on getting Avro support in pig?

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1310) ISO Date UDFs: Conversion, Rounding and Date Math

2010-03-19 Thread Russell Jurney (JIRA)

ISO Date UDFs: Conversion, Rounding and Date Math
-

 Key: PIG-1310
 URL: https://issues.apache.org/jira/browse/PIG-1310
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Russell Jurney
 Fix For: 0.7.0


I've written UDFs to handle loading unix times, datemonth values and ISO 8601 
formatted date strings, and working with them as ISO datetimes using jodatime.

The working code is here: 
http://github.com/rjurney/oink/tree/master/src/java/oink/udf/isodate/

It needs to be documented and tests added, and a couple UDFs are missing, but 
these work if you REGISTER the jodatime jar in your script.  Hopefully I can 
get this stuff in piggybank before someone else writes it this time :)  The 
rounding also may not be performant, but the code works.

Ultimately I'd also like to enable support for ISO 8601 durations.  Someone 
slap me if this isn't done soon, it is not much work and this should help 
everyone working with time series.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1310) ISO Date UDFs: Conversion, Rounding and Date Math


 [ 
https://issues.apache.org/jira/browse/PIG-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1310:


Fix Version/s: (was: 0.7.0)
   0.8.0

I think this will be very useful for many users!

We are freezing 0.7.0 on Monday so moving this to 0.8.0 release. 

 ISO Date UDFs: Conversion, Rounding and Date Math
 -

 Key: PIG-1310
 URL: https://issues.apache.org/jira/browse/PIG-1310
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Russell Jurney
 Fix For: 0.8.0

   Original Estimate: 168h
  Remaining Estimate: 168h

 I've written UDFs to handle loading unix times, datemonth values and ISO 8601 
 formatted date strings, and working with them as ISO datetimes using jodatime.
 The working code is here: 
 http://github.com/rjurney/oink/tree/master/src/java/oink/udf/isodate/
 It needs to be documented and tests added, and a couple UDFs are missing, but 
 these work if you REGISTER the jodatime jar in your script.  Hopefully I can 
 get this stuff in piggybank before someone else writes it this time :)  The 
 rounding also may not be performant, but the code works.
 Ultimately I'd also like to enable support for ISO 8601 durations.  Someone 
 slap me if this isn't done soon, it is not much work and this should help 
 everyone working with time series.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1150) VAR() Variance UDF

2010-03-19 Thread Russell Jurney (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847600#action_12847600
 ] 

Russell Jurney commented on PIG-1150:
-

Yes, this sounds like the thing to do :)

On Tue, Mar 16, 2010 at 5:29 PM, Dmitriy V. Ryaboy (JIRA)



 VAR() Variance UDF
 --

 Key: PIG-1150
 URL: https://issues.apache.org/jira/browse/PIG-1150
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.5.0
 Environment: UDF, written in Pig 0.5 contrib/
Reporter: Russell Jurney
 Fix For: 0.7.0

 Attachments: var.patch


 I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates 
 variance in a distributed manner, based on the AVG() builtin.  It works by 
 calculating the count, sum and sum of squares, as described here: 
 http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
 Is this a worthwhile contribution?  Taking the square root of this value 
 using the contrib SQRT() function gives Standard Deviation, which is missing 
 from Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1309) Map-side Cogroup

[
https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847604#action_12847604
]

Alan Gates commented on PIG-1309:
-

Comments:

A liberal dose of comments would help greatly in understanding what the various
helper methods are doing.

You use LocalRearrange to split the keys and values. What's the overhead of
that? Would it be more efficient to factor the key splitting code out of LR
and share it between LR and here?

I don't understand the need for pullTuplesFromSideLoaders(). In setup() you
put one tuple from each input into the heap. Then you pull from the heap until
you see a key change. But I don't understand the next step. At key change you
call pullTuplesFromSideLoaders(). But if you've been adding into the heap as
you pull tuples there's no need to pull anything from the side loaders at this
point. All you should need to do is package up the bags you've build and
return them as your tuple.

Also, it appears your using pullTuplesFromSideLoaders() to fill the heap. You
shouldn't be pulling all tuples for a current key from side loaders, as you're
likely to miss tuples with keys that are in the side loaders but not in the
main loader. The algorithm should be that as you pull a tuple from the heap,
you place the next tuple from that same stream into the heap. The heap will
guarantee that your tuples come out in order.

Map-side Cogroup

Key: PIG-1309
URL: https://issues.apache.org/jira/browse/PIG-1309
Project: Pig
Issue Type: Bug
Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
Attachments: mapsideCogrp.patch

In never ending quest to make Pig go faster, we want to parallelize as many
relational operations as possible. Its already possible to do Group-by(
PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira
is to add map-side implementation of Cogroup in Pig. Details to follow.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-794) Use Avro serialization in Pig


[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847606#action_12847606
 ] 

Alan Gates commented on PIG-794:


It depends on what you mean by support.  As far as Pig using Avro for 
serialization between Map and Reduce and MR jobs, we haven't done anything on 
that front lately.  Last time we tested the performance was comparable to our 
own BinStorage so we weren't motivated to move yet.  Now that Avro has matured 
a bit maybe we should test again.

As far as using Avro to store user data, with Pig 0.7 it should become quite 
easy to write Avro load and store functions.

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-03-19 Thread Jeff Hammerbacher (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847607#action_12847607
 ] 

Jeff Hammerbacher commented on PIG-794:
---

bq. Last time we tested the performance was comparable to our own BinStorage so 
we weren't motivated to move yet.

Hey Alan,

There should be benefits to using Avro besides just performance. Either way, 
looking forward to seeing you on the Avro lists when you decide to test again!

Regards,
Jeff

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1150) VAR() Variance UDF


 [ 
https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1150:
---

Fix Version/s: (was: 0.7.0)
   0.8.0

Changed the target to 0.8 -- we won't have time to finish splitting this out by 
Monday.

 VAR() Variance UDF
 --

 Key: PIG-1150
 URL: https://issues.apache.org/jira/browse/PIG-1150
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.5.0
 Environment: UDF, written in Pig 0.5 contrib/
Reporter: Russell Jurney
 Fix For: 0.8.0

 Attachments: var.patch


 I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates 
 variance in a distributed manner, based on the AVG() builtin.  It works by 
 calculating the count, sum and sum of squares, as described here: 
 http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
 Is this a worthwhile contribution?  Taking the square root of this value 
 using the contrib SQRT() function gives Standard Deviation, which is missing 
 from Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-794) Use Avro serialization in Pig


[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847613#action_12847613
 ] 

Alan Gates commented on PIG-794:


Jeff,

Beyond performance what do you see as the big wins of using Avro?  I'm just 
thinking here of moving data between MR jobs in a Pig script and between Map 
and Reduce phases.  I see lots of advantages to users using Avro to store their 
data.

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-794) Use Avro serialization in Pig


[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847614#action_12847614
 ] 

Dmitriy V. Ryaboy commented on PIG-794:
---

I'll take a crack at it.

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
Assignee: Dmitriy V. Ryaboy
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-794) Use Avro serialization in Pig


 [ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy reassigned PIG-794:
-

Assignee: Dmitriy V. Ryaboy

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
Assignee: Dmitriy V. Ryaboy
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-1150) VAR() Variance UDF


 [ 
https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy reassigned PIG-1150:
--

Assignee: Dmitriy V. Ryaboy

 VAR() Variance UDF
 --

 Key: PIG-1150
 URL: https://issues.apache.org/jira/browse/PIG-1150
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.5.0
 Environment: UDF, written in Pig 0.5 contrib/
Reporter: Russell Jurney
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: var.patch


 I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates 
 variance in a distributed manner, based on the AVG() builtin.  It works by 
 calculating the count, sum and sum of squares, as described here: 
 http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
 Is this a worthwhile contribution?  Taking the square root of this value 
 using the contrib SQRT() function gives Standard Deviation, which is missing 
 from Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1298) Restore file traversal behavior to Pig loaders

2010-03-19 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847630#action_12847630
 ] 

Ashutosh Chauhan commented on PIG-1298:
---

+1

 Restore file traversal behavior to Pig loaders
 --

 Key: PIG-1298
 URL: https://issues.apache.org/jira/browse/PIG-1298
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.7.0

 Attachments: PIG-1298.patch, PIG-1298_1.patch


 Given a location to a Pig loader, it is expected to recursively load all the 
 files under the location (i.e., all the files returned with  ls -R 
 command). However, after the transition to using Hadoop 20 API,  only files 
 returned with ls command are loaded.
  
   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1307) when we spill the DefaultDataBag we are not setting the sized changed flag to be true.

2010-03-19 Thread Benjamin Reed (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847631#action_12847631
 ] 

Benjamin Reed commented on PIG-1307:


i don't have test cases since the test cases would be orders of magnitude 
higher more difficult to write than the patch and may not reproduce the problem 
across different machine configurations.

 when we spill the DefaultDataBag we are not setting the sized changed flag to 
 be true.
 --

 Key: PIG-1307
 URL: https://issues.apache.org/jira/browse/PIG-1307
 Project: Pig
  Issue Type: Bug
Reporter: Benjamin Reed
Assignee: Benjamin Reed
 Fix For: 0.7.0

 Attachments: PIG-1307.patch


 pig uses a size changed flag to indicate when we should recalculate the 
 memory footprint of the bag. the setting of this flag is sprinkled throughout 
 the code. unfortunately, it is missing in DefaultDataBag.spill(). there may 
 be other cases as well. the problem with this case is that when the low 
 memory threshold kicks in, bags are spilled until the desired amount of 
 memory is freed. since the flag is not being reset subsequent calls to the 
 threshold events will retrigger the spill() and think more memory was freed 
 even though nothing was actually spilled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1311) Pig interfaces should be clearly classified in terms of scope and stability


[ 
https://issues.apache.org/jira/browse/PIG-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847632#action_12847632
 ] 

Alan Gates commented on PIG-1311:
-

Hadoop has a proposal on how to approach this in HADOOP-5073

I propose we use the same nomenclature.  Java interfaces would be marked via 
annotations (provided by Hadoop commons).  For other interfaces we would need 
to provide version specific documents (that is, in forrest not in wiki) that 
detail scope and stability for each interface.

 Pig interfaces should be clearly classified in terms of scope and stability
 ---

 Key: PIG-1311
 URL: https://issues.apache.org/jira/browse/PIG-1311
 Project: Pig
  Issue Type: Improvement
Reporter: Alan Gates
Assignee: Alan Gates

 Clearly marking Pig interfaces (Java interfaces but also things like config 
 files, CLIs, Pig Latin syntax and semantics, etc.) to show scope 
 (public/private) and stability (stable/evolving/unstable) will help users 
 understand how to interact with Pig and developers to understand what things 
 they can and cannot change.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1307) when we spill the DefaultDataBag we are not setting the sized changed flag to be true.


[ 
https://issues.apache.org/jira/browse/PIG-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847636#action_12847636
 ] 

Daniel Dai commented on PIG-1307:
-

+1. Will commit it shortly.

 when we spill the DefaultDataBag we are not setting the sized changed flag to 
 be true.
 --

 Key: PIG-1307
 URL: https://issues.apache.org/jira/browse/PIG-1307
 Project: Pig
  Issue Type: Bug
Reporter: Benjamin Reed
Assignee: Benjamin Reed
 Fix For: 0.7.0

 Attachments: PIG-1307.patch


 pig uses a size changed flag to indicate when we should recalculate the 
 memory footprint of the bag. the setting of this flag is sprinkled throughout 
 the code. unfortunately, it is missing in DefaultDataBag.spill(). there may 
 be other cases as well. the problem with this case is that when the low 
 memory threshold kicks in, bags are spilled until the desired amount of 
 memory is freed. since the flag is not being reset subsequent calls to the 
 threshold events will retrigger the spill() and think more memory was freed 
 even though nothing was actually spilled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1307) when we spill the DefaultDataBag we are not setting the sized changed flag to be true.


 [ 
https://issues.apache.org/jira/browse/PIG-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1307:


  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

Performed a manual test, and it works. Patch committed. Thanks Ben!

 when we spill the DefaultDataBag we are not setting the sized changed flag to 
 be true.
 --

 Key: PIG-1307
 URL: https://issues.apache.org/jira/browse/PIG-1307
 Project: Pig
  Issue Type: Bug
Reporter: Benjamin Reed
Assignee: Benjamin Reed
 Fix For: 0.7.0

 Attachments: PIG-1307.patch


 pig uses a size changed flag to indicate when we should recalculate the 
 memory footprint of the bag. the setting of this flag is sprinkled throughout 
 the code. unfortunately, it is missing in DefaultDataBag.spill(). there may 
 be other cases as well. the problem with this case is that when the low 
 memory threshold kicks in, bags are spilled until the desired amount of 
 memory is freed. since the flag is not being reset subsequent calls to the 
 threshold events will retrigger the spill() and think more memory was freed 
 even though nothing was actually spilled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1312) Make Pig work with hadoop security

Make Pig work with hadoop security
--

 Key: PIG-1312
 URL: https://issues.apache.org/jira/browse/PIG-1312
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.7.0


In order to make Pig work with hadoop security, we need to set 
mapreduce.job.credentials.binary in the JobConf before we call getSplit() in 
the backend. We need to change code in merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1312) Make Pig work with hadoop security


 [ 
https://issues.apache.org/jira/browse/PIG-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1312:


Status: Patch Available  (was: Open)

 Make Pig work with hadoop security
 --

 Key: PIG-1312
 URL: https://issues.apache.org/jira/browse/PIG-1312
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.7.0

 Attachments: PIG-1312-1.patch


 In order to make Pig work with hadoop security, we need to set 
 mapreduce.job.credentials.binary in the JobConf before we call getSplit() 
 in the backend. We need to change code in merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1312) Make Pig work with hadoop security