[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-03-30 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851736#action_12851736
 ] 

Daniel Dai commented on PIG-1295:
-

Thanks Gianmarco,
My suggestion is to divide it into two step:
1. make binary comparator works
2. integrate it into the current Pig code

It is better to make sure we have quality deliverable for step 1 before we move 
to step 2.

> Binary comparator for secondary sort
> 
>
> Key: PIG-1295
> URL: https://issues.apache.org/jira/browse/PIG-1295
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
>
> When hadoop framework doing the sorting, it will try to use binary version of 
> comparator if available. The benefit of binary comparator is we do not need 
> to instantiate the object before we compare. We see a ~30% speedup after we 
> switch to binary comparator. Currently, Pig use binary comparator in 
> following case:
> 1. When semantics of order doesn't matter. For example, in distinct, we need 
> to do a sort in order to filter out duplicate values; however, we do not care 
> how comparator sort keys. Groupby also share this character. In this case, we 
> rely on hadoop's default binary comparator
> 2. Semantics of order matter, but the key is of simple type. In this case, we 
> have implementation for simple types, such as integer, long, float, 
> chararray, databytearray, string
> However, if the key is a tuple and the sort semantics matters, we do not have 
> a binary comparator implementation. This especially matters when we switch to 
> use secondary sort. In secondary sort, we convert the inner sort of nested 
> foreach into the secondary key and rely on hadoop to sorting on both main key 
> and secondary key. The sorting key will become a two items tuple. Since the 
> secondary key the sorting key of the nested foreach, so the sorting semantics 
> matters. It turns out we do not have binary comparator once we use secondary 
> sort, and we see a significant slow down.
> Binary comparator for tuple should be doable once we understand the binary 
> structure of the serialized tuple. We can focus on most common use cases 
> first, which is "group by" followed by a nested sort. In this case, we will 
> use secondary sort. Semantics of the first key does not matter but semantics 
> of secondary key matters. We need to identify the boundary of main key and 
> secondary key in the binary tuple buffer without instantiate tuple itself. 
> Then if the first key equals, we use a binary comparator to compare secondary 
> key. Secondary key can also be a complex data type, but for the first step, 
> we focus on simple secondary key, which is the most common use case.
> We mark this issue to be a candidate project for "Google summer of code 2010" 
> program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1330) Move pruned schema tracking logic from LoadFunc to core code

2010-03-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1330:


Attachment: PIG-1330-1.patch

> Move pruned schema tracking logic from LoadFunc to core code
> 
>
> Key: PIG-1330
> URL: https://issues.apache.org/jira/browse/PIG-1330
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.7.0
>
> Attachments: PIG-1330-1.patch
>
>
> Currently, LoadFunc.getSchema require a schema after column pruning. The good 
> side of this is LoadFunc.getSchema matches the data it actually load. This 
> gives a sense of consistency. However, by doing this, every LoadFunc need to 
> keep track of the columns pruned. This is an unnecessary burden to the 
> LoadFunc writer and it is very error proning. This issue is to move this 
> logic from LoadFunc to Pig core. LoadFunc.getSchema then only need to return 
> original schema even after pruning.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1330) Move pruned schema tracking logic from LoadFunc to core code

2010-03-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1330:


Status: Patch Available  (was: Open)

> Move pruned schema tracking logic from LoadFunc to core code
> 
>
> Key: PIG-1330
> URL: https://issues.apache.org/jira/browse/PIG-1330
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.7.0
>
> Attachments: PIG-1330-1.patch
>
>
> Currently, LoadFunc.getSchema require a schema after column pruning. The good 
> side of this is LoadFunc.getSchema matches the data it actually load. This 
> gives a sense of consistency. However, by doing this, every LoadFunc need to 
> keep track of the columns pruned. This is an unnecessary burden to the 
> LoadFunc writer and it is very error proning. This issue is to move this 
> logic from LoadFunc to Pig core. LoadFunc.getSchema then only need to return 
> original schema even after pruning.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1344) PigStorage should be able to read back complex data containing delimiters created by PigStorage

2010-03-30 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851729#action_12851729
 ] 

Daniel Dai commented on PIG-1344:
-

Hi, Santhosh,
We change complex data parsing in 0.7, and all values will be read as 
bytearray. The goal for this change is to stop guessing datatype for complex 
data type. You can cast to other datatype either implicitly or explicitly. So 
here do you mean you still want data type guessing in some cases?

> PigStorage should be able to read back complex data containing delimiters 
> created by PigStorage
> ---
>
> Key: PIG-1344
> URL: https://issues.apache.org/jira/browse/PIG-1344
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Santhosh Srinivasan
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
>
> With Pig 0.7, the TextDataParser has been removed and the logic to parse 
> complex data types has moved to Utf8StorageConverter. However, this does not 
> handle the case where the complex data types could contain delimiters ('{', 
> '}', ',', '(', ')', '[', ']', '#'). Fixing this issue will make PigStorage 
> self contained and more usable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1313) PigServer leaks memory over time

2010-03-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851717#action_12851717
 ] 

Hadoop QA commented on PIG-1313:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12440276/PIG-1313-3.patch
  against trunk revision 929236.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no tests are needed for this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/272/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/272/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/272/console

This message is automatically generated.

> PigServer leaks memory over time
> 
>
> Key: PIG-1313
> URL: https://issues.apache.org/jira/browse/PIG-1313
> Project: Pig
>  Issue Type: Bug
>Reporter: Bill Graham
>Assignee: Bill Graham
> Attachments: PIG-1313-0.4.0-1.patch, PIG-1313-1.patch, 
> PIG-1313-1.patch, PIG-1313-2.patch, PIG-1313-3.patch, Pig1313Reproducer.java
>
>
> When {{PigServer}} runs it creates temporary files using the 
> {{FileLocalizer.getTemporaryPath(..)}}. This static method creates and 
> returns a handle to a temporary file (as an instance of 
> {{ElementDescriptor}}). The {{ElementDescriptors}} returned by this method 
> are kept on a static {{Stack}} named {{toDelete}}. The items on {{toDelete}} 
> get removed by the {{FileLocalizer.deleteTempFile()}} method.
> The only place in the code where I see {{FileLocalizer.deleteTempFile()}} 
> called is in the Main class. {{PigServer}} does not call that method though, 
> so a long-running VM that repeatedly uses instances of {{PigServer}} to run 
> jobs will leak memory via {{toDelete}}.
> One suggested fix is to have {{PigServer.shutdown()}} call 
> {{FileLocalizer.deleteTempFile()}}, but this would cause problems in a 
> multi-threaded environment, since it seems {{ElementDescriptors}} are pushed 
> onto the {{toDelete}} stack before they're used, not once they're done with. 
> With this approach, running multiple instances of {{PigServer}} in separate 
> threads could cause one completed job to clobber the other's still-in-use 
> temp files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1344) PigStorage should be able to read back complex data containing delimiters created by PigStorage

2010-03-30 Thread Santhosh Srinivasan (JIRA)
PigStorage should be able to read back complex data containing delimiters 
created by PigStorage
---

 Key: PIG-1344
 URL: https://issues.apache.org/jira/browse/PIG-1344
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Santhosh Srinivasan
Assignee: Daniel Dai
 Fix For: 0.8.0


With Pig 0.7, the TextDataParser has been removed and the logic to parse 
complex data types has moved to Utf8StorageConverter. However, this does not 
handle the case where the complex data types could contain delimiters ('{', 
'}', ',', '(', ')', '[', ']', '#'). Fixing this issue will make PigStorage self 
contained and more usable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1336) Optimize POStore serialized into JobConf

2010-03-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1336:


Status: Open  (was: Patch Available)

> Optimize POStore serialized into JobConf
> 
>
> Key: PIG-1336
> URL: https://issues.apache.org/jira/browse/PIG-1336
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Attachments: PIG-1336-1.patch, PIG-1336-2.patch, PIG-1336-3.patch, 
> PIG-1336-4.patch
>
>
> We serialize POStore too early in the JobControlCompiler. At that time, 
> storeFunc have unconstraint link to other operator; in the worst case, it 
> will chain the whole physical plan. Also, in multi-store case, POStore has 
> link to its data source, which is not needed and will increase the footprint 
> of serialized POStore. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1336) Optimize POStore serialized into JobConf

2010-03-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1336:


Attachment: PIG-1336-4.patch

> Optimize POStore serialized into JobConf
> 
>
> Key: PIG-1336
> URL: https://issues.apache.org/jira/browse/PIG-1336
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Attachments: PIG-1336-1.patch, PIG-1336-2.patch, PIG-1336-3.patch, 
> PIG-1336-4.patch
>
>
> We serialize POStore too early in the JobControlCompiler. At that time, 
> storeFunc have unconstraint link to other operator; in the worst case, it 
> will chain the whole physical plan. Also, in multi-store case, POStore has 
> link to its data source, which is not needed and will increase the footprint 
> of serialized POStore. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1336) Optimize POStore serialized into JobConf

2010-03-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1336:


Status: Patch Available  (was: Open)

Good catch. Thanks Richard. Modify the patch to address that.

> Optimize POStore serialized into JobConf
> 
>
> Key: PIG-1336
> URL: https://issues.apache.org/jira/browse/PIG-1336
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Attachments: PIG-1336-1.patch, PIG-1336-2.patch, PIG-1336-3.patch, 
> PIG-1336-4.patch
>
>
> We serialize POStore too early in the JobControlCompiler. At that time, 
> storeFunc have unconstraint link to other operator; in the worst case, it 
> will chain the whole physical plan. Also, in multi-store case, POStore has 
> link to its data source, which is not needed and will increase the footprint 
> of serialized POStore. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1338) Pig should exclude hadoop conf in local mode

2010-03-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1338:


Status: Open  (was: Patch Available)

> Pig should exclude hadoop conf in local mode
> 
>
> Key: PIG-1338
> URL: https://issues.apache.org/jira/browse/PIG-1338
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Attachments: PIG-1338-1.patch, PIG-1338-2.patch, PIG-1338-3.patch
>
>
> Currently, the behavior for hadoop conf look up is:
> * in local mode, if there is hadoop conf, bail out; if there is no hadoop 
> conf, launch local mode
> * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
> no, still launch without warning, but many functionality will go wrong
> We should bring it to a more intuitive way, which is:
> * in local mode, always launch Pig in local mode
> * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
> no, bail out with a meaningful message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1338) Pig should exclude hadoop conf in local mode

2010-03-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1338:


Attachment: PIG-1338-3.patch

Did some code restructure and give it another shot.

> Pig should exclude hadoop conf in local mode
> 
>
> Key: PIG-1338
> URL: https://issues.apache.org/jira/browse/PIG-1338
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Attachments: PIG-1338-1.patch, PIG-1338-2.patch, PIG-1338-3.patch
>
>
> Currently, the behavior for hadoop conf look up is:
> * in local mode, if there is hadoop conf, bail out; if there is no hadoop 
> conf, launch local mode
> * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
> no, still launch without warning, but many functionality will go wrong
> We should bring it to a more intuitive way, which is:
> * in local mode, always launch Pig in local mode
> * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
> no, bail out with a meaningful message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1338) Pig should exclude hadoop conf in local mode

2010-03-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1338:


Status: Patch Available  (was: Open)

> Pig should exclude hadoop conf in local mode
> 
>
> Key: PIG-1338
> URL: https://issues.apache.org/jira/browse/PIG-1338
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Attachments: PIG-1338-1.patch, PIG-1338-2.patch, PIG-1338-3.patch
>
>
> Currently, the behavior for hadoop conf look up is:
> * in local mode, if there is hadoop conf, bail out; if there is no hadoop 
> conf, launch local mode
> * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
> no, still launch without warning, but many functionality will go wrong
> We should bring it to a more intuitive way, which is:
> * in local mode, always launch Pig in local mode
> * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
> no, bail out with a meaningful message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails

2010-03-30 Thread Viraj Bhat (JIRA)
pig_log file missing even though Main tells it is creating one and an M/R job 
fails 


 Key: PIG-1343
 URL: https://issues.apache.org/jira/browse/PIG-1343
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat


There is a particular case where I was running with the latest trunk of Pig.

{code}
$java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig

[main] INFO  org.apache.pig.Main - Logging error messages to: 
/homes/viraj/pig_1263420012601.log

$ls -l pig_1263420012601.log
ls: pig_1263420012601.log: No such file or directory
{code}

The job failed and the log file did not contain anything, the only way to debug 
was to look into the Jobtracker logs.

Here are some reasons which would have caused this behavior:
1) The underlying filer/NFS had some issues. In that case do we not error on 
stdout?
2) There are some errors from the backend which are not being captured

Viraj


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1341) Cannot convert DataByeArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED

2010-03-30 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1341:


Fix Version/s: 0.7.0

> Cannot convert DataByeArray to Chararray and results in 
> FIELD_DISCARDED_TYPE_CONVERSION_FAILED
> --
>
> Key: PIG-1341
> URL: https://issues.apache.org/jira/browse/PIG-1341
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.6.0
>Reporter: Viraj Bhat
> Fix For: 0.7.0
>
>
> Script reads in BinStorage data and tries to convert a column which is in 
> DataByteArray to Chararray. 
> {code}
> raw = load 'sampledata' using BinStorage() as (col1,col2, col3);
> --filter out null columns
> A = filter raw by col1#'bcookie' is not null;
> B = foreach A generate col1#'bcookie'  as reqcolumn;
> describe B;
> --B: {regcolumn: bytearray}
> X = limit B 5;
> dump X;
> B = foreach A generate (chararray)col1#'bcookie'  as convertedcol;
> describe B;
> --B: {convertedcol: chararray}
> X = limit B 5;
> dump X;
> {code}
> The first dump produces:
> (36co9b55onr8s)
> (36co9b55onr8s)
> (36hilul5oo1q1)
> (36hilul5oo1q1)
> (36l4cj15ooa8a)
> The second dump produces:
> ()
> ()
> ()
> ()
> ()
> It also throws an error message: FIELD_DISCARDED_TYPE_CONVERSION_FAILED 5 
> time(s).
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1341) Cannot convert DataByeArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED

2010-03-30 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-1341:
---

Assignee: Richard Ding

> Cannot convert DataByeArray to Chararray and results in 
> FIELD_DISCARDED_TYPE_CONVERSION_FAILED
> --
>
> Key: PIG-1341
> URL: https://issues.apache.org/jira/browse/PIG-1341
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.6.0
>Reporter: Viraj Bhat
>Assignee: Richard Ding
> Fix For: 0.7.0
>
>
> Script reads in BinStorage data and tries to convert a column which is in 
> DataByteArray to Chararray. 
> {code}
> raw = load 'sampledata' using BinStorage() as (col1,col2, col3);
> --filter out null columns
> A = filter raw by col1#'bcookie' is not null;
> B = foreach A generate col1#'bcookie'  as reqcolumn;
> describe B;
> --B: {regcolumn: bytearray}
> X = limit B 5;
> dump X;
> B = foreach A generate (chararray)col1#'bcookie'  as convertedcol;
> describe B;
> --B: {convertedcol: chararray}
> X = limit B 5;
> dump X;
> {code}
> The first dump produces:
> (36co9b55onr8s)
> (36co9b55onr8s)
> (36hilul5oo1q1)
> (36hilul5oo1q1)
> (36l4cj15ooa8a)
> The second dump produces:
> ()
> ()
> ()
> ()
> ()
> It also throws an error message: FIELD_DISCARDED_TYPE_CONVERSION_FAILED 5 
> time(s).
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1342) [Zebra] Avoid making unnecessary name node calls for writes in Zebra

2010-03-30 Thread Chao Wang (JIRA)
[Zebra] Avoid making unnecessary name node calls for writes in Zebra


 Key: PIG-1342
 URL: https://issues.apache.org/jira/browse/PIG-1342
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.6.0, 0.7.0
Reporter: Chao Wang
Assignee: Chao Wang
 Fix For: 0.8.0


Currently, table and column group level meta data is extracted from job 
configuration object and written onto HDFS disk within checkOutputSpec(). Later 
on, writers at back end will open these files to access the meta data for doing 
writes. This puts extra load to name node since all writers need to make name 
node calls to open files. 

We propose the following approach to this problem:
For writers at back end, they extract meta information from job configuration 
object directly, rather than making name node calls and going to HDFS disk to 
fetch the information.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db

2010-03-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851665#action_12851665
 ] 

Hadoop QA commented on PIG-1229:


+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12440249/jira-1229-v2.patch
  against trunk revision 928950.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 4 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/260/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/260/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/260/console

This message is automatically generated.

> allow pig to write output into a JDBC db
> 
>
> Key: PIG-1229
> URL: https://issues.apache.org/jira/browse/PIG-1229
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Reporter: Ian Holsman
>Assignee: Ankur
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: jira-1229-v2.patch
>
>
> UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1309) Map-side Cogroup

2010-03-30 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851661#action_12851661
 ] 

Ashutosh Chauhan commented on PIG-1309:
---

To build index, we sample every split and get an index entry corresponding to 
the split. After sampling all the index entries are sorted and then index is 
written to disk. When I first wrote MergeJoin I wasn't able to figure out how 
to use hadoop sorting to sort the index. So, there is a comment in MRCompiler 
for that:
{noformat}
// Sorting of index can possibly be achieved by using Hadoop sorting 
// between map and reduce instead of Pig doing sort. If that is so, 
// it will simplify lot of the code below.
{noformat}
Now I figured it out :) By default, if LocalRearranges produce key of type 
tuple Pig supplies raw binary comparators (PigTupleWritableComparator) to 
hadoop to compare tuples, which ignores the semantics of tuple. We need to 
override that behavior to make Pig supply correct version of tuple comparator 
(PigTupleRawComparator).  We need to communicate this info to 
JobControlCompiler from MRCompiler. So, I am doing the same through 
MapReduceOper object. 

As a nice side-effects of this 
a) code in MRCompiler is indeed simplified now
b) We got rid of extra index sorting inside reducer. 

> Map-side Cogroup
> 
>
> Key: PIG-1309
> URL: https://issues.apache.org/jira/browse/PIG-1309
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Attachments: mapsideCogrp.patch, pig-1309_1.patch
>
>
> In never ending quest to make Pig go faster, we want to parallelize as many 
> relational operations as possible. Its already possible to do Group-by( 
> PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira 
> is to add map-side implementation of Cogroup in Pig. Details to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1341) Cannot convert DataByeArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED

2010-03-30 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1341:


Component/s: impl
Summary: Cannot convert DataByeArray to Chararray and results in 
FIELD_DISCARDED_TYPE_CONVERSION_FAILED  (was: Cannot convert DataByeArray to 
Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED 20)

> Cannot convert DataByeArray to Chararray and results in 
> FIELD_DISCARDED_TYPE_CONVERSION_FAILED
> --
>
> Key: PIG-1341
> URL: https://issues.apache.org/jira/browse/PIG-1341
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.6.0
>Reporter: Viraj Bhat
>
> Script reads in BinStorage data and tries to convert a column which is in 
> DataByteArray to Chararray. 
> {code}
> raw = load 'sampledata' using BinStorage() as (col1,col2, col3);
> --filter out null columns
> A = filter raw by col1#'bcookie' is not null;
> B = foreach A generate col1#'bcookie'  as reqcolumn;
> describe B;
> --B: {regcolumn: bytearray}
> X = limit B 5;
> dump X;
> B = foreach A generate (chararray)col1#'bcookie'  as convertedcol;
> describe B;
> --B: {convertedcol: chararray}
> X = limit B 5;
> dump X;
> {code}
> The first dump produces:
> (36co9b55onr8s)
> (36co9b55onr8s)
> (36hilul5oo1q1)
> (36hilul5oo1q1)
> (36l4cj15ooa8a)
> The second dump produces:
> ()
> ()
> ()
> ()
> ()
> It also throws an error message: FIELD_DISCARDED_TYPE_CONVERSION_FAILED 5 
> time(s).
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1341) Cannot convert DataByeArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED 20

2010-03-30 Thread Viraj Bhat (JIRA)
Cannot convert DataByeArray to Chararray and results in 
FIELD_DISCARDED_TYPE_CONVERSION_FAILED 20
-

 Key: PIG-1341
 URL: https://issues.apache.org/jira/browse/PIG-1341
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Viraj Bhat


Script reads in BinStorage data and tries to convert a column which is in 
DataByteArray to Chararray. 

{code}
raw = load 'sampledata' using BinStorage() as (col1,col2, col3);
--filter out null columns
A = filter raw by col1#'bcookie' is not null;

B = foreach A generate col1#'bcookie'  as reqcolumn;
describe B;
--B: {regcolumn: bytearray}
X = limit B 5;
dump X;

B = foreach A generate (chararray)col1#'bcookie'  as convertedcol;
describe B;
--B: {convertedcol: chararray}
X = limit B 5;
dump X;

{code}

The first dump produces:

(36co9b55onr8s)
(36co9b55onr8s)
(36hilul5oo1q1)
(36hilul5oo1q1)
(36l4cj15ooa8a)

The second dump produces:
()
()
()
()
()

It also throws an error message: FIELD_DISCARDED_TYPE_CONVERSION_FAILED 5 
time(s).
Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1340) [zebra] The zebra version number should be changed from 0.7 to 0.8

2010-03-30 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou reassigned PIG-1340:
-

Assignee: Yan Zhou

> [zebra] The zebra version number should be changed from 0.7 to 0.8
> --
>
> Key: PIG-1340
> URL: https://issues.apache.org/jira/browse/PIG-1340
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Yan Zhou
>Assignee: Yan Zhou
>Priority: Trivial
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1340) [zebra] The zebra version number should be changed from 0.7 to 0.8

2010-03-30 Thread Yan Zhou (JIRA)
[zebra] The zebra version number should be changed from 0.7 to 0.8
--

 Key: PIG-1340
 URL: https://issues.apache.org/jira/browse/PIG-1340
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Priority: Trivial




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1309) Map-side Cogroup

2010-03-30 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851655#action_12851655
 ] 

Alan Gates commented on PIG-1309:
-

I'm not clear on the need for the typedComparator logic in MapReduceOper.  Can 
you explain why that's necessary?

> Map-side Cogroup
> 
>
> Key: PIG-1309
> URL: https://issues.apache.org/jira/browse/PIG-1309
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Attachments: mapsideCogrp.patch, pig-1309_1.patch
>
>
> In never ending quest to make Pig go faster, we want to parallelize as many 
> relational operations as possible. Its already possible to do Group-by( 
> PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira 
> is to add map-side implementation of Cogroup in Pig. Details to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1338) Pig should exclude hadoop conf in local mode

2010-03-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851651#action_12851651
 ] 

Hadoop QA commented on PIG-1338:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12440253/PIG-1338-2.patch
  against trunk revision 928950.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 79 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/271/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/271/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/271/console

This message is automatically generated.

> Pig should exclude hadoop conf in local mode
> 
>
> Key: PIG-1338
> URL: https://issues.apache.org/jira/browse/PIG-1338
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Attachments: PIG-1338-1.patch, PIG-1338-2.patch
>
>
> Currently, the behavior for hadoop conf look up is:
> * in local mode, if there is hadoop conf, bail out; if there is no hadoop 
> conf, launch local mode
> * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
> no, still launch without warning, but many functionality will go wrong
> We should bring it to a more intuitive way, which is:
> * in local mode, always launch Pig in local mode
> * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
> no, bail out with a meaningful message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1310) ISO Date UDFs: Conversion, Trucation and Date Math

2010-03-30 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-1310:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

Checked into trunk and 0.7 branch.  Thanks Russell for your tireless work on 
this.

> ISO Date UDFs: Conversion, Trucation and Date Math
> --
>
> Key: PIG-1310
> URL: https://issues.apache.org/jira/browse/PIG-1310
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Reporter: Russell Jurney
>Assignee: Russell Jurney
> Fix For: 0.7.0
>
> Attachments: joda-mavenstuff.diff, pass.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> I've written UDFs to handle loading unix times, datemonth values and ISO 8601 
> formatted date strings, and working with them as ISO datetimes using jodatime.
> The working code is here: 
> http://github.com/rjurney/oink/tree/master/src/java/oink/udf/isodate/
> It needs to be documented and tests added, and a couple UDFs are missing, but 
> these work if you REGISTER the jodatime jar in your script.  Hopefully I can 
> get this stuff in piggybank before someone else writes it this time :)  The 
> rounding also may not be performant, but the code works.
> Ultimately I'd also like to enable support for ISO 8601 durations.  Someone 
> slap me if this isn't done soon, it is not much work and this should help 
> everyone working with time series.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1339) International characters in column names not supported

2010-03-30 Thread Viraj Bhat (JIRA)
International characters in column names not supported
--

 Key: PIG-1339
 URL: https://issues.apache.org/jira/browse/PIG-1339
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat


There is a particular use-case in which someone specifies a column name to be 
in International characters.

{code}
inputdata = load '/user/viraj/inputdata.txt' using PigStorage() as (あいうえお);
describe inputdata;
dump inputdata;
{code}
==
Pig Stack Trace
---
ERROR 1000: Error during parsing. Lexical error at line 1, column 64.  
Encountered: "\u3042" (12354), after : ""

org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at line 1, 
column 64.  Encountered: "\u3042" (12354), after : ""

at 
org.apache.pig.impl.logicalLayer.parser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1791)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_scan_token(QueryParser.java:8959)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_51(QueryParser.java:7462)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_120(QueryParser.java:7769)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_106(QueryParser.java:7787)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_63(QueryParser.java:8609)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_32(QueryParser.java:8621)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3_4(QueryParser.java:8354)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_2_4(QueryParser.java:6903)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1249)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700)
at 
org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
at 
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737)
at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
at org.apache.pig.Main.main(Main.java:391)
==

Thanks Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1331) Owl Hadoop Table Management Service

2010-03-30 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-1331:


Attachment: build.log

> Owl Hadoop Table Management Service
> ---
>
> Key: PIG-1331
> URL: https://issues.apache.org/jira/browse/PIG-1331
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.8.0
>Reporter: Jay Tang
> Attachments: build.log, owl.contrib.3.tgz
>
>
> This JIRA is a proposal to create a Hadoop table management service: Owl. 
> Today, MapReduce and Pig applications interacts directly with HDFS 
> directories and files and must deal with low level data management issues 
> such as storage format, serialization/compression schemes, data layout, and 
> efficient data accesses, etc, often with different solutions. Owl aims to 
> provide a standard way to addresses this issue and abstracts away the 
> complexities of reading/writing huge amount of data from/to HDFS.
> Owl has a data access API that is modeled after the traditional Hadoop 
> !InputFormt and a management API to manipulate Owl objects.  This JIRA is 
> related to Pig-823 (Hadoop Metadata Service) as Owl has an internal metadata 
> store.  Owl integrates with different storage module like Zebra with a 
> pluggable architecture.
>  Initially, the proposal is to submit Owl as a Pig contrib project.  Over 
> time, it makes sense to move it to a Hadoop subproject.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1331) Owl Hadoop Table Management Service

2010-03-30 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851602#action_12851602
 ] 

Alan Gates commented on PIG-1331:
-

Patch as provided doesn't build.  It gets an ivy error.  I've attached a copy 
of the stdout and stderr from the build run.

> Owl Hadoop Table Management Service
> ---
>
> Key: PIG-1331
> URL: https://issues.apache.org/jira/browse/PIG-1331
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.8.0
>Reporter: Jay Tang
> Attachments: build.log, owl.contrib.3.tgz
>
>
> This JIRA is a proposal to create a Hadoop table management service: Owl. 
> Today, MapReduce and Pig applications interacts directly with HDFS 
> directories and files and must deal with low level data management issues 
> such as storage format, serialization/compression schemes, data layout, and 
> efficient data accesses, etc, often with different solutions. Owl aims to 
> provide a standard way to addresses this issue and abstracts away the 
> complexities of reading/writing huge amount of data from/to HDFS.
> Owl has a data access API that is modeled after the traditional Hadoop 
> !InputFormt and a management API to manipulate Owl objects.  This JIRA is 
> related to Pig-823 (Hadoop Metadata Service) as Owl has an internal metadata 
> store.  Owl integrates with different storage module like Zebra with a 
> pluggable architecture.
>  Initially, the proposal is to submit Owl as a Pig contrib project.  Over 
> time, it makes sense to move it to a Hadoop subproject.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1310) ISO Date UDFs: Conversion, Trucation and Date Math

2010-03-30 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851594#action_12851594
 ] 

Dmitriy V. Ryaboy commented on PIG-1310:


It builds -- you just have to build pig with the test classes first, *then* 
test piggybank.  Those Piggybank tests require some of the test helpers Pig has.

> ISO Date UDFs: Conversion, Trucation and Date Math
> --
>
> Key: PIG-1310
> URL: https://issues.apache.org/jira/browse/PIG-1310
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Reporter: Russell Jurney
>Assignee: Russell Jurney
> Fix For: 0.7.0
>
> Attachments: joda-mavenstuff.diff, pass.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> I've written UDFs to handle loading unix times, datemonth values and ISO 8601 
> formatted date strings, and working with them as ISO datetimes using jodatime.
> The working code is here: 
> http://github.com/rjurney/oink/tree/master/src/java/oink/udf/isodate/
> It needs to be documented and tests added, and a couple UDFs are missing, but 
> these work if you REGISTER the jodatime jar in your script.  Hopefully I can 
> get this stuff in piggybank before someone else writes it this time :)  The 
> rounding also may not be performant, but the code works.
> Ultimately I'd also like to enable support for ISO 8601 durations.  Someone 
> slap me if this isn't done soon, it is not much work and this should help 
> everyone working with time series.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1313) PigServer leaks memory over time

2010-03-30 Thread Bill Graham (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Graham updated PIG-1313:
-

Attachment: PIG-1313-3.patch

> PigServer leaks memory over time
> 
>
> Key: PIG-1313
> URL: https://issues.apache.org/jira/browse/PIG-1313
> Project: Pig
>  Issue Type: Bug
>Reporter: Bill Graham
>Assignee: Bill Graham
> Attachments: PIG-1313-0.4.0-1.patch, PIG-1313-1.patch, 
> PIG-1313-1.patch, PIG-1313-2.patch, PIG-1313-3.patch, Pig1313Reproducer.java
>
>
> When {{PigServer}} runs it creates temporary files using the 
> {{FileLocalizer.getTemporaryPath(..)}}. This static method creates and 
> returns a handle to a temporary file (as an instance of 
> {{ElementDescriptor}}). The {{ElementDescriptors}} returned by this method 
> are kept on a static {{Stack}} named {{toDelete}}. The items on {{toDelete}} 
> get removed by the {{FileLocalizer.deleteTempFile()}} method.
> The only place in the code where I see {{FileLocalizer.deleteTempFile()}} 
> called is in the Main class. {{PigServer}} does not call that method though, 
> so a long-running VM that repeatedly uses instances of {{PigServer}} to run 
> jobs will leak memory via {{toDelete}}.
> One suggested fix is to have {{PigServer.shutdown()}} call 
> {{FileLocalizer.deleteTempFile()}}, but this would cause problems in a 
> multi-threaded environment, since it seems {{ElementDescriptors}} are pushed 
> onto the {{toDelete}} stack before they're used, not once they're done with. 
> With this approach, running multiple instances of {{PigServer}} in separate 
> threads could cause one completed job to clobber the other's still-in-use 
> temp files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1313) PigServer leaks memory over time

2010-03-30 Thread Bill Graham (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Graham updated PIG-1313:
-

Status: Patch Available  (was: Open)

> PigServer leaks memory over time
> 
>
> Key: PIG-1313
> URL: https://issues.apache.org/jira/browse/PIG-1313
> Project: Pig
>  Issue Type: Bug
>Reporter: Bill Graham
>Assignee: Bill Graham
> Attachments: PIG-1313-0.4.0-1.patch, PIG-1313-1.patch, 
> PIG-1313-1.patch, PIG-1313-2.patch, PIG-1313-3.patch, Pig1313Reproducer.java
>
>
> When {{PigServer}} runs it creates temporary files using the 
> {{FileLocalizer.getTemporaryPath(..)}}. This static method creates and 
> returns a handle to a temporary file (as an instance of 
> {{ElementDescriptor}}). The {{ElementDescriptors}} returned by this method 
> are kept on a static {{Stack}} named {{toDelete}}. The items on {{toDelete}} 
> get removed by the {{FileLocalizer.deleteTempFile()}} method.
> The only place in the code where I see {{FileLocalizer.deleteTempFile()}} 
> called is in the Main class. {{PigServer}} does not call that method though, 
> so a long-running VM that repeatedly uses instances of {{PigServer}} to run 
> jobs will leak memory via {{toDelete}}.
> One suggested fix is to have {{PigServer.shutdown()}} call 
> {{FileLocalizer.deleteTempFile()}}, but this would cause problems in a 
> multi-threaded environment, since it seems {{ElementDescriptors}} are pushed 
> onto the {{toDelete}} stack before they're used, not once they're done with. 
> With this approach, running multiple instances of {{PigServer}} in separate 
> threads could cause one completed job to clobber the other's still-in-use 
> temp files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1313) PigServer leaks memory over time

2010-03-30 Thread Bill Graham (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Graham updated PIG-1313:
-

Status: Open  (was: Patch Available)

> PigServer leaks memory over time
> 
>
> Key: PIG-1313
> URL: https://issues.apache.org/jira/browse/PIG-1313
> Project: Pig
>  Issue Type: Bug
>Reporter: Bill Graham
>Assignee: Bill Graham
> Attachments: PIG-1313-0.4.0-1.patch, PIG-1313-1.patch, 
> PIG-1313-1.patch, PIG-1313-2.patch, Pig1313Reproducer.java
>
>
> When {{PigServer}} runs it creates temporary files using the 
> {{FileLocalizer.getTemporaryPath(..)}}. This static method creates and 
> returns a handle to a temporary file (as an instance of 
> {{ElementDescriptor}}). The {{ElementDescriptors}} returned by this method 
> are kept on a static {{Stack}} named {{toDelete}}. The items on {{toDelete}} 
> get removed by the {{FileLocalizer.deleteTempFile()}} method.
> The only place in the code where I see {{FileLocalizer.deleteTempFile()}} 
> called is in the Main class. {{PigServer}} does not call that method though, 
> so a long-running VM that repeatedly uses instances of {{PigServer}} to run 
> jobs will leak memory via {{toDelete}}.
> One suggested fix is to have {{PigServer.shutdown()}} call 
> {{FileLocalizer.deleteTempFile()}}, but this would cause problems in a 
> multi-threaded environment, since it seems {{ElementDescriptors}} are pushed 
> onto the {{toDelete}} stack before they're used, not once they're done with. 
> With this approach, running multiple instances of {{PigServer}} in separate 
> threads could cause one completed job to clobber the other's still-in-use 
> temp files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1310) ISO Date UDFs: Conversion, Trucation and Date Math

2010-03-30 Thread Russell Jurney (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851542#action_12851542
 ] 

Russell Jurney commented on PIG-1310:
-

Cool - one thing though - Piggybank itself does not build in trunk.  It must 
not have built since 0.6, since the load/store func changes went in.  Does 
something need to be done there?  Should I submit a patch that removes all the 
broken UDFs to make ant build in piggybank work on trunk?

To get piggybank to build, I had to remove:

!   
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestMultiStorage.java
!   
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestSequenceFileLoader.java
!   
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestRegExLoader.java
!   
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/TestPigStorageSchema.java
!   
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/string/TestLookupInFiles.java
!   
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/TestEvalString.java

Is this just me, is this fixed on other branches?

> ISO Date UDFs: Conversion, Trucation and Date Math
> --
>
> Key: PIG-1310
> URL: https://issues.apache.org/jira/browse/PIG-1310
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Reporter: Russell Jurney
>Assignee: Russell Jurney
> Fix For: 0.7.0
>
> Attachments: joda-mavenstuff.diff, pass.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> I've written UDFs to handle loading unix times, datemonth values and ISO 8601 
> formatted date strings, and working with them as ISO datetimes using jodatime.
> The working code is here: 
> http://github.com/rjurney/oink/tree/master/src/java/oink/udf/isodate/
> It needs to be documented and tests added, and a couple UDFs are missing, but 
> these work if you REGISTER the jodatime jar in your script.  Hopefully I can 
> get this stuff in piggybank before someone else writes it this time :)  The 
> rounding also may not be performant, but the code works.
> Ultimately I'd also like to enable support for ISO 8601 durations.  Someone 
> slap me if this isn't done soon, it is not much work and this should help 
> everyone working with time series.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (PIG-1310) ISO Date UDFs: Conversion, Trucation and Date Math

2010-03-30 Thread Russell Jurney
Cool - one thing though - Piggybank itself does not build in trunk.  It must
not have built since 0.6, since the load/store func changes went in.  Does
something need to be done there?  Should I submit a patch that removes all
the broken UDFs to make ant build in piggybank work on trunk?

To get piggybank to build, I had to remove:

!
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestMultiStorage.java
!
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestSequenceFileLoader.java
!
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestRegExLoader.java
!
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/TestPigStorageSchema.java
!
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/string/TestLookupInFiles.java
!
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/TestEvalString.java

Is this just me, is this fixed on other branches?

On Tue, Mar 30, 2010 at 12:30 PM, Alan Gates (JIRA)  wrote:

>
>[
> https://issues.apache.org/jira/browse/PIG-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851526#action_12851526]
>
> Alan Gates commented on PIG-1310:
> -
>
> New patch looks good.  Piggybank tests pass.  I'm rerunning the patch test
> to check things like javac warnings, etc.  As long as that all returns
> success I'll commit it.  Then I'll apply it to 0.7, test it there, and
> assuming all is well, commit it there too.
>
> > ISO Date UDFs: Conversion, Trucation and Date Math
> > --
> >
> > Key: PIG-1310
> > URL: https://issues.apache.org/jira/browse/PIG-1310
> > Project: Pig
> >  Issue Type: New Feature
> >  Components: impl
> >Reporter: Russell Jurney
> >Assignee: Russell Jurney
> > Fix For: 0.7.0
> >
> > Attachments: joda-mavenstuff.diff, pass.patch
> >
> >   Original Estimate: 168h
> >  Remaining Estimate: 168h
> >
> > I've written UDFs to handle loading unix times, datemonth values and ISO
> 8601 formatted date strings, and working with them as ISO datetimes using
> jodatime.
> > The working code is here:
> http://github.com/rjurney/oink/tree/master/src/java/oink/udf/isodate/
> > It needs to be documented and tests added, and a couple UDFs are missing,
> but these work if you REGISTER the jodatime jar in your script.  Hopefully I
> can get this stuff in piggybank before someone else writes it this time :)
>  The rounding also may not be performant, but the code works.
> > Ultimately I'd also like to enable support for ISO 8601 durations.
>  Someone slap me if this isn't done soon, it is not much work and this
> should help everyone working with time series.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


[jira] Commented: (PIG-1310) ISO Date UDFs: Conversion, Trucation and Date Math

2010-03-30 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851526#action_12851526
 ] 

Alan Gates commented on PIG-1310:
-

New patch looks good.  Piggybank tests pass.  I'm rerunning the patch test to 
check things like javac warnings, etc.  As long as that all returns success 
I'll commit it.  Then I'll apply it to 0.7, test it there, and assuming all is 
well, commit it there too.

> ISO Date UDFs: Conversion, Trucation and Date Math
> --
>
> Key: PIG-1310
> URL: https://issues.apache.org/jira/browse/PIG-1310
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Reporter: Russell Jurney
>Assignee: Russell Jurney
> Fix For: 0.7.0
>
> Attachments: joda-mavenstuff.diff, pass.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> I've written UDFs to handle loading unix times, datemonth values and ISO 8601 
> formatted date strings, and working with them as ISO datetimes using jodatime.
> The working code is here: 
> http://github.com/rjurney/oink/tree/master/src/java/oink/udf/isodate/
> It needs to be documented and tests added, and a couple UDFs are missing, but 
> these work if you REGISTER the jodatime jar in your script.  Hopefully I can 
> get this stuff in piggybank before someone else writes it this time :)  The 
> rounding also may not be performant, but the code works.
> Ultimately I'd also like to enable support for ISO 8601 durations.  Someone 
> slap me if this isn't done soon, it is not much work and this should help 
> everyone working with time series.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1335) UDFFinder should find LoadFunc used by POCast

2010-03-30 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851519#action_12851519
 ] 

Richard Ding commented on PIG-1335:
---

+1

> UDFFinder should find LoadFunc used by POCast
> -
>
> Key: PIG-1335
> URL: https://issues.apache.org/jira/browse/PIG-1335
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Attachments: PIG-1335-1.patch
>
>
> UDFFinder doesn't look into POCast so it will miss LoadFunc used by POCast 
> for lineage. We could see "class not found" exception in some cases. Here is 
> a sample script:
> {code}
> a = load '1.txt' using CustomLoader() as (a0, a1, a2);
> b = group a by a0;
> c = foreach b generate flatten(a);
> d = order c by a0;
> e = foreach d generate(a1+a2);  -- use lineage
> dump e;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1336) Optimize POStore serialized into JobConf

2010-03-30 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851518#action_12851518
 ] 

Richard Ding commented on PIG-1336:
---

In the multi-store case, the parent plan can be set in POStore and should also 
be unlinked.

> Optimize POStore serialized into JobConf
> 
>
> Key: PIG-1336
> URL: https://issues.apache.org/jira/browse/PIG-1336
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Attachments: PIG-1336-1.patch, PIG-1336-2.patch, PIG-1336-3.patch
>
>
> We serialize POStore too early in the JobControlCompiler. At that time, 
> storeFunc have unconstraint link to other operator; in the worst case, it 
> will chain the whole physical plan. Also, in multi-store case, POStore has 
> link to its data source, which is not needed and will increase the footprint 
> of serialized POStore. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1310) ISO Date UDFs: Conversion, Trucation and Date Math

2010-03-30 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-1310:
---

Assignee: Russell Jurney

> ISO Date UDFs: Conversion, Trucation and Date Math
> --
>
> Key: PIG-1310
> URL: https://issues.apache.org/jira/browse/PIG-1310
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Reporter: Russell Jurney
>Assignee: Russell Jurney
> Fix For: 0.7.0
>
> Attachments: joda-mavenstuff.diff, pass.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> I've written UDFs to handle loading unix times, datemonth values and ISO 8601 
> formatted date strings, and working with them as ISO datetimes using jodatime.
> The working code is here: 
> http://github.com/rjurney/oink/tree/master/src/java/oink/udf/isodate/
> It needs to be documented and tests added, and a couple UDFs are missing, but 
> these work if you REGISTER the jodatime jar in your script.  Hopefully I can 
> get this stuff in piggybank before someone else writes it this time :)  The 
> rounding also may not be performant, but the code works.
> Ultimately I'd also like to enable support for ISO 8601 durations.  Someone 
> slap me if this isn't done soon, it is not much work and this should help 
> everyone working with time series.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1338) Pig should exclude hadoop conf in local mode

2010-03-30 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851489#action_12851489
 ] 

Pradeep Kamath commented on PIG-1338:
-

I haven't done a full review but had a comment on one of the changes which is 
pretty important:
{noformat}
Index: src/org/apache/pig/backend/hadoop/datastorage/ConfigurationUtil.java
===
--- src/org/apache/pig/backend/hadoop/datastorage/ConfigurationUtil.java
(revision 928370)
+++ src/org/apache/pig/backend/hadoop/datastorage/ConfigurationUtil.java
(working copy)
@@ -30,7 +30,9 @@
 
 public static Configuration toConfiguration(Properties properties) {
 assert properties != null;
-final Configuration config = new Configuration();
+final Configuration config = new Configuration(false);
+config.addResource("core-default.xml");
+config.addResource("mapred-default.xml");
 final Enumeration iter = properties.keys();
 while (iter.hasMoreElements()) {
 final String key = (String) iter.nextElement()
{noformat}

Looking at the Configuration class's implementation I found the following code:

{noformat}

 static{
//print deprecation warning if hadoop-site.xml is found in classpath
ClassLoader cL = Thread.currentThread().getContextClassLoader();
if (cL == null) {
  cL = Configuration.class.getClassLoader();
}
if(cL.getResource("hadoop-site.xml")!=null) {
  LOG.warn("DEPRECATED: hadoop-site.xml found in the classpath. " +
  "Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, "
  + "mapred-site.xml and hdfs-site.xml to override properties of " +
  "core-default.xml, mapred-default.xml and hdfs-default.xml " +
  "respectively");
}
addDefaultResource("core-default.xml");
addDefaultResource("core-site.xml");
  }

  private void loadResources(Properties properties,
 ArrayList resources,
 boolean quiet) {
if(loadDefaults) {
  for (String resource : defaultResources) {
loadResource(properties, resource, quiet);
  }

  //support the hadoop-site.xml as a deprecated case
  if(getResource("hadoop-site.xml")!=null) {
loadResource(properties, "hadoop-site.xml", quiet);
  }
}

for (Object resource : resources) {
  loadResource(properties, resource, quiet);
}
  }

{noformat}

There are two questions related to the code in Configuration Vs the change in 
this patch:
1) In the patch, core-default.xml and mapred-default.xml are added as resources 
while in Configuration core-default.xml and core-site.xml are added by default
2) In the patch, hadoop-site.xml is not considered while in Configuration, it 
is - so if a hadoop 20.x cluster is installed with hadoop-site.xml configured 
and without the other .xml files (like core-default.xml etc.) then pig would 
not get the cluster config information right?

> Pig should exclude hadoop conf in local mode
> 
>
> Key: PIG-1338
> URL: https://issues.apache.org/jira/browse/PIG-1338
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Attachments: PIG-1338-1.patch, PIG-1338-2.patch
>
>
> Currently, the behavior for hadoop conf look up is:
> * in local mode, if there is hadoop conf, bail out; if there is no hadoop 
> conf, launch local mode
> * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
> no, still launch without warning, but many functionality will go wrong
> We should bring it to a more intuitive way, which is:
> * in local mode, always launch Pig in local mode
> * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
> no, bail out with a meaningful message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1337) Need a way to pass distributed cache configuration information to hadoop backend in Pig's LoadFunc

2010-03-30 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851479#action_12851479
 ] 

Pradeep Kamath commented on PIG-1337:
-

My worry in doing these kinds of job related updates in the Job in getSchema() 
is that currently getSchema has been designed to be a pure getter without any 
indirect "set" side effects - this is noted in the javadoc:

{noformat}
/**
 * Get a schema for the data to be loaded.  
 * @param location Location as returned by 
 * {...@link LoadFunc#relativeToAbsolutePath(String, 
org.apache.hadoop.fs.Path)}
 * @param job The {...@link Job} object - this should be used only to 
obtain 
 * cluster properties through {...@link Job#getConfiguration()} and not to 
set/query
 * any runtime job information.  
...
{noformat}

We should be careful in opening this up to allow set capability - something to 
consider before designing a fix for this issue.

> Need a way to pass distributed cache configuration information to hadoop 
> backend in Pig's LoadFunc
> --
>
> Key: PIG-1337
> URL: https://issues.apache.org/jira/browse/PIG-1337
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.6.0
>Reporter: Chao Wang
> Fix For: 0.8.0
>
>
> The Zebra storage layer needs to use distributed cache to reduce name node 
> load during job runs.
> To to this, Zebra needs to set up distributed cache related configuration 
> information in TableLoader (which extends Pig's LoadFunc) .
> It is doing this within getSchema(conf). The problem is that the conf object 
> here is not the one that is being serialized to map/reduce backend. As such, 
> the distributed cache is not set up properly.
> To work over this problem, we need Pig in its LoadFunc to ensure a way that 
> we can use to set up distributed cache information in a conf object, and this 
> conf object is the one used by map/reduce backend.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1338) Pig should exclude hadoop conf in local mode

2010-03-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1338:


Status: Open  (was: Patch Available)

> Pig should exclude hadoop conf in local mode
> 
>
> Key: PIG-1338
> URL: https://issues.apache.org/jira/browse/PIG-1338
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Attachments: PIG-1338-1.patch, PIG-1338-2.patch
>
>
> Currently, the behavior for hadoop conf look up is:
> * in local mode, if there is hadoop conf, bail out; if there is no hadoop 
> conf, launch local mode
> * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
> no, still launch without warning, but many functionality will go wrong
> We should bring it to a more intuitive way, which is:
> * in local mode, always launch Pig in local mode
> * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
> no, bail out with a meaningful message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1338) Pig should exclude hadoop conf in local mode

2010-03-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1338:


Status: Patch Available  (was: Open)

> Pig should exclude hadoop conf in local mode
> 
>
> Key: PIG-1338
> URL: https://issues.apache.org/jira/browse/PIG-1338
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Attachments: PIG-1338-1.patch, PIG-1338-2.patch
>
>
> Currently, the behavior for hadoop conf look up is:
> * in local mode, if there is hadoop conf, bail out; if there is no hadoop 
> conf, launch local mode
> * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
> no, still launch without warning, but many functionality will go wrong
> We should bring it to a more intuitive way, which is:
> * in local mode, always launch Pig in local mode
> * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
> no, bail out with a meaningful message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1338) Pig should exclude hadoop conf in local mode

2010-03-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1338:


Attachment: PIG-1338-2.patch

> Pig should exclude hadoop conf in local mode
> 
>
> Key: PIG-1338
> URL: https://issues.apache.org/jira/browse/PIG-1338
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Attachments: PIG-1338-1.patch, PIG-1338-2.patch
>
>
> Currently, the behavior for hadoop conf look up is:
> * in local mode, if there is hadoop conf, bail out; if there is no hadoop 
> conf, launch local mode
> * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
> no, still launch without warning, but many functionality will go wrong
> We should bring it to a more intuitive way, which is:
> * in local mode, always launch Pig in local mode
> * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
> no, bail out with a meaningful message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db

2010-03-30 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851455#action_12851455
 ] 

Olga Natkovich commented on PIG-1229:
-

Since we already branched, this feature will not go into 0.7.0 branch but would 
instead be committed to trunk and released as part of 0.8.0 release. I think 
this patch should work just fine against trunk since we have noit deviated much.

> allow pig to write output into a JDBC db
> 
>
> Key: PIG-1229
> URL: https://issues.apache.org/jira/browse/PIG-1229
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Reporter: Ian Holsman
>Assignee: Ankur
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: jira-1229-v2.patch
>
>
> UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-03-30 Thread Ankur (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1229:
---

Status: In Progress  (was: Patch Available)

> allow pig to write output into a JDBC db
> 
>
> Key: PIG-1229
> URL: https://issues.apache.org/jira/browse/PIG-1229
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Reporter: Ian Holsman
>Assignee: Ankur
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: jira-1229-v2.patch
>
>
> UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-03-30 Thread Ankur (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1229:
---

Status: Patch Available  (was: In Progress)

> allow pig to write output into a JDBC db
> 
>
> Key: PIG-1229
> URL: https://issues.apache.org/jira/browse/PIG-1229
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Reporter: Ian Holsman
>Assignee: Ankur
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: jira-1229-v2.patch
>
>
> UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-03-30 Thread Ankur (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1229:
---

Attachment: (was: jira-1229.patch)

> allow pig to write output into a JDBC db
> 
>
> Key: PIG-1229
> URL: https://issues.apache.org/jira/browse/PIG-1229
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Reporter: Ian Holsman
>Assignee: Ankur
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: jira-1229-v2.patch
>
>
> UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-03-30 Thread Ankur (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1229:
---

Attachment: jira-1229-v2.patch

Here is the updated patch that compiles against pig 0.7 branch and implements 
new load/store APIs. 

Note:- that I haven't used hadoop's DBOutputFormat as the code is not yet moved 
to o.p.h.mapreduce.lib and hence there are compatibility issues.

> allow pig to write output into a JDBC db
> 
>
> Key: PIG-1229
> URL: https://issues.apache.org/jira/browse/PIG-1229
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Reporter: Ian Holsman
>Assignee: Ankur
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: jira-1229-v2.patch, jira-1229.patch
>
>
> UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-03-30 Thread Ankur (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1229:
---

Attachment: (was: hsqldb.jar)

> allow pig to write output into a JDBC db
> 
>
> Key: PIG-1229
> URL: https://issues.apache.org/jira/browse/PIG-1229
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Reporter: Ian Holsman
>Assignee: Ankur
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: jira-1229-v2.patch, jira-1229.patch
>
>
> UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-03-30 Thread Gianmarco De Francisci Morales (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851453#action_12851453
 ] 

Gianmarco De Francisci Morales commented on PIG-1295:
-

Hi, 
I have been reading the source code and the referenced PIG-1038 issue. 

Probably Avro integration is too big of a project for GSoC, but implementing 
the tuple binary comparator seems doable. 
I will write a proposal, any advices for it? 

My idea of the project's breakdown would be like this: 

Identify the cases that can be optimized and the appropriate visitor for those. 
Write a test unit for this optimization. 
Implement the comparator knowing the data types of the tuple. 
Write a second test unit with different types. 
Write the logic to extract tuple boundary from schema information (I suppose 
this optimization is possible only if the schema is known) 
Try to extend it to the general case of complex data type as secondary key. 

Thoughts?

> Binary comparator for secondary sort
> 
>
> Key: PIG-1295
> URL: https://issues.apache.org/jira/browse/PIG-1295
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
>
> When hadoop framework doing the sorting, it will try to use binary version of 
> comparator if available. The benefit of binary comparator is we do not need 
> to instantiate the object before we compare. We see a ~30% speedup after we 
> switch to binary comparator. Currently, Pig use binary comparator in 
> following case:
> 1. When semantics of order doesn't matter. For example, in distinct, we need 
> to do a sort in order to filter out duplicate values; however, we do not care 
> how comparator sort keys. Groupby also share this character. In this case, we 
> rely on hadoop's default binary comparator
> 2. Semantics of order matter, but the key is of simple type. In this case, we 
> have implementation for simple types, such as integer, long, float, 
> chararray, databytearray, string
> However, if the key is a tuple and the sort semantics matters, we do not have 
> a binary comparator implementation. This especially matters when we switch to 
> use secondary sort. In secondary sort, we convert the inner sort of nested 
> foreach into the secondary key and rely on hadoop to sorting on both main key 
> and secondary key. The sorting key will become a two items tuple. Since the 
> secondary key the sorting key of the nested foreach, so the sorting semantics 
> matters. It turns out we do not have binary comparator once we use secondary 
> sort, and we see a significant slow down.
> Binary comparator for tuple should be doable once we understand the binary 
> structure of the serialized tuple. We can focus on most common use cases 
> first, which is "group by" followed by a nested sort. In this case, we will 
> use secondary sort. Semantics of the first key does not matter but semantics 
> of secondary key matters. We need to identify the boundary of main key and 
> secondary key in the binary tuple buffer without instantiate tuple itself. 
> Then if the first key equals, we use a binary comparator to compare secondary 
> key. Secondary key can also be a complex data type, but for the first step, 
> we focus on simple secondary key, which is the most common use case.
> We mark this issue to be a candidate project for "Google summer of code 2010" 
> program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1336) Optimize POStore serialized into JobConf

2010-03-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851404#action_12851404
 ] 

Hadoop QA commented on PIG-1336:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12440184/PIG-1336-3.patch
  against trunk revision 928950.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no tests are needed for this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/259/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/259/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/259/console

This message is automatically generated.

> Optimize POStore serialized into JobConf
> 
>
> Key: PIG-1336
> URL: https://issues.apache.org/jira/browse/PIG-1336
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Attachments: PIG-1336-1.patch, PIG-1336-2.patch, PIG-1336-3.patch
>
>
> We serialize POStore too early in the JobControlCompiler. At that time, 
> storeFunc have unconstraint link to other operator; in the worst case, it 
> will chain the whole physical plan. Also, in multi-store case, POStore has 
> link to its data source, which is not needed and will increase the footprint 
> of serialized POStore. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1338) Pig should exclude hadoop conf in local mode

2010-03-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851338#action_12851338
 ] 

Hadoop QA commented on PIG-1338:


+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12440177/PIG-1338-1.patch
  against trunk revision 928950.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 79 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/270/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/270/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/270/console

This message is automatically generated.

> Pig should exclude hadoop conf in local mode
> 
>
> Key: PIG-1338
> URL: https://issues.apache.org/jira/browse/PIG-1338
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Attachments: PIG-1338-1.patch
>
>
> Currently, the behavior for hadoop conf look up is:
> * in local mode, if there is hadoop conf, bail out; if there is no hadoop 
> conf, launch local mode
> * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
> no, still launch without warning, but many functionality will go wrong
> We should bring it to a more intuitive way, which is:
> * in local mode, always launch Pig in local mode
> * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
> no, bail out with a meaningful message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1309) Map-side Cogroup

2010-03-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851323#action_12851323
 ] 

Hadoop QA commented on PIG-1309:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12440159/pig-1309_1.patch
  against trunk revision 928950.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 6 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 88 javac compiler warnings (more 
than the trunk's current 87 warnings).

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/258/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/258/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/258/console

This message is automatically generated.

> Map-side Cogroup
> 
>
> Key: PIG-1309
> URL: https://issues.apache.org/jira/browse/PIG-1309
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Attachments: mapsideCogrp.patch, pig-1309_1.patch
>
>
> In never ending quest to make Pig go faster, we want to parallelize as many 
> relational operations as possible. Its already possible to do Group-by( 
> PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira 
> is to add map-side implementation of Cogroup in Pig. Details to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-200) Pig Performance Benchmarks

2010-03-30 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851293#action_12851293
 ] 

Daniel Dai commented on PIG-200:


Hi, duncan,
I tried and I didn't see errors. Are you using pig 0.6 release? What error 
message did you see?

> Pig Performance Benchmarks
> --
>
> Key: PIG-200
> URL: https://issues.apache.org/jira/browse/PIG-200
> Project: Pig
>  Issue Type: Task
>Reporter: Amir Youssefi
>Assignee: Alan Gates
> Fix For: 0.2.0
>
> Attachments: generate_data.pl, perf-0.6.patch, perf.hadoop.patch, 
> perf.patch
>
>
> To benchmark Pig performance, we need to have a TPC-H like Large Data Set 
> plus Script Collection. This is used in comparison of different Pig releases, 
> Pig vs. other systems (e.g. Pig + Hadoop vs. Hadoop Only).
> Here is Wiki for small tests: http://wiki.apache.org/pig/PigPerformance
> I am currently running long-running Pig scripts over data-sets in the order 
> of tens of TBs. Next step is hundreds of TBs.
> We need to have an open large-data set (open source scripts which generate 
> data-set) and detailed scripts for important operations such as ORDER, 
> AGGREGATION etc.
> We can call those the Pig Workouts: Cardio (short processing), Marathon (long 
> running scripts) and Triathlon (Mix). 
> I will update this JIRA with more details of current activities soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.