[jira] [Updated] (PIG-2034) Pig client uses fs.default.name as provided from the JobTracker instead of local value

2011-05-03 Thread Bill Graham (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Graham updated PIG-2034:
-

Attachment: pig_1304465896181.log

Attached is an error log that shows what the exception looks like when the 
servers {{fs.default.name}} can't be resolved by the client when running from 
the pig shell.

The script that produced this was this:

{code}
raw = LOAD '/user/bob/some_file.txt' AS (a, b, c, d);
A = LIMIT raw 10;
dump A;
{code}

> Pig client uses fs.default.name as provided from the JobTracker instead of 
> local value
> --
>
> Key: PIG-2034
> URL: https://issues.apache.org/jira/browse/PIG-2034
> Project: Pig
>  Issue Type: Bug
>Reporter: Bill Graham
> Attachments: pig_1304465896181.log
>
>
> When submitting a Pig job, the client uses the {{fs.default.name}} supplied 
> to it by the JobTracker (via core-site.xml on the master typically) during 
> the staging phase. After that, the client then uses the {{fs.default.name}} 
> from it's local configs. This seems like a bug to me. Expected behavior would 
> be to always use the local value.
> I found this bug when the server configs were set to not use a FQDN for 
> {{fs.default.name}}. This caused the client to fail because it didn't have 
> the same default DNS domain. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-2034) Pig client uses fs.default.name as provided from the JobTracker instead of local value

2011-05-03 Thread Bill Graham (JIRA)
Pig client uses fs.default.name as provided from the JobTracker instead of 
local value
--

 Key: PIG-2034
 URL: https://issues.apache.org/jira/browse/PIG-2034
 Project: Pig
  Issue Type: Bug
Reporter: Bill Graham


When submitting a Pig job, the client uses the {{fs.default.name}} supplied to 
it by the JobTracker (via core-site.xml on the master typically) during the 
staging phase. After that, the client then uses the {{fs.default.name}} from 
it's local configs. This seems like a bug to me. Expected behavior would be to 
always use the local value.

I found this bug when the server configs were set to not use a FQDN for 
{{fs.default.name}}. This caused the client to fail because it didn't have the 
same default DNS domain. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2032) Penny related multiple improvements/issues.

2011-05-03 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028575#comment-13028575
 ] 

Mridul Muralidharan commented on PIG-2032:
--

Please note that the changes are not to pig, but to penny codebase alone : and 
without these, it fails for non-trivial usecases. I will leave it to Ben though 
...


> Penny related multiple improvements/issues.
> ---
>
> Key: PIG-2032
> URL: https://issues.apache.org/jira/browse/PIG-2032
> Project: Pig
>  Issue Type: Bug
>  Components: tools
>Affects Versions: 0.9.0
> Environment: All env
>Reporter: Mridul Muralidharan
> Attachments: diffs
>
>
> This is from a mail I sent Ben and Chris in feb
> ---
> Essentially the changes are :
> a) Allow for progress reporting so that long running agent api invocations 
> dont result in killing the task.
> b) Decouple creation from initialization of monitor agent harness.
> c) I increased the threads for coordinator since ibis has slightly io bound 
> coordinator messages to be processed : but this might not be generally 
> relevant.
> Maybe some way to configure it would be good ? (The default value was 
> observed to be exhausted really fast for some invocations of ibis !)
> d) Handle shutdown more gracefully at the agent side - monitor and wait for 
> relevant future's to be complete.
> Note that the code for this is slightly sensitive in terms of how it is 
> written since there are idioms in netty which it supports (immediate 
> invocation vs deferred invocation on future completion).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-1866) Dereference a bag within a tuple does not work

2011-05-03 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028562#comment-13028562
 ] 

Dmitriy V. Ryaboy commented on PIG-1866:


Daniel, is there any way to get a bag inside a tuple in 0.8? This ticket seems 
to indicate that this is not possible without your patch.

> Dereference a bag within a tuple does not work
> --
>
> Key: PIG-1866
> URL: https://issues.apache.org/jira/browse/PIG-1866
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.9.0
>
> Attachments: PIG-1866-1.patch, PIG-1866-2.patch, PIG-1866-3.patch
>
>
> The following script does not work (both in new and old logical plan):
> {code}
> a = load '1.txt' as (t : tuple(i: int, b1: bag { b_tuple : tuple ( b_str: 
> chararray) }));
> b = foreach a generate t.b1;
> dump b;
> {code}
> 1.txt:
> (1,{(one),(two)})
> Error from old logical plan:
> java.lang.ClassCastException: org.apache.pig.data.BinSedesTuple cannot be 
> cast to org.apache.pig.data.DataBag
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:482)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:197)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:480)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:197)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:339)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:237)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:232)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> Error from new logical plan:
> java.lang.NullPointerException
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.consumeInputBag(POProject.java:246)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:200)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:339)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:237)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:232)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> If we change "b = foreach a generate t.b1;" to "b = foreach a generate t.i;", 
> it works fine, only refer to a bag does not work.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (PIG-1942) script UDF (jython) should utilize the intended output schema to more directly convert Py objects to Pig objects

2011-05-03 Thread Woody Anderson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Woody Anderson reassigned PIG-1942:
---

Assignee: Woody Anderson

> script UDF (jython) should utilize the intended output schema to more 
> directly convert Py objects to Pig objects
> 
>
> Key: PIG-1942
> URL: https://issues.apache.org/jira/browse/PIG-1942
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.8.0, 0.9.0
>Reporter: Woody Anderson
>Assignee: Woody Anderson
>Priority: Minor
>  Labels: python, schema, udf
> Fix For: 0.10
>
> Attachments: 1942.patch, 1942_with_junit.patch
>
>
> from https://issues.apache.org/jira/browse/PIG-1824
> {code}
> import re
> @outputSchema("y:bag{t:tuple(word:chararray)}")
> def strsplittobag(content,regex):
> return re.compile(regex).split(content)
> {code}
> does not work because split returns a list of strings. However, the output 
> schema is known, and it would be quite simple to implicitly promote the 
> string element to a tupled element.
> also, a list/array/tuple/set etc. are all equally convertable to bag, and 
> list/array/tuple are equally convertable to Tuple, this conversion can be 
> done in a much less rigid way with the use of the schema.
> this allows much more facile re-use of existing python code and less memory 
> overhead to create intermediate re-converting of object types.
> I have written the code to do this a while back as part of my version of the 
> jython script framework, i'll isolate that and attach.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (PIG-2028) Speed up multiquery unit tests

2011-05-03 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding resolved PIG-2028.
---

  Resolution: Fixed
Hadoop Flags: [Reviewed]

Path committed to trunk and 0.9 branch.

> Speed up multiquery unit tests 
> ---
>
> Key: PIG-2028
> URL: https://issues.apache.org/jira/browse/PIG-2028
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.9.0
>
> Attachments: PIG-2028.patch, PIG-2028_1.patch
>
>
> Switch TestMultiQueryBasic and TestMultiQuery to use LOCAL mode. The results 
> on my laptop:
> Using Mini Cluster:
> TestMultiQueryBasic: 17 min 17 sec
> TestMultiQuery:  23 min 2 sec
> Using LOCAL mode:
> TestMultiQueryBasic: 4 min 17 sec
> TestMultiQuery:  5 min 51 sec

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2028) Speed up multiquery unit tests

2011-05-03 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028509#comment-13028509
 ] 

Thejas M Nair commented on PIG-2028:


+1

> Speed up multiquery unit tests 
> ---
>
> Key: PIG-2028
> URL: https://issues.apache.org/jira/browse/PIG-2028
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.9.0
>
> Attachments: PIG-2028.patch, PIG-2028_1.patch
>
>
> Switch TestMultiQueryBasic and TestMultiQuery to use LOCAL mode. The results 
> on my laptop:
> Using Mini Cluster:
> TestMultiQueryBasic: 17 min 17 sec
> TestMultiQuery:  23 min 2 sec
> Using LOCAL mode:
> TestMultiQueryBasic: 4 min 17 sec
> TestMultiQuery:  5 min 51 sec

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-2033) Pig returns sucess for the failed Pig script

2011-05-03 Thread Richard Ding (JIRA)
Pig returns sucess for the failed Pig script


 Key: PIG-2033
 URL: https://issues.apache.org/jira/browse/PIG-2033
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.1, 0.9.0


Pig returns success when a Pig script fails but the count of failed MR jobs is 
zero. 





--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2028) Speed up multiquery unit tests

2011-05-03 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028485#comment-13028485
 ] 

Richard Ding commented on PIG-2028:
---

Simplify the test cases. Using Util.createLocalInputFile whenever possible.

> Speed up multiquery unit tests 
> ---
>
> Key: PIG-2028
> URL: https://issues.apache.org/jira/browse/PIG-2028
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.9.0
>
> Attachments: PIG-2028.patch, PIG-2028_1.patch
>
>
> Switch TestMultiQueryBasic and TestMultiQuery to use LOCAL mode. The results 
> on my laptop:
> Using Mini Cluster:
> TestMultiQueryBasic: 17 min 17 sec
> TestMultiQuery:  23 min 2 sec
> Using LOCAL mode:
> TestMultiQueryBasic: 4 min 17 sec
> TestMultiQuery:  5 min 51 sec

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2028) Speed up multiquery unit tests

2011-05-03 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-2028:
--

Attachment: PIG-2028_1.patch

> Speed up multiquery unit tests 
> ---
>
> Key: PIG-2028
> URL: https://issues.apache.org/jira/browse/PIG-2028
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.9.0
>
> Attachments: PIG-2028.patch, PIG-2028_1.patch
>
>
> Switch TestMultiQueryBasic and TestMultiQuery to use LOCAL mode. The results 
> on my laptop:
> Using Mini Cluster:
> TestMultiQueryBasic: 17 min 17 sec
> TestMultiQuery:  23 min 2 sec
> Using LOCAL mode:
> TestMultiQueryBasic: 4 min 17 sec
> TestMultiQuery:  5 min 51 sec

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2008) Cache outputFormat in HBaseStorage

2011-05-03 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028453#comment-13028453
 ] 

Alan Gates commented on PIG-2008:
-

No objections.  Unit tests pass.  Running test-patch now.

> Cache outputFormat in HBaseStorage
> --
>
> Key: PIG-2008
> URL: https://issues.apache.org/jira/browse/PIG-2008
> Project: Pig
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.8.0
>Reporter: Jacob Perkins
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: patch_file.txt
>
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> getOutputFormat gets called more than one time in a StoreFunc. Modify 
> HBaseStorage to only create an instance of TableOutputFormat one time (since 
> it creates a new HTable connection each time) as opposed to multiple times 
> like it does now.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-1821) UDFContext.getUDFProperties does not handle collisions in hashcode of udf classname (+ arg hashcodes)

2011-05-03 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028438#comment-13028438
 ] 

Richard Ding commented on PIG-1821:
---

+1

> UDFContext.getUDFProperties does not handle collisions in hashcode of udf 
> classname (+ arg hashcodes)
> -
>
> Key: PIG-1821
> URL: https://issues.apache.org/jira/browse/PIG-1821
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.9.0
>
> Attachments: PIG-1821.1.patch, PIG-1821.2.patch
>
>
> In code below, if generateKey() returns same value for two udfs, the udfs 
> would end up sharing the properties object. 
> {code}
> private HashMap udfConfs = new HashMap Properties>();
> public Properties getUDFProperties(Class c) {
> Integer k = generateKey(c);
> Properties p = udfConfs.get(k);
> if (p == null) {
> p = new Properties();
> udfConfs.put(k, p);
> }
> return p;
> }
> private int generateKey(Class c) {
> return c.getName().hashCode();
> }
> public Properties getUDFProperties(Class c, String[] args) {
> Integer k = generateKey(c, args);
> Properties p = udfConfs.get(k);
> if (p == null) {
> p = new Properties();
> udfConfs.put(k, p);
> }
> return p;
> }
> private int generateKey(Class c, String[] args) {
> int hc = c.getName().hashCode();
> for (int i = 0; i < args.length; i++) {
> hc <<= 1;
> hc ^= args[i].hashCode();
> }
> return hc;
> }
> {code}
> To prevent this, a new class (say X) that can hold the classname and args 
> should be created, and instead of HashMap,  HashMap Properties> should be used. Then HahsMap will deal with the collisions. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-1821) UDFContext.getUDFProperties does not handle collisions in hashcode of udf classname (+ arg hashcodes)

2011-05-03 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1821:
---

Attachment: PIG-1821.2.patch

PIG-1821.2.patch - Changes to incorporate comments from Richard. Passes unit 
tests and test-patch.


> UDFContext.getUDFProperties does not handle collisions in hashcode of udf 
> classname (+ arg hashcodes)
> -
>
> Key: PIG-1821
> URL: https://issues.apache.org/jira/browse/PIG-1821
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.9.0
>
> Attachments: PIG-1821.1.patch, PIG-1821.2.patch
>
>
> In code below, if generateKey() returns same value for two udfs, the udfs 
> would end up sharing the properties object. 
> {code}
> private HashMap udfConfs = new HashMap Properties>();
> public Properties getUDFProperties(Class c) {
> Integer k = generateKey(c);
> Properties p = udfConfs.get(k);
> if (p == null) {
> p = new Properties();
> udfConfs.put(k, p);
> }
> return p;
> }
> private int generateKey(Class c) {
> return c.getName().hashCode();
> }
> public Properties getUDFProperties(Class c, String[] args) {
> Integer k = generateKey(c, args);
> Properties p = udfConfs.get(k);
> if (p == null) {
> p = new Properties();
> udfConfs.put(k, p);
> }
> return p;
> }
> private int generateKey(Class c, String[] args) {
> int hc = c.getName().hashCode();
> for (int i = 0; i < args.length; i++) {
> hc <<= 1;
> hc ^= args[i].hashCode();
> }
> return hc;
> }
> {code}
> To prevent this, a new class (say X) that can hold the classname and args 
> should be created, and instead of HashMap,  HashMap Properties> should be used. Then HahsMap will deal with the collisions. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2008) Cache outputFormat in HBaseStorage

2011-05-03 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028383#comment-13028383
 ] 

Dmitriy V. Ryaboy commented on PIG-2008:


Yeah looks good. Sorry I missed reviewing this earlier.

Alan, objections to committing this to the 0.9 branch? The HTable connection 
leak is a bug.

> Cache outputFormat in HBaseStorage
> --
>
> Key: PIG-2008
> URL: https://issues.apache.org/jira/browse/PIG-2008
> Project: Pig
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.8.0
>Reporter: Jacob Perkins
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: patch_file.txt
>
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> getOutputFormat gets called more than one time in a StoreFunc. Modify 
> HBaseStorage to only create an instance of TableOutputFormat one time (since 
> it creates a new HTable connection each time) as opposed to multiple times 
> like it does now.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2008) Cache outputFormat in HBaseStorage

2011-05-03 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028380#comment-13028380
 ] 

Alan Gates commented on PIG-2008:
-

Dmitriy, could you take a look at this.  It looks ok to me, but I'm not an 
HBase expert.  I'll run the unit tests and such on it.

> Cache outputFormat in HBaseStorage
> --
>
> Key: PIG-2008
> URL: https://issues.apache.org/jira/browse/PIG-2008
> Project: Pig
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.8.0
>Reporter: Jacob Perkins
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: patch_file.txt
>
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> getOutputFormat gets called more than one time in a StoreFunc. Modify 
> HBaseStorage to only create an instance of TableOutputFormat one time (since 
> it creates a new HTable connection each time) as opposed to multiple times 
> like it does now.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-1973) UDFContext.getUDFContext has a thread race condition around it's ThreadLocal

2011-05-03 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1973:


Fix Version/s: (was: 0.9.0)

> UDFContext.getUDFContext has a thread race condition around it's ThreadLocal
> 
>
> Key: PIG-1973
> URL: https://issues.apache.org/jira/browse/PIG-1973
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.8.0, 0.9.0
>Reporter: Woody Anderson
>Assignee: Woody Anderson
>Priority: Minor
> Attachments: 1973.patch
>
>
> this is probably isn't manifesting anywhere, but it's an incorrect use of the 
> ThreadLocal pattern.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-1943) jython functions can use the @outputSchema decorator, but only if in the out script that is imported, we should add a builting module pigdecorators.py so that developers ca

2011-05-03 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1943:


Fix Version/s: (was: 0.9.0)

> jython functions can use the @outputSchema decorator, but only if in the out 
> script that is imported, we should add a builting module pigdecorators.py so 
> that developers can import and use them in lib scripts
> 
>
> Key: PIG-1943
> URL: https://issues.apache.org/jira/browse/PIG-1943
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.8.0, 0.9.0
>Reporter: Woody Anderson
>Assignee: Woody Anderson
>Priority: Minor
>  Labels: pig, python, schema, udf
>
> if you have pig udf functions in a pig script, and want to re-use it (i.g. 
> import from another script) the decorators must be defined. They will not be, 
> due to scoping rules, so the decorators should be available via a standard 
> importable module that ships with the jython framework (as we already define 
> the decorators as part of initializing the interpreter).
> this simply involves adding an appropriately named: pigdecorators.py to the 
> classpath, so a dev can do:
> {quote}
> from pigdecorators import *
> @outputSchema("w:chararray")
> def word():
>  return 'word'
> {quote}
> this can be done currently in the primary script, but when 
> https://issues.apache.org/jira/browse/PIG-1824 is completed, that script 
> would not properly import when used within another script in the future.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-1824) Support import modules in Jython UDF

2011-05-03 Thread Woody Anderson (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028366#comment-13028366
 ] 

Woody Anderson commented on PIG-1824:
-

understood.
adding that null check/throw etc. is just a change that is unrelated to this 
bug. I can bundle it up as all the related lines of code are being changed by 
this bug anyway, but that's why i didn't do it originally.

I'll add a throw similar to current impl of getScriptAsStream

> Support import modules in Jython UDF
> 
>
> Key: PIG-1824
> URL: https://issues.apache.org/jira/browse/PIG-1824
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0, 0.9.0
>Reporter: Richard Ding
>Assignee: Woody Anderson
> Fix For: 0.9.0
>
> Attachments: 1824.patch, 1824a.patch, 1824b.patch, 1824c.patch
>
>
> Currently, Jython UDF script doesn't support Jython import statement as in 
> the following example:
> {code}
> #!/usr/bin/python
> import re
> @outputSchema("word:chararray")
> def resplit(content, regex, index):
> return re.compile(regex).split(content)[index]
> {code}
> Can Pig automatically locate the Jython module file and ship it to the 
> backend? Or should we add a ship clause to let user explicitly specify the 
> module to ship? 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-1824) Support import modules in Jython UDF

2011-05-03 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028362#comment-13028362
 ] 

Julien Le Dem commented on PIG-1824:


Hi Woody,
I had misread the code about automatic deletion. You're right it deletes only 
if it was created by Pig.

I understand the superfluous null check and the warning being somewhat 
incorrect. 
To me there should be either no null check in that case or throw some exception 
if null. This is about debug-ability of the code. If someone changes the 
behavior of getScriptAsStream() there should be an exception in your code at 
that point. Not somewhere else. It also helps with understanding the code so 
that the reader does not wonder why it does nothing when the stream is null 
(because it's never null. But then why do we check ? etc)

otherwise it looks good. Thanks!

> Support import modules in Jython UDF
> 
>
> Key: PIG-1824
> URL: https://issues.apache.org/jira/browse/PIG-1824
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0, 0.9.0
>Reporter: Richard Ding
>Assignee: Woody Anderson
> Fix For: 0.9.0
>
> Attachments: 1824.patch, 1824a.patch, 1824b.patch, 1824c.patch
>
>
> Currently, Jython UDF script doesn't support Jython import statement as in 
> the following example:
> {code}
> #!/usr/bin/python
> import re
> @outputSchema("word:chararray")
> def resplit(content, regex, index):
> return re.compile(regex).split(content)[index]
> {code}
> Can Pig automatically locate the Jython module file and ship it to the 
> backend? Or should we add a ship clause to let user explicitly specify the 
> module to ship? 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-1942) script UDF (jython) should utilize the intended output schema to more directly convert Py objects to Pig objects

2011-05-03 Thread Woody Anderson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Woody Anderson updated PIG-1942:


Attachment: 1942_with_junit.patch

i forgot to svn add my unit test that contains a lot of useful tests and 
comments.

it's included in this patch. it has a timing loop at the end that you can 
enable by adding an annotation etc. or running it directly in eclipse etc. to 
show the performance difference between the methods.

> script UDF (jython) should utilize the intended output schema to more 
> directly convert Py objects to Pig objects
> 
>
> Key: PIG-1942
> URL: https://issues.apache.org/jira/browse/PIG-1942
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.8.0, 0.9.0
>Reporter: Woody Anderson
>Priority: Minor
>  Labels: python, schema, udf
> Fix For: 0.10
>
> Attachments: 1942.patch, 1942_with_junit.patch
>
>
> from https://issues.apache.org/jira/browse/PIG-1824
> {code}
> import re
> @outputSchema("y:bag{t:tuple(word:chararray)}")
> def strsplittobag(content,regex):
> return re.compile(regex).split(content)
> {code}
> does not work because split returns a list of strings. However, the output 
> schema is known, and it would be quite simple to implicitly promote the 
> string element to a tupled element.
> also, a list/array/tuple/set etc. are all equally convertable to bag, and 
> list/array/tuple are equally convertable to Tuple, this conversion can be 
> done in a much less rigid way with the use of the schema.
> this allows much more facile re-use of existing python code and less memory 
> overhead to create intermediate re-converting of object types.
> I have written the code to do this a while back as part of my version of the 
> jython script framework, i'll isolate that and attach.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2032) Penny related multiple improvements/issues.

2011-05-03 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028352#comment-13028352
 ] 

Dmitriy V. Ryaboy commented on PIG-2032:


You didn't tie it to 0.10. I'm seeing things. Never mind.

> Penny related multiple improvements/issues.
> ---
>
> Key: PIG-2032
> URL: https://issues.apache.org/jira/browse/PIG-2032
> Project: Pig
>  Issue Type: Bug
>  Components: tools
>Affects Versions: 0.9.0
> Environment: All env
>Reporter: Mridul Muralidharan
> Attachments: diffs
>
>
> This is from a mail I sent Ben and Chris in feb
> ---
> Essentially the changes are :
> a) Allow for progress reporting so that long running agent api invocations 
> dont result in killing the task.
> b) Decouple creation from initialization of monitor agent harness.
> c) I increased the threads for coordinator since ibis has slightly io bound 
> coordinator messages to be processed : but this might not be generally 
> relevant.
> Maybe some way to configure it would be good ? (The default value was 
> observed to be exhausted really fast for some invocations of ibis !)
> d) Handle shutdown more gracefully at the agent side - monitor and wait for 
> relevant future's to be complete.
> Note that the code for this is slightly sensitive in terms of how it is 
> written since there are idioms in netty which it supports (immediate 
> invocation vs deferred invocation on future completion).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2032) Penny related multiple improvements/issues.

2011-05-03 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028351#comment-13028351
 ] 

Dmitriy V. Ryaboy commented on PIG-2032:


Oh right, I just meant about tying it to 0.10...

> Penny related multiple improvements/issues.
> ---
>
> Key: PIG-2032
> URL: https://issues.apache.org/jira/browse/PIG-2032
> Project: Pig
>  Issue Type: Bug
>  Components: tools
>Affects Versions: 0.9.0
> Environment: All env
>Reporter: Mridul Muralidharan
> Attachments: diffs
>
>
> This is from a mail I sent Ben and Chris in feb
> ---
> Essentially the changes are :
> a) Allow for progress reporting so that long running agent api invocations 
> dont result in killing the task.
> b) Decouple creation from initialization of monitor agent harness.
> c) I increased the threads for coordinator since ibis has slightly io bound 
> coordinator messages to be processed : but this might not be generally 
> relevant.
> Maybe some way to configure it would be good ? (The default value was 
> observed to be exhausted really fast for some invocations of ibis !)
> d) Handle shutdown more gracefully at the agent side - monitor and wait for 
> relevant future's to be complete.
> Note that the code for this is slightly sensitive in terms of how it is 
> written since there are idioms in netty which it supports (immediate 
> invocation vs deferred invocation on future completion).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2021) Parser error while referring a map nested foreach

2011-05-03 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-2021:


Assignee: Xuefu Zhang

> Parser error while referring a map nested foreach
> -
>
> Key: PIG-2021
> URL: https://issues.apache.org/jira/browse/PIG-2021
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.0
>Reporter: Vivek Padmanabhan
>Assignee: Xuefu Zhang
> Fix For: 0.9.0
>
>
> The below script is throwing parser errors
> {code}
> register string.jar;
> A = load 'test1'  using MapLoader() as ( s, m, l );   
> B = foreach A generate *, string.URLPARSE((chararray) s#'url') as parsedurl;
> C = foreach B {
>   urlpath = (chararray) parsedurl#'path';
>   lc_urlpath = string.TOLOWERCASE((chararray) urlpath);
>   generate *;
> };
> {code}
> Error message;
> | Failed to generate logical plan.
> |Nested exception: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 
> 2225: Projection with nothing to reference!
> PIG-2002 reports a similar issue, but when i tried with the patch of PIG-2002 
> i was getting the below exception;
>  ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200:  11, column 33>  mismatched input '(' expecting SEMI_COLON

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2032) Penny related multiple improvements/issues.

2011-05-03 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028344#comment-13028344
 ] 

Olga Natkovich commented on PIG-2032:
-

But we need to give some time for stabilizing the code before we release. So 
once we branch, it should be bug fixes, not enhancements

> Penny related multiple improvements/issues.
> ---
>
> Key: PIG-2032
> URL: https://issues.apache.org/jira/browse/PIG-2032
> Project: Pig
>  Issue Type: Bug
>  Components: tools
>Affects Versions: 0.9.0
> Environment: All env
>Reporter: Mridul Muralidharan
> Attachments: diffs
>
>
> This is from a mail I sent Ben and Chris in feb
> ---
> Essentially the changes are :
> a) Allow for progress reporting so that long running agent api invocations 
> dont result in killing the task.
> b) Decouple creation from initialization of monitor agent harness.
> c) I increased the threads for coordinator since ibis has slightly io bound 
> coordinator messages to be processed : but this might not be generally 
> relevant.
> Maybe some way to configure it would be good ? (The default value was 
> observed to be exhausted really fast for some invocations of ibis !)
> d) Handle shutdown more gracefully at the agent side - monitor and wait for 
> relevant future's to be complete.
> Note that the code for this is slightly sensitive in terms of how it is 
> written since there are idioms in netty which it supports (immediate 
> invocation vs deferred invocation on future completion).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2032) Penny related multiple improvements/issues.

2011-05-03 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028330#comment-13028330
 ] 

Dmitriy V. Ryaboy commented on PIG-2032:


I thought we aren't linking features to releases now until they go in, since 
releases are now time-bound, not feature-bound.

> Penny related multiple improvements/issues.
> ---
>
> Key: PIG-2032
> URL: https://issues.apache.org/jira/browse/PIG-2032
> Project: Pig
>  Issue Type: Bug
>  Components: tools
>Affects Versions: 0.9.0
> Environment: All env
>Reporter: Mridul Muralidharan
> Attachments: diffs
>
>
> This is from a mail I sent Ben and Chris in feb
> ---
> Essentially the changes are :
> a) Allow for progress reporting so that long running agent api invocations 
> dont result in killing the task.
> b) Decouple creation from initialization of monitor agent harness.
> c) I increased the threads for coordinator since ibis has slightly io bound 
> coordinator messages to be processed : but this might not be generally 
> relevant.
> Maybe some way to configure it would be good ? (The default value was 
> observed to be exhausted really fast for some invocations of ibis !)
> d) Handle shutdown more gracefully at the agent side - monitor and wait for 
> relevant future's to be complete.
> Note that the code for this is slightly sensitive in terms of how it is 
> written since there are idioms in netty which it supports (immediate 
> invocation vs deferred invocation on future completion).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (PIG-1895) Class cast exception while projecting udf result

2011-05-03 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai resolved PIG-1895.
-

   Resolution: Fixed
Fix Version/s: (was: 0.8.0)
   0.8.1

Fixed by PIG-1866.

> Class cast exception while projecting udf result
> 
>
> Key: PIG-1895
> URL: https://issues.apache.org/jira/browse/PIG-1895
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.7.0, 0.8.0, 0.9.0
>Reporter: Vivek Padmanabhan
>Assignee: Daniel Dai
> Fix For: 0.8.1
>
>
> Class cast exception is thrown when I try to project the result from my udf. 
> The udf has a defined schema DataType.BAG,DataType.LONG and DataType.INTEGER
> The below is my script
> {code}
> Data = load 'file:/home/pvivek/Desktop/input' using PigStorage() as ( i: int 
> );
> AllData = group Data all parallel 1;
> SampledData = foreach AllData generate org.vivek.TestEvalFunc(Data, 5) as rs;
> SampledData1 = foreach SampledData generate rs.sampled;
> {code}
> Even though the output schema defines "sampled" as a data bag, while 
> processing, instead of sending only the data bag generated from the UDF , the 
> entire tuple was sent to the projection as result.
> {code}
> Exception recieved :
> java.lang.ClassCastException: org.apache.pig.data.BinSedesTuple cannot be 
> cast to org.apache.pig.data.DataBag
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:484)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:197)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:480)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:197)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:339)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:434)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:402)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:382)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:1)
>   at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
> {code}
> This issue is happening with 0.9/0.8 and 0.7

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2032) Penny related multiple improvements/issues.

2011-05-03 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-2032:


  Description: 
This is from a mail I sent Ben and Chris in feb
---

Essentially the changes are :

a) Allow for progress reporting so that long running agent api invocations dont 
result in killing the task.

b) Decouple creation from initialization of monitor agent harness.

c) I increased the threads for coordinator since ibis has slightly io bound 
coordinator messages to be processed : but this might not be generally relevant.
Maybe some way to configure it would be good ? (The default value was observed 
to be exhausted really fast for some invocations of ibis !)


d) Handle shutdown more gracefully at the agent side - monitor and wait for 
relevant future's to be complete.
Note that the code for this is slightly sensitive in terms of how it is written 
since there are idioms in netty which it supports (immediate invocation vs 
deferred invocation on future completion).



  was:

This is from a mail I sent Ben and Chris in feb
---

Essentially the changes are :

a) Allow for progress reporting so that long running agent api invocations dont 
result in killing the task.

b) Decouple creation from initialization of monitor agent harness.

c) I increased the threads for coordinator since ibis has slightly io bound 
coordinator messages to be processed : but this might not be generally relevant.
Maybe some way to configure it would be good ? (The default value was observed 
to be exhausted really fast for some invocations of ibis !)


d) Handle shutdown more gracefully at the agent side - monitor and wait for 
relevant future's to be complete.
Note that the code for this is slightly sensitive in terms of how it is written 
since there are idioms in netty which it supports (immediate invocation vs 
deferred invocation on future completion).



Fix Version/s: (was: 0.9.0)

Since we already branch for 0.9, I don't believe and penny improvements will 
make it in. 

In general, for improvements to a particular feature to make it to a particular 
release it is good to have resources lined to work on the feature.

> Penny related multiple improvements/issues.
> ---
>
> Key: PIG-2032
> URL: https://issues.apache.org/jira/browse/PIG-2032
> Project: Pig
>  Issue Type: Bug
>  Components: tools
>Affects Versions: 0.9.0
> Environment: All env
>Reporter: Mridul Muralidharan
> Attachments: diffs
>
>
> This is from a mail I sent Ben and Chris in feb
> ---
> Essentially the changes are :
> a) Allow for progress reporting so that long running agent api invocations 
> dont result in killing the task.
> b) Decouple creation from initialization of monitor agent harness.
> c) I increased the threads for coordinator since ibis has slightly io bound 
> coordinator messages to be processed : but this might not be generally 
> relevant.
> Maybe some way to configure it would be good ? (The default value was 
> observed to be exhausted really fast for some invocations of ibis !)
> d) Handle shutdown more gracefully at the agent side - monitor and wait for 
> relevant future's to be complete.
> Note that the code for this is slightly sensitive in terms of how it is 
> written since there are idioms in netty which it supports (immediate 
> invocation vs deferred invocation on future completion).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2032) Penny related multiple improvements/issues.

2011-05-03 Thread Mridul Muralidharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated PIG-2032:
-

Attachment: diffs

Not cleaned the file as per pig requirements, etc ... Similarly, no tests 
attached for the changes.

Ibis depends on these changes, and it works on top of this - so it has been 
indirectly tested.

> Penny related multiple improvements/issues.
> ---
>
> Key: PIG-2032
> URL: https://issues.apache.org/jira/browse/PIG-2032
> Project: Pig
>  Issue Type: Bug
>  Components: tools
>Affects Versions: 0.9.0
> Environment: All env
>Reporter: Mridul Muralidharan
> Fix For: 0.9.0
>
> Attachments: diffs
>
>
> This is from a mail I sent Ben and Chris in feb
> ---
> Essentially the changes are :
> a) Allow for progress reporting so that long running agent api invocations 
> dont result in killing the task.
> b) Decouple creation from initialization of monitor agent harness.
> c) I increased the threads for coordinator since ibis has slightly io bound 
> coordinator messages to be processed : but this might not be generally 
> relevant.
> Maybe some way to configure it would be good ? (The default value was 
> observed to be exhausted really fast for some invocations of ibis !)
> d) Handle shutdown more gracefully at the agent side - monitor and wait for 
> relevant future's to be complete.
> Note that the code for this is slightly sensitive in terms of how it is 
> written since there are idioms in netty which it supports (immediate 
> invocation vs deferred invocation on future completion).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-2032) Penny related multiple improvements/issues.

2011-05-03 Thread Mridul Muralidharan (JIRA)
Penny related multiple improvements/issues.
---

 Key: PIG-2032
 URL: https://issues.apache.org/jira/browse/PIG-2032
 Project: Pig
  Issue Type: Bug
  Components: tools
Affects Versions: 0.9.0
 Environment: All env
Reporter: Mridul Muralidharan
 Fix For: 0.9.0



This is from a mail I sent Ben and Chris in feb
---

Essentially the changes are :

a) Allow for progress reporting so that long running agent api invocations dont 
result in killing the task.

b) Decouple creation from initialization of monitor agent harness.

c) I increased the threads for coordinator since ibis has slightly io bound 
coordinator messages to be processed : but this might not be generally relevant.
Maybe some way to configure it would be good ? (The default value was observed 
to be exhausted really fast for some invocations of ibis !)


d) Handle shutdown more gracefully at the agent side - monitor and wait for 
relevant future's to be complete.
Note that the code for this is slightly sensitive in terms of how it is written 
since there are idioms in netty which it supports (immediate invocation vs 
deferred invocation on future completion).



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-1775) Removal of old logical plan

2011-05-03 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028308#comment-13028308
 ] 

Xuefu Zhang commented on PIG-1775:
--

Unit test passed. Patch PIG-1775-2.patch is committed to both trunk and 0.9.0.

> Removal of old logical plan
> ---
>
> Key: PIG-1775
> URL: https://issues.apache.org/jira/browse/PIG-1775
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.9.0
>Reporter: Yan Zhou
>Assignee: Xuefu Zhang
> Fix For: 0.9.0
>
> Attachments: PIG-1775-2.patch, PIG-1775.patch
>
>
> The new logical plan will only be used and the old logical plan will be 
> removed once the new one is stable enough. It is scheduled for the 0.9 
> release.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-1775) Removal of old logical plan

2011-05-03 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-1775:
-

Attachment: PIG-1775-2.patch

Minor update patch. getSchemaFromString() is enabled.

> Removal of old logical plan
> ---
>
> Key: PIG-1775
> URL: https://issues.apache.org/jira/browse/PIG-1775
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.9.0
>Reporter: Yan Zhou
>Assignee: Xuefu Zhang
> Fix For: 0.9.0
>
> Attachments: PIG-1775-2.patch, PIG-1775.patch
>
>
> The new logical plan will only be used and the old logical plan will be 
> removed once the new one is stable enough. It is scheduled for the 0.9 
> release.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2029) Inconsistency in Pig Stats reports

2011-05-03 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-2029:


Fix Version/s: (was: 0.8.1)
 Assignee: Richard Ding

I do not believe it is a P1 issue so don't think it belongs on 0.8 branch. Even 
for 0.9 I do not see it as a blocker. If we can find a quick reproducible case, 
we will fix it in 0.9. Otherwise will delay till we can reproduce. Also, this 
could be a potential issue with Hadoop.

> Inconsistency in Pig Stats reports 
> ---
>
> Key: PIG-2029
> URL: https://issues.apache.org/jira/browse/PIG-2029
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.1, 0.9.0
>Reporter: Viraj Bhat
>Assignee: Richard Ding
> Fix For: 0.9.0
>
>
> I have a Pig script which reports varying Stats for the same M/R job (same 
> inputs). Sometimes the PigStats reports all the stats (such as 
> Maps,Reduces,MaxMapTime,MinMapTime,AvgMapTime,MaxReduceTime, MinReduceTime 
> and AvgReduceTime) for the M/R job as 0. Sometimes it reports it correctly.
> Enclosed are the stderr logs for 2 runs, you can notice that for Run 1 
> job_201103091134_556600 from Run 1; has 0 against all the columns whereas in 
> Run 2, Hadoop job job_201104272229_75693 has some valid values. 
> The actual Job Tracker link shows that they are non empty. This points to a 
> bug in the interaction of the PigStats module with the Jobtracker.
> Run 1:
> {quote}
> Job Stats (time in seconds):
> JobId MapsReduces MaxMapTime  MinMapTIme  AvgMapTime  
> MaxReduceTime   MinReduceTime   AvgReduceTime   Alias   Feature Outputs
> job_201103091134_556458   160 100 552 191 368 1257
> 371 392 
> IN,SP10P,SP11P,SP12P,SP13P,SP16P,SP17P,SP18P,SP20P,SP21P,SP22P,SP23P,SP24P,SP26P,SP27P,SP28P,SP29P,SP30P,SP31P,SP32P,SP33P,SP34P,SP4P,SP6P,SP7P,SP8P,SP9P
>DISTINCT,MULTI_QUERY
> job_201103091134_556600   0   0   0   0   0   0   
> 0   0   UNION5  MULTI_QUERY,MAP_ONLY/user/viraj/dir,,
> job_201103091134_556601   7   100 17  8   14  200 
> 15  27  CNJOIN25,GNJOIN25,sampleNJOIN25 GROUP_BY,COMBINER   
> job_201103091134_556602   0   0   0   0   0   0   
> 0   0   CNJOIN3,GNJOIN3,sampleNJOIN3GROUP_BY,COMBINER   
> job_201103091134_556603   0   0   0   0   0   0   
> 0   0   CNJOIN15,GNJOIN15,sampleNJOIN15 GROUP_BY,COMBINER   
> job_201103091134_556604   2   100 13  7   10  34  
> 13  31  CNJOIN19,GNJOIN19,sampleNJOIN19 GROUP_BY,COMBINER   
> job_201103091134_556644   0   0   0   0   0   0   
> 0   0   ONJOIN15SAMPLER 
> job_201103091134_556645   0   0   0   0   0   0   
> 0   0   ONJOIN25SAMPLER 
> job_201103091134_556646   0   0   0   0   0   0   
> 0   0   ONJOIN3 SAMPLER 
> job_201103091134_556654   0   0   0   0   0   0   
> 0   0   ONJOIN19SAMPLER 
> job_201103091134_556662   0   0   0   0   0   0   
> 0   0   ONJOIN19ORDER_BY,COMBINER
> ..
> {quote}
> Run 2:
> {quote}
> Job Stats (time in seconds):
> JobId MapsReduces MaxMapTime  MinMapTIme  AvgMapTime  
> MaxReduceTime   MinReduceTime   AvgReduceTime   Alias   Feature Outputs
> job_201104272229_75503159 100 484 192 353 396 
> 308 321 
> IN,SP10P,SP11P,SP12P,SP13P,SP16P,SP17P,SP18P,SP20P,SP21P,SP22P,SP23P,SP24P,SP26P,SP27P,SP28P,SP29P,SP30P,SP31P,SP32P,SP33P,SP34P,SP4P,SP6P,SP7P,SP8P,SP9P
>DISTINCT,MULTI_QUERY
> job_201104272229_7569318  0   31  14  24  0   
> 0   UNION5 MULTI_QUERY,MAP_ONLY /user/viraj/dir,
> job_201104272229_756947   100 34  13  22  46  
> 20  25  CNJOIN25,GNJOIN25,sampleNJOIN25 GROUP_BY,COMBINER   
> job_201104272229_75695125 100 19  11  15  32  
> 18  26  CNJOIN3,GNJOIN3,sampleNJOIN3GROUP_BY,COMBINER   
> job_201104272229_756981   100 12  12  12  13  
> 9   11  CNJOIN15,GNJOIN15,sampleNJOIN15 GROUP_BY,COMBINER   
> job_201104272229_757022   100 21  5   13  35  
> 22  26  CNJOIN19,GNJOIN19,sampleNJOIN19 GROUP_BY,COMBINER   
> job_201104272229_757241   1   4   4   4   11  
> 11  11  ONJOIN15SAMPLER 
> job_201104272229_757250   0   0   0   0   0   
> 0  

[jira] [Resolved] (PIG-1990) support casting of complex types with empty inner schema to complex type with non-empty inner schema

2011-05-03 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair resolved PIG-1990.


Resolution: Fixed

Patch committed to trunk and 0.9 branch.

> support casting of complex types with empty inner schema to complex type with 
> non-empty inner schema
> 
>
> Key: PIG-1990
> URL: https://issues.apache.org/jira/browse/PIG-1990
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.9.0
>
> Attachments: PIG-1990.1.patch
>
>
> Use case like the following should be supported - 
> {code}
> a = load '1.txt' as (t:tuple());
> b = foreach a generate (tuple(int))t;
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Review Request: PIG-1702. Fix for task output logs for streaming jobs containing null input-split information.

2011-05-03 Thread Ashutosh Chauhan


> On 2011-04-13 18:03:22, Dmitriy Ryaboy wrote:
> > trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java,
> >  line 202
> > 
> >
> > Do we care about the specifics of how this output is written?
> > 
> > Seems like it would be less code, and potentially better in the long 
> > run (if we are dealing with other kinds of splits) to just call toString() 
> > on the InputSplit. FileSplit already defines its own toString() which 
> > prints out the path, the start offset, and the length.

I agree with Dmitriy. If possible, we should avoid special casing for a 
particular type of InputSplit. Further, InputSplit provides getLocations() and 
getLength() api which should be used instead of FileSplit specific api.


- Ashutosh


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/547/#review452
---


On 2011-04-04 19:33:32, Adam Warrington wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/547/
> ---
> 
> (Updated 2011-04-04 19:33:32)
> 
> 
> Review request for pig.
> 
> 
> Summary
> ---
> 
> This is a patch for PIG-1702, which describes an issue where the task output 
> logs for PIG streaming jobs contains null input-split information. The 
> ability to query the input-split information through the JobConf went away 
> with the new MR API. We must now gain a reference to the underlying 
> FiletSplit, and query this reference for that information.
> 
> 
> Diffs
> -
> 
>   
> trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java
>  1088692 
> 
> Diff: https://reviews.apache.org/r/547/diff
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Adam
> 
>



[jira] [Created] (PIG-2031) NPE in TOP

2011-05-03 Thread Jacob Perkins (JIRA)
NPE in TOP
--

 Key: PIG-2031
 URL: https://issues.apache.org/jira/browse/PIG-2031
 Project: Pig
  Issue Type: Bug
Reporter: Jacob Perkins


If a NULL DataBag is passed to org.apache.pig.builtin.TOP then a NPE is thrown. 
Consider:


{code}
$: cat foo.tsv
a  {(foo,1),(bar,2)}
b
c  {(fyha,4),(asdf,9)}
{code}

then:

{code}
data  = LOAD 'foo.tsv' AS (key:chararray, a_bag:bag {t:tuple (name:chararray, 
value:int)});
tpd   = FOREACH data {
  top_n = TOP(1, 1, a_bag);
  GENERATE
key   AS key,
top_n AS top_n
  ; 
};
DUMP tpd;
{code}

will throw an NPE when it gets to the row with no bag.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-1942) script UDF (jython) should utilize the intended output schema to more directly convert Py objects to Pig objects

2011-05-03 Thread Woody Anderson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Woody Anderson updated PIG-1942:


Attachment: 1942.patch

I wanted to get this started, as this is a bit of a change.

often, it seems that people misuse the outputSchema annotation such that the 
output does not match the specified schema. At least, there was a unit test 
that did this, and it's possible that a few users in the wild have this issue 
as well.

At any rate, this patch includes code in JythonUtils that will coerce jythout 
object model output into the schema that the function is annotated with.

It's faster than the existing code and has quite a bit more functionality. It 
can convert arrays and many more types than previously. It also makes it much 
easier and faster to convert [1,2,3] to a bag rather than in jython create 
[(1), (2), (3)].

Given that this changes the functionality of udfs that use @outputSchema (by 
coercing schema adherence), we may want to use a different annotation, and 
allow outputSchema to exist in it's previous form, in that it doesn't actually 
convert the schema.


> script UDF (jython) should utilize the intended output schema to more 
> directly convert Py objects to Pig objects
> 
>
> Key: PIG-1942
> URL: https://issues.apache.org/jira/browse/PIG-1942
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.8.0, 0.9.0
>Reporter: Woody Anderson
>Priority: Minor
>  Labels: python, schema, udf
> Fix For: 0.10
>
> Attachments: 1942.patch
>
>
> from https://issues.apache.org/jira/browse/PIG-1824
> {code}
> import re
> @outputSchema("y:bag{t:tuple(word:chararray)}")
> def strsplittobag(content,regex):
> return re.compile(regex).split(content)
> {code}
> does not work because split returns a list of strings. However, the output 
> schema is known, and it would be quite simple to implicitly promote the 
> string element to a tupled element.
> also, a list/array/tuple/set etc. are all equally convertable to bag, and 
> list/array/tuple are equally convertable to Tuple, this conversion can be 
> done in a much less rigid way with the use of the schema.
> this allows much more facile re-use of existing python code and less memory 
> overhead to create intermediate re-converting of object types.
> I have written the code to do this a while back as part of my version of the 
> jython script framework, i'll isolate that and attach.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira