[jira] [Commented] (PIG-3486) Pig hitting OOM while using PigRunner.run()

2013-10-07 Thread Vivek Padmanabhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13787954#comment-13787954
 ] 

Vivek Padmanabhan commented on PIG-3486:


Hi Ankit/Daniel,
 Thanks for looking into this. 
Meanwhile, is the current patch safe to be applied to our cluster for Pig 
0.11.1 . All basic tests are not having any issues, not sure of any cases that 
I have missed.


 Pig hitting OOM while using PigRunner.run()
 ---

 Key: PIG-3486
 URL: https://issues.apache.org/jira/browse/PIG-3486
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11.1
Reporter: Vivek Padmanabhan
 Attachments: histolive.txt, PIG-3486.patch1


 I have a timer based class, which will trigger a pig script execution every 5 
 minutes using PigRunner.run(args, null).
 But it looks like the heap usage is gradually increasing after around 15days 
 it crossed 1G, ie after invoking the above method 4k times.
 The top entries of the histo live goes like this;
  num #instances #bytes  class name
 --
1:   2430178  433053080  [C
2:   3055280   97768960  java.util.Hashtable$Entry
3:   2454870   78555840  java.lang.String
4:   1585204   50726528  java.util.HashMap$Entry
5:260310   37503984  constMethodKlass
6:260310   35413536  methodKlass
7: 35024   23724672  [Ljava.util.Hashtable$Entry;
8:  7599   18141016  constantPoolKlass
9: 47551   18066696  [Ljava.util.HashMap$Entry;
   10:209516   16761280  java.lang.reflect.Method
   11:212292   16732008  [I
   12:  6881   11332896  constantPoolCacheKlass
   13:  75997160920  instanceKlassKlass
   14: 794124447072  java.util.ResourceBundle$CacheKey
   15: 107873958464  [S
   16: 794123811776  java.util.ResourceBundle$BundleReference
   17: 266343458160  [B
   18:1337013208824  java.util.LinkedList$Node
   19: 854922735744  
 java.util.concurrent.ConcurrentHashMap$HashEntry
   20: 794122541184  java.util.ResourceBundle$LoaderReference
   21: 475152280720  java.util.HashMap
   22: 372982274416  [Ljava.lang.Object;
   23: 706382260416  java.util.LinkedList
   24:  29491994376  methodDataKlass
   25:  79141749080  java.lang.Class
   26: 627461505904  
 org.apache.commons.logging.impl.Log4JLogger
   27: 166391463824  [[I
   28: 212791361856  java.net.URL
   29: 280901348320  java.util.Hashtable
   30: 141671231856  [Ljava.util.WeakHashMap$Entry;
   31: 17770 710800  java.lang.ref.Finalizer
   32: 10626 680064  java.util.jar.JarFile
   33: 14167 680016  java.util.WeakHashMap
   34: 14238 569520  java.util.WeakHashMap$Entry
   35:  7104 568320  java.util.jar.JarFile$JarFileEntry
   36:   165 567264  
 [Ljava.util.concurrent.ConcurrentHashMap$HashEntry;
   37: 10637 510576  sun.nio.cs.UTF_8$Encoder
   38: 10633 510384  sun.misc.URLClassPath$JarLoader
   39: 14176 453632  java.lang.ref.ReferenceQueue
   40: 17747 409752  [Ljava.lang.Class;
   41:  3463 387856  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$HangingJobKiller
   42: 15355 368520  java.util.ArrayList
   43: 10632 340224  java.util.zip.ZipCoder
   44:  6932 332736  java.util.Properties
   45:  4060 292320  java.lang.reflect.Constructor
   46:  7143 285720  java.util.LinkedHashMap$Entry
   47:  3517 281360  
 org.apache.pig.impl.PigContext$ContextClassLoader
   48:  3476 278144  
 [Ljava.lang.ThreadLocal$ThreadLocalMap$Entry;
   49:  3458 276640  java.net.URI
   50:  8576 274432  antlr.ANTLRHashString
   51: 10632 255168  java.util.ArrayDeque
 There are way too many instances of MapReduceLauncher$HangingJobKiller.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (PIG-3486) Pig hitting OOM while using PigRunner.run()

2013-09-26 Thread Vivek Padmanabhan (JIRA)
Vivek Padmanabhan created PIG-3486:
--

 Summary: Pig hitting OOM while using PigRunner.run()
 Key: PIG-3486
 URL: https://issues.apache.org/jira/browse/PIG-3486
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11.1
Reporter: Vivek Padmanabhan


I have a timer based class, which will trigger a pig script execution every 5 
minutes using PigRunner.run(args, null).

But it looks like the heap usage is gradually increasing after around 15days it 
crossed 1G, ie after invoking the above method 4k times.

The top entries of the histo live goes like this;

 num #instances #bytes  class name
--
   1:   2430178  433053080  [C
   2:   3055280   97768960  java.util.Hashtable$Entry
   3:   2454870   78555840  java.lang.String
   4:   1585204   50726528  java.util.HashMap$Entry
   5:260310   37503984  constMethodKlass
   6:260310   35413536  methodKlass
   7: 35024   23724672  [Ljava.util.Hashtable$Entry;
   8:  7599   18141016  constantPoolKlass
   9: 47551   18066696  [Ljava.util.HashMap$Entry;
  10:209516   16761280  java.lang.reflect.Method
  11:212292   16732008  [I
  12:  6881   11332896  constantPoolCacheKlass
  13:  75997160920  instanceKlassKlass
  14: 794124447072  java.util.ResourceBundle$CacheKey
  15: 107873958464  [S
  16: 794123811776  java.util.ResourceBundle$BundleReference
  17: 266343458160  [B
  18:1337013208824  java.util.LinkedList$Node
  19: 854922735744  
java.util.concurrent.ConcurrentHashMap$HashEntry
  20: 794122541184  java.util.ResourceBundle$LoaderReference
  21: 475152280720  java.util.HashMap
  22: 372982274416  [Ljava.lang.Object;
  23: 706382260416  java.util.LinkedList
  24:  29491994376  methodDataKlass
  25:  79141749080  java.lang.Class
  26: 627461505904  org.apache.commons.logging.impl.Log4JLogger
  27: 166391463824  [[I
  28: 212791361856  java.net.URL
  29: 280901348320  java.util.Hashtable
  30: 141671231856  [Ljava.util.WeakHashMap$Entry;
  31: 17770 710800  java.lang.ref.Finalizer
  32: 10626 680064  java.util.jar.JarFile
  33: 14167 680016  java.util.WeakHashMap
  34: 14238 569520  java.util.WeakHashMap$Entry
  35:  7104 568320  java.util.jar.JarFile$JarFileEntry
  36:   165 567264  
[Ljava.util.concurrent.ConcurrentHashMap$HashEntry;
  37: 10637 510576  sun.nio.cs.UTF_8$Encoder
  38: 10633 510384  sun.misc.URLClassPath$JarLoader
  39: 14176 453632  java.lang.ref.ReferenceQueue
  40: 17747 409752  [Ljava.lang.Class;
  41:  3463 387856  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$HangingJobKiller
  42: 15355 368520  java.util.ArrayList
  43: 10632 340224  java.util.zip.ZipCoder
  44:  6932 332736  java.util.Properties
  45:  4060 292320  java.lang.reflect.Constructor
  46:  7143 285720  java.util.LinkedHashMap$Entry
  47:  3517 281360  
org.apache.pig.impl.PigContext$ContextClassLoader
  48:  3476 278144  
[Ljava.lang.ThreadLocal$ThreadLocalMap$Entry;
  49:  3458 276640  java.net.URI
  50:  8576 274432  antlr.ANTLRHashString
  51: 10632 255168  java.util.ArrayDeque



There are way too many instances of MapReduceLauncher$HangingJobKiller.




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3486) Pig hitting OOM while using PigRunner.run()

2013-09-26 Thread Vivek Padmanabhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vivek Padmanabhan updated PIG-3486:
---

Attachment: histolive.txt

Attaching the histo live

 Pig hitting OOM while using PigRunner.run()
 ---

 Key: PIG-3486
 URL: https://issues.apache.org/jira/browse/PIG-3486
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11.1
Reporter: Vivek Padmanabhan
 Attachments: histolive.txt


 I have a timer based class, which will trigger a pig script execution every 5 
 minutes using PigRunner.run(args, null).
 But it looks like the heap usage is gradually increasing after around 15days 
 it crossed 1G, ie after invoking the above method 4k times.
 The top entries of the histo live goes like this;
  num #instances #bytes  class name
 --
1:   2430178  433053080  [C
2:   3055280   97768960  java.util.Hashtable$Entry
3:   2454870   78555840  java.lang.String
4:   1585204   50726528  java.util.HashMap$Entry
5:260310   37503984  constMethodKlass
6:260310   35413536  methodKlass
7: 35024   23724672  [Ljava.util.Hashtable$Entry;
8:  7599   18141016  constantPoolKlass
9: 47551   18066696  [Ljava.util.HashMap$Entry;
   10:209516   16761280  java.lang.reflect.Method
   11:212292   16732008  [I
   12:  6881   11332896  constantPoolCacheKlass
   13:  75997160920  instanceKlassKlass
   14: 794124447072  java.util.ResourceBundle$CacheKey
   15: 107873958464  [S
   16: 794123811776  java.util.ResourceBundle$BundleReference
   17: 266343458160  [B
   18:1337013208824  java.util.LinkedList$Node
   19: 854922735744  
 java.util.concurrent.ConcurrentHashMap$HashEntry
   20: 794122541184  java.util.ResourceBundle$LoaderReference
   21: 475152280720  java.util.HashMap
   22: 372982274416  [Ljava.lang.Object;
   23: 706382260416  java.util.LinkedList
   24:  29491994376  methodDataKlass
   25:  79141749080  java.lang.Class
   26: 627461505904  
 org.apache.commons.logging.impl.Log4JLogger
   27: 166391463824  [[I
   28: 212791361856  java.net.URL
   29: 280901348320  java.util.Hashtable
   30: 141671231856  [Ljava.util.WeakHashMap$Entry;
   31: 17770 710800  java.lang.ref.Finalizer
   32: 10626 680064  java.util.jar.JarFile
   33: 14167 680016  java.util.WeakHashMap
   34: 14238 569520  java.util.WeakHashMap$Entry
   35:  7104 568320  java.util.jar.JarFile$JarFileEntry
   36:   165 567264  
 [Ljava.util.concurrent.ConcurrentHashMap$HashEntry;
   37: 10637 510576  sun.nio.cs.UTF_8$Encoder
   38: 10633 510384  sun.misc.URLClassPath$JarLoader
   39: 14176 453632  java.lang.ref.ReferenceQueue
   40: 17747 409752  [Ljava.lang.Class;
   41:  3463 387856  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$HangingJobKiller
   42: 15355 368520  java.util.ArrayList
   43: 10632 340224  java.util.zip.ZipCoder
   44:  6932 332736  java.util.Properties
   45:  4060 292320  java.lang.reflect.Constructor
   46:  7143 285720  java.util.LinkedHashMap$Entry
   47:  3517 281360  
 org.apache.pig.impl.PigContext$ContextClassLoader
   48:  3476 278144  
 [Ljava.lang.ThreadLocal$ThreadLocalMap$Entry;
   49:  3458 276640  java.net.URI
   50:  8576 274432  antlr.ANTLRHashString
   51: 10632 255168  java.util.ArrayDeque
 There are way too many instances of MapReduceLauncher$HangingJobKiller.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3486) Pig hitting OOM while using PigRunner.run()

2013-09-26 Thread Vivek Padmanabhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vivek Padmanabhan updated PIG-3486:
---

Attachment: PIG-3486.patch1

Attaching an inital patch. 

 Pig hitting OOM while using PigRunner.run()
 ---

 Key: PIG-3486
 URL: https://issues.apache.org/jira/browse/PIG-3486
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11.1
Reporter: Vivek Padmanabhan
 Attachments: histolive.txt, PIG-3486.patch1


 I have a timer based class, which will trigger a pig script execution every 5 
 minutes using PigRunner.run(args, null).
 But it looks like the heap usage is gradually increasing after around 15days 
 it crossed 1G, ie after invoking the above method 4k times.
 The top entries of the histo live goes like this;
  num #instances #bytes  class name
 --
1:   2430178  433053080  [C
2:   3055280   97768960  java.util.Hashtable$Entry
3:   2454870   78555840  java.lang.String
4:   1585204   50726528  java.util.HashMap$Entry
5:260310   37503984  constMethodKlass
6:260310   35413536  methodKlass
7: 35024   23724672  [Ljava.util.Hashtable$Entry;
8:  7599   18141016  constantPoolKlass
9: 47551   18066696  [Ljava.util.HashMap$Entry;
   10:209516   16761280  java.lang.reflect.Method
   11:212292   16732008  [I
   12:  6881   11332896  constantPoolCacheKlass
   13:  75997160920  instanceKlassKlass
   14: 794124447072  java.util.ResourceBundle$CacheKey
   15: 107873958464  [S
   16: 794123811776  java.util.ResourceBundle$BundleReference
   17: 266343458160  [B
   18:1337013208824  java.util.LinkedList$Node
   19: 854922735744  
 java.util.concurrent.ConcurrentHashMap$HashEntry
   20: 794122541184  java.util.ResourceBundle$LoaderReference
   21: 475152280720  java.util.HashMap
   22: 372982274416  [Ljava.lang.Object;
   23: 706382260416  java.util.LinkedList
   24:  29491994376  methodDataKlass
   25:  79141749080  java.lang.Class
   26: 627461505904  
 org.apache.commons.logging.impl.Log4JLogger
   27: 166391463824  [[I
   28: 212791361856  java.net.URL
   29: 280901348320  java.util.Hashtable
   30: 141671231856  [Ljava.util.WeakHashMap$Entry;
   31: 17770 710800  java.lang.ref.Finalizer
   32: 10626 680064  java.util.jar.JarFile
   33: 14167 680016  java.util.WeakHashMap
   34: 14238 569520  java.util.WeakHashMap$Entry
   35:  7104 568320  java.util.jar.JarFile$JarFileEntry
   36:   165 567264  
 [Ljava.util.concurrent.ConcurrentHashMap$HashEntry;
   37: 10637 510576  sun.nio.cs.UTF_8$Encoder
   38: 10633 510384  sun.misc.URLClassPath$JarLoader
   39: 14176 453632  java.lang.ref.ReferenceQueue
   40: 17747 409752  [Ljava.lang.Class;
   41:  3463 387856  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$HangingJobKiller
   42: 15355 368520  java.util.ArrayList
   43: 10632 340224  java.util.zip.ZipCoder
   44:  6932 332736  java.util.Properties
   45:  4060 292320  java.lang.reflect.Constructor
   46:  7143 285720  java.util.LinkedHashMap$Entry
   47:  3517 281360  
 org.apache.pig.impl.PigContext$ContextClassLoader
   48:  3476 278144  
 [Ljava.lang.ThreadLocal$ThreadLocalMap$Entry;
   49:  3458 276640  java.net.URI
   50:  8576 274432  antlr.ANTLRHashString
   51: 10632 255168  java.util.ArrayDeque
 There are way too many instances of MapReduceLauncher$HangingJobKiller.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2127) PigStorageSchema need to deal with missing field

2012-06-06 Thread Vivek Padmanabhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13290233#comment-13290233
 ] 

Vivek Padmanabhan commented on PIG-2127:


I think this issue is still present with PigStorage -schema option,


{code}
a = load '2127_withschema' using PigStorage(',','-schema');
b = foreach a generate f1,f2,f3,f4;
dump b;
{code}

input
{code}
d,e,4,1
a,b,1,2
c,b
d,e,4,1
{code}

The above given script and input produces the below exception;
java.lang.IndexOutOfBoundsException: Index: 3, Size: 3
at java.util.ArrayList.RangeCheck(ArrayList.java:547)
at java.util.ArrayList.get(ArrayList.java:322)
at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:156)
at org.apache.pig.builtin.PigStorage.applySchema(PigStorage.java:282)
at org.apache.pig.builtin.PigStorage.getNext(PigStorage.java:246)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:194)
at 
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532)
at 
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)


 PigStorageSchema need to deal with missing field
 

 Key: PIG-2127
 URL: https://issues.apache.org/jira/browse/PIG-2127
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.10.0
Reporter: Daniel Dai
 Fix For: 0.10.0


 Currently, if data contains fewer columns than the schema, PigStorageSchema 
 will throw IndexOutOfBound exception (PigStorageSchema:97). We should padding 
 null in this case as we did in PigStorage.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-2127) PigStorageSchema need to deal with missing field

2012-06-06 Thread Vivek Padmanabhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13290241#comment-13290241
 ] 

Vivek Padmanabhan commented on PIG-2127:


I am seeing the same issue with PigStorage also for Pig 0.10;
Input;
d,e,4,1
a,b,1,2
c,b
d,e,4,1

Script
a = load '2127_withschema' using PigStorage(',') as (f1,f2,f3,f4);
b = foreach a generate f1,f2,f3,f4;
dump b;

The above script also results in the same IndexOutOfBound exception in Pig 
0.10. (works fine with Pig 0.9)



 PigStorageSchema need to deal with missing field
 

 Key: PIG-2127
 URL: https://issues.apache.org/jira/browse/PIG-2127
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.10.0
Reporter: Daniel Dai
 Fix For: 0.10.0


 Currently, if data contains fewer columns than the schema, PigStorageSchema 
 will throw IndexOutOfBound exception (PigStorageSchema:97). We should padding 
 null in this case as we did in PigStorage.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (PIG-2721) Wrong output generated while loading bags as input

2012-05-24 Thread Vivek Padmanabhan (JIRA)
Vivek Padmanabhan created PIG-2721:
--

 Summary: Wrong output generated while loading bags as input
 Key: PIG-2721
 URL: https://issues.apache.org/jira/browse/PIG-2721
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.9.2, 0.9.0
Reporter: Vivek Padmanabhan


{code}
A = LOAD '/user/pvivek/sample' as 
(id:chararray,mybag:bag{tuple(bttype:chararray,cat:long)});
B = foreach A generate id,FLATTEN(mybag) AS (bttype, cat);
C = order B by id;
dump C;
{code}

The above code generates wrong results when executed with Pig 0.10 and Pig 0.9
The below is the sample input;
{code}
...LKGaHqg--{(aa,806743)}
..0MI1Y37w--{(aa,498970)}
..0bnlpJrw--{(aa,806740)}
..0p0IIhbA--{(aa,498971),(se,498995)}
..1VkGqvXA--{(aa,805219)}
{code}

I think the Pig optimizers are causing this issue.From the logs I can see that 
the $1 is pruned for the relation A.

[main] INFO  org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns 
pruned for A: $1

One workaround for this is to disable -t ColumnMapKeyPrune.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (PIG-2305) Pig should log the split locations in task logs

2011-09-26 Thread Vivek Padmanabhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vivek Padmanabhan updated PIG-2305:
---

Fix Version/s: 0.9.1
   Status: Patch Available  (was: Open)

 Pig should log the split locations in task logs
 ---

 Key: PIG-2305
 URL: https://issues.apache.org/jira/browse/PIG-2305
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.9.0, 0.8.1
Reporter: Vivek Padmanabhan
Assignee: Vivek Padmanabhan
Priority: Minor
 Fix For: 0.9.1

 Attachments: PIG-2305_1.patch


 It would be helpful  if Pig can log the split information in the task logs. 
 MAPREDUCE-2076 talks about having this log, but from Pig 0.8 onwards, splits 
 could be combined hence not sure what
 could be the result. One side effect would be that these logs  will get 
 printed on intermediate pig jobs also.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (PIG-2305) Pig should log the split locations in task logs

2011-09-26 Thread Vivek Padmanabhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vivek Padmanabhan updated PIG-2305:
---

Attachment: PIG-2305_1.patch

 Pig should log the split locations in task logs
 ---

 Key: PIG-2305
 URL: https://issues.apache.org/jira/browse/PIG-2305
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.1, 0.9.0
Reporter: Vivek Padmanabhan
Assignee: Vivek Padmanabhan
Priority: Minor
 Fix For: 0.9.1

 Attachments: PIG-2305_1.patch


 It would be helpful  if Pig can log the split information in the task logs. 
 MAPREDUCE-2076 talks about having this log, but from Pig 0.8 onwards, splits 
 could be combined hence not sure what
 could be the result. One side effect would be that these logs  will get 
 printed on intermediate pig jobs also.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (PIG-2291) PigStats.isSuccessful returns false if embedded pig script has dump

2011-09-19 Thread Vivek Padmanabhan (JIRA)
PigStats.isSuccessful returns false if embedded pig script has dump
---

 Key: PIG-2291
 URL: https://issues.apache.org/jira/browse/PIG-2291
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Vivek Padmanabhan


The below is my python script, 
{code}
#! /usr/bin/python
from  org.apache.pig.scripting import Pig

P = Pig.compileFromFile(a.pig)
result = P.bind().runSingle()

if result.isSuccessful():
print 'Pig job succeeded'
else:
print 'Pig job failed'
{code}


The below is the pig script embedded (a.pig)
A = LOAD 'a1' USING PigStorage(',') AS (f1:chararray,f2:chararray);
B = GROUP A by f1;
dump B;


For this script execution, even though the job is successful the output printed 
is 'Pig job failed'
This is because result.isSuccessful() is returning false whenever the pig 
script is having a dump statement.

If i run the pig script alone, then the error code returned is proper.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (PIG-2250) Pig 0.9 error message not useful as compared to 0.8

2011-08-30 Thread Vivek Padmanabhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vivek Padmanabhan resolved PIG-2250.


   Resolution: Fixed
Fix Version/s: 0.10

Verified this with trunk. Sorry for the trouble.

 Pig 0.9 error message not useful as compared to 0.8
 ---

 Key: PIG-2250
 URL: https://issues.apache.org/jira/browse/PIG-2250
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Vivek Padmanabhan
 Fix For: 0.10


 Another instance of change in error message from 0.8 to 0.9 due to parser 
 modifications.
 This improper error message is due to \n in the UDF arguments.
 The below is a sample script;
 a = load 'input' using myLoader('a1,a2,
 a3,a4');
 dump a;
 Error Message from 0.9
 --
 ERROR 1200: Pig script failed to parse: MismatchedTokenException(93!=3)
 Error Message from 0.8
 
  ERROR 1000: Error during parsing. Lexical error at line 1, column 40.  
 Encountered: \n (10), after : \'a1,a2,

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (PIG-2250) Pig 0.9 error message not useful as compared to 0.8

2011-08-29 Thread Vivek Padmanabhan (JIRA)
Pig 0.9 error message not useful as compared to 0.8
---

 Key: PIG-2250
 URL: https://issues.apache.org/jira/browse/PIG-2250
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Vivek Padmanabhan


Another instance of change in error message from 0.8 to 0.9 due to parser 
modifications.
This improper error message is due to \n in the UDF arguments.

The below is a sample script;

a = load 'input' using myLoader('a1,a2,
a3,a4');
dump a;


Error Message from 0.9
--
ERROR 1200: Pig script failed to parse: MismatchedTokenException(93!=3)


Error Message from 0.8

 ERROR 1000: Error during parsing. Lexical error at line 1, column 40.  
Encountered: \n (10), after : \'a1,a2,

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-2238) Pig 0.9 error message not useful as compared to 0.8

2011-08-25 Thread Vivek Padmanabhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13090820#comment-13090820
 ] 

Vivek Padmanabhan commented on PIG-2238:


Looks like this issue was introduced as part of the parser changes.
In Pig 0.9, the validation is done like below;

In org.apache.pig.parser.AstValidator 

 private void validateAliasRef(SetString aliases, CommonTree node, String 
alias)
throws UndefinedAliasException {
if( !aliases.contains( alias ) ) {
throw new UndefinedAliasException( input, new SourceLocation( 
(PigParserNode)node ), alias );
}
}

Here it just checks that the alias is contained inside the set aliases, but 
this contains all the aliases
and it doesnt check for the order in which they are defined in the script.

Hence this will lead to other sort of issues like NullPointerException, if  i 
replace the F in the above script with
below
F = foreach F generate *;


 Pig 0.9 error message not useful as compared to 0.8
 ---

 Key: PIG-2238
 URL: https://issues.apache.org/jira/browse/PIG-2238
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.9.0
Reporter: Vivek Padmanabhan

 The below is my faulty script (note the usage of alias F) for which Pig 0.9 
 composes not so useful message as compared to 0.8;
 A = load 'input'  using TextLoader as (doc:chararray) ;
 B = foreach A generate flatten(TOKENIZE(doc)) as myword;
 C = group B by myword parallel 30;
 D = foreach C generate group,COUNT(B) as count,SIZE(group) as size;
 E = order D by size parallel 5;
 F = limit F 20;
 dump F;
 For this script , error message in 0.9
 ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2240: LogicalPlanVisitor can 
 only visit logical plan
 Error message in 0.8
 ERROR 1000: Error during parsing. Unrecognized alias F

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (PIG-2238) Pig 0.9 error message not useful as compared to 0.8

2011-08-24 Thread Vivek Padmanabhan (JIRA)
Pig 0.9 error message not useful as compared to 0.8
---

 Key: PIG-2238
 URL: https://issues.apache.org/jira/browse/PIG-2238
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.9.0
Reporter: Vivek Padmanabhan


The below is my faulty script (note the usage of alias F) for which Pig 0.9 
composes not so useful message as compared to 0.8;

A = load 'input'  using TextLoader as (doc:chararray) ;
B = foreach A generate flatten(TOKENIZE(doc)) as myword;
C = group B by myword parallel 30;
D = foreach C generate group,COUNT(B) as count,SIZE(group) as size;

E = order D by size parallel 5;
F = limit F 20;
dump F;

For this script , error message in 0.9
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2240: LogicalPlanVisitor can 
only visit logical plan

Error message in 0.8
ERROR 1000: Error during parsing. Unrecognized alias F


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-2217) POStore.getSchema() returns null if I dont have a schema defined at load statement

2011-08-23 Thread Vivek Padmanabhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13089366#comment-13089366
 ] 

Vivek Padmanabhan commented on PIG-2217:


Sorry for my confusing comment. 
My point was, if I dont specify a schema definition along with my load 
statement, then PigStorageSchema wont save the schema files. This is happening 
from Pig 0.8 onwards, if I use Pig 0.7 I can see the files saved.
I believe this is because the schema object is null in 0.8, but for 0.7 there 
is an empty schema created. Is this behaviour expected from 0.8 onwards.

 POStore.getSchema() returns null if I dont have a schema defined at load 
 statement
 --

 Key: PIG-2217
 URL: https://issues.apache.org/jira/browse/PIG-2217
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.1, 0.9.0
Reporter: Vivek Padmanabhan

 If I don't specify a schema definition in load statement, then 
 POStore.getSchema() returns null because of which PigOutputCommitter is not 
 storing schema . 
 For example if I run the below script, .pig_header and .pig_schema files 
 wont be saved.
 load_1 =  LOAD 'i1' USING PigStorage();
 ordered_data_1 =  ORDER load_1 BY * ASC PARALLEL 1;
 STORE ordered_data_1 INTO 'myout' using 
 org.apache.pig.piggybank.storage.PigStorageSchema();
 This works fine with Pig 0.7, but 0.8 onwards StoreMetadata.storeSchema is 
 not getting invoked for these cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (PIG-2221) Couldnt find documentation for ColumnMapKeyPrune optimization rule

2011-08-16 Thread Vivek Padmanabhan (JIRA)
Couldnt find documentation for ColumnMapKeyPrune optimization rule
--

 Key: PIG-2221
 URL: https://issues.apache.org/jira/browse/PIG-2221
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.8.1
Reporter: Vivek Padmanabhan


There are no documentations for some of the Optimization Rules, 
For ex; ColumnMapKeyPrune
in http://pig.apache.org/docs/r0.8.1/piglatin_ref1.html#Optimization+Rules

And moreover I believe the documentaion should be saying how to disable these 
rules using -t option. 
It would be nice if the documentation could talk of some uses cases where it 
makes sense to disable the optimization rule. 



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (PIG-2217) POStore.getSchema() returns null if I dont have a schema defined at load statement

2011-08-11 Thread Vivek Padmanabhan (JIRA)
POStore.getSchema() returns null if I dont have a schema defined at load 
statement
--

 Key: PIG-2217
 URL: https://issues.apache.org/jira/browse/PIG-2217
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.9.0, 0.8.1
Reporter: Vivek Padmanabhan


If I don't specify a schema definition in load statement, then 
POStore.getSchema() returns null because of which PigOutputCommitter is not 
storing schema . 

For example if I run the below script, .pig_header and .pig_schema files 
wont be saved.


load_1 =  LOAD 'i1' USING PigStorage();
ordered_data_1 =  ORDER load_1 BY * ASC PARALLEL 1;
STORE ordered_data_1 INTO 'myout' using 
org.apache.pig.piggybank.storage.PigStorageSchema();


This works fine with Pig 0.7, but 0.8 onwards StoreMetadata.storeSchema is not 
getting invoked for these cases.




--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-2217) POStore.getSchema() returns null if I dont have a schema defined at load statement

2011-08-11 Thread Vivek Padmanabhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13083939#comment-13083939
 ] 

Vivek Padmanabhan commented on PIG-2217:


For the above mentioned script the schema is marked as null from the logical 
layer itself, ie LOStore.getSchema() returns a null.
Since all the schema is derived from its predeccessor operators, the schema 
object for LOLoad itself is null.
Hence this scenario will be happening for all scripts which does not define a 
schema in the load stmt.


In Pig 0.7 , even if the schema value is null from logical layer, while 
translating, it is wrapped with an empty schema
For ex; In LogToPhyTranslationVisitor
   public void visit(LOStore loStore) throws VisitorException {

 store.setSchema(new Schema(loStore.getSchema()));

Hence the file will look like below
.pig_header (empty file )
.pig_schema
---
{fields:[],version:0,sortKeys:[-1],sortKeyOrders:[ASCENDING]}



But in 0.8 (new logical plan) onwards, the null value is directly returned, 
because of which the metadata is not saved.


This change in behaviour came with the new logical plan introduced in Pig 0.8 
which also got transferred into Pig 0.9.
Disabling the new logical plan in 0.8 ( pig -useversion 0.8 
-Dpig.usenewlogicalplan=false), will produce
.pig_header and .pig_schema files.

 POStore.getSchema() returns null if I dont have a schema defined at load 
 statement
 --

 Key: PIG-2217
 URL: https://issues.apache.org/jira/browse/PIG-2217
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.1, 0.9.0
Reporter: Vivek Padmanabhan

 If I don't specify a schema definition in load statement, then 
 POStore.getSchema() returns null because of which PigOutputCommitter is not 
 storing schema . 
 For example if I run the below script, .pig_header and .pig_schema files 
 wont be saved.
 load_1 =  LOAD 'i1' USING PigStorage();
 ordered_data_1 =  ORDER load_1 BY * ASC PARALLEL 1;
 STORE ordered_data_1 INTO 'myout' using 
 org.apache.pig.piggybank.storage.PigStorageSchema();
 This works fine with Pig 0.7, but 0.8 onwards StoreMetadata.storeSchema is 
 not getting invoked for these cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (PIG-2181) Improvement : for error message when describe misses alias

2011-07-22 Thread Vivek Padmanabhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vivek Padmanabhan updated PIG-2181:
---

Assignee: Vivek Padmanabhan
  Status: Patch Available  (was: Open)

 Improvement : for error message when describe misses alias
 --

 Key: PIG-2181
 URL: https://issues.apache.org/jira/browse/PIG-2181
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Vivek Padmanabhan
Assignee: Vivek Padmanabhan
Priority: Minor
  Labels: newbie
 Fix For: 0.10

 Attachments: PIG-2181_1.patch


 In Pig 0.9, if I have a describe without an alias, it throws a 
 NullPointerException like below.
 ERROR 2999: Unexpected internal error. null
 java.lang.NullPointerException
 at 
 org.apache.pig.tools.grunt.GruntParser.processDescribe(GruntParser.java:270)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:317)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
 at org.apache.pig.Main.run(Main.java:553)
 at org.apache.pig.Main.main(Main.java:108)
 For example;
 describe;
 This message is of no use from a users perspective. Especially when my script 
 becomes large and I have added couple of describe statements. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (PIG-2184) Not able to provide positional reference to macro invocations

2011-07-21 Thread Vivek Padmanabhan (JIRA)
Not able to provide positional reference to macro invocations
-

 Key: PIG-2184
 URL: https://issues.apache.org/jira/browse/PIG-2184
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Vivek Padmanabhan


It looks like the macro functionality doesnt support for positional references. 
The below is an example script;


DEFINE my_macro (X,key) returns Y
{
tmp1 = foreach  $X generate TOKENIZE((chararray)$key) as tokens;
tmp2 = foreach tmp1 generate flatten(tokens);
tmp3 = order tmp2 by $0;
$Y = distinct tmp3;
}

A = load 'sometext' using TextLoader() as (row1) ;
E = my_macro(A,A.$0);
dump E;


This script execution fails at parsing staging itself;

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during 
parsing. file try1.pig, line 16,
column 16  mismatched input '.' expecting RIGHT_PAREN

If i replace A.$0 with the field name ie row1 the script runs fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (PIG-2181) Improvement : for error message when describe misses alias

2011-07-20 Thread Vivek Padmanabhan (JIRA)
Improvement : for error message when describe misses alias
--

 Key: PIG-2181
 URL: https://issues.apache.org/jira/browse/PIG-2181
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Vivek Padmanabhan
Priority: Minor


In Pig 0.9, if I have a describe without an alias, it throws a 
NullPointerException like below.

ERROR 2999: Unexpected internal error. null

java.lang.NullPointerException
at 
org.apache.pig.tools.grunt.GruntParser.processDescribe(GruntParser.java:270)
at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:317)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
at org.apache.pig.Main.run(Main.java:553)
at org.apache.pig.Main.main(Main.java:108)


For example;
describe;

This message is of no use from a users perspective. Especially when my script 
becomes large and I have added couple of describe statements. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (PIG-2147) Support nested tags for XMLLoader

2011-07-11 Thread Vivek Padmanabhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vivek Padmanabhan updated PIG-2147:
---

Attachment: PIG-2147_1.patch

Attaching an initial patch.

 Support nested tags for XMLLoader
 -

 Key: PIG-2147
 URL: https://issues.apache.org/jira/browse/PIG-2147
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.1, 0.9.0
Reporter: Vivek Padmanabhan
Assignee: Vivek Padmanabhan
 Fix For: 0.8.1, 0.9.0

 Attachments: PIG-2147_1.patch


 Currently xmlloader does not support nested tags with same tag name, ie if i 
 have the below content
 {code}
 event
  relatedEvents
eventx\event
eventy\event
eventz\event
  \relatedEvents
 \event
 {code}
 And I load the above using XMLLoader,
 events = load 'input' using 
 org.apache.pig.piggybank.storage.XMLLoader('event') as (doc:chararray);
 The output will be,
 {code}
 event
  relatedEvents
eventx\event
 {code}
 Whereas the desired output is ;
 {code}
  relatedEvents
eventx\event
eventy\event
eventz\event
  \relatedEvents
 {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (PIG-2152) Null pointer exception while reporting progress

2011-07-08 Thread Vivek Padmanabhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vivek Padmanabhan updated PIG-2152:
---

Attachment: null_pointer_traces (copy)

Attaching a list of different traces

 Null pointer exception while reporting progress
 ---

 Key: PIG-2152
 URL: https://issues.apache.org/jira/browse/PIG-2152
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Olga Natkovich
 Fix For: 0.9.0

 Attachments: null_pointer_traces (copy)


 We have observed the following issues with code built from Pig 0.9 branch. We 
 have not seen this with earlier versions; however, since this happens once in 
 a while and is not reproducible at will it is not clear whether the issue is 
 specific to 0.9 or not.
 Here is the stack:
 java.lang.NullPointerException at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.ProgressableReporter.progress(ProgressableReporter.java:37)
 at
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:399)
 at
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
 at
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:261)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:256)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:58)
  at
 org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at
 org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at 
 org.apache.hadoop.mapred.Child$4.run(Child.java:261) at
 java.security.AccessController.doPrivileged(Native Method) at 
 javax.security.auth.Subject.doAs(Subject.java:396) at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
  at
 org.apache.hadoop.mapred.Child.main(Child.java:255) 
 Note that the code in progress function looks as follows:
 public void progress() {
 if(rep!=null)
 rep.progress();
 }
 This points to some sort of synchronization issue 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-2152) Null pointer exception while reporting progress

2011-07-08 Thread Vivek Padmanabhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13061891#comment-13061891
 ] 

Vivek Padmanabhan commented on PIG-2152:


Even though not reproducible , the exception is happening randomly and quite 
frequently. 
From most of the failed jobs it looks like this is happening towards end of 
the task execution. Till now, it is seen only in Map Tasks.
The below is one NullPointer from progess() with a different call heirachy;

at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.ProgressableReporter.progress(ProgressableReporter.java:37)
at 
org.apache.pig.data.DefaultAbstractBag.reportProgress(DefaultAbstractBag.java:369)
at 
org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator.next(DefaultDataBag.java:165)
at 
org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator.hasNext(DefaultDataBag.java:157)
at org.apache.pig.data.BinInterSedes.writeBag(BinInterSedes.java:522)
at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:361)
at org.apache.pig.data.BinInterSedes.writeTuple(BinInterSedes.java:542)
at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:357)
at org.apache.pig.data.BinInterSedes.writeTuple(BinInterSedes.java:542)
at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:357)
at org.apache.pig.data.BinSedesTuple.write(BinSedesTuple.java:57)
at 
org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:123)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1069)
at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
at 
org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:124)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:263)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:256)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:58)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:261)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.Child.main(Child.java:255)

 Null pointer exception while reporting progress
 ---

 Key: PIG-2152
 URL: https://issues.apache.org/jira/browse/PIG-2152
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Olga Natkovich
 Fix For: 0.9.0

 Attachments: null_pointer_traces (copy)


 We have observed the following issues with code built from Pig 0.9 branch. We 
 have not seen this with earlier versions; however, since this happens once in 
 a while and is not reproducible at will it is not clear whether the issue is 
 specific to 0.9 or not.
 Here is the stack:
 java.lang.NullPointerException at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.ProgressableReporter.progress(ProgressableReporter.java:37)
 at
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:399)
 at
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
 at
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:261)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:256)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:58)
  at
 org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at
 org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at
 

[jira] [Commented] (PIG-2036) Set header delimiter in PigStorageSchema

2011-06-28 Thread Vivek Padmanabhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13056330#comment-13056330
 ] 

Vivek Padmanabhan commented on PIG-2036:


Thanks Dmitriy, 0.9 will do. The same patch goes well with 0.9 branch also.

 Set header delimiter in PigStorageSchema
 

 Key: PIG-2036
 URL: https://issues.apache.org/jira/browse/PIG-2036
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0, 0.8.0
Reporter: Mads Moeller
Assignee: Mads Moeller
Priority: Minor
 Fix For: 0.10

 Attachments: PIG-2036.1.patch, PIG-2036.patch


 Piggybanks' PigStorageSchema currently defaults the delimiter to a tab in the 
 generated header file (.pig_header).
 The attached patch set the header delimiter to what is passed in via the 
 constructor. Otherwise it'll default to tab '\t'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (PIG-2146) POStore.getSchema() returns null because of which PigOutputCommitter is not storing schema while cleanup

2011-06-28 Thread Vivek Padmanabhan (JIRA)
POStore.getSchema() returns null because of which PigOutputCommitter is not 
storing schema while cleanup


 Key: PIG-2146
 URL: https://issues.apache.org/jira/browse/PIG-2146
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.1, 0.9.0
Reporter: Vivek Padmanabhan


The below is my script;
{code}
register piggybank.jar;
a = load 'myinput' using PigStorage(',') as 
(f1:chararray,f2:chararray,f3:chararray);
b = distinct a;
c = limit b 2;
store c into 'pss001' using org.apache.pig.piggybank.storage.PigStorageSchema();
{code}

Input
---
a,1,aa
b,2,bb
c,3,cc


For this script , PigStorageSchema is not generating  .pig_headers and 
.pig_schema files. While debugging I could see that storeSchema(..) method 
itself is not invoked.The schema object for the store is returned as  null 
(POStore.getSchema()) because of which PigOutputCommitter is not invoking the 
storSchema.

The same schema object is valid when I run it in local mode. This issue is 
happening for Pig 0.9 also.



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (PIG-2147) Support nested tags for XMLLoader

2011-06-28 Thread Vivek Padmanabhan (JIRA)
Support nested tags for XMLLoader
-

 Key: PIG-2147
 URL: https://issues.apache.org/jira/browse/PIG-2147
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.1, 0.9.0
Reporter: Vivek Padmanabhan
Assignee: Vivek Padmanabhan


Currently xmlloader does not support nested tags with same tag name, ie if i 
have the below content

{code}
event
 relatedEvents
   eventx\event
   eventy\event
   eventz\event
 \relatedEvents
\event
{code}

And I load the above using XMLLoader,
events = load 'input' using org.apache.pig.piggybank.storage.XMLLoader('event') 
as (doc:chararray);


The output will be,
{code}
event
 relatedEvents
   eventx\event
{code}

Whereas the desired output is ;
{code}
 relatedEvents
   eventx\event
   eventy\event
   eventz\event
 \relatedEvents
{code}


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-2036) Set header delimiter in PigStorageSchema

2011-06-27 Thread Vivek Padmanabhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055404#comment-13055404
 ] 

Vivek Padmanabhan commented on PIG-2036:


Hi Dmitriy ,
 Can we have this patch checked in for older versions also, like 0.8 and 0.9 ?

 Set header delimiter in PigStorageSchema
 

 Key: PIG-2036
 URL: https://issues.apache.org/jira/browse/PIG-2036
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0, 0.8.0
Reporter: Mads Moeller
Assignee: Mads Moeller
Priority: Minor
 Fix For: 0.10

 Attachments: PIG-2036.1.patch, PIG-2036.patch


 Piggybanks' PigStorageSchema currently defaults the delimiter to a tab in the 
 generated header file (.pig_header).
 The attached patch set the header delimiter to what is passed in via the 
 constructor. Otherwise it'll default to tab '\t'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (PIG-2135) Pig 0.9 ignoring Multiple filter conditions joined with AND/OR

2011-06-22 Thread Vivek Padmanabhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vivek Padmanabhan resolved PIG-2135.


Resolution: Invalid

 Pig 0.9 ignoring Multiple filter conditions joined with AND/OR 
 ---

 Key: PIG-2135
 URL: https://issues.apache.org/jira/browse/PIG-2135
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Vivek Padmanabhan
Assignee: Thejas M Nair
Priority: Critical
 Fix For: 0.9.0


 When I have multiple filter statements joined by AND/OR , except for the 
 first condition all other conditions are ignored.
 For example in the below script the second condition (org.udfs.Func09('e',w) 
 == 1) is ignored ;
 a = load 'sample_input' using PigStorage(',')  as (q:chararray,w:chararray);
 b = filter a by org.udfs.Func09('f1',q) == 1  AND  org.udfs.Func09('e',w) == 
 1 ;
 dump b;
 Output from the script
 (f1,a)
 (f1,e)  -- this record should have been filtered by the second condition
 Input for the script;
 f1,a
 f2,b
 f3,c
 f1,e
 f2,f
 f5,e
 The explain of the alias b shows that the second condition is not included in 
 the plan itself.
 The above statements works fine with Pig 0.8.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-2135) Pig 0.9 ignoring Multiple filter conditions joined with AND/OR

2011-06-22 Thread Vivek Padmanabhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053380#comment-13053380
 ] 

Vivek Padmanabhan commented on PIG-2135:


Hi Thejas,
I was using an older version of Pig 0.9. This issue is not present in the 
latest code. Sorry about that.



 Pig 0.9 ignoring Multiple filter conditions joined with AND/OR 
 ---

 Key: PIG-2135
 URL: https://issues.apache.org/jira/browse/PIG-2135
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Vivek Padmanabhan
Assignee: Thejas M Nair
Priority: Critical
 Fix For: 0.9.0


 When I have multiple filter statements joined by AND/OR , except for the 
 first condition all other conditions are ignored.
 For example in the below script the second condition (org.udfs.Func09('e',w) 
 == 1) is ignored ;
 a = load 'sample_input' using PigStorage(',')  as (q:chararray,w:chararray);
 b = filter a by org.udfs.Func09('f1',q) == 1  AND  org.udfs.Func09('e',w) == 
 1 ;
 dump b;
 Output from the script
 (f1,a)
 (f1,e)  -- this record should have been filtered by the second condition
 Input for the script;
 f1,a
 f2,b
 f3,c
 f1,e
 f2,f
 f5,e
 The explain of the alias b shows that the second condition is not included in 
 the plan itself.
 The above statements works fine with Pig 0.8.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-2135) Pig 0.9 ignoring Multiple filter conditions joined with AND/OR

2011-06-21 Thread Vivek Padmanabhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13052414#comment-13052414
 ] 

Vivek Padmanabhan commented on PIG-2135:


UDF Used in the above example;

{code}
public class Func09 extends EvalFuncInteger {
@Override
public Integer exec(Tuple input) throws IOException {
String field1 = (String)input.get(0);
String field2 = (String)input.get(1);
   return field1.equals(field2) ? 1 : 0;
}
}
{code}

 Pig 0.9 ignoring Multiple filter conditions joined with AND/OR 
 ---

 Key: PIG-2135
 URL: https://issues.apache.org/jira/browse/PIG-2135
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Vivek Padmanabhan
Priority: Critical
 Fix For: 0.9.0


 When I have multiple filter statements joined by AND/OR , except for the 
 first condition all other conditions are ignored.
 For example in the below script the second condition (org.udfs.Func09('e',w) 
 == 1) is ignored ;
 a = load 'sample_input' using PigStorage(',')  as (q:chararray,w:chararray);
 b = filter a by org.udfs.Func09('f1',q) == 1  AND  org.udfs.Func09('e',w) == 
 1 ;
 dump b;
 Output from the script
 (f1,a)
 (f1,e)  -- this record should have been filtered by the second condition
 Input for the script;
 f1,a
 f2,b
 f3,c
 f1,e
 f2,f
 f5,e
 The explain of the alias b shows that the second condition is not included in 
 the plan itself.
 The above statements works fine with Pig 0.8.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (PIG-2130) Piggybank:MultiStorage is not compressing output files

2011-06-20 Thread Vivek Padmanabhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vivek Padmanabhan updated PIG-2130:
---

Attachment: PIG-2130_1.patch

Attaching an initial patch

 Piggybank:MultiStorage is not compressing output files
 --

 Key: PIG-2130
 URL: https://issues.apache.org/jira/browse/PIG-2130
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0, 0.9.0
Reporter: Vivek Padmanabhan
Assignee: Vivek Padmanabhan
 Attachments: PIG-2130_1.patch


 MultiStorage is not compressing the records while writing the output. Even 
 though it takes a compression param,  when the record is written it ignores 
 the compression.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-2130) Piggybank:MultiStorage is not compressing output files

2011-06-20 Thread Vivek Padmanabhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13051841#comment-13051841
 ] 

Vivek Padmanabhan commented on PIG-2130:


Please note that , if compression is used, then the subfolders and output files 
will be having the corresponding extension.
For example, if output001.bz2 is output path and f1,f2 are the keys, the files 
will look like;

/tmp/output001.bz2 
   /tmp/output001.bz2/f1.bz2
  /tmp/output001.bz2/f1.bz2/f1-0.bz2

   /tmp/output001.bz2/f2.bz2
  /tmp/output001.bz2/f2.bz2/f2-0.bz2


 Piggybank:MultiStorage is not compressing output files
 --

 Key: PIG-2130
 URL: https://issues.apache.org/jira/browse/PIG-2130
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0, 0.9.0
Reporter: Vivek Padmanabhan
Assignee: Vivek Padmanabhan
 Attachments: PIG-2130_1.patch


 MultiStorage is not compressing the records while writing the output. Even 
 though it takes a compression param,  when the record is written it ignores 
 the compression.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (PIG-2130) Piggybank:MultiStorage is not compressing output files

2011-06-20 Thread Vivek Padmanabhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vivek Padmanabhan updated PIG-2130:
---

Fix Version/s: 0.8.0
   0.9.0
   Status: Patch Available  (was: Open)

 Piggybank:MultiStorage is not compressing output files
 --

 Key: PIG-2130
 URL: https://issues.apache.org/jira/browse/PIG-2130
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0, 0.9.0
Reporter: Vivek Padmanabhan
Assignee: Vivek Padmanabhan
 Fix For: 0.9.0, 0.8.0

 Attachments: PIG-2130_1.patch


 MultiStorage is not compressing the records while writing the output. Even 
 though it takes a compression param,  when the record is written it ignores 
 the compression.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (PIG-2130) Piggybank:MultiStorage is not compressing output files

2011-06-17 Thread Vivek Padmanabhan (JIRA)
Piggybank:MultiStorage is not compressing output files
--

 Key: PIG-2130
 URL: https://issues.apache.org/jira/browse/PIG-2130
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0, 0.9.0
Reporter: Vivek Padmanabhan
Assignee: Vivek Padmanabhan


MultiStorage is not compressing the records while writing the output. Even 
though it takes a compression param,  when the record is written it ignores the 
compression.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (PIG-2098) jython - problem with single item tuple in bag

2011-05-26 Thread Vivek Padmanabhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vivek Padmanabhan updated PIG-2098:
---

Description: 
While using phython udf, if I create a tuple with a single field, Pig execution 
fails with ClassCastException.

Caused by: java.io.IOException: Error executing function: 
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Cannot convert 
jython type to pig datatype java.lang.ClassCastException: java.lang.String 
cannot be cast to org.apache.pig.data.Tuple
at 
org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:111)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:245)


An example to reproduce the issuue ;

Pig Script
{code}
register 'mapkeys.py' using jython as mapkeys;
A = load 'mapkeys.data' using PigStorage() as ( aMap: map[] );
C = foreach A generate mapkeys.keys(aMap);
dump C;
{code}


mapkeys.py
{code}
@outputSchema(keys:bag{t:tuple(key:chararray)})
def keys(map):
  print mapkeys.py:keys:map:, map
  outBag = []
  for key in map.iterkeys():
t = (key) ## doesn't work, causes Pig to crash
#t = (key,) ## adding empty value works :-/
outBag.append(t)
  print mapkeys.py:keys:outBag:, outBag
  return outBag
{code}

Input data 'mapkeys.data'
[name#John,phone#5551212]


In the udf, t = (key) , because of this the item inside the bag is treated as a 
string instead of a tuple which causes for the class cast execption.
If I provide an additional comma, t = (key,) , then the script goes through 
fine.


From code what I can see is that ,for t = (key,) , pythonToPig(..) recieves 
the pyObject as  [(u'name',), (u'phone',)] from the PyFunction call .
But for t = (key) the return from PyFunction call is [u'name', u'phone']



  was:
While using phython udf, if I create a tuple with a single field, Pig execution 
fails with ClassCastException.

Caused by: java.io.IOException: Error executing function: 
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Cannot convert 
jython type to pig datatype java.lang.ClassCastException: java.lang.String 
cannot be cast to org.apache.pig.data.Tuple
at 
org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:111)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:245)


An example to reproduce the issuue ;





 jython - problem with single item tuple in bag
 --

 Key: PIG-2098
 URL: https://issues.apache.org/jira/browse/PIG-2098
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.1, 0.9.0
Reporter: Vivek Padmanabhan

 While using phython udf, if I create a tuple with a single field, Pig 
 execution fails with ClassCastException.
 Caused by: java.io.IOException: Error executing function: 
 org.apache.pig.backend.executionengine.ExecException: ERROR 0: Cannot convert 
 jython type to pig datatype java.lang.ClassCastException: java.lang.String 
 cannot be cast to org.apache.pig.data.Tuple
   at 
 org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:111)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:245)
 An example to reproduce the issuue ;
 Pig Script
 {code}
 register 'mapkeys.py' using jython as mapkeys;
 A = load 'mapkeys.data' using PigStorage() as ( aMap: map[] );
 C = foreach A generate mapkeys.keys(aMap);
 dump C;
 {code}
 mapkeys.py
 {code}
 @outputSchema(keys:bag{t:tuple(key:chararray)})
 def keys(map):
   print mapkeys.py:keys:map:, map
   outBag = []
   for key in map.iterkeys():
 t = (key) ## doesn't work, causes Pig to crash
 #t = (key,) ## adding empty value works :-/
 outBag.append(t)
   print mapkeys.py:keys:outBag:, outBag
   return outBag
 {code}
 Input data 'mapkeys.data'
 [name#John,phone#5551212]
 In the udf, t = (key) , because of this the item inside the bag is treated as 
 a string instead of a tuple which causes for the class cast execption.
 If I provide an additional comma, t = (key,) , then the script goes through 
 fine.
 From code what I can see is that ,for t = (key,) , pythonToPig(..) recieves 
 the pyObject as  [(u'name',), (u'phone',)] from the PyFunction call .
 But for t = (key) the return from PyFunction call is [u'name', u'phone']

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2021) Parser error while referring a map nested foreach

2011-05-10 Thread Vivek Padmanabhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13031114#comment-13031114
 ] 

Vivek Padmanabhan commented on PIG-2021:


Hi Xuefu,
 Extremely sorry about that.I was just trying to remove the dependencies.Please 
check whether the below is a valid case;

{code}
register mymapudf.jar;
A = load 'temp' as ( s, m, l );
B = foreach A generate *, org.vivek.udfs.mToMapUDF((chararray) s) as mapout;
C = foreach B {
  urlpath = (chararray) mapout#'k1';
  lc_urlpath = org.vivek.udfs.LOWER((chararray) urlpath);
  generate urlpath,lc_urlpath;
};
{code}


Source for org.vivek.udfs.mToMapUDF
{code}
package org.vivek.udfs;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.DataType;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.logicalLayer.schema.Schema;
public class mToMapUDF  extends EvalFuncMap {
public MapString, Object exec(Tuple arg0) throws IOException {
Map String,Object myMapTResult =  new HashMapString, 
Object();
myMapTResult.put(k1, SomeString);
myMapTResult.put(k3, SomeOtherString);
return myMapTResult;
}
public Schema outputSchema(Schema input) {
return new Schema(new 
Schema.FieldSchema(mapout,DataType.MAP));
}
}
{code}



Source for org.vivek.udfs.LOWER
{code}
package org.vivek.udfs;
import java.io.IOException;
import java.util.List;
import java.util.ArrayList;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.DataType;
import org.apache.pig.impl.logicalLayer.schema.Schema;
import org.apache.pig.impl.logicalLayer.FrontendException;
import org.apache.pig.FuncSpec;
public class LOWER extends EvalFuncString {
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try {
String str = (String)input.get(0);
return str.toLowerCase();
} catch(Exception e){
return null;
}
}
public Schema outputSchema(Schema input) {
return new Schema(new 
Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), 
input), DataType.CHARARRAY));
}
 public ListFuncSpec getArgToFuncMapping() throws FrontendException {
ListFuncSpec funcList = new ArrayListFuncSpec();
funcList.add(new FuncSpec(this.getClass().getName(), new Schema(new 
Schema.FieldSchema(null, DataType.CHARARRAY;
return funcList;
 }
}
{code}



 Parser error while referring a map nested foreach
 -

 Key: PIG-2021
 URL: https://issues.apache.org/jira/browse/PIG-2021
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Vivek Padmanabhan
Assignee: Xuefu Zhang
 Fix For: 0.9.0


 The below script is throwing parser errors
 {code}
 register string.jar;
 A = load 'test1'  using MapLoader() as ( s, m, l );   
 B = foreach A generate *, string.URLPARSE((chararray) s#'url') as parsedurl;
 C = foreach B {
   urlpath = (chararray) parsedurl#'path';
   lc_urlpath = string.TOLOWERCASE((chararray) urlpath);
   generate *;
 };
 {code}
 Error message;
 | Failed to generate logical plan.
 |Nested exception: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 
 2225: Projection with nothing to reference!
 PIG-2002 reports a similar issue, but when i tried with the patch of PIG-2002 
 i was getting the below exception;
  ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: file repro.pig, line 
 11, column 33  mismatched input '(' expecting SEMI_COLON

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2021) Parser error while referring a map nested foreach

2011-05-09 Thread Vivek Padmanabhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13030682#comment-13030682
 ] 

Vivek Padmanabhan commented on PIG-2021:


Attaching a script avoiding all the dependencies;

{code}
A = load 'temp' as ( s, m, l );
B = foreach A generate *, LOWER((chararray) s#'url') as parsedurl;
C = foreach B {
  urlpath = (chararray) parsedurl#'path';
  lc_urlpath = org.apache.pig.piggybank.evaluation.string.Reverse((chararray) 
urlpath);
  generate *;
};
{code}

 Parser error while referring a map nested foreach
 -

 Key: PIG-2021
 URL: https://issues.apache.org/jira/browse/PIG-2021
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Vivek Padmanabhan
Assignee: Xuefu Zhang
 Fix For: 0.9.0


 The below script is throwing parser errors
 {code}
 register string.jar;
 A = load 'test1'  using MapLoader() as ( s, m, l );   
 B = foreach A generate *, string.URLPARSE((chararray) s#'url') as parsedurl;
 C = foreach B {
   urlpath = (chararray) parsedurl#'path';
   lc_urlpath = string.TOLOWERCASE((chararray) urlpath);
   generate *;
 };
 {code}
 Error message;
 | Failed to generate logical plan.
 |Nested exception: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 
 2225: Projection with nothing to reference!
 PIG-2002 reports a similar issue, but when i tried with the patch of PIG-2002 
 i was getting the below exception;
  ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: file repro.pig, line 
 11, column 33  mismatched input '(' expecting SEMI_COLON

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-2046) Properties defined through 'SET' are not passed through to fs commands

2011-05-06 Thread Vivek Padmanabhan (JIRA)
Properties defined through 'SET' are not passed through to fs commands
--

 Key: PIG-2046
 URL: https://issues.apache.org/jira/browse/PIG-2046
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0, 0.9.0
Reporter: Vivek Padmanabhan


The properties which are set through 'SET' commands are not passed through to 
FS commands.

Ex;
SET dfs.umaskmode '026'
fs -touchz umasktest/file0

It looks like the SET commands are processed by GruntParser after the FsShell 
creation happens with current set of properties. Hence whatever properties 
defined in SET will not be reflected for fs commands executed in the script.






--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-2004) Incorrect input types passed on to eval function

2011-04-20 Thread Vivek Padmanabhan (JIRA)
Incorrect input types passed on to eval function


 Key: PIG-2004
 URL: https://issues.apache.org/jira/browse/PIG-2004
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.9.0
Reporter: Vivek Padmanabhan
 Fix For: 0.9.0


The below script fails by throwing a ClassCastException from the MAX udf. The 
udf expects the value of the bag supplied to be databyte array, but at run time 
the udf gets the actual type, ie Double in this case.  This causes the script 
execution to fail with exception;

| Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to 
org.apache.pig.data.DataByteArray


The same script runs properly with Pig 0.8.



{code}
A = LOAD 'myinput' as (f1,f2,f3);
B = foreach A generate f1,f2+f3/1000.0 as doub;
C = group B by f1;
D = foreach D generate (long)(MAX(B.doub)) as f4;
dump D;
{code}

myinput
---
a   100012345
b   200023456
c   300034567
a   150054321
b   250065432



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-1979) New logical plan failing with ERROR 2229: Couldn't find matching uid -1

2011-04-08 Thread Vivek Padmanabhan (JIRA)
New logical plan failing with ERROR 2229: Couldn't find matching uid -1 


 Key: PIG-1979
 URL: https://issues.apache.org/jira/browse/PIG-1979
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0, 0.9.0
Reporter: Vivek Padmanabhan


The below is my script 
{code}
register myudf.jar;
c01 = LOAD 'input'  USING org.test.MyTableLoader('');
c02 = FILTER c01  BY result == 'OK'  AND formatted IS NOT NULL  AND formatted 
!= '' ;
c03 = FOREACH c02 GENERATE url, formatted, FLATTEN(usage);
c04 = FOREACH c03 GENERATE usage::domain AS domain, url, formatted;
doc_001 = FOREACH c04 GENERATE domain,url, FLATTEN(MyExtractor(formatted)) AS 
category;
doc_004_1 = GROUP doc_001 BY (domain,url);
doc_005 = FOREACH doc_004_1 GENERATE group.domain as domain, group.url as url, 
doc_001.category as category;
STORE doc_005 INTO 'out_final' USING PigStorage();

review1 = FOREACH c04 GENERATE domain,url, MyExtractor(formatted) AS rev;
review2 = FILTER review1 BY SIZE(rev)0;
joinresult = JOIN review2 by (domain,url), doc_005 by (domain,url);
finalresult = FOREACH joinresult GENERATE  doc_005::category;
STORE finalresult INTO 'out_final' using PigStorage();
{code}

The script is failing in building the plan, while applying for logical 
optimization rule for AddForEach.

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2229: Couldn't find matching uid 
-1 for project (Name: Project Type: bytearray Uid: 106 Input: 0 Column: 5)

The problem is happening when I try to include doc_005::category in the 
projection for relation finalresult. This is field is orginated from the udf 
org.vivek.udfs.MyExtractor (source given below).

{code}

import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.*;
import org.apache.pig.impl.logicalLayer.FrontendException;
import org.apache.pig.impl.logicalLayer.schema.Schema;
import org.apache.pig.impl.logicalLayer.schema.Schema.FieldSchema;

public class MyExtractor extends EvalFuncDataBag
{
  @Override
public Schema outputSchema(Schema arg0) {
  try {
return Schema.generateNestedSchema(DataType.BAG, 
DataType.CHARARRAY);
} catch (FrontendException e) {
System.err.println(Error while generating schema. +e);
return new Schema(new FieldSchema(null, DataType.BAG));
}
}

  @Override
  public DataBag exec(Tuple inputTuple)
throws IOException
  {
try {
  Tuple tp2 = TupleFactory.getInstance().newTuple(1);
  tp2.set(0, (inputTuple.get(0).toString()+inputTuple.hashCode()));
  DataBag retBag = BagFactory.getInstance().newDefaultBag();
  retBag.add(tp2);
  return retBag;
}
catch (Exception e) {
  throw new IOException( Caught exception, e);
}
  }
}

{code}

The script goes through fine if I disable AddForEach rule by -t AddForEach

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-1966) Document Store and load from the same location does not support globbing

2011-04-05 Thread Vivek Padmanabhan (JIRA)
Document Store and load from the same location does not support globbing


 Key: PIG-1966
 URL: https://issues.apache.org/jira/browse/PIG-1966
 Project: Pig
  Issue Type: Improvement
  Components: documentation
Affects Versions: 0.8.0
Reporter: Vivek Padmanabhan
Priority: Minor
 Fix For: 0.8.0


If in my script there is a Store and a load from the same location like below;
STORE A INTO '/user/myname/myoutputfolder';
D = LOAD '/user/myname/myoutputfolder/part*' ;

This will cause my script to fail .Pig requires the store and load locations to 
be exactly same to realize that
there is a dependency .
This behavior of Pig should be documented preferably in
 http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Load%2FStore+Functions

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-1948) java.lang.ClassCastException while using double value from result of a group

2011-03-31 Thread Vivek Padmanabhan (JIRA)
java.lang.ClassCastException while using double value from result of a group


 Key: PIG-1948
 URL: https://issues.apache.org/jira/browse/PIG-1948
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0, 0.7.0, 0.9.0
Reporter: Vivek Padmanabhan


I have a fairly simple script (but too many coloumns) which is failing with 
class cast exception.


{code}
register myudf.jar;
A = load 'newinput' as (datestamp: chararray,vtestid: chararray,src_kt1: 
chararray,f1: chararray,f2: chararray,f3: chararray,f4: chararray,f5: 
chararray,f6: int,ipc: chararray,woeid: long,woeid_place: chararray,f7: 
chararray,f8: double,woeid_latitude: double,f9: chararray,woeid_town: 
chararray,woeid_county: chararray,a1: chararray,a2: chararray,woeid_country: 
chararray,a3: chararray,connection_speed: chararray,isp_name: 
chararray,isp_domain: chararray,ecnt: int,vcnt: int,ccnt: int,startts: 
int,duration: int,endts: int,stqust: chararray,startqc: chararray,starts_con: 
chararray,starts_lng: chararray,startv_pk1: int,startv_pk2: int,startv_pk3: 
int,startv_pk4: int,startv_pk5: int,lastquerystring: chararray,lastqc: 
chararray,lasts_con: chararray,lasts_lng: chararray,lastv_pk1: int,lastv_pk2: 
int,lastv_pk3: int,lastv_pk4: int,lastv_pk5: int,b1: chararray,lastsection: 
chararray,lastseclink: chararray,lasturl: chararray,path: chararray,pathtype: 
chararray,firstlastquerymatch: int,log_duration: double,log_duration_sq: 
double,duration_sq: double);

B = foreach A generate  
datestamp,src_kt1,vtestid,stqust,ecnt,vcnt,ccnt,log_duration,duration;
C = group B by ( datestamp, src_kt1,vtestid, stqust ) parallel 4;
D = foreach C generate COUNT( B ) as total, MyEval( B.log_duration ) as 
log_duration_summary;
store D into 'output';

{code}

The above script is failing with class cast exception;

{code}
java.lang.ClassCastException: java.lang.Double cannot be cast to 
java.lang.String
at org.apache.pig.data.BinInterSedes.readMap(BinInterSedes.java:193)
at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:280)
at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:251)
at org.apache.pig.data.BinInterSedes.readTuple(BinInterSedes.java:111)
at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:270)
at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:251)
at 
org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.java:555)
at org.apache.pig.data.BinSedesTuple.readFields(BinSedesTuple.java:64)
at 
org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.java:114)
at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
at 
org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:116)
at 
org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
at 
org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1376)
.
.
{code}

The problem is happening in the line MyEval( B.log_duration ), here even though 
log_duration is defined as a double field  BinInterSedes is considering it as a 
map value, TINYMAP to be exact. Hence it is trying to cast the double value 
into the key identifier, ie a String .  This bug exists in 0.9 also.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (PIG-1911) Infinite loop with accumulator function in nested foreach

2011-03-16 Thread Vivek Padmanabhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13007815#comment-13007815
 ] 

Vivek Padmanabhan commented on PIG-1911:


In this case pig is calling getValue() and cleanup() methods infinitely. The 
below is the udf source just in case;
{code}
public class MyCOUNT extends EvalFuncLong implements  AccumulatorLong{
@Override
public Long exec(Tuple input) throws IOException {
DataBag bag = (DataBag)input.get(0);
Iterator it = bag.iterator();
long cnt = 0;
while (it.hasNext()){
Tuple t = (Tuple)it.next();
if (t != null  t.size()  0  t.get(0) != null )
cnt++;
}
return cnt;
}

@Override
public Schema outputSchema(Schema input) {
return new Schema(new Schema.FieldSchema(null, DataType.LONG)); 
}
private long intermediateCount = 0L;
@Override
public void accumulate(Tuple b) throws IOException {
DataBag bag = (DataBag)b.get(0);
Iterator it = bag.iterator();
while (it.hasNext()){
Tuple t = (Tuple)it.next();
if (t != null  t.size()  0  t.get(0) != null) {
intermediateCount += 1;
}
}
}
@Override
public void cleanup() {
intermediateCount = 0L;
}
@Override
public Long getValue() {
return intermediateCount;
}
}
{code}

 Infinite loop with accumulator function in nested foreach
 -

 Key: PIG-1911
 URL: https://issues.apache.org/jira/browse/PIG-1911
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Olga Natkovich
Assignee: Thejas M Nair
 Fix For: 0.8.0


 Sample script:
 register v_udf.jar;
 a = load '2records' as (f1:chararray,f2:chararray);
 b = group a by f1;
 d = foreach b { sort = order a by f1; 
   generate org.udfs.MyCOUNT(sort) as something ; }
 dump d;
 This causes infinite loop if MyCOUNT implements Accumulator interface.
 The workaround is to take the function out of nested foreach into a separate 
 foreach statement.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Created: (PIG-1902) Documentation : Flatten behaviour should be updated in 0.7 docs

2011-03-14 Thread Vivek Padmanabhan (JIRA)
Documentation : Flatten behaviour should be updated in 0.7 docs
---

 Key: PIG-1902
 URL: https://issues.apache.org/jira/browse/PIG-1902
 Project: Pig
  Issue Type: Improvement
  Components: documentation
Affects Versions: 0.7.0
Reporter: Vivek Padmanabhan
Priority: Minor
 Fix For: 0.7.0


In 0.8 documentation 
,http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Flatten+Operator
the behavior of flatten for empty bags is well documented. 

{code}
Also note that the flatten of empty bag will result in that row being 
discarded; no output is generated.
{code}

Since this is applicable for Pig 0.7 also, the same should be documented in :
http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#Flatten+Operator

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (PIG-1894) Worng stats shown when there are multiple stores but same file names

2011-03-13 Thread Vivek Padmanabhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006328#comment-13006328
 ] 

Vivek Padmanabhan commented on PIG-1894:


Just for reference; the issue is PIG-1779 (Worng stats shown when there are 
multiple loads but same file names)

 Worng stats shown when there are multiple stores but same file names
 

 Key: PIG-1894
 URL: https://issues.apache.org/jira/browse/PIG-1894
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0, 0.9.0
Reporter: Vivek Padmanabhan
Assignee: Richard Ding
 Fix For: 0.9.0


 Pig 0.8/0.9 shows wrong stats for store counters when I have multiple store 
 but of the same name.
 To reproduce the issue please use the below script :
 {code}
 A = load 'sampledata1' as (f1:chararray,f2:chararray,f3:int);
 B = filter A by f3==1;
 C = filter A by f3==2;
 D = filter A by f3==3;
 store B into '/folder/B/out.gz';
 store C into '/folder/C/out.gz';
 store D into '/folder/D/out.gz';
 {code}
 Input 
 {code}
 aaa a   1
 aaa b   1
 bbb a   2
 bbb b   2
 ccc a   3
 ccc b   3
 {code}
 For this script Pig shows 
 Output(s):
 Successfully stored 6 records (32 bytes) in: /folder/B/out.gz
 Successfully stored 6 records (32 bytes) in: /folder/C/out.gz
 Successfully stored 6 records (32 bytes) in: /folder/D/out.gz
 Counters:
 Total records written : 18
 Total bytes written : 96

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Created: (PIG-1895) Class cast exception while projecting udf result

2011-03-11 Thread Vivek Padmanabhan (JIRA)
Class cast exception while projecting udf result


 Key: PIG-1895
 URL: https://issues.apache.org/jira/browse/PIG-1895
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0, 0.7.0, 0.9.0
Reporter: Vivek Padmanabhan


Class cast exception is thrown when I try to project the result from my udf. 
The udf has a defined schema DataType.BAG,DataType.LONG and DataType.INTEGER

The below is my script
{code}
Data = load 'file:/home/pvivek/Desktop/input' using PigStorage() as ( i: int );
AllData = group Data all parallel 1;
SampledData = foreach AllData generate org.vivek.TestEvalFunc(Data, 5) as rs;
SampledData1 = foreach SampledData generate rs.sampled;
{code}

Even though the output schema defines sampled as a data bag, while 
processing, instead of sending only the data bag generated from the UDF , the 
entire tuple was sent to the projection as result.
{code}
Exception recieved :
java.lang.ClassCastException: org.apache.pig.data.BinSedesTuple cannot be cast 
to org.apache.pig.data.DataBag
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:484)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:197)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:480)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:197)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:339)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:434)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:402)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:382)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:1)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
{code}
This issue is happening with 0.9/0.8 and 0.7



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (PIG-1895) Class cast exception while projecting udf result

2011-03-11 Thread Vivek Padmanabhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13005604#comment-13005604
 ] 

Vivek Padmanabhan commented on PIG-1895:


UDF Source code :
{code}
import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;

import org.apache.pig.EvalFunc;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.DataType;
import org.apache.pig.data.DefaultBagFactory;
import org.apache.pig.data.DefaultTupleFactory;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.logicalLayer.schema.Schema;

public class TestEvalFunc extends EvalFuncTuple{

public Tuple exec(Tuple input) throws IOException {
ArrayListTuple tupleList = new ArrayListTuple(2);
DataBag values = (DataBag)(input.get(0));
for (IteratorTuple vit = values.iterator(); vit.hasNext();) {
tupleList.add(vit.next());
}
DataBag sampleBag = 
DefaultBagFactory.getInstance().newDefaultBag(tupleList);
Tuple output = DefaultTupleFactory.getInstance().newTuple(3);
output.set(0, sampleBag);
output.set(1, new Long(3));
output.set(2, new Integer(2));
return output;
}
public Schema outputSchema(Schema input) {
Schema udfSchema = new Schema();
udfSchema.add(new Schema.FieldSchema(sampled,DataType.BAG));
udfSchema.add(new Schema.FieldSchema(k,DataType.LONG));
udfSchema.add(new Schema.FieldSchema(i,DataType.INTEGER));
return udfSchema;
}
}
{code}

Test Case to Verify;
{code}
import static org.apache.pig.ExecType.LOCAL;

import java.util.ArrayList;
import java.util.Iterator;

import junit.framework.TestCase;

import org.apache.pig.PigServer;
import org.apache.pig.data.Tuple;

public class MyPigUnitTests extends TestCase {
  private static String patternString = (\\d+)!+(\\w+)~+(\\w+);
  public static ArrayListString[] data = new ArrayListString[]();
  static {
data.add(new String[] { 1});
data.add(new String[] { 3});
  }

  private static String [] script = new String []{
 Data = load 'file:/home/pvivek/Desktop/input' using PigStorage() as ( i: 
int );,
 AllData = group Data all parallel 1;,
 SampledData = foreach AllData generate org.vivek.TestEvalFunc(Data, 5) as 
rs;,
 SampledData1 = foreach SampledData generate rs.sampled;,
  };

  public void test () throws Exception
  {
 String filename = TestHelper.createTempFile(data, );
 PigServer pig = new PigServer(LOCAL);
 filename = filename.replace(\\, );
 patternString = patternString.replace(\\, );
 
 for (String query : script) {
pig.registerQuery(query);
}
 Iterator? it = pig.openIterator(SampledData1);
 int tupleCount = 0;
 while (it.hasNext()) {
Tuple tuple = (Tuple) it.next();
if (tuple == null)
  break;
else {
  if (tuple.size()  0) {
  tupleCount++;
  }
}
  }
 assertEquals(1, tupleCount);
  }
}
{code}

 Class cast exception while projecting udf result
 

 Key: PIG-1895
 URL: https://issues.apache.org/jira/browse/PIG-1895
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0, 0.8.0, 0.9.0
Reporter: Vivek Padmanabhan

 Class cast exception is thrown when I try to project the result from my udf. 
 The udf has a defined schema DataType.BAG,DataType.LONG and DataType.INTEGER
 The below is my script
 {code}
 Data = load 'file:/home/pvivek/Desktop/input' using PigStorage() as ( i: int 
 );
 AllData = group Data all parallel 1;
 SampledData = foreach AllData generate org.vivek.TestEvalFunc(Data, 5) as rs;
 SampledData1 = foreach SampledData generate rs.sampled;
 {code}
 Even though the output schema defines sampled as a data bag, while 
 processing, instead of sending only the data bag generated from the UDF , the 
 entire tuple was sent to the projection as result.
 {code}
 Exception recieved :
 java.lang.ClassCastException: org.apache.pig.data.BinSedesTuple cannot be 
 cast to org.apache.pig.data.DataBag
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:484)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:197)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:480)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:197)
   at 
 

[jira] Created: (PIG-1894) Worng stats shown when there are multiple stores but same file names

2011-03-10 Thread Vivek Padmanabhan (JIRA)
Worng stats shown when there are multiple stores but same file names


 Key: PIG-1894
 URL: https://issues.apache.org/jira/browse/PIG-1894
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0, 0.9.0
Reporter: Vivek Padmanabhan


Pig 0.8/0.9 shows wrong stats for store counters when I have multiple store but 
of the same name.

To reproduce the issue please use the below script :
{code}
A = load 'sampledata1' as (f1:chararray,f2:chararray,f3:int);
B = filter A by f3==1;
C = filter A by f3==2;
D = filter A by f3==3;
store B into '/folder/B/out.gz';
store C into '/folder/C/out.gz';
store D into '/folder/D/out.gz';
{code}

Input 
{code}
aaa a   1
aaa b   1
bbb a   2
bbb b   2
ccc a   3
ccc b   3
{code}


For this script Pig shows 
Output(s):
Successfully stored 6 records (32 bytes) in: /folder/B/out.gz
Successfully stored 6 records (32 bytes) in: /folder/C/out.gz
Successfully stored 6 records (32 bytes) in: /folder/D/out.gz

Counters:
Total records written : 18
Total bytes written : 96

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia

2011-03-01 Thread Vivek Padmanabhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000794#comment-13000794
 ] 

Vivek Padmanabhan commented on PIG-1842:


The errors are because PIG-1839(XMLLoader will always add an extra empty tuple 
even if no tags are matched) was not applied to 0.8 branch which corrects these 
test cases. 

 Improve Scalability of the XMLLoader for large datasets such as wikipedia
 -

 Key: PIG-1842
 URL: https://issues.apache.org/jira/browse/PIG-1842
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0, 0.8.0, 0.9.0
Reporter: Viraj Bhat
Assignee: Vivek Padmanabhan
 Fix For: 0.7.0, 0.8.0, 0.9.0

 Attachments: PIG-1842_1.patch, PIG-1842_2.patch, 
 TEST-org.apache.pig.piggybank.test.storage.TestXMLLoader.txt


 The current XMLLoader for Pig, does not work well for large datasets such as 
 the wikipedia dataset. Each mapper reads in the entire XML file resulting in 
 extermely slow run times.
 Viraj

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia

2011-02-25 Thread Vivek Padmanabhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999315#comment-12999315
 ] 

Vivek Padmanabhan commented on PIG-1842:


Hi Alan ,
 The below is how I have handled these cases :

Note :-
The XMLLoader will consider one record from begining tag to end tag just like a 
line record reader searching for new line char .
Split start and end locations are provided by the default FileInputFormat.




Describing the entire steps in a simple way ;

*The loader will collect the start and end tags and create a record out of it. 
(XMLLoaderBufferedPositionedInputStream.collectTag)
*For begin tag 
*Read till the tag is found in this block 
*If tag not found and split end has reached then no rec 
found in this split (return empty array)
*If partial tag is found in the current split then even 
though split end has reached 
 continue reading rest of the file , beyond the split 
end location (handled by cond in while loop)
*For end tag
*Read till the end tag is found even if the split end location 
is reached.  


How far will split 1 read? It seems like it has to read to /a or else the 
map processing split one will not be able to process this as a coherent 
document. 
Yet from the setting of maxBytesReadable on line 132 it looks to me like it 
won't read past the end point.

The other condition will keep the reading going on. (matchBuf.size()  0 )

Here in this case lets say my tag identifier is a .  Then the loader will 
read till the split end to search for begining tag. 
Now for the end tag, it reads the rest of file starting from the last read 
position.Lets say split end has reached in between,
it will check whether it has found a match/or partial match. If not proceed 
with the reading till it finds a end tag.

 Improve Scalability of the XMLLoader for large datasets such as wikipedia
 -

 Key: PIG-1842
 URL: https://issues.apache.org/jira/browse/PIG-1842
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0, 0.8.0, 0.9.0
Reporter: Viraj Bhat
Assignee: Vivek Padmanabhan
 Fix For: 0.7.0, 0.8.0, 0.9.0

 Attachments: PIG-1842_1.patch, PIG-1842_2.patch


 The current XMLLoader for Pig, does not work well for large datasets such as 
 the wikipedia dataset. Each mapper reads in the entire XML file resulting in 
 extermely slow run times.
 Viraj

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia

2011-02-25 Thread Vivek Padmanabhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999316#comment-12999316
 ] 

Vivek Padmanabhan commented on PIG-1842:


I have done manual test for split boundary conditions. Please suggest 
whether/how I can do the same with unit tests.

 Improve Scalability of the XMLLoader for large datasets such as wikipedia
 -

 Key: PIG-1842
 URL: https://issues.apache.org/jira/browse/PIG-1842
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0, 0.8.0, 0.9.0
Reporter: Viraj Bhat
Assignee: Vivek Padmanabhan
 Fix For: 0.7.0, 0.8.0, 0.9.0

 Attachments: PIG-1842_1.patch, PIG-1842_2.patch


 The current XMLLoader for Pig, does not work well for large datasets such as 
 the wikipedia dataset. Each mapper reads in the entire XML file resulting in 
 extermely slow run times.
 Viraj

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (PIG-1868) New logical plan fails when I have complex data types from udf

2011-02-23 Thread Vivek Padmanabhan (JIRA)
New logical plan fails when I have complex data types from udf
--

 Key: PIG-1868
 URL: https://issues.apache.org/jira/browse/PIG-1868
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Vivek Padmanabhan


The new logical plan fails when I have complex data types returning from my 
eval function.

The below is my script :

{code}
register myudf.jar;   
B1 = load 'myinput' as (id:chararray,ts:int,url:chararray);
B2 = group B1 by id;
B = foreach B2 {
 Tuples = order B1 by ts;
 generate Tuples;
};
C1 = foreach B generate TransformToMyDataType(Tuples,-1,0,1) as seq: { t: ( 
previous, current, next ) };
C2 = foreach C1 generate FLATTEN(seq);
C3 = foreach C2 generate  current.id as id;
dump C3;
{code}

On C3 it fails with below message :
{code}
Couldn't find matching uid -1 for project (Name: Project Type: bytearray Uid: 
45 Input: 0 Column: 1)
{code}

The below is the describe on C1 ;
{code}
C1: {seq: {t: (previous: (id: chararray,ts: int,url: chararray),current: (id: 
chararray,ts: int,url: chararray),next: (id: chararray,ts: int,url: 
chararray))}}
{code}

The script works if I turn off new logical plan or use Pig 0.7.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (PIG-1865) BinStorage/PigStorageSchema cannot load data from a different namenode

2011-02-22 Thread Vivek Padmanabhan (JIRA)
BinStorage/PigStorageSchema cannot load data from a different namenode
--

 Key: PIG-1865
 URL: https://issues.apache.org/jira/browse/PIG-1865
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0, 0.7.0, 0.9.0
Reporter: Vivek Padmanabhan


BinStorage/PigStorageSchema cannot load data from a different namenode. The 
main reason for this is that, in the getSchema method , they use 
org.apache.pig.impl.io.FileLocalizer to check whether the exists, but the 
filesystem in HDataStorage refers to the natively configured dfs.

The test case is simple :
a = load 'hdfs://nn2/input' using BinStorage();
dump a;

Here if I specify -Dmapreduce.job.hdfs-servers, it should have worked , by pig 
still takes the fs from fs.default.name so to make it work i had to override  
fs.default.name in pig command line.

Raising this as a bug since the same scenario works with PigStorage.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (PIG-1864) Pig 0.8 Documentation : non-ascii characters present in sample udf scripts

2011-02-20 Thread Vivek Padmanabhan (JIRA)
Pig 0.8 Documentation : non-ascii characters present in sample udf scripts 
---

 Key: PIG-1864
 URL: https://issues.apache.org/jira/browse/PIG-1864
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.8.0
Reporter: Vivek Padmanabhan
Priority: Minor


In documentation ;
http://pig.apache.org/docs/r0.8.0/udf.html#Python+UDFs

For Sample Script UDFs , there are some non -ascii charaters present. Because 
of this when we try to execute the sample scripts it fails with error 
ERROR 2999: Unexpected internal error. null

SyntaxError: Non-ASCII character in file 'iostream', but no encoding 
declared; see http://www.python.org/peps/pep-0263.html for details



In the sample scripts provided in some of the line wrong characters are prsent 
. For example :
{code}
@outputSchema(onestring:chararray)
{code}


{code}
@outputSchema(y:bag{t:tuple(len:int,word:chararray)}) 
{code}

Requesting to have a look at all the udf examples present , since its a common 
practice to copy the examples directly and do a run .


-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (PIG-1860) Bug in plan built for Nested foreach

2011-02-18 Thread Vivek Padmanabhan (JIRA)
Bug in plan built for Nested foreach 
-

 Key: PIG-1860
 URL: https://issues.apache.org/jira/browse/PIG-1860
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0, 0.9.0
Reporter: Vivek Padmanabhan


Using the same inputs as in PIG-1858, 

{code}
register myanotherudf.jar;
A = load 'myinput' using PigStorage() as ( 
date:chararray,bcookie:chararray,count:int,avg:double,pvs:int);
B = foreach A generate (int)(avg / 100.0) * 100   as avg, pvs;
C = group B by ( avg );
D = foreach C {
Pvs = order B by pvs;
Const = org.vivek.MyAnotherUDF(Pvs.pvs).(count,sum);
generate Const.sum as sum;
};
store D into 'out_D';
{code}

In this script even though I am passing Pvs.pvs to the UDF in the nested 
foreach, at runtime the avg is getting passed.
It looks like the logical plan created for D is wrong.



-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (PIG-1858) NullPointerException while compiling the new logical plan

2011-02-17 Thread Vivek Padmanabhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vivek Padmanabhan updated PIG-1858:
---

Attachment: MyAnotherUDF.java

Attaching the udf source

 NullPointerException while compiling the new logical plan
 -

 Key: PIG-1858
 URL: https://issues.apache.org/jira/browse/PIG-1858
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0, 0.9.0
Reporter: Vivek Padmanabhan
 Attachments: MyAnotherUDF.java


 The below is my script :
 {code}
 register myanotherudf.jar;
 A = load 'myinput' using PigStorage() as ( 
 date:chararray,bcookie:chararray,count:int,avg:double,pvs:int);
 B = foreach A generate (int)(avg / 100.0) * 100   as avg, pvs;
 C = group B by ( avg );
 D = foreach C {
 Pvs = order B by pvs;
 Const = org.vivek.MyAnotherUDF(Pvs.pvs).(count,sum);
 generate Const.sum as sum;
 };
 store D into 'out_D';
 {code}
 The script is failing during compilation of the plan. The usage of the udf 
 inside the foreach is causing the problem. The udf implements algebraic and 
 the 
 output schema is also defined.
 The below is the exception that I get :
 ERROR 2042: Error in new logical plan. Try -Dpig.usenewlogicalplan=false.
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2042: Error in new 
 logical plan. Try -Dpig.usenewlogicalplan=false.
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:309)
 at org.apache.pig.PigServer.compilePp(PigServer.java:1364)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1206)
 at org.apache.pig.PigServer.execute(PigServer.java:1200)
 at org.apache.pig.PigServer.access$100(PigServer.java:128)
 at org.apache.pig.PigServer$Graph.execute(PigServer.java:1527)
 at org.apache.pig.PigServer.executeBatchEx(PigServer.java:372)
 at org.apache.pig.PigServer.executeBatch(PigServer.java:339)
 at 
 org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:169)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
 at org.apache.pig.Main.run(Main.java:500)
 at org.apache.pig.Main.main(Main.java:107)
 Caused by: java.lang.NullPointerException
 at 
 org.apache.pig.newplan.ReverseDependencyOrderWalker.walk(ReverseDependencyOrderWalker.java:70)
 at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
 at 
 org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:105)
 at 
 org.apache.pig.newplan.logical.relational.LOGenerate.accept(LOGenerate.java:229)
 at 
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
 at 
 org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:94)
 at 
 org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:71)
 at 
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
 at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:261)
 ... 13 more
  
 When i trun off new logical plan the script executes successfully. The issue 
 is observed in both 0.8 and 0.9

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia

2011-02-17 Thread Vivek Padmanabhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vivek Padmanabhan updated PIG-1842:
---

Status: Patch Available  (was: In Progress)

 Improve Scalability of the XMLLoader for large datasets such as wikipedia
 -

 Key: PIG-1842
 URL: https://issues.apache.org/jira/browse/PIG-1842
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.8.0, 0.7.0, 0.9.0
Reporter: Viraj Bhat
Assignee: Vivek Padmanabhan
 Fix For: 0.9.0, 0.8.0, 0.7.0

 Attachments: PIG-1842_1.patch, PIG-1842_2.patch


 The current XMLLoader for Pig, does not work well for large datasets such as 
 the wikipedia dataset. Each mapper reads in the entire XML file resulting in 
 extermely slow run times.
 Viraj

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (PIG-1850) Order by is failing with ClassCastException if schema is undefined for new logical plan in 0.8

2011-02-11 Thread Vivek Padmanabhan (JIRA)
Order by is failing with ClassCastException if schema is undefined for new 
logical plan in 0.8
--

 Key: PIG-1850
 URL: https://issues.apache.org/jira/browse/PIG-1850
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0, 0.9.0
Reporter: Vivek Padmanabhan


The below is the script :

A = load 'input' ;
B = group A all;
C = foreach B generate SUM($1.$0);
C1 = CROSS A,C;
D = foreach C1 generate ROUND($0*1.0/$2)/100.0, $1;
E = order D by $0 desc; 
store E  into 'out1';

input (tab separated fields)
26  A
1349595 B
235693  C


Exception
java.lang.ClassCastException: org.apache.pig.impl.io.NullableDoubleWritable 
cannot be cast to org.apache.pig.impl.io.NullableBytesWritable
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigBytesRawComparator.compare(PigBytesRawComparator.java:94)
at java.util.Arrays.binarySearch0(Arrays.java:2105)
at java.util.Arrays.binarySearch(Arrays.java:2043)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:72)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:52)
at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:602)
at 
org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:238)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:676)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:336)
at org.apache.hadoop.mapred.Child$4.run(Child.java:242)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.Child.main(Child.java:236)


The script is failing while doing order by in WeightedRangePartitioner since it 
considers the quantiles to be NullableBytesWritable but at run time this is 
NullableDoubleWritable . This is happening because there is no schema defined 
in the load statement.
But the same works fine when the  multiquery is turned off.

One more issue worth noting is that if i have a filter statement after relation 
E, then the above exception is swallowed by Pig. This make debugging really 
hard. 


-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (PIG-1848) Confusing statement for Merge Join - Both Conditions in Pig reference manual1

2011-02-10 Thread Vivek Padmanabhan (JIRA)
Confusing statement for Merge Join - Both Conditions in Pig reference manual1
--

 Key: PIG-1848
 URL: https://issues.apache.org/jira/browse/PIG-1848
 Project: Pig
  Issue Type: Improvement
  Components: documentation
Reporter: Vivek Padmanabhan


In Pig reference manual , 
http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Merge+Joins,
for merge join under Both Conditions ,  the example statement is confusing.

{quote}
Both Conditions
For optimal performance, each part file of the left (sorted) input of the join 
should have a size of at least 1 hdfs block size (for example if the hdfs block 
size is 128 MB, each part file should be less than 128 MB). 
{quote}

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (PIG-1848) Confusing statement for Merge Join - Both Conditions in Pig reference manual1

2011-02-10 Thread Vivek Padmanabhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992943#comment-12992943
 ] 

Vivek Padmanabhan commented on PIG-1848:


Please consider to refine the Both Conditions section.

 Confusing statement for Merge Join - Both Conditions in Pig reference manual1
 --

 Key: PIG-1848
 URL: https://issues.apache.org/jira/browse/PIG-1848
 Project: Pig
  Issue Type: Improvement
  Components: documentation
Reporter: Vivek Padmanabhan

 In Pig reference manual , 
 http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Merge+Joins,
 for merge join under Both Conditions ,  the example statement is confusing.
 {quote}
 Both Conditions
 For optimal performance, each part file of the left (sorted) input of the 
 join should have a size of at least 1 hdfs block size (for example if the 
 hdfs block size is 128 MB, each part file should be less than 128 MB). 
 {quote}

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia

2011-02-09 Thread Vivek Padmanabhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vivek Padmanabhan updated PIG-1842:
---

Attachment: PIG-1842_2.patch

Attaching the patch again

 Improve Scalability of the XMLLoader for large datasets such as wikipedia
 -

 Key: PIG-1842
 URL: https://issues.apache.org/jira/browse/PIG-1842
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0, 0.8.0, 0.9.0
Reporter: Viraj Bhat
Assignee: Vivek Padmanabhan
 Fix For: 0.7.0, 0.8.0, 0.9.0

 Attachments: PIG-1842_1.patch, PIG-1842_2.patch


 The current XMLLoader for Pig, does not work well for large datasets such as 
 the wikipedia dataset. Each mapper reads in the entire XML file resulting in 
 extermely slow run times.
 Viraj

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia

2011-02-08 Thread Vivek Padmanabhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991832#comment-12991832
 ] 

Vivek Padmanabhan commented on PIG-1842:


The below are some of the issues addressed in the patch :
a) Marking splittable of the loader as true except for gz formats
a) Changing XMLLoader to read for splits rather than entire file.
b) Handling scenarios regarding split/record boundaries
c) Using CBZip2InputStream to handle bzip2 files
d) An improvement on logic of collectTag (ie, skip unnecessary reads to find 
end tag if no start tags are found)

Manual tests for scalability and functional verification were done for the 
patch.
Using latest wikipedia dump in bz2 format (contains 10861606 pages; 6.5gb bz2) 
the new loader completed within 3 minutes,while the older version took more 
than 35minutes for a simple load-filter null-store script.



 Improve Scalability of the XMLLoader for large datasets such as wikipedia
 -

 Key: PIG-1842
 URL: https://issues.apache.org/jira/browse/PIG-1842
 Project: Pig
  Issue Type: Improvement
Reporter: Viraj Bhat
Assignee: Vivek Padmanabhan
 Attachments: PIG-1842_1.patch


 The current XMLLoader for Pig, does not work well for large datasets such as 
 the wikipedia dataset. Each mapper reads in the entire XML file resulting in 
 extermely slow run times.
 Viraj

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia

2011-02-07 Thread Vivek Padmanabhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vivek Padmanabhan updated PIG-1842:
---

Attachment: PIG-1842_1.patch

Attaching an initial patch.

 Improve Scalability of the XMLLoader for large datasets such as wikipedia
 -

 Key: PIG-1842
 URL: https://issues.apache.org/jira/browse/PIG-1842
 Project: Pig
  Issue Type: Improvement
Reporter: Viraj Bhat
Assignee: Vivek Padmanabhan
 Attachments: PIG-1842_1.patch


 The current XMLLoader for Pig, does not work well for large datasets such as 
 the wikipedia dataset. Each mapper reads in the entire XML file resulting in 
 extermely slow run times.
 Viraj

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (PIG-1839) piggybank: XMLLoader will always add an extra empty tuple even if no tags are matched

2011-02-02 Thread Vivek Padmanabhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vivek Padmanabhan updated PIG-1839:
---

Attachment: PIG-1839-1.patch

Attaching the initial patch. 
Please note that I have modified the existing test case to assert for the 
correct number of tuples .

 piggybank: XMLLoader will always add an extra empty tuple even if no tags are 
 matched
 -

 Key: PIG-1839
 URL: https://issues.apache.org/jira/browse/PIG-1839
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0, 0.8.0, 0.9.0
Reporter: Vivek Padmanabhan
Assignee: Vivek Padmanabhan
 Attachments: PIG-1839-1.patch


 The XMLLoader in piggy bank always add an empty tuple. Everytime this has to 
 be filtered out. Instead the same could be done by the loader itself.
 Consider the below script :
 a= load 'a.xml' using org.apache.pig.piggybank.storage.XMLLoader('name');
 dump a;
 b= filter a by $0  is not null;
 dump b;
 The output of first dump is :
 (name foobar /name)
 (name foo /name)
 (name justname /name)
 ()
 The output of second dump is :
 (name foobar /name)
 (name foo /name)
 (name justname /name)
 Again another case is if I dont have a matching tag , still the loader will 
 generate the empty tuple.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (PIG-1839) piggybank: XMLLoader will always add an extra empty tuple even if no tags are matched

2011-02-02 Thread Vivek Padmanabhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vivek Padmanabhan updated PIG-1839:
---

Patch Info: [Patch Available]

 piggybank: XMLLoader will always add an extra empty tuple even if no tags are 
 matched
 -

 Key: PIG-1839
 URL: https://issues.apache.org/jira/browse/PIG-1839
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0, 0.8.0, 0.9.0
Reporter: Vivek Padmanabhan
Assignee: Vivek Padmanabhan
 Attachments: PIG-1839-1.patch


 The XMLLoader in piggy bank always add an empty tuple. Everytime this has to 
 be filtered out. Instead the same could be done by the loader itself.
 Consider the below script :
 a= load 'a.xml' using org.apache.pig.piggybank.storage.XMLLoader('name');
 dump a;
 b= filter a by $0  is not null;
 dump b;
 The output of first dump is :
 (name foobar /name)
 (name foo /name)
 (name justname /name)
 ()
 The output of second dump is :
 (name foobar /name)
 (name foo /name)
 (name justname /name)
 Again another case is if I dont have a matching tag , still the loader will 
 generate the empty tuple.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (PIG-1835) Pig 0.9 new logical plan throws class cast exception

2011-02-01 Thread Vivek Padmanabhan (JIRA)
Pig 0.9 new logical plan throws class cast exception


 Key: PIG-1835
 URL: https://issues.apache.org/jira/browse/PIG-1835
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Vivek Padmanabhan


I have the below script which is throwing class cast exception while doing SUM. 
Even though all the fields are properly typed, while computing sum in m_agg0 
and m_agg02 the record from tuple is coming as java.lang.Long instead of Double.

The problem is happening in Pig 0.9. It works fine with 0.9 if I flag off new 
logical plan by -Dpig.usenewlogicalplan=false. 

{code}
A0 = load 'inputA' using PigStorage('\t') as ( group_id, r_id:long, 
is_phase2:int, roi_value:double,roi_cost:double,ecpm, prob:double,pixel_id, 
pixel_type,
val:long,f3, f4,type:long, amount:double,item_id:long);

A0 = foreach A0 generate r_id, is_phase2, ((val==257 or val==258)? 1: 0) as 
imps,
((val==257 or val==258)? amount: 0.0) as a_out, ((val==257 or 
val==258)? item_id: 0) as a_item_id,
((val==257 or val==258)? roi_value: 0.0) as roi_value,((val==257 or 
val==258)? roi_cost: 0.0) as roi_cost,
((val==257 or val==513)? ecpm: 0.0) as ecpm, ((val==257 or val==513)? 
prob: 0.0) as prob,
((val==257 or val==513)? amount: 0.0) as pub_rev, ((val==257 or 
val==513)? item_id: 0) as pub_line_id,((val==257 or val==513)? type: 0) as 
pub_pt;
-
B0 = load 'inputB' using PigStorage('\t') as ( group_id:long, r_id:long, 
roi_value:double,roi_cost:double,receive_time, 
host_name,site_id,rm_has_cookies,rm_pearl_id, 
f1,f2,pixel_id:long,pixel_type:int, xcookie,val:long,f3, 
f4,type:long,amount:double,item_id:long);

B0 = foreach B0 generate r_id, ((val==257 or val==258)? 1: 0) as B0,((val==257 
or val==258)? amount: 0.0) as a_out,
((val==257 or val==258)? item_id: 0) as a_item_id,((val==257 or 
val==258)? roi_value: 0.0) as roi_value,
((val==257 or val==258)? roi_cost: 0.0) as roi_cost, ((val==257 or 
val==513)? amount: 0.0) as pub_rev,
((val==257 or val==513)? item_id: 0) as pub_line_id, ((val==257 or 
val==513)? type: 0) as pub_pt;

C0 = load 'inputC' using PigStorage('\t') as (  group_id:long, r_id:long, 
roi_value:double, roi_cost:double, receive_time:long, host_name:chararray, 
site_id:long, rm_has_cookies:int,rm_pearl_id:long,f1,f2, pixel_id:long, 
pixel_type:int,rm_is_post_click:int, 
rm_conversion_id,xcookie:chararray,val:long,f3:long,f4:long,type:long,amount:double,item_id:long);

C0 = foreach C0 generate   r_id,((val==257 or val==258)? 1: 0) as C0, 
((val==257 or val==258)? amount: 0.0) as a_out,
((val==257 or val==258)? item_id: 0) as a_item_id,((val==257 or 
val==513)? amount: 0.0) as pub_rev,
((val==257 or val==513)? item_id: 0) as pub_line_id, ((val==257 or 
val==513)? type: 0) as pub_pt;

m_all = cogroup   A0 by (r_id) outer, B0 by (r_id) outer, C0  by (r_id) outer ;
m_agg01 = foreach m_all generate (double)(IsEmpty(C0) ? 0.0 : SUM(C0.pub_rev)) 
as conv_pub_rev;
store m_agg01 into 'out1' USING PigStorage(',');

m_all = cogroup   A0 by (r_id) outer, B0 by (r_id) outer, C0  by (r_id) outer ;
m_agg02 = foreach m_all generate (double)(IsEmpty(C0) ? 0.0 : SUM(C0.pub_rev)) 
as conv_pub_rev;
store m_agg02 into 'out2' USING PigStorage(',');
{code}



The below are the inputs to the script (all single record and tab seperated)


inputA
--
1   1.1 1.1 1.1 1.1 1   
1.1 

inputB
--
1.1 1.1 a1  1   b1  
1   1   c1  1.1 

inputC
--
1.1 1.1 a1  1   b1  
1   1   1   c1  
1.1 


Exception from reducers 
___
org.apache.pig.backend.executionengine.ExecException: ERROR 2103: Problem while 
computing sum of doubles.
at org.apache.pig.builtin.DoubleSum.sum(DoubleSum.java:147)
at org.apache.pig.builtin.DoubleSum.exec(DoubleSum.java:46)
at org.apache.pig.builtin.DoubleSum.exec(DoubleSum.java:41)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:230)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:302)

[jira] Created: (PIG-1831) Variation in output while using streaming udfs in local mode

2011-01-28 Thread Vivek Padmanabhan (JIRA)
Variation in output while using streaming udfs in local mode


 Key: PIG-1831
 URL: https://issues.apache.org/jira/browse/PIG-1831
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Vivek Padmanabhan


The below script when run in local mode gives me a different output. It looks 
like in local mode I have to store a relation obtained through streaming in 
order to use it afterwards.

 For example consider the below script : 
{code:lang=scala|title=} 
DEFINE MySTREAMUDF `test.sh`;

A  = LOAD 'myinput' USING PigStorage() AS (myId:chararray, data2, data3,data4 );
B = STREAM A THROUGH MySTREAMUDF AS (wId:chararray, num:int);

--STORE B into 'output.B';

C = JOIN B by wId LEFT OUTER, A by myId;
D = FOREACH C GENERATE B::wId,B::num,data4 ;
D = STREAM D THROUGH MySTREAMUDF AS (f1:chararray,f2:int);

--STORE D into 'output.D';

E = foreach B GENERATE wId,num;
F = DISTINCT E;
G = GROUP F ALL;
H = FOREACH G GENERATE COUNT_STAR(F) as TotalCount;
I = CROSS D,H;
STORE I  into 'output.I';
{code}

{code:lang=scala|title=test.sh}
#/bin/bash
cut -f1,3
{code}

And input is 
abcdlabel1  11  feature1
acbdlabel2  22  feature2
adbclabel3  33  feature3


Here if I store relation B and D then everytime i get the result  :
acbd3
abcd3
adbc3

But if i dont store relations B and D then I get an empty output.  




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1831) Variation in output while using streaming udfs in local mode

2011-01-28 Thread Vivek Padmanabhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vivek Padmanabhan updated PIG-1831:
---

Description: 
The below script when run in local mode gives me a different output. It looks 
like in local mode I have to store a relation obtained through streaming in 
order to use it afterwards.

 For example consider the below script : 

DEFINE MySTREAMUDF `test.sh`;
A  = LOAD 'myinput' USING PigStorage() AS (myId:chararray, data2, data3,data4 );
B = STREAM A THROUGH MySTREAMUDF AS (wId:chararray, num:int);

--STORE B into 'output.B';

C = JOIN B by wId LEFT OUTER, A by myId;
D = FOREACH C GENERATE B::wId,B::num,data4 ;
D = STREAM D THROUGH MySTREAMUDF AS (f1:chararray,f2:int);

--STORE D into 'output.D';

E = foreach B GENERATE wId,num;
F = DISTINCT E;
G = GROUP F ALL;
H = FOREACH G GENERATE COUNT_STAR(F) as TotalCount;
I = CROSS D,H;
STORE I  into 'output.I';


#/bin/bash
cut -f1,3


And input is 
abcdlabel1  11  feature1
acbdlabel2  22  feature2
adbclabel3  33  feature3


Here if I store relation B and D then everytime i get the result  :
acbd3
abcd3
adbc3

But if i dont store relations B and D then I get an empty output.  




  was:
The below script when run in local mode gives me a different output. It looks 
like in local mode I have to store a relation obtained through streaming in 
order to use it afterwards.

 For example consider the below script : 
{code:lang=scala|title=} 
DEFINE MySTREAMUDF `test.sh`;

A  = LOAD 'myinput' USING PigStorage() AS (myId:chararray, data2, data3,data4 );
B = STREAM A THROUGH MySTREAMUDF AS (wId:chararray, num:int);

--STORE B into 'output.B';

C = JOIN B by wId LEFT OUTER, A by myId;
D = FOREACH C GENERATE B::wId,B::num,data4 ;
D = STREAM D THROUGH MySTREAMUDF AS (f1:chararray,f2:int);

--STORE D into 'output.D';

E = foreach B GENERATE wId,num;
F = DISTINCT E;
G = GROUP F ALL;
H = FOREACH G GENERATE COUNT_STAR(F) as TotalCount;
I = CROSS D,H;
STORE I  into 'output.I';
{code}

{code:lang=scala|title=test.sh}
#/bin/bash
cut -f1,3
{code}

And input is 
abcdlabel1  11  feature1
acbdlabel2  22  feature2
adbclabel3  33  feature3


Here if I store relation B and D then everytime i get the result  :
acbd3
abcd3
adbc3

But if i dont store relations B and D then I get an empty output.  





 Variation in output while using streaming udfs in local mode
 

 Key: PIG-1831
 URL: https://issues.apache.org/jira/browse/PIG-1831
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Vivek Padmanabhan

 The below script when run in local mode gives me a different output. It looks 
 like in local mode I have to store a relation obtained through streaming in 
 order to use it afterwards.
  For example consider the below script : 
 DEFINE MySTREAMUDF `test.sh`;
 A  = LOAD 'myinput' USING PigStorage() AS (myId:chararray, data2, data3,data4 
 );
 B = STREAM A THROUGH MySTREAMUDF AS (wId:chararray, num:int);
 --STORE B into 'output.B';
 C = JOIN B by wId LEFT OUTER, A by myId;
 D = FOREACH C GENERATE B::wId,B::num,data4 ;
 D = STREAM D THROUGH MySTREAMUDF AS (f1:chararray,f2:int);
 --STORE D into 'output.D';
 E = foreach B GENERATE wId,num;
 F = DISTINCT E;
 G = GROUP F ALL;
 H = FOREACH G GENERATE COUNT_STAR(F) as TotalCount;
 I = CROSS D,H;
 STORE I  into 'output.I';
 #/bin/bash
 cut -f1,3
 And input is 
 abcdlabel1  11  feature1
 acbdlabel2  22  feature2
 adbclabel3  33  feature3
 Here if I store relation B and D then everytime i get the result  :
 acbd3
 abcd3
 adbc3
 But if i dont store relations B and D then I get an empty output.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1561) XMLLoader in Piggybank does not support bz2 or gzip compressed XML files

2011-01-14 Thread Vivek Padmanabhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vivek Padmanabhan updated PIG-1561:
---

Attachment: PIG-1561-1.patch

Attaching an initial patch for the issue. Please review. 

 XMLLoader in Piggybank does not support bz2 or gzip compressed XML files
 

 Key: PIG-1561
 URL: https://issues.apache.org/jira/browse/PIG-1561
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0, 0.8.0
Reporter: Viraj Bhat
Assignee: Vivek Padmanabhan
 Attachments: PIG-1561-1.patch


 I have a simple Pig script which uses the XMLLoader after the Piggybank is 
 built.
 {code}
 register piggybank.jar;
 A = load '/user/viraj/capacity-scheduler.xml.gz' using 
 org.apache.pig.piggybank.storage.XMLLoader('property') as (docs:chararray);
 B = limit A 1;
 dump B;
 --store B into '/user/viraj/handlegz' using PigStorage();
 {code}
 returns empty tuple
 {code}
 ()
 {code}
 If you supply the uncompressed XML file, you get
 {code}
 (property
 namemapred.capacity-scheduler.queue.my.capacity/name
 value10/value
 descriptionPercentage of the number of slots in the cluster that are
   guaranteed to be available for jobs in this queue.
 /description
   /property)
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1779) Worng stats shown when there are multiple loads but same file names

2010-12-21 Thread Vivek Padmanabhan (JIRA)
Worng stats shown when there are multiple loads but same file names
---

 Key: PIG-1779
 URL: https://issues.apache.org/jira/browse/PIG-1779
 Project: Pig
  Issue Type: Bug
  Components: tools
Affects Versions: 0.8.0
Reporter: Vivek Padmanabhan


In Pig 0.8 , the stats is showing wrong information when ever I have multiple 
loads and the the file names are similar .

a) Problem 1
Sample Script : 
A = LOAD 'myfolder/tryme' AS (f1);
B = LOAD 'myfolder/anotherfolder/tryme' AS (f2);
C = JOIN A BY f1, B BY f2;
DUMP C;

Here I have 10 records for A and 3 records for B , but pig says 
Successfully read 6 records from: nn/myfolder/anotherfolder/tryme
Successfully read 6 records from: nnmyfolder/tryme

b) Problem 2
A = LOAD 'myfolder/tryme' AS (f1);
B = LOAD 'myfolder/anotherfolder/tryme' AS (f2);
C = JOIN A BY f1, B BY f2;
DUMP C;

Here there is no folder named anotherfolder while myfolder/tryme exists . 
But pig says
Failed to read data from nn/myfolder/anotherfolder/tryme
Failed to read data from nn/myfolder/tryme


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.