[jira] [Commented] (PIG-3486) Pig hitting OOM while using PigRunner.run()
[ https://issues.apache.org/jira/browse/PIG-3486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13787954#comment-13787954 ] Vivek Padmanabhan commented on PIG-3486: Hi Ankit/Daniel, Thanks for looking into this. Meanwhile, is the current patch safe to be applied to our cluster for Pig 0.11.1 . All basic tests are not having any issues, not sure of any cases that I have missed. Pig hitting OOM while using PigRunner.run() --- Key: PIG-3486 URL: https://issues.apache.org/jira/browse/PIG-3486 Project: Pig Issue Type: Bug Affects Versions: 0.11.1 Reporter: Vivek Padmanabhan Attachments: histolive.txt, PIG-3486.patch1 I have a timer based class, which will trigger a pig script execution every 5 minutes using PigRunner.run(args, null). But it looks like the heap usage is gradually increasing after around 15days it crossed 1G, ie after invoking the above method 4k times. The top entries of the histo live goes like this; num #instances #bytes class name -- 1: 2430178 433053080 [C 2: 3055280 97768960 java.util.Hashtable$Entry 3: 2454870 78555840 java.lang.String 4: 1585204 50726528 java.util.HashMap$Entry 5:260310 37503984 constMethodKlass 6:260310 35413536 methodKlass 7: 35024 23724672 [Ljava.util.Hashtable$Entry; 8: 7599 18141016 constantPoolKlass 9: 47551 18066696 [Ljava.util.HashMap$Entry; 10:209516 16761280 java.lang.reflect.Method 11:212292 16732008 [I 12: 6881 11332896 constantPoolCacheKlass 13: 75997160920 instanceKlassKlass 14: 794124447072 java.util.ResourceBundle$CacheKey 15: 107873958464 [S 16: 794123811776 java.util.ResourceBundle$BundleReference 17: 266343458160 [B 18:1337013208824 java.util.LinkedList$Node 19: 854922735744 java.util.concurrent.ConcurrentHashMap$HashEntry 20: 794122541184 java.util.ResourceBundle$LoaderReference 21: 475152280720 java.util.HashMap 22: 372982274416 [Ljava.lang.Object; 23: 706382260416 java.util.LinkedList 24: 29491994376 methodDataKlass 25: 79141749080 java.lang.Class 26: 627461505904 org.apache.commons.logging.impl.Log4JLogger 27: 166391463824 [[I 28: 212791361856 java.net.URL 29: 280901348320 java.util.Hashtable 30: 141671231856 [Ljava.util.WeakHashMap$Entry; 31: 17770 710800 java.lang.ref.Finalizer 32: 10626 680064 java.util.jar.JarFile 33: 14167 680016 java.util.WeakHashMap 34: 14238 569520 java.util.WeakHashMap$Entry 35: 7104 568320 java.util.jar.JarFile$JarFileEntry 36: 165 567264 [Ljava.util.concurrent.ConcurrentHashMap$HashEntry; 37: 10637 510576 sun.nio.cs.UTF_8$Encoder 38: 10633 510384 sun.misc.URLClassPath$JarLoader 39: 14176 453632 java.lang.ref.ReferenceQueue 40: 17747 409752 [Ljava.lang.Class; 41: 3463 387856 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$HangingJobKiller 42: 15355 368520 java.util.ArrayList 43: 10632 340224 java.util.zip.ZipCoder 44: 6932 332736 java.util.Properties 45: 4060 292320 java.lang.reflect.Constructor 46: 7143 285720 java.util.LinkedHashMap$Entry 47: 3517 281360 org.apache.pig.impl.PigContext$ContextClassLoader 48: 3476 278144 [Ljava.lang.ThreadLocal$ThreadLocalMap$Entry; 49: 3458 276640 java.net.URI 50: 8576 274432 antlr.ANTLRHashString 51: 10632 255168 java.util.ArrayDeque There are way too many instances of MapReduceLauncher$HangingJobKiller. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (PIG-3486) Pig hitting OOM while using PigRunner.run()
Vivek Padmanabhan created PIG-3486: -- Summary: Pig hitting OOM while using PigRunner.run() Key: PIG-3486 URL: https://issues.apache.org/jira/browse/PIG-3486 Project: Pig Issue Type: Bug Affects Versions: 0.11.1 Reporter: Vivek Padmanabhan I have a timer based class, which will trigger a pig script execution every 5 minutes using PigRunner.run(args, null). But it looks like the heap usage is gradually increasing after around 15days it crossed 1G, ie after invoking the above method 4k times. The top entries of the histo live goes like this; num #instances #bytes class name -- 1: 2430178 433053080 [C 2: 3055280 97768960 java.util.Hashtable$Entry 3: 2454870 78555840 java.lang.String 4: 1585204 50726528 java.util.HashMap$Entry 5:260310 37503984 constMethodKlass 6:260310 35413536 methodKlass 7: 35024 23724672 [Ljava.util.Hashtable$Entry; 8: 7599 18141016 constantPoolKlass 9: 47551 18066696 [Ljava.util.HashMap$Entry; 10:209516 16761280 java.lang.reflect.Method 11:212292 16732008 [I 12: 6881 11332896 constantPoolCacheKlass 13: 75997160920 instanceKlassKlass 14: 794124447072 java.util.ResourceBundle$CacheKey 15: 107873958464 [S 16: 794123811776 java.util.ResourceBundle$BundleReference 17: 266343458160 [B 18:1337013208824 java.util.LinkedList$Node 19: 854922735744 java.util.concurrent.ConcurrentHashMap$HashEntry 20: 794122541184 java.util.ResourceBundle$LoaderReference 21: 475152280720 java.util.HashMap 22: 372982274416 [Ljava.lang.Object; 23: 706382260416 java.util.LinkedList 24: 29491994376 methodDataKlass 25: 79141749080 java.lang.Class 26: 627461505904 org.apache.commons.logging.impl.Log4JLogger 27: 166391463824 [[I 28: 212791361856 java.net.URL 29: 280901348320 java.util.Hashtable 30: 141671231856 [Ljava.util.WeakHashMap$Entry; 31: 17770 710800 java.lang.ref.Finalizer 32: 10626 680064 java.util.jar.JarFile 33: 14167 680016 java.util.WeakHashMap 34: 14238 569520 java.util.WeakHashMap$Entry 35: 7104 568320 java.util.jar.JarFile$JarFileEntry 36: 165 567264 [Ljava.util.concurrent.ConcurrentHashMap$HashEntry; 37: 10637 510576 sun.nio.cs.UTF_8$Encoder 38: 10633 510384 sun.misc.URLClassPath$JarLoader 39: 14176 453632 java.lang.ref.ReferenceQueue 40: 17747 409752 [Ljava.lang.Class; 41: 3463 387856 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$HangingJobKiller 42: 15355 368520 java.util.ArrayList 43: 10632 340224 java.util.zip.ZipCoder 44: 6932 332736 java.util.Properties 45: 4060 292320 java.lang.reflect.Constructor 46: 7143 285720 java.util.LinkedHashMap$Entry 47: 3517 281360 org.apache.pig.impl.PigContext$ContextClassLoader 48: 3476 278144 [Ljava.lang.ThreadLocal$ThreadLocalMap$Entry; 49: 3458 276640 java.net.URI 50: 8576 274432 antlr.ANTLRHashString 51: 10632 255168 java.util.ArrayDeque There are way too many instances of MapReduceLauncher$HangingJobKiller. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3486) Pig hitting OOM while using PigRunner.run()
[ https://issues.apache.org/jira/browse/PIG-3486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vivek Padmanabhan updated PIG-3486: --- Attachment: histolive.txt Attaching the histo live Pig hitting OOM while using PigRunner.run() --- Key: PIG-3486 URL: https://issues.apache.org/jira/browse/PIG-3486 Project: Pig Issue Type: Bug Affects Versions: 0.11.1 Reporter: Vivek Padmanabhan Attachments: histolive.txt I have a timer based class, which will trigger a pig script execution every 5 minutes using PigRunner.run(args, null). But it looks like the heap usage is gradually increasing after around 15days it crossed 1G, ie after invoking the above method 4k times. The top entries of the histo live goes like this; num #instances #bytes class name -- 1: 2430178 433053080 [C 2: 3055280 97768960 java.util.Hashtable$Entry 3: 2454870 78555840 java.lang.String 4: 1585204 50726528 java.util.HashMap$Entry 5:260310 37503984 constMethodKlass 6:260310 35413536 methodKlass 7: 35024 23724672 [Ljava.util.Hashtable$Entry; 8: 7599 18141016 constantPoolKlass 9: 47551 18066696 [Ljava.util.HashMap$Entry; 10:209516 16761280 java.lang.reflect.Method 11:212292 16732008 [I 12: 6881 11332896 constantPoolCacheKlass 13: 75997160920 instanceKlassKlass 14: 794124447072 java.util.ResourceBundle$CacheKey 15: 107873958464 [S 16: 794123811776 java.util.ResourceBundle$BundleReference 17: 266343458160 [B 18:1337013208824 java.util.LinkedList$Node 19: 854922735744 java.util.concurrent.ConcurrentHashMap$HashEntry 20: 794122541184 java.util.ResourceBundle$LoaderReference 21: 475152280720 java.util.HashMap 22: 372982274416 [Ljava.lang.Object; 23: 706382260416 java.util.LinkedList 24: 29491994376 methodDataKlass 25: 79141749080 java.lang.Class 26: 627461505904 org.apache.commons.logging.impl.Log4JLogger 27: 166391463824 [[I 28: 212791361856 java.net.URL 29: 280901348320 java.util.Hashtable 30: 141671231856 [Ljava.util.WeakHashMap$Entry; 31: 17770 710800 java.lang.ref.Finalizer 32: 10626 680064 java.util.jar.JarFile 33: 14167 680016 java.util.WeakHashMap 34: 14238 569520 java.util.WeakHashMap$Entry 35: 7104 568320 java.util.jar.JarFile$JarFileEntry 36: 165 567264 [Ljava.util.concurrent.ConcurrentHashMap$HashEntry; 37: 10637 510576 sun.nio.cs.UTF_8$Encoder 38: 10633 510384 sun.misc.URLClassPath$JarLoader 39: 14176 453632 java.lang.ref.ReferenceQueue 40: 17747 409752 [Ljava.lang.Class; 41: 3463 387856 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$HangingJobKiller 42: 15355 368520 java.util.ArrayList 43: 10632 340224 java.util.zip.ZipCoder 44: 6932 332736 java.util.Properties 45: 4060 292320 java.lang.reflect.Constructor 46: 7143 285720 java.util.LinkedHashMap$Entry 47: 3517 281360 org.apache.pig.impl.PigContext$ContextClassLoader 48: 3476 278144 [Ljava.lang.ThreadLocal$ThreadLocalMap$Entry; 49: 3458 276640 java.net.URI 50: 8576 274432 antlr.ANTLRHashString 51: 10632 255168 java.util.ArrayDeque There are way too many instances of MapReduceLauncher$HangingJobKiller. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3486) Pig hitting OOM while using PigRunner.run()
[ https://issues.apache.org/jira/browse/PIG-3486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vivek Padmanabhan updated PIG-3486: --- Attachment: PIG-3486.patch1 Attaching an inital patch. Pig hitting OOM while using PigRunner.run() --- Key: PIG-3486 URL: https://issues.apache.org/jira/browse/PIG-3486 Project: Pig Issue Type: Bug Affects Versions: 0.11.1 Reporter: Vivek Padmanabhan Attachments: histolive.txt, PIG-3486.patch1 I have a timer based class, which will trigger a pig script execution every 5 minutes using PigRunner.run(args, null). But it looks like the heap usage is gradually increasing after around 15days it crossed 1G, ie after invoking the above method 4k times. The top entries of the histo live goes like this; num #instances #bytes class name -- 1: 2430178 433053080 [C 2: 3055280 97768960 java.util.Hashtable$Entry 3: 2454870 78555840 java.lang.String 4: 1585204 50726528 java.util.HashMap$Entry 5:260310 37503984 constMethodKlass 6:260310 35413536 methodKlass 7: 35024 23724672 [Ljava.util.Hashtable$Entry; 8: 7599 18141016 constantPoolKlass 9: 47551 18066696 [Ljava.util.HashMap$Entry; 10:209516 16761280 java.lang.reflect.Method 11:212292 16732008 [I 12: 6881 11332896 constantPoolCacheKlass 13: 75997160920 instanceKlassKlass 14: 794124447072 java.util.ResourceBundle$CacheKey 15: 107873958464 [S 16: 794123811776 java.util.ResourceBundle$BundleReference 17: 266343458160 [B 18:1337013208824 java.util.LinkedList$Node 19: 854922735744 java.util.concurrent.ConcurrentHashMap$HashEntry 20: 794122541184 java.util.ResourceBundle$LoaderReference 21: 475152280720 java.util.HashMap 22: 372982274416 [Ljava.lang.Object; 23: 706382260416 java.util.LinkedList 24: 29491994376 methodDataKlass 25: 79141749080 java.lang.Class 26: 627461505904 org.apache.commons.logging.impl.Log4JLogger 27: 166391463824 [[I 28: 212791361856 java.net.URL 29: 280901348320 java.util.Hashtable 30: 141671231856 [Ljava.util.WeakHashMap$Entry; 31: 17770 710800 java.lang.ref.Finalizer 32: 10626 680064 java.util.jar.JarFile 33: 14167 680016 java.util.WeakHashMap 34: 14238 569520 java.util.WeakHashMap$Entry 35: 7104 568320 java.util.jar.JarFile$JarFileEntry 36: 165 567264 [Ljava.util.concurrent.ConcurrentHashMap$HashEntry; 37: 10637 510576 sun.nio.cs.UTF_8$Encoder 38: 10633 510384 sun.misc.URLClassPath$JarLoader 39: 14176 453632 java.lang.ref.ReferenceQueue 40: 17747 409752 [Ljava.lang.Class; 41: 3463 387856 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$HangingJobKiller 42: 15355 368520 java.util.ArrayList 43: 10632 340224 java.util.zip.ZipCoder 44: 6932 332736 java.util.Properties 45: 4060 292320 java.lang.reflect.Constructor 46: 7143 285720 java.util.LinkedHashMap$Entry 47: 3517 281360 org.apache.pig.impl.PigContext$ContextClassLoader 48: 3476 278144 [Ljava.lang.ThreadLocal$ThreadLocalMap$Entry; 49: 3458 276640 java.net.URI 50: 8576 274432 antlr.ANTLRHashString 51: 10632 255168 java.util.ArrayDeque There are way too many instances of MapReduceLauncher$HangingJobKiller. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2127) PigStorageSchema need to deal with missing field
[ https://issues.apache.org/jira/browse/PIG-2127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13290233#comment-13290233 ] Vivek Padmanabhan commented on PIG-2127: I think this issue is still present with PigStorage -schema option, {code} a = load '2127_withschema' using PigStorage(',','-schema'); b = foreach a generate f1,f2,f3,f4; dump b; {code} input {code} d,e,4,1 a,b,1,2 c,b d,e,4,1 {code} The above given script and input produces the below exception; java.lang.IndexOutOfBoundsException: Index: 3, Size: 3 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:156) at org.apache.pig.builtin.PigStorage.applySchema(PigStorage.java:282) at org.apache.pig.builtin.PigStorage.getNext(PigStorage.java:246) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:194) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) PigStorageSchema need to deal with missing field Key: PIG-2127 URL: https://issues.apache.org/jira/browse/PIG-2127 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.10.0 Reporter: Daniel Dai Fix For: 0.10.0 Currently, if data contains fewer columns than the schema, PigStorageSchema will throw IndexOutOfBound exception (PigStorageSchema:97). We should padding null in this case as we did in PigStorage. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2127) PigStorageSchema need to deal with missing field
[ https://issues.apache.org/jira/browse/PIG-2127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13290241#comment-13290241 ] Vivek Padmanabhan commented on PIG-2127: I am seeing the same issue with PigStorage also for Pig 0.10; Input; d,e,4,1 a,b,1,2 c,b d,e,4,1 Script a = load '2127_withschema' using PigStorage(',') as (f1,f2,f3,f4); b = foreach a generate f1,f2,f3,f4; dump b; The above script also results in the same IndexOutOfBound exception in Pig 0.10. (works fine with Pig 0.9) PigStorageSchema need to deal with missing field Key: PIG-2127 URL: https://issues.apache.org/jira/browse/PIG-2127 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.10.0 Reporter: Daniel Dai Fix For: 0.10.0 Currently, if data contains fewer columns than the schema, PigStorageSchema will throw IndexOutOfBound exception (PigStorageSchema:97). We should padding null in this case as we did in PigStorage. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2721) Wrong output generated while loading bags as input
Vivek Padmanabhan created PIG-2721: -- Summary: Wrong output generated while loading bags as input Key: PIG-2721 URL: https://issues.apache.org/jira/browse/PIG-2721 Project: Pig Issue Type: Bug Affects Versions: 0.9.2, 0.9.0 Reporter: Vivek Padmanabhan {code} A = LOAD '/user/pvivek/sample' as (id:chararray,mybag:bag{tuple(bttype:chararray,cat:long)}); B = foreach A generate id,FLATTEN(mybag) AS (bttype, cat); C = order B by id; dump C; {code} The above code generates wrong results when executed with Pig 0.10 and Pig 0.9 The below is the sample input; {code} ...LKGaHqg--{(aa,806743)} ..0MI1Y37w--{(aa,498970)} ..0bnlpJrw--{(aa,806740)} ..0p0IIhbA--{(aa,498971),(se,498995)} ..1VkGqvXA--{(aa,805219)} {code} I think the Pig optimizers are causing this issue.From the logs I can see that the $1 is pruned for the relation A. [main] INFO org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for A: $1 One workaround for this is to disable -t ColumnMapKeyPrune. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2305) Pig should log the split locations in task logs
[ https://issues.apache.org/jira/browse/PIG-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vivek Padmanabhan updated PIG-2305: --- Fix Version/s: 0.9.1 Status: Patch Available (was: Open) Pig should log the split locations in task logs --- Key: PIG-2305 URL: https://issues.apache.org/jira/browse/PIG-2305 Project: Pig Issue Type: Improvement Affects Versions: 0.9.0, 0.8.1 Reporter: Vivek Padmanabhan Assignee: Vivek Padmanabhan Priority: Minor Fix For: 0.9.1 Attachments: PIG-2305_1.patch It would be helpful if Pig can log the split information in the task logs. MAPREDUCE-2076 talks about having this log, but from Pig 0.8 onwards, splits could be combined hence not sure what could be the result. One side effect would be that these logs will get printed on intermediate pig jobs also. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2305) Pig should log the split locations in task logs
[ https://issues.apache.org/jira/browse/PIG-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vivek Padmanabhan updated PIG-2305: --- Attachment: PIG-2305_1.patch Pig should log the split locations in task logs --- Key: PIG-2305 URL: https://issues.apache.org/jira/browse/PIG-2305 Project: Pig Issue Type: Improvement Affects Versions: 0.8.1, 0.9.0 Reporter: Vivek Padmanabhan Assignee: Vivek Padmanabhan Priority: Minor Fix For: 0.9.1 Attachments: PIG-2305_1.patch It would be helpful if Pig can log the split information in the task logs. MAPREDUCE-2076 talks about having this log, but from Pig 0.8 onwards, splits could be combined hence not sure what could be the result. One side effect would be that these logs will get printed on intermediate pig jobs also. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2291) PigStats.isSuccessful returns false if embedded pig script has dump
PigStats.isSuccessful returns false if embedded pig script has dump --- Key: PIG-2291 URL: https://issues.apache.org/jira/browse/PIG-2291 Project: Pig Issue Type: Bug Affects Versions: 0.9.0 Reporter: Vivek Padmanabhan The below is my python script, {code} #! /usr/bin/python from org.apache.pig.scripting import Pig P = Pig.compileFromFile(a.pig) result = P.bind().runSingle() if result.isSuccessful(): print 'Pig job succeeded' else: print 'Pig job failed' {code} The below is the pig script embedded (a.pig) A = LOAD 'a1' USING PigStorage(',') AS (f1:chararray,f2:chararray); B = GROUP A by f1; dump B; For this script execution, even though the job is successful the output printed is 'Pig job failed' This is because result.isSuccessful() is returning false whenever the pig script is having a dump statement. If i run the pig script alone, then the error code returned is proper. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (PIG-2250) Pig 0.9 error message not useful as compared to 0.8
[ https://issues.apache.org/jira/browse/PIG-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vivek Padmanabhan resolved PIG-2250. Resolution: Fixed Fix Version/s: 0.10 Verified this with trunk. Sorry for the trouble. Pig 0.9 error message not useful as compared to 0.8 --- Key: PIG-2250 URL: https://issues.apache.org/jira/browse/PIG-2250 Project: Pig Issue Type: Bug Affects Versions: 0.9.0 Reporter: Vivek Padmanabhan Fix For: 0.10 Another instance of change in error message from 0.8 to 0.9 due to parser modifications. This improper error message is due to \n in the UDF arguments. The below is a sample script; a = load 'input' using myLoader('a1,a2, a3,a4'); dump a; Error Message from 0.9 -- ERROR 1200: Pig script failed to parse: MismatchedTokenException(93!=3) Error Message from 0.8 ERROR 1000: Error during parsing. Lexical error at line 1, column 40. Encountered: \n (10), after : \'a1,a2, -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2250) Pig 0.9 error message not useful as compared to 0.8
Pig 0.9 error message not useful as compared to 0.8 --- Key: PIG-2250 URL: https://issues.apache.org/jira/browse/PIG-2250 Project: Pig Issue Type: Bug Affects Versions: 0.9.0 Reporter: Vivek Padmanabhan Another instance of change in error message from 0.8 to 0.9 due to parser modifications. This improper error message is due to \n in the UDF arguments. The below is a sample script; a = load 'input' using myLoader('a1,a2, a3,a4'); dump a; Error Message from 0.9 -- ERROR 1200: Pig script failed to parse: MismatchedTokenException(93!=3) Error Message from 0.8 ERROR 1000: Error during parsing. Lexical error at line 1, column 40. Encountered: \n (10), after : \'a1,a2, -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2238) Pig 0.9 error message not useful as compared to 0.8
[ https://issues.apache.org/jira/browse/PIG-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13090820#comment-13090820 ] Vivek Padmanabhan commented on PIG-2238: Looks like this issue was introduced as part of the parser changes. In Pig 0.9, the validation is done like below; In org.apache.pig.parser.AstValidator private void validateAliasRef(SetString aliases, CommonTree node, String alias) throws UndefinedAliasException { if( !aliases.contains( alias ) ) { throw new UndefinedAliasException( input, new SourceLocation( (PigParserNode)node ), alias ); } } Here it just checks that the alias is contained inside the set aliases, but this contains all the aliases and it doesnt check for the order in which they are defined in the script. Hence this will lead to other sort of issues like NullPointerException, if i replace the F in the above script with below F = foreach F generate *; Pig 0.9 error message not useful as compared to 0.8 --- Key: PIG-2238 URL: https://issues.apache.org/jira/browse/PIG-2238 Project: Pig Issue Type: Improvement Affects Versions: 0.9.0 Reporter: Vivek Padmanabhan The below is my faulty script (note the usage of alias F) for which Pig 0.9 composes not so useful message as compared to 0.8; A = load 'input' using TextLoader as (doc:chararray) ; B = foreach A generate flatten(TOKENIZE(doc)) as myword; C = group B by myword parallel 30; D = foreach C generate group,COUNT(B) as count,SIZE(group) as size; E = order D by size parallel 5; F = limit F 20; dump F; For this script , error message in 0.9 ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2240: LogicalPlanVisitor can only visit logical plan Error message in 0.8 ERROR 1000: Error during parsing. Unrecognized alias F -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2238) Pig 0.9 error message not useful as compared to 0.8
Pig 0.9 error message not useful as compared to 0.8 --- Key: PIG-2238 URL: https://issues.apache.org/jira/browse/PIG-2238 Project: Pig Issue Type: Improvement Affects Versions: 0.9.0 Reporter: Vivek Padmanabhan The below is my faulty script (note the usage of alias F) for which Pig 0.9 composes not so useful message as compared to 0.8; A = load 'input' using TextLoader as (doc:chararray) ; B = foreach A generate flatten(TOKENIZE(doc)) as myword; C = group B by myword parallel 30; D = foreach C generate group,COUNT(B) as count,SIZE(group) as size; E = order D by size parallel 5; F = limit F 20; dump F; For this script , error message in 0.9 ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2240: LogicalPlanVisitor can only visit logical plan Error message in 0.8 ERROR 1000: Error during parsing. Unrecognized alias F -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2217) POStore.getSchema() returns null if I dont have a schema defined at load statement
[ https://issues.apache.org/jira/browse/PIG-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13089366#comment-13089366 ] Vivek Padmanabhan commented on PIG-2217: Sorry for my confusing comment. My point was, if I dont specify a schema definition along with my load statement, then PigStorageSchema wont save the schema files. This is happening from Pig 0.8 onwards, if I use Pig 0.7 I can see the files saved. I believe this is because the schema object is null in 0.8, but for 0.7 there is an empty schema created. Is this behaviour expected from 0.8 onwards. POStore.getSchema() returns null if I dont have a schema defined at load statement -- Key: PIG-2217 URL: https://issues.apache.org/jira/browse/PIG-2217 Project: Pig Issue Type: Bug Affects Versions: 0.8.1, 0.9.0 Reporter: Vivek Padmanabhan If I don't specify a schema definition in load statement, then POStore.getSchema() returns null because of which PigOutputCommitter is not storing schema . For example if I run the below script, .pig_header and .pig_schema files wont be saved. load_1 = LOAD 'i1' USING PigStorage(); ordered_data_1 = ORDER load_1 BY * ASC PARALLEL 1; STORE ordered_data_1 INTO 'myout' using org.apache.pig.piggybank.storage.PigStorageSchema(); This works fine with Pig 0.7, but 0.8 onwards StoreMetadata.storeSchema is not getting invoked for these cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2221) Couldnt find documentation for ColumnMapKeyPrune optimization rule
Couldnt find documentation for ColumnMapKeyPrune optimization rule -- Key: PIG-2221 URL: https://issues.apache.org/jira/browse/PIG-2221 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.8.1 Reporter: Vivek Padmanabhan There are no documentations for some of the Optimization Rules, For ex; ColumnMapKeyPrune in http://pig.apache.org/docs/r0.8.1/piglatin_ref1.html#Optimization+Rules And moreover I believe the documentaion should be saying how to disable these rules using -t option. It would be nice if the documentation could talk of some uses cases where it makes sense to disable the optimization rule. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2217) POStore.getSchema() returns null if I dont have a schema defined at load statement
POStore.getSchema() returns null if I dont have a schema defined at load statement -- Key: PIG-2217 URL: https://issues.apache.org/jira/browse/PIG-2217 Project: Pig Issue Type: Bug Affects Versions: 0.9.0, 0.8.1 Reporter: Vivek Padmanabhan If I don't specify a schema definition in load statement, then POStore.getSchema() returns null because of which PigOutputCommitter is not storing schema . For example if I run the below script, .pig_header and .pig_schema files wont be saved. load_1 = LOAD 'i1' USING PigStorage(); ordered_data_1 = ORDER load_1 BY * ASC PARALLEL 1; STORE ordered_data_1 INTO 'myout' using org.apache.pig.piggybank.storage.PigStorageSchema(); This works fine with Pig 0.7, but 0.8 onwards StoreMetadata.storeSchema is not getting invoked for these cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2217) POStore.getSchema() returns null if I dont have a schema defined at load statement
[ https://issues.apache.org/jira/browse/PIG-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13083939#comment-13083939 ] Vivek Padmanabhan commented on PIG-2217: For the above mentioned script the schema is marked as null from the logical layer itself, ie LOStore.getSchema() returns a null. Since all the schema is derived from its predeccessor operators, the schema object for LOLoad itself is null. Hence this scenario will be happening for all scripts which does not define a schema in the load stmt. In Pig 0.7 , even if the schema value is null from logical layer, while translating, it is wrapped with an empty schema For ex; In LogToPhyTranslationVisitor public void visit(LOStore loStore) throws VisitorException { store.setSchema(new Schema(loStore.getSchema())); Hence the file will look like below .pig_header (empty file ) .pig_schema --- {fields:[],version:0,sortKeys:[-1],sortKeyOrders:[ASCENDING]} But in 0.8 (new logical plan) onwards, the null value is directly returned, because of which the metadata is not saved. This change in behaviour came with the new logical plan introduced in Pig 0.8 which also got transferred into Pig 0.9. Disabling the new logical plan in 0.8 ( pig -useversion 0.8 -Dpig.usenewlogicalplan=false), will produce .pig_header and .pig_schema files. POStore.getSchema() returns null if I dont have a schema defined at load statement -- Key: PIG-2217 URL: https://issues.apache.org/jira/browse/PIG-2217 Project: Pig Issue Type: Bug Affects Versions: 0.8.1, 0.9.0 Reporter: Vivek Padmanabhan If I don't specify a schema definition in load statement, then POStore.getSchema() returns null because of which PigOutputCommitter is not storing schema . For example if I run the below script, .pig_header and .pig_schema files wont be saved. load_1 = LOAD 'i1' USING PigStorage(); ordered_data_1 = ORDER load_1 BY * ASC PARALLEL 1; STORE ordered_data_1 INTO 'myout' using org.apache.pig.piggybank.storage.PigStorageSchema(); This works fine with Pig 0.7, but 0.8 onwards StoreMetadata.storeSchema is not getting invoked for these cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2181) Improvement : for error message when describe misses alias
[ https://issues.apache.org/jira/browse/PIG-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vivek Padmanabhan updated PIG-2181: --- Assignee: Vivek Padmanabhan Status: Patch Available (was: Open) Improvement : for error message when describe misses alias -- Key: PIG-2181 URL: https://issues.apache.org/jira/browse/PIG-2181 Project: Pig Issue Type: Bug Affects Versions: 0.9.0 Reporter: Vivek Padmanabhan Assignee: Vivek Padmanabhan Priority: Minor Labels: newbie Fix For: 0.10 Attachments: PIG-2181_1.patch In Pig 0.9, if I have a describe without an alias, it throws a NullPointerException like below. ERROR 2999: Unexpected internal error. null java.lang.NullPointerException at org.apache.pig.tools.grunt.GruntParser.processDescribe(GruntParser.java:270) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:317) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81) at org.apache.pig.Main.run(Main.java:553) at org.apache.pig.Main.main(Main.java:108) For example; describe; This message is of no use from a users perspective. Especially when my script becomes large and I have added couple of describe statements. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2184) Not able to provide positional reference to macro invocations
Not able to provide positional reference to macro invocations - Key: PIG-2184 URL: https://issues.apache.org/jira/browse/PIG-2184 Project: Pig Issue Type: Bug Affects Versions: 0.9.0 Reporter: Vivek Padmanabhan It looks like the macro functionality doesnt support for positional references. The below is an example script; DEFINE my_macro (X,key) returns Y { tmp1 = foreach $X generate TOKENIZE((chararray)$key) as tokens; tmp2 = foreach tmp1 generate flatten(tokens); tmp3 = order tmp2 by $0; $Y = distinct tmp3; } A = load 'sometext' using TextLoader() as (row1) ; E = my_macro(A,A.$0); dump E; This script execution fails at parsing staging itself; org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. file try1.pig, line 16, column 16 mismatched input '.' expecting RIGHT_PAREN If i replace A.$0 with the field name ie row1 the script runs fine. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2181) Improvement : for error message when describe misses alias
Improvement : for error message when describe misses alias -- Key: PIG-2181 URL: https://issues.apache.org/jira/browse/PIG-2181 Project: Pig Issue Type: Bug Affects Versions: 0.9.0 Reporter: Vivek Padmanabhan Priority: Minor In Pig 0.9, if I have a describe without an alias, it throws a NullPointerException like below. ERROR 2999: Unexpected internal error. null java.lang.NullPointerException at org.apache.pig.tools.grunt.GruntParser.processDescribe(GruntParser.java:270) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:317) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81) at org.apache.pig.Main.run(Main.java:553) at org.apache.pig.Main.main(Main.java:108) For example; describe; This message is of no use from a users perspective. Especially when my script becomes large and I have added couple of describe statements. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2147) Support nested tags for XMLLoader
[ https://issues.apache.org/jira/browse/PIG-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vivek Padmanabhan updated PIG-2147: --- Attachment: PIG-2147_1.patch Attaching an initial patch. Support nested tags for XMLLoader - Key: PIG-2147 URL: https://issues.apache.org/jira/browse/PIG-2147 Project: Pig Issue Type: Bug Affects Versions: 0.8.1, 0.9.0 Reporter: Vivek Padmanabhan Assignee: Vivek Padmanabhan Fix For: 0.8.1, 0.9.0 Attachments: PIG-2147_1.patch Currently xmlloader does not support nested tags with same tag name, ie if i have the below content {code} event relatedEvents eventx\event eventy\event eventz\event \relatedEvents \event {code} And I load the above using XMLLoader, events = load 'input' using org.apache.pig.piggybank.storage.XMLLoader('event') as (doc:chararray); The output will be, {code} event relatedEvents eventx\event {code} Whereas the desired output is ; {code} relatedEvents eventx\event eventy\event eventz\event \relatedEvents {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2152) Null pointer exception while reporting progress
[ https://issues.apache.org/jira/browse/PIG-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vivek Padmanabhan updated PIG-2152: --- Attachment: null_pointer_traces (copy) Attaching a list of different traces Null pointer exception while reporting progress --- Key: PIG-2152 URL: https://issues.apache.org/jira/browse/PIG-2152 Project: Pig Issue Type: Bug Affects Versions: 0.9.0 Reporter: Olga Natkovich Fix For: 0.9.0 Attachments: null_pointer_traces (copy) We have observed the following issues with code built from Pig 0.9 branch. We have not seen this with earlier versions; however, since this happens once in a while and is not reproducible at will it is not clear whether the issue is specific to 0.9 or not. Here is the stack: java.lang.NullPointerException at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.ProgressableReporter.progress(ProgressableReporter.java:37) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:399) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:261) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:256) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:58) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.Child$4.run(Child.java:261) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:255) Note that the code in progress function looks as follows: public void progress() { if(rep!=null) rep.progress(); } This points to some sort of synchronization issue -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2152) Null pointer exception while reporting progress
[ https://issues.apache.org/jira/browse/PIG-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13061891#comment-13061891 ] Vivek Padmanabhan commented on PIG-2152: Even though not reproducible , the exception is happening randomly and quite frequently. From most of the failed jobs it looks like this is happening towards end of the task execution. Till now, it is seen only in Map Tasks. The below is one NullPointer from progess() with a different call heirachy; at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.ProgressableReporter.progress(ProgressableReporter.java:37) at org.apache.pig.data.DefaultAbstractBag.reportProgress(DefaultAbstractBag.java:369) at org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator.next(DefaultDataBag.java:165) at org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator.hasNext(DefaultDataBag.java:157) at org.apache.pig.data.BinInterSedes.writeBag(BinInterSedes.java:522) at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:361) at org.apache.pig.data.BinInterSedes.writeTuple(BinInterSedes.java:542) at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:357) at org.apache.pig.data.BinInterSedes.writeTuple(BinInterSedes.java:542) at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:357) at org.apache.pig.data.BinSedesTuple.write(BinSedesTuple.java:57) at org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:123) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1069) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691) at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:124) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:263) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:256) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:58) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.Child$4.run(Child.java:261) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:255) Null pointer exception while reporting progress --- Key: PIG-2152 URL: https://issues.apache.org/jira/browse/PIG-2152 Project: Pig Issue Type: Bug Affects Versions: 0.9.0 Reporter: Olga Natkovich Fix For: 0.9.0 Attachments: null_pointer_traces (copy) We have observed the following issues with code built from Pig 0.9 branch. We have not seen this with earlier versions; however, since this happens once in a while and is not reproducible at will it is not clear whether the issue is specific to 0.9 or not. Here is the stack: java.lang.NullPointerException at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.ProgressableReporter.progress(ProgressableReporter.java:37) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:399) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:261) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:256) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:58) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at
[jira] [Commented] (PIG-2036) Set header delimiter in PigStorageSchema
[ https://issues.apache.org/jira/browse/PIG-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13056330#comment-13056330 ] Vivek Padmanabhan commented on PIG-2036: Thanks Dmitriy, 0.9 will do. The same patch goes well with 0.9 branch also. Set header delimiter in PigStorageSchema Key: PIG-2036 URL: https://issues.apache.org/jira/browse/PIG-2036 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0, 0.8.0 Reporter: Mads Moeller Assignee: Mads Moeller Priority: Minor Fix For: 0.10 Attachments: PIG-2036.1.patch, PIG-2036.patch Piggybanks' PigStorageSchema currently defaults the delimiter to a tab in the generated header file (.pig_header). The attached patch set the header delimiter to what is passed in via the constructor. Otherwise it'll default to tab '\t'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2146) POStore.getSchema() returns null because of which PigOutputCommitter is not storing schema while cleanup
POStore.getSchema() returns null because of which PigOutputCommitter is not storing schema while cleanup Key: PIG-2146 URL: https://issues.apache.org/jira/browse/PIG-2146 Project: Pig Issue Type: Bug Affects Versions: 0.8.1, 0.9.0 Reporter: Vivek Padmanabhan The below is my script; {code} register piggybank.jar; a = load 'myinput' using PigStorage(',') as (f1:chararray,f2:chararray,f3:chararray); b = distinct a; c = limit b 2; store c into 'pss001' using org.apache.pig.piggybank.storage.PigStorageSchema(); {code} Input --- a,1,aa b,2,bb c,3,cc For this script , PigStorageSchema is not generating .pig_headers and .pig_schema files. While debugging I could see that storeSchema(..) method itself is not invoked.The schema object for the store is returned as null (POStore.getSchema()) because of which PigOutputCommitter is not invoking the storSchema. The same schema object is valid when I run it in local mode. This issue is happening for Pig 0.9 also. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2147) Support nested tags for XMLLoader
Support nested tags for XMLLoader - Key: PIG-2147 URL: https://issues.apache.org/jira/browse/PIG-2147 Project: Pig Issue Type: Bug Affects Versions: 0.8.1, 0.9.0 Reporter: Vivek Padmanabhan Assignee: Vivek Padmanabhan Currently xmlloader does not support nested tags with same tag name, ie if i have the below content {code} event relatedEvents eventx\event eventy\event eventz\event \relatedEvents \event {code} And I load the above using XMLLoader, events = load 'input' using org.apache.pig.piggybank.storage.XMLLoader('event') as (doc:chararray); The output will be, {code} event relatedEvents eventx\event {code} Whereas the desired output is ; {code} relatedEvents eventx\event eventy\event eventz\event \relatedEvents {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2036) Set header delimiter in PigStorageSchema
[ https://issues.apache.org/jira/browse/PIG-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055404#comment-13055404 ] Vivek Padmanabhan commented on PIG-2036: Hi Dmitriy , Can we have this patch checked in for older versions also, like 0.8 and 0.9 ? Set header delimiter in PigStorageSchema Key: PIG-2036 URL: https://issues.apache.org/jira/browse/PIG-2036 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0, 0.8.0 Reporter: Mads Moeller Assignee: Mads Moeller Priority: Minor Fix For: 0.10 Attachments: PIG-2036.1.patch, PIG-2036.patch Piggybanks' PigStorageSchema currently defaults the delimiter to a tab in the generated header file (.pig_header). The attached patch set the header delimiter to what is passed in via the constructor. Otherwise it'll default to tab '\t'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (PIG-2135) Pig 0.9 ignoring Multiple filter conditions joined with AND/OR
[ https://issues.apache.org/jira/browse/PIG-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vivek Padmanabhan resolved PIG-2135. Resolution: Invalid Pig 0.9 ignoring Multiple filter conditions joined with AND/OR --- Key: PIG-2135 URL: https://issues.apache.org/jira/browse/PIG-2135 Project: Pig Issue Type: Bug Affects Versions: 0.9.0 Reporter: Vivek Padmanabhan Assignee: Thejas M Nair Priority: Critical Fix For: 0.9.0 When I have multiple filter statements joined by AND/OR , except for the first condition all other conditions are ignored. For example in the below script the second condition (org.udfs.Func09('e',w) == 1) is ignored ; a = load 'sample_input' using PigStorage(',') as (q:chararray,w:chararray); b = filter a by org.udfs.Func09('f1',q) == 1 AND org.udfs.Func09('e',w) == 1 ; dump b; Output from the script (f1,a) (f1,e) -- this record should have been filtered by the second condition Input for the script; f1,a f2,b f3,c f1,e f2,f f5,e The explain of the alias b shows that the second condition is not included in the plan itself. The above statements works fine with Pig 0.8. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2135) Pig 0.9 ignoring Multiple filter conditions joined with AND/OR
[ https://issues.apache.org/jira/browse/PIG-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053380#comment-13053380 ] Vivek Padmanabhan commented on PIG-2135: Hi Thejas, I was using an older version of Pig 0.9. This issue is not present in the latest code. Sorry about that. Pig 0.9 ignoring Multiple filter conditions joined with AND/OR --- Key: PIG-2135 URL: https://issues.apache.org/jira/browse/PIG-2135 Project: Pig Issue Type: Bug Affects Versions: 0.9.0 Reporter: Vivek Padmanabhan Assignee: Thejas M Nair Priority: Critical Fix For: 0.9.0 When I have multiple filter statements joined by AND/OR , except for the first condition all other conditions are ignored. For example in the below script the second condition (org.udfs.Func09('e',w) == 1) is ignored ; a = load 'sample_input' using PigStorage(',') as (q:chararray,w:chararray); b = filter a by org.udfs.Func09('f1',q) == 1 AND org.udfs.Func09('e',w) == 1 ; dump b; Output from the script (f1,a) (f1,e) -- this record should have been filtered by the second condition Input for the script; f1,a f2,b f3,c f1,e f2,f f5,e The explain of the alias b shows that the second condition is not included in the plan itself. The above statements works fine with Pig 0.8. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2135) Pig 0.9 ignoring Multiple filter conditions joined with AND/OR
[ https://issues.apache.org/jira/browse/PIG-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13052414#comment-13052414 ] Vivek Padmanabhan commented on PIG-2135: UDF Used in the above example; {code} public class Func09 extends EvalFuncInteger { @Override public Integer exec(Tuple input) throws IOException { String field1 = (String)input.get(0); String field2 = (String)input.get(1); return field1.equals(field2) ? 1 : 0; } } {code} Pig 0.9 ignoring Multiple filter conditions joined with AND/OR --- Key: PIG-2135 URL: https://issues.apache.org/jira/browse/PIG-2135 Project: Pig Issue Type: Bug Affects Versions: 0.9.0 Reporter: Vivek Padmanabhan Priority: Critical Fix For: 0.9.0 When I have multiple filter statements joined by AND/OR , except for the first condition all other conditions are ignored. For example in the below script the second condition (org.udfs.Func09('e',w) == 1) is ignored ; a = load 'sample_input' using PigStorage(',') as (q:chararray,w:chararray); b = filter a by org.udfs.Func09('f1',q) == 1 AND org.udfs.Func09('e',w) == 1 ; dump b; Output from the script (f1,a) (f1,e) -- this record should have been filtered by the second condition Input for the script; f1,a f2,b f3,c f1,e f2,f f5,e The explain of the alias b shows that the second condition is not included in the plan itself. The above statements works fine with Pig 0.8. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2130) Piggybank:MultiStorage is not compressing output files
[ https://issues.apache.org/jira/browse/PIG-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vivek Padmanabhan updated PIG-2130: --- Attachment: PIG-2130_1.patch Attaching an initial patch Piggybank:MultiStorage is not compressing output files -- Key: PIG-2130 URL: https://issues.apache.org/jira/browse/PIG-2130 Project: Pig Issue Type: Bug Affects Versions: 0.8.0, 0.9.0 Reporter: Vivek Padmanabhan Assignee: Vivek Padmanabhan Attachments: PIG-2130_1.patch MultiStorage is not compressing the records while writing the output. Even though it takes a compression param, when the record is written it ignores the compression. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2130) Piggybank:MultiStorage is not compressing output files
[ https://issues.apache.org/jira/browse/PIG-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13051841#comment-13051841 ] Vivek Padmanabhan commented on PIG-2130: Please note that , if compression is used, then the subfolders and output files will be having the corresponding extension. For example, if output001.bz2 is output path and f1,f2 are the keys, the files will look like; /tmp/output001.bz2 /tmp/output001.bz2/f1.bz2 /tmp/output001.bz2/f1.bz2/f1-0.bz2 /tmp/output001.bz2/f2.bz2 /tmp/output001.bz2/f2.bz2/f2-0.bz2 Piggybank:MultiStorage is not compressing output files -- Key: PIG-2130 URL: https://issues.apache.org/jira/browse/PIG-2130 Project: Pig Issue Type: Bug Affects Versions: 0.8.0, 0.9.0 Reporter: Vivek Padmanabhan Assignee: Vivek Padmanabhan Attachments: PIG-2130_1.patch MultiStorage is not compressing the records while writing the output. Even though it takes a compression param, when the record is written it ignores the compression. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2130) Piggybank:MultiStorage is not compressing output files
[ https://issues.apache.org/jira/browse/PIG-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vivek Padmanabhan updated PIG-2130: --- Fix Version/s: 0.8.0 0.9.0 Status: Patch Available (was: Open) Piggybank:MultiStorage is not compressing output files -- Key: PIG-2130 URL: https://issues.apache.org/jira/browse/PIG-2130 Project: Pig Issue Type: Bug Affects Versions: 0.8.0, 0.9.0 Reporter: Vivek Padmanabhan Assignee: Vivek Padmanabhan Fix For: 0.9.0, 0.8.0 Attachments: PIG-2130_1.patch MultiStorage is not compressing the records while writing the output. Even though it takes a compression param, when the record is written it ignores the compression. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2130) Piggybank:MultiStorage is not compressing output files
Piggybank:MultiStorage is not compressing output files -- Key: PIG-2130 URL: https://issues.apache.org/jira/browse/PIG-2130 Project: Pig Issue Type: Bug Affects Versions: 0.8.0, 0.9.0 Reporter: Vivek Padmanabhan Assignee: Vivek Padmanabhan MultiStorage is not compressing the records while writing the output. Even though it takes a compression param, when the record is written it ignores the compression. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2098) jython - problem with single item tuple in bag
[ https://issues.apache.org/jira/browse/PIG-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vivek Padmanabhan updated PIG-2098: --- Description: While using phython udf, if I create a tuple with a single field, Pig execution fails with ClassCastException. Caused by: java.io.IOException: Error executing function: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Cannot convert jython type to pig datatype java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.Tuple at org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:111) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:245) An example to reproduce the issuue ; Pig Script {code} register 'mapkeys.py' using jython as mapkeys; A = load 'mapkeys.data' using PigStorage() as ( aMap: map[] ); C = foreach A generate mapkeys.keys(aMap); dump C; {code} mapkeys.py {code} @outputSchema(keys:bag{t:tuple(key:chararray)}) def keys(map): print mapkeys.py:keys:map:, map outBag = [] for key in map.iterkeys(): t = (key) ## doesn't work, causes Pig to crash #t = (key,) ## adding empty value works :-/ outBag.append(t) print mapkeys.py:keys:outBag:, outBag return outBag {code} Input data 'mapkeys.data' [name#John,phone#5551212] In the udf, t = (key) , because of this the item inside the bag is treated as a string instead of a tuple which causes for the class cast execption. If I provide an additional comma, t = (key,) , then the script goes through fine. From code what I can see is that ,for t = (key,) , pythonToPig(..) recieves the pyObject as [(u'name',), (u'phone',)] from the PyFunction call . But for t = (key) the return from PyFunction call is [u'name', u'phone'] was: While using phython udf, if I create a tuple with a single field, Pig execution fails with ClassCastException. Caused by: java.io.IOException: Error executing function: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Cannot convert jython type to pig datatype java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.Tuple at org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:111) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:245) An example to reproduce the issuue ; jython - problem with single item tuple in bag -- Key: PIG-2098 URL: https://issues.apache.org/jira/browse/PIG-2098 Project: Pig Issue Type: Bug Affects Versions: 0.8.1, 0.9.0 Reporter: Vivek Padmanabhan While using phython udf, if I create a tuple with a single field, Pig execution fails with ClassCastException. Caused by: java.io.IOException: Error executing function: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Cannot convert jython type to pig datatype java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.Tuple at org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:111) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:245) An example to reproduce the issuue ; Pig Script {code} register 'mapkeys.py' using jython as mapkeys; A = load 'mapkeys.data' using PigStorage() as ( aMap: map[] ); C = foreach A generate mapkeys.keys(aMap); dump C; {code} mapkeys.py {code} @outputSchema(keys:bag{t:tuple(key:chararray)}) def keys(map): print mapkeys.py:keys:map:, map outBag = [] for key in map.iterkeys(): t = (key) ## doesn't work, causes Pig to crash #t = (key,) ## adding empty value works :-/ outBag.append(t) print mapkeys.py:keys:outBag:, outBag return outBag {code} Input data 'mapkeys.data' [name#John,phone#5551212] In the udf, t = (key) , because of this the item inside the bag is treated as a string instead of a tuple which causes for the class cast execption. If I provide an additional comma, t = (key,) , then the script goes through fine. From code what I can see is that ,for t = (key,) , pythonToPig(..) recieves the pyObject as [(u'name',), (u'phone',)] from the PyFunction call . But for t = (key) the return from PyFunction call is [u'name', u'phone'] -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2021) Parser error while referring a map nested foreach
[ https://issues.apache.org/jira/browse/PIG-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13031114#comment-13031114 ] Vivek Padmanabhan commented on PIG-2021: Hi Xuefu, Extremely sorry about that.I was just trying to remove the dependencies.Please check whether the below is a valid case; {code} register mymapudf.jar; A = load 'temp' as ( s, m, l ); B = foreach A generate *, org.vivek.udfs.mToMapUDF((chararray) s) as mapout; C = foreach B { urlpath = (chararray) mapout#'k1'; lc_urlpath = org.vivek.udfs.LOWER((chararray) urlpath); generate urlpath,lc_urlpath; }; {code} Source for org.vivek.udfs.mToMapUDF {code} package org.vivek.udfs; import java.io.IOException; import java.util.HashMap; import java.util.Map; import org.apache.pig.EvalFunc; import org.apache.pig.data.DataType; import org.apache.pig.data.Tuple; import org.apache.pig.impl.logicalLayer.schema.Schema; public class mToMapUDF extends EvalFuncMap { public MapString, Object exec(Tuple arg0) throws IOException { Map String,Object myMapTResult = new HashMapString, Object(); myMapTResult.put(k1, SomeString); myMapTResult.put(k3, SomeOtherString); return myMapTResult; } public Schema outputSchema(Schema input) { return new Schema(new Schema.FieldSchema(mapout,DataType.MAP)); } } {code} Source for org.vivek.udfs.LOWER {code} package org.vivek.udfs; import java.io.IOException; import java.util.List; import java.util.ArrayList; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; import org.apache.pig.data.DataType; import org.apache.pig.impl.logicalLayer.schema.Schema; import org.apache.pig.impl.logicalLayer.FrontendException; import org.apache.pig.FuncSpec; public class LOWER extends EvalFuncString { public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try { String str = (String)input.get(0); return str.toLowerCase(); } catch(Exception e){ return null; } } public Schema outputSchema(Schema input) { return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), DataType.CHARARRAY)); } public ListFuncSpec getArgToFuncMapping() throws FrontendException { ListFuncSpec funcList = new ArrayListFuncSpec(); funcList.add(new FuncSpec(this.getClass().getName(), new Schema(new Schema.FieldSchema(null, DataType.CHARARRAY; return funcList; } } {code} Parser error while referring a map nested foreach - Key: PIG-2021 URL: https://issues.apache.org/jira/browse/PIG-2021 Project: Pig Issue Type: Bug Affects Versions: 0.9.0 Reporter: Vivek Padmanabhan Assignee: Xuefu Zhang Fix For: 0.9.0 The below script is throwing parser errors {code} register string.jar; A = load 'test1' using MapLoader() as ( s, m, l ); B = foreach A generate *, string.URLPARSE((chararray) s#'url') as parsedurl; C = foreach B { urlpath = (chararray) parsedurl#'path'; lc_urlpath = string.TOLOWERCASE((chararray) urlpath); generate *; }; {code} Error message; | Failed to generate logical plan. |Nested exception: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2225: Projection with nothing to reference! PIG-2002 reports a similar issue, but when i tried with the patch of PIG-2002 i was getting the below exception; ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: file repro.pig, line 11, column 33 mismatched input '(' expecting SEMI_COLON -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2021) Parser error while referring a map nested foreach
[ https://issues.apache.org/jira/browse/PIG-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13030682#comment-13030682 ] Vivek Padmanabhan commented on PIG-2021: Attaching a script avoiding all the dependencies; {code} A = load 'temp' as ( s, m, l ); B = foreach A generate *, LOWER((chararray) s#'url') as parsedurl; C = foreach B { urlpath = (chararray) parsedurl#'path'; lc_urlpath = org.apache.pig.piggybank.evaluation.string.Reverse((chararray) urlpath); generate *; }; {code} Parser error while referring a map nested foreach - Key: PIG-2021 URL: https://issues.apache.org/jira/browse/PIG-2021 Project: Pig Issue Type: Bug Affects Versions: 0.9.0 Reporter: Vivek Padmanabhan Assignee: Xuefu Zhang Fix For: 0.9.0 The below script is throwing parser errors {code} register string.jar; A = load 'test1' using MapLoader() as ( s, m, l ); B = foreach A generate *, string.URLPARSE((chararray) s#'url') as parsedurl; C = foreach B { urlpath = (chararray) parsedurl#'path'; lc_urlpath = string.TOLOWERCASE((chararray) urlpath); generate *; }; {code} Error message; | Failed to generate logical plan. |Nested exception: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2225: Projection with nothing to reference! PIG-2002 reports a similar issue, but when i tried with the patch of PIG-2002 i was getting the below exception; ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: file repro.pig, line 11, column 33 mismatched input '(' expecting SEMI_COLON -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2046) Properties defined through 'SET' are not passed through to fs commands
Properties defined through 'SET' are not passed through to fs commands -- Key: PIG-2046 URL: https://issues.apache.org/jira/browse/PIG-2046 Project: Pig Issue Type: Bug Affects Versions: 0.8.0, 0.9.0 Reporter: Vivek Padmanabhan The properties which are set through 'SET' commands are not passed through to FS commands. Ex; SET dfs.umaskmode '026' fs -touchz umasktest/file0 It looks like the SET commands are processed by GruntParser after the FsShell creation happens with current set of properties. Hence whatever properties defined in SET will not be reflected for fs commands executed in the script. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2004) Incorrect input types passed on to eval function
Incorrect input types passed on to eval function Key: PIG-2004 URL: https://issues.apache.org/jira/browse/PIG-2004 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.9.0 Reporter: Vivek Padmanabhan Fix For: 0.9.0 The below script fails by throwing a ClassCastException from the MAX udf. The udf expects the value of the bag supplied to be databyte array, but at run time the udf gets the actual type, ie Double in this case. This causes the script execution to fail with exception; | Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to org.apache.pig.data.DataByteArray The same script runs properly with Pig 0.8. {code} A = LOAD 'myinput' as (f1,f2,f3); B = foreach A generate f1,f2+f3/1000.0 as doub; C = group B by f1; D = foreach D generate (long)(MAX(B.doub)) as f4; dump D; {code} myinput --- a 100012345 b 200023456 c 300034567 a 150054321 b 250065432 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-1979) New logical plan failing with ERROR 2229: Couldn't find matching uid -1
New logical plan failing with ERROR 2229: Couldn't find matching uid -1 Key: PIG-1979 URL: https://issues.apache.org/jira/browse/PIG-1979 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0, 0.9.0 Reporter: Vivek Padmanabhan The below is my script {code} register myudf.jar; c01 = LOAD 'input' USING org.test.MyTableLoader(''); c02 = FILTER c01 BY result == 'OK' AND formatted IS NOT NULL AND formatted != '' ; c03 = FOREACH c02 GENERATE url, formatted, FLATTEN(usage); c04 = FOREACH c03 GENERATE usage::domain AS domain, url, formatted; doc_001 = FOREACH c04 GENERATE domain,url, FLATTEN(MyExtractor(formatted)) AS category; doc_004_1 = GROUP doc_001 BY (domain,url); doc_005 = FOREACH doc_004_1 GENERATE group.domain as domain, group.url as url, doc_001.category as category; STORE doc_005 INTO 'out_final' USING PigStorage(); review1 = FOREACH c04 GENERATE domain,url, MyExtractor(formatted) AS rev; review2 = FILTER review1 BY SIZE(rev)0; joinresult = JOIN review2 by (domain,url), doc_005 by (domain,url); finalresult = FOREACH joinresult GENERATE doc_005::category; STORE finalresult INTO 'out_final' using PigStorage(); {code} The script is failing in building the plan, while applying for logical optimization rule for AddForEach. ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2229: Couldn't find matching uid -1 for project (Name: Project Type: bytearray Uid: 106 Input: 0 Column: 5) The problem is happening when I try to include doc_005::category in the projection for relation finalresult. This is field is orginated from the udf org.vivek.udfs.MyExtractor (source given below). {code} import java.io.IOException; import org.apache.pig.EvalFunc; import org.apache.pig.data.*; import org.apache.pig.impl.logicalLayer.FrontendException; import org.apache.pig.impl.logicalLayer.schema.Schema; import org.apache.pig.impl.logicalLayer.schema.Schema.FieldSchema; public class MyExtractor extends EvalFuncDataBag { @Override public Schema outputSchema(Schema arg0) { try { return Schema.generateNestedSchema(DataType.BAG, DataType.CHARARRAY); } catch (FrontendException e) { System.err.println(Error while generating schema. +e); return new Schema(new FieldSchema(null, DataType.BAG)); } } @Override public DataBag exec(Tuple inputTuple) throws IOException { try { Tuple tp2 = TupleFactory.getInstance().newTuple(1); tp2.set(0, (inputTuple.get(0).toString()+inputTuple.hashCode())); DataBag retBag = BagFactory.getInstance().newDefaultBag(); retBag.add(tp2); return retBag; } catch (Exception e) { throw new IOException( Caught exception, e); } } } {code} The script goes through fine if I disable AddForEach rule by -t AddForEach -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-1966) Document Store and load from the same location does not support globbing
Document Store and load from the same location does not support globbing Key: PIG-1966 URL: https://issues.apache.org/jira/browse/PIG-1966 Project: Pig Issue Type: Improvement Components: documentation Affects Versions: 0.8.0 Reporter: Vivek Padmanabhan Priority: Minor Fix For: 0.8.0 If in my script there is a Store and a load from the same location like below; STORE A INTO '/user/myname/myoutputfolder'; D = LOAD '/user/myname/myoutputfolder/part*' ; This will cause my script to fail .Pig requires the store and load locations to be exactly same to realize that there is a dependency . This behavior of Pig should be documented preferably in http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Load%2FStore+Functions -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-1948) java.lang.ClassCastException while using double value from result of a group
java.lang.ClassCastException while using double value from result of a group Key: PIG-1948 URL: https://issues.apache.org/jira/browse/PIG-1948 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0, 0.7.0, 0.9.0 Reporter: Vivek Padmanabhan I have a fairly simple script (but too many coloumns) which is failing with class cast exception. {code} register myudf.jar; A = load 'newinput' as (datestamp: chararray,vtestid: chararray,src_kt1: chararray,f1: chararray,f2: chararray,f3: chararray,f4: chararray,f5: chararray,f6: int,ipc: chararray,woeid: long,woeid_place: chararray,f7: chararray,f8: double,woeid_latitude: double,f9: chararray,woeid_town: chararray,woeid_county: chararray,a1: chararray,a2: chararray,woeid_country: chararray,a3: chararray,connection_speed: chararray,isp_name: chararray,isp_domain: chararray,ecnt: int,vcnt: int,ccnt: int,startts: int,duration: int,endts: int,stqust: chararray,startqc: chararray,starts_con: chararray,starts_lng: chararray,startv_pk1: int,startv_pk2: int,startv_pk3: int,startv_pk4: int,startv_pk5: int,lastquerystring: chararray,lastqc: chararray,lasts_con: chararray,lasts_lng: chararray,lastv_pk1: int,lastv_pk2: int,lastv_pk3: int,lastv_pk4: int,lastv_pk5: int,b1: chararray,lastsection: chararray,lastseclink: chararray,lasturl: chararray,path: chararray,pathtype: chararray,firstlastquerymatch: int,log_duration: double,log_duration_sq: double,duration_sq: double); B = foreach A generate datestamp,src_kt1,vtestid,stqust,ecnt,vcnt,ccnt,log_duration,duration; C = group B by ( datestamp, src_kt1,vtestid, stqust ) parallel 4; D = foreach C generate COUNT( B ) as total, MyEval( B.log_duration ) as log_duration_summary; store D into 'output'; {code} The above script is failing with class cast exception; {code} java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.String at org.apache.pig.data.BinInterSedes.readMap(BinInterSedes.java:193) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:280) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:251) at org.apache.pig.data.BinInterSedes.readTuple(BinInterSedes.java:111) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:270) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:251) at org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.java:555) at org.apache.pig.data.BinSedesTuple.readFields(BinSedesTuple.java:64) at org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.java:114) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) at org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:116) at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175) at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1376) . . {code} The problem is happening in the line MyEval( B.log_duration ), here even though log_duration is defined as a double field BinInterSedes is considering it as a map value, TINYMAP to be exact. Hence it is trying to cast the double value into the key identifier, ie a String . This bug exists in 0.9 also. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (PIG-1911) Infinite loop with accumulator function in nested foreach
[ https://issues.apache.org/jira/browse/PIG-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13007815#comment-13007815 ] Vivek Padmanabhan commented on PIG-1911: In this case pig is calling getValue() and cleanup() methods infinitely. The below is the udf source just in case; {code} public class MyCOUNT extends EvalFuncLong implements AccumulatorLong{ @Override public Long exec(Tuple input) throws IOException { DataBag bag = (DataBag)input.get(0); Iterator it = bag.iterator(); long cnt = 0; while (it.hasNext()){ Tuple t = (Tuple)it.next(); if (t != null t.size() 0 t.get(0) != null ) cnt++; } return cnt; } @Override public Schema outputSchema(Schema input) { return new Schema(new Schema.FieldSchema(null, DataType.LONG)); } private long intermediateCount = 0L; @Override public void accumulate(Tuple b) throws IOException { DataBag bag = (DataBag)b.get(0); Iterator it = bag.iterator(); while (it.hasNext()){ Tuple t = (Tuple)it.next(); if (t != null t.size() 0 t.get(0) != null) { intermediateCount += 1; } } } @Override public void cleanup() { intermediateCount = 0L; } @Override public Long getValue() { return intermediateCount; } } {code} Infinite loop with accumulator function in nested foreach - Key: PIG-1911 URL: https://issues.apache.org/jira/browse/PIG-1911 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Thejas M Nair Fix For: 0.8.0 Sample script: register v_udf.jar; a = load '2records' as (f1:chararray,f2:chararray); b = group a by f1; d = foreach b { sort = order a by f1; generate org.udfs.MyCOUNT(sort) as something ; } dump d; This causes infinite loop if MyCOUNT implements Accumulator interface. The workaround is to take the function out of nested foreach into a separate foreach statement. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (PIG-1902) Documentation : Flatten behaviour should be updated in 0.7 docs
Documentation : Flatten behaviour should be updated in 0.7 docs --- Key: PIG-1902 URL: https://issues.apache.org/jira/browse/PIG-1902 Project: Pig Issue Type: Improvement Components: documentation Affects Versions: 0.7.0 Reporter: Vivek Padmanabhan Priority: Minor Fix For: 0.7.0 In 0.8 documentation ,http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Flatten+Operator the behavior of flatten for empty bags is well documented. {code} Also note that the flatten of empty bag will result in that row being discarded; no output is generated. {code} Since this is applicable for Pig 0.7 also, the same should be documented in : http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#Flatten+Operator -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (PIG-1894) Worng stats shown when there are multiple stores but same file names
[ https://issues.apache.org/jira/browse/PIG-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006328#comment-13006328 ] Vivek Padmanabhan commented on PIG-1894: Just for reference; the issue is PIG-1779 (Worng stats shown when there are multiple loads but same file names) Worng stats shown when there are multiple stores but same file names Key: PIG-1894 URL: https://issues.apache.org/jira/browse/PIG-1894 Project: Pig Issue Type: Bug Affects Versions: 0.8.0, 0.9.0 Reporter: Vivek Padmanabhan Assignee: Richard Ding Fix For: 0.9.0 Pig 0.8/0.9 shows wrong stats for store counters when I have multiple store but of the same name. To reproduce the issue please use the below script : {code} A = load 'sampledata1' as (f1:chararray,f2:chararray,f3:int); B = filter A by f3==1; C = filter A by f3==2; D = filter A by f3==3; store B into '/folder/B/out.gz'; store C into '/folder/C/out.gz'; store D into '/folder/D/out.gz'; {code} Input {code} aaa a 1 aaa b 1 bbb a 2 bbb b 2 ccc a 3 ccc b 3 {code} For this script Pig shows Output(s): Successfully stored 6 records (32 bytes) in: /folder/B/out.gz Successfully stored 6 records (32 bytes) in: /folder/C/out.gz Successfully stored 6 records (32 bytes) in: /folder/D/out.gz Counters: Total records written : 18 Total bytes written : 96 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (PIG-1895) Class cast exception while projecting udf result
Class cast exception while projecting udf result Key: PIG-1895 URL: https://issues.apache.org/jira/browse/PIG-1895 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0, 0.7.0, 0.9.0 Reporter: Vivek Padmanabhan Class cast exception is thrown when I try to project the result from my udf. The udf has a defined schema DataType.BAG,DataType.LONG and DataType.INTEGER The below is my script {code} Data = load 'file:/home/pvivek/Desktop/input' using PigStorage() as ( i: int ); AllData = group Data all parallel 1; SampledData = foreach AllData generate org.vivek.TestEvalFunc(Data, 5) as rs; SampledData1 = foreach SampledData generate rs.sampled; {code} Even though the output schema defines sampled as a data bag, while processing, instead of sending only the data bag generated from the UDF , the entire tuple was sent to the projection as result. {code} Exception recieved : java.lang.ClassCastException: org.apache.pig.data.BinSedesTuple cannot be cast to org.apache.pig.data.DataBag at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:484) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:197) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:480) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:197) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:339) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:434) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:402) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:382) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:1) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) {code} This issue is happening with 0.9/0.8 and 0.7 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (PIG-1895) Class cast exception while projecting udf result
[ https://issues.apache.org/jira/browse/PIG-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13005604#comment-13005604 ] Vivek Padmanabhan commented on PIG-1895: UDF Source code : {code} import java.io.IOException; import java.util.ArrayList; import java.util.Iterator; import org.apache.pig.EvalFunc; import org.apache.pig.data.DataBag; import org.apache.pig.data.DataType; import org.apache.pig.data.DefaultBagFactory; import org.apache.pig.data.DefaultTupleFactory; import org.apache.pig.data.Tuple; import org.apache.pig.impl.logicalLayer.schema.Schema; public class TestEvalFunc extends EvalFuncTuple{ public Tuple exec(Tuple input) throws IOException { ArrayListTuple tupleList = new ArrayListTuple(2); DataBag values = (DataBag)(input.get(0)); for (IteratorTuple vit = values.iterator(); vit.hasNext();) { tupleList.add(vit.next()); } DataBag sampleBag = DefaultBagFactory.getInstance().newDefaultBag(tupleList); Tuple output = DefaultTupleFactory.getInstance().newTuple(3); output.set(0, sampleBag); output.set(1, new Long(3)); output.set(2, new Integer(2)); return output; } public Schema outputSchema(Schema input) { Schema udfSchema = new Schema(); udfSchema.add(new Schema.FieldSchema(sampled,DataType.BAG)); udfSchema.add(new Schema.FieldSchema(k,DataType.LONG)); udfSchema.add(new Schema.FieldSchema(i,DataType.INTEGER)); return udfSchema; } } {code} Test Case to Verify; {code} import static org.apache.pig.ExecType.LOCAL; import java.util.ArrayList; import java.util.Iterator; import junit.framework.TestCase; import org.apache.pig.PigServer; import org.apache.pig.data.Tuple; public class MyPigUnitTests extends TestCase { private static String patternString = (\\d+)!+(\\w+)~+(\\w+); public static ArrayListString[] data = new ArrayListString[](); static { data.add(new String[] { 1}); data.add(new String[] { 3}); } private static String [] script = new String []{ Data = load 'file:/home/pvivek/Desktop/input' using PigStorage() as ( i: int );, AllData = group Data all parallel 1;, SampledData = foreach AllData generate org.vivek.TestEvalFunc(Data, 5) as rs;, SampledData1 = foreach SampledData generate rs.sampled;, }; public void test () throws Exception { String filename = TestHelper.createTempFile(data, ); PigServer pig = new PigServer(LOCAL); filename = filename.replace(\\, ); patternString = patternString.replace(\\, ); for (String query : script) { pig.registerQuery(query); } Iterator? it = pig.openIterator(SampledData1); int tupleCount = 0; while (it.hasNext()) { Tuple tuple = (Tuple) it.next(); if (tuple == null) break; else { if (tuple.size() 0) { tupleCount++; } } } assertEquals(1, tupleCount); } } {code} Class cast exception while projecting udf result Key: PIG-1895 URL: https://issues.apache.org/jira/browse/PIG-1895 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0, 0.8.0, 0.9.0 Reporter: Vivek Padmanabhan Class cast exception is thrown when I try to project the result from my udf. The udf has a defined schema DataType.BAG,DataType.LONG and DataType.INTEGER The below is my script {code} Data = load 'file:/home/pvivek/Desktop/input' using PigStorage() as ( i: int ); AllData = group Data all parallel 1; SampledData = foreach AllData generate org.vivek.TestEvalFunc(Data, 5) as rs; SampledData1 = foreach SampledData generate rs.sampled; {code} Even though the output schema defines sampled as a data bag, while processing, instead of sending only the data bag generated from the UDF , the entire tuple was sent to the projection as result. {code} Exception recieved : java.lang.ClassCastException: org.apache.pig.data.BinSedesTuple cannot be cast to org.apache.pig.data.DataBag at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:484) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:197) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:480) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:197) at
[jira] Created: (PIG-1894) Worng stats shown when there are multiple stores but same file names
Worng stats shown when there are multiple stores but same file names Key: PIG-1894 URL: https://issues.apache.org/jira/browse/PIG-1894 Project: Pig Issue Type: Bug Affects Versions: 0.8.0, 0.9.0 Reporter: Vivek Padmanabhan Pig 0.8/0.9 shows wrong stats for store counters when I have multiple store but of the same name. To reproduce the issue please use the below script : {code} A = load 'sampledata1' as (f1:chararray,f2:chararray,f3:int); B = filter A by f3==1; C = filter A by f3==2; D = filter A by f3==3; store B into '/folder/B/out.gz'; store C into '/folder/C/out.gz'; store D into '/folder/D/out.gz'; {code} Input {code} aaa a 1 aaa b 1 bbb a 2 bbb b 2 ccc a 3 ccc b 3 {code} For this script Pig shows Output(s): Successfully stored 6 records (32 bytes) in: /folder/B/out.gz Successfully stored 6 records (32 bytes) in: /folder/C/out.gz Successfully stored 6 records (32 bytes) in: /folder/D/out.gz Counters: Total records written : 18 Total bytes written : 96 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia
[ https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000794#comment-13000794 ] Vivek Padmanabhan commented on PIG-1842: The errors are because PIG-1839(XMLLoader will always add an extra empty tuple even if no tags are matched) was not applied to 0.8 branch which corrects these test cases. Improve Scalability of the XMLLoader for large datasets such as wikipedia - Key: PIG-1842 URL: https://issues.apache.org/jira/browse/PIG-1842 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0, 0.8.0, 0.9.0 Reporter: Viraj Bhat Assignee: Vivek Padmanabhan Fix For: 0.7.0, 0.8.0, 0.9.0 Attachments: PIG-1842_1.patch, PIG-1842_2.patch, TEST-org.apache.pig.piggybank.test.storage.TestXMLLoader.txt The current XMLLoader for Pig, does not work well for large datasets such as the wikipedia dataset. Each mapper reads in the entire XML file resulting in extermely slow run times. Viraj -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia
[ https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999315#comment-12999315 ] Vivek Padmanabhan commented on PIG-1842: Hi Alan , The below is how I have handled these cases : Note :- The XMLLoader will consider one record from begining tag to end tag just like a line record reader searching for new line char . Split start and end locations are provided by the default FileInputFormat. Describing the entire steps in a simple way ; *The loader will collect the start and end tags and create a record out of it. (XMLLoaderBufferedPositionedInputStream.collectTag) *For begin tag *Read till the tag is found in this block *If tag not found and split end has reached then no rec found in this split (return empty array) *If partial tag is found in the current split then even though split end has reached continue reading rest of the file , beyond the split end location (handled by cond in while loop) *For end tag *Read till the end tag is found even if the split end location is reached. How far will split 1 read? It seems like it has to read to /a or else the map processing split one will not be able to process this as a coherent document. Yet from the setting of maxBytesReadable on line 132 it looks to me like it won't read past the end point. The other condition will keep the reading going on. (matchBuf.size() 0 ) Here in this case lets say my tag identifier is a . Then the loader will read till the split end to search for begining tag. Now for the end tag, it reads the rest of file starting from the last read position.Lets say split end has reached in between, it will check whether it has found a match/or partial match. If not proceed with the reading till it finds a end tag. Improve Scalability of the XMLLoader for large datasets such as wikipedia - Key: PIG-1842 URL: https://issues.apache.org/jira/browse/PIG-1842 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0, 0.8.0, 0.9.0 Reporter: Viraj Bhat Assignee: Vivek Padmanabhan Fix For: 0.7.0, 0.8.0, 0.9.0 Attachments: PIG-1842_1.patch, PIG-1842_2.patch The current XMLLoader for Pig, does not work well for large datasets such as the wikipedia dataset. Each mapper reads in the entire XML file resulting in extermely slow run times. Viraj -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia
[ https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999316#comment-12999316 ] Vivek Padmanabhan commented on PIG-1842: I have done manual test for split boundary conditions. Please suggest whether/how I can do the same with unit tests. Improve Scalability of the XMLLoader for large datasets such as wikipedia - Key: PIG-1842 URL: https://issues.apache.org/jira/browse/PIG-1842 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0, 0.8.0, 0.9.0 Reporter: Viraj Bhat Assignee: Vivek Padmanabhan Fix For: 0.7.0, 0.8.0, 0.9.0 Attachments: PIG-1842_1.patch, PIG-1842_2.patch The current XMLLoader for Pig, does not work well for large datasets such as the wikipedia dataset. Each mapper reads in the entire XML file resulting in extermely slow run times. Viraj -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (PIG-1868) New logical plan fails when I have complex data types from udf
New logical plan fails when I have complex data types from udf -- Key: PIG-1868 URL: https://issues.apache.org/jira/browse/PIG-1868 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Vivek Padmanabhan The new logical plan fails when I have complex data types returning from my eval function. The below is my script : {code} register myudf.jar; B1 = load 'myinput' as (id:chararray,ts:int,url:chararray); B2 = group B1 by id; B = foreach B2 { Tuples = order B1 by ts; generate Tuples; }; C1 = foreach B generate TransformToMyDataType(Tuples,-1,0,1) as seq: { t: ( previous, current, next ) }; C2 = foreach C1 generate FLATTEN(seq); C3 = foreach C2 generate current.id as id; dump C3; {code} On C3 it fails with below message : {code} Couldn't find matching uid -1 for project (Name: Project Type: bytearray Uid: 45 Input: 0 Column: 1) {code} The below is the describe on C1 ; {code} C1: {seq: {t: (previous: (id: chararray,ts: int,url: chararray),current: (id: chararray,ts: int,url: chararray),next: (id: chararray,ts: int,url: chararray))}} {code} The script works if I turn off new logical plan or use Pig 0.7. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (PIG-1865) BinStorage/PigStorageSchema cannot load data from a different namenode
BinStorage/PigStorageSchema cannot load data from a different namenode -- Key: PIG-1865 URL: https://issues.apache.org/jira/browse/PIG-1865 Project: Pig Issue Type: Bug Affects Versions: 0.8.0, 0.7.0, 0.9.0 Reporter: Vivek Padmanabhan BinStorage/PigStorageSchema cannot load data from a different namenode. The main reason for this is that, in the getSchema method , they use org.apache.pig.impl.io.FileLocalizer to check whether the exists, but the filesystem in HDataStorage refers to the natively configured dfs. The test case is simple : a = load 'hdfs://nn2/input' using BinStorage(); dump a; Here if I specify -Dmapreduce.job.hdfs-servers, it should have worked , by pig still takes the fs from fs.default.name so to make it work i had to override fs.default.name in pig command line. Raising this as a bug since the same scenario works with PigStorage. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (PIG-1864) Pig 0.8 Documentation : non-ascii characters present in sample udf scripts
Pig 0.8 Documentation : non-ascii characters present in sample udf scripts --- Key: PIG-1864 URL: https://issues.apache.org/jira/browse/PIG-1864 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.8.0 Reporter: Vivek Padmanabhan Priority: Minor In documentation ; http://pig.apache.org/docs/r0.8.0/udf.html#Python+UDFs For Sample Script UDFs , there are some non -ascii charaters present. Because of this when we try to execute the sample scripts it fails with error ERROR 2999: Unexpected internal error. null SyntaxError: Non-ASCII character in file 'iostream', but no encoding declared; see http://www.python.org/peps/pep-0263.html for details In the sample scripts provided in some of the line wrong characters are prsent . For example : {code} @outputSchema(onestring:chararray) {code} {code} @outputSchema(y:bag{t:tuple(len:int,word:chararray)}) {code} Requesting to have a look at all the udf examples present , since its a common practice to copy the examples directly and do a run . -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (PIG-1860) Bug in plan built for Nested foreach
Bug in plan built for Nested foreach - Key: PIG-1860 URL: https://issues.apache.org/jira/browse/PIG-1860 Project: Pig Issue Type: Bug Affects Versions: 0.8.0, 0.9.0 Reporter: Vivek Padmanabhan Using the same inputs as in PIG-1858, {code} register myanotherudf.jar; A = load 'myinput' using PigStorage() as ( date:chararray,bcookie:chararray,count:int,avg:double,pvs:int); B = foreach A generate (int)(avg / 100.0) * 100 as avg, pvs; C = group B by ( avg ); D = foreach C { Pvs = order B by pvs; Const = org.vivek.MyAnotherUDF(Pvs.pvs).(count,sum); generate Const.sum as sum; }; store D into 'out_D'; {code} In this script even though I am passing Pvs.pvs to the UDF in the nested foreach, at runtime the avg is getting passed. It looks like the logical plan created for D is wrong. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (PIG-1858) NullPointerException while compiling the new logical plan
[ https://issues.apache.org/jira/browse/PIG-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vivek Padmanabhan updated PIG-1858: --- Attachment: MyAnotherUDF.java Attaching the udf source NullPointerException while compiling the new logical plan - Key: PIG-1858 URL: https://issues.apache.org/jira/browse/PIG-1858 Project: Pig Issue Type: Bug Affects Versions: 0.8.0, 0.9.0 Reporter: Vivek Padmanabhan Attachments: MyAnotherUDF.java The below is my script : {code} register myanotherudf.jar; A = load 'myinput' using PigStorage() as ( date:chararray,bcookie:chararray,count:int,avg:double,pvs:int); B = foreach A generate (int)(avg / 100.0) * 100 as avg, pvs; C = group B by ( avg ); D = foreach C { Pvs = order B by pvs; Const = org.vivek.MyAnotherUDF(Pvs.pvs).(count,sum); generate Const.sum as sum; }; store D into 'out_D'; {code} The script is failing during compilation of the plan. The usage of the udf inside the foreach is causing the problem. The udf implements algebraic and the output schema is also defined. The below is the exception that I get : ERROR 2042: Error in new logical plan. Try -Dpig.usenewlogicalplan=false. org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2042: Error in new logical plan. Try -Dpig.usenewlogicalplan=false. at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:309) at org.apache.pig.PigServer.compilePp(PigServer.java:1364) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1206) at org.apache.pig.PigServer.execute(PigServer.java:1200) at org.apache.pig.PigServer.access$100(PigServer.java:128) at org.apache.pig.PigServer$Graph.execute(PigServer.java:1527) at org.apache.pig.PigServer.executeBatchEx(PigServer.java:372) at org.apache.pig.PigServer.executeBatch(PigServer.java:339) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:169) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90) at org.apache.pig.Main.run(Main.java:500) at org.apache.pig.Main.main(Main.java:107) Caused by: java.lang.NullPointerException at org.apache.pig.newplan.ReverseDependencyOrderWalker.walk(ReverseDependencyOrderWalker.java:70) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50) at org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:105) at org.apache.pig.newplan.logical.relational.LOGenerate.accept(LOGenerate.java:229) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:94) at org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:71) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:261) ... 13 more When i trun off new logical plan the script executes successfully. The issue is observed in both 0.8 and 0.9 -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia
[ https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vivek Padmanabhan updated PIG-1842: --- Status: Patch Available (was: In Progress) Improve Scalability of the XMLLoader for large datasets such as wikipedia - Key: PIG-1842 URL: https://issues.apache.org/jira/browse/PIG-1842 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.8.0, 0.7.0, 0.9.0 Reporter: Viraj Bhat Assignee: Vivek Padmanabhan Fix For: 0.9.0, 0.8.0, 0.7.0 Attachments: PIG-1842_1.patch, PIG-1842_2.patch The current XMLLoader for Pig, does not work well for large datasets such as the wikipedia dataset. Each mapper reads in the entire XML file resulting in extermely slow run times. Viraj -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (PIG-1850) Order by is failing with ClassCastException if schema is undefined for new logical plan in 0.8
Order by is failing with ClassCastException if schema is undefined for new logical plan in 0.8 -- Key: PIG-1850 URL: https://issues.apache.org/jira/browse/PIG-1850 Project: Pig Issue Type: Bug Affects Versions: 0.8.0, 0.9.0 Reporter: Vivek Padmanabhan The below is the script : A = load 'input' ; B = group A all; C = foreach B generate SUM($1.$0); C1 = CROSS A,C; D = foreach C1 generate ROUND($0*1.0/$2)/100.0, $1; E = order D by $0 desc; store E into 'out1'; input (tab separated fields) 26 A 1349595 B 235693 C Exception java.lang.ClassCastException: org.apache.pig.impl.io.NullableDoubleWritable cannot be cast to org.apache.pig.impl.io.NullableBytesWritable at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigBytesRawComparator.compare(PigBytesRawComparator.java:94) at java.util.Arrays.binarySearch0(Arrays.java:2105) at java.util.Arrays.binarySearch(Arrays.java:2043) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:72) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:52) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:602) at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:238) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:676) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:336) at org.apache.hadoop.mapred.Child$4.run(Child.java:242) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:236) The script is failing while doing order by in WeightedRangePartitioner since it considers the quantiles to be NullableBytesWritable but at run time this is NullableDoubleWritable . This is happening because there is no schema defined in the load statement. But the same works fine when the multiquery is turned off. One more issue worth noting is that if i have a filter statement after relation E, then the above exception is swallowed by Pig. This make debugging really hard. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (PIG-1848) Confusing statement for Merge Join - Both Conditions in Pig reference manual1
Confusing statement for Merge Join - Both Conditions in Pig reference manual1 -- Key: PIG-1848 URL: https://issues.apache.org/jira/browse/PIG-1848 Project: Pig Issue Type: Improvement Components: documentation Reporter: Vivek Padmanabhan In Pig reference manual , http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Merge+Joins, for merge join under Both Conditions , the example statement is confusing. {quote} Both Conditions For optimal performance, each part file of the left (sorted) input of the join should have a size of at least 1 hdfs block size (for example if the hdfs block size is 128 MB, each part file should be less than 128 MB). {quote} -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (PIG-1848) Confusing statement for Merge Join - Both Conditions in Pig reference manual1
[ https://issues.apache.org/jira/browse/PIG-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992943#comment-12992943 ] Vivek Padmanabhan commented on PIG-1848: Please consider to refine the Both Conditions section. Confusing statement for Merge Join - Both Conditions in Pig reference manual1 -- Key: PIG-1848 URL: https://issues.apache.org/jira/browse/PIG-1848 Project: Pig Issue Type: Improvement Components: documentation Reporter: Vivek Padmanabhan In Pig reference manual , http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Merge+Joins, for merge join under Both Conditions , the example statement is confusing. {quote} Both Conditions For optimal performance, each part file of the left (sorted) input of the join should have a size of at least 1 hdfs block size (for example if the hdfs block size is 128 MB, each part file should be less than 128 MB). {quote} -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia
[ https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vivek Padmanabhan updated PIG-1842: --- Attachment: PIG-1842_2.patch Attaching the patch again Improve Scalability of the XMLLoader for large datasets such as wikipedia - Key: PIG-1842 URL: https://issues.apache.org/jira/browse/PIG-1842 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0, 0.8.0, 0.9.0 Reporter: Viraj Bhat Assignee: Vivek Padmanabhan Fix For: 0.7.0, 0.8.0, 0.9.0 Attachments: PIG-1842_1.patch, PIG-1842_2.patch The current XMLLoader for Pig, does not work well for large datasets such as the wikipedia dataset. Each mapper reads in the entire XML file resulting in extermely slow run times. Viraj -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia
[ https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991832#comment-12991832 ] Vivek Padmanabhan commented on PIG-1842: The below are some of the issues addressed in the patch : a) Marking splittable of the loader as true except for gz formats a) Changing XMLLoader to read for splits rather than entire file. b) Handling scenarios regarding split/record boundaries c) Using CBZip2InputStream to handle bzip2 files d) An improvement on logic of collectTag (ie, skip unnecessary reads to find end tag if no start tags are found) Manual tests for scalability and functional verification were done for the patch. Using latest wikipedia dump in bz2 format (contains 10861606 pages; 6.5gb bz2) the new loader completed within 3 minutes,while the older version took more than 35minutes for a simple load-filter null-store script. Improve Scalability of the XMLLoader for large datasets such as wikipedia - Key: PIG-1842 URL: https://issues.apache.org/jira/browse/PIG-1842 Project: Pig Issue Type: Improvement Reporter: Viraj Bhat Assignee: Vivek Padmanabhan Attachments: PIG-1842_1.patch The current XMLLoader for Pig, does not work well for large datasets such as the wikipedia dataset. Each mapper reads in the entire XML file resulting in extermely slow run times. Viraj -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia
[ https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vivek Padmanabhan updated PIG-1842: --- Attachment: PIG-1842_1.patch Attaching an initial patch. Improve Scalability of the XMLLoader for large datasets such as wikipedia - Key: PIG-1842 URL: https://issues.apache.org/jira/browse/PIG-1842 Project: Pig Issue Type: Improvement Reporter: Viraj Bhat Assignee: Vivek Padmanabhan Attachments: PIG-1842_1.patch The current XMLLoader for Pig, does not work well for large datasets such as the wikipedia dataset. Each mapper reads in the entire XML file resulting in extermely slow run times. Viraj -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (PIG-1839) piggybank: XMLLoader will always add an extra empty tuple even if no tags are matched
[ https://issues.apache.org/jira/browse/PIG-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vivek Padmanabhan updated PIG-1839: --- Attachment: PIG-1839-1.patch Attaching the initial patch. Please note that I have modified the existing test case to assert for the correct number of tuples . piggybank: XMLLoader will always add an extra empty tuple even if no tags are matched - Key: PIG-1839 URL: https://issues.apache.org/jira/browse/PIG-1839 Project: Pig Issue Type: Bug Affects Versions: 0.7.0, 0.8.0, 0.9.0 Reporter: Vivek Padmanabhan Assignee: Vivek Padmanabhan Attachments: PIG-1839-1.patch The XMLLoader in piggy bank always add an empty tuple. Everytime this has to be filtered out. Instead the same could be done by the loader itself. Consider the below script : a= load 'a.xml' using org.apache.pig.piggybank.storage.XMLLoader('name'); dump a; b= filter a by $0 is not null; dump b; The output of first dump is : (name foobar /name) (name foo /name) (name justname /name) () The output of second dump is : (name foobar /name) (name foo /name) (name justname /name) Again another case is if I dont have a matching tag , still the loader will generate the empty tuple. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (PIG-1839) piggybank: XMLLoader will always add an extra empty tuple even if no tags are matched
[ https://issues.apache.org/jira/browse/PIG-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vivek Padmanabhan updated PIG-1839: --- Patch Info: [Patch Available] piggybank: XMLLoader will always add an extra empty tuple even if no tags are matched - Key: PIG-1839 URL: https://issues.apache.org/jira/browse/PIG-1839 Project: Pig Issue Type: Bug Affects Versions: 0.7.0, 0.8.0, 0.9.0 Reporter: Vivek Padmanabhan Assignee: Vivek Padmanabhan Attachments: PIG-1839-1.patch The XMLLoader in piggy bank always add an empty tuple. Everytime this has to be filtered out. Instead the same could be done by the loader itself. Consider the below script : a= load 'a.xml' using org.apache.pig.piggybank.storage.XMLLoader('name'); dump a; b= filter a by $0 is not null; dump b; The output of first dump is : (name foobar /name) (name foo /name) (name justname /name) () The output of second dump is : (name foobar /name) (name foo /name) (name justname /name) Again another case is if I dont have a matching tag , still the loader will generate the empty tuple. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (PIG-1835) Pig 0.9 new logical plan throws class cast exception
Pig 0.9 new logical plan throws class cast exception Key: PIG-1835 URL: https://issues.apache.org/jira/browse/PIG-1835 Project: Pig Issue Type: Bug Affects Versions: 0.9.0 Reporter: Vivek Padmanabhan I have the below script which is throwing class cast exception while doing SUM. Even though all the fields are properly typed, while computing sum in m_agg0 and m_agg02 the record from tuple is coming as java.lang.Long instead of Double. The problem is happening in Pig 0.9. It works fine with 0.9 if I flag off new logical plan by -Dpig.usenewlogicalplan=false. {code} A0 = load 'inputA' using PigStorage('\t') as ( group_id, r_id:long, is_phase2:int, roi_value:double,roi_cost:double,ecpm, prob:double,pixel_id, pixel_type, val:long,f3, f4,type:long, amount:double,item_id:long); A0 = foreach A0 generate r_id, is_phase2, ((val==257 or val==258)? 1: 0) as imps, ((val==257 or val==258)? amount: 0.0) as a_out, ((val==257 or val==258)? item_id: 0) as a_item_id, ((val==257 or val==258)? roi_value: 0.0) as roi_value,((val==257 or val==258)? roi_cost: 0.0) as roi_cost, ((val==257 or val==513)? ecpm: 0.0) as ecpm, ((val==257 or val==513)? prob: 0.0) as prob, ((val==257 or val==513)? amount: 0.0) as pub_rev, ((val==257 or val==513)? item_id: 0) as pub_line_id,((val==257 or val==513)? type: 0) as pub_pt; - B0 = load 'inputB' using PigStorage('\t') as ( group_id:long, r_id:long, roi_value:double,roi_cost:double,receive_time, host_name,site_id,rm_has_cookies,rm_pearl_id, f1,f2,pixel_id:long,pixel_type:int, xcookie,val:long,f3, f4,type:long,amount:double,item_id:long); B0 = foreach B0 generate r_id, ((val==257 or val==258)? 1: 0) as B0,((val==257 or val==258)? amount: 0.0) as a_out, ((val==257 or val==258)? item_id: 0) as a_item_id,((val==257 or val==258)? roi_value: 0.0) as roi_value, ((val==257 or val==258)? roi_cost: 0.0) as roi_cost, ((val==257 or val==513)? amount: 0.0) as pub_rev, ((val==257 or val==513)? item_id: 0) as pub_line_id, ((val==257 or val==513)? type: 0) as pub_pt; C0 = load 'inputC' using PigStorage('\t') as ( group_id:long, r_id:long, roi_value:double, roi_cost:double, receive_time:long, host_name:chararray, site_id:long, rm_has_cookies:int,rm_pearl_id:long,f1,f2, pixel_id:long, pixel_type:int,rm_is_post_click:int, rm_conversion_id,xcookie:chararray,val:long,f3:long,f4:long,type:long,amount:double,item_id:long); C0 = foreach C0 generate r_id,((val==257 or val==258)? 1: 0) as C0, ((val==257 or val==258)? amount: 0.0) as a_out, ((val==257 or val==258)? item_id: 0) as a_item_id,((val==257 or val==513)? amount: 0.0) as pub_rev, ((val==257 or val==513)? item_id: 0) as pub_line_id, ((val==257 or val==513)? type: 0) as pub_pt; m_all = cogroup A0 by (r_id) outer, B0 by (r_id) outer, C0 by (r_id) outer ; m_agg01 = foreach m_all generate (double)(IsEmpty(C0) ? 0.0 : SUM(C0.pub_rev)) as conv_pub_rev; store m_agg01 into 'out1' USING PigStorage(','); m_all = cogroup A0 by (r_id) outer, B0 by (r_id) outer, C0 by (r_id) outer ; m_agg02 = foreach m_all generate (double)(IsEmpty(C0) ? 0.0 : SUM(C0.pub_rev)) as conv_pub_rev; store m_agg02 into 'out2' USING PigStorage(','); {code} The below are the inputs to the script (all single record and tab seperated) inputA -- 1 1.1 1.1 1.1 1.1 1 1.1 inputB -- 1.1 1.1 a1 1 b1 1 1 c1 1.1 inputC -- 1.1 1.1 a1 1 b1 1 1 1 c1 1.1 Exception from reducers ___ org.apache.pig.backend.executionengine.ExecException: ERROR 2103: Problem while computing sum of doubles. at org.apache.pig.builtin.DoubleSum.sum(DoubleSum.java:147) at org.apache.pig.builtin.DoubleSum.exec(DoubleSum.java:46) at org.apache.pig.builtin.DoubleSum.exec(DoubleSum.java:41) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:230) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:302)
[jira] Created: (PIG-1831) Variation in output while using streaming udfs in local mode
Variation in output while using streaming udfs in local mode Key: PIG-1831 URL: https://issues.apache.org/jira/browse/PIG-1831 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Vivek Padmanabhan The below script when run in local mode gives me a different output. It looks like in local mode I have to store a relation obtained through streaming in order to use it afterwards. For example consider the below script : {code:lang=scala|title=} DEFINE MySTREAMUDF `test.sh`; A = LOAD 'myinput' USING PigStorage() AS (myId:chararray, data2, data3,data4 ); B = STREAM A THROUGH MySTREAMUDF AS (wId:chararray, num:int); --STORE B into 'output.B'; C = JOIN B by wId LEFT OUTER, A by myId; D = FOREACH C GENERATE B::wId,B::num,data4 ; D = STREAM D THROUGH MySTREAMUDF AS (f1:chararray,f2:int); --STORE D into 'output.D'; E = foreach B GENERATE wId,num; F = DISTINCT E; G = GROUP F ALL; H = FOREACH G GENERATE COUNT_STAR(F) as TotalCount; I = CROSS D,H; STORE I into 'output.I'; {code} {code:lang=scala|title=test.sh} #/bin/bash cut -f1,3 {code} And input is abcdlabel1 11 feature1 acbdlabel2 22 feature2 adbclabel3 33 feature3 Here if I store relation B and D then everytime i get the result : acbd3 abcd3 adbc3 But if i dont store relations B and D then I get an empty output. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1831) Variation in output while using streaming udfs in local mode
[ https://issues.apache.org/jira/browse/PIG-1831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vivek Padmanabhan updated PIG-1831: --- Description: The below script when run in local mode gives me a different output. It looks like in local mode I have to store a relation obtained through streaming in order to use it afterwards. For example consider the below script : DEFINE MySTREAMUDF `test.sh`; A = LOAD 'myinput' USING PigStorage() AS (myId:chararray, data2, data3,data4 ); B = STREAM A THROUGH MySTREAMUDF AS (wId:chararray, num:int); --STORE B into 'output.B'; C = JOIN B by wId LEFT OUTER, A by myId; D = FOREACH C GENERATE B::wId,B::num,data4 ; D = STREAM D THROUGH MySTREAMUDF AS (f1:chararray,f2:int); --STORE D into 'output.D'; E = foreach B GENERATE wId,num; F = DISTINCT E; G = GROUP F ALL; H = FOREACH G GENERATE COUNT_STAR(F) as TotalCount; I = CROSS D,H; STORE I into 'output.I'; #/bin/bash cut -f1,3 And input is abcdlabel1 11 feature1 acbdlabel2 22 feature2 adbclabel3 33 feature3 Here if I store relation B and D then everytime i get the result : acbd3 abcd3 adbc3 But if i dont store relations B and D then I get an empty output. was: The below script when run in local mode gives me a different output. It looks like in local mode I have to store a relation obtained through streaming in order to use it afterwards. For example consider the below script : {code:lang=scala|title=} DEFINE MySTREAMUDF `test.sh`; A = LOAD 'myinput' USING PigStorage() AS (myId:chararray, data2, data3,data4 ); B = STREAM A THROUGH MySTREAMUDF AS (wId:chararray, num:int); --STORE B into 'output.B'; C = JOIN B by wId LEFT OUTER, A by myId; D = FOREACH C GENERATE B::wId,B::num,data4 ; D = STREAM D THROUGH MySTREAMUDF AS (f1:chararray,f2:int); --STORE D into 'output.D'; E = foreach B GENERATE wId,num; F = DISTINCT E; G = GROUP F ALL; H = FOREACH G GENERATE COUNT_STAR(F) as TotalCount; I = CROSS D,H; STORE I into 'output.I'; {code} {code:lang=scala|title=test.sh} #/bin/bash cut -f1,3 {code} And input is abcdlabel1 11 feature1 acbdlabel2 22 feature2 adbclabel3 33 feature3 Here if I store relation B and D then everytime i get the result : acbd3 abcd3 adbc3 But if i dont store relations B and D then I get an empty output. Variation in output while using streaming udfs in local mode Key: PIG-1831 URL: https://issues.apache.org/jira/browse/PIG-1831 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Vivek Padmanabhan The below script when run in local mode gives me a different output. It looks like in local mode I have to store a relation obtained through streaming in order to use it afterwards. For example consider the below script : DEFINE MySTREAMUDF `test.sh`; A = LOAD 'myinput' USING PigStorage() AS (myId:chararray, data2, data3,data4 ); B = STREAM A THROUGH MySTREAMUDF AS (wId:chararray, num:int); --STORE B into 'output.B'; C = JOIN B by wId LEFT OUTER, A by myId; D = FOREACH C GENERATE B::wId,B::num,data4 ; D = STREAM D THROUGH MySTREAMUDF AS (f1:chararray,f2:int); --STORE D into 'output.D'; E = foreach B GENERATE wId,num; F = DISTINCT E; G = GROUP F ALL; H = FOREACH G GENERATE COUNT_STAR(F) as TotalCount; I = CROSS D,H; STORE I into 'output.I'; #/bin/bash cut -f1,3 And input is abcdlabel1 11 feature1 acbdlabel2 22 feature2 adbclabel3 33 feature3 Here if I store relation B and D then everytime i get the result : acbd3 abcd3 adbc3 But if i dont store relations B and D then I get an empty output. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1561) XMLLoader in Piggybank does not support bz2 or gzip compressed XML files
[ https://issues.apache.org/jira/browse/PIG-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vivek Padmanabhan updated PIG-1561: --- Attachment: PIG-1561-1.patch Attaching an initial patch for the issue. Please review. XMLLoader in Piggybank does not support bz2 or gzip compressed XML files Key: PIG-1561 URL: https://issues.apache.org/jira/browse/PIG-1561 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0, 0.8.0 Reporter: Viraj Bhat Assignee: Vivek Padmanabhan Attachments: PIG-1561-1.patch I have a simple Pig script which uses the XMLLoader after the Piggybank is built. {code} register piggybank.jar; A = load '/user/viraj/capacity-scheduler.xml.gz' using org.apache.pig.piggybank.storage.XMLLoader('property') as (docs:chararray); B = limit A 1; dump B; --store B into '/user/viraj/handlegz' using PigStorage(); {code} returns empty tuple {code} () {code} If you supply the uncompressed XML file, you get {code} (property namemapred.capacity-scheduler.queue.my.capacity/name value10/value descriptionPercentage of the number of slots in the cluster that are guaranteed to be available for jobs in this queue. /description /property) {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1779) Worng stats shown when there are multiple loads but same file names
Worng stats shown when there are multiple loads but same file names --- Key: PIG-1779 URL: https://issues.apache.org/jira/browse/PIG-1779 Project: Pig Issue Type: Bug Components: tools Affects Versions: 0.8.0 Reporter: Vivek Padmanabhan In Pig 0.8 , the stats is showing wrong information when ever I have multiple loads and the the file names are similar . a) Problem 1 Sample Script : A = LOAD 'myfolder/tryme' AS (f1); B = LOAD 'myfolder/anotherfolder/tryme' AS (f2); C = JOIN A BY f1, B BY f2; DUMP C; Here I have 10 records for A and 3 records for B , but pig says Successfully read 6 records from: nn/myfolder/anotherfolder/tryme Successfully read 6 records from: nnmyfolder/tryme b) Problem 2 A = LOAD 'myfolder/tryme' AS (f1); B = LOAD 'myfolder/anotherfolder/tryme' AS (f2); C = JOIN A BY f1, B BY f2; DUMP C; Here there is no folder named anotherfolder while myfolder/tryme exists . But pig says Failed to read data from nn/myfolder/anotherfolder/tryme Failed to read data from nn/myfolder/tryme -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.