[jira] Updated: (PIG-1354) UDFs for dynamic invocation of simple Java methods
[ https://issues.apache.org/jira/browse/PIG-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1354: --- Resolution: Fixed Status: Resolved (was: Patch Available) Committed. UDFs for dynamic invocation of simple Java methods -- Key: PIG-1354 URL: https://issues.apache.org/jira/browse/PIG-1354 Project: Pig Issue Type: New Feature Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1354.patch, PIG-1354.patch, PIG-1354.patch The need to create wrapper UDFs for simple Java functions creates unnecessary work for Pig users, slows down the development process, and produces a lot of trivial classes. We can use Java's reflection to allow invoking a number of methods on the fly, dynamically, by creating a generic UDF to accomplish this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1150) VAR() Variance UDF
[ https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854487#action_12854487 ] Dmitriy V. Ryaboy commented on PIG-1150: Jay, there may be -- I only glanced at the code here. The real problem is this: http://planetmath.org/encyclopedia/OnePassAlgorithmToComputeSampleVariance.html -- you are going to get round-off errors, and possibly overflow errors, using this approach. Thanks for reminding me that I promised this, I'll work on open-sourcing our code. -Dmitriy VAR() Variance UDF -- Key: PIG-1150 URL: https://issues.apache.org/jira/browse/PIG-1150 Project: Pig Issue Type: New Feature Affects Versions: 0.5.0 Environment: UDF, written in Pig 0.5 contrib/ Reporter: Russell Jurney Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: var.patch I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates variance in a distributed manner, based on the AVG() builtin. It works by calculating the count, sum and sum of squares, as described here: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm Is this a worthwhile contribution? Taking the square root of this value using the contrib SQRT() function gives Standard Deviation, which is missing from Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1359) bin/pig script does not pick up correct jar libraries
[ https://issues.apache.org/jira/browse/PIG-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gianmarco De Francisci Morales updated PIG-1359: Description: The bin/pig script tries to load pig jar libraries from the pig-*-core.jar using this bash fragment {code} # for releases, add core pig to CLASSPATH for f in $PIG_HOME/pig-*core.jar; do CLASSPATH=${CLASSPATH}:$f; done # during development pig jar might be in build for f in $PIG_HOME/build/pig-*-core.jar; do CLASSPATH=${CLASSPATH}:$f; done {code} The pig-*-core.jar does not contain the dependencies for pig that are found in build/ivy/lib/Pig/*.jar (jline). The script does not even pick the pig.jar in PIG_HOME that is produced as a result of the ant build process. This results in the following error after successfully building pig: {code} Exception in thread main java.lang.NoClassDefFoundError: jline/ConsoleReaderInputStream Caused by: java.lang.ClassNotFoundException: jline.ConsoleReaderInputStream {code} was: The bin/pig script tries to load pig jar libraries from the pig-*-core.jar using this bash fragment {code:bash} # for releases, add core pig to CLASSPATH for f in $PIG_HOME/pig-*core.jar; do CLASSPATH=${CLASSPATH}:$f; done # during development pig jar might be in build for f in $PIG_HOME/build/pig-*-core.jar; do CLASSPATH=${CLASSPATH}:$f; done {code} The pig-*-core.jar does not contain the dependencies for pig that are found in build/ivy/lib/Pig/*.jar (jline). The script does not even pick the pig.jar in PIG_HOME that is produced as a result of the ant build process. This results in the following error after successfully building pig: {code} Exception in thread main java.lang.NoClassDefFoundError: jline/ConsoleReaderInputStream Caused by: java.lang.ClassNotFoundException: jline.ConsoleReaderInputStream {code} bin/pig script does not pick up correct jar libraries - Key: PIG-1359 URL: https://issues.apache.org/jira/browse/PIG-1359 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Environment: Linux Ubuntu 8.10, java-6-sun Reporter: Gianmarco De Francisci Morales Priority: Trivial The bin/pig script tries to load pig jar libraries from the pig-*-core.jar using this bash fragment {code} # for releases, add core pig to CLASSPATH for f in $PIG_HOME/pig-*core.jar; do CLASSPATH=${CLASSPATH}:$f; done # during development pig jar might be in build for f in $PIG_HOME/build/pig-*-core.jar; do CLASSPATH=${CLASSPATH}:$f; done {code} The pig-*-core.jar does not contain the dependencies for pig that are found in build/ivy/lib/Pig/*.jar (jline). The script does not even pick the pig.jar in PIG_HOME that is produced as a result of the ant build process. This results in the following error after successfully building pig: {code} Exception in thread main java.lang.NoClassDefFoundError: jline/ConsoleReaderInputStream Caused by: java.lang.ClassNotFoundException: jline.ConsoleReaderInputStream {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1359) bin/pig script does not pick up correct jar libraries
[ https://issues.apache.org/jira/browse/PIG-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gianmarco De Francisci Morales updated PIG-1359: Description: The bin/pig script tries to load pig jar libraries from the pig-*-core.jar using this bash fragment {code} # for releases, add core pig to CLASSPATH for f in $PIG_HOME/pig-*core.jar; do CLASSPATH=${CLASSPATH}:$f; done # during development pig jar might be in build for f in $PIG_HOME/build/pig-*-core.jar; do CLASSPATH=${CLASSPATH}:$f; done {code} The pig-\*-core.jar does not contain the dependencies for pig that are found in build/ivy/lib/Pig/\*.jar (jline). The script does not even pick the pig.jar in PIG_HOME that is produced as a result of the ant build process. This results in the following error after successfully building pig: {code} Exception in thread main java.lang.NoClassDefFoundError: jline/ConsoleReaderInputStream Caused by: java.lang.ClassNotFoundException: jline.ConsoleReaderInputStream {code} was: The bin/pig script tries to load pig jar libraries from the pig-*-core.jar using this bash fragment {code} # for releases, add core pig to CLASSPATH for f in $PIG_HOME/pig-*core.jar; do CLASSPATH=${CLASSPATH}:$f; done # during development pig jar might be in build for f in $PIG_HOME/build/pig-*-core.jar; do CLASSPATH=${CLASSPATH}:$f; done {code} The pig-*-core.jar does not contain the dependencies for pig that are found in build/ivy/lib/Pig/*.jar (jline). The script does not even pick the pig.jar in PIG_HOME that is produced as a result of the ant build process. This results in the following error after successfully building pig: {code} Exception in thread main java.lang.NoClassDefFoundError: jline/ConsoleReaderInputStream Caused by: java.lang.ClassNotFoundException: jline.ConsoleReaderInputStream {code} bin/pig script does not pick up correct jar libraries - Key: PIG-1359 URL: https://issues.apache.org/jira/browse/PIG-1359 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Environment: Linux Ubuntu 8.10, java-6-sun Reporter: Gianmarco De Francisci Morales Priority: Trivial The bin/pig script tries to load pig jar libraries from the pig-*-core.jar using this bash fragment {code} # for releases, add core pig to CLASSPATH for f in $PIG_HOME/pig-*core.jar; do CLASSPATH=${CLASSPATH}:$f; done # during development pig jar might be in build for f in $PIG_HOME/build/pig-*-core.jar; do CLASSPATH=${CLASSPATH}:$f; done {code} The pig-\*-core.jar does not contain the dependencies for pig that are found in build/ivy/lib/Pig/\*.jar (jline). The script does not even pick the pig.jar in PIG_HOME that is produced as a result of the ant build process. This results in the following error after successfully building pig: {code} Exception in thread main java.lang.NoClassDefFoundError: jline/ConsoleReaderInputStream Caused by: java.lang.ClassNotFoundException: jline.ConsoleReaderInputStream {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1351) [Zebra] No type check when we write to the basic table
[ https://issues.apache.org/jira/browse/PIG-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Wang updated PIG-1351: --- Attachment: (was: PIG-1351.patch) [Zebra] No type check when we write to the basic table -- Key: PIG-1351 URL: https://issues.apache.org/jira/browse/PIG-1351 Project: Pig Issue Type: Improvement Affects Versions: 0.6.0, 0.7.0, 0.8.0 Reporter: Chao Wang Assignee: Chao Wang Fix For: 0.8.0 In Zebra, we do not have any type check when writing to a basic table. Say, we have a schema: f1:int, f2:string, however we can write a tuple (abc, 123) without any problem, which is definitely not desirable. To overcome this problem, we decide to perform certain amount of type checking in Zebra - We check the first row only for each writer. This only serves as a sanity check purpose in cases where users screw up specifying the output schema. We do NOT perform a rigorous type checking for all rows for apparently performance concerns. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1351) [Zebra] No type check when we write to the basic table
[ https://issues.apache.org/jira/browse/PIG-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Wang updated PIG-1351: --- Attachment: PIG-1351.patch [Zebra] No type check when we write to the basic table -- Key: PIG-1351 URL: https://issues.apache.org/jira/browse/PIG-1351 Project: Pig Issue Type: Improvement Affects Versions: 0.6.0, 0.7.0, 0.8.0 Reporter: Chao Wang Assignee: Chao Wang Fix For: 0.8.0 Attachments: PIG-1351.patch In Zebra, we do not have any type check when writing to a basic table. Say, we have a schema: f1:int, f2:string, however we can write a tuple (abc, 123) without any problem, which is definitely not desirable. To overcome this problem, we decide to perform certain amount of type checking in Zebra - We check the first row only for each writer. This only serves as a sanity check purpose in cases where users screw up specifying the output schema. We do NOT perform a rigorous type checking for all rows for apparently performance concerns. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1360) Pig API docs should include Piggybank
Pig API docs should include Piggybank - Key: PIG-1360 URL: https://issues.apache.org/jira/browse/PIG-1360 Project: Pig Issue Type: Bug Components: documentation Reporter: Alan Gates Assignee: Alan Gates Priority: Minor Currently piggybank functions aren't included in the javadocs. As they aren't documented anywhere else this forces users to read the code to understand how to use them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1351) [Zebra] No type check when we write to the basic table
[ https://issues.apache.org/jira/browse/PIG-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854595#action_12854595 ] Chao Wang commented on PIG-1351: 1) We follow Java's type compatibility rule as follows: For int type column, we allow int data instances. For long type column, we allow int and long data instances. For float type column, we allow int, long and float data instances. For double type column, we allow int, long, float and double data instances. 2) Also, due to the limitation that Pig only supports BYTES for map value type, we do not check inside of a map if it's BYTES type, else we do check. [Zebra] No type check when we write to the basic table -- Key: PIG-1351 URL: https://issues.apache.org/jira/browse/PIG-1351 Project: Pig Issue Type: Improvement Affects Versions: 0.6.0, 0.7.0, 0.8.0 Reporter: Chao Wang Assignee: Chao Wang Fix For: 0.8.0 Attachments: PIG-1351.patch In Zebra, we do not have any type check when writing to a basic table. Say, we have a schema: f1:int, f2:string, however we can write a tuple (abc, 123) without any problem, which is definitely not desirable. To overcome this problem, we decide to perform certain amount of type checking in Zebra - We check the first row only for each writer. This only serves as a sanity check purpose in cases where users screw up specifying the output schema. We do NOT perform a rigorous type checking for all rows for apparently performance concerns. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1361) [Zebra] Zebra TableLoader.getSchema() should return the projectionSchema specified in the constructor of TableLoader instead of pruned proejction by pig
[ https://issues.apache.org/jira/browse/PIG-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gaurav Jain reassigned PIG-1361: Assignee: Gaurav Jain [Zebra] Zebra TableLoader.getSchema() should return the projectionSchema specified in the constructor of TableLoader instead of pruned proejction by pig - Key: PIG-1361 URL: https://issues.apache.org/jira/browse/PIG-1361 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.8.0 Reporter: Gaurav Jain Assignee: Gaurav Jain Priority: Minor Fix For: 0.8.0 Pig request for consistency reasons among different TableLoader that Zebra TableLoader.getSchema() should return the projectionSchema specified in the constructor of TableLoader instead of pruned proejction by pig -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1315) [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader
[ https://issues.apache.org/jira/browse/PIG-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated PIG-1315: - Status: Open (was: Patch Available) [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader Key: PIG-1315 URL: https://issues.apache.org/jira/browse/PIG-1315 Project: Pig Issue Type: New Feature Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: zebra.0324, zebra.0324, zebra.0324 OrderedLoadFunc interface is used by Pig to do merge join and mapside cogrouping. For Zebra, implementing this interface is necessary to support mapside cogrouping. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1315) [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader
[ https://issues.apache.org/jira/browse/PIG-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated PIG-1315: - Attachment: (was: zebra.0324) [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader Key: PIG-1315 URL: https://issues.apache.org/jira/browse/PIG-1315 Project: Pig Issue Type: New Feature Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: pig-1315.patch OrderedLoadFunc interface is used by Pig to do merge join and mapside cogrouping. For Zebra, implementing this interface is necessary to support mapside cogrouping. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1315) [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader
[ https://issues.apache.org/jira/browse/PIG-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated PIG-1315: - Attachment: (was: zebra.0324) [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader Key: PIG-1315 URL: https://issues.apache.org/jira/browse/PIG-1315 Project: Pig Issue Type: New Feature Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: pig-1315.patch OrderedLoadFunc interface is used by Pig to do merge join and mapside cogrouping. For Zebra, implementing this interface is necessary to support mapside cogrouping. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854607#action_12854607 ] Gianmarco De Francisci Morales commented on PIG-1295: - I have drafted my proposal at http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/azaroth/t127030843242 Any feedback is more than welcome. Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1348) PigStorage making unnecessary byte array copy when storing data
[ https://issues.apache.org/jira/browse/PIG-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1348: -- Attachment: PIG-1348_2.patch PigStorage making unnecessary byte array copy when storing data --- Key: PIG-1348 URL: https://issues.apache.org/jira/browse/PIG-1348 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Richard Ding Fix For: 0.7.0 Attachments: PIG-1348.patch, PIG-1348_2.patch InternalCachedBag makes estimate of memory available to the VM by using Runtime.getRuntime().maxMemory(). It then uses 10%(by default, though configurable) of this memory and divides this memory into number of bags. It keeps track of the memory used by bags and then proactively spills if bags memory usage reach close to these limits. Given all this in theory when presented with data more then it can handle InternalCachedBag should not run out of memory. But in practice we find OOM happening. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1348) PigStorage making unnecessary byte array copy when storing data
[ https://issues.apache.org/jira/browse/PIG-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854608#action_12854608 ] Richard Ding commented on PIG-1348: --- Thanks Ashutosh. I changed signature of write() to take values of type Tuple instead of type Object. On 1) and 3), Hadoop LineRecordWriter#write() is a synchronized method, and I think that JVM is optimized for 'instanceof'' construct and also for uncontended synchronization. I prefer that we have some performance numbers before adding optimizations. PigStorage making unnecessary byte array copy when storing data --- Key: PIG-1348 URL: https://issues.apache.org/jira/browse/PIG-1348 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Richard Ding Fix For: 0.7.0 Attachments: PIG-1348.patch, PIG-1348_2.patch InternalCachedBag makes estimate of memory available to the VM by using Runtime.getRuntime().maxMemory(). It then uses 10%(by default, though configurable) of this memory and divides this memory into number of bags. It keeps track of the memory used by bags and then proactively spills if bags memory usage reach close to these limits. Given all this in theory when presented with data more then it can handle InternalCachedBag should not run out of memory. But in practice we find OOM happening. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1348) PigStorage making unnecessary byte array copy when storing data
[ https://issues.apache.org/jira/browse/PIG-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1348: -- Status: Open (was: Patch Available) PigStorage making unnecessary byte array copy when storing data --- Key: PIG-1348 URL: https://issues.apache.org/jira/browse/PIG-1348 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Richard Ding Fix For: 0.7.0 Attachments: PIG-1348.patch, PIG-1348_2.patch InternalCachedBag makes estimate of memory available to the VM by using Runtime.getRuntime().maxMemory(). It then uses 10%(by default, though configurable) of this memory and divides this memory into number of bags. It keeps track of the memory used by bags and then proactively spills if bags memory usage reach close to these limits. Given all this in theory when presented with data more then it can handle InternalCachedBag should not run out of memory. But in practice we find OOM happening. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1348) PigStorage making unnecessary byte array copy when storing data
[ https://issues.apache.org/jira/browse/PIG-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854611#action_12854611 ] Alan Gates commented on PIG-1348: - bq. In StorageUtil.putField(), is it possible to get rid of DataType.findType(), possibly by getting hold of schema and getting type information from there. If not, then may be we cache the type info first time, instead of finding it on every call. At the very least, we shall get rid of casts for simple types as thats unnecessary. DataType.isComplex() can be used to determine that. We have to be careful here. In the case where a schema is given, it's ok to use that to cast types. In cases without schema we cannot assume that all records match the first, because Pig does not impose that as a requirement on the data. So looking at the first record and caching results is not ok. PigStorage making unnecessary byte array copy when storing data --- Key: PIG-1348 URL: https://issues.apache.org/jira/browse/PIG-1348 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Richard Ding Fix For: 0.7.0 Attachments: PIG-1348.patch, PIG-1348_2.patch InternalCachedBag makes estimate of memory available to the VM by using Runtime.getRuntime().maxMemory(). It then uses 10%(by default, though configurable) of this memory and divides this memory into number of bags. It keeps track of the memory used by bags and then proactively spills if bags memory usage reach close to these limits. Given all this in theory when presented with data more then it can handle InternalCachedBag should not run out of memory. But in practice we find OOM happening. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1362) Provide udf context signature in ensureAllKeysInSameSplit() method of loader
[ https://issues.apache.org/jira/browse/PIG-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1362: -- Attachment: backport.patch Simple one line fix. Test cases included. Provide udf context signature in ensureAllKeysInSameSplit() method of loader Key: PIG-1362 URL: https://issues.apache.org/jira/browse/PIG-1362 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Priority: Critical Fix For: 0.7.0 Attachments: backport.patch As a part of PIG-1292 a check was introduced to make sure loader used in collected group-by implements CollectableLoader (new interface in that patch). In its method, loader may use udf context to store some info. We need to make sure that udf context signature is setup correctly in such cases. This is already the case in trunk, need to backport it to 0.7 branch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1362) Provide udf context signature in ensureAllKeysInSameSplit() method of loader
[ https://issues.apache.org/jira/browse/PIG-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1362: -- Status: Patch Available (was: Open) Provide udf context signature in ensureAllKeysInSameSplit() method of loader Key: PIG-1362 URL: https://issues.apache.org/jira/browse/PIG-1362 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Priority: Critical Fix For: 0.7.0 Attachments: backport.patch As a part of PIG-1292 a check was introduced to make sure loader used in collected group-by implements CollectableLoader (new interface in that patch). In its method, loader may use udf context to store some info. We need to make sure that udf context signature is setup correctly in such cases. This is already the case in trunk, need to backport it to 0.7 branch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1362) Provide udf context signature in ensureAllKeysInSameSplit() method of loader
[ https://issues.apache.org/jira/browse/PIG-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan reassigned PIG-1362: - Assignee: Ashutosh Chauhan Provide udf context signature in ensureAllKeysInSameSplit() method of loader Key: PIG-1362 URL: https://issues.apache.org/jira/browse/PIG-1362 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Priority: Critical Fix For: 0.7.0 Attachments: backport.patch As a part of PIG-1292 a check was introduced to make sure loader used in collected group-by implements CollectableLoader (new interface in that patch). In its method, loader may use udf context to store some info. We need to make sure that udf context signature is setup correctly in such cases. This is already the case in trunk, need to backport it to 0.7 branch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1350) [Zebra] Zebra column names cannot have leading _
[ https://issues.apache.org/jira/browse/PIG-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854623#action_12854623 ] Chao Wang commented on PIG-1350: Patch looks good +1 [Zebra] Zebra column names cannot have leading _ -- Key: PIG-1350 URL: https://issues.apache.org/jira/browse/PIG-1350 Project: Pig Issue Type: Improvement Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.7.0 Attachments: pig-1350.patch, pig-1350.patch Disallowing '_' as leading character in column names in Zebra schema is too restrictive, which should be lifted. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1362) Provide udf context signature in ensureAllKeysInSameSplit() method of loader
[ https://issues.apache.org/jira/browse/PIG-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-1362: Resolution: Fixed Status: Resolved (was: Patch Available) +1 Provide udf context signature in ensureAllKeysInSameSplit() method of loader Key: PIG-1362 URL: https://issues.apache.org/jira/browse/PIG-1362 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Priority: Critical Fix For: 0.7.0 Attachments: backport.patch As a part of PIG-1292 a check was introduced to make sure loader used in collected group-by implements CollectableLoader (new interface in that patch). In its method, loader may use udf context to store some info. We need to make sure that udf context signature is setup correctly in such cases. This is already the case in trunk, need to backport it to 0.7 branch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1357) [zebra] Test cases of map-side GROUP-BY should be added.
[ https://issues.apache.org/jira/browse/PIG-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1357: -- Attachment: PIG-1357.patch [zebra] Test cases of map-side GROUP-BY should be added. Key: PIG-1357 URL: https://issues.apache.org/jira/browse/PIG-1357 Project: Pig Issue Type: Test Affects Versions: 0.7.0 Reporter: Yan Zhou Priority: Minor Fix For: 0.7.0 Attachments: PIG-1357.patch The global sorted input splits for this feature to work properly. Prior to 0.7, all sorted input splits are globally sorted at the LOAD call on sorted table. But with the support of locally sorted input splits, PIG-1306 and PIG-1315, the globally sorted input splits need to be asked for by PIG explicitly. So this creates separate call paths for all PIG feature that require map-side-only ops. Currently there are two PIG features that require globally sorted input splits from Zebra: map-side COGROUP and map-side GROUP-BY. PIG-1315 will contain test cases for the former; while this JIRA will cover the latter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1341) BinStorage cannot convert DataByteArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED
[ https://issues.apache.org/jira/browse/PIG-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1341: -- Fix Version/s: (was: 0.7.0) BinStorage cannot convert DataByteArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED -- Key: PIG-1341 URL: https://issues.apache.org/jira/browse/PIG-1341 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Richard Ding Attachments: PIG-1341.patch Script reads in BinStorage data and tries to convert a column which is in DataByteArray to Chararray. {code} raw = load 'sampledata' using BinStorage() as (col1,col2, col3); --filter out null columns A = filter raw by col1#'bcookie' is not null; B = foreach A generate col1#'bcookie' as reqcolumn; describe B; --B: {regcolumn: bytearray} X = limit B 5; dump X; B = foreach A generate (chararray)col1#'bcookie' as convertedcol; describe B; --B: {convertedcol: chararray} X = limit B 5; dump X; {code} The first dump produces: (36co9b55onr8s) (36co9b55onr8s) (36hilul5oo1q1) (36hilul5oo1q1) (36l4cj15ooa8a) The second dump produces: () () () () () It also throws an error message: FIELD_DISCARDED_TYPE_CONVERSION_FAILED 5 time(s). Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1348) PigStorage making unnecessary byte array copy when storing data
[ https://issues.apache.org/jira/browse/PIG-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854643#action_12854643 ] Ashutosh Chauhan commented on PIG-1348: --- 1) As far as I can see TextOutputFormat has synchronized write() because it is meant to work even with mappers implementing MultithreadedMapRunner. But since thats not the case for Pig, we can get rid of it especially now that we are putting in our own PigTextOutputFormat instead of using TextOutputformat. 3) Thats what I meant, if Schema is available, we should use that to find types, instead of reflecting on every call. I suggested the work around of caching for the case if we know user did provide Schema, but we dont have a handle on it. Clearly, if there is no schema, we need to find type every time. I can see that dealing with Complex types even when there is a schema is not straight forward. In any case, casts that are currently there for simple types are unnecessary. For performance numbers, both of these will save CPU time, if we are convinced that we are always I/O bound we can leave these things as it is. PigStorage making unnecessary byte array copy when storing data --- Key: PIG-1348 URL: https://issues.apache.org/jira/browse/PIG-1348 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Richard Ding Fix For: 0.7.0 Attachments: PIG-1348.patch, PIG-1348_2.patch InternalCachedBag makes estimate of memory available to the VM by using Runtime.getRuntime().maxMemory(). It then uses 10%(by default, though configurable) of this memory and divides this memory into number of bags. It keeps track of the memory used by bags and then proactively spills if bags memory usage reach close to these limits. Given all this in theory when presented with data more then it can handle InternalCachedBag should not run out of memory. But in practice we find OOM happening. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Reopened: (PIG-1362) Provide udf context signature in ensureAllKeysInSameSplit() method of loader
[ https://issues.apache.org/jira/browse/PIG-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan reopened PIG-1362: --- Provide udf context signature in ensureAllKeysInSameSplit() method of loader Key: PIG-1362 URL: https://issues.apache.org/jira/browse/PIG-1362 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Priority: Critical Fix For: 0.7.0 Attachments: backport.patch As a part of PIG-1292 a check was introduced to make sure loader used in collected group-by implements CollectableLoader (new interface in that patch). In its method, loader may use udf context to store some info. We need to make sure that udf context signature is setup correctly in such cases. This is already the case in trunk, need to backport it to 0.7 branch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1363) Unnecessary loadFunc instantiations
Unnecessary loadFunc instantiations --- Key: PIG-1363 URL: https://issues.apache.org/jira/browse/PIG-1363 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Fix For: 0.8.0 In MRCompiler loadfuncs are instantiated at multiple locations in different visit methods. This is inconsistent and confusing. LoadFunc should be instantiated at only one place, ideally in LogToPhyTanslation#visit(LOLoad). A getter should be added to POLoad to retrieve this instantiated loadFunc wherever it is needed in later stages of compilation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854649#action_12854649 ] Daniel Dai commented on PIG-1295: - Thanks Gianmarco, Proposal looks good. Besides unit test, we need to add some performance test in both phase 1 and phase 2. Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1299) Implement Pig counter to track number of output rows for each output files
[ https://issues.apache.org/jira/browse/PIG-1299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1299: -- Attachment: PIG-1299.patch Thanks Pradeep. The new patch addresses the comments. The patch adds a new Hadoop counter group--MultiStoreCounters--that counts the numbers of output records in each store of a MultiQuery script. Implement Pig counter to track number of output rows for each output files Key: PIG-1299 URL: https://issues.apache.org/jira/browse/PIG-1299 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1299.patch, PIG-1299.patch When running a multi-store query, the Hadoop job tracker often displays only 0 for Reduce output records or Map output records counters, This is incorrect and misleading. Pig should implement an output records counter for each output files in the query. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1299) Implement Pig counter to track number of output rows for each output files
[ https://issues.apache.org/jira/browse/PIG-1299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1299: -- Status: Open (was: Patch Available) Implement Pig counter to track number of output rows for each output files Key: PIG-1299 URL: https://issues.apache.org/jira/browse/PIG-1299 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1299.patch, PIG-1299.patch When running a multi-store query, the Hadoop job tracker often displays only 0 for Reduce output records or Map output records counters, This is incorrect and misleading. Pig should implement an output records counter for each output files in the query. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1299) Implement Pig counter to track number of output rows for each output files
[ https://issues.apache.org/jira/browse/PIG-1299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1299: -- Status: Patch Available (was: Open) Implement Pig counter to track number of output rows for each output files Key: PIG-1299 URL: https://issues.apache.org/jira/browse/PIG-1299 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1299.patch, PIG-1299.patch When running a multi-store query, the Hadoop job tracker often displays only 0 for Reduce output records or Map output records counters, This is incorrect and misleading. Pig should implement an output records counter for each output files in the query. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1348) PigStorage making unnecessary byte array copy when storing data
[ https://issues.apache.org/jira/browse/PIG-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854654#action_12854654 ] Dmitriy V. Ryaboy commented on PIG-1348: In the spirit of better java and micro-optimizations: StorageUtil does things like this to convert to bytes: {code} out.write(((Integer)field).toString().getBytes()); {code} Integer's toString() method creates a new string every time, even if the same integer (value-wise) is being converted to a String. This is better: {code} out.wirte(String.valueOf(field).getBytes()); {code} (This reuses the values, and also collapses the case statement a fair bit, cleaning up the code -- we can batch Integer, Double, etc, together and fall through to just one line of code.) This discussion should probably go into a separate ticket. PigStorage making unnecessary byte array copy when storing data --- Key: PIG-1348 URL: https://issues.apache.org/jira/browse/PIG-1348 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Richard Ding Fix For: 0.7.0 Attachments: PIG-1348.patch, PIG-1348_2.patch InternalCachedBag makes estimate of memory available to the VM by using Runtime.getRuntime().maxMemory(). It then uses 10%(by default, though configurable) of this memory and divides this memory into number of bags. It keeps track of the memory used by bags and then proactively spills if bags memory usage reach close to these limits. Given all this in theory when presented with data more then it can handle InternalCachedBag should not run out of memory. But in practice we find OOM happening. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1315) [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader
[ https://issues.apache.org/jira/browse/PIG-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated PIG-1315: - Status: Patch Available (was: Open) [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader Key: PIG-1315 URL: https://issues.apache.org/jira/browse/PIG-1315 Project: Pig Issue Type: New Feature Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: pig-1315.patch OrderedLoadFunc interface is used by Pig to do merge join and mapside cogrouping. For Zebra, implementing this interface is necessary to support mapside cogrouping. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1364) Public javadoc on apache site still on 0.2, needs to be updated for each version release
Public javadoc on apache site still on 0.2, needs to be updated for each version release Key: PIG-1364 URL: https://issues.apache.org/jira/browse/PIG-1364 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.7.0 Reporter: Alan Gates Assignee: Alan Gates Priority: Critical See http://hadoop.apache.org/pig/javadoc/docs/api/. This currently contains javadocs for 0.2. It is also versionless. It needs to be changed so that javadocs for recent versions are posted. It also needs to change so that the version is in the api so that multiple versions of the API can be posted. It's probably too late to do this for 0.6 and before, but it needs to happen for 0.7. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1364) Public javadoc on apache site still on 0.2, needs to be updated for each version release
[ https://issues.apache.org/jira/browse/PIG-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1364: Fix Version/s: 0.7.0 Public javadoc on apache site still on 0.2, needs to be updated for each version release Key: PIG-1364 URL: https://issues.apache.org/jira/browse/PIG-1364 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.7.0 Reporter: Alan Gates Assignee: Alan Gates Priority: Critical Fix For: 0.7.0 See http://hadoop.apache.org/pig/javadoc/docs/api/. This currently contains javadocs for 0.2. It is also versionless. It needs to be changed so that javadocs for recent versions are posted. It also needs to change so that the version is in the api so that multiple versions of the API can be posted. It's probably too late to do this for 0.6 and before, but it needs to happen for 0.7. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1357) [zebra] Test cases of map-side GROUP-BY should be added.
[ https://issues.apache.org/jira/browse/PIG-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854732#action_12854732 ] Chao Wang commented on PIG-1357: +1 [zebra] Test cases of map-side GROUP-BY should be added. Key: PIG-1357 URL: https://issues.apache.org/jira/browse/PIG-1357 Project: Pig Issue Type: Test Affects Versions: 0.7.0 Reporter: Yan Zhou Priority: Minor Fix For: 0.7.0 Attachments: PIG-1357.patch The global sorted input splits for this feature to work properly. Prior to 0.7, all sorted input splits are globally sorted at the LOAD call on sorted table. But with the support of locally sorted input splits, PIG-1306 and PIG-1315, the globally sorted input splits need to be asked for by PIG explicitly. So this creates separate call paths for all PIG feature that require map-side-only ops. Currently there are two PIG features that require globally sorted input splits from Zebra: map-side COGROUP and map-side GROUP-BY. PIG-1315 will contain test cases for the former; while this JIRA will cover the latter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1365) WrappedIOException is missing from Pig.jar
WrappedIOException is missing from Pig.jar -- Key: PIG-1365 URL: https://issues.apache.org/jira/browse/PIG-1365 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Pradeep Kamath Priority: Critical Fix For: 0.7.0 We need to put it back since UDFs rely on it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1315) [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader
[ https://issues.apache.org/jira/browse/PIG-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854738#action_12854738 ] Yan Zhou commented on PIG-1315: --- +1 [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader Key: PIG-1315 URL: https://issues.apache.org/jira/browse/PIG-1315 Project: Pig Issue Type: New Feature Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: pig-1315.patch OrderedLoadFunc interface is used by Pig to do merge join and mapside cogrouping. For Zebra, implementing this interface is necessary to support mapside cogrouping. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854740#action_12854740 ] Ashutosh Chauhan commented on PIG-1229: --- You can get rid of this stack-trace by overriding relToAbsPathForStoreLocation() of StoreFunc which DBStorage extends and turning it into no-op. Since, DB location is always absolute, there is no need of default behavior which is there in StoreFunc. For DataType.find() I found even PigStorage does the same, so this patch is no worse then PigStorage in that way. allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Fix For: 0.8.0 Attachments: jira-1229-v2.patch UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1362) Provide udf context signature in ensureAllKeysInSameSplit() method of loader
[ https://issues.apache.org/jira/browse/PIG-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan resolved PIG-1362. --- Resolution: Fixed Since hudson is flaky once again. Ran the full test - suite. All of it passed. Ran test-patch: {noformat} [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. {noformat} Patch checked-in for 0.7 branch. Provide udf context signature in ensureAllKeysInSameSplit() method of loader Key: PIG-1362 URL: https://issues.apache.org/jira/browse/PIG-1362 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Priority: Critical Fix For: 0.7.0 Attachments: backport.patch As a part of PIG-1292 a check was introduced to make sure loader used in collected group-by implements CollectableLoader (new interface in that patch). In its method, loader may use udf context to store some info. We need to make sure that udf context signature is setup correctly in such cases. This is already the case in trunk, need to backport it to 0.7 branch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-756) UDFs should have API for transparently opening and reading files from HDFS or from local file system with only relative path
[ https://issues.apache.org/jira/browse/PIG-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854762#action_12854762 ] Viraj Bhat commented on PIG-756: In Pig 0.7 we have moved local mode of Pig to local mode of Hadoop. https://issues.apache.org/jira/browse/PIG-1053 Closing issue UDFs should have API for transparently opening and reading files from HDFS or from local file system with only relative path Key: PIG-756 URL: https://issues.apache.org/jira/browse/PIG-756 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz I have a utility function util.INSETFROMFILE() that I pass a file name during initialization. {code} define inQuerySet util.INSETFROMFILE(analysis/queries); A = load 'logs' using PigStorage() as ( date int, query chararray ); B = filter A by inQuerySet(query); {code} This provides a computationally inexpensive way to effect map-side joins for small sets plus functions of this style provide the ability to encapsulate more complex matching rules. For rapid development and debugging purposes, I want this code to run without modification on both my local file system when I do pig -exectype local and on HDFS. Pig needs to provide an API for UDFs which allow them to either: 1) know when they are in local or HDFS mode and let them open and read from files as appropriate 2) just provide a file name and read statements and have pig transparently manage local or HDFS opens and reads for the UDF UDFs need to read configuration information off the filesystem and it simplifies the process if one can just flip the switch of -exectype local. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-756) UDFs should have API for transparently opening and reading files from HDFS or from local file system with only relative path
[ https://issues.apache.org/jira/browse/PIG-756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat resolved PIG-756. Resolution: Fixed Fix Version/s: 0.7.0 https://issues.apache.org/jira/browse/PIG-1053 fixes this issue. UDFs should have API for transparently opening and reading files from HDFS or from local file system with only relative path Key: PIG-756 URL: https://issues.apache.org/jira/browse/PIG-756 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz Fix For: 0.7.0 I have a utility function util.INSETFROMFILE() that I pass a file name during initialization. {code} define inQuerySet util.INSETFROMFILE(analysis/queries); A = load 'logs' using PigStorage() as ( date int, query chararray ); B = filter A by inQuerySet(query); {code} This provides a computationally inexpensive way to effect map-side joins for small sets plus functions of this style provide the ability to encapsulate more complex matching rules. For rapid development and debugging purposes, I want this code to run without modification on both my local file system when I do pig -exectype local and on HDFS. Pig needs to provide an API for UDFs which allow them to either: 1) know when they are in local or HDFS mode and let them open and read from files as appropriate 2) just provide a file name and read statements and have pig transparently manage local or HDFS opens and reads for the UDF UDFs need to read configuration information off the filesystem and it simplifies the process if one can just flip the switch of -exectype local. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1357) [zebra] Test cases of map-side GROUP-BY should be added.
[ https://issues.apache.org/jira/browse/PIG-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1357: -- Status: Patch Available (was: Open) [zebra] Test cases of map-side GROUP-BY should be added. Key: PIG-1357 URL: https://issues.apache.org/jira/browse/PIG-1357 Project: Pig Issue Type: Test Affects Versions: 0.7.0 Reporter: Yan Zhou Priority: Minor Fix For: 0.7.0 Attachments: PIG-1357.patch The global sorted input splits for this feature to work properly. Prior to 0.7, all sorted input splits are globally sorted at the LOAD call on sorted table. But with the support of locally sorted input splits, PIG-1306 and PIG-1315, the globally sorted input splits need to be asked for by PIG explicitly. So this creates separate call paths for all PIG feature that require map-side-only ops. Currently there are two PIG features that require globally sorted input splits from Zebra: map-side COGROUP and map-side GROUP-BY. PIG-1315 will contain test cases for the former; while this JIRA will cover the latter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1366) PigStorage's pushProjection implementation results in NPE under certain data conditions
PigStorage's pushProjection implementation results in NPE under certain data conditions --- Key: PIG-1366 URL: https://issues.apache.org/jira/browse/PIG-1366 Project: Pig Issue Type: Bug Affects Versions: 0.6.0, 0.7.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.7.0 Under the following conditions, a NullPointerException is caused when PigStorage is used: If in the script, only the 2nd and 3rd column of the data (say) are used, the PruneColumns optimization passes this information to PigStorage through the pushProjection() method. If the data contains a row with only one column (malformed data due to missing cols in certain rows), PigStorage returns a Tuple backed by a null ArrayList. Subsequent projection operations on this tuple result in the NPE. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1366) PigStorage's pushProjection implementation results in NPE under certain data conditions
[ https://issues.apache.org/jira/browse/PIG-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-1366: Attachment: PIG-1366.patch Currently in PigStorage the ArrayList backing the Tuple returned in getNext() is created in readField(). Under the data conditions explained in the description, readField() never gets called and the ArrayList (mProtoTuple) remains null causing the eventual NPE. The patch fixes the issue by initializing mProtoTuple to a new ArrayList at the beginning of getNext(). PigStorage's pushProjection implementation results in NPE under certain data conditions --- Key: PIG-1366 URL: https://issues.apache.org/jira/browse/PIG-1366 Project: Pig Issue Type: Bug Affects Versions: 0.6.0, 0.7.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.7.0 Attachments: PIG-1366.patch Under the following conditions, a NullPointerException is caused when PigStorage is used: If in the script, only the 2nd and 3rd column of the data (say) are used, the PruneColumns optimization passes this information to PigStorage through the pushProjection() method. If the data contains a row with only one column (malformed data due to missing cols in certain rows), PigStorage returns a Tuple backed by a null ArrayList. Subsequent projection operations on this tuple result in the NPE. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1366) PigStorage's pushProjection implementation results in NPE under certain data conditions
[ https://issues.apache.org/jira/browse/PIG-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-1366: Status: Patch Available (was: Open) PigStorage's pushProjection implementation results in NPE under certain data conditions --- Key: PIG-1366 URL: https://issues.apache.org/jira/browse/PIG-1366 Project: Pig Issue Type: Bug Affects Versions: 0.6.0, 0.7.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.7.0 Attachments: PIG-1366.patch Under the following conditions, a NullPointerException is caused when PigStorage is used: If in the script, only the 2nd and 3rd column of the data (say) are used, the PruneColumns optimization passes this information to PigStorage through the pushProjection() method. If the data contains a row with only one column (malformed data due to missing cols in certain rows), PigStorage returns a Tuple backed by a null ArrayList. Subsequent projection operations on this tuple result in the NPE. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1365) WrappedIOException is missing from Pig.jar
[ https://issues.apache.org/jira/browse/PIG-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-1365: Attachment: PIG-1365.patch Attached patch restores WrappedIOException - this is not used in Pig Code and only provided for use by UDFs to maintain backward compatibility. I have marked the class as deprecated so that it can be removed from pig code base in a later release. No unit tests have been added since this is just restoring an old class which is no longer used in the pig code. WrappedIOException is missing from Pig.jar -- Key: PIG-1365 URL: https://issues.apache.org/jira/browse/PIG-1365 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Pradeep Kamath Priority: Critical Fix For: 0.7.0 Attachments: PIG-1365.patch We need to put it back since UDFs rely on it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1365) WrappedIOException is missing from Pig.jar
[ https://issues.apache.org/jira/browse/PIG-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-1365: Status: Patch Available (was: Open) WrappedIOException is missing from Pig.jar -- Key: PIG-1365 URL: https://issues.apache.org/jira/browse/PIG-1365 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Pradeep Kamath Priority: Critical Fix For: 0.7.0 Attachments: PIG-1365.patch We need to put it back since UDFs rely on it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.