[jira] Updated: (PIG-1354) UDFs for dynamic invocation of simple Java methods

2010-04-07 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1354:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed.

 UDFs for dynamic invocation of simple Java methods
 --

 Key: PIG-1354
 URL: https://issues.apache.org/jira/browse/PIG-1354
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1354.patch, PIG-1354.patch, PIG-1354.patch


 The need to create wrapper UDFs for simple Java functions creates unnecessary 
 work for Pig users, slows down the development process, and produces a lot of 
 trivial classes. We can use Java's reflection to allow invoking a number of 
 methods on the fly, dynamically, by creating a generic UDF to accomplish this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1150) VAR() Variance UDF

2010-04-07 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854487#action_12854487
 ] 

Dmitriy V. Ryaboy commented on PIG-1150:


Jay, there may be -- I only glanced at the code here. The real problem is this: 
http://planetmath.org/encyclopedia/OnePassAlgorithmToComputeSampleVariance.html 
 -- you are going to get round-off errors, and possibly overflow errors, using 
this approach.

Thanks for reminding me that I promised this, I'll work on open-sourcing our 
code.

-Dmitriy

 VAR() Variance UDF
 --

 Key: PIG-1150
 URL: https://issues.apache.org/jira/browse/PIG-1150
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.5.0
 Environment: UDF, written in Pig 0.5 contrib/
Reporter: Russell Jurney
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: var.patch


 I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates 
 variance in a distributed manner, based on the AVG() builtin.  It works by 
 calculating the count, sum and sum of squares, as described here: 
 http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
 Is this a worthwhile contribution?  Taking the square root of this value 
 using the contrib SQRT() function gives Standard Deviation, which is missing 
 from Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1359) bin/pig script does not pick up correct jar libraries

2010-04-07 Thread Gianmarco De Francisci Morales (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gianmarco De Francisci Morales updated PIG-1359:


Description: 
The bin/pig script tries to load pig jar libraries from the pig-*-core.jar 
using this bash fragment

{code}

# for releases, add core pig to CLASSPATH
for f in $PIG_HOME/pig-*core.jar; do
CLASSPATH=${CLASSPATH}:$f;
done

# during development pig jar might be in build
for f in $PIG_HOME/build/pig-*-core.jar; do
CLASSPATH=${CLASSPATH}:$f;
done

{code} 

The pig-*-core.jar does not contain the dependencies for pig that are found in 
build/ivy/lib/Pig/*.jar (jline).
The script does not even pick the pig.jar in PIG_HOME that is produced as a 
result of the ant build process.

This results in the following error after successfully building pig:

{code} 

Exception in thread main java.lang.NoClassDefFoundError: 
jline/ConsoleReaderInputStream
Caused by: java.lang.ClassNotFoundException: jline.ConsoleReaderInputStream

{code} 


  was:
The bin/pig script tries to load pig jar libraries from the pig-*-core.jar 
using this bash fragment

{code:bash}

# for releases, add core pig to CLASSPATH
for f in $PIG_HOME/pig-*core.jar; do
CLASSPATH=${CLASSPATH}:$f;
done

# during development pig jar might be in build
for f in $PIG_HOME/build/pig-*-core.jar; do
CLASSPATH=${CLASSPATH}:$f;
done

{code} 

The pig-*-core.jar does not contain the dependencies for pig that are found in 
build/ivy/lib/Pig/*.jar (jline).
The script does not even pick the pig.jar in PIG_HOME that is produced as a 
result of the ant build process.

This results in the following error after successfully building pig:

{code} 

Exception in thread main java.lang.NoClassDefFoundError: 
jline/ConsoleReaderInputStream
Caused by: java.lang.ClassNotFoundException: jline.ConsoleReaderInputStream

{code} 



 bin/pig script does not pick up correct jar libraries
 -

 Key: PIG-1359
 URL: https://issues.apache.org/jira/browse/PIG-1359
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
 Environment: Linux Ubuntu 8.10, java-6-sun
Reporter: Gianmarco De Francisci Morales
Priority: Trivial

 The bin/pig script tries to load pig jar libraries from the pig-*-core.jar 
 using this bash fragment
 {code}
 # for releases, add core pig to CLASSPATH
 for f in $PIG_HOME/pig-*core.jar; do
 CLASSPATH=${CLASSPATH}:$f;
 done
 # during development pig jar might be in build
 for f in $PIG_HOME/build/pig-*-core.jar; do
 CLASSPATH=${CLASSPATH}:$f;
 done
 {code} 
 The pig-*-core.jar does not contain the dependencies for pig that are found 
 in build/ivy/lib/Pig/*.jar (jline).
 The script does not even pick the pig.jar in PIG_HOME that is produced as a 
 result of the ant build process.
 This results in the following error after successfully building pig:
 {code} 
 Exception in thread main java.lang.NoClassDefFoundError: 
 jline/ConsoleReaderInputStream
 Caused by: java.lang.ClassNotFoundException: jline.ConsoleReaderInputStream
 {code} 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1359) bin/pig script does not pick up correct jar libraries

2010-04-07 Thread Gianmarco De Francisci Morales (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gianmarco De Francisci Morales updated PIG-1359:


Description: 
The bin/pig script tries to load pig jar libraries from the pig-*-core.jar 
using this bash fragment

{code}

# for releases, add core pig to CLASSPATH
for f in $PIG_HOME/pig-*core.jar; do
CLASSPATH=${CLASSPATH}:$f;
done

# during development pig jar might be in build
for f in $PIG_HOME/build/pig-*-core.jar; do
CLASSPATH=${CLASSPATH}:$f;
done

{code} 

The pig-\*-core.jar does not contain the dependencies for pig that are found in 
build/ivy/lib/Pig/\*.jar (jline).
The script does not even pick the pig.jar in PIG_HOME that is produced as a 
result of the ant build process.

This results in the following error after successfully building pig:

{code} 

Exception in thread main java.lang.NoClassDefFoundError: 
jline/ConsoleReaderInputStream
Caused by: java.lang.ClassNotFoundException: jline.ConsoleReaderInputStream

{code} 


  was:
The bin/pig script tries to load pig jar libraries from the pig-*-core.jar 
using this bash fragment

{code}

# for releases, add core pig to CLASSPATH
for f in $PIG_HOME/pig-*core.jar; do
CLASSPATH=${CLASSPATH}:$f;
done

# during development pig jar might be in build
for f in $PIG_HOME/build/pig-*-core.jar; do
CLASSPATH=${CLASSPATH}:$f;
done

{code} 

The pig-*-core.jar does not contain the dependencies for pig that are found in 
build/ivy/lib/Pig/*.jar (jline).
The script does not even pick the pig.jar in PIG_HOME that is produced as a 
result of the ant build process.

This results in the following error after successfully building pig:

{code} 

Exception in thread main java.lang.NoClassDefFoundError: 
jline/ConsoleReaderInputStream
Caused by: java.lang.ClassNotFoundException: jline.ConsoleReaderInputStream

{code} 



 bin/pig script does not pick up correct jar libraries
 -

 Key: PIG-1359
 URL: https://issues.apache.org/jira/browse/PIG-1359
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
 Environment: Linux Ubuntu 8.10, java-6-sun
Reporter: Gianmarco De Francisci Morales
Priority: Trivial

 The bin/pig script tries to load pig jar libraries from the pig-*-core.jar 
 using this bash fragment
 {code}
 # for releases, add core pig to CLASSPATH
 for f in $PIG_HOME/pig-*core.jar; do
 CLASSPATH=${CLASSPATH}:$f;
 done
 # during development pig jar might be in build
 for f in $PIG_HOME/build/pig-*-core.jar; do
 CLASSPATH=${CLASSPATH}:$f;
 done
 {code} 
 The pig-\*-core.jar does not contain the dependencies for pig that are found 
 in build/ivy/lib/Pig/\*.jar (jline).
 The script does not even pick the pig.jar in PIG_HOME that is produced as a 
 result of the ant build process.
 This results in the following error after successfully building pig:
 {code} 
 Exception in thread main java.lang.NoClassDefFoundError: 
 jline/ConsoleReaderInputStream
 Caused by: java.lang.ClassNotFoundException: jline.ConsoleReaderInputStream
 {code} 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1351) [Zebra] No type check when we write to the basic table

2010-04-07 Thread Chao Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Wang updated PIG-1351:
---

Attachment: (was: PIG-1351.patch)

 [Zebra] No type check when we write to the basic table
 --

 Key: PIG-1351
 URL: https://issues.apache.org/jira/browse/PIG-1351
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.6.0, 0.7.0, 0.8.0
Reporter: Chao Wang
Assignee: Chao Wang
 Fix For: 0.8.0


 In Zebra, we do not have any type check when writing to a basic table. 
 Say, we have a schema: f1:int, f2:string,
 however we can write a tuple (abc, 123) without any problem, which is 
 definitely not desirable.
 To overcome this problem, we decide to perform certain amount of type 
 checking in Zebra - We check the first row only for each writer.
 This only serves as a sanity check purpose in cases where users screw up 
 specifying the output schema. We do NOT perform a rigorous type checking for 
 all rows for apparently performance concerns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1351) [Zebra] No type check when we write to the basic table

2010-04-07 Thread Chao Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Wang updated PIG-1351:
---

Attachment: PIG-1351.patch

 [Zebra] No type check when we write to the basic table
 --

 Key: PIG-1351
 URL: https://issues.apache.org/jira/browse/PIG-1351
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.6.0, 0.7.0, 0.8.0
Reporter: Chao Wang
Assignee: Chao Wang
 Fix For: 0.8.0

 Attachments: PIG-1351.patch


 In Zebra, we do not have any type check when writing to a basic table. 
 Say, we have a schema: f1:int, f2:string,
 however we can write a tuple (abc, 123) without any problem, which is 
 definitely not desirable.
 To overcome this problem, we decide to perform certain amount of type 
 checking in Zebra - We check the first row only for each writer.
 This only serves as a sanity check purpose in cases where users screw up 
 specifying the output schema. We do NOT perform a rigorous type checking for 
 all rows for apparently performance concerns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1360) Pig API docs should include Piggybank

2010-04-07 Thread Alan Gates (JIRA)
Pig API docs should include Piggybank
-

 Key: PIG-1360
 URL: https://issues.apache.org/jira/browse/PIG-1360
 Project: Pig
  Issue Type: Bug
  Components: documentation
Reporter: Alan Gates
Assignee: Alan Gates
Priority: Minor


Currently piggybank functions aren't included in the javadocs.  As they aren't 
documented anywhere else this forces users to read the code to understand how 
to use them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1351) [Zebra] No type check when we write to the basic table

2010-04-07 Thread Chao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854595#action_12854595
 ] 

Chao Wang commented on PIG-1351:


1) We follow Java's type compatibility rule as follows:
For int type column, we allow int data instances.
For long type column, we allow int and long data instances.
For float type column, we allow int, long and float data instances.
For double type column, we allow int, long, float and double data instances.

2) Also, due to the limitation that Pig only supports BYTES for map value type, 
we do not check inside of a map if it's BYTES type, else we do check.

 [Zebra] No type check when we write to the basic table
 --

 Key: PIG-1351
 URL: https://issues.apache.org/jira/browse/PIG-1351
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.6.0, 0.7.0, 0.8.0
Reporter: Chao Wang
Assignee: Chao Wang
 Fix For: 0.8.0

 Attachments: PIG-1351.patch


 In Zebra, we do not have any type check when writing to a basic table. 
 Say, we have a schema: f1:int, f2:string,
 however we can write a tuple (abc, 123) without any problem, which is 
 definitely not desirable.
 To overcome this problem, we decide to perform certain amount of type 
 checking in Zebra - We check the first row only for each writer.
 This only serves as a sanity check purpose in cases where users screw up 
 specifying the output schema. We do NOT perform a rigorous type checking for 
 all rows for apparently performance concerns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1361) [Zebra] Zebra TableLoader.getSchema() should return the projectionSchema specified in the constructor of TableLoader instead of pruned proejction by pig

2010-04-07 Thread Gaurav Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Jain reassigned PIG-1361:


Assignee: Gaurav Jain

 [Zebra] Zebra TableLoader.getSchema() should return the projectionSchema 
 specified in the constructor of TableLoader instead of pruned proejction by 
 pig 
 -

 Key: PIG-1361
 URL: https://issues.apache.org/jira/browse/PIG-1361
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.8.0
Reporter: Gaurav Jain
Assignee: Gaurav Jain
Priority: Minor
 Fix For: 0.8.0


 Pig request for consistency reasons among different TableLoader  that Zebra 
 TableLoader.getSchema() should return the projectionSchema specified in the 
 constructor of TableLoader instead of pruned proejction by pig 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1315) [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader

2010-04-07 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-1315:
-

Status: Open  (was: Patch Available)

 [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader
 

 Key: PIG-1315
 URL: https://issues.apache.org/jira/browse/PIG-1315
 Project: Pig
  Issue Type: New Feature
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Fix For: 0.8.0

 Attachments: zebra.0324, zebra.0324, zebra.0324


 OrderedLoadFunc interface is used by Pig to do merge join and mapside 
 cogrouping. For Zebra, implementing this interface is necessary to support 
 mapside cogrouping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1315) [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader

2010-04-07 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-1315:
-

Attachment: (was: zebra.0324)

 [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader
 

 Key: PIG-1315
 URL: https://issues.apache.org/jira/browse/PIG-1315
 Project: Pig
  Issue Type: New Feature
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Fix For: 0.8.0

 Attachments: pig-1315.patch


 OrderedLoadFunc interface is used by Pig to do merge join and mapside 
 cogrouping. For Zebra, implementing this interface is necessary to support 
 mapside cogrouping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1315) [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader

2010-04-07 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-1315:
-

Attachment: (was: zebra.0324)

 [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader
 

 Key: PIG-1315
 URL: https://issues.apache.org/jira/browse/PIG-1315
 Project: Pig
  Issue Type: New Feature
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Fix For: 0.8.0

 Attachments: pig-1315.patch


 OrderedLoadFunc interface is used by Pig to do merge join and mapside 
 cogrouping. For Zebra, implementing this interface is necessary to support 
 mapside cogrouping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-04-07 Thread Gianmarco De Francisci Morales (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854607#action_12854607
 ] 

Gianmarco De Francisci Morales commented on PIG-1295:
-

I have drafted my proposal at 
http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/azaroth/t127030843242

Any feedback is more than welcome.

 Binary comparator for secondary sort
 

 Key: PIG-1295
 URL: https://issues.apache.org/jira/browse/PIG-1295
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai

 When hadoop framework doing the sorting, it will try to use binary version of 
 comparator if available. The benefit of binary comparator is we do not need 
 to instantiate the object before we compare. We see a ~30% speedup after we 
 switch to binary comparator. Currently, Pig use binary comparator in 
 following case:
 1. When semantics of order doesn't matter. For example, in distinct, we need 
 to do a sort in order to filter out duplicate values; however, we do not care 
 how comparator sort keys. Groupby also share this character. In this case, we 
 rely on hadoop's default binary comparator
 2. Semantics of order matter, but the key is of simple type. In this case, we 
 have implementation for simple types, such as integer, long, float, 
 chararray, databytearray, string
 However, if the key is a tuple and the sort semantics matters, we do not have 
 a binary comparator implementation. This especially matters when we switch to 
 use secondary sort. In secondary sort, we convert the inner sort of nested 
 foreach into the secondary key and rely on hadoop to sorting on both main key 
 and secondary key. The sorting key will become a two items tuple. Since the 
 secondary key the sorting key of the nested foreach, so the sorting semantics 
 matters. It turns out we do not have binary comparator once we use secondary 
 sort, and we see a significant slow down.
 Binary comparator for tuple should be doable once we understand the binary 
 structure of the serialized tuple. We can focus on most common use cases 
 first, which is group by followed by a nested sort. In this case, we will 
 use secondary sort. Semantics of the first key does not matter but semantics 
 of secondary key matters. We need to identify the boundary of main key and 
 secondary key in the binary tuple buffer without instantiate tuple itself. 
 Then if the first key equals, we use a binary comparator to compare secondary 
 key. Secondary key can also be a complex data type, but for the first step, 
 we focus on simple secondary key, which is the most common use case.
 We mark this issue to be a candidate project for Google summer of code 2010 
 program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1348) PigStorage making unnecessary byte array copy when storing data

2010-04-07 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1348:
--

Attachment: PIG-1348_2.patch

 PigStorage making unnecessary byte array copy when storing data
 ---

 Key: PIG-1348
 URL: https://issues.apache.org/jira/browse/PIG-1348
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Richard Ding
 Fix For: 0.7.0

 Attachments: PIG-1348.patch, PIG-1348_2.patch


 InternalCachedBag makes estimate of memory available to the VM by using 
 Runtime.getRuntime().maxMemory(). It then uses 10%(by default, though 
 configurable) of this memory and divides this memory into number of bags. It 
 keeps track of the memory used by bags and then proactively spills if bags 
 memory usage reach close to these limits. Given all this in theory when 
 presented with data more then it can handle InternalCachedBag should not run 
 out of memory. But in practice we find OOM happening. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1348) PigStorage making unnecessary byte array copy when storing data

2010-04-07 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854608#action_12854608
 ] 

Richard Ding commented on PIG-1348:
---

Thanks Ashutosh. I changed signature of write() to take values of type Tuple 
instead of type Object. 

On 1) and 3), Hadoop LineRecordWriter#write() is a synchronized method, and I 
think that JVM is optimized for 'instanceof'' construct and also for 
uncontended synchronization. I prefer that we have some performance numbers 
before adding optimizations.

 PigStorage making unnecessary byte array copy when storing data
 ---

 Key: PIG-1348
 URL: https://issues.apache.org/jira/browse/PIG-1348
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Richard Ding
 Fix For: 0.7.0

 Attachments: PIG-1348.patch, PIG-1348_2.patch


 InternalCachedBag makes estimate of memory available to the VM by using 
 Runtime.getRuntime().maxMemory(). It then uses 10%(by default, though 
 configurable) of this memory and divides this memory into number of bags. It 
 keeps track of the memory used by bags and then proactively spills if bags 
 memory usage reach close to these limits. Given all this in theory when 
 presented with data more then it can handle InternalCachedBag should not run 
 out of memory. But in practice we find OOM happening. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1348) PigStorage making unnecessary byte array copy when storing data

2010-04-07 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1348:
--

Status: Open  (was: Patch Available)

 PigStorage making unnecessary byte array copy when storing data
 ---

 Key: PIG-1348
 URL: https://issues.apache.org/jira/browse/PIG-1348
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Richard Ding
 Fix For: 0.7.0

 Attachments: PIG-1348.patch, PIG-1348_2.patch


 InternalCachedBag makes estimate of memory available to the VM by using 
 Runtime.getRuntime().maxMemory(). It then uses 10%(by default, though 
 configurable) of this memory and divides this memory into number of bags. It 
 keeps track of the memory used by bags and then proactively spills if bags 
 memory usage reach close to these limits. Given all this in theory when 
 presented with data more then it can handle InternalCachedBag should not run 
 out of memory. But in practice we find OOM happening. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1348) PigStorage making unnecessary byte array copy when storing data

2010-04-07 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854611#action_12854611
 ] 

Alan Gates commented on PIG-1348:
-

bq. In StorageUtil.putField(), is it possible to get rid of 
DataType.findType(), possibly by getting hold of schema and getting type 
information from there. If not, then may be we cache the type info first time, 
instead of finding it on every call. At the very least, we shall get rid of 
casts for simple types as thats unnecessary. DataType.isComplex() can be used 
to determine that. 

We have to be careful here.  In the case where a schema is given, it's ok to 
use that to cast types.  In cases without schema we cannot assume that all 
records match the first, because Pig does not impose that as a requirement on 
the data.  So looking at the first record and caching results is not ok.

 PigStorage making unnecessary byte array copy when storing data
 ---

 Key: PIG-1348
 URL: https://issues.apache.org/jira/browse/PIG-1348
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Richard Ding
 Fix For: 0.7.0

 Attachments: PIG-1348.patch, PIG-1348_2.patch


 InternalCachedBag makes estimate of memory available to the VM by using 
 Runtime.getRuntime().maxMemory(). It then uses 10%(by default, though 
 configurable) of this memory and divides this memory into number of bags. It 
 keeps track of the memory used by bags and then proactively spills if bags 
 memory usage reach close to these limits. Given all this in theory when 
 presented with data more then it can handle InternalCachedBag should not run 
 out of memory. But in practice we find OOM happening. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1362) Provide udf context signature in ensureAllKeysInSameSplit() method of loader

2010-04-07 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1362:
--

Attachment: backport.patch

Simple one line fix. Test cases included.

 Provide udf context signature in ensureAllKeysInSameSplit() method of loader
 

 Key: PIG-1362
 URL: https://issues.apache.org/jira/browse/PIG-1362
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Priority: Critical
 Fix For: 0.7.0

 Attachments: backport.patch


 As a part of PIG-1292 a check was introduced to make sure loader used in 
 collected group-by implements CollectableLoader (new interface in that 
 patch). In its method, loader may use udf context to store some info. We need 
 to make sure that udf context signature is setup correctly in such cases. 
 This is already the case in trunk, need to backport it to 0.7 branch. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1362) Provide udf context signature in ensureAllKeysInSameSplit() method of loader

2010-04-07 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1362:
--

Status: Patch Available  (was: Open)

 Provide udf context signature in ensureAllKeysInSameSplit() method of loader
 

 Key: PIG-1362
 URL: https://issues.apache.org/jira/browse/PIG-1362
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Priority: Critical
 Fix For: 0.7.0

 Attachments: backport.patch


 As a part of PIG-1292 a check was introduced to make sure loader used in 
 collected group-by implements CollectableLoader (new interface in that 
 patch). In its method, loader may use udf context to store some info. We need 
 to make sure that udf context signature is setup correctly in such cases. 
 This is already the case in trunk, need to backport it to 0.7 branch. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1362) Provide udf context signature in ensureAllKeysInSameSplit() method of loader

2010-04-07 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan reassigned PIG-1362:
-

Assignee: Ashutosh Chauhan

 Provide udf context signature in ensureAllKeysInSameSplit() method of loader
 

 Key: PIG-1362
 URL: https://issues.apache.org/jira/browse/PIG-1362
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
Priority: Critical
 Fix For: 0.7.0

 Attachments: backport.patch


 As a part of PIG-1292 a check was introduced to make sure loader used in 
 collected group-by implements CollectableLoader (new interface in that 
 patch). In its method, loader may use udf context to store some info. We need 
 to make sure that udf context signature is setup correctly in such cases. 
 This is already the case in trunk, need to backport it to 0.7 branch. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1350) [Zebra] Zebra column names cannot have leading _

2010-04-07 Thread Chao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854623#action_12854623
 ] 

Chao Wang commented on PIG-1350:


Patch looks good +1

 [Zebra] Zebra column names cannot have leading _
 --

 Key: PIG-1350
 URL: https://issues.apache.org/jira/browse/PIG-1350
 Project: Pig
  Issue Type: Improvement
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Fix For: 0.7.0

 Attachments: pig-1350.patch, pig-1350.patch


 Disallowing '_' as leading character in column names in Zebra schema is too 
 restrictive, which should be lifted.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1362) Provide udf context signature in ensureAllKeysInSameSplit() method of loader

2010-04-07 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-1362:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

+1

 Provide udf context signature in ensureAllKeysInSameSplit() method of loader
 

 Key: PIG-1362
 URL: https://issues.apache.org/jira/browse/PIG-1362
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
Priority: Critical
 Fix For: 0.7.0

 Attachments: backport.patch


 As a part of PIG-1292 a check was introduced to make sure loader used in 
 collected group-by implements CollectableLoader (new interface in that 
 patch). In its method, loader may use udf context to store some info. We need 
 to make sure that udf context signature is setup correctly in such cases. 
 This is already the case in trunk, need to backport it to 0.7 branch. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1357) [zebra] Test cases of map-side GROUP-BY should be added.

2010-04-07 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1357:
--

Attachment: PIG-1357.patch

 [zebra] Test cases of map-side GROUP-BY should be added.
 

 Key: PIG-1357
 URL: https://issues.apache.org/jira/browse/PIG-1357
 Project: Pig
  Issue Type: Test
Affects Versions: 0.7.0
Reporter: Yan Zhou
Priority: Minor
 Fix For: 0.7.0

 Attachments: PIG-1357.patch


 The global sorted input splits for this feature to work properly. Prior to 
 0.7, all sorted input splits are globally sorted at the LOAD call on sorted 
 table. But with the support of locally sorted input splits, PIG-1306 and 
 PIG-1315, the globally sorted input splits need to be asked for by PIG 
 explicitly. So this creates separate call paths for all PIG feature that 
 require map-side-only ops. Currently there are two PIG features that require 
 globally sorted input splits from Zebra: map-side COGROUP and map-side 
 GROUP-BY. PIG-1315 will contain test cases for the former; while this JIRA 
 will cover the latter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1341) BinStorage cannot convert DataByteArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED

2010-04-07 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1341:
--

Fix Version/s: (was: 0.7.0)

 BinStorage cannot convert DataByteArray to Chararray and results in 
 FIELD_DISCARDED_TYPE_CONVERSION_FAILED
 --

 Key: PIG-1341
 URL: https://issues.apache.org/jira/browse/PIG-1341
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: Richard Ding
 Attachments: PIG-1341.patch


 Script reads in BinStorage data and tries to convert a column which is in 
 DataByteArray to Chararray. 
 {code}
 raw = load 'sampledata' using BinStorage() as (col1,col2, col3);
 --filter out null columns
 A = filter raw by col1#'bcookie' is not null;
 B = foreach A generate col1#'bcookie'  as reqcolumn;
 describe B;
 --B: {regcolumn: bytearray}
 X = limit B 5;
 dump X;
 B = foreach A generate (chararray)col1#'bcookie'  as convertedcol;
 describe B;
 --B: {convertedcol: chararray}
 X = limit B 5;
 dump X;
 {code}
 The first dump produces:
 (36co9b55onr8s)
 (36co9b55onr8s)
 (36hilul5oo1q1)
 (36hilul5oo1q1)
 (36l4cj15ooa8a)
 The second dump produces:
 ()
 ()
 ()
 ()
 ()
 It also throws an error message: FIELD_DISCARDED_TYPE_CONVERSION_FAILED 5 
 time(s).
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1348) PigStorage making unnecessary byte array copy when storing data

2010-04-07 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854643#action_12854643
 ] 

Ashutosh Chauhan commented on PIG-1348:
---

1) As far as I can see TextOutputFormat has synchronized write() because it is 
meant to work even with mappers implementing MultithreadedMapRunner. But since 
thats not the case for Pig, we can get rid of it especially now that we are 
putting in our own PigTextOutputFormat instead of using TextOutputformat. 

3) Thats what I meant, if Schema is available, we should use that to find 
types, instead of reflecting on every call. I suggested the work around of 
caching for the case if we know user did provide Schema, but we dont have a 
handle on it. Clearly, if there is no schema, we need to find type every time. 
I can see that dealing with Complex types even when there is a schema is not 
straight forward. In any case, casts that are currently there for simple types 
are unnecessary.

For performance numbers, both of these will save CPU time, if we are convinced 
that we are always I/O bound we can leave these things as it is. 

 PigStorage making unnecessary byte array copy when storing data
 ---

 Key: PIG-1348
 URL: https://issues.apache.org/jira/browse/PIG-1348
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Richard Ding
 Fix For: 0.7.0

 Attachments: PIG-1348.patch, PIG-1348_2.patch


 InternalCachedBag makes estimate of memory available to the VM by using 
 Runtime.getRuntime().maxMemory(). It then uses 10%(by default, though 
 configurable) of this memory and divides this memory into number of bags. It 
 keeps track of the memory used by bags and then proactively spills if bags 
 memory usage reach close to these limits. Given all this in theory when 
 presented with data more then it can handle InternalCachedBag should not run 
 out of memory. But in practice we find OOM happening. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Reopened: (PIG-1362) Provide udf context signature in ensureAllKeysInSameSplit() method of loader

2010-04-07 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan reopened PIG-1362:
---


 Provide udf context signature in ensureAllKeysInSameSplit() method of loader
 

 Key: PIG-1362
 URL: https://issues.apache.org/jira/browse/PIG-1362
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
Priority: Critical
 Fix For: 0.7.0

 Attachments: backport.patch


 As a part of PIG-1292 a check was introduced to make sure loader used in 
 collected group-by implements CollectableLoader (new interface in that 
 patch). In its method, loader may use udf context to store some info. We need 
 to make sure that udf context signature is setup correctly in such cases. 
 This is already the case in trunk, need to backport it to 0.7 branch. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1363) Unnecessary loadFunc instantiations

2010-04-07 Thread Ashutosh Chauhan (JIRA)
Unnecessary loadFunc instantiations
---

 Key: PIG-1363
 URL: https://issues.apache.org/jira/browse/PIG-1363
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
 Fix For: 0.8.0


In MRCompiler loadfuncs are instantiated at multiple locations in different 
visit methods. This is inconsistent and confusing. LoadFunc should be 
instantiated at only one place, ideally in LogToPhyTanslation#visit(LOLoad). A 
getter should be added to POLoad to retrieve this instantiated loadFunc 
wherever it is needed in later stages of compilation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-04-07 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854649#action_12854649
 ] 

Daniel Dai commented on PIG-1295:
-

Thanks Gianmarco, 
Proposal looks good. Besides unit test, we need to add some performance test in 
both phase 1 and phase 2.

 Binary comparator for secondary sort
 

 Key: PIG-1295
 URL: https://issues.apache.org/jira/browse/PIG-1295
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai

 When hadoop framework doing the sorting, it will try to use binary version of 
 comparator if available. The benefit of binary comparator is we do not need 
 to instantiate the object before we compare. We see a ~30% speedup after we 
 switch to binary comparator. Currently, Pig use binary comparator in 
 following case:
 1. When semantics of order doesn't matter. For example, in distinct, we need 
 to do a sort in order to filter out duplicate values; however, we do not care 
 how comparator sort keys. Groupby also share this character. In this case, we 
 rely on hadoop's default binary comparator
 2. Semantics of order matter, but the key is of simple type. In this case, we 
 have implementation for simple types, such as integer, long, float, 
 chararray, databytearray, string
 However, if the key is a tuple and the sort semantics matters, we do not have 
 a binary comparator implementation. This especially matters when we switch to 
 use secondary sort. In secondary sort, we convert the inner sort of nested 
 foreach into the secondary key and rely on hadoop to sorting on both main key 
 and secondary key. The sorting key will become a two items tuple. Since the 
 secondary key the sorting key of the nested foreach, so the sorting semantics 
 matters. It turns out we do not have binary comparator once we use secondary 
 sort, and we see a significant slow down.
 Binary comparator for tuple should be doable once we understand the binary 
 structure of the serialized tuple. We can focus on most common use cases 
 first, which is group by followed by a nested sort. In this case, we will 
 use secondary sort. Semantics of the first key does not matter but semantics 
 of secondary key matters. We need to identify the boundary of main key and 
 secondary key in the binary tuple buffer without instantiate tuple itself. 
 Then if the first key equals, we use a binary comparator to compare secondary 
 key. Secondary key can also be a complex data type, but for the first step, 
 we focus on simple secondary key, which is the most common use case.
 We mark this issue to be a candidate project for Google summer of code 2010 
 program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1299) Implement Pig counter to track number of output rows for each output files

2010-04-07 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1299:
--

Attachment: PIG-1299.patch

Thanks Pradeep.  The new patch addresses the comments.

The patch adds a new Hadoop counter group--MultiStoreCounters--that counts the 
numbers of output records in each store of a MultiQuery script. 

 Implement Pig counter  to track number of output rows for each output files 
 

 Key: PIG-1299
 URL: https://issues.apache.org/jira/browse/PIG-1299
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1299.patch, PIG-1299.patch


 When running a multi-store query, the Hadoop job tracker often displays only 
 0 for Reduce output records or Map output records counters, This is 
 incorrect and misleading. Pig should implement an output records counter 
 for each output files in the query. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1299) Implement Pig counter to track number of output rows for each output files

2010-04-07 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1299:
--

Status: Open  (was: Patch Available)

 Implement Pig counter  to track number of output rows for each output files 
 

 Key: PIG-1299
 URL: https://issues.apache.org/jira/browse/PIG-1299
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1299.patch, PIG-1299.patch


 When running a multi-store query, the Hadoop job tracker often displays only 
 0 for Reduce output records or Map output records counters, This is 
 incorrect and misleading. Pig should implement an output records counter 
 for each output files in the query. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1299) Implement Pig counter to track number of output rows for each output files

2010-04-07 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1299:
--

Status: Patch Available  (was: Open)

 Implement Pig counter  to track number of output rows for each output files 
 

 Key: PIG-1299
 URL: https://issues.apache.org/jira/browse/PIG-1299
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1299.patch, PIG-1299.patch


 When running a multi-store query, the Hadoop job tracker often displays only 
 0 for Reduce output records or Map output records counters, This is 
 incorrect and misleading. Pig should implement an output records counter 
 for each output files in the query. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1348) PigStorage making unnecessary byte array copy when storing data

2010-04-07 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854654#action_12854654
 ] 

Dmitriy V. Ryaboy commented on PIG-1348:


In the spirit of better java and micro-optimizations:

StorageUtil does things like this to convert to bytes:

{code}
out.write(((Integer)field).toString().getBytes());
{code}

Integer's toString() method creates a new string every time, even if the same 
integer (value-wise) is being converted to a String.  This is better:

{code}
out.wirte(String.valueOf(field).getBytes());
{code}

(This reuses the values, and also collapses the case statement a fair bit, 
cleaning up the code -- we can batch Integer, Double, etc, together and fall 
through to just one line of code.)

This discussion should probably go into a separate ticket.

 PigStorage making unnecessary byte array copy when storing data
 ---

 Key: PIG-1348
 URL: https://issues.apache.org/jira/browse/PIG-1348
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Richard Ding
 Fix For: 0.7.0

 Attachments: PIG-1348.patch, PIG-1348_2.patch


 InternalCachedBag makes estimate of memory available to the VM by using 
 Runtime.getRuntime().maxMemory(). It then uses 10%(by default, though 
 configurable) of this memory and divides this memory into number of bags. It 
 keeps track of the memory used by bags and then proactively spills if bags 
 memory usage reach close to these limits. Given all this in theory when 
 presented with data more then it can handle InternalCachedBag should not run 
 out of memory. But in practice we find OOM happening. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1315) [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader

2010-04-07 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-1315:
-

Status: Patch Available  (was: Open)

 [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader
 

 Key: PIG-1315
 URL: https://issues.apache.org/jira/browse/PIG-1315
 Project: Pig
  Issue Type: New Feature
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Fix For: 0.8.0

 Attachments: pig-1315.patch


 OrderedLoadFunc interface is used by Pig to do merge join and mapside 
 cogrouping. For Zebra, implementing this interface is necessary to support 
 mapside cogrouping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1364) Public javadoc on apache site still on 0.2, needs to be updated for each version release

2010-04-07 Thread Alan Gates (JIRA)
Public javadoc on apache site still on 0.2, needs to be updated for each 
version release


 Key: PIG-1364
 URL: https://issues.apache.org/jira/browse/PIG-1364
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.7.0
Reporter: Alan Gates
Assignee: Alan Gates
Priority: Critical


See http://hadoop.apache.org/pig/javadoc/docs/api/.  This currently contains 
javadocs for 0.2.  It is also versionless.

It needs to be changed so that javadocs for recent versions are posted.  It 
also needs to change so that the version is in the api so that multiple 
versions of the API can be posted.

It's probably too late to do this for 0.6 and before, but it needs to happen 
for 0.7.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1364) Public javadoc on apache site still on 0.2, needs to be updated for each version release

2010-04-07 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-1364:


Fix Version/s: 0.7.0

 Public javadoc on apache site still on 0.2, needs to be updated for each 
 version release
 

 Key: PIG-1364
 URL: https://issues.apache.org/jira/browse/PIG-1364
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.7.0
Reporter: Alan Gates
Assignee: Alan Gates
Priority: Critical
 Fix For: 0.7.0


 See http://hadoop.apache.org/pig/javadoc/docs/api/.  This currently contains 
 javadocs for 0.2.  It is also versionless.
 It needs to be changed so that javadocs for recent versions are posted.  It 
 also needs to change so that the version is in the api so that multiple 
 versions of the API can be posted.
 It's probably too late to do this for 0.6 and before, but it needs to happen 
 for 0.7.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1357) [zebra] Test cases of map-side GROUP-BY should be added.

2010-04-07 Thread Chao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854732#action_12854732
 ] 

Chao Wang commented on PIG-1357:


+1

 [zebra] Test cases of map-side GROUP-BY should be added.
 

 Key: PIG-1357
 URL: https://issues.apache.org/jira/browse/PIG-1357
 Project: Pig
  Issue Type: Test
Affects Versions: 0.7.0
Reporter: Yan Zhou
Priority: Minor
 Fix For: 0.7.0

 Attachments: PIG-1357.patch


 The global sorted input splits for this feature to work properly. Prior to 
 0.7, all sorted input splits are globally sorted at the LOAD call on sorted 
 table. But with the support of locally sorted input splits, PIG-1306 and 
 PIG-1315, the globally sorted input splits need to be asked for by PIG 
 explicitly. So this creates separate call paths for all PIG feature that 
 require map-side-only ops. Currently there are two PIG features that require 
 globally sorted input splits from Zebra: map-side COGROUP and map-side 
 GROUP-BY. PIG-1315 will contain test cases for the former; while this JIRA 
 will cover the latter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1365) WrappedIOException is missing from Pig.jar

2010-04-07 Thread Olga Natkovich (JIRA)
WrappedIOException is missing from Pig.jar
--

 Key: PIG-1365
 URL: https://issues.apache.org/jira/browse/PIG-1365
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Pradeep Kamath
Priority: Critical
 Fix For: 0.7.0


We need to put it back since UDFs rely on it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1315) [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader

2010-04-07 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854738#action_12854738
 ] 

Yan Zhou commented on PIG-1315:
---

+1

 [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader
 

 Key: PIG-1315
 URL: https://issues.apache.org/jira/browse/PIG-1315
 Project: Pig
  Issue Type: New Feature
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Fix For: 0.8.0

 Attachments: pig-1315.patch


 OrderedLoadFunc interface is used by Pig to do merge join and mapside 
 cogrouping. For Zebra, implementing this interface is necessary to support 
 mapside cogrouping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db

2010-04-07 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854740#action_12854740
 ] 

Ashutosh Chauhan commented on PIG-1229:
---

You  can get rid of this stack-trace by overriding 
relToAbsPathForStoreLocation() of StoreFunc which DBStorage extends and turning 
it into no-op. Since, DB location is always absolute, there is no need of 
default behavior which is there in StoreFunc.  

For DataType.find() I found even PigStorage does the same, so this patch is no 
worse then PigStorage in that way.

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-v2.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1362) Provide udf context signature in ensureAllKeysInSameSplit() method of loader

2010-04-07 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan resolved PIG-1362.
---

Resolution: Fixed

Since hudson is flaky once again. Ran the full test - suite. All of it passed. 
Ran test-patch:

{noformat}
 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.
{noformat}

Patch checked-in for 0.7 branch.

 Provide udf context signature in ensureAllKeysInSameSplit() method of loader
 

 Key: PIG-1362
 URL: https://issues.apache.org/jira/browse/PIG-1362
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
Priority: Critical
 Fix For: 0.7.0

 Attachments: backport.patch


 As a part of PIG-1292 a check was introduced to make sure loader used in 
 collected group-by implements CollectableLoader (new interface in that 
 patch). In its method, loader may use udf context to store some info. We need 
 to make sure that udf context signature is setup correctly in such cases. 
 This is already the case in trunk, need to backport it to 0.7 branch. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-756) UDFs should have API for transparently opening and reading files from HDFS or from local file system with only relative path

2010-04-07 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854762#action_12854762
 ] 

Viraj Bhat commented on PIG-756:


In Pig 0.7 we have moved local mode of Pig to local mode of Hadoop.
https://issues.apache.org/jira/browse/PIG-1053

Closing issue

 UDFs should have API for transparently opening and reading files from HDFS or 
 from local file system with only relative path
 

 Key: PIG-756
 URL: https://issues.apache.org/jira/browse/PIG-756
 Project: Pig
  Issue Type: Bug
Reporter: David Ciemiewicz

 I have a utility function util.INSETFROMFILE() that I pass a file name during 
 initialization.
 {code}
 define inQuerySet util.INSETFROMFILE(analysis/queries);
 A = load 'logs' using PigStorage() as ( date int, query chararray );
 B = filter A by inQuerySet(query);
 {code}
 This provides a computationally inexpensive way to effect map-side joins for 
 small sets plus functions of this style provide the ability to encapsulate 
 more complex matching rules.
 For rapid development and debugging purposes, I want this code to run without 
 modification on both my local file system when I do pig -exectype local and 
 on HDFS.
 Pig needs to provide an API for UDFs which allow them to either:
 1) know  when they are in local or HDFS mode and let them open and read 
 from files as appropriate
 2) just provide a file name and read statements and have pig transparently 
 manage local or HDFS opens and reads for the UDF
 UDFs need to read configuration information off the filesystem and it 
 simplifies the process if one can just flip the switch of -exectype local.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-756) UDFs should have API for transparently opening and reading files from HDFS or from local file system with only relative path

2010-04-07 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat resolved PIG-756.


   Resolution: Fixed
Fix Version/s: 0.7.0

https://issues.apache.org/jira/browse/PIG-1053 fixes this issue.

 UDFs should have API for transparently opening and reading files from HDFS or 
 from local file system with only relative path
 

 Key: PIG-756
 URL: https://issues.apache.org/jira/browse/PIG-756
 Project: Pig
  Issue Type: Bug
Reporter: David Ciemiewicz
 Fix For: 0.7.0


 I have a utility function util.INSETFROMFILE() that I pass a file name during 
 initialization.
 {code}
 define inQuerySet util.INSETFROMFILE(analysis/queries);
 A = load 'logs' using PigStorage() as ( date int, query chararray );
 B = filter A by inQuerySet(query);
 {code}
 This provides a computationally inexpensive way to effect map-side joins for 
 small sets plus functions of this style provide the ability to encapsulate 
 more complex matching rules.
 For rapid development and debugging purposes, I want this code to run without 
 modification on both my local file system when I do pig -exectype local and 
 on HDFS.
 Pig needs to provide an API for UDFs which allow them to either:
 1) know  when they are in local or HDFS mode and let them open and read 
 from files as appropriate
 2) just provide a file name and read statements and have pig transparently 
 manage local or HDFS opens and reads for the UDF
 UDFs need to read configuration information off the filesystem and it 
 simplifies the process if one can just flip the switch of -exectype local.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1357) [zebra] Test cases of map-side GROUP-BY should be added.

2010-04-07 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1357:
--

Status: Patch Available  (was: Open)

 [zebra] Test cases of map-side GROUP-BY should be added.
 

 Key: PIG-1357
 URL: https://issues.apache.org/jira/browse/PIG-1357
 Project: Pig
  Issue Type: Test
Affects Versions: 0.7.0
Reporter: Yan Zhou
Priority: Minor
 Fix For: 0.7.0

 Attachments: PIG-1357.patch


 The global sorted input splits for this feature to work properly. Prior to 
 0.7, all sorted input splits are globally sorted at the LOAD call on sorted 
 table. But with the support of locally sorted input splits, PIG-1306 and 
 PIG-1315, the globally sorted input splits need to be asked for by PIG 
 explicitly. So this creates separate call paths for all PIG feature that 
 require map-side-only ops. Currently there are two PIG features that require 
 globally sorted input splits from Zebra: map-side COGROUP and map-side 
 GROUP-BY. PIG-1315 will contain test cases for the former; while this JIRA 
 will cover the latter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1366) PigStorage's pushProjection implementation results in NPE under certain data conditions

2010-04-07 Thread Pradeep Kamath (JIRA)
PigStorage's pushProjection implementation results in NPE under certain data 
conditions
---

 Key: PIG-1366
 URL: https://issues.apache.org/jira/browse/PIG-1366
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0, 0.7.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.7.0


Under the following conditions, a NullPointerException is caused when 
PigStorage is used:
If in the script, only the 2nd and 3rd column of the data (say) are used, the 
PruneColumns optimization passes this information to PigStorage through the 
pushProjection() method. If the data contains a row with only one column 
(malformed data due to missing cols in certain rows), PigStorage returns a 
Tuple backed by a null ArrayList. Subsequent projection operations on this 
tuple result in the NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1366) PigStorage's pushProjection implementation results in NPE under certain data conditions

2010-04-07 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-1366:


Attachment: PIG-1366.patch

Currently in PigStorage the ArrayList backing the Tuple returned in getNext() 
is created in readField(). Under the data conditions explained in the 
description, readField() never gets called and the ArrayList (mProtoTuple) 
remains null causing the eventual NPE. The patch fixes the issue by 
initializing mProtoTuple to a new ArrayList at the beginning of getNext().

 PigStorage's pushProjection implementation results in NPE under certain data 
 conditions
 ---

 Key: PIG-1366
 URL: https://issues.apache.org/jira/browse/PIG-1366
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0, 0.7.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.7.0

 Attachments: PIG-1366.patch


 Under the following conditions, a NullPointerException is caused when 
 PigStorage is used:
 If in the script, only the 2nd and 3rd column of the data (say) are used, the 
 PruneColumns optimization passes this information to PigStorage through the 
 pushProjection() method. If the data contains a row with only one column 
 (malformed data due to missing cols in certain rows), PigStorage returns a 
 Tuple backed by a null ArrayList. Subsequent projection operations on this 
 tuple result in the NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1366) PigStorage's pushProjection implementation results in NPE under certain data conditions

2010-04-07 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-1366:


Status: Patch Available  (was: Open)

 PigStorage's pushProjection implementation results in NPE under certain data 
 conditions
 ---

 Key: PIG-1366
 URL: https://issues.apache.org/jira/browse/PIG-1366
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0, 0.7.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.7.0

 Attachments: PIG-1366.patch


 Under the following conditions, a NullPointerException is caused when 
 PigStorage is used:
 If in the script, only the 2nd and 3rd column of the data (say) are used, the 
 PruneColumns optimization passes this information to PigStorage through the 
 pushProjection() method. If the data contains a row with only one column 
 (malformed data due to missing cols in certain rows), PigStorage returns a 
 Tuple backed by a null ArrayList. Subsequent projection operations on this 
 tuple result in the NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1365) WrappedIOException is missing from Pig.jar

2010-04-07 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-1365:


Attachment: PIG-1365.patch

Attached patch restores WrappedIOException - this is not used in Pig Code and 
only provided for use by UDFs to maintain backward compatibility. I have marked 
the class as deprecated so that it can be removed from pig code base in a later 
release.

No unit tests have been added since this is just restoring an old class which 
is no longer used in the pig code.

 WrappedIOException is missing from Pig.jar
 --

 Key: PIG-1365
 URL: https://issues.apache.org/jira/browse/PIG-1365
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Pradeep Kamath
Priority: Critical
 Fix For: 0.7.0

 Attachments: PIG-1365.patch


 We need to put it back since UDFs rely on it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1365) WrappedIOException is missing from Pig.jar

2010-04-07 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-1365:


Status: Patch Available  (was: Open)

 WrappedIOException is missing from Pig.jar
 --

 Key: PIG-1365
 URL: https://issues.apache.org/jira/browse/PIG-1365
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Pradeep Kamath
Priority: Critical
 Fix For: 0.7.0

 Attachments: PIG-1365.patch


 We need to put it back since UDFs rely on it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.