[jira] [Commented] (PIG-3345) Handle null in DateTime functions

2013-06-03 Thread Prashant Kommireddi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13674070#comment-13674070
 ] 

Prashant Kommireddi commented on PIG-3345:
--

LGTM +1

Thanks Rohini!

> Handle null in DateTime functions
> -
>
> Key: PIG-3345
> URL: https://issues.apache.org/jira/browse/PIG-3345
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11.1
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.12
>
> Attachments: PIG-3345-1.patch, PIG-3345-2.patch
>
>
>  NPE is thrown in date time functions when a null value is passed. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (PIG-3345) Handle null in DateTime functions

2013-06-03 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13674025#comment-13674025
 ] 

Rohini Palaniswamy edited comment on PIG-3345 at 6/4/13 4:25 AM:
-

Thanks Prashant. Added for all the udfs in testConversionBetweenDTAndString

  was (Author: rohini):
Thanks Prashant. Added for all the methods in 
testConversionBetweenDTAndString
  
> Handle null in DateTime functions
> -
>
> Key: PIG-3345
> URL: https://issues.apache.org/jira/browse/PIG-3345
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11.1
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.12
>
> Attachments: PIG-3345-1.patch, PIG-3345-2.patch
>
>
>  NPE is thrown in date time functions when a null value is passed. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3345) Handle null in DateTime functions

2013-06-03 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-3345:


Attachment: PIG-3345-2.patch

Thanks Prashant. Added for all the methods in testConversionBetweenDTAndString

> Handle null in DateTime functions
> -
>
> Key: PIG-3345
> URL: https://issues.apache.org/jira/browse/PIG-3345
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11.1
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.12
>
> Attachments: PIG-3345-1.patch, PIG-3345-2.patch
>
>
>  NPE is thrown in date time functions when a null value is passed. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3346) New property that controls the number of combined splits

2013-06-03 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3346:
---

Status: Patch Available  (was: Open)

> New property that controls the number of combined splits
> 
>
> Key: PIG-3346
> URL: https://issues.apache.org/jira/browse/PIG-3346
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.12
>
> Attachments: PIG-3346.patch
>
>
> Currently, the size of combined splits can be configured by the 
> {{pig.maxCombinedSplitSize}} property.
> Although this works fine most of time, it can lead to a undesired situation 
> where a single mapper ends up loading a lot of combined splits. Particularly, 
> this is bad if Pig uploads them from S3.
> So it will be useful if the max number of combined splits can be configured 
> via a property something like {{pig.maxCombinedSplitNum}}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3346) New property that controls the number of combined splits

2013-06-03 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3346:
---

Attachment: PIG-3346.patch

The attached patch includes the following changes:
* Adds a new property {{pig.maxCombinedSplitNum}}. By default, it is set to 
Long.MAX_VALUE.
* Updates the logic of {{MapRedUtil.getCombinePigSplits()}} to take the number 
of combined splits into account.
* Adds a new test case to {{TestSplitCombine}}.
* Updates the document regarding the new property.

Test done:
* ant test-commit
* ant test -Dtestcase=TestSplitCombine

Thanks!

> New property that controls the number of combined splits
> 
>
> Key: PIG-3346
> URL: https://issues.apache.org/jira/browse/PIG-3346
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.12
>
> Attachments: PIG-3346.patch
>
>
> Currently, the size of combined splits can be configured by the 
> {{pig.maxCombinedSplitSize}} property.
> Although this works fine most of time, it can lead to a undesired situation 
> where a single mapper ends up loading a lot of combined splits. Particularly, 
> this is bad if Pig uploads them from S3.
> So it will be useful if the max number of combined splits can be configured 
> via a property something like {{pig.maxCombinedSplitNum}}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3329) RANK operator failed when working with SPLIT

2013-06-03 Thread Johnny Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673936#comment-13673936
 ] 

Johnny Zhang commented on PIG-3329:
---

[~xalan], are you working this right now? I got the similar exception when I 
was working on another patch, so it will be very nice if I can understand how 
will you resolve this issue. Thanks a lot!

> RANK operator failed when working with SPLIT 
> -
>
> Key: PIG-3329
> URL: https://issues.apache.org/jira/browse/PIG-3329
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11.1
>Reporter: Redis Liu
>Assignee: Allan AvendaƱo
>Priority: Critical
>
> input.txt:
> 1 2 3
> 4 5 6
> 7 8 9
> script:
> a = load 'input.txt' using PigStorage(' ') as (a:int, b:int, c:int);
> SPLIT a into b if a > 0, c if a > 5;
> d = RANK b;
> dump d;
> job will fail with error message:
> java.lang.RuntimeException: Unable to read counter 
> pig.counters.counter_4929375455335572575_-1
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PORank.addRank(PORank.java:161)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PORank.getNext(PORank.java:134)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:308)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.getNext(POSplit.java:214)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:157)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:673)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:275)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:396)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1340)
>   at org.apache.hadoop.mapred.Child.main(Child.java:269)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3342) Allow conditions in case statement

2013-06-03 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673933#comment-13673933
 ] 

Cheolsoo Park commented on PIG-3342:


Thanks Rohini for taking a look.

Here is the RB request:
https://reviews.apache.org/r/11613/

> Allow conditions in case statement
> --
>
> Key: PIG-3342
> URL: https://issues.apache.org/jira/browse/PIG-3342
> Project: Pig
>  Issue Type: Improvement
>  Components: parser
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.12
>
> Attachments: PIG-3342.patch
>
>
> PIG-3268 added case statement support. But conditions are currently not 
> allowed in when branches. For example,
> {code}
> CASE
>   WHEN i % 5 == 0 THEN '5n'
>   WHEN i % 5 == 1 THEN '5n+1'
>   WHEN i % 5 == 2 THEN '5n+2'
>   WHEN i % 5 == 3 THEN '5n+3'
>   ELSE '5n+4'
> END
> {code}
> This is invalid now. However, it will be useful if it's allowed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Review Request: PIG-3342 Allow conditions in case statement

2013-06-03 Thread Cheolsoo Park

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/11613/
---

Review request for pig.


Description
---

Allows condition expression in case statement.


This addresses bug PIG-3342.
https://issues.apache.org/jira/browse/PIG-3342


Diffs
-

  src/org/apache/pig/parser/AstPrinter.g c2abede 
  src/org/apache/pig/parser/AstValidator.g 2c6d4dc 
  src/org/apache/pig/parser/LogicalPlanGenerator.g 9375d60 
  src/org/apache/pig/parser/QueryParser.g 2b84c86 
  test/org/apache/pig/test/TestCase.java dbee495 

Diff: https://reviews.apache.org/r/11613/diff/


Testing
---

All unit tests pass.


Thanks,

Cheolsoo Park



[jira] [Resolved] (PIG-3322) AVRO: AvroStorage give NPE on reading file with union as top level schema

2013-06-03 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy resolved PIG-3322.
-

Resolution: Fixed

Committed to trunk (0.12). Thanks Viraj and Cheolsoo.

> AVRO: AvroStorage give NPE on reading file with union as top level schema
> -
>
> Key: PIG-3322
> URL: https://issues.apache.org/jira/browse/PIG-3322
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Affects Versions: 0.11.2
>Reporter: Egil Sorensen
>Assignee: Viraj Bhat
>  Labels: patch
> Fix For: 0.12
>
> Attachments: PIG-3322_3.patch, test_loadavrowithnulls.avro
>
>
> I am getting NPE when loading a file with AvroStorage a file that has schema 
> like:
> {code}
> ["null",{"type":"record","name":"TUPLE_0","fields":[{"name":"name","type":["null","string"],"doc":"autogenerated
>  from Pig Field 
> Schema"},{"name":"age","type":["null","int"],"doc":"autogenerated from Pig 
> Field Schema"},{"name":"gpa","type":["null","double"],"doc":"autogenerated 
> from Pig Field Schema"}]}]
> {code}
> E.g. see the e2e style test, which fails on this:
> {code}
> {
> 'num' => 4,
> # storing file with Pig type tuple relying on 
> conversion to record
> # loading using stored schemas 
> 'notmq' => 1,
> 'pig' => q\
> a = load ':INPATH:/singlefile/studentcomplextab10k' using PigStorage() as 
> (m:[], t:(name:chararray, age:int, gpa:double), b:{t:(name:chararray, 
> age:int, gpa:double)});
> b = foreach a generate t;
> describe b;
> store b into ':OUTPATH:.intermediate' USING 
> org.apache.pig.piggybank.storage.avro.AvroStorage();
> exec;
> -- Read back what was stored with Avro
> u = load ':OUTPATH:.intermediate' USING 
> org.apache.pig.piggybank.storage.avro.AvroStorage();
> describe u;
> store u into ':OUTPATH:';
> \,
> 'verify_pig_script' => q\
> a = load ':INPATH:/singlefile/studentcomplextab10k' using PigStorage() as 
> (m:[], t:(name:chararray, age:int, gpa:double), b:{t:(name:chararray, 
> age:int, gpa:double)});
> b = foreach a generate t;
> describe b;
> store b into ':OUTPATH:';
> \,
> },
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Subscription: PIG patch available

2013-06-03 Thread jira
Issue Subscription
Filter: PIG patch available (19 issues)

Subscriber: pigdaily

Key Summary
PIG-3345Handle null in DateTime functions
https://issues.apache.org/jira/browse/PIG-3345
PIG-3342Allow conditions in case statement
https://issues.apache.org/jira/browse/PIG-3342
PIG-Fix remaining Windows core unit test failures
https://issues.apache.org/jira/browse/PIG-
PIG-3318AVRO: 'default value' not honored when merging schemas on load with 
AvroStorage
https://issues.apache.org/jira/browse/PIG-3318
PIG-3295Casting from bytearray failing after Union (even when each field is 
from a single Loader)
https://issues.apache.org/jira/browse/PIG-3295
PIG-3288Kill jobs if the number of output files is over a configurable limit
https://issues.apache.org/jira/browse/PIG-3288
PIG-3280Document IN operator and CASE expression
https://issues.apache.org/jira/browse/PIG-3280
PIG-3257Add unique identifier UDF
https://issues.apache.org/jira/browse/PIG-3257
PIG-3247Piggybank functions to mimic OVER clause in SQL
https://issues.apache.org/jira/browse/PIG-3247
PIG-3210Pig fails to start when it cannot write log to log files
https://issues.apache.org/jira/browse/PIG-3210
PIG-3199Expose LogicalPlan via PigServer API
https://issues.apache.org/jira/browse/PIG-3199
PIG-3166Update eclipse .classpath according to ivy library.properties
https://issues.apache.org/jira/browse/PIG-3166
PIG-3123Simplify Logical Plans By Removing Unneccessary Identity Projections
https://issues.apache.org/jira/browse/PIG-3123
PIG-3088Add a builtin udf which removes prefixes
https://issues.apache.org/jira/browse/PIG-3088
PIG-3015Rewrite of AvroStorage
https://issues.apache.org/jira/browse/PIG-3015
PIG-2828Handle nulls in DataType.compare
https://issues.apache.org/jira/browse/PIG-2828
PIG-2248Pig parser does not detect when a macro name masks a UDF name
https://issues.apache.org/jira/browse/PIG-2248
PIG-2244Macros cannot be passed relation names
https://issues.apache.org/jira/browse/PIG-2244
PIG-1914Support load/store JSON data in Pig
https://issues.apache.org/jira/browse/PIG-1914

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225&filterId=12322384


[jira] [Created] (PIG-3346) New property that controls the number of combined splits

2013-06-03 Thread Cheolsoo Park (JIRA)
Cheolsoo Park created PIG-3346:
--

 Summary: New property that controls the number of combined splits
 Key: PIG-3346
 URL: https://issues.apache.org/jira/browse/PIG-3346
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
 Fix For: 0.12


Currently, the size of combined splits can be configured by the 
{{pig.maxCombinedSplitSize}} property.

Although this works fine most of time, it can lead to a undesired situation 
where a single mapper ends up loading a lot of combined splits. Particularly, 
this is bad if Pig uploads them from S3.

So it will be useful if the max number of combined splits can be configured via 
a property something like {{pig.maxCombinedSplitNum}}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3341) Improving performance of loading datetime values

2013-06-03 Thread pat chan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673876#comment-13673876
 ] 

pat chan commented on PIG-3341:
---

I was looking in the docs for any documentation on this topic. I found the 
following in http://wiki.apache.org/pig/UDFManual


The first thing to decide is what to do with invalid data. This depends on the 
format of the data. If the data is of type bytearray it means that it has not 
yet been converted to its proper type. In this case, if the format of the data 
does not match the expected type, a null value should be returned. If, on the 
other hand, the input data is of another type, this means that the conversion 
has already happened and the data should be in the correct format. This is the 
case with our example and that's why it throws an error (line 16.) Note that 
WrappedIOException is a helper class to convert the actual exception to an 
IOException.

Also, note that lines 10-11 check if the input data is null or empty and if so 
returns null.


If I'm reading this correctly, it says that if the type of the input doesn't 
match the signature of the UDF, a null should be returned. However, I get this:

  grunt> A = load 'o' as (a:bytearray);
  grunt> B = foreach A generate ToDate(a); dump B;
  2013-06-03 17:15:09,253 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1046: 
   Multiple matching functions for 
org.apache.pig.builtin.ToDate with input schema: ({long}, {chararray}). Please 
use an explicit cast.

It also seems to be saying that if the types are right and the format is 
invalid, an error should be thrown. I just checked and yes, I get an error. 
However, this doesn't match Rohini's proposal to return a null instead. Also, 
as Dmitriy hinted, it's not philosophically consistent with loading behavior 
where invalid things turn into nulls.

  2013-06-03 17:25:12,977 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Failed!
  2013-06-03 17:25:12,981 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1066: Unable to open iterator for alias B


BTW, the note about lines 10-11 isn't quite right. The code in the example 
doesn't have a check for null and so a null would cause an exception.


> Improving performance of loading datetime values
> 
>
> Key: PIG-3341
> URL: https://issues.apache.org/jira/browse/PIG-3341
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.11.1
>Reporter: pat chan
>Priority: Minor
> Fix For: 0.12, 0.11.2
>
>
> The performance of loading datetime values can be improved by about 25% by 
> moving a single line in ToDate.java:
> public static DateTimeZone extractDateTimeZone(String dtStr) {
>   Pattern pattern = 
> Pattern.compile("(Z|(?<=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$");;
> should become:
> static Pattern pattern = 
> Pattern.compile("(Z|(?<=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$");
> public static DateTimeZone extractDateTimeZone(String dtStr) {
> There is no need to recompile the regular expression for every value. I'm not 
> sure if this function is ever called concurrently, but Pattern objects are 
> thread-safe anyways.
> As a test, I created a file of 10M timestamps:
>   for i in 0..1000
> puts '2000-01-01T00:00:00+23'
>   end
> I then ran this script:
>   grunt> A = load 'data' as (a:datetime); B = filter A by a is null; dump B;
> Before the change it took 160s.
> After the change, the script took 120s.
> 
> Another performance improvement can be made for invalid datetime values. If a 
> datetime value is invalid, an exception is created and thrown, which is a 
> costly way to fail a validity check. To test the performance impact, I 
> created 10M invalid datetime values:
>   for i in 0..1000
> puts '2000-99-01T00:00:00+23'
>   end
> In this test, the regex pattern was always recompiled. I then ran this script:
>   grunt> A = load 'data' as (a:datetime); B = filter A by a is not null; dump 
> B;
> The script took 190s.
> I understand this could be considered an edge case and might not be worth 
> changing. However, if there are use cases where invalid dates are part of 
> normal processing, then you might consider fixing this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Review Request: PIG-3322 Fix the issue where NPE is thrown when reading a union which has nulls and add a testcase

2013-06-03 Thread Rohini Palaniswamy

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/11333/#review21383
---

Ship it!


Ship It!

- Rohini Palaniswamy


On June 4, 2013, 12:15 a.m., Viraj Bhat wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/11333/
> ---
> 
> (Updated June 4, 2013, 12:15 a.m.)
> 
> 
> Review request for pig and Rohini Palaniswamy.
> 
> 
> Description
> ---
> 
> Null pointer exception when loading union with null in it's schema. Test case 
> was also updated with a sample test case.
> 
> 
> Diffs
> -
> 
>   
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java
>  1485358 
>   
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/PigAvroRecordReader.java
>  1485358 
>   
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java
>  1485358 
> 
> Diff: https://reviews.apache.org/r/11333/diff/
> 
> 
> Testing
> ---
> 
> Yes all tests pass in the piggybank
> 
> 
> Thanks,
> 
> Viraj Bhat
> 
>



[jira] [Updated] (PIG-3322) AVRO: AvroStorage give NPE on reading file with union as top level schema

2013-06-03 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-3322:


Attachment: (was: test_loadavrowithnulls.avro)

> AVRO: AvroStorage give NPE on reading file with union as top level schema
> -
>
> Key: PIG-3322
> URL: https://issues.apache.org/jira/browse/PIG-3322
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Affects Versions: 0.11.2
>Reporter: Egil Sorensen
>Assignee: Viraj Bhat
>  Labels: patch
> Fix For: 0.12
>
> Attachments: PIG-3322_3.patch, test_loadavrowithnulls.avro
>
>
> I am getting NPE when loading a file with AvroStorage a file that has schema 
> like:
> {code}
> ["null",{"type":"record","name":"TUPLE_0","fields":[{"name":"name","type":["null","string"],"doc":"autogenerated
>  from Pig Field 
> Schema"},{"name":"age","type":["null","int"],"doc":"autogenerated from Pig 
> Field Schema"},{"name":"gpa","type":["null","double"],"doc":"autogenerated 
> from Pig Field Schema"}]}]
> {code}
> E.g. see the e2e style test, which fails on this:
> {code}
> {
> 'num' => 4,
> # storing file with Pig type tuple relying on 
> conversion to record
> # loading using stored schemas 
> 'notmq' => 1,
> 'pig' => q\
> a = load ':INPATH:/singlefile/studentcomplextab10k' using PigStorage() as 
> (m:[], t:(name:chararray, age:int, gpa:double), b:{t:(name:chararray, 
> age:int, gpa:double)});
> b = foreach a generate t;
> describe b;
> store b into ':OUTPATH:.intermediate' USING 
> org.apache.pig.piggybank.storage.avro.AvroStorage();
> exec;
> -- Read back what was stored with Avro
> u = load ':OUTPATH:.intermediate' USING 
> org.apache.pig.piggybank.storage.avro.AvroStorage();
> describe u;
> store u into ':OUTPATH:';
> \,
> 'verify_pig_script' => q\
> a = load ':INPATH:/singlefile/studentcomplextab10k' using PigStorage() as 
> (m:[], t:(name:chararray, age:int, gpa:double), b:{t:(name:chararray, 
> age:int, gpa:double)});
> b = foreach a generate t;
> describe b;
> store b into ':OUTPATH:';
> \,
> },
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3322) AVRO: AvroStorage give NPE on reading file with union as top level schema

2013-06-03 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-3322:


Attachment: PIG-3322_3.patch

> AVRO: AvroStorage give NPE on reading file with union as top level schema
> -
>
> Key: PIG-3322
> URL: https://issues.apache.org/jira/browse/PIG-3322
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Affects Versions: 0.11.2
>Reporter: Egil Sorensen
>Assignee: Viraj Bhat
>  Labels: patch
> Fix For: 0.12
>
> Attachments: PIG-3322_3.patch, test_loadavrowithnulls.avro
>
>
> I am getting NPE when loading a file with AvroStorage a file that has schema 
> like:
> {code}
> ["null",{"type":"record","name":"TUPLE_0","fields":[{"name":"name","type":["null","string"],"doc":"autogenerated
>  from Pig Field 
> Schema"},{"name":"age","type":["null","int"],"doc":"autogenerated from Pig 
> Field Schema"},{"name":"gpa","type":["null","double"],"doc":"autogenerated 
> from Pig Field Schema"}]}]
> {code}
> E.g. see the e2e style test, which fails on this:
> {code}
> {
> 'num' => 4,
> # storing file with Pig type tuple relying on 
> conversion to record
> # loading using stored schemas 
> 'notmq' => 1,
> 'pig' => q\
> a = load ':INPATH:/singlefile/studentcomplextab10k' using PigStorage() as 
> (m:[], t:(name:chararray, age:int, gpa:double), b:{t:(name:chararray, 
> age:int, gpa:double)});
> b = foreach a generate t;
> describe b;
> store b into ':OUTPATH:.intermediate' USING 
> org.apache.pig.piggybank.storage.avro.AvroStorage();
> exec;
> -- Read back what was stored with Avro
> u = load ':OUTPATH:.intermediate' USING 
> org.apache.pig.piggybank.storage.avro.AvroStorage();
> describe u;
> store u into ':OUTPATH:';
> \,
> 'verify_pig_script' => q\
> a = load ':INPATH:/singlefile/studentcomplextab10k' using PigStorage() as 
> (m:[], t:(name:chararray, age:int, gpa:double), b:{t:(name:chararray, 
> age:int, gpa:double)});
> b = foreach a generate t;
> describe b;
> store b into ':OUTPATH:';
> \,
> },
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3322) AVRO: AvroStorage give NPE on reading file with union as top level schema

2013-06-03 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-3322:


Attachment: test_loadavrowithnulls.avro

> AVRO: AvroStorage give NPE on reading file with union as top level schema
> -
>
> Key: PIG-3322
> URL: https://issues.apache.org/jira/browse/PIG-3322
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Affects Versions: 0.11.2
>Reporter: Egil Sorensen
>Assignee: Viraj Bhat
>  Labels: patch
> Fix For: 0.12
>
> Attachments: PIG-3322_3.patch, test_loadavrowithnulls.avro
>
>
> I am getting NPE when loading a file with AvroStorage a file that has schema 
> like:
> {code}
> ["null",{"type":"record","name":"TUPLE_0","fields":[{"name":"name","type":["null","string"],"doc":"autogenerated
>  from Pig Field 
> Schema"},{"name":"age","type":["null","int"],"doc":"autogenerated from Pig 
> Field Schema"},{"name":"gpa","type":["null","double"],"doc":"autogenerated 
> from Pig Field Schema"}]}]
> {code}
> E.g. see the e2e style test, which fails on this:
> {code}
> {
> 'num' => 4,
> # storing file with Pig type tuple relying on 
> conversion to record
> # loading using stored schemas 
> 'notmq' => 1,
> 'pig' => q\
> a = load ':INPATH:/singlefile/studentcomplextab10k' using PigStorage() as 
> (m:[], t:(name:chararray, age:int, gpa:double), b:{t:(name:chararray, 
> age:int, gpa:double)});
> b = foreach a generate t;
> describe b;
> store b into ':OUTPATH:.intermediate' USING 
> org.apache.pig.piggybank.storage.avro.AvroStorage();
> exec;
> -- Read back what was stored with Avro
> u = load ':OUTPATH:.intermediate' USING 
> org.apache.pig.piggybank.storage.avro.AvroStorage();
> describe u;
> store u into ':OUTPATH:';
> \,
> 'verify_pig_script' => q\
> a = load ':INPATH:/singlefile/studentcomplextab10k' using PigStorage() as 
> (m:[], t:(name:chararray, age:int, gpa:double), b:{t:(name:chararray, 
> age:int, gpa:double)});
> b = foreach a generate t;
> describe b;
> store b into ':OUTPATH:';
> \,
> },
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3322) AVRO: AvroStorage give NPE on reading file with union as top level schema

2013-06-03 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-3322:


Attachment: (was: expected_testLoadAvrowithNulls.txt)

> AVRO: AvroStorage give NPE on reading file with union as top level schema
> -
>
> Key: PIG-3322
> URL: https://issues.apache.org/jira/browse/PIG-3322
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Affects Versions: 0.11.2
>Reporter: Egil Sorensen
>Assignee: Viraj Bhat
>  Labels: patch
> Fix For: 0.12
>
> Attachments: PIG-3322_3.patch, test_loadavrowithnulls.avro
>
>
> I am getting NPE when loading a file with AvroStorage a file that has schema 
> like:
> {code}
> ["null",{"type":"record","name":"TUPLE_0","fields":[{"name":"name","type":["null","string"],"doc":"autogenerated
>  from Pig Field 
> Schema"},{"name":"age","type":["null","int"],"doc":"autogenerated from Pig 
> Field Schema"},{"name":"gpa","type":["null","double"],"doc":"autogenerated 
> from Pig Field Schema"}]}]
> {code}
> E.g. see the e2e style test, which fails on this:
> {code}
> {
> 'num' => 4,
> # storing file with Pig type tuple relying on 
> conversion to record
> # loading using stored schemas 
> 'notmq' => 1,
> 'pig' => q\
> a = load ':INPATH:/singlefile/studentcomplextab10k' using PigStorage() as 
> (m:[], t:(name:chararray, age:int, gpa:double), b:{t:(name:chararray, 
> age:int, gpa:double)});
> b = foreach a generate t;
> describe b;
> store b into ':OUTPATH:.intermediate' USING 
> org.apache.pig.piggybank.storage.avro.AvroStorage();
> exec;
> -- Read back what was stored with Avro
> u = load ':OUTPATH:.intermediate' USING 
> org.apache.pig.piggybank.storage.avro.AvroStorage();
> describe u;
> store u into ':OUTPATH:';
> \,
> 'verify_pig_script' => q\
> a = load ':INPATH:/singlefile/studentcomplextab10k' using PigStorage() as 
> (m:[], t:(name:chararray, age:int, gpa:double), b:{t:(name:chararray, 
> age:int, gpa:double)});
> b = foreach a generate t;
> describe b;
> store b into ':OUTPATH:';
> \,
> },
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3322) AVRO: AvroStorage give NPE on reading file with union as top level schema

2013-06-03 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-3322:


Attachment: (was: PIG-3322_2.patch)

> AVRO: AvroStorage give NPE on reading file with union as top level schema
> -
>
> Key: PIG-3322
> URL: https://issues.apache.org/jira/browse/PIG-3322
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Affects Versions: 0.11.2
>Reporter: Egil Sorensen
>Assignee: Viraj Bhat
>  Labels: patch
> Fix For: 0.12
>
> Attachments: PIG-3322_3.patch, test_loadavrowithnulls.avro
>
>
> I am getting NPE when loading a file with AvroStorage a file that has schema 
> like:
> {code}
> ["null",{"type":"record","name":"TUPLE_0","fields":[{"name":"name","type":["null","string"],"doc":"autogenerated
>  from Pig Field 
> Schema"},{"name":"age","type":["null","int"],"doc":"autogenerated from Pig 
> Field Schema"},{"name":"gpa","type":["null","double"],"doc":"autogenerated 
> from Pig Field Schema"}]}]
> {code}
> E.g. see the e2e style test, which fails on this:
> {code}
> {
> 'num' => 4,
> # storing file with Pig type tuple relying on 
> conversion to record
> # loading using stored schemas 
> 'notmq' => 1,
> 'pig' => q\
> a = load ':INPATH:/singlefile/studentcomplextab10k' using PigStorage() as 
> (m:[], t:(name:chararray, age:int, gpa:double), b:{t:(name:chararray, 
> age:int, gpa:double)});
> b = foreach a generate t;
> describe b;
> store b into ':OUTPATH:.intermediate' USING 
> org.apache.pig.piggybank.storage.avro.AvroStorage();
> exec;
> -- Read back what was stored with Avro
> u = load ':OUTPATH:.intermediate' USING 
> org.apache.pig.piggybank.storage.avro.AvroStorage();
> describe u;
> store u into ':OUTPATH:';
> \,
> 'verify_pig_script' => q\
> a = load ':INPATH:/singlefile/studentcomplextab10k' using PigStorage() as 
> (m:[], t:(name:chararray, age:int, gpa:double), b:{t:(name:chararray, 
> age:int, gpa:double)});
> b = foreach a generate t;
> describe b;
> store b into ':OUTPATH:';
> \,
> },
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Review Request: PIG-3322 Fix the issue where NPE is thrown when reading a union which has nulls and add a testcase

2013-06-03 Thread Viraj Bhat


> On June 2, 2013, 9:27 p.m., Cheolsoo Park wrote:
> > http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java,
> >  line 1104
> > 
> >
> > If you use mock.Storage here instead of PigStoage, you won't need the 
> > verifyTextResults method and extra output file. Can you please update your 
> > test?
> > 
> > Please see org.apache.pig.builtin.mock.Storage.java.

Added Mock Storage


- Viraj


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/11333/#review21305
---


On June 4, 2013, 12:15 a.m., Viraj Bhat wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/11333/
> ---
> 
> (Updated June 4, 2013, 12:15 a.m.)
> 
> 
> Review request for pig and Rohini Palaniswamy.
> 
> 
> Description
> ---
> 
> Null pointer exception when loading union with null in it's schema. Test case 
> was also updated with a sample test case.
> 
> 
> Diffs
> -
> 
>   
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java
>  1485358 
>   
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/PigAvroRecordReader.java
>  1485358 
>   
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java
>  1485358 
> 
> Diff: https://reviews.apache.org/r/11333/diff/
> 
> 
> Testing
> ---
> 
> Yes all tests pass in the piggybank
> 
> 
> Thanks,
> 
> Viraj Bhat
> 
>



Re: Review Request: PIG-3322 Fix the issue where NPE is thrown when reading a union which has nulls and add a testcase

2013-06-03 Thread Viraj Bhat


> On June 3, 2013, 1:03 p.m., Rohini Palaniswamy wrote:
> > Just minor comments in the naming of the variable. Java variable names 
> > should be camel case.

Thanks but now the verifyTxtResults method is not used any more


- Viraj


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/11333/#review21312
---


On June 4, 2013, 12:15 a.m., Viraj Bhat wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/11333/
> ---
> 
> (Updated June 4, 2013, 12:15 a.m.)
> 
> 
> Review request for pig and Rohini Palaniswamy.
> 
> 
> Description
> ---
> 
> Null pointer exception when loading union with null in it's schema. Test case 
> was also updated with a sample test case.
> 
> 
> Diffs
> -
> 
>   
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java
>  1485358 
>   
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/PigAvroRecordReader.java
>  1485358 
>   
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java
>  1485358 
> 
> Diff: https://reviews.apache.org/r/11333/diff/
> 
> 
> Testing
> ---
> 
> Yes all tests pass in the piggybank
> 
> 
> Thanks,
> 
> Viraj Bhat
> 
>



Re: Review Request: PIG-3322 Fix the issue where NPE is thrown when reading a union which has nulls and add a testcase

2013-06-03 Thread Viraj Bhat

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/11333/
---

(Updated June 4, 2013, 12:15 a.m.)


Review request for pig and Rohini Palaniswamy.


Changes
---

Using MockStorage instead of the PigStorage and comparing results inline for 4 
records.


Description
---

Null pointer exception when loading union with null in it's schema. Test case 
was also updated with a sample test case.


Diffs (updated)
-

  
http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java
 1485358 
  
http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/PigAvroRecordReader.java
 1485358 
  
http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java
 1485358 

Diff: https://reviews.apache.org/r/11333/diff/


Testing
---

Yes all tests pass in the piggybank


Thanks,

Viraj Bhat



[jira] [Commented] (PIG-3341) Improving performance of loading datetime values

2013-06-03 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673780#comment-13673780
 ] 

Dmitriy V. Ryaboy commented on PIG-3341:


I don't think we are completely consistent, but turning invalid into null has 
been pretty standard.

My personal preference is also to increment a counter for # of such 
conversions, and to log the first N occurrences (when N errors are encountered, 
log something to the effect of "not logging this error any more because there's 
so much of it.")

> Improving performance of loading datetime values
> 
>
> Key: PIG-3341
> URL: https://issues.apache.org/jira/browse/PIG-3341
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.11.1
>Reporter: pat chan
>Priority: Minor
> Fix For: 0.12, 0.11.2
>
>
> The performance of loading datetime values can be improved by about 25% by 
> moving a single line in ToDate.java:
> public static DateTimeZone extractDateTimeZone(String dtStr) {
>   Pattern pattern = 
> Pattern.compile("(Z|(?<=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$");;
> should become:
> static Pattern pattern = 
> Pattern.compile("(Z|(?<=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$");
> public static DateTimeZone extractDateTimeZone(String dtStr) {
> There is no need to recompile the regular expression for every value. I'm not 
> sure if this function is ever called concurrently, but Pattern objects are 
> thread-safe anyways.
> As a test, I created a file of 10M timestamps:
>   for i in 0..1000
> puts '2000-01-01T00:00:00+23'
>   end
> I then ran this script:
>   grunt> A = load 'data' as (a:datetime); B = filter A by a is null; dump B;
> Before the change it took 160s.
> After the change, the script took 120s.
> 
> Another performance improvement can be made for invalid datetime values. If a 
> datetime value is invalid, an exception is created and thrown, which is a 
> costly way to fail a validity check. To test the performance impact, I 
> created 10M invalid datetime values:
>   for i in 0..1000
> puts '2000-99-01T00:00:00+23'
>   end
> In this test, the regex pattern was always recompiled. I then ran this script:
>   grunt> A = load 'data' as (a:datetime); B = filter A by a is not null; dump 
> B;
> The script took 190s.
> I understand this could be considered an edge case and might not be worth 
> changing. However, if there are use cases where invalid dates are part of 
> normal processing, then you might consider fixing this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3345) Handle null in DateTime functions

2013-06-03 Thread Prashant Kommireddi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673761#comment-13673761
 ] 

Prashant Kommireddi commented on PIG-3345:
--

Hi [~rohini], patch looks good. Would you like to add tests for ToDate* 
functions too (under testConversionBetweenDateTimeAndString())? 

> Handle null in DateTime functions
> -
>
> Key: PIG-3345
> URL: https://issues.apache.org/jira/browse/PIG-3345
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11.1
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.12
>
> Attachments: PIG-3345-1.patch
>
>
>  NPE is thrown in date time functions when a null value is passed. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3341) Improving performance of loading datetime values

2013-06-03 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673760#comment-13673760
 ] 

Rohini Palaniswamy commented on PIG-3341:
-

The current behavior returns null if there is a invalid value while loading as 
datetime. Pig as far as I have seen does not fail loading when there is invalid 
values. But UDFs do fail. 

Asking the old timers..

[~alangates]/[~daijy]/[~dvryaboy]/[~julienledem]/[~thejas],
   How should we handle the invalid dates?
   

> Improving performance of loading datetime values
> 
>
> Key: PIG-3341
> URL: https://issues.apache.org/jira/browse/PIG-3341
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.11.1
>Reporter: pat chan
>Priority: Minor
> Fix For: 0.12, 0.11.2
>
>
> The performance of loading datetime values can be improved by about 25% by 
> moving a single line in ToDate.java:
> public static DateTimeZone extractDateTimeZone(String dtStr) {
>   Pattern pattern = 
> Pattern.compile("(Z|(?<=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$");;
> should become:
> static Pattern pattern = 
> Pattern.compile("(Z|(?<=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$");
> public static DateTimeZone extractDateTimeZone(String dtStr) {
> There is no need to recompile the regular expression for every value. I'm not 
> sure if this function is ever called concurrently, but Pattern objects are 
> thread-safe anyways.
> As a test, I created a file of 10M timestamps:
>   for i in 0..1000
> puts '2000-01-01T00:00:00+23'
>   end
> I then ran this script:
>   grunt> A = load 'data' as (a:datetime); B = filter A by a is null; dump B;
> Before the change it took 160s.
> After the change, the script took 120s.
> 
> Another performance improvement can be made for invalid datetime values. If a 
> datetime value is invalid, an exception is created and thrown, which is a 
> costly way to fail a validity check. To test the performance impact, I 
> created 10M invalid datetime values:
>   for i in 0..1000
> puts '2000-99-01T00:00:00+23'
>   end
> In this test, the regex pattern was always recompiled. I then ran this script:
>   grunt> A = load 'data' as (a:datetime); B = filter A by a is not null; dump 
> B;
> The script took 190s.
> I understand this could be considered an edge case and might not be worth 
> changing. However, if there are use cases where invalid dates are part of 
> normal processing, then you might consider fixing this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2828) Handle nulls in DataType.compare

2013-06-03 Thread Aniket Mokashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Mokashi updated PIG-2828:


Summary: Handle nulls in DataType.compare  (was: DataType.compare null)

> Handle nulls in DataType.compare
> 
>
> Key: PIG-2828
> URL: https://issues.apache.org/jira/browse/PIG-2828
> Project: Pig
>  Issue Type: Bug
>Reporter: Haitao Yao
>Assignee: Aniket Mokashi
> Attachments: DataType.patch, PIG-2828-format.patch, PIG-2828.patch, 
> test.patch
>
>
> While using TOP, and if the DataBag contains null value to compare, it will 
> generate the following exception:
> Caused by: java.lang.NullPointerException
>   at org.apache.pig.data.DataType.compare(DataType.java:427)
>   at org.apache.pig.builtin.TOP$TupleComparator.compare(TOP.java:97)
>   at org.apache.pig.builtin.TOP$TupleComparator.compare(TOP.java:1)
>   at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649)
>   at java.util.PriorityQueue.siftUp(PriorityQueue.java:627)
>   at java.util.PriorityQueue.offer(PriorityQueue.java:329)
>   at java.util.PriorityQueue.add(PriorityQueue.java:306)
>   at org.apache.pig.builtin.TOP.updateTop(TOP.java:141)
>   at org.apache.pig.builtin.TOP.exec(TOP.java:116)
> code: (TOP.java, starts with line 91)
> Object field1 = o1.get(fieldNum);
> Object field2 = o2.get(fieldNum);
> if (!typeFound) {
> datatype = DataType.findType(field1);
> typeFound = true;
> }
> return DataType.compare(field1, field2, datatype, datatype);
> The reason is that if the typeFound is true , and the dataType is not null, 
> and field1 is null, the script failed.
> So we need to judge the field1 whether is null.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3341) Improving performance of loading datetime values

2013-06-03 Thread pat chan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673742#comment-13673742
 ] 

pat chan commented on PIG-3341:
---

Hi, you bring up two good design points.

1. are more formats the better for this use case? Some possible cons:

a) the spec becomes more complicated for probably unused formats. The simplest 
spec would be to conform to the w3c profile.
b) you will have to support all these formats forever
c) there could be a performance overhead to support the possibly unused formats
d) ToDate(s,f) and UDFs already give users the ability to handle any format 
that's needed.
e) asymmetry: seems cleaner if the default parseable format is exactly the 
default printed format


2. What is the design philosophy for invalid conversions? Quietly turning 
invalid values into null seems like it could be a possibly dangerous default 
since it would be really hard to know if your query on terabytes of data is 
encountering problems which are quietly being ignored. A safer philosophy would 
have the default be as strict with the data as possible and then if the user 
finds a legitimate case for null-conversions, provide a way for the user to 
enable it explicitly in the script.

cheers


> Improving performance of loading datetime values
> 
>
> Key: PIG-3341
> URL: https://issues.apache.org/jira/browse/PIG-3341
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.11.1
>Reporter: pat chan
>Priority: Minor
> Fix For: 0.12, 0.11.2
>
>
> The performance of loading datetime values can be improved by about 25% by 
> moving a single line in ToDate.java:
> public static DateTimeZone extractDateTimeZone(String dtStr) {
>   Pattern pattern = 
> Pattern.compile("(Z|(?<=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$");;
> should become:
> static Pattern pattern = 
> Pattern.compile("(Z|(?<=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$");
> public static DateTimeZone extractDateTimeZone(String dtStr) {
> There is no need to recompile the regular expression for every value. I'm not 
> sure if this function is ever called concurrently, but Pattern objects are 
> thread-safe anyways.
> As a test, I created a file of 10M timestamps:
>   for i in 0..1000
> puts '2000-01-01T00:00:00+23'
>   end
> I then ran this script:
>   grunt> A = load 'data' as (a:datetime); B = filter A by a is null; dump B;
> Before the change it took 160s.
> After the change, the script took 120s.
> 
> Another performance improvement can be made for invalid datetime values. If a 
> datetime value is invalid, an exception is created and thrown, which is a 
> costly way to fail a validity check. To test the performance impact, I 
> created 10M invalid datetime values:
>   for i in 0..1000
> puts '2000-99-01T00:00:00+23'
>   end
> In this test, the regex pattern was always recompiled. I then ran this script:
>   grunt> A = load 'data' as (a:datetime); B = filter A by a is not null; dump 
> B;
> The script took 190s.
> I understand this could be considered an edge case and might not be worth 
> changing. However, if there are use cases where invalid dates are part of 
> normal processing, then you might consider fixing this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3342) Allow conditions in case statement

2013-06-03 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673690#comment-13673690
 ] 

Rohini Palaniswamy commented on PIG-3342:
-

Since it is slightly big, can you upload it in review board?

> Allow conditions in case statement
> --
>
> Key: PIG-3342
> URL: https://issues.apache.org/jira/browse/PIG-3342
> Project: Pig
>  Issue Type: Improvement
>  Components: parser
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.12
>
> Attachments: PIG-3342.patch
>
>
> PIG-3268 added case statement support. But conditions are currently not 
> allowed in when branches. For example,
> {code}
> CASE
>   WHEN i % 5 == 0 THEN '5n'
>   WHEN i % 5 == 1 THEN '5n+1'
>   WHEN i % 5 == 2 THEN '5n+2'
>   WHEN i % 5 == 3 THEN '5n+3'
>   ELSE '5n+4'
> END
> {code}
> This is invalid now. However, it will be useful if it's allowed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3285) Jobs using HBaseStorage fail to ship dependency jars

2013-06-03 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-3285:


Status: Open  (was: Patch Available)

Canceling patch for now so that it does not show in Patch Available list. 

> Jobs using HBaseStorage fail to ship dependency jars
> 
>
> Key: PIG-3285
> URL: https://issues.apache.org/jira/browse/PIG-3285
> Project: Pig
>  Issue Type: Bug
>Reporter: Nick Dimiduk
>Assignee: Nick Dimiduk
> Fix For: 0.11.1
>
> Attachments: 0001-PIG-3285-Add-HBase-dependency-jars.patch, 
> 0001-PIG-3285-Add-HBase-dependency-jars.patch, 1.pig, 1.txt, 2.pig
>
>
> Launching a job consuming {{HBaseStorage}} fails out of the box. The user 
> must specify {{-Dpig.additional.jars}} for HBase and all of its dependencies. 
> Exceptions look something like this:
> {noformat}
> 2013-04-19 18:58:39,360 FATAL org.apache.hadoop.mapred.Child: Error running 
> child : java.lang.NoClassDefFoundError: com/google/protobuf/Message
>   at 
> org.apache.hadoop.hbase.io.HbaseObjectWritable.(HbaseObjectWritable.java:266)
>   at org.apache.hadoop.hbase.ipc.Invocation.write(Invocation.java:139)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.sendParam(HBaseClient.java:612)
>   at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:975)
>   at 
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:84)
>   at $Proxy7.getProtocolVersion(Unknown Source)
>   at 
> org.apache.hadoop.hbase.ipc.WritableRpcEngine.getProxy(WritableRpcEngine.java:136)
>   at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:208)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3345) Handle null in DateTime functions

2013-06-03 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-3345:


Status: Patch Available  (was: Open)

> Handle null in DateTime functions
> -
>
> Key: PIG-3345
> URL: https://issues.apache.org/jira/browse/PIG-3345
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11.1
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.12
>
> Attachments: PIG-3345-1.patch
>
>
>  NPE is thrown in date time functions when a null value is passed. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3345) Handle null in DateTime functions

2013-06-03 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-3345:


Attachment: PIG-3345-1.patch

> Handle null in DateTime functions
> -
>
> Key: PIG-3345
> URL: https://issues.apache.org/jira/browse/PIG-3345
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11.1
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.12
>
> Attachments: PIG-3345-1.patch
>
>
>  NPE is thrown in date time functions when a null value is passed. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3345) Handle null in DateTime functions

2013-06-03 Thread Rohini Palaniswamy (JIRA)
Rohini Palaniswamy created PIG-3345:
---

 Summary: Handle null in DateTime functions
 Key: PIG-3345
 URL: https://issues.apache.org/jira/browse/PIG-3345
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11.1
Reporter: Rohini Palaniswamy
Assignee: Rohini Palaniswamy
 Fix For: 0.12


 NPE is thrown in date time functions when a null value is passed. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2828) DataType.compare null

2013-06-03 Thread Aniket Mokashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Mokashi updated PIG-2828:


Attachment: PIG-2828-format.patch

> DataType.compare null
> -
>
> Key: PIG-2828
> URL: https://issues.apache.org/jira/browse/PIG-2828
> Project: Pig
>  Issue Type: Bug
>Reporter: Haitao Yao
>Assignee: Aniket Mokashi
> Attachments: DataType.patch, PIG-2828-format.patch, PIG-2828.patch, 
> test.patch
>
>
> While using TOP, and if the DataBag contains null value to compare, it will 
> generate the following exception:
> Caused by: java.lang.NullPointerException
>   at org.apache.pig.data.DataType.compare(DataType.java:427)
>   at org.apache.pig.builtin.TOP$TupleComparator.compare(TOP.java:97)
>   at org.apache.pig.builtin.TOP$TupleComparator.compare(TOP.java:1)
>   at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649)
>   at java.util.PriorityQueue.siftUp(PriorityQueue.java:627)
>   at java.util.PriorityQueue.offer(PriorityQueue.java:329)
>   at java.util.PriorityQueue.add(PriorityQueue.java:306)
>   at org.apache.pig.builtin.TOP.updateTop(TOP.java:141)
>   at org.apache.pig.builtin.TOP.exec(TOP.java:116)
> code: (TOP.java, starts with line 91)
> Object field1 = o1.get(fieldNum);
> Object field2 = o2.get(fieldNum);
> if (!typeFound) {
> datatype = DataType.findType(field1);
> typeFound = true;
> }
> return DataType.compare(field1, field2, datatype, datatype);
> The reason is that if the typeFound is true , and the dataType is not null, 
> and field1 is null, the script failed.
> So we need to judge the field1 whether is null.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3279) Support nested RANK

2013-06-03 Thread Johnny Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnny Zhang updated PIG-3279:
--

Attachment: PIG-3279-3.patch.txt

Thanks a lot for your comments, [~daijy]! Appreciate. I changed 
LogToPhyTranslationVisitor.java:
1. for RANK BY operation, only include POSort -> POCounter -> PORank -> 
POForEach. The current physical plan looks like:
c: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-42
|
|---c: New For Each(true)[bag] - scope-41
|   |
|   RelationToExpressionProject[bag][*] - scope-32
|   |
|   |---New For Each(false,true)[tuple] - scope-40
|   |   |
|   |   Project[long][0] - scope-38
|   |   |
|   |   Project[bag][2] - scope-39
|   |
|   |---d: PORank[tuple] - scope-37
|   |   |
|   |   Project[int][0] - scope-34
|   |
|   |---d: POCounter[tuple] - scope-36
|   |   |
|   |   Project[int][0] - scope-34
|   |
|   |---d: POSort[tuple]() - scope-35
|   |   |
|   |   Project[int][0] - scope-34
|   |
|   |---Project[bag][1] - scope-33
|
|---b: Package[tuple]{chararray} - scope-29
|
|---b: Global Rearrange[tuple] - scope-28
|
|---b: Local Rearrange[tuple]{chararray}(false) - scope-30
|   |
|   Project[chararray][1] - scope-31
|
|---a: New For Each(false,false,false)[bag] - scope-27
|   |
|   Cast[chararray] - scope-19
|   |
|   |---Project[bytearray][0] - scope-18
|   |
|   Cast[chararray] - scope-22
|   |
|   |---Project[bytearray][1] - scope-21
|   |
|   Cast[int] - scope-25
|   |
|   |---Project[bytearray][2] - scope-24
|
|---a: 
Load(file:///home/xiaoyuz/PIG-new/pig/input1:org.apache.pig.builtin.PigStorage) 
- scope-17


2. for RANK operation, there is no difference between nested and non-nested 
RANK. Since there is no POPackage, global rearrange for non-nested RANK anyway

However, I still got exception for RANK BY and RANK operations
{noformat}
Caused by: java.lang.RuntimeException: Unable to read counter 
pig.counters.counter_2415405541993583480_-1
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PORank.addRank(PORank.java:165)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PORank.getNextTuple(PORank.java:134)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:281)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:242)
... 13 more
{noformat}
thing get closer, but still not complete. Thanks.

> Support nested RANK
> ---
>
> Key: PIG-3279
> URL: https://issues.apache.org/jira/browse/PIG-3279
> Project: Pig
>  Issue Type: Improvement
>Reporter: Gianmarco De Francisci Morales
>Assignee: Johnny Zhang
> Attachments: PIG-3279-1.patch.txt, PIG-3279-2.patch.txt, 
> PIG-3279-3.patch.txt
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2828) DataType.compare null

2013-06-03 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-2828:
---

Assignee: Aniket Mokashi

+1 to PIG-2828.patch. Looks good to me.

[~aniket486], can you please replace all the tabs with 4 spaces when committing 
your patch? Thanks!

> DataType.compare null
> -
>
> Key: PIG-2828
> URL: https://issues.apache.org/jira/browse/PIG-2828
> Project: Pig
>  Issue Type: Bug
>Reporter: Haitao Yao
>Assignee: Aniket Mokashi
> Attachments: DataType.patch, PIG-2828.patch, test.patch
>
>
> While using TOP, and if the DataBag contains null value to compare, it will 
> generate the following exception:
> Caused by: java.lang.NullPointerException
>   at org.apache.pig.data.DataType.compare(DataType.java:427)
>   at org.apache.pig.builtin.TOP$TupleComparator.compare(TOP.java:97)
>   at org.apache.pig.builtin.TOP$TupleComparator.compare(TOP.java:1)
>   at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649)
>   at java.util.PriorityQueue.siftUp(PriorityQueue.java:627)
>   at java.util.PriorityQueue.offer(PriorityQueue.java:329)
>   at java.util.PriorityQueue.add(PriorityQueue.java:306)
>   at org.apache.pig.builtin.TOP.updateTop(TOP.java:141)
>   at org.apache.pig.builtin.TOP.exec(TOP.java:116)
> code: (TOP.java, starts with line 91)
> Object field1 = o1.get(fieldNum);
> Object field2 = o2.get(fieldNum);
> if (!typeFound) {
> datatype = DataType.findType(field1);
> typeFound = true;
> }
> return DataType.compare(field1, field2, datatype, datatype);
> The reason is that if the typeFound is true , and the dataType is not null, 
> and field1 is null, the script failed.
> So we need to judge the field1 whether is null.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: A major addition to Pig. Working with spatial data

2013-06-03 Thread Russell Jurney
Those JIRAs do best that are completed by one person driving them.


On Mon, Jun 3, 2013 at 10:26 AM, Ahmed Eldawy  wrote:

> I've just created a new JIRA issue for the spatial functionality.
> https://issues.apache.org/jira/browse/PIG-3344
> This issue is all about the new datatype which is the only thing that needs
> to be changed internally in Pig in this phase. Pigeon is already working
> with the ESRI library but it converts between binary representation and
> Geometry class back and forth. Once the new datatype is added, we can
> change Pigeon to work with this datatype too. We can still keep the current
> conversion functionality as it allows the system to automatically perform
> the conversion from the bytearray datatype as it adds the autodetect
> functionality when a column is not given a type in the schema.
>
> I don't know if I should provide a patch to this issue myself or there is
> someone else who can work on it. I can of course do it but I think it will
> take me some time to finish as I'm not yet familiar with the internals of
> Pig. Someone who is familiar with the parser would definitely make a better
> job here. I can focus on Pigeon and add more spatial functions there so
> that we can have a plenty of functions once the new datatype is added. I'm
> open to both solutions but I'm just checking with you.
>
> Thanks
> Ahmed
>
> Best regards,
> Ahmed Eldawy
>
>
> On Wed, May 29, 2013 at 12:17 PM, Russell Jurney
> wrote:
>
> > Awesome. This would be a great addition to Pig. Please create a JIRA.
> >
> > Russell Jurney http://datasyndrome.com
> >
> > On May 29, 2013, at 8:51 AM, Ahmed Eldawy  wrote:
> >
> > > Hi all,
> > >
> > > Nick has pointed out to me an alternative GIS package that can replace
> > JTS.
> > > ESRI has recently released a GIS
> > > packageunder Apache
> > > license. I changed Pigeon to work with that new package. I
> > > think it could be easier now to integrate this work with main branch of
> > > Apache Pig. I will go on with the current project and add more spatial
> > > functionality. We can then add a new datatype to Apache and link it to
> > > those functions.
> > >
> > > ESRI package contains a class OGCGeometry
> > > <
> >
> http://esri.github.io/geometry-api-java/javadoc/com/esri/core/geometry/ogc/OGCGeometry.html
> > >which
> > > can be linked to a new datatype 'Geometry'. Do you think we can rely on
> > the
> > > new package and integrate the work with Apache Pig?
> > >
> > > On May 23, 2013 11:40 PM, "Ahmed Eldawy"  wrote:
> > >
> > >> Hi all,
> > >>  Thanks for your help. I've started the project with a minimal
> > >> functionality as a start. It's currently hosted in github. It is
> > licensed
> > >> under the Apache public license to make it easier to merge with Pig.
> > >> Currently it has only a very few functions. I implemented a function
> > from
> > >> different types of functions (e.g., Aggregate and create). I'll keep
> > adding
> > >> functions and any contributions to the project are welcome. As a
> > beginning,
> > >> I need an ANT build file that runs the tests, compiles and generates a
> > jar
> > >> file. I'm not familiar with ANT so any help in this is encouraged.
> > >> Here's the project home page
> > >> https://github.com/aseldawy/pigeon
> > >>
> > >>
> > >> If you have any comments or suggestion please contact me.
> > >>
> > >>
> > >> Best regards,
> > >> Ahmed Eldawy
> > >>
> > >>
> > >> On Mon, May 6, 2013 at 3:09 PM, Jonathan Coveney  > >wrote:
> > >>
> > >>> Nick: the only issue is that the way types are implemented in Pig
> don't
> > >>> allow us to easily "plug-in" types externally. Adding support for
> that
> > >>> would be cool, but a fair bit of work.
> > >>>
> > >>>
> > >>> 2013/5/6 Nick Dimiduk 
> > >>>
> >  I'm to a lawyer, but I see no reason why this cannot be an external
> >  extension to Pig. It would behave the same way PostGIS is an
> external
> >  extension to Postgres. Any Apache issues would be toward general
> >  purpose enhancements, not specific to your project.
> > 
> >  Good on you!
> >  -n
> > 
> >  On Mon, May 6, 2013 at 10:12 AM, Ahmed Eldawy 
> > >>> wrote:
> > 
> > > I contacted solr developers to see how JTS can be included in an
> > >>> Apache
> > > project. See
> > >>>
> >
> http://mail-archives.apache.org/mod_mbox/lucene-dev/201305.mbox/raw/%3C1367815102914-4060969.post%40n3.nabble.com%3E/
> > > As far as I understand, they did not include it in the main solr
> > >>> project,
> > > rather, they created a separate project (spatial 4j) which is still
> > > licensed under Apache license and refers to JTS. Users will have to
> > > download JTS libraries separately to make it run. That's pretty
> much
> > >>> the
> > > same plan that Jonathan mentioned. We will still have the overhead
> of
> > > serializing/deserializing the shapes each time a function is
> called.
> >  Also,
> > > we wil

Re: A major addition to Pig. Working with spatial data

2013-06-03 Thread Ahmed Eldawy
I've just created a new JIRA issue for the spatial functionality.
https://issues.apache.org/jira/browse/PIG-3344
This issue is all about the new datatype which is the only thing that needs
to be changed internally in Pig in this phase. Pigeon is already working
with the ESRI library but it converts between binary representation and
Geometry class back and forth. Once the new datatype is added, we can
change Pigeon to work with this datatype too. We can still keep the current
conversion functionality as it allows the system to automatically perform
the conversion from the bytearray datatype as it adds the autodetect
functionality when a column is not given a type in the schema.

I don't know if I should provide a patch to this issue myself or there is
someone else who can work on it. I can of course do it but I think it will
take me some time to finish as I'm not yet familiar with the internals of
Pig. Someone who is familiar with the parser would definitely make a better
job here. I can focus on Pigeon and add more spatial functions there so
that we can have a plenty of functions once the new datatype is added. I'm
open to both solutions but I'm just checking with you.

Thanks
Ahmed

Best regards,
Ahmed Eldawy


On Wed, May 29, 2013 at 12:17 PM, Russell Jurney
wrote:

> Awesome. This would be a great addition to Pig. Please create a JIRA.
>
> Russell Jurney http://datasyndrome.com
>
> On May 29, 2013, at 8:51 AM, Ahmed Eldawy  wrote:
>
> > Hi all,
> >
> > Nick has pointed out to me an alternative GIS package that can replace
> JTS.
> > ESRI has recently released a GIS
> > packageunder Apache
> > license. I changed Pigeon to work with that new package. I
> > think it could be easier now to integrate this work with main branch of
> > Apache Pig. I will go on with the current project and add more spatial
> > functionality. We can then add a new datatype to Apache and link it to
> > those functions.
> >
> > ESRI package contains a class OGCGeometry
> > <
> http://esri.github.io/geometry-api-java/javadoc/com/esri/core/geometry/ogc/OGCGeometry.html
> >which
> > can be linked to a new datatype 'Geometry'. Do you think we can rely on
> the
> > new package and integrate the work with Apache Pig?
> >
> > On May 23, 2013 11:40 PM, "Ahmed Eldawy"  wrote:
> >
> >> Hi all,
> >>  Thanks for your help. I've started the project with a minimal
> >> functionality as a start. It's currently hosted in github. It is
> licensed
> >> under the Apache public license to make it easier to merge with Pig.
> >> Currently it has only a very few functions. I implemented a function
> from
> >> different types of functions (e.g., Aggregate and create). I'll keep
> adding
> >> functions and any contributions to the project are welcome. As a
> beginning,
> >> I need an ANT build file that runs the tests, compiles and generates a
> jar
> >> file. I'm not familiar with ANT so any help in this is encouraged.
> >> Here's the project home page
> >> https://github.com/aseldawy/pigeon
> >>
> >>
> >> If you have any comments or suggestion please contact me.
> >>
> >>
> >> Best regards,
> >> Ahmed Eldawy
> >>
> >>
> >> On Mon, May 6, 2013 at 3:09 PM, Jonathan Coveney  >wrote:
> >>
> >>> Nick: the only issue is that the way types are implemented in Pig don't
> >>> allow us to easily "plug-in" types externally. Adding support for that
> >>> would be cool, but a fair bit of work.
> >>>
> >>>
> >>> 2013/5/6 Nick Dimiduk 
> >>>
>  I'm to a lawyer, but I see no reason why this cannot be an external
>  extension to Pig. It would behave the same way PostGIS is an external
>  extension to Postgres. Any Apache issues would be toward general
>  purpose enhancements, not specific to your project.
> 
>  Good on you!
>  -n
> 
>  On Mon, May 6, 2013 at 10:12 AM, Ahmed Eldawy 
> >>> wrote:
> 
> > I contacted solr developers to see how JTS can be included in an
> >>> Apache
> > project. See
> >>>
> http://mail-archives.apache.org/mod_mbox/lucene-dev/201305.mbox/raw/%3C1367815102914-4060969.post%40n3.nabble.com%3E/
> > As far as I understand, they did not include it in the main solr
> >>> project,
> > rather, they created a separate project (spatial 4j) which is still
> > licensed under Apache license and refers to JTS. Users will have to
> > download JTS libraries separately to make it run. That's pretty much
> >>> the
> > same plan that Jonathan mentioned. We will still have the overhead of
> > serializing/deserializing the shapes each time a function is called.
>  Also,
> > we will have to use the ugly bytearray data type for spatial data
> >>> instead
> > of creating its own data type (e.g., Geometry).
> > I think using spatial 4j instead of JTS will not be sufficient for
> our
>  case
> > as we need to provide an access to all spatial functions of JTS such
> >>> as
> > Union, Intersection, Difference, ... etc. This way we

[jira] [Created] (PIG-3344) Add a spatial datatype to Pig

2013-06-03 Thread Ahmed Eldawy (JIRA)
Ahmed Eldawy created PIG-3344:
-

 Summary: Add a spatial datatype to Pig
 Key: PIG-3344
 URL: https://issues.apache.org/jira/browse/PIG-3344
 Project: Pig
  Issue Type: New Feature
  Components: parser
Reporter: Ahmed Eldawy


This issue is about adding a new datatype to Pig that abstracts a spatial 
attribute. Following OGC [http://www.opengeospatial.org/], we will add a new 
datatype called 'Geometry' that abstracts all standard shapes (e.g., Point, 
Polygon and Linestring). This datatype is automatically parsed from either a 
Well-Known Text (WKT) or Well-Known Binary (WKB) represented as a Hex string. 
These two types are the standard export formats for OGC shapes and they are 
supported by many existing tools including PostGIS [http://postgis.net/]. 
Exporting through PigStorage should default to a WKB represented as Hex string 
and there will be additional functions to convert to WKT.

This new datatype maps internally to the class OGCGeometry 
[https://github.com/Esri/geometry-api-java/blob/master/src/com/esri/core/geometry/ogc/OGCGeometry.java]
 licensed under Apache license. This class contains functionality to 
import/export to the WKT and WKB formats.

Data manipulation functions to the new datatype will be all done through UDFs. 
Currently, there is a spatial extension to Pig (called Pigeon) 
[https://github.com/aseldawy/pigeon] that provides basic spatial functionality 
via UDFs powered by the aforementioned library. Currently, it automatically 
converts WKB and WKT fields to OGCGeometry class, performs the spatial 
operation, and produces the result back as WKB. Once the Geometry datatype is 
added, it will natively use it to avoid the conversion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2828) DataType.compare null

2013-06-03 Thread Aniket Mokashi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673318#comment-13673318
 ] 

Aniket Mokashi commented on PIG-2828:
-

I have created https://issues.apache.org/jira/browse/PIG-3343 to track api 
refactor.

> DataType.compare null
> -
>
> Key: PIG-2828
> URL: https://issues.apache.org/jira/browse/PIG-2828
> Project: Pig
>  Issue Type: Bug
>Reporter: Haitao Yao
> Attachments: DataType.patch, PIG-2828.patch, test.patch
>
>
> While using TOP, and if the DataBag contains null value to compare, it will 
> generate the following exception:
> Caused by: java.lang.NullPointerException
>   at org.apache.pig.data.DataType.compare(DataType.java:427)
>   at org.apache.pig.builtin.TOP$TupleComparator.compare(TOP.java:97)
>   at org.apache.pig.builtin.TOP$TupleComparator.compare(TOP.java:1)
>   at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649)
>   at java.util.PriorityQueue.siftUp(PriorityQueue.java:627)
>   at java.util.PriorityQueue.offer(PriorityQueue.java:329)
>   at java.util.PriorityQueue.add(PriorityQueue.java:306)
>   at org.apache.pig.builtin.TOP.updateTop(TOP.java:141)
>   at org.apache.pig.builtin.TOP.exec(TOP.java:116)
> code: (TOP.java, starts with line 91)
> Object field1 = o1.get(fieldNum);
> Object field2 = o2.get(fieldNum);
> if (!typeFound) {
> datatype = DataType.findType(field1);
> typeFound = true;
> }
> return DataType.compare(field1, field2, datatype, datatype);
> The reason is that if the typeFound is true , and the dataType is not null, 
> and field1 is null, the script failed.
> So we need to judge the field1 whether is null.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3343) Refactor DataType.compare api to handle NULLs and reflection

2013-06-03 Thread Aniket Mokashi (JIRA)
Aniket Mokashi created PIG-3343:
---

 Summary: Refactor DataType.compare api to handle NULLs and 
reflection
 Key: PIG-3343
 URL: https://issues.apache.org/jira/browse/PIG-3343
 Project: Pig
  Issue Type: Bug
  Components: data
Reporter: Aniket Mokashi




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3337) Fix remaining Window e2e tests

2013-06-03 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673254#comment-13673254
 ] 

Rohini Palaniswamy commented on PIG-3337:
-

[~daijy],
   Any idea why hive hudson messages are appearing here? Saw this before in 
PIG-2955 and PIG-3069 also 

> Fix remaining Window e2e tests
> --
>
> Key: PIG-3337
> URL: https://issues.apache.org/jira/browse/PIG-3337
> Project: Pig
>  Issue Type: Sub-task
>  Components: e2e harness
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.12
>
> Attachments: PIG-3337-1.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3337) Fix remaining Window e2e tests

2013-06-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673234#comment-13673234
 ] 

Hudson commented on PIG-3337:
-

Integrated in Hive-trunk-h0.21 #2125 (See 
[https://builds.apache.org/job/Hive-trunk-h0.21/2125/])
PIG-3337: Fix remaining Window e2e tests (Revision 1487967)

 Result = FAILURE
daijy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1487967
Files : 
* /pig/trunk/CHANGES.txt
* /pig/trunk/test/e2e/harness/TestDriver.pm
* /pig/trunk/test/e2e/pig/drivers/TestDriverPig.pm


> Fix remaining Window e2e tests
> --
>
> Key: PIG-3337
> URL: https://issues.apache.org/jira/browse/PIG-3337
> Project: Pig
>  Issue Type: Sub-task
>  Components: e2e harness
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.12
>
> Attachments: PIG-3337-1.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3327) Pig hits OOM when fetching task Reports

2013-06-03 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-3327:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to trunk (0.12). Thanks Cheolsoo

> Pig hits OOM when fetching task Reports
> ---
>
> Key: PIG-3327
> URL: https://issues.apache.org/jira/browse/PIG-3327
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.10.1
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.12
>
> Attachments: PIG-3327-1.patch
>
>
> java.lang.OutOfMemoryError: GC overhead limit exceeded is hit with hadoop 23 
> by the pig script when a launched job has 80K+ maps. The TaskReport[] array 
> is causing OOM. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3341) Improving performance of loading datetime values

2013-06-03 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673122#comment-13673122
 ] 

Rohini Palaniswamy commented on PIG-3341:
-

bq. Before making the fix, I think there needs to be a little more clarity 
around exactly what formats are supported. For example, pig 0.11.1 currently 
supports datetime strings with no date - "T00:00:00" produces a date in 1970. 
Is this intentional?
   I don't think anyone is looking for such a behaviour. Not intuitive. 

 I think we can go with option 1 (more is better) but also state which of those 
formats supported are not part of w3c profile. We also need to return null if 
it does not confirm to the format instead of throwing an error. 

> Improving performance of loading datetime values
> 
>
> Key: PIG-3341
> URL: https://issues.apache.org/jira/browse/PIG-3341
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.11.1
>Reporter: pat chan
>Priority: Minor
> Fix For: 0.12, 0.11.2
>
>
> The performance of loading datetime values can be improved by about 25% by 
> moving a single line in ToDate.java:
> public static DateTimeZone extractDateTimeZone(String dtStr) {
>   Pattern pattern = 
> Pattern.compile("(Z|(?<=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$");;
> should become:
> static Pattern pattern = 
> Pattern.compile("(Z|(?<=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$");
> public static DateTimeZone extractDateTimeZone(String dtStr) {
> There is no need to recompile the regular expression for every value. I'm not 
> sure if this function is ever called concurrently, but Pattern objects are 
> thread-safe anyways.
> As a test, I created a file of 10M timestamps:
>   for i in 0..1000
> puts '2000-01-01T00:00:00+23'
>   end
> I then ran this script:
>   grunt> A = load 'data' as (a:datetime); B = filter A by a is null; dump B;
> Before the change it took 160s.
> After the change, the script took 120s.
> 
> Another performance improvement can be made for invalid datetime values. If a 
> datetime value is invalid, an exception is created and thrown, which is a 
> costly way to fail a validity check. To test the performance impact, I 
> created 10M invalid datetime values:
>   for i in 0..1000
> puts '2000-99-01T00:00:00+23'
>   end
> In this test, the regex pattern was always recompiled. I then ran this script:
>   grunt> A = load 'data' as (a:datetime); B = filter A by a is not null; dump 
> B;
> The script took 190s.
> I understand this could be considered an edge case and might not be worth 
> changing. However, if there are use cases where invalid dates are part of 
> normal processing, then you might consider fixing this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Review Request: PIG-3331 Default values not written to Schema when specified in the output schema

2013-06-03 Thread Rohini Palaniswamy

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/11355/#review21315
---



http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/PigSchema2Avro.java


Initialize defaultValue in a variable and pass defaultValue instead of 
doing a if else condition.



http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java


Isn't a load and store enough to reproduce the test case? Why such a long 
pig script? Please try to keep the unit tests simple.


- Rohini Palaniswamy


On May 30, 2013, 2:29 a.m., Viraj Bhat wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/11355/
> ---
> 
> (Updated May 30, 2013, 2:29 a.m.)
> 
> 
> Review request for pig and Rohini Palaniswamy.
> 
> 
> Description
> ---
> 
> Patch to write default values to the Schema when the writer schema contains 
> that in the AvroStorage.
> 
> 
> Diffs
> -
> 
>   
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/PigSchema2Avro.java
>  1485826 
>   
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java
>  1485826 
>   
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/numbers.txt
>  PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/11355/diff/
> 
> 
> Testing
> ---
> 
> Yes against the Piggybank  in Pig trunk/Pig 0.12
> 
> 
> Thanks,
> 
> Viraj Bhat
> 
>



Re: Review Request: PIG-3322 Fix the issue where NPE is thrown when reading a union which has nulls and add a testcase

2013-06-03 Thread Rohini Palaniswamy

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/11333/#review21316
---



http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java


Isn't a load and store enough to reproduce the test case? Why such a long 
pig script?


- Rohini Palaniswamy


On May 29, 2013, 11:07 p.m., Viraj Bhat wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/11333/
> ---
> 
> (Updated May 29, 2013, 11:07 p.m.)
> 
> 
> Review request for pig and Rohini Palaniswamy.
> 
> 
> Description
> ---
> 
> Null pointer exception when loading union with null in it's schema. Test case 
> was also updated with a sample test case.
> 
> 
> Diffs
> -
> 
>   
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java
>  1485358 
>   
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/PigAvroRecordReader.java
>  1485358 
>   
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java
>  1485358 
>   
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/expected_testLoadAvrowithNulls.txt
>  PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/11333/diff/
> 
> 
> Testing
> ---
> 
> Yes all tests pass in the piggybank
> 
> 
> Thanks,
> 
> Viraj Bhat
> 
>



Re: Review Request: PIG-3322 Fix the issue where NPE is thrown when reading a union which has nulls and add a testcase

2013-06-03 Thread Rohini Palaniswamy

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/11333/#review21312
---


Just minor comments in the naming of the variable. Java variable names should 
be camel case.


http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java


goldenOutput



http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java


output



http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java


golden output



http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java


fileOutput


- Rohini Palaniswamy


On May 29, 2013, 11:07 p.m., Viraj Bhat wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/11333/
> ---
> 
> (Updated May 29, 2013, 11:07 p.m.)
> 
> 
> Review request for pig and Rohini Palaniswamy.
> 
> 
> Description
> ---
> 
> Null pointer exception when loading union with null in it's schema. Test case 
> was also updated with a sample test case.
> 
> 
> Diffs
> -
> 
>   
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java
>  1485358 
>   
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/PigAvroRecordReader.java
>  1485358 
>   
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java
>  1485358 
>   
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/expected_testLoadAvrowithNulls.txt
>  PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/11333/diff/
> 
> 
> Testing
> ---
> 
> Yes all tests pass in the piggybank
> 
> 
> Thanks,
> 
> Viraj Bhat
> 
>



[jira] [Commented] (PIG-2828) DataType.compare null

2013-06-03 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672896#comment-13672896
 ] 

Julien Le Dem commented on PIG-2828:


Sounds good to me.

> DataType.compare null
> -
>
> Key: PIG-2828
> URL: https://issues.apache.org/jira/browse/PIG-2828
> Project: Pig
>  Issue Type: Bug
>Reporter: Haitao Yao
> Attachments: DataType.patch, PIG-2828.patch, test.patch
>
>
> While using TOP, and if the DataBag contains null value to compare, it will 
> generate the following exception:
> Caused by: java.lang.NullPointerException
>   at org.apache.pig.data.DataType.compare(DataType.java:427)
>   at org.apache.pig.builtin.TOP$TupleComparator.compare(TOP.java:97)
>   at org.apache.pig.builtin.TOP$TupleComparator.compare(TOP.java:1)
>   at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649)
>   at java.util.PriorityQueue.siftUp(PriorityQueue.java:627)
>   at java.util.PriorityQueue.offer(PriorityQueue.java:329)
>   at java.util.PriorityQueue.add(PriorityQueue.java:306)
>   at org.apache.pig.builtin.TOP.updateTop(TOP.java:141)
>   at org.apache.pig.builtin.TOP.exec(TOP.java:116)
> code: (TOP.java, starts with line 91)
> Object field1 = o1.get(fieldNum);
> Object field2 = o2.get(fieldNum);
> if (!typeFound) {
> datatype = DataType.findType(field1);
> typeFound = true;
> }
> return DataType.compare(field1, field2, datatype, datatype);
> The reason is that if the typeFound is true , and the dataType is not null, 
> and field1 is null, the script failed.
> So we need to judge the field1 whether is null.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2828) DataType.compare null

2013-06-03 Thread Aniket Mokashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Mokashi updated PIG-2828:


Status: Patch Available  (was: Open)

> DataType.compare null
> -
>
> Key: PIG-2828
> URL: https://issues.apache.org/jira/browse/PIG-2828
> Project: Pig
>  Issue Type: Bug
>Reporter: Haitao Yao
> Attachments: DataType.patch, PIG-2828.patch, test.patch
>
>
> While using TOP, and if the DataBag contains null value to compare, it will 
> generate the following exception:
> Caused by: java.lang.NullPointerException
>   at org.apache.pig.data.DataType.compare(DataType.java:427)
>   at org.apache.pig.builtin.TOP$TupleComparator.compare(TOP.java:97)
>   at org.apache.pig.builtin.TOP$TupleComparator.compare(TOP.java:1)
>   at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649)
>   at java.util.PriorityQueue.siftUp(PriorityQueue.java:627)
>   at java.util.PriorityQueue.offer(PriorityQueue.java:329)
>   at java.util.PriorityQueue.add(PriorityQueue.java:306)
>   at org.apache.pig.builtin.TOP.updateTop(TOP.java:141)
>   at org.apache.pig.builtin.TOP.exec(TOP.java:116)
> code: (TOP.java, starts with line 91)
> Object field1 = o1.get(fieldNum);
> Object field2 = o2.get(fieldNum);
> if (!typeFound) {
> datatype = DataType.findType(field1);
> typeFound = true;
> }
> return DataType.compare(field1, field2, datatype, datatype);
> The reason is that if the typeFound is true , and the dataType is not null, 
> and field1 is null, the script failed.
> So we need to judge the field1 whether is null.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2828) DataType.compare null

2013-06-03 Thread Aniket Mokashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Mokashi updated PIG-2828:


Attachment: PIG-2828.patch

> DataType.compare null
> -
>
> Key: PIG-2828
> URL: https://issues.apache.org/jira/browse/PIG-2828
> Project: Pig
>  Issue Type: Bug
>Reporter: Haitao Yao
> Attachments: DataType.patch, PIG-2828.patch, test.patch
>
>
> While using TOP, and if the DataBag contains null value to compare, it will 
> generate the following exception:
> Caused by: java.lang.NullPointerException
>   at org.apache.pig.data.DataType.compare(DataType.java:427)
>   at org.apache.pig.builtin.TOP$TupleComparator.compare(TOP.java:97)
>   at org.apache.pig.builtin.TOP$TupleComparator.compare(TOP.java:1)
>   at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649)
>   at java.util.PriorityQueue.siftUp(PriorityQueue.java:627)
>   at java.util.PriorityQueue.offer(PriorityQueue.java:329)
>   at java.util.PriorityQueue.add(PriorityQueue.java:306)
>   at org.apache.pig.builtin.TOP.updateTop(TOP.java:141)
>   at org.apache.pig.builtin.TOP.exec(TOP.java:116)
> code: (TOP.java, starts with line 91)
> Object field1 = o1.get(fieldNum);
> Object field2 = o2.get(fieldNum);
> if (!typeFound) {
> datatype = DataType.findType(field1);
> typeFound = true;
> }
> return DataType.compare(field1, field2, datatype, datatype);
> The reason is that if the typeFound is true , and the dataType is not null, 
> and field1 is null, the script failed.
> So we need to judge the field1 whether is null.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2828) DataType.compare null

2013-06-03 Thread Aniket Mokashi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672880#comment-13672880
 ] 

Aniket Mokashi commented on PIG-2828:
-

DataType compare api is little broken.
public static int compare(Object o1, Object o2) - uses reflection to infer 
datatypes of o1 and o2.
public static int compare(Object o1, Object o2, byte dt1, byte dt2) - doesn't 
use reflection, however callers of this api use reflection and also deal with 
NULLs. 
Currently, callers of second API handle NULLs somewhat similarly but its not 
consistent. We can refactor the api to avoid reflection and handle NULLs 
consistently in a separate jira.
Right now, TOP that uses second api directly fails with NPE if o1 or o2 has 
null data. We should fix that with NULL < non-NULL semantics. 

> DataType.compare null
> -
>
> Key: PIG-2828
> URL: https://issues.apache.org/jira/browse/PIG-2828
> Project: Pig
>  Issue Type: Bug
>Reporter: Haitao Yao
> Attachments: DataType.patch, test.patch
>
>
> While using TOP, and if the DataBag contains null value to compare, it will 
> generate the following exception:
> Caused by: java.lang.NullPointerException
>   at org.apache.pig.data.DataType.compare(DataType.java:427)
>   at org.apache.pig.builtin.TOP$TupleComparator.compare(TOP.java:97)
>   at org.apache.pig.builtin.TOP$TupleComparator.compare(TOP.java:1)
>   at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649)
>   at java.util.PriorityQueue.siftUp(PriorityQueue.java:627)
>   at java.util.PriorityQueue.offer(PriorityQueue.java:329)
>   at java.util.PriorityQueue.add(PriorityQueue.java:306)
>   at org.apache.pig.builtin.TOP.updateTop(TOP.java:141)
>   at org.apache.pig.builtin.TOP.exec(TOP.java:116)
> code: (TOP.java, starts with line 91)
> Object field1 = o1.get(fieldNum);
> Object field2 = o2.get(fieldNum);
> if (!typeFound) {
> datatype = DataType.findType(field1);
> typeFound = true;
> }
> return DataType.compare(field1, field2, datatype, datatype);
> The reason is that if the typeFound is true , and the dataType is not null, 
> and field1 is null, the script failed.
> So we need to judge the field1 whether is null.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira