[jira] Subscription: PIG patch available

2015-08-12 Thread jira
Issue Subscription
Filter: PIG patch available (29 issues)

Subscriber: pigdaily

Key Summary
PIG-4657[Pig on Tez] Optimize GroupBy and Distinct key comparison
https://issues.apache.org/jira/browse/PIG-4657
PIG-4654Set tez.task.scale.memory.reserve-fraction to 0.5 at AM level
https://issues.apache.org/jira/browse/PIG-4654
PIG-4644PORelationToExprProject.clone() is broken
https://issues.apache.org/jira/browse/PIG-4644
PIG-4629org.apache.hadoop.hive.ql.exec.FunctionRegistry#getFunctionInfo() 
throws SemanticException since Hive 1.1.0
https://issues.apache.org/jira/browse/PIG-4629
PIG-4598Allow user defined plan optimizer rules
https://issues.apache.org/jira/browse/PIG-4598
PIG-4581thread safe issue in NodeIdGenerator
https://issues.apache.org/jira/browse/PIG-4581
PIG-4539New PigUnit
https://issues.apache.org/jira/browse/PIG-4539
PIG-4534Pig 0.14.0 with Hive 1.1.0, gives unresolved dependency error for 
hive-shims-common-secure
https://issues.apache.org/jira/browse/PIG-4534
PIG-4515org.apache.pig.builtin.Distinct throws ClassCastException
https://issues.apache.org/jira/browse/PIG-4515
PIG-4468Pig's jackson version conflicts with that of hadoop 2.6.0
https://issues.apache.org/jira/browse/PIG-4468
PIG-4455Should use DependencyOrderWalker instead of DepthFirstWalker in 
MRPrinter
https://issues.apache.org/jira/browse/PIG-4455
PIG-4417Pig's register command should support automatic fetching of jars 
from repo.
https://issues.apache.org/jira/browse/PIG-4417
PIG-4373Implement PIG-3861 in Tez
https://issues.apache.org/jira/browse/PIG-4373
PIG-4341Add CMX support to pig.tmpfilecompression.codec
https://issues.apache.org/jira/browse/PIG-4341
PIG-4323PackageConverter hanging in Spark
https://issues.apache.org/jira/browse/PIG-4323
PIG-4313StackOverflowError in LIMIT operation on Spark
https://issues.apache.org/jira/browse/PIG-4313
PIG-4251Pig on Storm
https://issues.apache.org/jira/browse/PIG-4251
PIG-4111Make Pig compiles with avro-1.7.7
https://issues.apache.org/jira/browse/PIG-4111
PIG-4002Disable combiner when map-side aggregation is used
https://issues.apache.org/jira/browse/PIG-4002
PIG-3952PigStorage accepts '-tagSplit' to return full split information
https://issues.apache.org/jira/browse/PIG-3952
PIG-3911Define unique fields with @OutputSchema
https://issues.apache.org/jira/browse/PIG-3911
PIG-3877Getting Geo Latitude/Longitude from Address Lines
https://issues.apache.org/jira/browse/PIG-3877
PIG-3873Geo distance calculation using Haversine
https://issues.apache.org/jira/browse/PIG-3873
PIG-3866Create ThreadLocal classloader per PigContext
https://issues.apache.org/jira/browse/PIG-3866
PIG-3864ToDate(userstring, format, timezone) computes DateTime with strange 
handling of Daylight Saving Time with location based timezones
https://issues.apache.org/jira/browse/PIG-3864
PIG-3851Upgrade jline to 2.11
https://issues.apache.org/jira/browse/PIG-3851
PIG-3668COR built-in function when atleast one of the coefficient values is 
NaN
https://issues.apache.org/jira/browse/PIG-3668
PIG-3635Fix e2e tests for Hadoop 2.X on Windows
https://issues.apache.org/jira/browse/PIG-3635
PIG-3587add functionality for rolling over dates
https://issues.apache.org/jira/browse/PIG-3587

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=16328&filterId=12322384


[jira] [Updated] (PIG-4657) [Pig on Tez] Optimize GroupBy and Distinct key comparison

2015-08-12 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-4657:

Attachment: PIG-4657-1.patch

> [Pig on Tez] Optimize GroupBy and Distinct key comparison
> -
>
> Key: PIG-4657
> URL: https://issues.apache.org/jira/browse/PIG-4657
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.16.0
>
> Attachments: PIG-4657-1.patch
>
>
>While bytes comparator cannot be used for joins till TEZ-2715 is 
> available, they can be used for group by and distinct if they have only one 
> Tez input. If there is more than one input due to union optimization 
> (OrderedGroupedMergedKVInput) , full comparator has to be still used as 
> OrderedGroupedMergedKVInput uses the comparator to merge the two underlying 
> inputs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4657) [Pig on Tez] Optimize GroupBy and Distinct key comparison

2015-08-12 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-4657:

Status: Patch Available  (was: Open)

> [Pig on Tez] Optimize GroupBy and Distinct key comparison
> -
>
> Key: PIG-4657
> URL: https://issues.apache.org/jira/browse/PIG-4657
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.16.0
>
> Attachments: PIG-4657-1.patch
>
>
>While bytes comparator cannot be used for joins till TEZ-2715 is 
> available, they can be used for group by and distinct if they have only one 
> Tez input. If there is more than one input due to union optimization 
> (OrderedGroupedMergedKVInput) , full comparator has to be still used as 
> OrderedGroupedMergedKVInput uses the comparator to merge the two underlying 
> inputs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PIG-4658) Reduce key comparisons in TezAccumulativeTupleBuffer

2015-08-12 Thread Rohini Palaniswamy (JIRA)
Rohini Palaniswamy created PIG-4658:
---

 Summary: Reduce key comparisons in TezAccumulativeTupleBuffer
 Key: PIG-4658
 URL: https://issues.apache.org/jira/browse/PIG-4658
 Project: Pig
  Issue Type: Sub-task
Reporter: Rohini Palaniswamy
Assignee: Rohini Palaniswamy
 Fix For: 0.16.0


   Currently Accumulator is applicable only for Group by.  
TezAccumulativeTupleBuffer supports more than one tez inputs and the code for 
that adds a lot of additional comparisons. We can make it support only one tez 
input and get rid of couple of comparisons. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PIG-4657) [Pig on Tez] Optimize GroupBy and Distinct key comparison

2015-08-12 Thread Rohini Palaniswamy (JIRA)
Rohini Palaniswamy created PIG-4657:
---

 Summary: [Pig on Tez] Optimize GroupBy and Distinct key comparison
 Key: PIG-4657
 URL: https://issues.apache.org/jira/browse/PIG-4657
 Project: Pig
  Issue Type: Sub-task
Reporter: Rohini Palaniswamy
Assignee: Rohini Palaniswamy
 Fix For: 0.16.0


   While bytes comparator cannot be used for joins till TEZ-2715 is available, 
they can be used for group by and distinct if they have only one Tez input. If 
there is more than one input due to union optimization 
(OrderedGroupedMergedKVInput) , full comparator has to be still used as 
OrderedGroupedMergedKVInput uses the comparator to merge the two underlying 
inputs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

2015-08-12 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14694035#comment-14694035
 ] 

Rohini Palaniswamy commented on PIG-1472:
-

Thanks [~thejas]. Created PIG-4656 to move to WritableUtils.writeVInt. Where 
type also denotes size, will keep as is. For eg:
TUPLE_0 to TUPLE_9 will stay as that packs type and size into one byte. But 
with TINYTUPLE, SMALLTUPLE and TUPLE - only TUPLE will be retained converting 
to WritableUtils.writeVInt.

> Optimize serialization/deserialization between Map and Reduce and between MR 
> jobs
> -
>
> Key: PIG-1472
> URL: https://issues.apache.org/jira/browse/PIG-1472
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.4.patch, 
> PIG-1472.patch
>
>
> In certain types of pig queries most of the execution time is spent in 
> serializing/deserializing (sedes) records between Map and Reduce and between 
> MR jobs. 
> For example, if PigMix queries are modified to specify types for all the 
> fields in the load statement schema, some of the queries (L2,L3,L9, L10 in 
> pigmix v1) that have records with bags and maps being transmitted across map 
> or reduce boundaries run a lot longer (runtime increase of few times has been 
> seen.
> There are a few optimizations that have shown to improve the performance of 
> sedes in my tests -
> 1. Use smaller number of bytes to store length of the column . For example if 
> a bytearray is smaller than 255 bytes , a byte can be used to store the 
> length instead of the integer that is currently used.
> 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and 
> DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
> Zebra and BinStorage are known to use DefaultTuple sedes functionality. The 
> serialization format that these loaders use cannot change, so after the 
> optimization their format is going to be different from the format used 
> between M/R boundaries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

2015-08-12 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14694026#comment-14694026
 ] 

Thejas M Nair commented on PIG-1472:


I don't remember if I had looked into WritableUtils.writeVInt back then or if 
it was available with the pig version being used back then (its been 5 years! 
:) )
Would using WritableUtils.writeVInt mean that an extra byte needs to be used 
for storing the type ? ie bag vs map vs tuple ..
For complex types, savings are more noticeable for smaller sizes. For a bag of 
size 32768, one byte saving won't be significant. However, for an int of size 
32768 , the saving of one byte is significant.


> Optimize serialization/deserialization between Map and Reduce and between MR 
> jobs
> -
>
> Key: PIG-1472
> URL: https://issues.apache.org/jira/browse/PIG-1472
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.4.patch, 
> PIG-1472.patch
>
>
> In certain types of pig queries most of the execution time is spent in 
> serializing/deserializing (sedes) records between Map and Reduce and between 
> MR jobs. 
> For example, if PigMix queries are modified to specify types for all the 
> fields in the load statement schema, some of the queries (L2,L3,L9, L10 in 
> pigmix v1) that have records with bags and maps being transmitted across map 
> or reduce boundaries run a lot longer (runtime increase of few times has been 
> seen.
> There are a few optimizations that have shown to improve the performance of 
> sedes in my tests -
> 1. Use smaller number of bytes to store length of the column . For example if 
> a bytearray is smaller than 255 bytes , a byte can be used to store the 
> length instead of the integer that is currently used.
> 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and 
> DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
> Zebra and BinStorage are known to use DefaultTuple sedes functionality. The 
> serialization format that these loaders use cannot change, so after the 
> optimization their format is going to be different from the format used 
> between M/R boundaries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4652) [Pig on Tez] Key Comparison is slower than mapreduce

2015-08-12 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14694005#comment-14694005
 ] 

Rohini Palaniswamy commented on PIG-4652:
-

  Filed PIG-4656 to optimize the BinInterSedesTupleRawComparator String 
comparison so that PigTupleSortComparator does not perform too bad compared to 
blind bytes comparators.
 
  Filed TEZ-2715 to add support for getting key bytes from Reader in Tez.

> [Pig on Tez] Key Comparison is slower than mapreduce
> 
>
> Key: PIG-4652
> URL: https://issues.apache.org/jira/browse/PIG-4652
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
> Fix For: 0.16.0
>
>
> Tez is using PigTupleSortComparator on both map and reduce side and in 
> POShuffleTezLoad.  Mapreduce is using PigTupleWritableComparator on the map 
> and reduce side for comparing tuples which is byte only comparison and very 
> fast.  It then uses PigGroupingWritableComparator as the grouping 
> comparator to correctly group those keys. 
>   It is not possible to use similar method in Tez (PigTupleWritableComparator 
> for output and input and PigTupleSortComparator in POShuffleTezLoad), without 
> addition of APIs in Tez to get raw bytes of the keys. Because when we compare 
> multiple inputs for min key in POShuffleTezLoad, there raw bytes need to be 
> compared to maintain the same order as the mapside. In mapreduce, there was 
> only single input and mapreduce framework sorted them together. But in Tez, 
> the join inputs are sorted separately and the application only gets the 
> serialized key. Need APIs in Tez KeyValuesReader to get the bytes of the 
> current key as well which can be used in POShuffleTezLoad for min key 
> comparison.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4652) [Pig on Tez] Key Comparison is slower than mapreduce

2015-08-12 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-4652:

Description: 
Tez is using PigTupleSortComparator on both map and reduce side and in 
POShuffleTezLoad.  Mapreduce is using PigTupleWritableComparator on the map and 
reduce side for comparing tuples which is byte only comparison and very fast.  
It then uses PigGroupingWritableComparator as the grouping comparator 
to correctly group those keys. 

  It is not possible to use similar method in Tez (PigTupleWritableComparator 
for output and input and PigTupleSortComparator in POShuffleTezLoad), without 
addition of APIs in Tez to get raw bytes of the keys. Because when we compare 
multiple inputs for min key in POShuffleTezLoad, there raw bytes need to be 
compared to maintain the same order as the mapside. In mapreduce, there was 
only single input and mapreduce framework sorted them together. But in Tez, the 
join inputs are sorted separately and the application only gets the serialized 
key. Need APIs in Tez KeyValuesReader to get the bytes of the current key as 
well which can be used in POShuffleTezLoad for min key comparison.



  was:
Tez is using PigTupleSortComparator on both map and reduce side and in 
POShuffleTezLoad.  Mapreduce is using PigTupleWritableComparator on the map and 
reduce side for comparing tuples which is byte only comparison and very fast.  
It then uses PigGroupingWritableComparator as the grouping comparator 
to correctly group those keys. 

  It is not possible to use similar method in Tez (PigTupleWritableComparator 
for output and input and PigTupleSortComparator in POShuffleTezLoad), without 
addition of APIs in Tez to get raw bytes of the keys. Because when we compare 
multiple inputs for min key in POShuffleTezLoad, there raw bytes need to be 
compared to maintain the same order as the mapside. In mapreduce, there was 
only single input and mapreduce framework sorted them together. But in Tez, the 
join inputs are sorted separately and the application only gets the serialized 
key. Need APIs in Tez KeyValuesReader to get the bytes of the current key as 
well which can be used in POShuffleTezLoad for min key comparison.

  But the majority of the slowness of PigTupleSortComparator seems to be coming 
from inefficiency of String comparison in BinInterSedesTupleRawComparator which 
initializes String instead of comparing bytes like Text.Comparator. 

{code}
str1 = new String(bb1.array(), bb1.position(), casz1, BinInterSedes.UTF8);
str2 = new String(bb2.array(), bb2.position(), casz2, BinInterSedes.UTF8);
{code}

Fixing that should make performance very close to mapreduce with negligible 
difference. But following mapreduce like model, should make it even more 
efficient.



Summary: [Pig on Tez] Key Comparison is slower than mapreduce  (was: 
[Pig on Tez] Group by on multiple keys is slower than mapreduce)

> [Pig on Tez] Key Comparison is slower than mapreduce
> 
>
> Key: PIG-4652
> URL: https://issues.apache.org/jira/browse/PIG-4652
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
> Fix For: 0.16.0
>
>
> Tez is using PigTupleSortComparator on both map and reduce side and in 
> POShuffleTezLoad.  Mapreduce is using PigTupleWritableComparator on the map 
> and reduce side for comparing tuples which is byte only comparison and very 
> fast.  It then uses PigGroupingWritableComparator as the grouping 
> comparator to correctly group those keys. 
>   It is not possible to use similar method in Tez (PigTupleWritableComparator 
> for output and input and PigTupleSortComparator in POShuffleTezLoad), without 
> addition of APIs in Tez to get raw bytes of the keys. Because when we compare 
> multiple inputs for min key in POShuffleTezLoad, there raw bytes need to be 
> compared to maintain the same order as the mapside. In mapreduce, there was 
> only single input and mapreduce framework sorted them together. But in Tez, 
> the join inputs are sorted separately and the application only gets the 
> serialized key. Need APIs in Tez KeyValuesReader to get the bytes of the 
> current key as well which can be used in POShuffleTezLoad for min key 
> comparison.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PIG-4656) Improve serialization and comparator performance in BinInterSedes

2015-08-12 Thread Rohini Palaniswamy (JIRA)
Rohini Palaniswamy created PIG-4656:
---

 Summary: Improve serialization and comparator performance in 
BinInterSedes
 Key: PIG-4656
 URL: https://issues.apache.org/jira/browse/PIG-4656
 Project: Pig
  Issue Type: Improvement
Reporter: Rohini Palaniswamy
Assignee: Rohini Palaniswamy
 Fix For: 0.16.0


Two major optimizations can be done:
  -  PIG-1472 added multiple data types to store different sizes (byte, short, 
int). It can be simplified using WritableUtils.writeVInt. There is no 
difference for byte and short compared to current approach. But with int, it 
could be beneficial where lot of numbers could be written with 3 bytes instead 
of 4. For eg: 32768 is written using 3 bytes in with WritableUtils.writeVInt 
whereas currently 4 bytes (int) is used. 
  -  String comparison in BinInterSedesTupleRawComparator initializes String 
for comparison. Should instead compare bytes like Text.Comparator.
{code}
str1 = new String(bb1.array(), bb1.position(), casz1, BinInterSedes.UTF8);
str2 = new String(bb2.array(), bb2.position(), casz2, BinInterSedes.UTF8);
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4628) Pig 0.14 job with order by fails in mapreduce mode with Oozie

2015-08-12 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14693974#comment-14693974
 ] 

Viraj Bhat commented on PIG-4628:
-

Thanks Koji for your help.
Viraj

> Pig 0.14 job with order by fails in mapreduce mode with Oozie
> -
>
> Key: PIG-4628
> URL: https://issues.apache.org/jira/browse/PIG-4628
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.14.0, 0.15.0
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
> Fix For: 0.15.1
>
> Attachments: pig-4628-v01.patch, pig-4628-v02.patch
>
>
> A simple pig script with order-by submitted through oozie and running with 
> mapreduce-mode 
> {code}
> A = LOAD '$input' AS (a1:CHARARRAY,a2:CHARARRAY, );
> A_sorted = ORDER A BY url DESC PARALLEL 2;
> STORE A_sorted INTO '$output';
> {code}
> failed on our hadoop cluster which had security turned on.  Part of the stack 
> trace had 
> {noformat}
> 2015-06-08 22:24:39,246 WARN [main] org.apache.hadoop.mapred.YarnChild: 
> Exception running child : java.lang.RuntimeException: 
> java.lang.RuntimeException: java.io.IOException: Exception reading 
> file:/tmp/2/yarn-local/usercache/userA/appcache/application_1432697993142_199266/container_e06_1432697993142_199266_01_03/container_tokens
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.init(WeightedRangePartitioner.java:155)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:75)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:58)
>   at 
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:712)
>   at 
> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
>   at 
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Map.collect(PigGenericMapReduce.java:135)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:281)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:274)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
>   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> {noformat}
> This failing job was from application_1432697993142_199305 and the error path 
> was from application_1432697993142_199266 which was a oozie pig-launcher job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4628) Pig 0.14 job with order by fails in mapreduce mode with Oozie

2015-08-12 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-4628:
--
   Resolution: Fixed
Fix Version/s: 0.15.1
   Status: Resolved  (was: Patch Available)

Patch committed to 0.15 and trunk.  Thanks for the review [~rohini]! 

> Pig 0.14 job with order by fails in mapreduce mode with Oozie
> -
>
> Key: PIG-4628
> URL: https://issues.apache.org/jira/browse/PIG-4628
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.14.0, 0.15.0
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
> Fix For: 0.15.1
>
> Attachments: pig-4628-v01.patch, pig-4628-v02.patch
>
>
> A simple pig script with order-by submitted through oozie and running with 
> mapreduce-mode 
> {code}
> A = LOAD '$input' AS (a1:CHARARRAY,a2:CHARARRAY, );
> A_sorted = ORDER A BY url DESC PARALLEL 2;
> STORE A_sorted INTO '$output';
> {code}
> failed on our hadoop cluster which had security turned on.  Part of the stack 
> trace had 
> {noformat}
> 2015-06-08 22:24:39,246 WARN [main] org.apache.hadoop.mapred.YarnChild: 
> Exception running child : java.lang.RuntimeException: 
> java.lang.RuntimeException: java.io.IOException: Exception reading 
> file:/tmp/2/yarn-local/usercache/userA/appcache/application_1432697993142_199266/container_e06_1432697993142_199266_01_03/container_tokens
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.init(WeightedRangePartitioner.java:155)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:75)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:58)
>   at 
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:712)
>   at 
> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
>   at 
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Map.collect(PigGenericMapReduce.java:135)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:281)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:274)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
>   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> {noformat}
> This failing job was from application_1432697993142_199305 and the error path 
> was from application_1432697993142_199266 which was a oozie pig-launcher job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Build failed in Jenkins: Pig-trunk-commit #2223

2015-08-12 Thread Apache Jenkins Server
See 

Changes:

[rohini] Fix double application of patch and code duplication in PIG-4651

[rohini] PIG-4651: Optimize NullablePartitionWritable serialization for skewed 
join (rohini)

[rohini] PIG-4627: [Pig on Tez] Self join does not handle null values correctly 
(rohini)

--
[...truncated 4414 lines...]
[junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
36.902 sec
[junit] Running org.apache.pig.test.TestNewPlanListener
[junit] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.369 sec
[junit] Running org.apache.pig.test.TestNewPlanLogToPhyTranslationVisitor
[junit] Tests run: 27, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
4.921 sec
[junit] Running org.apache.pig.test.TestNewPlanLogicalOptimizer
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
1.396 sec
[junit] Running org.apache.pig.test.TestNewPlanOperatorPlan
[junit] Tests run: 47, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
4.099 sec
[junit] Running org.apache.pig.test.TestNewPlanPruneMapKeys
[junit] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
3.092 sec
[junit] Running org.apache.pig.test.TestNewPlanPushDownForeachFlatten
[junit] Tests run: 45, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
7.585 sec
[junit] Running org.apache.pig.test.TestNewPlanPushUpFilter
[junit] Tests run: 46, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
7.848 sec
[junit] Running org.apache.pig.test.TestNewPlanRule
[junit] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.42 sec
[junit] Running org.apache.pig.test.TestNotEqualTo
[junit] Tests run: 28, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.46 sec
[junit] Running org.apache.pig.test.TestNull
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.476 sec
[junit] Running org.apache.pig.test.TestNullConstant
[junit] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
23.146 sec
[junit] Running org.apache.pig.test.TestNumberOfReducers
[junit] Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
527.615 sec
[junit] Running org.apache.pig.test.TestOptimizeLimit
[junit] Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
4.102 sec
[junit] Running org.apache.pig.test.TestOrderBy3
[junit] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
13.336 sec
[junit] Running org.apache.pig.test.TestPOBinCond
[junit] Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.455 sec
[junit] Running org.apache.pig.test.TestPOCast
[junit] Tests run: 14, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.892 sec
[junit] Running org.apache.pig.test.TestPODistinct
[junit] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.395 sec
[junit] Running org.apache.pig.test.TestPOGenerate
[junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.377 sec
[junit] Running org.apache.pig.test.TestPOMapLookUp
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.361 sec
[junit] Running org.apache.pig.test.TestPONegative
[junit] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
6.116 sec
[junit] Running org.apache.pig.test.TestPOPartialAgg
[junit] Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
3.861 sec
[junit] Running org.apache.pig.test.TestPOPartialAggPlan
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 
0.247 sec
[junit] Running org.apache.pig.test.TestPORegexp
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.376 sec
[junit] Running org.apache.pig.test.TestPOSort
[junit] Tests run: 14, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.45 sec
[junit] Running org.apache.pig.test.TestPOSplit
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.369 sec
[junit] Running org.apache.pig.test.TestPOUserFunc
[junit] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.45 sec
[junit] Running org.apache.pig.test.TestPackage
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
8.459 sec
[junit] Running org.apache.pig.test.TestParamSubPreproc
[junit] Tests run: 36, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
4.463 sec
[junit] Running org.apache.pig.test.TestParser
[junit] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
4.988 sec
[junit] Running org.apache.pig.test.TestPhyOp
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.477 sec
[junit] Running org.apache.pig.test.TestPhyPatternMatch
[junit] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.424 sec
   

[jira] [Updated] (PIG-4627) [Pig on Tez] Self join does not handle null values correctly

2015-08-12 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-4627:

  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Committed to trunk and branch-0.15. Thanks for the review Daniel.

> [Pig on Tez] Self join does not handle null values correctly
> 
>
> Key: PIG-4627
> URL: https://issues.apache.org/jira/browse/PIG-4627
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.16.0, 0.15.1
>
> Attachments: PIG-4627-1.patch
>
>
>   Self join does not produce right results in case of null after PIG-4495 
> which writes multiple inputs into same tez input. Need the 
> https://issues.apache.org/jira/secure/attachment/12628162/PIG-3761-1.patch 
> fix of  PIG-3761 to handle that by comparing indexes in the raw comparators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PIG-4655) Support InputStats in spark mode

2015-08-12 Thread kexianda (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kexianda reassigned PIG-4655:
-

Assignee: kexianda

> Support InputStats in spark mode
> 
>
> Key: PIG-4655
> URL: https://issues.apache.org/jira/browse/PIG-4655
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: kexianda
>Assignee: kexianda
> Fix For: spark-branch
>
>
> Currently, InputStats is not implemented in spark mode. 
> The JUnit case TestPigRunner.testEmptyFileCounter() will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PIG-4655) Support InputStats in spark mode

2015-08-12 Thread kexianda (JIRA)
kexianda created PIG-4655:
-

 Summary: Support InputStats in spark mode
 Key: PIG-4655
 URL: https://issues.apache.org/jira/browse/PIG-4655
 Project: Pig
  Issue Type: Sub-task
Reporter: kexianda


Currently, InputStats is not implemented in spark mode. 
The JUnit case TestPigRunner.testEmptyFileCounter() will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4634) Fix records count issues in output statistics

2015-08-12 Thread kexianda (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14693038#comment-14693038
 ] 

kexianda commented on PIG-4634:
---

Hi [~mohitsabharwal] & [~xuefuz],
PIG-4634-3.patch is attached.  Would you please help review the code. 

1. Implement records count logic using SparkCounter
(a). SparkPigStatusReporter.java:  a singleton factory to get sparkcounters.
(b). Create a new SparkCounter in StoreConverter.convert(). And increase the 
counter in FromTupleFunction.
We append the key of store operator to the counter name (in 
SparkStatsUtil.getStoreSparkCOunterName()), to avoid the counter name conflict 
when output file have the same shortname(say, /tmp1/output & /tmp2/output).

2. some slight changes/fix:
(a).set pigContext when initializing SparkPigStats.
(b).getOutputAlias() in spark mode


How to test:
Run TestPigRunner.simpleTest()

> Fix records count issues in output statistics
> -
>
> Key: PIG-4634
> URL: https://issues.apache.org/jira/browse/PIG-4634
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: kexianda
>Assignee: kexianda
> Fix For: spark-branch
>
> Attachments: PIG-4634-3.patch, PIG-4634.patch, PIG-4634_2.patch
>
>
> Test cases simpleTest() and simpleTest2()  in TestPigRunner failed, caused by 
> following issues:
> 1. pig context in SparkPigStats isn't initialized.
> 2. the records count logic hasn't been implemented.
> 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and 
> getRecordWritten() have not been implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4634) Fix records count issues in output statistics

2015-08-12 Thread kexianda (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kexianda updated PIG-4634:
--
Attachment: PIG-4634-3.patch

> Fix records count issues in output statistics
> -
>
> Key: PIG-4634
> URL: https://issues.apache.org/jira/browse/PIG-4634
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: kexianda
>Assignee: kexianda
> Fix For: spark-branch
>
> Attachments: PIG-4634-3.patch, PIG-4634.patch, PIG-4634_2.patch
>
>
> Test cases simpleTest() and simpleTest2()  in TestPigRunner failed, caused by 
> following issues:
> 1. pig context in SparkPigStats isn't initialized.
> 2. the records count logic hasn't been implemented.
> 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and 
> getRecordWritten() have not been implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)