[jira] [Updated] (PIG-3876) Handle two outputs from split going to same input in MultiQueryOptimizer

2014-10-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3876:

Fix Version/s: (was: 0.14.0)
   0.15.0

> Handle two outputs from split going to same input in MultiQueryOptimizer
> 
>
> Key: PIG-3876
> URL: https://issues.apache.org/jira/browse/PIG-3876
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Reporter: Rohini Palaniswamy
> Fix For: 0.15.0
>
>
> MultiQueryOptimizerTez.java
> {code}
> // Detect diamond shape, we cannot merge it into split, since Tez
> // does not handle double edge between vertexes
> boolean sharedSucc = false;
> if (getPlan().getSuccessors(successor)!=null) {
> for (TezOperator succ_successor : 
> getPlan().getSuccessors(successor)) {
> if (succ_successors.contains(succ_successor)) {
> sharedSucc = true;
> break;
> }
> }
> 
> succ_successors.addAll(getPlan().getSuccessors(successor));
> }
> {code}
> SPLIT A INTO B if , C if ;
> D = JOIN B by x, C by x;
> We would like to do 
> V1 - Split (B -> V2, C -> V2)
> V2 - Join B and C
> Without the check for shared successors, above plan is created but B and C 
> create two separate edges between V1 and V2 which is not supported by Tez. 
> Since the splits are not merged into POSplit fully, we currently have
> V1 - Split ( B-> V3, C-> V2 with just POValueOutputTez)
> V2 -  LocalRearrange and -> V3
> V3 - Join B and C
>  We need to remove the check and merge them into the POSplit and fix this 
> case to make B and C both write to same edge. Being more aggressive in 
> multi-query increases performance.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-3912) Evalfunc's getCacheFiles should also support local mode

2014-10-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3912:

Fix Version/s: (was: 0.14.0)
   0.15.0

> Evalfunc's getCacheFiles should also support local mode
> ---
>
> Key: PIG-3912
> URL: https://issues.apache.org/jira/browse/PIG-3912
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.12.1, 0.13.0
>Reporter: Aniket Mokashi
> Fix For: 0.15.0
>
>
> Evalfunc's getCacheFiles only supports files on HDFS right now. We should 
> also support local files so that udf's that use this api can run in local 
> (auto-local) mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-3961) Adding HBaseStorage cell value filters

2014-10-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3961:

Fix Version/s: (was: 0.14.0)
   0.15.0

> Adding HBaseStorage cell value filters
> --
>
> Key: PIG-3961
> URL: https://issues.apache.org/jira/browse/PIG-3961
> Project: Pig
>  Issue Type: New Feature
>Reporter: Mike Welch
>Assignee: Mike Welch
>Priority: Minor
> Fix For: 0.15.0
>
> Attachments: filters-patch.v2.diff
>
>
> Adding three additional server side filtering options when loading data with 
> HBaseStorage:
> # specified cf:col does not exist
> {{-null cf:col}}
> # specified cf:col must exist
> {{-notnull cf:col}}
> # specified cf:col contains the given value
> {{-val cf:col=value}}
> These are meant to replace (and optimize by reducing data transfer) the 
> frequent paradigm in pig of loading data and immediately filtering for a 
> specific condition.  For example
> data = load 'hbase://mytable' using 
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf:*') as (cf:map[]) ;
> data_with_value = filter data by cf#'col' = 'value' ;
> Can be replaced with:
> data_with_value = load 'hbase://mytable' using 
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf:*', '-val cf:col=value') 
> as (cf:map[]) ;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-3979) group all performance, garbage collection, and incremental aggregation

2014-10-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3979:

Fix Version/s: (was: 0.14.0)
   0.15.0

> group all performance, garbage collection, and incremental aggregation
> --
>
> Key: PIG-3979
> URL: https://issues.apache.org/jira/browse/PIG-3979
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.12.0, 0.11.1
>Reporter: David Dreyfus
>Assignee: David Dreyfus
> Fix For: 0.15.0
>
> Attachments: PIG-3979-v1.patch, POPartialAgg.java.patch, 
> SpillableMemoryManager.java.patch
>
>
> I have a PIG statement similar to:
> summary = foreach (group data ALL) generate 
> COUNT(data.col1), SUM(data.col2), SUM(data.col2)
> , Moments(col3)
> , Moments(data.col4)
> There are a couple of hundred columns.
> I set the following:
> SET pig.exec.mapPartAgg true;
> SET pig.exec.mapPartAgg.minReduction 3;
> SET pig.cachedbag.memusage 0.05;
> I found that when I ran this on a JVM with insufficient memory, the process 
> eventually timed out because of an infinite garbage collection loop.
> The problem was invariant to the memusage setting.
> I solved the problem by making changes to:
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperator.POPartialAgg.java
> Rather than reading in 1 records to establish an estimate of the 
> reduction, I make an estimate after reading in enough tuples to fill 
> pig.cachedbag.memusage percent of Runtime.getRuntime().maxMemory().
> I also made a change to guarantee at least one record allowed in second tier 
> storage. In the current implementation, if the reduction is very high 1000:1, 
> space in second tier storage is zero.
> With these changes, I can summarize large data sets with small JVMs. I also 
> find that setting pig.cachedbag.memusage to a small number such as 0.05 
> results in much better garbage collection performance without reducing 
> throughput. I suppose tuning GC would also solve a problem with excessive 
> garbage collection.
> The performance is sweet. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-3981) Use Tez counter to collect per output stats

2014-10-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3981:

Fix Version/s: (was: 0.14.0)
   0.15.0

> Use Tez counter to collect per output stats
> ---
>
> Key: PIG-3981
> URL: https://issues.apache.org/jira/browse/PIG-3981
> Project: Pig
>  Issue Type: Improvement
>  Components: tez
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.15.0
>
>
> Pig currently goes to hdfs to collect those information. That will take from 
> several hundred milliseconds to several seconds, which adds overhead for 
> namenode and hurt performance especially for iterative queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4011) Pig should not put PigContext in job.jar to help jar dedup

2014-10-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai resolved PIG-4011.
-
Resolution: Fixed

This is solved as part of PIG-4054.

> Pig should not put PigContext in job.jar to help jar dedup
> --
>
> Key: PIG-4011
> URL: https://issues.apache.org/jira/browse/PIG-4011
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.13.0
>Reporter: Aniket Mokashi
>Assignee: Daniel Dai
> Fix For: 0.14.0
>
>
> Job jars are largely identical from job to job (just the pig classes and 
> their dependencies). However, there is pigContext in the job jar, and that 
> seems to change from job to job.As a result, the job jars are basically 
> uncacheable in the shared cache. It would be great if we can find a way to 
> separate the pig context and store it in the distributed cache separate from 
> the (stable) job jar. That would give us a better chance to store the job jar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4004) Upgrade the Pigmix queries from the (old) mapred API to mapreduce

2014-10-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4004:

Fix Version/s: (was: 0.14.0)
   0.15.0

> Upgrade the Pigmix queries from the (old) mapred API to mapreduce
> -
>
> Key: PIG-4004
> URL: https://issues.apache.org/jira/browse/PIG-4004
> Project: Pig
>  Issue Type: Bug
>  Components: tools
>Affects Versions: 0.12.1
>Reporter: Keren Ouaknine
> Fix For: 0.15.0
>
> Attachments: PIG-4004.1.patch
>
>
> Until now, the Pigmix queries were written using the old mapred API. 
> As a result, some queries were expressed with three concatenated MR jobs 
> instead of one. I rewrote all the queries to match the newer mapreduce API 
> and optimized them on the fly. 
> This is a continuity work to PIG-3915.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-3993) Implement illustrate in Tez

2014-10-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3993:

Fix Version/s: (was: 0.14.0)
   0.15.0

> Implement illustrate in Tez
> ---
>
> Key: PIG-3993
> URL: https://issues.apache.org/jira/browse/PIG-3993
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Reporter: Rohini Palaniswamy
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4049) Improve performance of Limit following an Orderby on Tez

2014-10-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4049:

Fix Version/s: (was: 0.14.0)
   0.15.0

> Improve performance of Limit following an Orderby on Tez
> 
>
> Key: PIG-4049
> URL: https://issues.apache.org/jira/browse/PIG-4049
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Reporter: Rohini Palaniswamy
> Fix For: 0.15.0
>
>
> Better algorithms can be applied to improve performance for limit following 
> an order by.
> For eg:
> {code}
> A = LOAD '/tmp/data' ...;
> B = ORDER A by $0 parallel 100;
> C = LIMIT B 100;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4045) Add e2e tests for AvroStorage

2014-10-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4045:

Fix Version/s: (was: 0.14.0)
   0.15.0

> Add e2e tests for AvroStorage
> -
>
> Key: PIG-4045
> URL: https://issues.apache.org/jira/browse/PIG-4045
> Project: Pig
>  Issue Type: Improvement
>  Components: e2e harness
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4072) secondary sort optimizer to support multiple predecessors

2014-10-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4072:

Fix Version/s: (was: 0.14.0)
   0.15.0

> secondary sort optimizer to support multiple predecessors
> -
>
> Key: PIG-4072
> URL: https://issues.apache.org/jira/browse/PIG-4072
> Project: Pig
>  Issue Type: Sub-task
>  Components: impl
>Reporter: Daniel Dai
> Fix For: 0.15.0
>
>
> As described in PIG-4064, SecondaryOptimizer does not handle two 
> predecessors. When we process the first predecessor, we remove the foreach 
> inner operators from the reduce side, and the second predecessor cannot see 
> them. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4093) Predicate pushdown to support removing filters from pig plan

2014-10-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4093:

Fix Version/s: (was: 0.14.0)
   0.15.0

> Predicate pushdown to support removing filters from pig plan
> 
>
> Key: PIG-4093
> URL: https://issues.apache.org/jira/browse/PIG-4093
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Rohini Palaniswamy
> Fix For: 0.15.0
>
>
>It is possible for the loaders to evaluate the pushed filter conditions. 
> In that case it is not necessary to retain the filter conditions in the pig 
> plan. So need to support two modes :
> 1) filter conditions are pushed into loader but also retained in pig plan 
> as loader might do only best effort filtering based on block metadata
> 2) filter conditions are pushed into loader and removed from pig plan 
> when the loader can evaluate the expression itself and filter out records. In 
> this case, loader can do lazy deserialization adn avoid deserialization of 
> the full record.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4094) Predicate pushdown to support complex data types

2014-10-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4094:

Fix Version/s: (was: 0.14.0)
   0.15.0

> Predicate pushdown to support complex data types
> 
>
> Key: PIG-4094
> URL: https://issues.apache.org/jira/browse/PIG-4094
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Rohini Palaniswamy
> Fix For: 0.15.0
>
>
>   Parquet has support for pushing predicates on tuples, maps and bags 
> according to [~aniket486]. ORC currently only supports primitives, but will 
> add support for structs(tuples) in the future.  The API needs to be there 
> even if not implemented as it will hard to change the interface once released.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4095) Collapse multiple OR conditions to IN and BETWEEN

2014-10-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4095:

Fix Version/s: (was: 0.14.0)
   0.15.0

> Collapse multiple OR conditions to IN and BETWEEN
> -
>
> Key: PIG-4095
> URL: https://issues.apache.org/jira/browse/PIG-4095
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Rohini Palaniswamy
> Fix For: 0.15.0
>
>
>   ORC predicate pushdown supports IN and BETWEEN operators. Need equivalent 
> expressions in Pig.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4096) Implement predicate pushdown evaluation in OrcStorage

2014-10-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4096:

Fix Version/s: (was: 0.14.0)
   0.15.0

> Implement predicate pushdown evaluation in OrcStorage
> -
>
> Key: PIG-4096
> URL: https://issues.apache.org/jira/browse/PIG-4096
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Rohini Palaniswamy
> Fix For: 0.15.0
>
>
>   Parquest supports predicate evaluation and filtering of records natively 
> via 
> https://github.com/apache/incubator-parquet-mr/blob/master/parquet-column/src/main/java/parquet/filter2/predicate/FilterApi.java.
>  It would be good to have it in ORC Reader as well but is not available right 
> now. So evaluate the predicate pushdown condition in OrcStorage itself and 
> filter records. This requires PIG-4093.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4120) Broadcast the index file in case of POMergeCoGroup and POMergeJoin

2014-10-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4120:

Fix Version/s: (was: 0.14.0)
   0.15.0

> Broadcast the index file in case of POMergeCoGroup and POMergeJoin
> --
>
> Key: PIG-4120
> URL: https://issues.apache.org/jira/browse/PIG-4120
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Reporter: Rohini Palaniswamy
> Fix For: 0.15.0
>
>
> Currently merge join and merge cogroup use two DAGs - the first DAG creates 
> the index file in hdfs and second DAG does the merge join.  Similar to 
> replicate join, we can broadcast the index file and cache it and use it in 
> merge join and merge cogroup. This will give better performance and also 
> eliminate need for the second DAG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4130) Store/Load the same file fails for AvroStorage/OrcStorage, etc

2014-10-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4130:

Fix Version/s: (was: 0.14.0)
   0.15.0

> Store/Load the same file fails for AvroStorage/OrcStorage, etc
> --
>
> Key: PIG-4130
> URL: https://issues.apache.org/jira/browse/PIG-4130
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
>Priority: Minor
> Fix For: 0.15.0
>
>
> The following script fail:
> {code}
> a = load '/user/pig/tests/data/singlefile/studenttab10k' as (name:chararray, 
> age:int, gpa:float);
> store a into 'Avro.intermediate' using OrcStorage();
> b = load 'Avro.intermediate' using OrcStorage();
> c = filter b by age < 30;
> store c into 'ooo';
> {code}
> Message:
>  Invalid field projection. Projected 
> field \[age\] does not exist.
> If put a "exec" after the first store, the script success.
> Pig does compile the script into two MR job, and correctly figure out the 
> dependency of the two, but it still need to goes for "Avro.intermediate" for 
> the schema of b when compiling, and at this time "Avro.intermediate" does not 
> exist. This also happens to other Loaders which need to get the schema from 
> input file, such as OrcStorage, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4151) Pig Cannot Write Empty Maps to HBase

2014-10-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4151:

Status: Patch Available  (was: Open)

> Pig Cannot Write Empty Maps to HBase
> 
>
> Key: PIG-4151
> URL: https://issues.apache.org/jira/browse/PIG-4151
> Project: Pig
>  Issue Type: Bug
>  Components: internal-udfs
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.14.0
>
> Attachments: PIG-4151-1.patch
>
>
> Pig is unable to write empty maps to HBase. Instruction for reproduce:
> input file pig_data_bad.txt:
> {code}
> row1;Homer;Morrison;[1#Silvia,2#Stacy]
> row2;Sheila;Fletcher;[1#Becky,2#Salvador,3#Lois]
> row4;Andre;Morton;[1#Nancy]
> row3;Sonja;Webb;[]
> {code}
> Create table in hbase:
> create 'test', 'info', 'friends'
> Pig script:
> {code}
> source = LOAD '/pig_data_bad.txt' USING PigStorage(';') AS (row:chararray, 
> first_name:chararray, last_name:chararray, friends:map[]);
> STORE source INTO 'hbase://test' USING 
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:fname info:lname 
> friends:*');
> {code}
> Stack:
> java.lang.NullPointerException
> at 
> org.apache.pig.backend.hadoop.hbase.HBaseStorage.putNext(HBaseStorage.java:880)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98)
> at 
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
> at 
> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
> at 
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.collect(PigMapOnly.java:48)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:284)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:277)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4218) Pig OrcStorage fail to load a map with null key

2014-10-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4218:

  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Patch committed to both trunk and 0.14 branch. Thanks Rohini for review!

> Pig OrcStorage fail to load a map with null key
> ---
>
> Key: PIG-4218
> URL: https://issues.apache.org/jira/browse/PIG-4218
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.14.0
>
> Attachments: PIG-4218-1.patch, nullmapkey.orc
>
>
> Error message:
> Backend error message
> -
> AttemptID:attempt_1403634189382_0006_m_00_1 Info:Error: 
> java.lang.NullPointerException
> at org.apache.pig.impl.util.orc.OrcUtils.convertOrcToPig(OrcUtils.java:97)
> at org.apache.pig.impl.util.orc.OrcUtils.convertOrcToPig(OrcUtils.java:82)
> at org.apache.pig.builtin.OrcStorage.getNext(OrcStorage.java:312)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211)
> at 
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:533)
> at 
> org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
> at 
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4175) PIG CROSS operation follow by STORE produces non-deterministic results each run

2014-10-01 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14155749#comment-14155749
 ] 

Rohini Palaniswamy commented on PIG-4175:
-

Committed PIG-4175-additional-1.patch to both branch 0.14 and trunk.

> PIG CROSS operation follow by STORE produces non-deterministic results each 
> run
> ---
>
> Key: PIG-4175
> URL: https://issues.apache.org/jira/browse/PIG-4175
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11, 0.12.0
> Environment: RHEL 6/64-bit
>Reporter: Jim Huang
>Assignee: Daniel Dai
> Fix For: 0.14.0
>
> Attachments: PIG-4175-1.patch, PIG-4175-Debug.patch, 
> PIG-4175-additional-1.patch, mktestdata.py, pig_testcross_plan.png, 
> test_cross.out, test_cross.pig
>
>
> Three files will be attached to help visualize this issue.
> 1. mktestdata.py - to generate test data to feed the pig script
> 2. test_cross.pig - the PIG script using CROSS and STORE
> 3. test_cross.out - the PIG console output showing the input/output records 
> delta
> To reproduce this PIG CROSS operation problem, you need to use the supplied 
> Python script,
> mktestdata.py, to generate an input file that is at least 13,948,228,930 
> bytes (> 13GB).
> The CROSS between raw_data (m records) and cross_count (1 record) should 
> yield exactly (m records) as the output.  
> The STORE results from the CROSS operations yielded about 1/3 of input record 
> in raw_data as the output.  
> If I joined the both of the CROSS operations together, the STORE results from 
> the CROSS operations yielded about 2/3
> of the input records in raw-data as the output.  
> -- data = CROSS raw_data, field04s_count, subsection1_field04s_count, 
> subsection2_field04s_count;
> We have reproduced this using both Pig 0.11 (Hadoop 1.x) and Pig 0.12 (Hadoop 
> 2.x) clusters.  
> The default HDFS block size is 128MB.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4218) Pig OrcStorage fail to load a map with null key

2014-10-01 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14155738#comment-14155738
 ] 

Rohini Palaniswamy commented on PIG-4218:
-

Clarified with Daniel why we can't support null keys in Pig. Pig can support 
null keys, but we are dropping them only to have same behavior while processing 
as hive 0.14 and HCatLoader.

+1.  

> Pig OrcStorage fail to load a map with null key
> ---
>
> Key: PIG-4218
> URL: https://issues.apache.org/jira/browse/PIG-4218
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.14.0
>
> Attachments: PIG-4218-1.patch, nullmapkey.orc
>
>
> Error message:
> Backend error message
> -
> AttemptID:attempt_1403634189382_0006_m_00_1 Info:Error: 
> java.lang.NullPointerException
> at org.apache.pig.impl.util.orc.OrcUtils.convertOrcToPig(OrcUtils.java:97)
> at org.apache.pig.impl.util.orc.OrcUtils.convertOrcToPig(OrcUtils.java:82)
> at org.apache.pig.builtin.OrcStorage.getNext(OrcStorage.java:312)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211)
> at 
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:533)
> at 
> org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
> at 
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4175) PIG CROSS operation follow by STORE produces non-deterministic results each run

2014-10-01 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14155650#comment-14155650
 ] 

Daniel Dai commented on PIG-4175:
-

+1

> PIG CROSS operation follow by STORE produces non-deterministic results each 
> run
> ---
>
> Key: PIG-4175
> URL: https://issues.apache.org/jira/browse/PIG-4175
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11, 0.12.0
> Environment: RHEL 6/64-bit
>Reporter: Jim Huang
>Assignee: Daniel Dai
> Fix For: 0.14.0
>
> Attachments: PIG-4175-1.patch, PIG-4175-Debug.patch, 
> PIG-4175-additional-1.patch, mktestdata.py, pig_testcross_plan.png, 
> test_cross.out, test_cross.pig
>
>
> Three files will be attached to help visualize this issue.
> 1. mktestdata.py - to generate test data to feed the pig script
> 2. test_cross.pig - the PIG script using CROSS and STORE
> 3. test_cross.out - the PIG console output showing the input/output records 
> delta
> To reproduce this PIG CROSS operation problem, you need to use the supplied 
> Python script,
> mktestdata.py, to generate an input file that is at least 13,948,228,930 
> bytes (> 13GB).
> The CROSS between raw_data (m records) and cross_count (1 record) should 
> yield exactly (m records) as the output.  
> The STORE results from the CROSS operations yielded about 1/3 of input record 
> in raw_data as the output.  
> If I joined the both of the CROSS operations together, the STORE results from 
> the CROSS operations yielded about 2/3
> of the input records in raw-data as the output.  
> -- data = CROSS raw_data, field04s_count, subsection1_field04s_count, 
> subsection2_field04s_count;
> We have reproduced this using both Pig 0.11 (Hadoop 1.x) and Pig 0.12 (Hadoop 
> 2.x) clusters.  
> The default HDFS block size is 128MB.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4164) After Pig job finish, Pig client spend too much time retry to connect to AM

2014-10-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai resolved PIG-4164.
-
Resolution: Fixed

Patch committed to trunk and 0.14 branch.

> After Pig job finish, Pig client spend too much time retry to connect to AM
> ---
>
> Key: PIG-4164
> URL: https://issues.apache.org/jira/browse/PIG-4164
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.14.0
>
> Attachments: PIG-4164-0.patch, PIG-4164-1.patch
>
>
> For some script, after job finish, Pig spend a lot time try to connect AM 
> before get redirect to JobHistoryServer. Here is the message we saw:
> {code}
> 2014-09-10 15:13:55,370 [main] INFO  org.apache.hadoop.ipc.Client - Retrying 
> connect to server: daijymacpro-2.local/10.11.2.30:55223. Already tried 0 
> time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, 
> sleepTime=1000 MILLISECONDS)
> 2014-09-10 15:13:56,371 [main] INFO  org.apache.hadoop.ipc.Client - Retrying 
> connect to server: daijymacpro-2.local/10.11.2.30:55223. Already tried 1 
> time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, 
> sleepTime=1000 MILLISECONDS)
> 2014-09-10 15:13:57,372 [main] INFO  org.apache.hadoop.ipc.Client - Retrying 
> connect to server: daijymacpro-2.local/10.11.2.30:55223. Already tried 2 
> time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, 
> sleepTime=1000 MILLISECONDS)
> 2014-09-10 15:13:57,476 [main] INFO  
> org.apache.hadoop.mapred.ClientServiceDelegate - Application state is 
> completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4164) After Pig job finish, Pig client spend too much time retry to connect to AM

2014-10-01 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14155641#comment-14155641
 ] 

Rohini Palaniswamy commented on PIG-4164:
-

+1

> After Pig job finish, Pig client spend too much time retry to connect to AM
> ---
>
> Key: PIG-4164
> URL: https://issues.apache.org/jira/browse/PIG-4164
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.14.0
>
> Attachments: PIG-4164-0.patch, PIG-4164-1.patch
>
>
> For some script, after job finish, Pig spend a lot time try to connect AM 
> before get redirect to JobHistoryServer. Here is the message we saw:
> {code}
> 2014-09-10 15:13:55,370 [main] INFO  org.apache.hadoop.ipc.Client - Retrying 
> connect to server: daijymacpro-2.local/10.11.2.30:55223. Already tried 0 
> time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, 
> sleepTime=1000 MILLISECONDS)
> 2014-09-10 15:13:56,371 [main] INFO  org.apache.hadoop.ipc.Client - Retrying 
> connect to server: daijymacpro-2.local/10.11.2.30:55223. Already tried 1 
> time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, 
> sleepTime=1000 MILLISECONDS)
> 2014-09-10 15:13:57,372 [main] INFO  org.apache.hadoop.ipc.Client - Retrying 
> connect to server: daijymacpro-2.local/10.11.2.30:55223. Already tried 2 
> time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, 
> sleepTime=1000 MILLISECONDS)
> 2014-09-10 15:13:57,476 [main] INFO  
> org.apache.hadoop.mapred.ClientServiceDelegate - Application state is 
> completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4164) After Pig job finish, Pig client spend too much time retry to connect to AM

2014-10-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4164:

Attachment: PIG-4164-1.patch

Add a null check in case of local mode.

> After Pig job finish, Pig client spend too much time retry to connect to AM
> ---
>
> Key: PIG-4164
> URL: https://issues.apache.org/jira/browse/PIG-4164
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.14.0
>
> Attachments: PIG-4164-0.patch, PIG-4164-1.patch
>
>
> For some script, after job finish, Pig spend a lot time try to connect AM 
> before get redirect to JobHistoryServer. Here is the message we saw:
> {code}
> 2014-09-10 15:13:55,370 [main] INFO  org.apache.hadoop.ipc.Client - Retrying 
> connect to server: daijymacpro-2.local/10.11.2.30:55223. Already tried 0 
> time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, 
> sleepTime=1000 MILLISECONDS)
> 2014-09-10 15:13:56,371 [main] INFO  org.apache.hadoop.ipc.Client - Retrying 
> connect to server: daijymacpro-2.local/10.11.2.30:55223. Already tried 1 
> time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, 
> sleepTime=1000 MILLISECONDS)
> 2014-09-10 15:13:57,372 [main] INFO  org.apache.hadoop.ipc.Client - Retrying 
> connect to server: daijymacpro-2.local/10.11.2.30:55223. Already tried 2 
> time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, 
> sleepTime=1000 MILLISECONDS)
> 2014-09-10 15:13:57,476 [main] INFO  
> org.apache.hadoop.mapred.ClientServiceDelegate - Application state is 
> completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4175) PIG CROSS operation follow by STORE produces non-deterministic results each run

2014-10-01 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-4175:

Attachment: PIG-4175-additional-1.patch

Enhanced the test with PIG-4175-additional-1.patch. Also moved the constant out 
of PigConfiguration and into PigImplConstants as it is not user configurable. 

> PIG CROSS operation follow by STORE produces non-deterministic results each 
> run
> ---
>
> Key: PIG-4175
> URL: https://issues.apache.org/jira/browse/PIG-4175
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11, 0.12.0
> Environment: RHEL 6/64-bit
>Reporter: Jim Huang
>Assignee: Daniel Dai
> Fix For: 0.14.0
>
> Attachments: PIG-4175-1.patch, PIG-4175-Debug.patch, 
> PIG-4175-additional-1.patch, mktestdata.py, pig_testcross_plan.png, 
> test_cross.out, test_cross.pig
>
>
> Three files will be attached to help visualize this issue.
> 1. mktestdata.py - to generate test data to feed the pig script
> 2. test_cross.pig - the PIG script using CROSS and STORE
> 3. test_cross.out - the PIG console output showing the input/output records 
> delta
> To reproduce this PIG CROSS operation problem, you need to use the supplied 
> Python script,
> mktestdata.py, to generate an input file that is at least 13,948,228,930 
> bytes (> 13GB).
> The CROSS between raw_data (m records) and cross_count (1 record) should 
> yield exactly (m records) as the output.  
> The STORE results from the CROSS operations yielded about 1/3 of input record 
> in raw_data as the output.  
> If I joined the both of the CROSS operations together, the STORE results from 
> the CROSS operations yielded about 2/3
> of the input records in raw-data as the output.  
> -- data = CROSS raw_data, field04s_count, subsection1_field04s_count, 
> subsection2_field04s_count;
> We have reproduced this using both Pig 0.11 (Hadoop 1.x) and Pig 0.12 (Hadoop 
> 2.x) clusters.  
> The default HDFS block size is 128MB.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PIG-4219) When parsing a schema, pig drops tuple inside of Bag if it contains only one field

2014-10-01 Thread Julien Le Dem (JIRA)
Julien Le Dem created PIG-4219:
--

 Summary: When parsing a schema, pig drops tuple inside of Bag if 
it contains only one field
 Key: PIG-4219
 URL: https://issues.apache.org/jira/browse/PIG-4219
 Project: Pig
  Issue Type: Bug
Reporter: Julien Le Dem


Example
{code:java}
//We generate a schema object and call toString()
String schemaStr = "my_list: {array: (array_element: (num1: int,num2: int))}";
// Reparsed using org.apache.pig.impl.util.Utils
Schema schema = Utils.getSchemaFromString(schemaStr);
// But no longer matches the original structure
schema.toString();
// => {my_list: {array_element: (num1: int,num2: int)}}
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4175) PIG CROSS operation follow by STORE produces non-deterministic results each run

2014-10-01 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14155586#comment-14155586
 ] 

Rohini Palaniswamy commented on PIG-4175:
-

Thanks. I will just attach a additional patch to this jira to modify the 
testcase to exactly verify the results shortly. 

> PIG CROSS operation follow by STORE produces non-deterministic results each 
> run
> ---
>
> Key: PIG-4175
> URL: https://issues.apache.org/jira/browse/PIG-4175
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11, 0.12.0
> Environment: RHEL 6/64-bit
>Reporter: Jim Huang
>Assignee: Daniel Dai
> Fix For: 0.14.0
>
> Attachments: PIG-4175-1.patch, PIG-4175-Debug.patch, mktestdata.py, 
> pig_testcross_plan.png, test_cross.out, test_cross.pig
>
>
> Three files will be attached to help visualize this issue.
> 1. mktestdata.py - to generate test data to feed the pig script
> 2. test_cross.pig - the PIG script using CROSS and STORE
> 3. test_cross.out - the PIG console output showing the input/output records 
> delta
> To reproduce this PIG CROSS operation problem, you need to use the supplied 
> Python script,
> mktestdata.py, to generate an input file that is at least 13,948,228,930 
> bytes (> 13GB).
> The CROSS between raw_data (m records) and cross_count (1 record) should 
> yield exactly (m records) as the output.  
> The STORE results from the CROSS operations yielded about 1/3 of input record 
> in raw_data as the output.  
> If I joined the both of the CROSS operations together, the STORE results from 
> the CROSS operations yielded about 2/3
> of the input records in raw-data as the output.  
> -- data = CROSS raw_data, field04s_count, subsection1_field04s_count, 
> subsection2_field04s_count;
> We have reproduced this using both Pig 0.11 (Hadoop 1.x) and Pig 0.12 (Hadoop 
> 2.x) clusters.  
> The default HDFS block size is 128MB.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4175) PIG CROSS operation follow by STORE produces non-deterministic results each run

2014-10-01 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14155583#comment-14155583
 ] 

Daniel Dai commented on PIG-4175:
-

What job.getResults does is load the stored data with the original loadFunc 
(PigStorage here), thus lose the type info. To get the original datatype, 
should use PigServer.openIterator instead of PigServer.store + job.getResults.

> PIG CROSS operation follow by STORE produces non-deterministic results each 
> run
> ---
>
> Key: PIG-4175
> URL: https://issues.apache.org/jira/browse/PIG-4175
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11, 0.12.0
> Environment: RHEL 6/64-bit
>Reporter: Jim Huang
>Assignee: Daniel Dai
> Fix For: 0.14.0
>
> Attachments: PIG-4175-1.patch, PIG-4175-Debug.patch, mktestdata.py, 
> pig_testcross_plan.png, test_cross.out, test_cross.pig
>
>
> Three files will be attached to help visualize this issue.
> 1. mktestdata.py - to generate test data to feed the pig script
> 2. test_cross.pig - the PIG script using CROSS and STORE
> 3. test_cross.out - the PIG console output showing the input/output records 
> delta
> To reproduce this PIG CROSS operation problem, you need to use the supplied 
> Python script,
> mktestdata.py, to generate an input file that is at least 13,948,228,930 
> bytes (> 13GB).
> The CROSS between raw_data (m records) and cross_count (1 record) should 
> yield exactly (m records) as the output.  
> The STORE results from the CROSS operations yielded about 1/3 of input record 
> in raw_data as the output.  
> If I joined the both of the CROSS operations together, the STORE results from 
> the CROSS operations yielded about 2/3
> of the input records in raw-data as the output.  
> -- data = CROSS raw_data, field04s_count, subsection1_field04s_count, 
> subsection2_field04s_count;
> We have reproduced this using both Pig 0.11 (Hadoop 1.x) and Pig 0.12 (Hadoop 
> 2.x) clusters.  
> The default HDFS block size is 128MB.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4175) PIG CROSS operation follow by STORE produces non-deterministic results each run

2014-10-01 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1419#comment-1419
 ] 

Daniel Dai commented on PIG-4175:
-

pigServer.openIterator("D") gives the right datatype. Seems to be an issue in 
job.getResults().

> PIG CROSS operation follow by STORE produces non-deterministic results each 
> run
> ---
>
> Key: PIG-4175
> URL: https://issues.apache.org/jira/browse/PIG-4175
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11, 0.12.0
> Environment: RHEL 6/64-bit
>Reporter: Jim Huang
>Assignee: Daniel Dai
> Fix For: 0.14.0
>
> Attachments: PIG-4175-1.patch, PIG-4175-Debug.patch, mktestdata.py, 
> pig_testcross_plan.png, test_cross.out, test_cross.pig
>
>
> Three files will be attached to help visualize this issue.
> 1. mktestdata.py - to generate test data to feed the pig script
> 2. test_cross.pig - the PIG script using CROSS and STORE
> 3. test_cross.out - the PIG console output showing the input/output records 
> delta
> To reproduce this PIG CROSS operation problem, you need to use the supplied 
> Python script,
> mktestdata.py, to generate an input file that is at least 13,948,228,930 
> bytes (> 13GB).
> The CROSS between raw_data (m records) and cross_count (1 record) should 
> yield exactly (m records) as the output.  
> The STORE results from the CROSS operations yielded about 1/3 of input record 
> in raw_data as the output.  
> If I joined the both of the CROSS operations together, the STORE results from 
> the CROSS operations yielded about 2/3
> of the input records in raw-data as the output.  
> -- data = CROSS raw_data, field04s_count, subsection1_field04s_count, 
> subsection2_field04s_count;
> We have reproduced this using both Pig 0.11 (Hadoop 1.x) and Pig 0.12 (Hadoop 
> 2.x) clusters.  
> The default HDFS block size is 128MB.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4218) Pig OrcStorage fail to load a map with null key

2014-10-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4218:

Status: Patch Available  (was: Open)

> Pig OrcStorage fail to load a map with null key
> ---
>
> Key: PIG-4218
> URL: https://issues.apache.org/jira/browse/PIG-4218
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.14.0
>
> Attachments: PIG-4218-1.patch, nullmapkey.orc
>
>
> Error message:
> Backend error message
> -
> AttemptID:attempt_1403634189382_0006_m_00_1 Info:Error: 
> java.lang.NullPointerException
> at org.apache.pig.impl.util.orc.OrcUtils.convertOrcToPig(OrcUtils.java:97)
> at org.apache.pig.impl.util.orc.OrcUtils.convertOrcToPig(OrcUtils.java:82)
> at org.apache.pig.builtin.OrcStorage.getNext(OrcStorage.java:312)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211)
> at 
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:533)
> at 
> org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
> at 
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4218) Pig OrcStorage fail to load a map with null key

2014-10-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4218:

Attachment: nullmapkey.orc
PIG-4218-1.patch

Ignore the map entry with null key. Hive does it the same way (HIVE-8115). Hive 
also plan to disable storing map with null key. For legacy data, Pig should not 
die. 

> Pig OrcStorage fail to load a map with null key
> ---
>
> Key: PIG-4218
> URL: https://issues.apache.org/jira/browse/PIG-4218
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.14.0
>
> Attachments: PIG-4218-1.patch, nullmapkey.orc
>
>
> Error message:
> Backend error message
> -
> AttemptID:attempt_1403634189382_0006_m_00_1 Info:Error: 
> java.lang.NullPointerException
> at org.apache.pig.impl.util.orc.OrcUtils.convertOrcToPig(OrcUtils.java:97)
> at org.apache.pig.impl.util.orc.OrcUtils.convertOrcToPig(OrcUtils.java:82)
> at org.apache.pig.builtin.OrcStorage.getNext(OrcStorage.java:312)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211)
> at 
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:533)
> at 
> org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
> at 
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4213) CSVExcelStorage not quoting texts containing \r (CR) when storing

2014-10-01 Thread Alfonso Nishikawa (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alfonso Nishikawa updated PIG-4213:
---
Attachment: PIG-4213_tests_updated_CR_before.patch

I upload the file [^PIG-4213_tests_updated_CR_before.patch] to show the 
expected behavior when there are {{\r}} in the input when LOADing, as my 
previous comment specifies.

Do not apply before the final patch. This patch is only a proof.

> CSVExcelStorage not quoting texts containing \r (CR) when storing
> -
>
> Key: PIG-4213
> URL: https://issues.apache.org/jira/browse/PIG-4213
> Project: Pig
>  Issue Type: Improvement
>  Components: piggybank
>Affects Versions: 0.12.0
>Reporter: Alfonso Nishikawa
>Assignee: Alfonso Nishikawa
>Priority: Trivial
> Attachments: PIG-4213_tests_updated_CR_before.patch, PIG-4213v1.patch
>
>
> Managing tweets information I found that someone wrote a multiline tweet in 
> Mac OS 9 (or bellow). When exporting the text, it is not being quoted so 
> LibreOffice can't import the cell properly (don't try Excel 2007 because it's 
> bugged).
> I suggest including the CR case in the same way as commented in 
> http://svn.apache.org/viewvc/pig/tags/release-0.12.1/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java?view=markup#l315



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PIG-4218) Pig OrcStorage fail to load a map with null key

2014-10-01 Thread Daniel Dai (JIRA)
Daniel Dai created PIG-4218:
---

 Summary: Pig OrcStorage fail to load a map with null key
 Key: PIG-4218
 URL: https://issues.apache.org/jira/browse/PIG-4218
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.14.0


Error message:

Backend error message
-
AttemptID:attempt_1403634189382_0006_m_00_1 Info:Error: 
java.lang.NullPointerException
at org.apache.pig.impl.util.orc.OrcUtils.convertOrcToPig(OrcUtils.java:97)
at org.apache.pig.impl.util.orc.OrcUtils.convertOrcToPig(OrcUtils.java:82)
at org.apache.pig.builtin.OrcStorage.getNext(OrcStorage.java:312)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211)
at 
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:533)
at 
org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at 
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4213) CSVExcelStorage not quoting texts containing \r (CR) when storing

2014-10-01 Thread Alfonso Nishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14155384#comment-14155384
 ] 

Alfonso Nishikawa commented on PIG-4213:


After researching, I share some conclusions before finishing the patch:

* {{CSVExcelStorage}} uses {{PigTextInputFormat}}, which extends 
{{TextInputFormat}}, which instantiates {{LineRecordReader}}.
* {{LineRecordReader}} splits the input by {{\r}} considering CR a linefeed.
* Reading data with {{CSVExcelStorage}} will treat {{\r}} the same as {{\n}} 
and {{\r\n}}.
* Reading data with {{CSVExcelStorage}} will substitute any alone {{\r}} in the 
input for a {{\n}}, but anyway, this reading behavior is the same as used to be 
and can't be fixed since it belongs to {{LineRecordReader}}.

this implies:

* What can be fixed is the output regarding to this ticket: quote when there is 
a {{\r}} present.
* What can not be fixed is the fact that the operation {{load(store(x))}} will 
not be idempotent if there is any CR present. But this behavior, again, is the 
same as it was before this patch.

I will check this in a while :)

> CSVExcelStorage not quoting texts containing \r (CR) when storing
> -
>
> Key: PIG-4213
> URL: https://issues.apache.org/jira/browse/PIG-4213
> Project: Pig
>  Issue Type: Improvement
>  Components: piggybank
>Affects Versions: 0.12.0
>Reporter: Alfonso Nishikawa
>Assignee: Alfonso Nishikawa
>Priority: Trivial
> Attachments: PIG-4213v1.patch
>
>
> Managing tweets information I found that someone wrote a multiline tweet in 
> Mac OS 9 (or bellow). When exporting the text, it is not being quoted so 
> LibreOffice can't import the cell properly (don't try Excel 2007 because it's 
> bugged).
> I suggest including the CR case in the same way as commented in 
> http://svn.apache.org/viewvc/pig/tags/release-0.12.1/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java?view=markup#l315



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4175) PIG CROSS operation follow by STORE produces non-deterministic results each run

2014-10-01 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14155375#comment-14155375
 ] 

Rohini Palaniswamy commented on PIG-4175:
-

Is it because GFCross does not define outputSchema ?

> PIG CROSS operation follow by STORE produces non-deterministic results each 
> run
> ---
>
> Key: PIG-4175
> URL: https://issues.apache.org/jira/browse/PIG-4175
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11, 0.12.0
> Environment: RHEL 6/64-bit
>Reporter: Jim Huang
>Assignee: Daniel Dai
> Fix For: 0.14.0
>
> Attachments: PIG-4175-1.patch, PIG-4175-Debug.patch, mktestdata.py, 
> pig_testcross_plan.png, test_cross.out, test_cross.pig
>
>
> Three files will be attached to help visualize this issue.
> 1. mktestdata.py - to generate test data to feed the pig script
> 2. test_cross.pig - the PIG script using CROSS and STORE
> 3. test_cross.out - the PIG console output showing the input/output records 
> delta
> To reproduce this PIG CROSS operation problem, you need to use the supplied 
> Python script,
> mktestdata.py, to generate an input file that is at least 13,948,228,930 
> bytes (> 13GB).
> The CROSS between raw_data (m records) and cross_count (1 record) should 
> yield exactly (m records) as the output.  
> The STORE results from the CROSS operations yielded about 1/3 of input record 
> in raw_data as the output.  
> If I joined the both of the CROSS operations together, the STORE results from 
> the CROSS operations yielded about 2/3
> of the input records in raw-data as the output.  
> -- data = CROSS raw_data, field04s_count, subsection1_field04s_count, 
> subsection2_field04s_count;
> We have reproduced this using both Pig 0.11 (Hadoop 1.x) and Pig 0.12 (Hadoop 
> 2.x) clusters.  
> The default HDFS block size is 128MB.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4175) PIG CROSS operation follow by STORE produces non-deterministic results each run

2014-10-01 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-4175:

Attachment: PIG-4175-Debug.patch

[~daijy],
   I was trying to enhance the test to compare actual results so that test is 
more foolproof. But found that the output of cross was all bytearray even 
though dump of the schema is as expected. 
C: {long}
D: {A::a0: int,A::a1: chararray,long}

Am I missing something? Attaching the debug patch.

> PIG CROSS operation follow by STORE produces non-deterministic results each 
> run
> ---
>
> Key: PIG-4175
> URL: https://issues.apache.org/jira/browse/PIG-4175
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11, 0.12.0
> Environment: RHEL 6/64-bit
>Reporter: Jim Huang
>Assignee: Daniel Dai
> Fix For: 0.14.0
>
> Attachments: PIG-4175-1.patch, PIG-4175-Debug.patch, mktestdata.py, 
> pig_testcross_plan.png, test_cross.out, test_cross.pig
>
>
> Three files will be attached to help visualize this issue.
> 1. mktestdata.py - to generate test data to feed the pig script
> 2. test_cross.pig - the PIG script using CROSS and STORE
> 3. test_cross.out - the PIG console output showing the input/output records 
> delta
> To reproduce this PIG CROSS operation problem, you need to use the supplied 
> Python script,
> mktestdata.py, to generate an input file that is at least 13,948,228,930 
> bytes (> 13GB).
> The CROSS between raw_data (m records) and cross_count (1 record) should 
> yield exactly (m records) as the output.  
> The STORE results from the CROSS operations yielded about 1/3 of input record 
> in raw_data as the output.  
> If I joined the both of the CROSS operations together, the STORE results from 
> the CROSS operations yielded about 2/3
> of the input records in raw-data as the output.  
> -- data = CROSS raw_data, field04s_count, subsection1_field04s_count, 
> subsection2_field04s_count;
> We have reproduced this using both Pig 0.11 (Hadoop 1.x) and Pig 0.12 (Hadoop 
> 2.x) clusters.  
> The default HDFS block size is 128MB.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4212) Allow LIMIT of 0 for variableLimit (constant 0 is already allowed)

2014-10-01 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-4212:
--
   Resolution: Fixed
Fix Version/s: 0.14.0
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

Committed to 0.14 and trunk (with extra line added pointed out by Rohini).

Thanks Daniel and Rohini for the review! 

> Allow LIMIT of 0 for variableLimit (constant 0 is already allowed)
> --
>
> Key: PIG-4212
> URL: https://issues.apache.org/jira/browse/PIG-4212
> Project: Pig
>  Issue Type: Bug
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Trivial
> Fix For: 0.14.0
>
> Attachments: pig-4212-v1.patch
>
>
> Somehow 
> limit A 0 
> is currently allowed but not
>limit A B.count - B.count 
> I'd like the latter to be also allowed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4196) Auto ship udf jar is broken

2014-10-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4196:

  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Patch committed to both trunk and 0.14 branch. Thanks Rohini for review!

> Auto ship udf jar is broken
> ---
>
> Key: PIG-4196
> URL: https://issues.apache.org/jira/browse/PIG-4196
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.14.0
>
> Attachments: PIG-4196-0.patch, PIG-4196-1.patch, PIG-4196-2.patch
>
>
> The mechanism to ship udf containing jar is broken in PIG-4054. Attach a 
> quick fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4214) Fix unit test fail TestMRJobStats

2014-10-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4214:

  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Patch committed to both trunk and 0.14 branch. Thanks Rohini for review!

> Fix unit test fail TestMRJobStats
> -
>
> Key: PIG-4214
> URL: https://issues.apache.org/jira/browse/PIG-4214
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.14.0
>
> Attachments: PIG-4214-1.patch, PIG-4214-2.patch, PIG-4214-3.patch
>
>
> TestMRJobStats is broken by PIG-4050. We shall fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4214) Fix unit test fail TestMRJobStats

2014-10-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4214:

Attachment: PIG-4214-3.patch

Sure, reattach patch.

> Fix unit test fail TestMRJobStats
> -
>
> Key: PIG-4214
> URL: https://issues.apache.org/jira/browse/PIG-4214
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.14.0
>
> Attachments: PIG-4214-1.patch, PIG-4214-2.patch, PIG-4214-3.patch
>
>
> TestMRJobStats is broken by PIG-4050. We shall fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4217) Fix documentation in BuildBloom

2014-10-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4217:

   Resolution: Fixed
Fix Version/s: 0.14.0
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

Patch committed to trunk and 0.14 branch. Thanks Praveen!

> Fix documentation in BuildBloom
> ---
>
> Key: PIG-4217
> URL: https://issues.apache.org/jira/browse/PIG-4217
> Project: Pig
>  Issue Type: Bug
>Reporter: Praveen Rachabattuni
>Assignee: Praveen Rachabattuni
> Fix For: 0.14.0
>
> Attachments: PIG-4217-1.patch
>
>
> /**
>  * Build a bloom filter for use later in Bloom.  This UDF is intended to run
>  * in a group all job.  For example:
>  * define bb BuildBloom('jenkins', '100', '0.1');
>  * A = load 'foo' as (x, y);
>  * B = group A all;
>  * C = foreach B generate BuildBloom(A.x);
>  * store C into 'mybloom';
>  * The bloom filter can be on multiple keys by passing more than one field
>  * (or the entire bag) to BuildBloom.
>  * The resulting file can then be used in a Bloom filter as:
>  * define bloom Bloom(mybloom);
>  * A = load 'foo' as (x, y);
>  * B = load 'bar' as (z);
>  * C = filter B by Bloom(z);
>  * D = join C by z, A by x;
>  * It uses {@link org.apache.hadoop.util.bloom.BloomFilter}.
>  */
> Pig script inside above doc strings doesn't work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-3985) Multiquery execution of RANK with RANK BY causes NPE JobCreationException "ERROR 2017: Internal error creating job configuration"

2014-10-01 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3985:
--
Attachment: pig-3985-v01.txt

{noformat}
Caused by: java.lang.NullPointerException
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:938)
{noformat}
{code:title=JobControlCompiler.java|borderStyle=solid}
 938 Iterator> itPairs = 
globalCounters.get(operationID).iterator();
{code}
This was due to globalCounters not containing operationID.
This itself was caused by saveCounters not being called due to 
mro.isCounterOperation incorrectly returning false.
{code:title=JobControlCompiler.java|borderStyle=solid}
 358 if (!pigContext.inIllustrator && mro.isCounterOperation())
 359 saveCounters(job,mro.getOperationID());
{code}

This was caused by  mro.isCounterOperation assuming that POCount is always 
placed at the leaf level.
{code:title=MapReduceOper.java|borderStyle=solid}
511 public boolean isCounterOperation() {
512 return (getCounterOperation() != null);
513 }
...
525 private POCounter getCounterOperation() {
526 PhysicalOperator operator;
527 Iterator it =  
this.mapPlan.getLeaves().iterator();
528
529 while(it.hasNext()) {
530 operator = it.next();
531 if(operator instanceof POCounter)
532 return (POCounter) operator;
533 }
...
{code}

For the sample pig test program given by Philip, mapreduce plan showed "SPLIT" 
as the only leaf.
{noformat}
MapReduce node scope-34
Map Plan
Split - scope-69
|   |
|   
Store(file:/tmp/temp465448860/tmp1018450824:org.apache.pig.impl.io.InterStorage)
 - scope-38
|   |
|   |---citypops_nosort_inplace: POCounter[tuple] - scope-14
|   |
|   citypops_ties_cause_skips: Local Rearrange[tuple]{chararray}(false) - 
scope-21
|   |   |
|   |   Project[chararray][0] - scope-22
|
|---citypops: New For Each(false,false,false)[bag] - scope-10
|   |
|   Cast[chararray] - scope-2
|   |
|   |---Project[bytearray][0] - scope-1
|   |
|   Cast[chararray] - scope-5
|   |
|   |---Project[bytearray][1] - scope-4
|   |
|   Cast[int] - scope-8
|   |
|   |---Project[bytearray][2] - scope-7
|
|---citypops: 
Load(file:///Users/knoguchi/git/pig/pig-3985/us_city_pops.tsv:org.apache.pig.builtin.PigStorage)
 - scope-0
{noformat}

I initially tried fixing MapReduceOper.getCounterOperation() so that it'll find 
the POCounter even if it's part of the split.  However, I soon learned that 
POCount requires different map-reduce class 
(PigMapReduceCounter.PigMapCounter.class and PigReduceCounter.class) and it 
currently doesn't work if they are mixed with other operations.

Instead of rewriting Rank, for now made a change so that all POCount starts a 
new mapreduce job.

> Multiquery execution of RANK with RANK BY causes NPE JobCreationException 
> "ERROR 2017: Internal error creating job configuration"
> -
>
> Key: PIG-3985
> URL: https://issues.apache.org/jira/browse/PIG-3985
> Project: Pig
>  Issue Type: Bug
>Reporter: Philip (flip) Kromer
>  Labels: nullpointerexception, rank, udf
> Attachments: many_ranks_much_sadness.pig, pig-3985-v01.txt, 
> us_city_pops.tsv
>
>
> A script with both RANK and RANK BY will crash with a Null Pointer Exception 
> in JobControlCompiler.java when multiquery is enabled.
> The following script will work for any combination of the RANK BY operations; 
> or if there is one RANK operation only (i.e. no other RANK or RANK BY 
> operation). Non-BY-RANKS will perish together but succeed alone.
> Disabling multiquery execution makes everything work again.
> I am using Hadoop 2.4.0 with Pig Trunk (d24d06a48, after PIG-3739). The error 
> occurs in local or mapreduce mode.
> {code}
> -- disable multiquery and you can rank all day long
> -- SET opt.multiquery false
> citypops = LOAD 'us_city_pops.tsv' AS (city:chararray, state:chararray, 
> pop_2011:int);
> citypops_o = ORDER citypops BY city;
> --
> -- if you have one non-by RANK you may not have any other RANKs
> --
> citypops_nosort_inplace= RANK citypops;
> citypops_presorted_inplace = RANK citypops_o;
> citypops_ties_cause_skips  = RANK citypops   BY city;
> citypops_ties_no_skips = RANK citypops   BY city  DENSE;
> citypops_presorted_ranked  = RANK citypops_o BY city;
> STORE citypops_nosort_inplaceINTO '/tmp/citypops_nosort_inplace'USING 
> PigStorage('\t', '--overwrite true');
> -- STORE citypops_presorted_inplace INTO '/tmp/citypops_presorted_inplace' 
> USING PigStorage('\t', '--overwrite true');
> STORE citypops_ties_cause_skips  INTO '

[jira] [Commented] (PIG-4196) Auto ship udf jar is broken

2014-10-01 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14155124#comment-14155124
 ] 

Rohini Palaniswamy commented on PIG-4196:
-

+1

> Auto ship udf jar is broken
> ---
>
> Key: PIG-4196
> URL: https://issues.apache.org/jira/browse/PIG-4196
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.14.0
>
> Attachments: PIG-4196-0.patch, PIG-4196-1.patch, PIG-4196-2.patch
>
>
> The mechanism to ship udf containing jar is broken in PIG-4054. Attach a 
> quick fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4214) Fix unit test fail TestMRJobStats

2014-10-01 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14155122#comment-14155122
 ] 

Rohini Palaniswamy commented on PIG-4214:
-

+1.  Could you change it to import java.util.Arrays instead of referencing full 
package name every time before checking in?

> Fix unit test fail TestMRJobStats
> -
>
> Key: PIG-4214
> URL: https://issues.apache.org/jira/browse/PIG-4214
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.14.0
>
> Attachments: PIG-4214-1.patch, PIG-4214-2.patch
>
>
> TestMRJobStats is broken by PIG-4050. We shall fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [RESULT] [VOTE] Drop support for Hadoop 0.20 from Pig 0.14

2014-10-01 Thread Gianmarco De Francisci Morales
We should add this in the release notes for 0.14 as well.

Cheers,

--
Gianmarco

On 29 September 2014 19:25, Rohini Palaniswamy 
wrote:

> My +1 as well.
>
> With 6 binding +1s, 8 non-binding +1s and no -1s this vote passes.
>
> Nothing special to address for this. PIG-3507 which went into Pig 0.14 used
> UserGroupInformation class without reflection and so Pig 0.14 is already
> incompatible with Hadoop 0.20.
>
> Regards,
> Rohini
>
> On Mon, Sep 22, 2014 at 5:56 PM, Thejas Nair 
> wrote:
>
> > +1
> >
> > On Thu, Sep 18, 2014 at 5:50 PM, Mona Chitnis 
> > wrote:
> > >
> > > +1 (non-binding)
> > >  Mona Chitnis
> > > Yahoo!
> > >
> > >  On Thursday, September 18, 2014 8:48 AM, Ashutosh Chauhan <
> > hashut...@apache.org> wrote:
> > >
> > >
> > >  +1
> > >
> > > On Wed, Sep 17, 2014 at 7:02 PM, Daniel Dai 
> > wrote:
> > >
> > >> +1
> > >>
> > >> On Wed, Sep 17, 2014 at 11:12 AM, Prashant Kommireddi
> > >>  wrote:
> > >> > +1
> > >> >
> > >> > On Wed, Sep 17, 2014 at 8:44 AM, Cheolsoo Park <
> piaozhe...@gmail.com>
> > >> wrote:
> > >> >
> > >> >> +1
> > >> >>
> > >> >> On Wed, Sep 17, 2014 at 7:09 AM, Xuefu Zhang 
> > >> wrote:
> > >> >>
> > >> >> > +1
> > >> >> >
> > >> >> > On Wed, Sep 17, 2014 at 7:04 AM, Julien Le Dem  >
> > >> wrote:
> > >> >> >
> > >> >> > > +1
> > >> >> > >
> > >> >> > > Julien
> > >> >> > >
> > >> >> > > > -Original Message-
> > >> >> > > > From: Rohini Palaniswamy [mailto:rohini.adi...@gmail.com]
> > >> >> > > > Sent: Wednesday, September 17, 2014 12:38 PM
> > >> >> > > > To: dev@pig.apache.org
> > >> >> > > > Subject: [VOTE] Drop support for Hadoop 0.20 from Pig 0.14
> > >> >> > > >
> > >> >> > > > Hi,
> > >> >> > > >  Hadoop has matured far from Hadoop 0.20 and has had two
> major
> > >> >> > releases
> > >> >> > > after that and there has been no development on branch-0.20 (
> > >> >> > > >
> > http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20/)
> > >> >> for
> > >> >> > 3
> > >> >> > > years now. It is high time we drop support for Hadoop 0.20 and
> > only
> > >> >> > support
> > >> >> > > Hadoop 1.x and 2.x lines going forward. This will reduce the
> > >> >> maintenance
> > >> >> > > effort and also enable us to right more efficient code and cut
> > down
> > >> on
> > >> >> > > reflections.
> > >> >> > > >
> > >> >> > > > Vote closes on Tuesday, Sep 23 2014.
> > >> >> > > >
> > >> >> > > > Thanks,
> > >> >> > > > Rohini
> > >> >> > >
> > >> >> >
> > >> >>
> > >>
> > >> --
> > >> CONFIDENTIALITY NOTICE
> > >> NOTICE: This message is intended for the use of the individual or
> > entity to
> > >> which it is addressed and may contain information that is
> confidential,
> > >> privileged and exempt from disclosure under applicable law. If the
> > reader
> > >> of this message is not the intended recipient, you are hereby notified
> > that
> > >> any printing, copying, dissemination, distribution, disclosure or
> > >> forwarding of this communication is strictly prohibited. If you have
> > >> received this communication in error, please contact the sender
> > immediately
> > >> and delete it from your system. Thank You.
> > >>
> > >
> > >
> > >
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
> >
>


[jira] [Updated] (PIG-4217) Fix documentation in BuildBloom

2014-10-01 Thread Praveen Rachabattuni (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen Rachabattuni updated PIG-4217:
--
Status: Patch Available  (was: Open)

> Fix documentation in BuildBloom
> ---
>
> Key: PIG-4217
> URL: https://issues.apache.org/jira/browse/PIG-4217
> Project: Pig
>  Issue Type: Bug
>Reporter: Praveen Rachabattuni
>Assignee: Praveen Rachabattuni
> Attachments: PIG-4217-1.patch
>
>
> /**
>  * Build a bloom filter for use later in Bloom.  This UDF is intended to run
>  * in a group all job.  For example:
>  * define bb BuildBloom('jenkins', '100', '0.1');
>  * A = load 'foo' as (x, y);
>  * B = group A all;
>  * C = foreach B generate BuildBloom(A.x);
>  * store C into 'mybloom';
>  * The bloom filter can be on multiple keys by passing more than one field
>  * (or the entire bag) to BuildBloom.
>  * The resulting file can then be used in a Bloom filter as:
>  * define bloom Bloom(mybloom);
>  * A = load 'foo' as (x, y);
>  * B = load 'bar' as (z);
>  * C = filter B by Bloom(z);
>  * D = join C by z, A by x;
>  * It uses {@link org.apache.hadoop.util.bloom.BloomFilter}.
>  */
> Pig script inside above doc strings doesn't work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4217) Fix documentation in BuildBloom

2014-10-01 Thread Praveen Rachabattuni (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen Rachabattuni updated PIG-4217:
--
Attachment: PIG-4217-1.patch

> Fix documentation in BuildBloom
> ---
>
> Key: PIG-4217
> URL: https://issues.apache.org/jira/browse/PIG-4217
> Project: Pig
>  Issue Type: Bug
>Reporter: Praveen Rachabattuni
>Assignee: Praveen Rachabattuni
> Attachments: PIG-4217-1.patch
>
>
> /**
>  * Build a bloom filter for use later in Bloom.  This UDF is intended to run
>  * in a group all job.  For example:
>  * define bb BuildBloom('jenkins', '100', '0.1');
>  * A = load 'foo' as (x, y);
>  * B = group A all;
>  * C = foreach B generate BuildBloom(A.x);
>  * store C into 'mybloom';
>  * The bloom filter can be on multiple keys by passing more than one field
>  * (or the entire bag) to BuildBloom.
>  * The resulting file can then be used in a Bloom filter as:
>  * define bloom Bloom(mybloom);
>  * A = load 'foo' as (x, y);
>  * B = load 'bar' as (z);
>  * C = filter B by Bloom(z);
>  * D = join C by z, A by x;
>  * It uses {@link org.apache.hadoop.util.bloom.BloomFilter}.
>  */
> Pig script inside above doc strings doesn't work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PIG-4217) Fix documentation in BuildBloom

2014-10-01 Thread Praveen Rachabattuni (JIRA)
Praveen Rachabattuni created PIG-4217:
-

 Summary: Fix documentation in BuildBloom
 Key: PIG-4217
 URL: https://issues.apache.org/jira/browse/PIG-4217
 Project: Pig
  Issue Type: Bug
Reporter: Praveen Rachabattuni
Assignee: Praveen Rachabattuni


/**
 * Build a bloom filter for use later in Bloom.  This UDF is intended to run
 * in a group all job.  For example:
 * define bb BuildBloom('jenkins', '100', '0.1');
 * A = load 'foo' as (x, y);
 * B = group A all;
 * C = foreach B generate BuildBloom(A.x);
 * store C into 'mybloom';
 * The bloom filter can be on multiple keys by passing more than one field
 * (or the entire bag) to BuildBloom.
 * The resulting file can then be used in a Bloom filter as:
 * define bloom Bloom(mybloom);
 * A = load 'foo' as (x, y);
 * B = load 'bar' as (z);
 * C = filter B by Bloom(z);
 * D = join C by z, A by x;
 * It uses {@link org.apache.hadoop.util.bloom.BloomFilter}.
 */

Pig script inside above doc strings doesn't work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Jenkins build is back to normal : Pig-trunk #1670

2014-10-01 Thread Apache Jenkins Server
See