[jira] [Comment Edited] (PIG-4963) Add a Bloom join

2017-01-25 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15839299#comment-15839299
 ] 

Daniel Dai edited comment on PIG-4963 at 1/26/17 7:02 AM:
--

+1 for PIG-4963-5.patch.


was (Author: daijy):
+1 for the new patch (on RB).

> Add a Bloom join
> 
>
> Key: PIG-4963
> URL: https://issues.apache.org/jira/browse/PIG-4963
> Project: Pig
>  Issue Type: New Feature
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.17.0
>
> Attachments: PIG-4963-1.patch, PIG-4963-2.patch, PIG-4963-3.patch, 
> PIG-4963-4.patch, PIG-4963-5.patch
>
>
> In PIG-4925, added option to pass BloomFilter as a scalar to bloom function. 
> But found that actually using it for big data which required huge vector size 
> was very inefficient and led to OOM.
>I had initially calculated that it would take around 12MB bytearray for 
> 100 million vectorsize (1 + 7) / 8 = 1250 bytes) and that would 
> be the scalar value broadcasted and would not take much space. But problem is 
> 12MB was written out for every input record with BuildBloom$Initial before 
> the aggregation happens and we arrive at the final BloomFilter vector. And 
> with POPartialAgg it runs into OOM issues. 
> If we added a bloom join implementation, which can be combined with hash or 
> skewed join it would boost performance for a lot of jobs. Bloom filter of the 
> smaller tables can be sent to the bigger tables as scalar and data filtered 
> before hash or skewed join is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] Subscription: PIG patch available

2017-01-25 Thread jira
Issue Subscription
Filter: PIG patch available (27 issues)

Subscriber: pigdaily

Key Summary
PIG-4926Modify the content of start.xml for spark mode
https://issues-test.apache.org/jira/browse/PIG-4926
PIG-4922Deadlock between SpillableMemoryManager and 
InternalSortedBag$SortedDataBagIterator
https://issues-test.apache.org/jira/browse/PIG-4922
PIG-4918Pig on Tez cannot switch pig.temp.dir to another fs
https://issues-test.apache.org/jira/browse/PIG-4918
PIG-4897Scope of param substitution for run/exec commands
https://issues-test.apache.org/jira/browse/PIG-4897
PIG-4886Add PigSplit#getLocationInfo to fix the NPE found in log in spark 
mode
https://issues-test.apache.org/jira/browse/PIG-4886
PIG-4854Merge spark branch to trunk
https://issues-test.apache.org/jira/browse/PIG-4854
PIG-4849pig on tez will cause tez-ui to crash,because the content from 
timeline server is too long. 
https://issues-test.apache.org/jira/browse/PIG-4849
PIG-4788the value BytesRead metric info always returns 0 even the length of 
input file is not 0 in spark engine
https://issues-test.apache.org/jira/browse/PIG-4788
PIG-4745DataBag should protect content of passed list of tuples
https://issues-test.apache.org/jira/browse/PIG-4745
PIG-4684Exception should be changed to warning when job diagnostics cannot 
be fetched
https://issues-test.apache.org/jira/browse/PIG-4684
PIG-4656Improve String serialization and comparator performance in 
BinInterSedes
https://issues-test.apache.org/jira/browse/PIG-4656
PIG-4598Allow user defined plan optimizer rules
https://issues-test.apache.org/jira/browse/PIG-4598
PIG-4551Partition filter is not pushed down in case of SPLIT
https://issues-test.apache.org/jira/browse/PIG-4551
PIG-4539New PigUnit
https://issues-test.apache.org/jira/browse/PIG-4539
PIG-4515org.apache.pig.builtin.Distinct throws ClassCastException
https://issues-test.apache.org/jira/browse/PIG-4515
PIG-4323PackageConverter hanging in Spark
https://issues-test.apache.org/jira/browse/PIG-4323
PIG-4313StackOverflowError in LIMIT operation on Spark
https://issues-test.apache.org/jira/browse/PIG-4313
PIG-4251Pig on Storm
https://issues-test.apache.org/jira/browse/PIG-4251
PIG-4002Disable combiner when map-side aggregation is used
https://issues-test.apache.org/jira/browse/PIG-4002
PIG-3952PigStorage accepts '-tagSplit' to return full split information
https://issues-test.apache.org/jira/browse/PIG-3952
PIG-3911Define unique fields with @OutputSchema
https://issues-test.apache.org/jira/browse/PIG-3911
PIG-3877Getting Geo Latitude/Longitude from Address Lines
https://issues-test.apache.org/jira/browse/PIG-3877
PIG-3873Geo distance calculation using Haversine
https://issues-test.apache.org/jira/browse/PIG-3873
PIG-3864ToDate(userstring, format, timezone) computes DateTime with strange 
handling of Daylight Saving Time with location based timezones
https://issues-test.apache.org/jira/browse/PIG-3864
PIG-3851Upgrade jline to 2.11
https://issues-test.apache.org/jira/browse/PIG-3851
PIG-3668COR built-in function when atleast one of the coefficient values is 
NaN
https://issues-test.apache.org/jira/browse/PIG-3668
PIG-3587add functionality for rolling over dates
https://issues-test.apache.org/jira/browse/PIG-3587

You may edit this subscription at:
https://issues-test.apache.org/jira/secure/FilterSubscription!default.jspa?subId=16328&filterId=12322384


[jira] Subscription: PIG patch available

2017-01-25 Thread jira
Issue Subscription
Filter: PIG patch available (32 issues)

Subscriber: pigdaily

Key Summary
PIG-5115Builtin AvroStorage generates incorrect avro schema when the same 
pig field name appears in the alias
https://issues.apache.org/jira/browse/PIG-5115
PIG-5106Optimize when mapreduce.input.fileinputformat.input.dir.recursive 
set to true
https://issues.apache.org/jira/browse/PIG-5106
PIG-5081Can not run pig on spark source code distribution
https://issues.apache.org/jira/browse/PIG-5081
PIG-5080Support store alias as spark table
https://issues.apache.org/jira/browse/PIG-5080
PIG-5057IndexOutOfBoundsException when pig reducer processOnePackageOutput
https://issues.apache.org/jira/browse/PIG-5057
PIG-5029Optimize sort case when data is skewed
https://issues.apache.org/jira/browse/PIG-5029
PIG-4926Modify the content of start.xml for spark mode
https://issues.apache.org/jira/browse/PIG-4926
PIG-4854Merge spark branch to trunk
https://issues.apache.org/jira/browse/PIG-4854
PIG-4849pig on tez will cause tez-ui to crash,because the content from 
timeline server is too long. 
https://issues.apache.org/jira/browse/PIG-4849
PIG-4788the value BytesRead metric info always returns 0 even the length of 
input file is not 0 in spark engine
https://issues.apache.org/jira/browse/PIG-4788
PIG-4750REPLACE_MULTI should compile Pattern once and reuse it
https://issues.apache.org/jira/browse/PIG-4750
PIG-4748DateTimeWritable forgets Chronology
https://issues.apache.org/jira/browse/PIG-4748
PIG-4745DataBag should protect content of passed list of tuples
https://issues.apache.org/jira/browse/PIG-4745
PIG-4684Exception should be changed to warning when job diagnostics cannot 
be fetched
https://issues.apache.org/jira/browse/PIG-4684
PIG-4656Improve String serialization and comparator performance in 
BinInterSedes
https://issues.apache.org/jira/browse/PIG-4656
PIG-4598Allow user defined plan optimizer rules
https://issues.apache.org/jira/browse/PIG-4598
PIG-4551Partition filter is not pushed down in case of SPLIT
https://issues.apache.org/jira/browse/PIG-4551
PIG-4539New PigUnit
https://issues.apache.org/jira/browse/PIG-4539
PIG-4515org.apache.pig.builtin.Distinct throws ClassCastException
https://issues.apache.org/jira/browse/PIG-4515
PIG-4323PackageConverter hanging in Spark
https://issues.apache.org/jira/browse/PIG-4323
PIG-4313StackOverflowError in LIMIT operation on Spark
https://issues.apache.org/jira/browse/PIG-4313
PIG-4251Pig on Storm
https://issues.apache.org/jira/browse/PIG-4251
PIG-4002Disable combiner when map-side aggregation is used
https://issues.apache.org/jira/browse/PIG-4002
PIG-3952PigStorage accepts '-tagSplit' to return full split information
https://issues.apache.org/jira/browse/PIG-3952
PIG-3911Define unique fields with @OutputSchema
https://issues.apache.org/jira/browse/PIG-3911
PIG-3877Getting Geo Latitude/Longitude from Address Lines
https://issues.apache.org/jira/browse/PIG-3877
PIG-3873Geo distance calculation using Haversine
https://issues.apache.org/jira/browse/PIG-3873
PIG-3864ToDate(userstring, format, timezone) computes DateTime with strange 
handling of Daylight Saving Time with location based timezones
https://issues.apache.org/jira/browse/PIG-3864
PIG-3851Upgrade jline to 2.11
https://issues.apache.org/jira/browse/PIG-3851
PIG-3668COR built-in function when atleast one of the coefficient values is 
NaN
https://issues.apache.org/jira/browse/PIG-3668
PIG-3587add functionality for rolling over dates
https://issues.apache.org/jira/browse/PIG-3587
PIG-1804Alow Jython function to implement Algebraic and/or Accumulator 
interfaces
https://issues.apache.org/jira/browse/PIG-1804

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=16328&filterId=12322384


[jira] [Commented] (PIG-4891) Implement FR join by broadcasting small rdd not making more copys of data

2017-01-25 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15839327#comment-15839327
 ] 

liyunzhang_intel commented on PIG-4891:
---

Here is my understanding to this jira, let's use an example to explain it.
{code}
A = load './SkewedJoinInput1.txt' as (id,name,n);
B = load './SkewedJoinInput2.txt' as (id,name);
D = join A by (id,name), B by (id,name) using 'replicated';
explain D;
{code}
before the patch, the spark plan is:
{code}
#--
# Spark Plan 
#--

Spark node scope-26
Store(hdfs://zly1.sh.intel.com:8020/tmp/temp1749487848/tmp1731009936:org.apache.pig.impl.io.InterStorage)
 - scope-27
|
|---B: New For Each(false,false)[bag] - scope-13
|   |
|   Project[bytearray][0] - scope-9
|   |
|   Project[bytearray][1] - scope-11
|
|---B: 
Load(hdfs://zly1.sh.intel.com:8020/user/root/SkewedJoinInput2.txt:org.apache.pig.builtin.PigStorage)
 - scope-8

Spark node scope-25
D: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-24
|
|---D: FRJoin[tuple] - scope-18
|   |
|   Project[bytearray][0] - scope-14
|   |
|   Project[bytearray][1] - scope-15
|   |
|   Project[bytearray][0] - scope-16
|   |
|   Project[bytearray][1] - scope-17
|
|---A: New For Each(false,false,false)[bag] - scope-7
|   |
|   Project[bytearray][0] - scope-1
|   |
|   Project[bytearray][1] - scope-3
|   |
|   Project[bytearray][2] - scope-5
|
|---A: 
Load(hdfs://zly1.sh.intel.com:8020/user/root/SkewedJoinInput1.txt:org.apache.pig.builtin.PigStorage)
 - scope-0
{code}

After patch
{code}
Spark node scope-28
D: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-24
|
|---D: FRJoinSpark[tuple] - scope-18
|   |
|   Project[bytearray][0] - scope-14
|   |
|   Project[bytearray][1] - scope-15
|   |
|   Project[bytearray][0] - scope-16
|   |
|   Project[bytearray][1] - scope-17
|
|---A: New For Each(false,false,false)[bag] - scope-7
|   |   |
|   |   Project[bytearray][0] - scope-1
|   |   |
|   |   Project[bytearray][1] - scope-3
|   |   |
|   |   Project[bytearray][2] - scope-5
|   |
|   |---A: 
Load(hdfs://zly1.sh.intel.com:8020/user/root/SkewedJoinInput1.txt:org.apache.pig.builtin.PigStorage)
 - scope-0
|
|---BroadcastSpark - scope-27
|
|---B: New For Each(false,false)[bag] - scope-13
|   |
|   Project[bytearray][0] - scope-9
|   |
|   Project[bytearray][1] - scope-11
|
|---B: 
Load(hdfs://zly1.sh.intel.com:8020/user/root/SkewedJoinInput2.txt:org.apache.pig.builtin.PigStorage)
 - scope
{code}
In the patch
1. we don't load the small table to the distributed cache and start a 
new job to load data from distributed cache.
2. load small table as rdd and broadcast small rdd by 
SparkContext.broadcast()



> Implement FR join by broadcasting small rdd not making more copys of data
> -
>
> Key: PIG-4891
> URL: https://issues.apache.org/jira/browse/PIG-4891
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
>
> In current implementation of FRJoin(PIG-4771), we just set the value of 
> replication of data as 10 to make the data access more efficiency because 
> current FRJoin algrithms can be reused in this way. We need to figure out how 
> to use broadcasting small rdd to implement FRJoin in current code base if we 
> find the performance can be improved a lot by using broadcasting rdd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4891) Implement FR join by broadcasting small rdd not making more copys of data

2017-01-25 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15839313#comment-15839313
 ] 

liyunzhang_intel commented on PIG-4891:
---

[~nkollar]: LGTM except some minor issues and left some comment on rb.

> Implement FR join by broadcasting small rdd not making more copys of data
> -
>
> Key: PIG-4891
> URL: https://issues.apache.org/jira/browse/PIG-4891
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
>
> In current implementation of FRJoin(PIG-4771), we just set the value of 
> replication of data as 10 to make the data access more efficiency because 
> current FRJoin algrithms can be reused in this way. We need to figure out how 
> to use broadcasting small rdd to implement FRJoin in current code base if we 
> find the performance can be improved a lot by using broadcasting rdd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4963) Add a Bloom join

2017-01-25 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15839299#comment-15839299
 ] 

Daniel Dai commented on PIG-4963:
-

+1 for the new patch (on RB).

> Add a Bloom join
> 
>
> Key: PIG-4963
> URL: https://issues.apache.org/jira/browse/PIG-4963
> Project: Pig
>  Issue Type: New Feature
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.17.0
>
> Attachments: PIG-4963-1.patch, PIG-4963-2.patch, PIG-4963-3.patch, 
> PIG-4963-4.patch, PIG-4963-5.patch
>
>
> In PIG-4925, added option to pass BloomFilter as a scalar to bloom function. 
> But found that actually using it for big data which required huge vector size 
> was very inefficient and led to OOM.
>I had initially calculated that it would take around 12MB bytearray for 
> 100 million vectorsize (1 + 7) / 8 = 1250 bytes) and that would 
> be the scalar value broadcasted and would not take much space. But problem is 
> 12MB was written out for every input record with BuildBloom$Initial before 
> the aggregation happens and we arrive at the final BloomFilter vector. And 
> with POPartialAgg it runs into OOM issues. 
> If we added a bloom join implementation, which can be combined with hash or 
> skewed join it would boost performance for a lot of jobs. Bloom filter of the 
> smaller tables can be sent to the bigger tables as scalar and data filtered 
> before hash or skewed join is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4963) Add a Bloom join

2017-01-25 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-4963:

Attachment: PIG-4963-5.patch

> Add a Bloom join
> 
>
> Key: PIG-4963
> URL: https://issues.apache.org/jira/browse/PIG-4963
> Project: Pig
>  Issue Type: New Feature
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.17.0
>
> Attachments: PIG-4963-1.patch, PIG-4963-2.patch, PIG-4963-3.patch, 
> PIG-4963-4.patch, PIG-4963-5.patch
>
>
> In PIG-4925, added option to pass BloomFilter as a scalar to bloom function. 
> But found that actually using it for big data which required huge vector size 
> was very inefficient and led to OOM.
>I had initially calculated that it would take around 12MB bytearray for 
> 100 million vectorsize (1 + 7) / 8 = 1250 bytes) and that would 
> be the scalar value broadcasted and would not take much space. But problem is 
> 12MB was written out for every input record with BuildBloom$Initial before 
> the aggregation happens and we arrive at the final BloomFilter vector. And 
> with POPartialAgg it runs into OOM issues. 
> If we added a bloom join implementation, which can be combined with hash or 
> skewed join it would boost performance for a lot of jobs. Bloom filter of the 
> smaller tables can be sent to the bigger tables as scalar and data filtered 
> before hash or skewed join is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-5111) e2e Utf8Test fails in local mode

2017-01-25 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-5111:

  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Committed to branch-0.16 and trunk. Thanks for the review Daniel.

> e2e Utf8Test fails in local mode
> 
>
> Key: PIG-5111
> URL: https://issues.apache.org/jira/browse/PIG-5111
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.17.0, 0.16.1
>
> Attachments: PIG-5111-1.patch
>
>
> The test data required is not setup during deploy in local mode 
> (test-e2e-deploy-local)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4963) Add a Bloom join

2017-01-25 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15839298#comment-15839298
 ] 

Rohini Palaniswamy commented on PIG-4963:
-

bq. But I feel it is more clear if the plan show a filter + regular local 
rearrange. The execution plan of the later is more understandable.
   Actually in this case bloom filter cannot be applied before local rearrange. 
Local rearrange is the one that separates the record into key and value for the 
join and Bloom filter is then applied on the key. So it has to be either part 
of the local rearrange operator as currently implemented or be a separate 
operator after local rearrange which will we be lot more confusing. 

> Add a Bloom join
> 
>
> Key: PIG-4963
> URL: https://issues.apache.org/jira/browse/PIG-4963
> Project: Pig
>  Issue Type: New Feature
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.17.0
>
> Attachments: PIG-4963-1.patch, PIG-4963-2.patch, PIG-4963-3.patch, 
> PIG-4963-4.patch
>
>
> In PIG-4925, added option to pass BloomFilter as a scalar to bloom function. 
> But found that actually using it for big data which required huge vector size 
> was very inefficient and led to OOM.
>I had initially calculated that it would take around 12MB bytearray for 
> 100 million vectorsize (1 + 7) / 8 = 1250 bytes) and that would 
> be the scalar value broadcasted and would not take much space. But problem is 
> 12MB was written out for every input record with BuildBloom$Initial before 
> the aggregation happens and we arrive at the final BloomFilter vector. And 
> with POPartialAgg it runs into OOM issues. 
> If we added a bloom join implementation, which can be combined with hash or 
> skewed join it would boost performance for a lot of jobs. Bloom filter of the 
> smaller tables can be sent to the bigger tables as scalar and data filtered 
> before hash or skewed join is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 55681: [PIG-4963] Add a Bloom join

2017-01-25 Thread Daniel Dai

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55681/#review163100
---


Ship it!




Ship It!

- Daniel Dai


On Jan. 26, 2017, 5:55 a.m., Rohini Palaniswamy wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55681/
> ---
> 
> (Updated Jan. 26, 2017, 5:55 a.m.)
> 
> 
> Review request for pig, Daniel Dai and Adam Szita.
> 
> 
> Bugs: PIG-4963
> https://issues.apache.org/jira/browse/PIG-4963
> 
> 
> Repository: pig
> 
> 
> Description
> ---
> 
> This patch adds a new type of join called bloom. It supports creating 
> multiple bloom filters partitioned by hashcode of key for parallelism. Two 
> new operators and one Packager implementations are added.
>   POBuildBloomRearrageTez  - Builds the bloom filter for one of the relations 
> of the join on the map side or writes out the join keys based on the strategy
>   BloomPackager - Used in the reducer to create or combine bloom filters and 
> produces the final bloom filters. 
>   POBloomFilterRearrangeTez - Applies the bloom filters to other relations in 
> the join and filters out data.
> 
> More details in the documentation.
> 
> 
> Diffs
> -
> 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/docs/src/documentation/content/xdocs/basic.xml
>  1779665 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/docs/src/documentation/content/xdocs/perf.xml
>  1779665 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/PigConfiguration.java
>  1779665 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
>  1779665 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigCombiner.java
>  1779665 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java
>  1779665 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/PhysicalOperator.java
>  1779665 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/Packager.java
>  1779665 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java
>  1779665 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/TezCompiler.java
>  1779665 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/TezEdgeDescriptor.java
>  1779665 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/TezOperator.java
>  1779665 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/TezPOPackageAnnotator.java
>  1779665 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/BloomPackager.java
>  PRE-CREATION 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/POBloomFilterRearrangeTez.java
>  PRE-CREATION 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/POBuildBloomRearrangeTez.java
>  PRE-CREATION 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/POLocalRearrangeTez.java
>  1779665 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/POShuffleTezLoad.java
>  1779665 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/optimizer/CombinerOptimizer.java
>  1779665 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/optimizer/ParallelismSetter.java
>  1779665 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/optimizer/SecondaryKeyOptimizerTez.java
>  1779665 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/optimizer/TezEstimatedParallelismClearer.java
>  1779665 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/optimizer/TezOperDependencyParallelismEstimator.java
>  1779665 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LOJoin.java
>  1779665 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/p

Re: Review Request 55681: [PIG-4963] Add a Bloom join

2017-01-25 Thread Rohini Palaniswamy

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55681/
---

(Updated Jan. 26, 2017, 5:55 a.m.)


Review request for pig, Daniel Dai and Adam Szita.


Changes
---

Addressed review comments on documentation


Bugs: PIG-4963
https://issues.apache.org/jira/browse/PIG-4963


Repository: pig


Description
---

This patch adds a new type of join called bloom. It supports creating multiple 
bloom filters partitioned by hashcode of key for parallelism. Two new operators 
and one Packager implementations are added.
  POBuildBloomRearrageTez  - Builds the bloom filter for one of the relations 
of the join on the map side or writes out the join keys based on the strategy
  BloomPackager - Used in the reducer to create or combine bloom filters and 
produces the final bloom filters. 
  POBloomFilterRearrangeTez - Applies the bloom filters to other relations in 
the join and filters out data.

More details in the documentation.


Diffs (updated)
-

  
http://svn.apache.org/repos/asf/pig/trunk/src/docs/src/documentation/content/xdocs/basic.xml
 1779665 
  
http://svn.apache.org/repos/asf/pig/trunk/src/docs/src/documentation/content/xdocs/perf.xml
 1779665 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/PigConfiguration.java
 1779665 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
 1779665 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigCombiner.java
 1779665 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java
 1779665 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/PhysicalOperator.java
 1779665 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/Packager.java
 1779665 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java
 1779665 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/TezCompiler.java
 1779665 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/TezEdgeDescriptor.java
 1779665 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/TezOperator.java
 1779665 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/TezPOPackageAnnotator.java
 1779665 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/BloomPackager.java
 PRE-CREATION 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/POBloomFilterRearrangeTez.java
 PRE-CREATION 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/POBuildBloomRearrangeTez.java
 PRE-CREATION 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/POLocalRearrangeTez.java
 1779665 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/POShuffleTezLoad.java
 1779665 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/optimizer/CombinerOptimizer.java
 1779665 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/optimizer/ParallelismSetter.java
 1779665 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/optimizer/SecondaryKeyOptimizerTez.java
 1779665 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/optimizer/TezEstimatedParallelismClearer.java
 1779665 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/optimizer/TezOperDependencyParallelismEstimator.java
 1779665 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LOJoin.java
 1779665 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LogToPhyTranslationVisitor.java
 1779665 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/LogicalPlanBuilder.java
 1779665 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/tools/pigstats/ScriptState.java
 1779665 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/tools/pigstats/tez/TezScriptState.java
 1779665 
  http://svn.apache.org/repos/asf/pig/trunk/test/e2e/pig/build.xml 1779665 
  htt

[jira] [Updated] (PIG-5112) Cleanup pig-template.xml

2017-01-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-5112:

  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Patch committed to both 0.16 branch and trunk. Thanks Thejas for review!

> Cleanup pig-template.xml
> 
>
> Key: PIG-5112
> URL: https://issues.apache.org/jira/browse/PIG-5112
> Project: Pig
>  Issue Type: Bug
>  Components: build
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.17.0, 0.16.1
>
> Attachments: PIG-5112-1.patch
>
>
> Several entries in pig-template.xml are outdated. Attach a patch to remove or 
> update those entries. Later we shall use ivy:makepom to generate pig.pom and 
> lib dir, I will open a separate ticket for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4963) Add a Bloom join

2017-01-25 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15839173#comment-15839173
 ] 

Rohini Palaniswamy commented on PIG-4963:
-

Will address 1.  For 3, I did a quick run of Join converting them to use bloom 
and they were fine except for full outer which is not supported. Actually tests 
added for bloom join cover all cases in the Join group and in fact cover lot 
more - tuple keys and more datatypes for keys, more cases for union and split. 
Also uses studentnulltab10k which tests null cases better. self join case is 
covered in multiquery.conf.

bq. But I feel it is more clear if the plan show a filter + regular local 
rearrange. The execution plan of the later is more understandable.
  I think it is unnecessary overhead to add a separate filter operator for just 
readability. The current Filter operator which executes a plan for filtering 
has no relation to the BloomFilter way of filtering and does not logically make 
sense to extend for BloomFilter. This is simpler and cleaner in terms of 
implementation and also should be faster in terms of execution as there is no 
unnecessary overhead.

> Add a Bloom join
> 
>
> Key: PIG-4963
> URL: https://issues.apache.org/jira/browse/PIG-4963
> Project: Pig
>  Issue Type: New Feature
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.17.0
>
> Attachments: PIG-4963-1.patch, PIG-4963-2.patch, PIG-4963-3.patch, 
> PIG-4963-4.patch
>
>
> In PIG-4925, added option to pass BloomFilter as a scalar to bloom function. 
> But found that actually using it for big data which required huge vector size 
> was very inefficient and led to OOM.
>I had initially calculated that it would take around 12MB bytearray for 
> 100 million vectorsize (1 + 7) / 8 = 1250 bytes) and that would 
> be the scalar value broadcasted and would not take much space. But problem is 
> 12MB was written out for every input record with BuildBloom$Initial before 
> the aggregation happens and we arrive at the final BloomFilter vector. And 
> with POPartialAgg it runs into OOM issues. 
> If we added a bloom join implementation, which can be combined with hash or 
> skewed join it would boost performance for a lot of jobs. Bloom filter of the 
> smaller tables can be sent to the bigger tables as scalar and data filtered 
> before hash or skewed join is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-5111) e2e Utf8Test fails in local mode

2017-01-25 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15838976#comment-15838976
 ] 

Daniel Dai commented on PIG-5111:
-

+1

> e2e Utf8Test fails in local mode
> 
>
> Key: PIG-5111
> URL: https://issues.apache.org/jira/browse/PIG-5111
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.17.0, 0.16.1
>
> Attachments: PIG-5111-1.patch
>
>
> The test data required is not setup during deploy in local mode 
> (test-e2e-deploy-local)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-5115) Builtin AvroStorage generates incorrect avro schema when the same pig field name appears in the alias

2017-01-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-5115:

Assignee: Anyi Li

> Builtin AvroStorage generates incorrect avro schema when the same pig field 
> name appears in the alias
> -
>
> Key: PIG-5115
> URL: https://issues.apache.org/jira/browse/PIG-5115
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.17.0
>Reporter: Anyi Li
>Assignee: Anyi Li
> Fix For: 0.17.0
>
> Attachments: PIG-5115.patch
>
>
> Pig ResourceSchema allows to use same field names but different types when 
> they are not in the same level. The pig schema like
> {quote}
> data: {col1: (col2: (col1_data: chararray)),col2: {col2: (col2_data: 
> chararray)}}
> {quote}
> Although _col2_ has been redefined, they are not appeared in the same level, 
> it is a totally valid pig schema. 
> However, once it is translated by AvroStorage, it will throw exception 
> {noformat}
> Can't redefine: col2
> at 
> org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:64)
> at 
> org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
> at 
> org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
> at 
> org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
> at 
> org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
> at 
> org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
> at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
> at 
> org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212)
> at org.apache.pig.PigServer$Graph.compile(PigServer.java:1767)
> at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1443)
> at org.apache.pig.PigServer.execute(PigServer.java:1356)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:415)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
> at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
> at org.apache.pig.Main.run(Main.java:631)
> at org.apache.pig.Main.main(Main.java:177)
> Caused by: org.apache.avro.SchemaParseException: Can't redefine: col2
> at org.apache.avro.Schema$Names.put(Schema.java:1042)
> at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:511)
> at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:626)
> at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:737)
> at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:814)
> at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:648)
> at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:635)
> at org.apache.avro.Schema.toString(Schema.java:297)
> at org.apache.avro.Schema.toString(Schema.java:287)
> at 
> org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:442)
> at 
> org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:433)
> at 
> org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:54)
> ... 18 more
> {noformat}
> It is caused by a bug in AvroStorageSchemaConversionUtilities class which 
> uses tuple name as GenericRecord name as well as the fieldname that wraps the 
> record. 
> So it would like to  produces the avro schema like the following 
> {noformat}
> {
>   "type": "record",
>   "name": "data",
>   "fields": [
> {
>   "name": "col1",
>   "type": {
> "type": "record",
> "name": "col1_1",
> "fields": [
>   {
> "name": "col2",
> "type": {
>   "type": "record",
>   "name": "col2",
>   "fields": [
> {
>   "name": "col1_data",
>   "type": "string"
> }
>   ]
> }
>   }
> ]
>   }
> },
> {
>   "name": "col2",
>   "type": {
> "type": "array",
> "items": {
>   "type": "record",
>   "name": "col2",
>   "fields": [
> {
>   "name": "col2_data",
>   "type": "string"
> }
>   ]
> }
>   }
> }
>   ]
> }
> {noformat}
> But according 

[jira] [Resolved] (PIG-5113) Not a valid JAR

2017-01-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai resolved PIG-5113.
-
Resolution: Not A Problem

The classic way to is export PIG_HEAPSIZE environment variable. But seem you 
solve the issue anyway.

> Not a valid JAR
> ---
>
> Key: PIG-5113
> URL: https://issues.apache.org/jira/browse/PIG-5113
> Project: Pig
>  Issue Type: Bug
>  Components: grunt
>Affects Versions: 0.16.0
> Environment: Ubuntu Server 16.04
>Reporter: Fabrizio Massara
>
> Hello, I installed Pig on Ubuntu Server 16.0 and I need to use it in local 
> mode.
> Yesterday I tried to run some jobs but unfortunately they were killed cause 
> the heap space of java wasn't enough. I updated it but now, when I try to run 
> pig, it appear this error:
> Not a valid JAR: /usr/local/pig/pig-0.16.0-core-h2.jar 
> /usr/local/pig/pig-0.16.0-SNAPSHOT-core-h2.jar
> How could I solve this? I cannot find a solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4963) Add a Bloom join

2017-01-25 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15838928#comment-15838928
 ] 

Daniel Dai commented on PIG-4963:
-

I glance through the patch and looks very good. I have some minor comments:
1. The documentation about the left outer join give the impression that user 
can make bloom join efficient by switch the order of relations. Actually this 
is the limitation of bloom join and switch order does not solve the problem. We 
shall make it more clear.
2. Currently we use POBloomFilterRearrangeTez for the bloom filter. But I feel 
it is more clear if the plan show a filter + regular local rearrange. The 
execution plan of the later is more understandable.
3. The patch does have quite a few test coverage. However, we can run existing 
join e2e tests once with bloom join, and make sure it works. That's an easy 
approach for additional tests.

I still need more time to do a code level review, but I am fine to commit once 
we have done #1, #3, and deal with #2 and other review comments in the follow 
up Jiras.

> Add a Bloom join
> 
>
> Key: PIG-4963
> URL: https://issues.apache.org/jira/browse/PIG-4963
> Project: Pig
>  Issue Type: New Feature
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.17.0
>
> Attachments: PIG-4963-1.patch, PIG-4963-2.patch, PIG-4963-3.patch, 
> PIG-4963-4.patch
>
>
> In PIG-4925, added option to pass BloomFilter as a scalar to bloom function. 
> But found that actually using it for big data which required huge vector size 
> was very inefficient and led to OOM.
>I had initially calculated that it would take around 12MB bytearray for 
> 100 million vectorsize (1 + 7) / 8 = 1250 bytes) and that would 
> be the scalar value broadcasted and would not take much space. But problem is 
> 12MB was written out for every input record with BuildBloom$Initial before 
> the aggregation happens and we arrive at the final BloomFilter vector. And 
> with POPartialAgg it runs into OOM issues. 
> If we added a bloom join implementation, which can be combined with hash or 
> skewed join it would boost performance for a lot of jobs. Bloom filter of the 
> smaller tables can be sent to the bigger tables as scalar and data filtered 
> before hash or skewed join is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-5115) Builtin AvroStorage generates incorrect avro schema when the same pig field name appears in the alias

2017-01-25 Thread Anyi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anyi Li updated PIG-5115:
-
Fix Version/s: 0.17.0
   Status: Patch Available  (was: Open)

> Builtin AvroStorage generates incorrect avro schema when the same pig field 
> name appears in the alias
> -
>
> Key: PIG-5115
> URL: https://issues.apache.org/jira/browse/PIG-5115
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.17.0
>Reporter: Anyi Li
> Fix For: 0.17.0
>
> Attachments: PIG-5115.patch
>
>
> Pig ResourceSchema allows to use same field names but different types when 
> they are not in the same level. The pig schema like
> {quote}
> data: {col1: (col2: (col1_data: chararray)),col2: {col2: (col2_data: 
> chararray)}}
> {quote}
> Although _col2_ has been redefined, they are not appeared in the same level, 
> it is a totally valid pig schema. 
> However, once it is translated by AvroStorage, it will throw exception 
> {noformat}
> Can't redefine: col2
> at 
> org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:64)
> at 
> org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
> at 
> org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
> at 
> org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
> at 
> org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
> at 
> org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
> at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
> at 
> org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212)
> at org.apache.pig.PigServer$Graph.compile(PigServer.java:1767)
> at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1443)
> at org.apache.pig.PigServer.execute(PigServer.java:1356)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:415)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
> at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
> at org.apache.pig.Main.run(Main.java:631)
> at org.apache.pig.Main.main(Main.java:177)
> Caused by: org.apache.avro.SchemaParseException: Can't redefine: col2
> at org.apache.avro.Schema$Names.put(Schema.java:1042)
> at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:511)
> at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:626)
> at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:737)
> at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:814)
> at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:648)
> at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:635)
> at org.apache.avro.Schema.toString(Schema.java:297)
> at org.apache.avro.Schema.toString(Schema.java:287)
> at 
> org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:442)
> at 
> org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:433)
> at 
> org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:54)
> ... 18 more
> {noformat}
> It is caused by a bug in AvroStorageSchemaConversionUtilities class which 
> uses tuple name as GenericRecord name as well as the fieldname that wraps the 
> record. 
> So it would like to  produces the avro schema like the following 
> {noformat}
> {
>   "type": "record",
>   "name": "data",
>   "fields": [
> {
>   "name": "col1",
>   "type": {
> "type": "record",
> "name": "col1_1",
> "fields": [
>   {
> "name": "col2",
> "type": {
>   "type": "record",
>   "name": "col2",
>   "fields": [
> {
>   "name": "col1_data",
>   "type": "string"
> }
>   ]
> }
>   }
> ]
>   }
> },
> {
>   "name": "col2",
>   "type": {
> "type": "array",
> "items": {
>   "type": "record",
>   "name": "col2",
>   "fields": [
> {
>   "name": "col2_data",
>   "type": "string"
> }
>   ]
> }
>   }
> }
>   ]
> }
> {noformat}
>

[jira] [Updated] (PIG-5115) Builtin AvroStorage generates incorrect avro schema when the same pig field name appears in the alias

2017-01-25 Thread Anyi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anyi Li updated PIG-5115:
-
Attachment: PIG-5115.patch

> Builtin AvroStorage generates incorrect avro schema when the same pig field 
> name appears in the alias
> -
>
> Key: PIG-5115
> URL: https://issues.apache.org/jira/browse/PIG-5115
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.17.0
>Reporter: Anyi Li
> Attachments: PIG-5115.patch
>
>
> Pig ResourceSchema allows to use same field names but different types when 
> they are not in the same level. The pig schema like
> {quote}
> data: {col1: (col2: (col1_data: chararray)),col2: {col2: (col2_data: 
> chararray)}}
> {quote}
> Although _col2_ has been redefined, they are not appeared in the same level, 
> it is a totally valid pig schema. 
> However, once it is translated by AvroStorage, it will throw exception 
> {noformat}
> Can't redefine: col2
> at 
> org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:64)
> at 
> org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
> at 
> org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
> at 
> org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
> at 
> org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
> at 
> org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
> at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
> at 
> org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212)
> at org.apache.pig.PigServer$Graph.compile(PigServer.java:1767)
> at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1443)
> at org.apache.pig.PigServer.execute(PigServer.java:1356)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:415)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
> at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
> at org.apache.pig.Main.run(Main.java:631)
> at org.apache.pig.Main.main(Main.java:177)
> Caused by: org.apache.avro.SchemaParseException: Can't redefine: col2
> at org.apache.avro.Schema$Names.put(Schema.java:1042)
> at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:511)
> at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:626)
> at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:737)
> at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:814)
> at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:648)
> at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:635)
> at org.apache.avro.Schema.toString(Schema.java:297)
> at org.apache.avro.Schema.toString(Schema.java:287)
> at 
> org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:442)
> at 
> org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:433)
> at 
> org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:54)
> ... 18 more
> {noformat}
> It is caused by a bug in AvroStorageSchemaConversionUtilities class which 
> uses tuple name as GenericRecord name as well as the fieldname that wraps the 
> record. 
> So it would like to  produces the avro schema like the following 
> {noformat}
> {
>   "type": "record",
>   "name": "data",
>   "fields": [
> {
>   "name": "col1",
>   "type": {
> "type": "record",
> "name": "col1_1",
> "fields": [
>   {
> "name": "col2",
> "type": {
>   "type": "record",
>   "name": "col2",
>   "fields": [
> {
>   "name": "col1_data",
>   "type": "string"
> }
>   ]
> }
>   }
> ]
>   }
> },
> {
>   "name": "col2",
>   "type": {
> "type": "array",
> "items": {
>   "type": "record",
>   "name": "col2",
>   "fields": [
> {
>   "name": "col2_data",
>   "type": "string"
> }
>   ]
> }
>   }
> }
>   ]
> }
> {noformat}
> But according to the avro 1.7.7  specs 
> ([https://avro.apache.org/docs/1

[jira] [Updated] (PIG-5115) Builtin AvroStorage generates incorrect avro schema when the same pig field name appears in the alias

2017-01-25 Thread Anyi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anyi Li updated PIG-5115:
-
Description: 
Pig ResourceSchema allows to use same field names but different types when they 
are not in the same level. The pig schema like
{quote}
data: {col1: (col2: (col1_data: chararray)),col2: {col2: (col2_data: 
chararray)}}
{quote}

Although _col2_ has been redefined, they are not appeared in the same level, it 
is a totally valid pig schema. 

However, once it is translated by AvroStorage, it will throw exception 
{noformat}
Can't redefine: col2
at 
org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:64)
at 
org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
at 
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
at 
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at 
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at 
org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at 
org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212)
at org.apache.pig.PigServer$Graph.compile(PigServer.java:1767)
at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1443)
at org.apache.pig.PigServer.execute(PigServer.java:1356)
at org.apache.pig.PigServer.executeBatch(PigServer.java:415)
at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
at 
org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
at org.apache.pig.Main.run(Main.java:631)
at org.apache.pig.Main.main(Main.java:177)
Caused by: org.apache.avro.SchemaParseException: Can't redefine: col2
at org.apache.avro.Schema$Names.put(Schema.java:1042)
at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:511)
at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:626)
at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:737)
at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:814)
at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:648)
at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:635)
at org.apache.avro.Schema.toString(Schema.java:297)
at org.apache.avro.Schema.toString(Schema.java:287)
at 
org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:442)
at org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:433)
at 
org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:54)
... 18 more
{noformat}

It is caused by a bug in AvroStorageSchemaConversionUtilities class which uses 
tuple name as GenericRecord name as well as the fieldname that wraps the 
record. 

So it would like to  produces the avro schema like the following 
{noformat}
{
  "type": "record",
  "name": "data",
  "fields": [
{
  "name": "col1",
  "type": {
"type": "record",
"name": "col1_1",
"fields": [
  {
"name": "col2",
"type": {
  "type": "record",
  "name": "col2",
  "fields": [
{
  "name": "col1_data",
  "type": "string"
}
  ]
}
  }
]
  }
},
{
  "name": "col2",
  "type": {
"type": "array",
"items": {
  "type": "record",
  "name": "col2",
  "fields": [
{
  "name": "col2_data",
  "type": "string"
}
  ]
}
  }
}
  ]
}

{noformat}
But according to the avro 1.7.7  specs 
([https://avro.apache.org/docs/1.7.7/spec.html#Names]), _col2_ has been defined 
as record and redefined as array later, it is an invalid schema, unless the 
fullname (namespace + name) is unique. 

Since AvroStorageSchemaConversionUtilities will generate avro record if the pig 
schema is a tuple, we need a way to generate unique _recordName_. 

{code}
public static Schema resourceSchemaToAvroSchema(final ResourceSchema rs,
  String recordName, final String recordNameSpace,
  final Map> definedRecordNames,
  final Boolean doubleColonsToDoubleUnderscores) throws IOException {

if (rs == null) {
  return null;
}

recordName = toAvroName(recordName, doubleColonsToDoubleUnderscores);

List fields 

[jira] [Updated] (PIG-5115) Builtin AvroStorage generates incorrect avro schema when the same pig field name appears in the alias

2017-01-25 Thread Anyi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anyi Li updated PIG-5115:
-
Description: 
Pig ResourceSchema allows to use same field names but different types when they 
are not in the same level. The pig schema like
{quote}
data: {col1: (col2: (col1_data: chararray)),col2: {col2: (col2_data: 
chararray)}}
{quote}

Although _col2_ has been redefined, they are not appeared in the same level, it 
is a totally valid pig schema. 

However, once it is translated by AvroStorage, it will throw exception 
{noformat}
Can't redefine: col2
at 
org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:64)
at 
org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
at 
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
at 
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at 
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at 
org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at 
org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212)
at org.apache.pig.PigServer$Graph.compile(PigServer.java:1767)
at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1443)
at org.apache.pig.PigServer.execute(PigServer.java:1356)
at org.apache.pig.PigServer.executeBatch(PigServer.java:415)
at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
at 
org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
at org.apache.pig.Main.run(Main.java:631)
at org.apache.pig.Main.main(Main.java:177)
Caused by: org.apache.avro.SchemaParseException: Can't redefine: col2
at org.apache.avro.Schema$Names.put(Schema.java:1042)
at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:511)
at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:626)
at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:737)
at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:814)
at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:648)
at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:635)
at org.apache.avro.Schema.toString(Schema.java:297)
at org.apache.avro.Schema.toString(Schema.java:287)
at 
org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:442)
at org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:433)
at 
org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:54)
... 18 more
{noformat}

It is caused by a bug in AvroStorageSchemaConversionUtilities class which uses 
tuple name as GenericRecord name as well as the fieldname that wraps the 
record. 

So it would like to  produces the avro schema like the following 
{noformat}
{
  "type": "record",
  "name": "data",
  "fields": [
{
  "name": "col1",
  "type": {
"type": "record",
"name": "col1_1",
"fields": [
  {
"name": "col2",
"type": {
  "type": "record",
  "name": "col2",
  "fields": [
{
  "name": "col1_data",
  "type": "string"
}
  ]
}
  }
]
  }
},
{
  "name": "col2",
  "type": {
"type": "array",
"items": {
  "type": "record",
  "name": "col2",
  "fields": [
{
  "name": "col2_data",
  "type": "string"
}
  ]
}
  }
}
  ]
}

{noformat}
But according to the avro 1.7.7  specs 
([https://avro.apache.org/docs/1.7.7/spec.html#Names]), _col2_ has been defined 
as record and redefined as array later, it is an invalid schema, unless the 
fullname (namespace + name) is unique. 

Since AvroStorageSchemaConversionUtilities will generate avro record if the pig 
schema is a tuple, we need a way to generate unique _recordName_. 

{code}
public static Schema resourceSchemaToAvroSchema(final ResourceSchema rs,
  String recordName, final String recordNameSpace,
  final Map> definedRecordNames,
  final Boolean doubleColonsToDoubleUnderscores) throws IOException {

if (rs == null) {
  return null;
}

recordName = toAvroName(recordName, doubleColonsToDoubleUnderscores);

List fields 

[jira] [Updated] (PIG-5115) Builtin AvroStorage generates the incorrect avro schema when same pig field names appears in the alias

2017-01-25 Thread Anyi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anyi Li updated PIG-5115:
-
Description: 
Pig ResourceSchema allows to use same field name but different type when they 
are not in the same level. The pig schema like
{quote}
data: {col1: (col2: (col1_data: chararray)),col2: {col2: (col2_data: 
chararray)}}
{quote}

Although _col2_ has been redefined, they are not appeared in the same level, it 
is a totally valid pig schema. 

However, once it is translated by AvroStorage, it will throw exception 
{noformat}
Can't redefine: col2
at 
org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:64)
at 
org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
at 
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
at 
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at 
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at 
org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at 
org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212)
at org.apache.pig.PigServer$Graph.compile(PigServer.java:1767)
at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1443)
at org.apache.pig.PigServer.execute(PigServer.java:1356)
at org.apache.pig.PigServer.executeBatch(PigServer.java:415)
at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
at 
org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
at org.apache.pig.Main.run(Main.java:631)
at org.apache.pig.Main.main(Main.java:177)
Caused by: org.apache.avro.SchemaParseException: Can't redefine: col2
at org.apache.avro.Schema$Names.put(Schema.java:1042)
at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:511)
at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:626)
at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:737)
at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:814)
at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:648)
at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:635)
at org.apache.avro.Schema.toString(Schema.java:297)
at org.apache.avro.Schema.toString(Schema.java:287)
at 
org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:442)
at org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:433)
at 
org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:54)
... 18 more
{noformat}

It is caused by a bug in AvroStorageSchemaConversionUtilities class which uses 
tuple name as GenericRecord name as well as the fieldname that wraps the 
record. 

So it would like to  produces the avro schema like the following 
{noformat}
{
  "type": "record",
  "name": "data",
  "fields": [
{
  "name": "col1",
  "type": {
"type": "record",
"name": "col1_1",
"fields": [
  {
"name": "col2",
"type": {
  "type": "record",
  "name": "col2",
  "fields": [
{
  "name": "col1_data",
  "type": "string"
}
  ]
}
  }
]
  }
},
{
  "name": "col2",
  "type": {
"type": "array",
"items": {
  "type": "record",
  "name": "col2",
  "fields": [
{
  "name": "col2_data",
  "type": "string"
}
  ]
}
  }
}
  ]
}

{noformat}
But according to the avro 1.7.7  specs 
([https://avro.apache.org/docs/1.7.7/spec.html#Names]), _col2_ has been defined 
as record and redefined as array later, it is an invalid schema, unless the 
fullname (namespace + name) is unique. 

Since AvroStorageSchemaConversionUtilities will generate avro record if the pig 
schema is a tuple, we need a way to generate unique _recordName_. 

{code}
public static Schema resourceSchemaToAvroSchema(final ResourceSchema rs,
  String recordName, final String recordNameSpace,
  final Map> definedRecordNames,
  final Boolean doubleColonsToDoubleUnderscores) throws IOException {

if (rs == null) {
  return null;
}

recordName = toAvroName(recordName, doubleColonsToDoubleUnderscores);

List fields = 

[jira] [Updated] (PIG-5115) Builtin AvroStorage generates incorrect avro schema when the same pig field name appears in the alias

2017-01-25 Thread Anyi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anyi Li updated PIG-5115:
-
Summary: Builtin AvroStorage generates incorrect avro schema when the same 
pig field name appears in the alias  (was: Builtin AvroStorage generates the 
incorrect avro schema when same pig field names appears in the alias)

> Builtin AvroStorage generates incorrect avro schema when the same pig field 
> name appears in the alias
> -
>
> Key: PIG-5115
> URL: https://issues.apache.org/jira/browse/PIG-5115
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.17.0
>Reporter: Anyi Li
>
> Pig ResourceSchema allows to use same field name but different type when they 
> are not in the same level. The pig schema like
> {quote}
> data: {col1: (col2: (col1_data: chararray)),col2: {col2: (col2_data: 
> chararray)}}
> {quote}
> Although _col2_ has been redefined, they are not appeared in the same level, 
> it is a totally valid pig schema. 
> However, once it is translated by AvroStorage, it will throw exception 
> {noformat}
> Can't redefine: col2
> at 
> org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:64)
> at 
> org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
> at 
> org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
> at 
> org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
> at 
> org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
> at 
> org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
> at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
> at 
> org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212)
> at org.apache.pig.PigServer$Graph.compile(PigServer.java:1767)
> at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1443)
> at org.apache.pig.PigServer.execute(PigServer.java:1356)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:415)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
> at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
> at org.apache.pig.Main.run(Main.java:631)
> at org.apache.pig.Main.main(Main.java:177)
> Caused by: org.apache.avro.SchemaParseException: Can't redefine: col2
> at org.apache.avro.Schema$Names.put(Schema.java:1042)
> at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:511)
> at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:626)
> at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:737)
> at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:814)
> at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:648)
> at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:635)
> at org.apache.avro.Schema.toString(Schema.java:297)
> at org.apache.avro.Schema.toString(Schema.java:287)
> at 
> org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:442)
> at 
> org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:433)
> at 
> org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:54)
> ... 18 more
> {noformat}
> It is caused by a bug in AvroStorageSchemaConversionUtilities class which 
> uses tuple name as GenericRecord name as well as the fieldname that wraps the 
> record. 
> So it would like to  produces the avro schema like the following 
> {noformat}
> {
>   "type": "record",
>   "name": "data",
>   "fields": [
> {
>   "name": "col1",
>   "type": {
> "type": "record",
> "name": "col1_1",
> "fields": [
>   {
> "name": "col2",
> "type": {
>   "type": "record",
>   "name": "col2",
>   "fields": [
> {
>   "name": "col1_data",
>   "type": "string"
> }
>   ]
> }
>   }
> ]
>   }
> },
> {
>   "name": "col2",
>   "type": {
> "type": "array",
> "items": {
>   "type": "record",
>   "name": "col2",
>   "fields": [
> {
>   "name": "col2_data",
>   "type": "string"
>   

[jira] [Updated] (PIG-5115) Builtin AvroStorage generates the incorrect avro schema when same pig field names appears in the alias

2017-01-25 Thread Anyi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anyi Li updated PIG-5115:
-
Description: 
Pig ResourceSchema allows to use same field name but different type when they 
are not in the same level. The pig schema like
{quote}
data: {col1: (col2: (col1_data: chararray)),col2: {col2: (col2_data: 
chararray)}}
{quote}

Although _col2_ has been redefined, they are not appeared in the same level, it 
is a totally valid pig schema. 

However, once it is translated by AvroStorage, it will throw exception 
{quote}
Can't redefine: col2
at 
org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:64)
at 
org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
at 
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
at 
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at 
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at 
org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at 
org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212)
at org.apache.pig.PigServer$Graph.compile(PigServer.java:1767)
at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1443)
at org.apache.pig.PigServer.execute(PigServer.java:1356)
at org.apache.pig.PigServer.executeBatch(PigServer.java:415)
at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
at 
org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
at org.apache.pig.Main.run(Main.java:631)
at org.apache.pig.Main.main(Main.java:177)
Caused by: org.apache.avro.SchemaParseException: Can't redefine: col2
at org.apache.avro.Schema$Names.put(Schema.java:1042)
at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:511)
at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:626)
at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:737)
at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:814)
at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:648)
at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:635)
at org.apache.avro.Schema.toString(Schema.java:297)
at org.apache.avro.Schema.toString(Schema.java:287)
at 
org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:442)
at org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:433)
at 
org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:54)
... 18 more
{quote}

It is caused by a bug in AvroStorageSchemaConversionUtilities class which uses 
tuple name as GenericRecord name as well as the fieldname that wraps the 
record. 

So it would like to  produces the avro schema like the following 
{noformat}
{
  "type": "record",
  "name": "data",
  "fields": [
{
  "name": "col1",
  "type": {
"type": "record",
"name": "col1_1",
"fields": [
  {
"name": "col2",
"type": {
  "type": "record",
  "name": "col2",
  "fields": [
{
  "name": "col1_data",
  "type": "string"
}
  ]
}
  }
]
  }
},
{
  "name": "col2",
  "type": {
"type": "array",
"items": {
  "type": "record",
  "name": "col2",
  "fields": [
{
  "name": "col2_data",
  "type": "string"
}
  ]
}
  }
}
  ]
}

{noformat}
But according to the avro 1.7.7  specs 
([https://avro.apache.org/docs/1.7.7/spec.html#Names]), _col2_ has been defined 
as record and redefined as array later, it is an invalid schema, unless the 
fullname (namespace + name) is unique. 

Since AvroStorageSchemaConversionUtilities will generate avro record if the pig 
schema is a tuple, we need a way to generate unique _recordName_. 

{code}
public static Schema resourceSchemaToAvroSchema(final ResourceSchema rs,
  String recordName, final String recordNameSpace,
  final Map> definedRecordNames,
  final Boolean doubleColonsToDoubleUnderscores) throws IOException {

if (rs == null) {
  return null;
}

recordName = toAvroName(recordName, doubleColonsToDoubleUnderscores);

List fields = new Ar

[jira] [Updated] (PIG-5115) Builtin AvroStorage generates the incorrect avro schema when same pig field names appears in the alias

2017-01-25 Thread Anyi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anyi Li updated PIG-5115:
-
Description: 
Pig ResourceSchema allows to use same field name but different type when they 
are not in the same level. The pig schema like
{quote}
data: {col1: (col2: (col1_data: chararray)),col2: {col2: (col2_data: 
chararray)}}
{quote}

Although _col2_ has been redefined, they are not appeared in the same level, it 
is a totally valid pig schema. 

However, once it is translated by AvroStorage, it will throw exception 
{quote}
Can't redefine: col2
at 
org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:64)
at 
org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
at 
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
at 
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at 
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at 
org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at 
org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212)
at org.apache.pig.PigServer$Graph.compile(PigServer.java:1767)
at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1443)
at org.apache.pig.PigServer.execute(PigServer.java:1356)
at org.apache.pig.PigServer.executeBatch(PigServer.java:415)
at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
at 
org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
at org.apache.pig.Main.run(Main.java:631)
at org.apache.pig.Main.main(Main.java:177)
Caused by: org.apache.avro.SchemaParseException: Can't redefine: col2
at org.apache.avro.Schema$Names.put(Schema.java:1042)
at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:511)
at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:626)
at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:737)
at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:814)
at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:648)
at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:635)
at org.apache.avro.Schema.toString(Schema.java:297)
at org.apache.avro.Schema.toString(Schema.java:287)
at 
org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:442)
at org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:433)
at 
org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:54)
... 18 more
{quote}

It is caused by a bug in AvroStorageSchemaConversionUtilities class which uses 
tuple name as GenericRecord name as well as the fieldname that wraps the 
record. 

So it would like to  produces the avro schema like the following 
{noformat}
{
  "type": "record",
  "name": "data",
  "fields": [
{
  "name": "col1",
  "type": {
"type": "record",
"name": "col1_1",
"fields": [
  {
"name": "col2",
"type": {
  "type": "record",
  "name": "col2",
  "fields": [
{
  "name": "col1_data",
  "type": "string"
}
  ]
}
  }
]
  }
},
{
  "name": "col2",
  "type": {
"type": "array",
"items": {
  "type": "record",
  "name": "col2",
  "fields": [
{
  "name": "col2_data",
  "type": "string"
}
  ]
}
  }
}
  ]
}

{noformat}
But according to the avro 1.7.7  specs 
([https://avro.apache.org/docs/1.7.7/spec.html#Names]), _col2_ has been defined 
as record and redefined as array later, it is an invalid schema, unless the 
fullname (namespace + name) is unique. 

Since AvroStorageSchemaConversionUtilities will generate avro record if the pig 
schema is a tuple, we need a way to generate unique _recordName_. 

{code:java}

public static Schema resourceSchemaToAvroSchema(final ResourceSchema rs,
  String recordName, final String recordNameSpace,
  final Map> definedRecordNames,
  final Boolean doubleColonsToDoubleUnderscores) throws IOException {

if (rs == null) {
  return null;
}

recordName = toAvroName(recordName, doubleColonsToDoubleUnderscores);

List fields = 

[jira] [Updated] (PIG-5115) Builtin AvroStorage generates the incorrect avro schema when same pig field names appears in the alias

2017-01-25 Thread Anyi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anyi Li updated PIG-5115:
-
Description: 
Pig ResourceSchema allows to use same field name but different type when they 
are not in the same level. The pig schema like
{quote}
data: {col1: (col2: (col1_data: chararray)),col2: {col2: (col2_data: 
chararray)}}
{quote}

Although _col2_ has been redefined, they are not appeared in the same level, it 
is a totally valid pig schema. 

However, once it is translated by AvroStorage, it will throw exception 
{quote}
Can't redefine: col2
at 
org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:64)
at 
org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
at 
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
at 
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at 
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at 
org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at 
org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212)
at org.apache.pig.PigServer$Graph.compile(PigServer.java:1767)
at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1443)
at org.apache.pig.PigServer.execute(PigServer.java:1356)
at org.apache.pig.PigServer.executeBatch(PigServer.java:415)
at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
at 
org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
at org.apache.pig.Main.run(Main.java:631)
at org.apache.pig.Main.main(Main.java:177)
Caused by: org.apache.avro.SchemaParseException: Can't redefine: col2
at org.apache.avro.Schema$Names.put(Schema.java:1042)
at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:511)
at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:626)
at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:737)
at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:814)
at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:648)
at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:635)
at org.apache.avro.Schema.toString(Schema.java:297)
at org.apache.avro.Schema.toString(Schema.java:287)
at 
org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:442)
at org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:433)
at 
org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:54)
... 18 more
{quote}

It is caused by a bug in AvroStorageSchemaConversionUtilities class which uses 
tuple name as GenericRecord name as well as the fieldname that wraps the 
record. 

So it would like to  produces the avro schema like the following 
{quote}

{
  "type": "record",
  "name": "data",
  "fields": [
{
  "name": "col1",
  "type": {
"type": "record",
"name": "col1_1",
"fields": [
  {
"name": "col2",
"type": {
  "type": "record",
  "name": "col2",
  "fields": [
{
  "name": "col1_data",
  "type": "string"
}
  ]
}
  }
]
  }
},
{
  "name": "col2",
  "type": {
"type": "array",
"items": {
  "type": "record",
  "name": "col2",
  "fields": [
{
  "name": "col2_data",
  "type": "string"
}
  ]
}
  }
}
  ]
}

{quote}
But according to the avro 1.7.7  specs 
([https://avro.apache.org/docs/1.7.7/spec.html#Names]), _col2_ has been defined 
as record and redefined as array later, it is an invalid schema, unless the 
fullname (namespace + name) is unique. 

Since AvroStorageSchemaConversionUtilities will generate avro record if the pig 
schema is a tuple, we need a way to generate unique _recordName_. 
{code:java}
public static Schema resourceSchemaToAvroSchema(final ResourceSchema rs,
  String recordName, final String recordNameSpace,
  final Map> definedRecordNames,
  final Boolean doubleColonsToDoubleUnderscores) throws IOException {

if (rs == null) {
  return null;
}

recordName = toAvroName(recordName, doubleColonsToDoubleUnderscores);

List fields = new Arr

[jira] [Created] (PIG-5115) Builtin AvroStorage generates the incorrect avro schema when same pig field names appears in the alias

2017-01-25 Thread Anyi Li (JIRA)
Anyi Li created PIG-5115:


 Summary: Builtin AvroStorage generates the incorrect avro schema 
when same pig field names appears in the alias
 Key: PIG-5115
 URL: https://issues.apache.org/jira/browse/PIG-5115
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.17.0
Reporter: Anyi Li


Pig ResourceSchema allows to use same field name but different type when they 
are not in the same level. The pig schema like
{quote}
data: {col1: (col2: (col1_data: chararray)),col2: {col2: (col2_data: 
chararray)}}
{quote}

Although _col2_ has been redefined, they are not appeared in the same level, it 
is a totally valid pig schema. 

However, once it is translated by AvroStorage, it will throw exception 
{quote}
Can't redefine: col2
at 
org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:64)
at 
org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
at 
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
at 
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at 
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at 
org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at 
org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212)
at org.apache.pig.PigServer$Graph.compile(PigServer.java:1767)
at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1443)
at org.apache.pig.PigServer.execute(PigServer.java:1356)
at org.apache.pig.PigServer.executeBatch(PigServer.java:415)
at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
at 
org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
at org.apache.pig.Main.run(Main.java:631)
at org.apache.pig.Main.main(Main.java:177)
Caused by: org.apache.avro.SchemaParseException: Can't redefine: col2
at org.apache.avro.Schema$Names.put(Schema.java:1042)
at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:511)
at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:626)
at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:737)
at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:814)
at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:648)
at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:635)
at org.apache.avro.Schema.toString(Schema.java:297)
at org.apache.avro.Schema.toString(Schema.java:287)
at 
org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:442)
at org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:433)
at 
org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:54)
... 18 more
{quote}

It is caused by a bug in AvroStorageSchemaConversionUtilities class which uses 
tuple name as GenericRecord name as well as the fieldname that wraps the 
record. 

So it would like to  produces the avro schema like the following 
{code:json}

{
  "type": "record",
  "name": "data",
  "fields": [
{
  "name": "col1",
  "type": {
"type": "record",
"name": "col1_1",
"fields": [
  {
"name": "col2",
"type": {
  "type": "record",
  "name": "col2",
  "fields": [
{
  "name": "col1_data",
  "type": "string"
}
  ]
}
  }
]
  }
},
{
  "name": "col2",
  "type": {
"type": "array",
"items": {
  "type": "record",
  "name": "col2",
  "fields": [
{
  "name": "col2_data",
  "type": "string"
}
  ]
}
  }
}
  ]
}

{code:json}
But according to the avro 1.7.7  specs 
([https://avro.apache.org/docs/1.7.7/spec.html#Names]), _col2_ has been defined 
as record and redefined as array later, it is an invalid schema, unless the 
fullname (namespace + name) is unique. 

Since AvroStorageSchemaConversionUtilities will generate avro record if the pig 
schema is a tuple, we need a way to generate unique _recordName_. 
{code: java}
public static Schema resourceSchemaToAvroSchema(final ResourceSchema rs,
  String recordName, final String recordNameSpace,
  final Map> definedRecordNames,
  final Boo

[jira] [Updated] (PIG-5112) Cleanup pig-template.xml

2017-01-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-5112:

Fix Version/s: 0.16.1

> Cleanup pig-template.xml
> 
>
> Key: PIG-5112
> URL: https://issues.apache.org/jira/browse/PIG-5112
> Project: Pig
>  Issue Type: Bug
>  Components: build
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.17.0, 0.16.1
>
> Attachments: PIG-5112-1.patch
>
>
> Several entries in pig-template.xml are outdated. Attach a patch to remove or 
> update those entries. Later we shall use ivy:makepom to generate pig.pom and 
> lib dir, I will open a separate ticket for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-5112) Cleanup pig-template.xml

2017-01-25 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15838166#comment-15838166
 ] 

Thejas M Nair commented on PIG-5112:


+1

> Cleanup pig-template.xml
> 
>
> Key: PIG-5112
> URL: https://issues.apache.org/jira/browse/PIG-5112
> Project: Pig
>  Issue Type: Bug
>  Components: build
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.17.0
>
> Attachments: PIG-5112-1.patch
>
>
> Several entries in pig-template.xml are outdated. Attach a patch to remove or 
> update those entries. Later we shall use ivy:makepom to generate pig.pom and 
> lib dir, I will open a separate ticket for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-5113) Not a valid JAR

2017-01-25 Thread Fabrizio Massara (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15837662#comment-15837662
 ] 

Fabrizio Massara commented on PIG-5113:
---

I increased using:
export _JAVA_OPTIONS=-Xmx8192m

I tried also with PIG_OPTS but nothing.
After verify the jars with:
jar -tvf pig-0.16.0-core-h2.jar

no errors are arised.

> Not a valid JAR
> ---
>
> Key: PIG-5113
> URL: https://issues.apache.org/jira/browse/PIG-5113
> Project: Pig
>  Issue Type: Bug
>  Components: grunt
>Affects Versions: 0.16.0
> Environment: Ubuntu Server 16.04
>Reporter: Fabrizio Massara
>
> Hello, I installed Pig on Ubuntu Server 16.0 and I need to use it in local 
> mode.
> Yesterday I tried to run some jobs but unfortunately they were killed cause 
> the heap space of java wasn't enough. I updated it but now, when I try to run 
> pig, it appear this error:
> Not a valid JAR: /usr/local/pig/pig-0.16.0-core-h2.jar 
> /usr/local/pig/pig-0.16.0-SNAPSHOT-core-h2.jar
> How could I solve this? I cannot find a solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-5113) Not a valid JAR

2017-01-25 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15837591#comment-15837591
 ] 

Nandor Kollar commented on PIG-5113:


How did you increase the heap space? Via exporting PIG_OPTS? Did you verify 
that the jar is actually a valid jar manually?

> Not a valid JAR
> ---
>
> Key: PIG-5113
> URL: https://issues.apache.org/jira/browse/PIG-5113
> Project: Pig
>  Issue Type: Bug
>  Components: grunt
>Affects Versions: 0.16.0
> Environment: Ubuntu Server 16.04
>Reporter: Fabrizio Massara
>
> Hello, I installed Pig on Ubuntu Server 16.0 and I need to use it in local 
> mode.
> Yesterday I tried to run some jobs but unfortunately they were killed cause 
> the heap space of java wasn't enough. I updated it but now, when I try to run 
> pig, it appear this error:
> Not a valid JAR: /usr/local/pig/pig-0.16.0-core-h2.jar 
> /usr/local/pig/pig-0.16.0-SNAPSHOT-core-h2.jar
> How could I solve this? I cannot find a solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PIG-5114) Getting error 1006-unable to iterate alias for r

2017-01-25 Thread Sandip Samaddar (JIRA)
Sandip Samaddar created PIG-5114:


 Summary: Getting error 1006-unable to iterate alias for r
 Key: PIG-5114
 URL: https://issues.apache.org/jira/browse/PIG-5114
 Project: Pig
  Issue Type: Bug
 Environment: OS  - Ubuntu 16.04
2 Virtual machines with OS Ubuntu-16.04 having 
Hadoop 2.5.1 installed as master and slave.
HBase 1.1.4 installed in distributed mode.
Pig-15 is installed in master virtual machine.
Reporter: Sandip Samaddar


I am using 2 virtual machines where 1 is hadoop master and another is hadoop 
slave. I have installed HBase 1.1.4 in distributed mode . and then pig 15 is 
installed in master. Now I open pig in mapreduce mode and load a txt file from 
hdfs and then dump it , I get error unable to iterate alias. 
But in local mode dump is working fine.
It is also to mention that I did 
ant clean tar -Dhadoopversion=23 -Dhbase95.version=1.1.2 
-Dforrest.home=/home/hduser/forrest/apache-forrest-0.9 

Build was successful, still getting error. Kindly help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PIG-5113) Not a valid JAR

2017-01-25 Thread Fabrizio Massara (JIRA)
Fabrizio Massara created PIG-5113:
-

 Summary: Not a valid JAR
 Key: PIG-5113
 URL: https://issues.apache.org/jira/browse/PIG-5113
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.16.0
 Environment: Ubuntu Server 16.04
Reporter: Fabrizio Massara


Hello, I installed Pig on Ubuntu Server 16.0 and I need to use it in local mode.
Yesterday I tried to run some jobs but unfortunately they were killed cause the 
heap space of java wasn't enough. I updated it but now, when I try to run pig, 
it appear this error:
Not a valid JAR: /usr/local/pig/pig-0.16.0-core-h2.jar 
/usr/local/pig/pig-0.16.0-SNAPSHOT-core-h2.jar

How could I solve this? I cannot find a solution.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-5104) Union_15 e2e test failing on Spark

2017-01-25 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15837564#comment-15837564
 ] 

Nandor Kollar commented on PIG-5104:


[~kellyzly] attached a unit test to show the issue. It is not included into the 
diff, because we already have similar e2e tests for this scenario, but 
executing the unit test might be easier in local mode.

> Union_15 e2e test failing on Spark
> --
>
> Key: PIG-5104
> URL: https://issues.apache.org/jira/browse/PIG-5104
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
> Attachments: PIG-5104.patch, TestUnion_15.java
>
>
> While working on PIG-4891 I noticed that Union_15 e2e test is failing on 
> Spark mode with this exception:
> Caused by: java.lang.RuntimeException: 
> org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught 
> error from UDF: org.apache.pig.impl.builtin.GFCross [Unable to get 
> parallelism hint from job conf]
>   at 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.OutputConsumerIterator.readNext(OutputConsumerIterator.java:89)
>   at 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.OutputConsumerIterator.hasNext(OutputConsumerIterator.java:96)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2078: 
> Caught error from UDF: org.apache.pig.impl.builtin.GFCross [Unable to get 
> parallelism hint from job conf]
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:358)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextDataBag(POUserFunc.java:374)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:335)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:404)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:321)
>   at 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.ForEachConverter$ForEachFunction$1$1.getNextResult(ForEachConverter.java:87)
>   at 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.OutputConsumerIterator.readNext(OutputConsumerIterator.java:69)
>   ... 11 more
> Caused by: java.io.IOException: Unable to get parallelism hint from job conf
>   at org.apache.pig.impl.builtin.GFCross.exec(GFCross.java:66)
>   at org.apache.pig.impl.builtin.GFCross.exec(GFCross.java:37)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:330)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-5104) Union_15 e2e test failing on Spark

2017-01-25 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15837562#comment-15837562
 ] 

Nandor Kollar commented on PIG-5104:


Thanks Rohini!

> Union_15 e2e test failing on Spark
> --
>
> Key: PIG-5104
> URL: https://issues.apache.org/jira/browse/PIG-5104
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
> Attachments: PIG-5104.patch, TestUnion_15.java
>
>
> While working on PIG-4891 I noticed that Union_15 e2e test is failing on 
> Spark mode with this exception:
> Caused by: java.lang.RuntimeException: 
> org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught 
> error from UDF: org.apache.pig.impl.builtin.GFCross [Unable to get 
> parallelism hint from job conf]
>   at 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.OutputConsumerIterator.readNext(OutputConsumerIterator.java:89)
>   at 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.OutputConsumerIterator.hasNext(OutputConsumerIterator.java:96)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2078: 
> Caught error from UDF: org.apache.pig.impl.builtin.GFCross [Unable to get 
> parallelism hint from job conf]
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:358)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextDataBag(POUserFunc.java:374)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:335)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:404)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:321)
>   at 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.ForEachConverter$ForEachFunction$1$1.getNextResult(ForEachConverter.java:87)
>   at 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.OutputConsumerIterator.readNext(OutputConsumerIterator.java:69)
>   ... 11 more
> Caused by: java.io.IOException: Unable to get parallelism hint from job conf
>   at org.apache.pig.impl.builtin.GFCross.exec(GFCross.java:66)
>   at org.apache.pig.impl.builtin.GFCross.exec(GFCross.java:37)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:330)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-5104) Union_15 e2e test failing on Spark

2017-01-25 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-5104:
---
Attachment: TestUnion_15.java

> Union_15 e2e test failing on Spark
> --
>
> Key: PIG-5104
> URL: https://issues.apache.org/jira/browse/PIG-5104
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
> Attachments: PIG-5104.patch, TestUnion_15.java
>
>
> While working on PIG-4891 I noticed that Union_15 e2e test is failing on 
> Spark mode with this exception:
> Caused by: java.lang.RuntimeException: 
> org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught 
> error from UDF: org.apache.pig.impl.builtin.GFCross [Unable to get 
> parallelism hint from job conf]
>   at 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.OutputConsumerIterator.readNext(OutputConsumerIterator.java:89)
>   at 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.OutputConsumerIterator.hasNext(OutputConsumerIterator.java:96)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2078: 
> Caught error from UDF: org.apache.pig.impl.builtin.GFCross [Unable to get 
> parallelism hint from job conf]
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:358)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextDataBag(POUserFunc.java:374)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:335)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:404)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:321)
>   at 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.ForEachConverter$ForEachFunction$1$1.getNextResult(ForEachConverter.java:87)
>   at 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.OutputConsumerIterator.readNext(OutputConsumerIterator.java:69)
>   ... 11 more
> Caused by: java.io.IOException: Unable to get parallelism hint from job conf
>   at org.apache.pig.impl.builtin.GFCross.exec(GFCross.java:66)
>   at org.apache.pig.impl.builtin.GFCross.exec(GFCross.java:37)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:330)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)