[jira] [Comment Edited] (PIG-4963) Add a Bloom join
[ https://issues.apache.org/jira/browse/PIG-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15839299#comment-15839299 ] Daniel Dai edited comment on PIG-4963 at 1/26/17 7:02 AM: -- +1 for PIG-4963-5.patch. was (Author: daijy): +1 for the new patch (on RB). > Add a Bloom join > > > Key: PIG-4963 > URL: https://issues.apache.org/jira/browse/PIG-4963 > Project: Pig > Issue Type: New Feature >Reporter: Rohini Palaniswamy >Assignee: Rohini Palaniswamy > Fix For: 0.17.0 > > Attachments: PIG-4963-1.patch, PIG-4963-2.patch, PIG-4963-3.patch, > PIG-4963-4.patch, PIG-4963-5.patch > > > In PIG-4925, added option to pass BloomFilter as a scalar to bloom function. > But found that actually using it for big data which required huge vector size > was very inefficient and led to OOM. >I had initially calculated that it would take around 12MB bytearray for > 100 million vectorsize (1 + 7) / 8 = 1250 bytes) and that would > be the scalar value broadcasted and would not take much space. But problem is > 12MB was written out for every input record with BuildBloom$Initial before > the aggregation happens and we arrive at the final BloomFilter vector. And > with POPartialAgg it runs into OOM issues. > If we added a bloom join implementation, which can be combined with hash or > skewed join it would boost performance for a lot of jobs. Bloom filter of the > smaller tables can be sent to the bigger tables as scalar and data filtered > before hash or skewed join is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] Subscription: PIG patch available
Issue Subscription Filter: PIG patch available (27 issues) Subscriber: pigdaily Key Summary PIG-4926Modify the content of start.xml for spark mode https://issues-test.apache.org/jira/browse/PIG-4926 PIG-4922Deadlock between SpillableMemoryManager and InternalSortedBag$SortedDataBagIterator https://issues-test.apache.org/jira/browse/PIG-4922 PIG-4918Pig on Tez cannot switch pig.temp.dir to another fs https://issues-test.apache.org/jira/browse/PIG-4918 PIG-4897Scope of param substitution for run/exec commands https://issues-test.apache.org/jira/browse/PIG-4897 PIG-4886Add PigSplit#getLocationInfo to fix the NPE found in log in spark mode https://issues-test.apache.org/jira/browse/PIG-4886 PIG-4854Merge spark branch to trunk https://issues-test.apache.org/jira/browse/PIG-4854 PIG-4849pig on tez will cause tez-ui to crash,because the content from timeline server is too long. https://issues-test.apache.org/jira/browse/PIG-4849 PIG-4788the value BytesRead metric info always returns 0 even the length of input file is not 0 in spark engine https://issues-test.apache.org/jira/browse/PIG-4788 PIG-4745DataBag should protect content of passed list of tuples https://issues-test.apache.org/jira/browse/PIG-4745 PIG-4684Exception should be changed to warning when job diagnostics cannot be fetched https://issues-test.apache.org/jira/browse/PIG-4684 PIG-4656Improve String serialization and comparator performance in BinInterSedes https://issues-test.apache.org/jira/browse/PIG-4656 PIG-4598Allow user defined plan optimizer rules https://issues-test.apache.org/jira/browse/PIG-4598 PIG-4551Partition filter is not pushed down in case of SPLIT https://issues-test.apache.org/jira/browse/PIG-4551 PIG-4539New PigUnit https://issues-test.apache.org/jira/browse/PIG-4539 PIG-4515org.apache.pig.builtin.Distinct throws ClassCastException https://issues-test.apache.org/jira/browse/PIG-4515 PIG-4323PackageConverter hanging in Spark https://issues-test.apache.org/jira/browse/PIG-4323 PIG-4313StackOverflowError in LIMIT operation on Spark https://issues-test.apache.org/jira/browse/PIG-4313 PIG-4251Pig on Storm https://issues-test.apache.org/jira/browse/PIG-4251 PIG-4002Disable combiner when map-side aggregation is used https://issues-test.apache.org/jira/browse/PIG-4002 PIG-3952PigStorage accepts '-tagSplit' to return full split information https://issues-test.apache.org/jira/browse/PIG-3952 PIG-3911Define unique fields with @OutputSchema https://issues-test.apache.org/jira/browse/PIG-3911 PIG-3877Getting Geo Latitude/Longitude from Address Lines https://issues-test.apache.org/jira/browse/PIG-3877 PIG-3873Geo distance calculation using Haversine https://issues-test.apache.org/jira/browse/PIG-3873 PIG-3864ToDate(userstring, format, timezone) computes DateTime with strange handling of Daylight Saving Time with location based timezones https://issues-test.apache.org/jira/browse/PIG-3864 PIG-3851Upgrade jline to 2.11 https://issues-test.apache.org/jira/browse/PIG-3851 PIG-3668COR built-in function when atleast one of the coefficient values is NaN https://issues-test.apache.org/jira/browse/PIG-3668 PIG-3587add functionality for rolling over dates https://issues-test.apache.org/jira/browse/PIG-3587 You may edit this subscription at: https://issues-test.apache.org/jira/secure/FilterSubscription!default.jspa?subId=16328&filterId=12322384
[jira] Subscription: PIG patch available
Issue Subscription Filter: PIG patch available (32 issues) Subscriber: pigdaily Key Summary PIG-5115Builtin AvroStorage generates incorrect avro schema when the same pig field name appears in the alias https://issues.apache.org/jira/browse/PIG-5115 PIG-5106Optimize when mapreduce.input.fileinputformat.input.dir.recursive set to true https://issues.apache.org/jira/browse/PIG-5106 PIG-5081Can not run pig on spark source code distribution https://issues.apache.org/jira/browse/PIG-5081 PIG-5080Support store alias as spark table https://issues.apache.org/jira/browse/PIG-5080 PIG-5057IndexOutOfBoundsException when pig reducer processOnePackageOutput https://issues.apache.org/jira/browse/PIG-5057 PIG-5029Optimize sort case when data is skewed https://issues.apache.org/jira/browse/PIG-5029 PIG-4926Modify the content of start.xml for spark mode https://issues.apache.org/jira/browse/PIG-4926 PIG-4854Merge spark branch to trunk https://issues.apache.org/jira/browse/PIG-4854 PIG-4849pig on tez will cause tez-ui to crash,because the content from timeline server is too long. https://issues.apache.org/jira/browse/PIG-4849 PIG-4788the value BytesRead metric info always returns 0 even the length of input file is not 0 in spark engine https://issues.apache.org/jira/browse/PIG-4788 PIG-4750REPLACE_MULTI should compile Pattern once and reuse it https://issues.apache.org/jira/browse/PIG-4750 PIG-4748DateTimeWritable forgets Chronology https://issues.apache.org/jira/browse/PIG-4748 PIG-4745DataBag should protect content of passed list of tuples https://issues.apache.org/jira/browse/PIG-4745 PIG-4684Exception should be changed to warning when job diagnostics cannot be fetched https://issues.apache.org/jira/browse/PIG-4684 PIG-4656Improve String serialization and comparator performance in BinInterSedes https://issues.apache.org/jira/browse/PIG-4656 PIG-4598Allow user defined plan optimizer rules https://issues.apache.org/jira/browse/PIG-4598 PIG-4551Partition filter is not pushed down in case of SPLIT https://issues.apache.org/jira/browse/PIG-4551 PIG-4539New PigUnit https://issues.apache.org/jira/browse/PIG-4539 PIG-4515org.apache.pig.builtin.Distinct throws ClassCastException https://issues.apache.org/jira/browse/PIG-4515 PIG-4323PackageConverter hanging in Spark https://issues.apache.org/jira/browse/PIG-4323 PIG-4313StackOverflowError in LIMIT operation on Spark https://issues.apache.org/jira/browse/PIG-4313 PIG-4251Pig on Storm https://issues.apache.org/jira/browse/PIG-4251 PIG-4002Disable combiner when map-side aggregation is used https://issues.apache.org/jira/browse/PIG-4002 PIG-3952PigStorage accepts '-tagSplit' to return full split information https://issues.apache.org/jira/browse/PIG-3952 PIG-3911Define unique fields with @OutputSchema https://issues.apache.org/jira/browse/PIG-3911 PIG-3877Getting Geo Latitude/Longitude from Address Lines https://issues.apache.org/jira/browse/PIG-3877 PIG-3873Geo distance calculation using Haversine https://issues.apache.org/jira/browse/PIG-3873 PIG-3864ToDate(userstring, format, timezone) computes DateTime with strange handling of Daylight Saving Time with location based timezones https://issues.apache.org/jira/browse/PIG-3864 PIG-3851Upgrade jline to 2.11 https://issues.apache.org/jira/browse/PIG-3851 PIG-3668COR built-in function when atleast one of the coefficient values is NaN https://issues.apache.org/jira/browse/PIG-3668 PIG-3587add functionality for rolling over dates https://issues.apache.org/jira/browse/PIG-3587 PIG-1804Alow Jython function to implement Algebraic and/or Accumulator interfaces https://issues.apache.org/jira/browse/PIG-1804 You may edit this subscription at: https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=16328&filterId=12322384
[jira] [Commented] (PIG-4891) Implement FR join by broadcasting small rdd not making more copys of data
[ https://issues.apache.org/jira/browse/PIG-4891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15839327#comment-15839327 ] liyunzhang_intel commented on PIG-4891: --- Here is my understanding to this jira, let's use an example to explain it. {code} A = load './SkewedJoinInput1.txt' as (id,name,n); B = load './SkewedJoinInput2.txt' as (id,name); D = join A by (id,name), B by (id,name) using 'replicated'; explain D; {code} before the patch, the spark plan is: {code} #-- # Spark Plan #-- Spark node scope-26 Store(hdfs://zly1.sh.intel.com:8020/tmp/temp1749487848/tmp1731009936:org.apache.pig.impl.io.InterStorage) - scope-27 | |---B: New For Each(false,false)[bag] - scope-13 | | | Project[bytearray][0] - scope-9 | | | Project[bytearray][1] - scope-11 | |---B: Load(hdfs://zly1.sh.intel.com:8020/user/root/SkewedJoinInput2.txt:org.apache.pig.builtin.PigStorage) - scope-8 Spark node scope-25 D: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-24 | |---D: FRJoin[tuple] - scope-18 | | | Project[bytearray][0] - scope-14 | | | Project[bytearray][1] - scope-15 | | | Project[bytearray][0] - scope-16 | | | Project[bytearray][1] - scope-17 | |---A: New For Each(false,false,false)[bag] - scope-7 | | | Project[bytearray][0] - scope-1 | | | Project[bytearray][1] - scope-3 | | | Project[bytearray][2] - scope-5 | |---A: Load(hdfs://zly1.sh.intel.com:8020/user/root/SkewedJoinInput1.txt:org.apache.pig.builtin.PigStorage) - scope-0 {code} After patch {code} Spark node scope-28 D: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-24 | |---D: FRJoinSpark[tuple] - scope-18 | | | Project[bytearray][0] - scope-14 | | | Project[bytearray][1] - scope-15 | | | Project[bytearray][0] - scope-16 | | | Project[bytearray][1] - scope-17 | |---A: New For Each(false,false,false)[bag] - scope-7 | | | | | Project[bytearray][0] - scope-1 | | | | | Project[bytearray][1] - scope-3 | | | | | Project[bytearray][2] - scope-5 | | | |---A: Load(hdfs://zly1.sh.intel.com:8020/user/root/SkewedJoinInput1.txt:org.apache.pig.builtin.PigStorage) - scope-0 | |---BroadcastSpark - scope-27 | |---B: New For Each(false,false)[bag] - scope-13 | | | Project[bytearray][0] - scope-9 | | | Project[bytearray][1] - scope-11 | |---B: Load(hdfs://zly1.sh.intel.com:8020/user/root/SkewedJoinInput2.txt:org.apache.pig.builtin.PigStorage) - scope {code} In the patch 1. we don't load the small table to the distributed cache and start a new job to load data from distributed cache. 2. load small table as rdd and broadcast small rdd by SparkContext.broadcast() > Implement FR join by broadcasting small rdd not making more copys of data > - > > Key: PIG-4891 > URL: https://issues.apache.org/jira/browse/PIG-4891 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: Nandor Kollar > Fix For: spark-branch > > > In current implementation of FRJoin(PIG-4771), we just set the value of > replication of data as 10 to make the data access more efficiency because > current FRJoin algrithms can be reused in this way. We need to figure out how > to use broadcasting small rdd to implement FRJoin in current code base if we > find the performance can be improved a lot by using broadcasting rdd. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4891) Implement FR join by broadcasting small rdd not making more copys of data
[ https://issues.apache.org/jira/browse/PIG-4891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15839313#comment-15839313 ] liyunzhang_intel commented on PIG-4891: --- [~nkollar]: LGTM except some minor issues and left some comment on rb. > Implement FR join by broadcasting small rdd not making more copys of data > - > > Key: PIG-4891 > URL: https://issues.apache.org/jira/browse/PIG-4891 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: Nandor Kollar > Fix For: spark-branch > > > In current implementation of FRJoin(PIG-4771), we just set the value of > replication of data as 10 to make the data access more efficiency because > current FRJoin algrithms can be reused in this way. We need to figure out how > to use broadcasting small rdd to implement FRJoin in current code base if we > find the performance can be improved a lot by using broadcasting rdd. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4963) Add a Bloom join
[ https://issues.apache.org/jira/browse/PIG-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15839299#comment-15839299 ] Daniel Dai commented on PIG-4963: - +1 for the new patch (on RB). > Add a Bloom join > > > Key: PIG-4963 > URL: https://issues.apache.org/jira/browse/PIG-4963 > Project: Pig > Issue Type: New Feature >Reporter: Rohini Palaniswamy >Assignee: Rohini Palaniswamy > Fix For: 0.17.0 > > Attachments: PIG-4963-1.patch, PIG-4963-2.patch, PIG-4963-3.patch, > PIG-4963-4.patch, PIG-4963-5.patch > > > In PIG-4925, added option to pass BloomFilter as a scalar to bloom function. > But found that actually using it for big data which required huge vector size > was very inefficient and led to OOM. >I had initially calculated that it would take around 12MB bytearray for > 100 million vectorsize (1 + 7) / 8 = 1250 bytes) and that would > be the scalar value broadcasted and would not take much space. But problem is > 12MB was written out for every input record with BuildBloom$Initial before > the aggregation happens and we arrive at the final BloomFilter vector. And > with POPartialAgg it runs into OOM issues. > If we added a bloom join implementation, which can be combined with hash or > skewed join it would boost performance for a lot of jobs. Bloom filter of the > smaller tables can be sent to the bigger tables as scalar and data filtered > before hash or skewed join is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-4963) Add a Bloom join
[ https://issues.apache.org/jira/browse/PIG-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohini Palaniswamy updated PIG-4963: Attachment: PIG-4963-5.patch > Add a Bloom join > > > Key: PIG-4963 > URL: https://issues.apache.org/jira/browse/PIG-4963 > Project: Pig > Issue Type: New Feature >Reporter: Rohini Palaniswamy >Assignee: Rohini Palaniswamy > Fix For: 0.17.0 > > Attachments: PIG-4963-1.patch, PIG-4963-2.patch, PIG-4963-3.patch, > PIG-4963-4.patch, PIG-4963-5.patch > > > In PIG-4925, added option to pass BloomFilter as a scalar to bloom function. > But found that actually using it for big data which required huge vector size > was very inefficient and led to OOM. >I had initially calculated that it would take around 12MB bytearray for > 100 million vectorsize (1 + 7) / 8 = 1250 bytes) and that would > be the scalar value broadcasted and would not take much space. But problem is > 12MB was written out for every input record with BuildBloom$Initial before > the aggregation happens and we arrive at the final BloomFilter vector. And > with POPartialAgg it runs into OOM issues. > If we added a bloom join implementation, which can be combined with hash or > skewed join it would boost performance for a lot of jobs. Bloom filter of the > smaller tables can be sent to the bigger tables as scalar and data filtered > before hash or skewed join is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-5111) e2e Utf8Test fails in local mode
[ https://issues.apache.org/jira/browse/PIG-5111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohini Palaniswamy updated PIG-5111: Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Committed to branch-0.16 and trunk. Thanks for the review Daniel. > e2e Utf8Test fails in local mode > > > Key: PIG-5111 > URL: https://issues.apache.org/jira/browse/PIG-5111 > Project: Pig > Issue Type: Bug >Reporter: Rohini Palaniswamy >Assignee: Rohini Palaniswamy > Fix For: 0.17.0, 0.16.1 > > Attachments: PIG-5111-1.patch > > > The test data required is not setup during deploy in local mode > (test-e2e-deploy-local) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4963) Add a Bloom join
[ https://issues.apache.org/jira/browse/PIG-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15839298#comment-15839298 ] Rohini Palaniswamy commented on PIG-4963: - bq. But I feel it is more clear if the plan show a filter + regular local rearrange. The execution plan of the later is more understandable. Actually in this case bloom filter cannot be applied before local rearrange. Local rearrange is the one that separates the record into key and value for the join and Bloom filter is then applied on the key. So it has to be either part of the local rearrange operator as currently implemented or be a separate operator after local rearrange which will we be lot more confusing. > Add a Bloom join > > > Key: PIG-4963 > URL: https://issues.apache.org/jira/browse/PIG-4963 > Project: Pig > Issue Type: New Feature >Reporter: Rohini Palaniswamy >Assignee: Rohini Palaniswamy > Fix For: 0.17.0 > > Attachments: PIG-4963-1.patch, PIG-4963-2.patch, PIG-4963-3.patch, > PIG-4963-4.patch > > > In PIG-4925, added option to pass BloomFilter as a scalar to bloom function. > But found that actually using it for big data which required huge vector size > was very inefficient and led to OOM. >I had initially calculated that it would take around 12MB bytearray for > 100 million vectorsize (1 + 7) / 8 = 1250 bytes) and that would > be the scalar value broadcasted and would not take much space. But problem is > 12MB was written out for every input record with BuildBloom$Initial before > the aggregation happens and we arrive at the final BloomFilter vector. And > with POPartialAgg it runs into OOM issues. > If we added a bloom join implementation, which can be combined with hash or > skewed join it would boost performance for a lot of jobs. Bloom filter of the > smaller tables can be sent to the bigger tables as scalar and data filtered > before hash or skewed join is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Review Request 55681: [PIG-4963] Add a Bloom join
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/55681/#review163100 --- Ship it! Ship It! - Daniel Dai On Jan. 26, 2017, 5:55 a.m., Rohini Palaniswamy wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/55681/ > --- > > (Updated Jan. 26, 2017, 5:55 a.m.) > > > Review request for pig, Daniel Dai and Adam Szita. > > > Bugs: PIG-4963 > https://issues.apache.org/jira/browse/PIG-4963 > > > Repository: pig > > > Description > --- > > This patch adds a new type of join called bloom. It supports creating > multiple bloom filters partitioned by hashcode of key for parallelism. Two > new operators and one Packager implementations are added. > POBuildBloomRearrageTez - Builds the bloom filter for one of the relations > of the join on the map side or writes out the join keys based on the strategy > BloomPackager - Used in the reducer to create or combine bloom filters and > produces the final bloom filters. > POBloomFilterRearrangeTez - Applies the bloom filters to other relations in > the join and filters out data. > > More details in the documentation. > > > Diffs > - > > > http://svn.apache.org/repos/asf/pig/trunk/src/docs/src/documentation/content/xdocs/basic.xml > 1779665 > > http://svn.apache.org/repos/asf/pig/trunk/src/docs/src/documentation/content/xdocs/perf.xml > 1779665 > > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/PigConfiguration.java > 1779665 > > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java > 1779665 > > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigCombiner.java > 1779665 > > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java > 1779665 > > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/PhysicalOperator.java > 1779665 > > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/Packager.java > 1779665 > > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java > 1779665 > > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/TezCompiler.java > 1779665 > > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/TezEdgeDescriptor.java > 1779665 > > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/TezOperator.java > 1779665 > > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/TezPOPackageAnnotator.java > 1779665 > > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/BloomPackager.java > PRE-CREATION > > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/POBloomFilterRearrangeTez.java > PRE-CREATION > > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/POBuildBloomRearrangeTez.java > PRE-CREATION > > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/POLocalRearrangeTez.java > 1779665 > > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/POShuffleTezLoad.java > 1779665 > > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/optimizer/CombinerOptimizer.java > 1779665 > > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/optimizer/ParallelismSetter.java > 1779665 > > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/optimizer/SecondaryKeyOptimizerTez.java > 1779665 > > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/optimizer/TezEstimatedParallelismClearer.java > 1779665 > > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/optimizer/TezOperDependencyParallelismEstimator.java > 1779665 > > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LOJoin.java > 1779665 > > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/p
Re: Review Request 55681: [PIG-4963] Add a Bloom join
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/55681/ --- (Updated Jan. 26, 2017, 5:55 a.m.) Review request for pig, Daniel Dai and Adam Szita. Changes --- Addressed review comments on documentation Bugs: PIG-4963 https://issues.apache.org/jira/browse/PIG-4963 Repository: pig Description --- This patch adds a new type of join called bloom. It supports creating multiple bloom filters partitioned by hashcode of key for parallelism. Two new operators and one Packager implementations are added. POBuildBloomRearrageTez - Builds the bloom filter for one of the relations of the join on the map side or writes out the join keys based on the strategy BloomPackager - Used in the reducer to create or combine bloom filters and produces the final bloom filters. POBloomFilterRearrangeTez - Applies the bloom filters to other relations in the join and filters out data. More details in the documentation. Diffs (updated) - http://svn.apache.org/repos/asf/pig/trunk/src/docs/src/documentation/content/xdocs/basic.xml 1779665 http://svn.apache.org/repos/asf/pig/trunk/src/docs/src/documentation/content/xdocs/perf.xml 1779665 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/PigConfiguration.java 1779665 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java 1779665 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigCombiner.java 1779665 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java 1779665 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/PhysicalOperator.java 1779665 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/Packager.java 1779665 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java 1779665 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/TezCompiler.java 1779665 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/TezEdgeDescriptor.java 1779665 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/TezOperator.java 1779665 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/TezPOPackageAnnotator.java 1779665 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/BloomPackager.java PRE-CREATION http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/POBloomFilterRearrangeTez.java PRE-CREATION http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/POBuildBloomRearrangeTez.java PRE-CREATION http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/POLocalRearrangeTez.java 1779665 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/POShuffleTezLoad.java 1779665 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/optimizer/CombinerOptimizer.java 1779665 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/optimizer/ParallelismSetter.java 1779665 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/optimizer/SecondaryKeyOptimizerTez.java 1779665 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/optimizer/TezEstimatedParallelismClearer.java 1779665 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/optimizer/TezOperDependencyParallelismEstimator.java 1779665 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LOJoin.java 1779665 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LogToPhyTranslationVisitor.java 1779665 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/LogicalPlanBuilder.java 1779665 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/tools/pigstats/ScriptState.java 1779665 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/tools/pigstats/tez/TezScriptState.java 1779665 http://svn.apache.org/repos/asf/pig/trunk/test/e2e/pig/build.xml 1779665 htt
[jira] [Updated] (PIG-5112) Cleanup pig-template.xml
[ https://issues.apache.org/jira/browse/PIG-5112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-5112: Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Patch committed to both 0.16 branch and trunk. Thanks Thejas for review! > Cleanup pig-template.xml > > > Key: PIG-5112 > URL: https://issues.apache.org/jira/browse/PIG-5112 > Project: Pig > Issue Type: Bug > Components: build >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.17.0, 0.16.1 > > Attachments: PIG-5112-1.patch > > > Several entries in pig-template.xml are outdated. Attach a patch to remove or > update those entries. Later we shall use ivy:makepom to generate pig.pom and > lib dir, I will open a separate ticket for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4963) Add a Bloom join
[ https://issues.apache.org/jira/browse/PIG-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15839173#comment-15839173 ] Rohini Palaniswamy commented on PIG-4963: - Will address 1. For 3, I did a quick run of Join converting them to use bloom and they were fine except for full outer which is not supported. Actually tests added for bloom join cover all cases in the Join group and in fact cover lot more - tuple keys and more datatypes for keys, more cases for union and split. Also uses studentnulltab10k which tests null cases better. self join case is covered in multiquery.conf. bq. But I feel it is more clear if the plan show a filter + regular local rearrange. The execution plan of the later is more understandable. I think it is unnecessary overhead to add a separate filter operator for just readability. The current Filter operator which executes a plan for filtering has no relation to the BloomFilter way of filtering and does not logically make sense to extend for BloomFilter. This is simpler and cleaner in terms of implementation and also should be faster in terms of execution as there is no unnecessary overhead. > Add a Bloom join > > > Key: PIG-4963 > URL: https://issues.apache.org/jira/browse/PIG-4963 > Project: Pig > Issue Type: New Feature >Reporter: Rohini Palaniswamy >Assignee: Rohini Palaniswamy > Fix For: 0.17.0 > > Attachments: PIG-4963-1.patch, PIG-4963-2.patch, PIG-4963-3.patch, > PIG-4963-4.patch > > > In PIG-4925, added option to pass BloomFilter as a scalar to bloom function. > But found that actually using it for big data which required huge vector size > was very inefficient and led to OOM. >I had initially calculated that it would take around 12MB bytearray for > 100 million vectorsize (1 + 7) / 8 = 1250 bytes) and that would > be the scalar value broadcasted and would not take much space. But problem is > 12MB was written out for every input record with BuildBloom$Initial before > the aggregation happens and we arrive at the final BloomFilter vector. And > with POPartialAgg it runs into OOM issues. > If we added a bloom join implementation, which can be combined with hash or > skewed join it would boost performance for a lot of jobs. Bloom filter of the > smaller tables can be sent to the bigger tables as scalar and data filtered > before hash or skewed join is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-5111) e2e Utf8Test fails in local mode
[ https://issues.apache.org/jira/browse/PIG-5111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15838976#comment-15838976 ] Daniel Dai commented on PIG-5111: - +1 > e2e Utf8Test fails in local mode > > > Key: PIG-5111 > URL: https://issues.apache.org/jira/browse/PIG-5111 > Project: Pig > Issue Type: Bug >Reporter: Rohini Palaniswamy >Assignee: Rohini Palaniswamy > Fix For: 0.17.0, 0.16.1 > > Attachments: PIG-5111-1.patch > > > The test data required is not setup during deploy in local mode > (test-e2e-deploy-local) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-5115) Builtin AvroStorage generates incorrect avro schema when the same pig field name appears in the alias
[ https://issues.apache.org/jira/browse/PIG-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-5115: Assignee: Anyi Li > Builtin AvroStorage generates incorrect avro schema when the same pig field > name appears in the alias > - > > Key: PIG-5115 > URL: https://issues.apache.org/jira/browse/PIG-5115 > Project: Pig > Issue Type: Bug >Affects Versions: 0.17.0 >Reporter: Anyi Li >Assignee: Anyi Li > Fix For: 0.17.0 > > Attachments: PIG-5115.patch > > > Pig ResourceSchema allows to use same field names but different types when > they are not in the same level. The pig schema like > {quote} > data: {col1: (col2: (col1_data: chararray)),col2: {col2: (col2_data: > chararray)}} > {quote} > Although _col2_ has been redefined, they are not appeared in the same level, > it is a totally valid pig schema. > However, once it is translated by AvroStorage, it will throw exception > {noformat} > Can't redefine: col2 > at > org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:64) > at > org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66) > at > org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64) > at > org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) > at > org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) > at > org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53) > at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) > at > org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212) > at org.apache.pig.PigServer$Graph.compile(PigServer.java:1767) > at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1443) > at org.apache.pig.PigServer.execute(PigServer.java:1356) > at org.apache.pig.PigServer.executeBatch(PigServer.java:415) > at org.apache.pig.PigServer.executeBatch(PigServer.java:398) > at > org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81) > at org.apache.pig.Main.run(Main.java:631) > at org.apache.pig.Main.main(Main.java:177) > Caused by: org.apache.avro.SchemaParseException: Can't redefine: col2 > at org.apache.avro.Schema$Names.put(Schema.java:1042) > at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:511) > at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:626) > at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:737) > at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:814) > at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:648) > at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:635) > at org.apache.avro.Schema.toString(Schema.java:297) > at org.apache.avro.Schema.toString(Schema.java:287) > at > org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:442) > at > org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:433) > at > org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:54) > ... 18 more > {noformat} > It is caused by a bug in AvroStorageSchemaConversionUtilities class which > uses tuple name as GenericRecord name as well as the fieldname that wraps the > record. > So it would like to produces the avro schema like the following > {noformat} > { > "type": "record", > "name": "data", > "fields": [ > { > "name": "col1", > "type": { > "type": "record", > "name": "col1_1", > "fields": [ > { > "name": "col2", > "type": { > "type": "record", > "name": "col2", > "fields": [ > { > "name": "col1_data", > "type": "string" > } > ] > } > } > ] > } > }, > { > "name": "col2", > "type": { > "type": "array", > "items": { > "type": "record", > "name": "col2", > "fields": [ > { > "name": "col2_data", > "type": "string" > } > ] > } > } > } > ] > } > {noformat} > But according
[jira] [Resolved] (PIG-5113) Not a valid JAR
[ https://issues.apache.org/jira/browse/PIG-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai resolved PIG-5113. - Resolution: Not A Problem The classic way to is export PIG_HEAPSIZE environment variable. But seem you solve the issue anyway. > Not a valid JAR > --- > > Key: PIG-5113 > URL: https://issues.apache.org/jira/browse/PIG-5113 > Project: Pig > Issue Type: Bug > Components: grunt >Affects Versions: 0.16.0 > Environment: Ubuntu Server 16.04 >Reporter: Fabrizio Massara > > Hello, I installed Pig on Ubuntu Server 16.0 and I need to use it in local > mode. > Yesterday I tried to run some jobs but unfortunately they were killed cause > the heap space of java wasn't enough. I updated it but now, when I try to run > pig, it appear this error: > Not a valid JAR: /usr/local/pig/pig-0.16.0-core-h2.jar > /usr/local/pig/pig-0.16.0-SNAPSHOT-core-h2.jar > How could I solve this? I cannot find a solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4963) Add a Bloom join
[ https://issues.apache.org/jira/browse/PIG-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15838928#comment-15838928 ] Daniel Dai commented on PIG-4963: - I glance through the patch and looks very good. I have some minor comments: 1. The documentation about the left outer join give the impression that user can make bloom join efficient by switch the order of relations. Actually this is the limitation of bloom join and switch order does not solve the problem. We shall make it more clear. 2. Currently we use POBloomFilterRearrangeTez for the bloom filter. But I feel it is more clear if the plan show a filter + regular local rearrange. The execution plan of the later is more understandable. 3. The patch does have quite a few test coverage. However, we can run existing join e2e tests once with bloom join, and make sure it works. That's an easy approach for additional tests. I still need more time to do a code level review, but I am fine to commit once we have done #1, #3, and deal with #2 and other review comments in the follow up Jiras. > Add a Bloom join > > > Key: PIG-4963 > URL: https://issues.apache.org/jira/browse/PIG-4963 > Project: Pig > Issue Type: New Feature >Reporter: Rohini Palaniswamy >Assignee: Rohini Palaniswamy > Fix For: 0.17.0 > > Attachments: PIG-4963-1.patch, PIG-4963-2.patch, PIG-4963-3.patch, > PIG-4963-4.patch > > > In PIG-4925, added option to pass BloomFilter as a scalar to bloom function. > But found that actually using it for big data which required huge vector size > was very inefficient and led to OOM. >I had initially calculated that it would take around 12MB bytearray for > 100 million vectorsize (1 + 7) / 8 = 1250 bytes) and that would > be the scalar value broadcasted and would not take much space. But problem is > 12MB was written out for every input record with BuildBloom$Initial before > the aggregation happens and we arrive at the final BloomFilter vector. And > with POPartialAgg it runs into OOM issues. > If we added a bloom join implementation, which can be combined with hash or > skewed join it would boost performance for a lot of jobs. Bloom filter of the > smaller tables can be sent to the bigger tables as scalar and data filtered > before hash or skewed join is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-5115) Builtin AvroStorage generates incorrect avro schema when the same pig field name appears in the alias
[ https://issues.apache.org/jira/browse/PIG-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anyi Li updated PIG-5115: - Fix Version/s: 0.17.0 Status: Patch Available (was: Open) > Builtin AvroStorage generates incorrect avro schema when the same pig field > name appears in the alias > - > > Key: PIG-5115 > URL: https://issues.apache.org/jira/browse/PIG-5115 > Project: Pig > Issue Type: Bug >Affects Versions: 0.17.0 >Reporter: Anyi Li > Fix For: 0.17.0 > > Attachments: PIG-5115.patch > > > Pig ResourceSchema allows to use same field names but different types when > they are not in the same level. The pig schema like > {quote} > data: {col1: (col2: (col1_data: chararray)),col2: {col2: (col2_data: > chararray)}} > {quote} > Although _col2_ has been redefined, they are not appeared in the same level, > it is a totally valid pig schema. > However, once it is translated by AvroStorage, it will throw exception > {noformat} > Can't redefine: col2 > at > org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:64) > at > org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66) > at > org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64) > at > org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) > at > org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) > at > org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53) > at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) > at > org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212) > at org.apache.pig.PigServer$Graph.compile(PigServer.java:1767) > at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1443) > at org.apache.pig.PigServer.execute(PigServer.java:1356) > at org.apache.pig.PigServer.executeBatch(PigServer.java:415) > at org.apache.pig.PigServer.executeBatch(PigServer.java:398) > at > org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81) > at org.apache.pig.Main.run(Main.java:631) > at org.apache.pig.Main.main(Main.java:177) > Caused by: org.apache.avro.SchemaParseException: Can't redefine: col2 > at org.apache.avro.Schema$Names.put(Schema.java:1042) > at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:511) > at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:626) > at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:737) > at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:814) > at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:648) > at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:635) > at org.apache.avro.Schema.toString(Schema.java:297) > at org.apache.avro.Schema.toString(Schema.java:287) > at > org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:442) > at > org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:433) > at > org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:54) > ... 18 more > {noformat} > It is caused by a bug in AvroStorageSchemaConversionUtilities class which > uses tuple name as GenericRecord name as well as the fieldname that wraps the > record. > So it would like to produces the avro schema like the following > {noformat} > { > "type": "record", > "name": "data", > "fields": [ > { > "name": "col1", > "type": { > "type": "record", > "name": "col1_1", > "fields": [ > { > "name": "col2", > "type": { > "type": "record", > "name": "col2", > "fields": [ > { > "name": "col1_data", > "type": "string" > } > ] > } > } > ] > } > }, > { > "name": "col2", > "type": { > "type": "array", > "items": { > "type": "record", > "name": "col2", > "fields": [ > { > "name": "col2_data", > "type": "string" > } > ] > } > } > } > ] > } > {noformat} >
[jira] [Updated] (PIG-5115) Builtin AvroStorage generates incorrect avro schema when the same pig field name appears in the alias
[ https://issues.apache.org/jira/browse/PIG-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anyi Li updated PIG-5115: - Attachment: PIG-5115.patch > Builtin AvroStorage generates incorrect avro schema when the same pig field > name appears in the alias > - > > Key: PIG-5115 > URL: https://issues.apache.org/jira/browse/PIG-5115 > Project: Pig > Issue Type: Bug >Affects Versions: 0.17.0 >Reporter: Anyi Li > Attachments: PIG-5115.patch > > > Pig ResourceSchema allows to use same field names but different types when > they are not in the same level. The pig schema like > {quote} > data: {col1: (col2: (col1_data: chararray)),col2: {col2: (col2_data: > chararray)}} > {quote} > Although _col2_ has been redefined, they are not appeared in the same level, > it is a totally valid pig schema. > However, once it is translated by AvroStorage, it will throw exception > {noformat} > Can't redefine: col2 > at > org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:64) > at > org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66) > at > org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64) > at > org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) > at > org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) > at > org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53) > at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) > at > org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212) > at org.apache.pig.PigServer$Graph.compile(PigServer.java:1767) > at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1443) > at org.apache.pig.PigServer.execute(PigServer.java:1356) > at org.apache.pig.PigServer.executeBatch(PigServer.java:415) > at org.apache.pig.PigServer.executeBatch(PigServer.java:398) > at > org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81) > at org.apache.pig.Main.run(Main.java:631) > at org.apache.pig.Main.main(Main.java:177) > Caused by: org.apache.avro.SchemaParseException: Can't redefine: col2 > at org.apache.avro.Schema$Names.put(Schema.java:1042) > at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:511) > at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:626) > at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:737) > at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:814) > at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:648) > at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:635) > at org.apache.avro.Schema.toString(Schema.java:297) > at org.apache.avro.Schema.toString(Schema.java:287) > at > org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:442) > at > org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:433) > at > org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:54) > ... 18 more > {noformat} > It is caused by a bug in AvroStorageSchemaConversionUtilities class which > uses tuple name as GenericRecord name as well as the fieldname that wraps the > record. > So it would like to produces the avro schema like the following > {noformat} > { > "type": "record", > "name": "data", > "fields": [ > { > "name": "col1", > "type": { > "type": "record", > "name": "col1_1", > "fields": [ > { > "name": "col2", > "type": { > "type": "record", > "name": "col2", > "fields": [ > { > "name": "col1_data", > "type": "string" > } > ] > } > } > ] > } > }, > { > "name": "col2", > "type": { > "type": "array", > "items": { > "type": "record", > "name": "col2", > "fields": [ > { > "name": "col2_data", > "type": "string" > } > ] > } > } > } > ] > } > {noformat} > But according to the avro 1.7.7 specs > ([https://avro.apache.org/docs/1
[jira] [Updated] (PIG-5115) Builtin AvroStorage generates incorrect avro schema when the same pig field name appears in the alias
[ https://issues.apache.org/jira/browse/PIG-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anyi Li updated PIG-5115: - Description: Pig ResourceSchema allows to use same field names but different types when they are not in the same level. The pig schema like {quote} data: {col1: (col2: (col1_data: chararray)),col2: {col2: (col2_data: chararray)}} {quote} Although _col2_ has been redefined, they are not appeared in the same level, it is a totally valid pig schema. However, once it is translated by AvroStorage, it will throw exception {noformat} Can't redefine: col2 at org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:64) at org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) at org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) at org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1767) at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1443) at org.apache.pig.PigServer.execute(PigServer.java:1356) at org.apache.pig.PigServer.executeBatch(PigServer.java:415) at org.apache.pig.PigServer.executeBatch(PigServer.java:398) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81) at org.apache.pig.Main.run(Main.java:631) at org.apache.pig.Main.main(Main.java:177) Caused by: org.apache.avro.SchemaParseException: Can't redefine: col2 at org.apache.avro.Schema$Names.put(Schema.java:1042) at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:511) at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:626) at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:737) at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:814) at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:648) at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:635) at org.apache.avro.Schema.toString(Schema.java:297) at org.apache.avro.Schema.toString(Schema.java:287) at org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:442) at org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:433) at org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:54) ... 18 more {noformat} It is caused by a bug in AvroStorageSchemaConversionUtilities class which uses tuple name as GenericRecord name as well as the fieldname that wraps the record. So it would like to produces the avro schema like the following {noformat} { "type": "record", "name": "data", "fields": [ { "name": "col1", "type": { "type": "record", "name": "col1_1", "fields": [ { "name": "col2", "type": { "type": "record", "name": "col2", "fields": [ { "name": "col1_data", "type": "string" } ] } } ] } }, { "name": "col2", "type": { "type": "array", "items": { "type": "record", "name": "col2", "fields": [ { "name": "col2_data", "type": "string" } ] } } } ] } {noformat} But according to the avro 1.7.7 specs ([https://avro.apache.org/docs/1.7.7/spec.html#Names]), _col2_ has been defined as record and redefined as array later, it is an invalid schema, unless the fullname (namespace + name) is unique. Since AvroStorageSchemaConversionUtilities will generate avro record if the pig schema is a tuple, we need a way to generate unique _recordName_. {code} public static Schema resourceSchemaToAvroSchema(final ResourceSchema rs, String recordName, final String recordNameSpace, final Map> definedRecordNames, final Boolean doubleColonsToDoubleUnderscores) throws IOException { if (rs == null) { return null; } recordName = toAvroName(recordName, doubleColonsToDoubleUnderscores); List fields
[jira] [Updated] (PIG-5115) Builtin AvroStorage generates incorrect avro schema when the same pig field name appears in the alias
[ https://issues.apache.org/jira/browse/PIG-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anyi Li updated PIG-5115: - Description: Pig ResourceSchema allows to use same field names but different types when they are not in the same level. The pig schema like {quote} data: {col1: (col2: (col1_data: chararray)),col2: {col2: (col2_data: chararray)}} {quote} Although _col2_ has been redefined, they are not appeared in the same level, it is a totally valid pig schema. However, once it is translated by AvroStorage, it will throw exception {noformat} Can't redefine: col2 at org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:64) at org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) at org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) at org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1767) at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1443) at org.apache.pig.PigServer.execute(PigServer.java:1356) at org.apache.pig.PigServer.executeBatch(PigServer.java:415) at org.apache.pig.PigServer.executeBatch(PigServer.java:398) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81) at org.apache.pig.Main.run(Main.java:631) at org.apache.pig.Main.main(Main.java:177) Caused by: org.apache.avro.SchemaParseException: Can't redefine: col2 at org.apache.avro.Schema$Names.put(Schema.java:1042) at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:511) at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:626) at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:737) at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:814) at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:648) at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:635) at org.apache.avro.Schema.toString(Schema.java:297) at org.apache.avro.Schema.toString(Schema.java:287) at org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:442) at org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:433) at org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:54) ... 18 more {noformat} It is caused by a bug in AvroStorageSchemaConversionUtilities class which uses tuple name as GenericRecord name as well as the fieldname that wraps the record. So it would like to produces the avro schema like the following {noformat} { "type": "record", "name": "data", "fields": [ { "name": "col1", "type": { "type": "record", "name": "col1_1", "fields": [ { "name": "col2", "type": { "type": "record", "name": "col2", "fields": [ { "name": "col1_data", "type": "string" } ] } } ] } }, { "name": "col2", "type": { "type": "array", "items": { "type": "record", "name": "col2", "fields": [ { "name": "col2_data", "type": "string" } ] } } } ] } {noformat} But according to the avro 1.7.7 specs ([https://avro.apache.org/docs/1.7.7/spec.html#Names]), _col2_ has been defined as record and redefined as array later, it is an invalid schema, unless the fullname (namespace + name) is unique. Since AvroStorageSchemaConversionUtilities will generate avro record if the pig schema is a tuple, we need a way to generate unique _recordName_. {code} public static Schema resourceSchemaToAvroSchema(final ResourceSchema rs, String recordName, final String recordNameSpace, final Map> definedRecordNames, final Boolean doubleColonsToDoubleUnderscores) throws IOException { if (rs == null) { return null; } recordName = toAvroName(recordName, doubleColonsToDoubleUnderscores); List fields
[jira] [Updated] (PIG-5115) Builtin AvroStorage generates the incorrect avro schema when same pig field names appears in the alias
[ https://issues.apache.org/jira/browse/PIG-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anyi Li updated PIG-5115: - Description: Pig ResourceSchema allows to use same field name but different type when they are not in the same level. The pig schema like {quote} data: {col1: (col2: (col1_data: chararray)),col2: {col2: (col2_data: chararray)}} {quote} Although _col2_ has been redefined, they are not appeared in the same level, it is a totally valid pig schema. However, once it is translated by AvroStorage, it will throw exception {noformat} Can't redefine: col2 at org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:64) at org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) at org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) at org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1767) at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1443) at org.apache.pig.PigServer.execute(PigServer.java:1356) at org.apache.pig.PigServer.executeBatch(PigServer.java:415) at org.apache.pig.PigServer.executeBatch(PigServer.java:398) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81) at org.apache.pig.Main.run(Main.java:631) at org.apache.pig.Main.main(Main.java:177) Caused by: org.apache.avro.SchemaParseException: Can't redefine: col2 at org.apache.avro.Schema$Names.put(Schema.java:1042) at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:511) at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:626) at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:737) at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:814) at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:648) at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:635) at org.apache.avro.Schema.toString(Schema.java:297) at org.apache.avro.Schema.toString(Schema.java:287) at org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:442) at org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:433) at org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:54) ... 18 more {noformat} It is caused by a bug in AvroStorageSchemaConversionUtilities class which uses tuple name as GenericRecord name as well as the fieldname that wraps the record. So it would like to produces the avro schema like the following {noformat} { "type": "record", "name": "data", "fields": [ { "name": "col1", "type": { "type": "record", "name": "col1_1", "fields": [ { "name": "col2", "type": { "type": "record", "name": "col2", "fields": [ { "name": "col1_data", "type": "string" } ] } } ] } }, { "name": "col2", "type": { "type": "array", "items": { "type": "record", "name": "col2", "fields": [ { "name": "col2_data", "type": "string" } ] } } } ] } {noformat} But according to the avro 1.7.7 specs ([https://avro.apache.org/docs/1.7.7/spec.html#Names]), _col2_ has been defined as record and redefined as array later, it is an invalid schema, unless the fullname (namespace + name) is unique. Since AvroStorageSchemaConversionUtilities will generate avro record if the pig schema is a tuple, we need a way to generate unique _recordName_. {code} public static Schema resourceSchemaToAvroSchema(final ResourceSchema rs, String recordName, final String recordNameSpace, final Map> definedRecordNames, final Boolean doubleColonsToDoubleUnderscores) throws IOException { if (rs == null) { return null; } recordName = toAvroName(recordName, doubleColonsToDoubleUnderscores); List fields =
[jira] [Updated] (PIG-5115) Builtin AvroStorage generates incorrect avro schema when the same pig field name appears in the alias
[ https://issues.apache.org/jira/browse/PIG-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anyi Li updated PIG-5115: - Summary: Builtin AvroStorage generates incorrect avro schema when the same pig field name appears in the alias (was: Builtin AvroStorage generates the incorrect avro schema when same pig field names appears in the alias) > Builtin AvroStorage generates incorrect avro schema when the same pig field > name appears in the alias > - > > Key: PIG-5115 > URL: https://issues.apache.org/jira/browse/PIG-5115 > Project: Pig > Issue Type: Bug >Affects Versions: 0.17.0 >Reporter: Anyi Li > > Pig ResourceSchema allows to use same field name but different type when they > are not in the same level. The pig schema like > {quote} > data: {col1: (col2: (col1_data: chararray)),col2: {col2: (col2_data: > chararray)}} > {quote} > Although _col2_ has been redefined, they are not appeared in the same level, > it is a totally valid pig schema. > However, once it is translated by AvroStorage, it will throw exception > {noformat} > Can't redefine: col2 > at > org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:64) > at > org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66) > at > org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64) > at > org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) > at > org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) > at > org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53) > at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) > at > org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212) > at org.apache.pig.PigServer$Graph.compile(PigServer.java:1767) > at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1443) > at org.apache.pig.PigServer.execute(PigServer.java:1356) > at org.apache.pig.PigServer.executeBatch(PigServer.java:415) > at org.apache.pig.PigServer.executeBatch(PigServer.java:398) > at > org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81) > at org.apache.pig.Main.run(Main.java:631) > at org.apache.pig.Main.main(Main.java:177) > Caused by: org.apache.avro.SchemaParseException: Can't redefine: col2 > at org.apache.avro.Schema$Names.put(Schema.java:1042) > at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:511) > at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:626) > at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:737) > at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:814) > at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:648) > at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:635) > at org.apache.avro.Schema.toString(Schema.java:297) > at org.apache.avro.Schema.toString(Schema.java:287) > at > org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:442) > at > org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:433) > at > org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:54) > ... 18 more > {noformat} > It is caused by a bug in AvroStorageSchemaConversionUtilities class which > uses tuple name as GenericRecord name as well as the fieldname that wraps the > record. > So it would like to produces the avro schema like the following > {noformat} > { > "type": "record", > "name": "data", > "fields": [ > { > "name": "col1", > "type": { > "type": "record", > "name": "col1_1", > "fields": [ > { > "name": "col2", > "type": { > "type": "record", > "name": "col2", > "fields": [ > { > "name": "col1_data", > "type": "string" > } > ] > } > } > ] > } > }, > { > "name": "col2", > "type": { > "type": "array", > "items": { > "type": "record", > "name": "col2", > "fields": [ > { > "name": "col2_data", > "type": "string" >
[jira] [Updated] (PIG-5115) Builtin AvroStorage generates the incorrect avro schema when same pig field names appears in the alias
[ https://issues.apache.org/jira/browse/PIG-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anyi Li updated PIG-5115: - Description: Pig ResourceSchema allows to use same field name but different type when they are not in the same level. The pig schema like {quote} data: {col1: (col2: (col1_data: chararray)),col2: {col2: (col2_data: chararray)}} {quote} Although _col2_ has been redefined, they are not appeared in the same level, it is a totally valid pig schema. However, once it is translated by AvroStorage, it will throw exception {quote} Can't redefine: col2 at org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:64) at org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) at org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) at org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1767) at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1443) at org.apache.pig.PigServer.execute(PigServer.java:1356) at org.apache.pig.PigServer.executeBatch(PigServer.java:415) at org.apache.pig.PigServer.executeBatch(PigServer.java:398) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81) at org.apache.pig.Main.run(Main.java:631) at org.apache.pig.Main.main(Main.java:177) Caused by: org.apache.avro.SchemaParseException: Can't redefine: col2 at org.apache.avro.Schema$Names.put(Schema.java:1042) at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:511) at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:626) at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:737) at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:814) at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:648) at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:635) at org.apache.avro.Schema.toString(Schema.java:297) at org.apache.avro.Schema.toString(Schema.java:287) at org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:442) at org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:433) at org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:54) ... 18 more {quote} It is caused by a bug in AvroStorageSchemaConversionUtilities class which uses tuple name as GenericRecord name as well as the fieldname that wraps the record. So it would like to produces the avro schema like the following {noformat} { "type": "record", "name": "data", "fields": [ { "name": "col1", "type": { "type": "record", "name": "col1_1", "fields": [ { "name": "col2", "type": { "type": "record", "name": "col2", "fields": [ { "name": "col1_data", "type": "string" } ] } } ] } }, { "name": "col2", "type": { "type": "array", "items": { "type": "record", "name": "col2", "fields": [ { "name": "col2_data", "type": "string" } ] } } } ] } {noformat} But according to the avro 1.7.7 specs ([https://avro.apache.org/docs/1.7.7/spec.html#Names]), _col2_ has been defined as record and redefined as array later, it is an invalid schema, unless the fullname (namespace + name) is unique. Since AvroStorageSchemaConversionUtilities will generate avro record if the pig schema is a tuple, we need a way to generate unique _recordName_. {code} public static Schema resourceSchemaToAvroSchema(final ResourceSchema rs, String recordName, final String recordNameSpace, final Map> definedRecordNames, final Boolean doubleColonsToDoubleUnderscores) throws IOException { if (rs == null) { return null; } recordName = toAvroName(recordName, doubleColonsToDoubleUnderscores); List fields = new Ar
[jira] [Updated] (PIG-5115) Builtin AvroStorage generates the incorrect avro schema when same pig field names appears in the alias
[ https://issues.apache.org/jira/browse/PIG-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anyi Li updated PIG-5115: - Description: Pig ResourceSchema allows to use same field name but different type when they are not in the same level. The pig schema like {quote} data: {col1: (col2: (col1_data: chararray)),col2: {col2: (col2_data: chararray)}} {quote} Although _col2_ has been redefined, they are not appeared in the same level, it is a totally valid pig schema. However, once it is translated by AvroStorage, it will throw exception {quote} Can't redefine: col2 at org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:64) at org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) at org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) at org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1767) at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1443) at org.apache.pig.PigServer.execute(PigServer.java:1356) at org.apache.pig.PigServer.executeBatch(PigServer.java:415) at org.apache.pig.PigServer.executeBatch(PigServer.java:398) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81) at org.apache.pig.Main.run(Main.java:631) at org.apache.pig.Main.main(Main.java:177) Caused by: org.apache.avro.SchemaParseException: Can't redefine: col2 at org.apache.avro.Schema$Names.put(Schema.java:1042) at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:511) at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:626) at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:737) at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:814) at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:648) at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:635) at org.apache.avro.Schema.toString(Schema.java:297) at org.apache.avro.Schema.toString(Schema.java:287) at org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:442) at org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:433) at org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:54) ... 18 more {quote} It is caused by a bug in AvroStorageSchemaConversionUtilities class which uses tuple name as GenericRecord name as well as the fieldname that wraps the record. So it would like to produces the avro schema like the following {noformat} { "type": "record", "name": "data", "fields": [ { "name": "col1", "type": { "type": "record", "name": "col1_1", "fields": [ { "name": "col2", "type": { "type": "record", "name": "col2", "fields": [ { "name": "col1_data", "type": "string" } ] } } ] } }, { "name": "col2", "type": { "type": "array", "items": { "type": "record", "name": "col2", "fields": [ { "name": "col2_data", "type": "string" } ] } } } ] } {noformat} But according to the avro 1.7.7 specs ([https://avro.apache.org/docs/1.7.7/spec.html#Names]), _col2_ has been defined as record and redefined as array later, it is an invalid schema, unless the fullname (namespace + name) is unique. Since AvroStorageSchemaConversionUtilities will generate avro record if the pig schema is a tuple, we need a way to generate unique _recordName_. {code:java} public static Schema resourceSchemaToAvroSchema(final ResourceSchema rs, String recordName, final String recordNameSpace, final Map> definedRecordNames, final Boolean doubleColonsToDoubleUnderscores) throws IOException { if (rs == null) { return null; } recordName = toAvroName(recordName, doubleColonsToDoubleUnderscores); List fields =
[jira] [Updated] (PIG-5115) Builtin AvroStorage generates the incorrect avro schema when same pig field names appears in the alias
[ https://issues.apache.org/jira/browse/PIG-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anyi Li updated PIG-5115: - Description: Pig ResourceSchema allows to use same field name but different type when they are not in the same level. The pig schema like {quote} data: {col1: (col2: (col1_data: chararray)),col2: {col2: (col2_data: chararray)}} {quote} Although _col2_ has been redefined, they are not appeared in the same level, it is a totally valid pig schema. However, once it is translated by AvroStorage, it will throw exception {quote} Can't redefine: col2 at org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:64) at org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) at org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) at org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1767) at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1443) at org.apache.pig.PigServer.execute(PigServer.java:1356) at org.apache.pig.PigServer.executeBatch(PigServer.java:415) at org.apache.pig.PigServer.executeBatch(PigServer.java:398) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81) at org.apache.pig.Main.run(Main.java:631) at org.apache.pig.Main.main(Main.java:177) Caused by: org.apache.avro.SchemaParseException: Can't redefine: col2 at org.apache.avro.Schema$Names.put(Schema.java:1042) at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:511) at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:626) at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:737) at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:814) at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:648) at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:635) at org.apache.avro.Schema.toString(Schema.java:297) at org.apache.avro.Schema.toString(Schema.java:287) at org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:442) at org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:433) at org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:54) ... 18 more {quote} It is caused by a bug in AvroStorageSchemaConversionUtilities class which uses tuple name as GenericRecord name as well as the fieldname that wraps the record. So it would like to produces the avro schema like the following {quote} { "type": "record", "name": "data", "fields": [ { "name": "col1", "type": { "type": "record", "name": "col1_1", "fields": [ { "name": "col2", "type": { "type": "record", "name": "col2", "fields": [ { "name": "col1_data", "type": "string" } ] } } ] } }, { "name": "col2", "type": { "type": "array", "items": { "type": "record", "name": "col2", "fields": [ { "name": "col2_data", "type": "string" } ] } } } ] } {quote} But according to the avro 1.7.7 specs ([https://avro.apache.org/docs/1.7.7/spec.html#Names]), _col2_ has been defined as record and redefined as array later, it is an invalid schema, unless the fullname (namespace + name) is unique. Since AvroStorageSchemaConversionUtilities will generate avro record if the pig schema is a tuple, we need a way to generate unique _recordName_. {code:java} public static Schema resourceSchemaToAvroSchema(final ResourceSchema rs, String recordName, final String recordNameSpace, final Map> definedRecordNames, final Boolean doubleColonsToDoubleUnderscores) throws IOException { if (rs == null) { return null; } recordName = toAvroName(recordName, doubleColonsToDoubleUnderscores); List fields = new Arr
[jira] [Created] (PIG-5115) Builtin AvroStorage generates the incorrect avro schema when same pig field names appears in the alias
Anyi Li created PIG-5115: Summary: Builtin AvroStorage generates the incorrect avro schema when same pig field names appears in the alias Key: PIG-5115 URL: https://issues.apache.org/jira/browse/PIG-5115 Project: Pig Issue Type: Bug Affects Versions: 0.17.0 Reporter: Anyi Li Pig ResourceSchema allows to use same field name but different type when they are not in the same level. The pig schema like {quote} data: {col1: (col2: (col1_data: chararray)),col2: {col2: (col2_data: chararray)}} {quote} Although _col2_ has been redefined, they are not appeared in the same level, it is a totally valid pig schema. However, once it is translated by AvroStorage, it will throw exception {quote} Can't redefine: col2 at org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:64) at org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) at org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) at org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1767) at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1443) at org.apache.pig.PigServer.execute(PigServer.java:1356) at org.apache.pig.PigServer.executeBatch(PigServer.java:415) at org.apache.pig.PigServer.executeBatch(PigServer.java:398) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81) at org.apache.pig.Main.run(Main.java:631) at org.apache.pig.Main.main(Main.java:177) Caused by: org.apache.avro.SchemaParseException: Can't redefine: col2 at org.apache.avro.Schema$Names.put(Schema.java:1042) at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:511) at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:626) at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:737) at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:814) at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:648) at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:635) at org.apache.avro.Schema.toString(Schema.java:297) at org.apache.avro.Schema.toString(Schema.java:287) at org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:442) at org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:433) at org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:54) ... 18 more {quote} It is caused by a bug in AvroStorageSchemaConversionUtilities class which uses tuple name as GenericRecord name as well as the fieldname that wraps the record. So it would like to produces the avro schema like the following {code:json} { "type": "record", "name": "data", "fields": [ { "name": "col1", "type": { "type": "record", "name": "col1_1", "fields": [ { "name": "col2", "type": { "type": "record", "name": "col2", "fields": [ { "name": "col1_data", "type": "string" } ] } } ] } }, { "name": "col2", "type": { "type": "array", "items": { "type": "record", "name": "col2", "fields": [ { "name": "col2_data", "type": "string" } ] } } } ] } {code:json} But according to the avro 1.7.7 specs ([https://avro.apache.org/docs/1.7.7/spec.html#Names]), _col2_ has been defined as record and redefined as array later, it is an invalid schema, unless the fullname (namespace + name) is unique. Since AvroStorageSchemaConversionUtilities will generate avro record if the pig schema is a tuple, we need a way to generate unique _recordName_. {code: java} public static Schema resourceSchemaToAvroSchema(final ResourceSchema rs, String recordName, final String recordNameSpace, final Map> definedRecordNames, final Boo
[jira] [Updated] (PIG-5112) Cleanup pig-template.xml
[ https://issues.apache.org/jira/browse/PIG-5112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-5112: Fix Version/s: 0.16.1 > Cleanup pig-template.xml > > > Key: PIG-5112 > URL: https://issues.apache.org/jira/browse/PIG-5112 > Project: Pig > Issue Type: Bug > Components: build >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.17.0, 0.16.1 > > Attachments: PIG-5112-1.patch > > > Several entries in pig-template.xml are outdated. Attach a patch to remove or > update those entries. Later we shall use ivy:makepom to generate pig.pom and > lib dir, I will open a separate ticket for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-5112) Cleanup pig-template.xml
[ https://issues.apache.org/jira/browse/PIG-5112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15838166#comment-15838166 ] Thejas M Nair commented on PIG-5112: +1 > Cleanup pig-template.xml > > > Key: PIG-5112 > URL: https://issues.apache.org/jira/browse/PIG-5112 > Project: Pig > Issue Type: Bug > Components: build >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.17.0 > > Attachments: PIG-5112-1.patch > > > Several entries in pig-template.xml are outdated. Attach a patch to remove or > update those entries. Later we shall use ivy:makepom to generate pig.pom and > lib dir, I will open a separate ticket for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-5113) Not a valid JAR
[ https://issues.apache.org/jira/browse/PIG-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15837662#comment-15837662 ] Fabrizio Massara commented on PIG-5113: --- I increased using: export _JAVA_OPTIONS=-Xmx8192m I tried also with PIG_OPTS but nothing. After verify the jars with: jar -tvf pig-0.16.0-core-h2.jar no errors are arised. > Not a valid JAR > --- > > Key: PIG-5113 > URL: https://issues.apache.org/jira/browse/PIG-5113 > Project: Pig > Issue Type: Bug > Components: grunt >Affects Versions: 0.16.0 > Environment: Ubuntu Server 16.04 >Reporter: Fabrizio Massara > > Hello, I installed Pig on Ubuntu Server 16.0 and I need to use it in local > mode. > Yesterday I tried to run some jobs but unfortunately they were killed cause > the heap space of java wasn't enough. I updated it but now, when I try to run > pig, it appear this error: > Not a valid JAR: /usr/local/pig/pig-0.16.0-core-h2.jar > /usr/local/pig/pig-0.16.0-SNAPSHOT-core-h2.jar > How could I solve this? I cannot find a solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-5113) Not a valid JAR
[ https://issues.apache.org/jira/browse/PIG-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15837591#comment-15837591 ] Nandor Kollar commented on PIG-5113: How did you increase the heap space? Via exporting PIG_OPTS? Did you verify that the jar is actually a valid jar manually? > Not a valid JAR > --- > > Key: PIG-5113 > URL: https://issues.apache.org/jira/browse/PIG-5113 > Project: Pig > Issue Type: Bug > Components: grunt >Affects Versions: 0.16.0 > Environment: Ubuntu Server 16.04 >Reporter: Fabrizio Massara > > Hello, I installed Pig on Ubuntu Server 16.0 and I need to use it in local > mode. > Yesterday I tried to run some jobs but unfortunately they were killed cause > the heap space of java wasn't enough. I updated it but now, when I try to run > pig, it appear this error: > Not a valid JAR: /usr/local/pig/pig-0.16.0-core-h2.jar > /usr/local/pig/pig-0.16.0-SNAPSHOT-core-h2.jar > How could I solve this? I cannot find a solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PIG-5114) Getting error 1006-unable to iterate alias for r
Sandip Samaddar created PIG-5114: Summary: Getting error 1006-unable to iterate alias for r Key: PIG-5114 URL: https://issues.apache.org/jira/browse/PIG-5114 Project: Pig Issue Type: Bug Environment: OS - Ubuntu 16.04 2 Virtual machines with OS Ubuntu-16.04 having Hadoop 2.5.1 installed as master and slave. HBase 1.1.4 installed in distributed mode. Pig-15 is installed in master virtual machine. Reporter: Sandip Samaddar I am using 2 virtual machines where 1 is hadoop master and another is hadoop slave. I have installed HBase 1.1.4 in distributed mode . and then pig 15 is installed in master. Now I open pig in mapreduce mode and load a txt file from hdfs and then dump it , I get error unable to iterate alias. But in local mode dump is working fine. It is also to mention that I did ant clean tar -Dhadoopversion=23 -Dhbase95.version=1.1.2 -Dforrest.home=/home/hduser/forrest/apache-forrest-0.9 Build was successful, still getting error. Kindly help. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PIG-5113) Not a valid JAR
Fabrizio Massara created PIG-5113: - Summary: Not a valid JAR Key: PIG-5113 URL: https://issues.apache.org/jira/browse/PIG-5113 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.16.0 Environment: Ubuntu Server 16.04 Reporter: Fabrizio Massara Hello, I installed Pig on Ubuntu Server 16.0 and I need to use it in local mode. Yesterday I tried to run some jobs but unfortunately they were killed cause the heap space of java wasn't enough. I updated it but now, when I try to run pig, it appear this error: Not a valid JAR: /usr/local/pig/pig-0.16.0-core-h2.jar /usr/local/pig/pig-0.16.0-SNAPSHOT-core-h2.jar How could I solve this? I cannot find a solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-5104) Union_15 e2e test failing on Spark
[ https://issues.apache.org/jira/browse/PIG-5104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15837564#comment-15837564 ] Nandor Kollar commented on PIG-5104: [~kellyzly] attached a unit test to show the issue. It is not included into the diff, because we already have similar e2e tests for this scenario, but executing the unit test might be easier in local mode. > Union_15 e2e test failing on Spark > -- > > Key: PIG-5104 > URL: https://issues.apache.org/jira/browse/PIG-5104 > Project: Pig > Issue Type: Bug > Components: spark >Reporter: Nandor Kollar >Assignee: Nandor Kollar > Fix For: spark-branch > > Attachments: PIG-5104.patch, TestUnion_15.java > > > While working on PIG-4891 I noticed that Union_15 e2e test is failing on > Spark mode with this exception: > Caused by: java.lang.RuntimeException: > org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught > error from UDF: org.apache.pig.impl.builtin.GFCross [Unable to get > parallelism hint from job conf] > at > org.apache.pig.backend.hadoop.executionengine.spark.converter.OutputConsumerIterator.readNext(OutputConsumerIterator.java:89) > at > org.apache.pig.backend.hadoop.executionengine.spark.converter.OutputConsumerIterator.hasNext(OutputConsumerIterator.java:96) > at > scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2078: > Caught error from UDF: org.apache.pig.impl.builtin.GFCross [Unable to get > parallelism hint from job conf] > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:358) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextDataBag(POUserFunc.java:374) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:335) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:404) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:321) > at > org.apache.pig.backend.hadoop.executionengine.spark.converter.ForEachConverter$ForEachFunction$1$1.getNextResult(ForEachConverter.java:87) > at > org.apache.pig.backend.hadoop.executionengine.spark.converter.OutputConsumerIterator.readNext(OutputConsumerIterator.java:69) > ... 11 more > Caused by: java.io.IOException: Unable to get parallelism hint from job conf > at org.apache.pig.impl.builtin.GFCross.exec(GFCross.java:66) > at org.apache.pig.impl.builtin.GFCross.exec(GFCross.java:37) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:330) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-5104) Union_15 e2e test failing on Spark
[ https://issues.apache.org/jira/browse/PIG-5104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15837562#comment-15837562 ] Nandor Kollar commented on PIG-5104: Thanks Rohini! > Union_15 e2e test failing on Spark > -- > > Key: PIG-5104 > URL: https://issues.apache.org/jira/browse/PIG-5104 > Project: Pig > Issue Type: Bug > Components: spark >Reporter: Nandor Kollar >Assignee: Nandor Kollar > Fix For: spark-branch > > Attachments: PIG-5104.patch, TestUnion_15.java > > > While working on PIG-4891 I noticed that Union_15 e2e test is failing on > Spark mode with this exception: > Caused by: java.lang.RuntimeException: > org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught > error from UDF: org.apache.pig.impl.builtin.GFCross [Unable to get > parallelism hint from job conf] > at > org.apache.pig.backend.hadoop.executionengine.spark.converter.OutputConsumerIterator.readNext(OutputConsumerIterator.java:89) > at > org.apache.pig.backend.hadoop.executionengine.spark.converter.OutputConsumerIterator.hasNext(OutputConsumerIterator.java:96) > at > scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2078: > Caught error from UDF: org.apache.pig.impl.builtin.GFCross [Unable to get > parallelism hint from job conf] > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:358) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextDataBag(POUserFunc.java:374) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:335) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:404) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:321) > at > org.apache.pig.backend.hadoop.executionengine.spark.converter.ForEachConverter$ForEachFunction$1$1.getNextResult(ForEachConverter.java:87) > at > org.apache.pig.backend.hadoop.executionengine.spark.converter.OutputConsumerIterator.readNext(OutputConsumerIterator.java:69) > ... 11 more > Caused by: java.io.IOException: Unable to get parallelism hint from job conf > at org.apache.pig.impl.builtin.GFCross.exec(GFCross.java:66) > at org.apache.pig.impl.builtin.GFCross.exec(GFCross.java:37) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:330) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-5104) Union_15 e2e test failing on Spark
[ https://issues.apache.org/jira/browse/PIG-5104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nandor Kollar updated PIG-5104: --- Attachment: TestUnion_15.java > Union_15 e2e test failing on Spark > -- > > Key: PIG-5104 > URL: https://issues.apache.org/jira/browse/PIG-5104 > Project: Pig > Issue Type: Bug > Components: spark >Reporter: Nandor Kollar >Assignee: Nandor Kollar > Fix For: spark-branch > > Attachments: PIG-5104.patch, TestUnion_15.java > > > While working on PIG-4891 I noticed that Union_15 e2e test is failing on > Spark mode with this exception: > Caused by: java.lang.RuntimeException: > org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught > error from UDF: org.apache.pig.impl.builtin.GFCross [Unable to get > parallelism hint from job conf] > at > org.apache.pig.backend.hadoop.executionengine.spark.converter.OutputConsumerIterator.readNext(OutputConsumerIterator.java:89) > at > org.apache.pig.backend.hadoop.executionengine.spark.converter.OutputConsumerIterator.hasNext(OutputConsumerIterator.java:96) > at > scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2078: > Caught error from UDF: org.apache.pig.impl.builtin.GFCross [Unable to get > parallelism hint from job conf] > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:358) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextDataBag(POUserFunc.java:374) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:335) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:404) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:321) > at > org.apache.pig.backend.hadoop.executionengine.spark.converter.ForEachConverter$ForEachFunction$1$1.getNextResult(ForEachConverter.java:87) > at > org.apache.pig.backend.hadoop.executionengine.spark.converter.OutputConsumerIterator.readNext(OutputConsumerIterator.java:69) > ... 11 more > Caused by: java.io.IOException: Unable to get parallelism hint from job conf > at org.apache.pig.impl.builtin.GFCross.exec(GFCross.java:66) > at org.apache.pig.impl.builtin.GFCross.exec(GFCross.java:37) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:330) -- This message was sent by Atlassian JIRA (v6.3.4#6332)