CROSS/Self-Join Bug - Please Help :(
I have this bug that is killing me, where I can't self-join/cross a dataset with itself. Its blocking my work :( The script is like this: businesses = LOAD 'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using com.twitter.elephantbird.pig.load.JsonLoader() as json:map[]; /* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar, business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E Camelback Rd Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business, city=Phoenix} */ locations = FOREACH businesses GENERATE $0#'business_id' AS business_id, $0#'longitude' AS longitude, $0#'latitude' AS latitude; STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv'; locations_2 = LOAD 'yelp_phoenix_academic_dataset/locations.tsv' AS (business_id:chararray, longitude:double, latitude:double); location_comparisons = CROSS locations_2, locations; distances = FOREACH businesses GENERATE locations.business_id AS business_id_1, locations_2.business_id AS business_id_2, udfs.haversine(locations.longitude, locations.latitude, locations_2.longitude, locations_2.latitude) AS distance; STORE distances INTO 'yelp_phoenix_academic_dataset/distances.tsv'; I have also tried converting this to a self-join using JOIN BY '1', and also locations_2 = locations, and I get the same error: *org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (rncjoVoEFUJGCUoC1JgnUA,-112.241596,33.581867), 2nd :(0FNFSzCFP_rGUoJx8W7tJg,-112.105933,33.604054)* at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:438) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) This makes no sense! What am I to do? I can't self-join :( -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
Re: CROSS/Self-Join Bug - Please Help :(
There was a bug in the script on the 2nd to last line. Fixed it, still have same issue. I found a workaround: if I store the CROSSED relation immediately after the CROSS, then load it... it works. Something about resetting the plan. This is a bug. I'll file a JIRA. On Wed, Dec 4, 2013 at 1:21 PM, Russell Jurney russell.jur...@gmail.comwrote: I have this bug that is killing me, where I can't self-join/cross a dataset with itself. Its blocking my work :( The script is like this: businesses = LOAD 'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using com.twitter.elephantbird.pig.load.JsonLoader() as json:map[]; /* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar, business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E Camelback Rd Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business, city=Phoenix} */ locations = FOREACH businesses GENERATE $0#'business_id' AS business_id, $0#'longitude' AS longitude, $0#'latitude' AS latitude; STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv'; locations_2 = LOAD 'yelp_phoenix_academic_dataset/locations.tsv' AS (business_id:chararray, longitude:double, latitude:double); location_comparisons = CROSS locations_2, locations; distances = FOREACH businesses GENERATE locations.business_id AS business_id_1, locations_2.business_id AS business_id_2, udfs.haversine(locations.longitude, locations.latitude, locations_2.longitude, locations_2.latitude) AS distance; STORE distances INTO 'yelp_phoenix_academic_dataset/distances.tsv'; I have also tried converting this to a self-join using JOIN BY '1', and also locations_2 = locations, and I get the same error: *org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (rncjoVoEFUJGCUoC1JgnUA,-112.241596,33.581867), 2nd :(0FNFSzCFP_rGUoJx8W7tJg,-112.105933,33.604054)* at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:438) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) This makes no sense! What am I to do? I can't self-join :( -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome. com -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
Re: CROSS/Self-Join Bug - Please Help :(
I tried to following script (not exactly the same) and it worked correctly for me. businesses = LOAD 'dataset' using PigStorage(',') AS (a, b, c, business_id: chararray, lat: double, lng: double); locations = FOREACH businesses GENERATE business_id, lat, lng; STORE locations INTO 'locations.tsv'; locations2 = LOAD 'locations.tsv' AS (business_id, lat, long); loc_com = CROSS locations2, locations; dump loc_com; I’m wondering your problem has something to do with the way that the JsonStorage works. Another thing you can try is to load ‘locations.tsv’ twice and do a self-cross on that. On Wed, Dec 4, 2013 at 1:21 PM, Russell Jurney russell.jur...@gmail.comwrote: I have this bug that is killing me, where I can't self-join/cross a dataset with itself. Its blocking my work :( The script is like this: businesses = LOAD 'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using com.twitter.elephantbird.pig.load.JsonLoader() as json:map[]; /* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar, business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E Camelback Rd Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business, city=Phoenix} */ locations = FOREACH businesses GENERATE $0#'business_id' AS business_id, $0#'longitude' AS longitude, $0#'latitude' AS latitude; STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv'; locations_2 = LOAD 'yelp_phoenix_academic_dataset/locations.tsv' AS (business_id:chararray, longitude:double, latitude:double); location_comparisons = CROSS locations_2, locations; distances = FOREACH businesses GENERATE locations.business_id AS business_id_1, locations_2.business_id AS business_id_2, udfs.haversine(locations.longitude, locations.latitude, locations_2.longitude, locations_2.latitude) AS distance; STORE distances INTO 'yelp_phoenix_academic_dataset/distances.tsv'; I have also tried converting this to a self-join using JOIN BY '1', and also locations_2 = locations, and I get the same error: *org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (rncjoVoEFUJGCUoC1JgnUA,-112.241596,33.581867), 2nd :(0FNFSzCFP_rGUoJx8W7tJg,-112.105933,33.604054)* at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:438) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) This makes no sense! What am I to do? I can't self-join :( -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
Re: CROSS/Self-Join Bug - Please Help :(
If you store immediately after the CROSS, it works. If you do another FOREACH/GENERATE, etc. it does not. On Wed, Dec 4, 2013 at 1:41 PM, Pradeep Gollakota pradeep...@gmail.comwrote: I tried to following script (not exactly the same) and it worked correctly for me. businesses = LOAD 'dataset' using PigStorage(',') AS (a, b, c, business_id: chararray, lat: double, lng: double); locations = FOREACH businesses GENERATE business_id, lat, lng; STORE locations INTO 'locations.tsv'; locations2 = LOAD 'locations.tsv' AS (business_id, lat, long); loc_com = CROSS locations2, locations; dump loc_com; I’m wondering your problem has something to do with the way that the JsonStorage works. Another thing you can try is to load ‘locations.tsv’ twice and do a self-cross on that. On Wed, Dec 4, 2013 at 1:21 PM, Russell Jurney russell.jur...@gmail.com wrote: I have this bug that is killing me, where I can't self-join/cross a dataset with itself. Its blocking my work :( The script is like this: businesses = LOAD 'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using com.twitter.elephantbird.pig.load.JsonLoader() as json:map[]; /* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar, business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E Camelback Rd Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business, city=Phoenix} */ locations = FOREACH businesses GENERATE $0#'business_id' AS business_id, $0#'longitude' AS longitude, $0#'latitude' AS latitude; STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv'; locations_2 = LOAD 'yelp_phoenix_academic_dataset/locations.tsv' AS (business_id:chararray, longitude:double, latitude:double); location_comparisons = CROSS locations_2, locations; distances = FOREACH businesses GENERATE locations.business_id AS business_id_1, locations_2.business_id AS business_id_2, udfs.haversine(locations.longitude, locations.latitude, locations_2.longitude, locations_2.latitude) AS distance; STORE distances INTO 'yelp_phoenix_academic_dataset/distances.tsv'; I have also tried converting this to a self-join using JOIN BY '1', and also locations_2 = locations, and I get the same error: *org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (rncjoVoEFUJGCUoC1JgnUA,-112.241596,33.581867), 2nd :(0FNFSzCFP_rGUoJx8W7tJg,-112.105933,33.604054)* at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:438) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) This makes no sense! What am I to do? I can't self-join :( -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com