CROSS/Self-Join Bug - Please Help :(

2013-12-04 Thread Russell Jurney
I have this bug that is killing me, where I can't self-join/cross a dataset
with itself. Its blocking my work :(

The script is like this:

businesses = LOAD
'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using
com.twitter.elephantbird.pig.load.JsonLoader() as json:map[];

/* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar,
business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E Camelback
Rd
Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty 
Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business,
city=Phoenix} */
locations = FOREACH businesses GENERATE $0#'business_id' AS business_id,
  $0#'longitude' AS longitude,
  $0#'latitude' AS latitude;
STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv';
locations_2 = LOAD 'yelp_phoenix_academic_dataset/locations.tsv' AS
(business_id:chararray, longitude:double, latitude:double);
location_comparisons = CROSS locations_2, locations;

distances = FOREACH businesses GENERATE locations.business_id AS
business_id_1,
locations_2.business_id AS
business_id_2,
udfs.haversine(locations.longitude,
   locations.latitude,

 locations_2.longitude,

 locations_2.latitude) AS distance;
STORE distances INTO 'yelp_phoenix_academic_dataset/distances.tsv';


I have also tried converting this to a self-join using JOIN BY '1', and
also locations_2 = locations, and I get the same error:

*org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has
more than one row in the output. 1st :
(rncjoVoEFUJGCUoC1JgnUA,-112.241596,33.581867), 2nd
:(0FNFSzCFP_rGUoJx8W7tJg,-112.105933,33.604054)*

at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111)

at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336)

at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:438)

at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347)

at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)

at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298)

at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)

at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)

at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)

at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)

at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

This makes no sense! What am I to do? I can't self-join :(
-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com


Re: CROSS/Self-Join Bug - Please Help :(

2013-12-04 Thread Russell Jurney
There was a bug in the script on the 2nd to last line. Fixed it, still have
same issue.

I found a workaround: if I store the CROSSED relation immediately after the
CROSS, then load it... it works. Something about resetting the plan. This
is a bug. I'll file a JIRA.


On Wed, Dec 4, 2013 at 1:21 PM, Russell Jurney russell.jur...@gmail.comwrote:

 I have this bug that is killing me, where I can't self-join/cross a
 dataset with itself. Its blocking my work :(

 The script is like this:

 businesses = LOAD
 'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using
 com.twitter.elephantbird.pig.load.JsonLoader() as json:map[];

 /* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar,
 business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E Camelback
 Rd
 Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty 
 Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business,
 city=Phoenix} */
 locations = FOREACH businesses GENERATE $0#'business_id' AS business_id,
   $0#'longitude' AS longitude,
   $0#'latitude' AS latitude;
 STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv';
 locations_2 = LOAD 'yelp_phoenix_academic_dataset/locations.tsv' AS
 (business_id:chararray, longitude:double, latitude:double);
 location_comparisons = CROSS locations_2, locations;

 distances = FOREACH businesses GENERATE locations.business_id AS
 business_id_1,
 locations_2.business_id AS
 business_id_2,

 udfs.haversine(locations.longitude,
locations.latitude,

  locations_2.longitude,

  locations_2.latitude) AS distance;
 STORE distances INTO 'yelp_phoenix_academic_dataset/distances.tsv';


 I have also tried converting this to a self-join using JOIN BY '1', and
 also locations_2 = locations, and I get the same error:

 *org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has
 more than one row in the output. 1st :
 (rncjoVoEFUJGCUoC1JgnUA,-112.241596,33.581867), 2nd
 :(0FNFSzCFP_rGUoJx8W7tJg,-112.105933,33.604054)*

 at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111)

 at
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336)

 at
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:438)

 at
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347)

 at
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)

 at
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298)

 at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)

 at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)

 at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)

 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)

 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)

 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)

 at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

 This makes no sense! What am I to do? I can't self-join :(
 --
 Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.
 com




-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com


Re: CROSS/Self-Join Bug - Please Help :(

2013-12-04 Thread Pradeep Gollakota
I tried to following script (not exactly the same) and it worked correctly
for me.

businesses = LOAD 'dataset' using PigStorage(',') AS (a, b, c,
business_id: chararray, lat: double, lng: double);
locations = FOREACH businesses GENERATE business_id, lat, lng;
STORE locations INTO 'locations.tsv';
locations2 = LOAD 'locations.tsv' AS (business_id, lat, long);
loc_com = CROSS locations2, locations;
dump loc_com;

I’m wondering your problem has something to do with the way that the
JsonStorage works. Another thing you can try is to load ‘locations.tsv’
twice and do a self-cross on that.


On Wed, Dec 4, 2013 at 1:21 PM, Russell Jurney russell.jur...@gmail.comwrote:

 I have this bug that is killing me, where I can't self-join/cross a dataset
 with itself. Its blocking my work :(

 The script is like this:

 businesses = LOAD
 'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using
 com.twitter.elephantbird.pig.load.JsonLoader() as json:map[];

 /* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar,
 business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E Camelback
 Rd
 Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty 
 Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business,
 city=Phoenix} */
 locations = FOREACH businesses GENERATE $0#'business_id' AS business_id,
   $0#'longitude' AS longitude,
   $0#'latitude' AS latitude;
 STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv';
 locations_2 = LOAD 'yelp_phoenix_academic_dataset/locations.tsv' AS
 (business_id:chararray, longitude:double, latitude:double);
 location_comparisons = CROSS locations_2, locations;

 distances = FOREACH businesses GENERATE locations.business_id AS
 business_id_1,
 locations_2.business_id AS
 business_id_2,
 udfs.haversine(locations.longitude,
locations.latitude,

  locations_2.longitude,

  locations_2.latitude) AS distance;
 STORE distances INTO 'yelp_phoenix_academic_dataset/distances.tsv';


 I have also tried converting this to a self-join using JOIN BY '1', and
 also locations_2 = locations, and I get the same error:

 *org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has
 more than one row in the output. 1st :
 (rncjoVoEFUJGCUoC1JgnUA,-112.241596,33.581867), 2nd
 :(0FNFSzCFP_rGUoJx8W7tJg,-112.105933,33.604054)*

 at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111)

 at

 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336)

 at

 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:438)

 at

 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347)

 at

 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)

 at

 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298)

 at

 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)

 at

 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)

 at

 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)

 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)

 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)

 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)

 at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

 This makes no sense! What am I to do? I can't self-join :(
 --
 Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
 datasyndrome.com



Re: CROSS/Self-Join Bug - Please Help :(

2013-12-04 Thread Russell Jurney
If you store immediately after the CROSS, it works. If you do another
FOREACH/GENERATE, etc. it does not.


On Wed, Dec 4, 2013 at 1:41 PM, Pradeep Gollakota pradeep...@gmail.comwrote:

 I tried to following script (not exactly the same) and it worked correctly
 for me.

 businesses = LOAD 'dataset' using PigStorage(',') AS (a, b, c,
 business_id: chararray, lat: double, lng: double);
 locations = FOREACH businesses GENERATE business_id, lat, lng;
 STORE locations INTO 'locations.tsv';
 locations2 = LOAD 'locations.tsv' AS (business_id, lat, long);
 loc_com = CROSS locations2, locations;
 dump loc_com;

 I’m wondering your problem has something to do with the way that the
 JsonStorage works. Another thing you can try is to load ‘locations.tsv’
 twice and do a self-cross on that.


 On Wed, Dec 4, 2013 at 1:21 PM, Russell Jurney russell.jur...@gmail.com
 wrote:

  I have this bug that is killing me, where I can't self-join/cross a
 dataset
  with itself. Its blocking my work :(
 
  The script is like this:
 
  businesses = LOAD
  'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using
  com.twitter.elephantbird.pig.load.JsonLoader() as json:map[];
 
  /* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar,
  business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E
 Camelback
  Rd
  Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty 
  Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business,
  city=Phoenix} */
  locations = FOREACH businesses GENERATE $0#'business_id' AS business_id,
$0#'longitude' AS longitude,
$0#'latitude' AS latitude;
  STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv';
  locations_2 = LOAD 'yelp_phoenix_academic_dataset/locations.tsv' AS
  (business_id:chararray, longitude:double, latitude:double);
  location_comparisons = CROSS locations_2, locations;
 
  distances = FOREACH businesses GENERATE locations.business_id AS
  business_id_1,
  locations_2.business_id AS
  business_id_2,
 
 udfs.haversine(locations.longitude,
 
  locations.latitude,
 
   locations_2.longitude,
 
   locations_2.latitude) AS distance;
  STORE distances INTO 'yelp_phoenix_academic_dataset/distances.tsv';
 
 
  I have also tried converting this to a self-join using JOIN BY '1', and
  also locations_2 = locations, and I get the same error:
 
  *org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar
 has
  more than one row in the output. 1st :
  (rncjoVoEFUJGCUoC1JgnUA,-112.241596,33.581867), 2nd
  :(0FNFSzCFP_rGUoJx8W7tJg,-112.105933,33.604054)*
 
  at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111)
 
  at
 
 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336)
 
  at
 
 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:438)
 
  at
 
 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347)
 
  at
 
 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
 
  at
 
 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298)
 
  at
 
 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
 
  at
 
 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
 
  at
 
 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
 
  at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
 
  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
 
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
 
  at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
 
  This makes no sense! What am I to do? I can't self-join :(
  --
  Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
  datasyndrome.com
 




-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com