Trouble with REGEX in PIG
Hi, I'm trying to use regular expressions in PIG, but it's failing. Based on the documentation http://pig.apache.org/docs/r0.12.0/func.html#regex-extract I am trying this: [watrous@c0003913 ~]$ pig -x local which: no hadoop in (/opt/krb5/sbin/64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/X11R6/bin:/sbin:/usr/sbin:/usr/bin:/opt/pb/bin:/opt/perf/bin:/bin:/usr/local/bin:/home/watrous/bin:/home/watrous/pig-0.12.0/bin) 2013-12-04 17:15:15,398 [main] INFO org.apache.pig.Main - Apache Pig version 0.12.0 (r1529718) compiled Oct 07 2013, 12:20:14 2013-12-04 17:15:15,398 [main] INFO org.apache.pig.Main - Logging error messages to: /home/watrous/pig_1386177315394.log 2013-12-04 17:15:15,425 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/watrous/.pigbootup not found 2013-12-04 17:15:15,599 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// grunt REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1); 2013-12-04 17:16:59,753 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: line 1 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro must be defined before expansion. Details at logfile: /home/watrous/pig_1386177315394.log Here's the relevant bit from the log file: Pig Stack Trace --- ERROR 1200: line 1 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro must be defined before expansion. Failed to parse: line 1 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro must be defined before expansion. at org.apache.pig.parser.PigMacro.macroInline(PigMacro.java:455) at org.apache.pig.parser.QueryParserDriver.inlineMacro(QueryParserDriver.java:298) at org.apache.pig.parser.QueryParserDriver.expandMacro(QueryParserDriver.java:287) at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:180) at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1648) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1621) at org.apache.pig.PigServer.registerQuery(PigServer.java:575) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1093) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:501) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69) at org.apache.pig.Main.run(Main.java:541) at org.apache.pig.Main.main(Main.java:156) I attempted to define the macro (following this tutorial http://aws.amazon.com/articles/2729). However, piggybank.jar doesn't define org.apache.pig.piggybank.evaluation.string.EXTRACT, so I located the most likely file in the current version of the jar. grunt register /home/watrous/pig-0.12.0/contrib/piggybank/java/piggybank.jar grunt DEFINE REGEX_EXTRACT org.apache.pig.piggybank.evaluation.string.RegexExtract; grunt REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1); 2013-12-04 17:23:20,383 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: line 3 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro must be defined before expansion. Details at logfile: /home/watrous/pig_1386177315394.log I get the same stack trace with the only change being a reference to line 3 instead of line 1. Any idea how I can get this working? Daniel
Re: Trouble with REGEX in PIG
R u planning to use org.apache.pig.builtin.REGEX_EXTRACT ? On 12/4/13 9:28 AM, Watrous, Daniel daniel.t.watr...@hp.com wrote: Hi, I'm trying to use regular expressions in PIG, but it's failing. Based on the documentation http://pig.apache.org/docs/r0.12.0/func.html#regex-extract I am trying this: [watrous@c0003913 ~]$ pig -x local which: no hadoop in (/opt/krb5/sbin/64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr /local/sbin:/usr/sbin:/sbin:/usr/X11R6/bin:/sbin:/usr/sbin:/usr/bin:/opt/p b/bin:/opt/perf/bin:/bin:/usr/local/bin:/home/watrous/bin:/home/watrous/pi g-0.12.0/bin) 2013-12-04 17:15:15,398 [main] INFO org.apache.pig.Main - Apache Pig version 0.12.0 (r1529718) compiled Oct 07 2013, 12:20:14 2013-12-04 17:15:15,398 [main] INFO org.apache.pig.Main - Logging error messages to: /home/watrous/pig_1386177315394.log 2013-12-04 17:15:15,425 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/watrous/.pigbootup not found 2013-12-04 17:15:15,599 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// grunt REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1); 2013-12-04 17:16:59,753 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: line 1 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro must be defined before expansion. Details at logfile: /home/watrous/pig_1386177315394.log Here's the relevant bit from the log file: Pig Stack Trace --- ERROR 1200: line 1 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro must be defined before expansion. Failed to parse: line 1 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro must be defined before expansion. at org.apache.pig.parser.PigMacro.macroInline(PigMacro.java:455) at org.apache.pig.parser.QueryParserDriver.inlineMacro(QueryParserDriver.java :298) at org.apache.pig.parser.QueryParserDriver.expandMacro(QueryParserDriver.java :287) at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:180) at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1648) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1621) at org.apache.pig.PigServer.registerQuery(PigServer.java:575) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1093) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParse r.java:501) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:1 98) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:1 73) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69) at org.apache.pig.Main.run(Main.java:541) at org.apache.pig.Main.main(Main.java:156) I attempted to define the macro (following this tutorial http://aws.amazon.com/articles/2729). However, piggybank.jar doesn't define org.apache.pig.piggybank.evaluation.string.EXTRACT, so I located the most likely file in the current version of the jar. grunt register /home/watrous/pig-0.12.0/contrib/piggybank/java/piggybank.jar grunt DEFINE REGEX_EXTRACT org.apache.pig.piggybank.evaluation.string.RegexExtract; grunt REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1); 2013-12-04 17:23:20,383 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: line 3 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro must be defined before expansion. Details at logfile: /home/watrous/pig_1386177315394.log I get the same stack trace with the only change being a reference to line 3 instead of line 1. Any idea how I can get this working? Daniel
Re: Trouble with REGEX in PIG
It's not valid PigLatin... The Grunt shell doesn't let you try out functions and UDFs are you're trying to use them. A = LOAD 'data' USING PigStorage() as (ip: chararray); B = FOREACH A GENERATE REGEX_EXTRACT(ip, '(.*):(.*)', 1); DUMP B; You always have to load a dataset and work with said dataset(s). You can create a file called 'data' (per the above script) and put 192.168.1.5:8020 in the file and try the above set of commands in the grunt shell. On Wed, Dec 4, 2013 at 10:15 AM, Ankit Bhatnagar ank...@yahoo-inc.comwrote: R u planning to use org.apache.pig.builtin.REGEX_EXTRACT ? On 12/4/13 9:28 AM, Watrous, Daniel daniel.t.watr...@hp.com wrote: Hi, I'm trying to use regular expressions in PIG, but it's failing. Based on the documentation http://pig.apache.org/docs/r0.12.0/func.html#regex-extract I am trying this: [watrous@c0003913 ~]$ pig -x local which: no hadoop in (/opt/krb5/sbin/64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr /local/sbin:/usr/sbin:/sbin:/usr/X11R6/bin:/sbin:/usr/sbin:/usr/bin:/opt/p b/bin:/opt/perf/bin:/bin:/usr/local/bin:/home/watrous/bin:/home/watrous/pi g-0.12.0/bin) 2013-12-04 17:15:15,398 [main] INFO org.apache.pig.Main - Apache Pig version 0.12.0 (r1529718) compiled Oct 07 2013, 12:20:14 2013-12-04 17:15:15,398 [main] INFO org.apache.pig.Main - Logging error messages to: /home/watrous/pig_1386177315394.log 2013-12-04 17:15:15,425 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/watrous/.pigbootup not found 2013-12-04 17:15:15,599 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// grunt REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1); 2013-12-04 17:16:59,753 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: line 1 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro must be defined before expansion. Details at logfile: /home/watrous/pig_1386177315394.log Here's the relevant bit from the log file: Pig Stack Trace --- ERROR 1200: line 1 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro must be defined before expansion. Failed to parse: line 1 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro must be defined before expansion. at org.apache.pig.parser.PigMacro.macroInline(PigMacro.java:455) at org.apache.pig.parser.QueryParserDriver.inlineMacro(QueryParserDriver.java :298) at org.apache.pig.parser.QueryParserDriver.expandMacro(QueryParserDriver.java :287) at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:180) at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1648) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1621) at org.apache.pig.PigServer.registerQuery(PigServer.java:575) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1093) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParse r.java:501) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:1 98) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:1 73) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69) at org.apache.pig.Main.run(Main.java:541) at org.apache.pig.Main.main(Main.java:156) I attempted to define the macro (following this tutorial http://aws.amazon.com/articles/2729). However, piggybank.jar doesn't define org.apache.pig.piggybank.evaluation.string.EXTRACT, so I located the most likely file in the current version of the jar. grunt register /home/watrous/pig-0.12.0/contrib/piggybank/java/piggybank.jar grunt DEFINE REGEX_EXTRACT org.apache.pig.piggybank.evaluation.string.RegexExtract; grunt REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1); 2013-12-04 17:23:20,383 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: line 3 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro must be defined before expansion. Details at logfile: /home/watrous/pig_1386177315394.log I get the same stack trace with the only change being a reference to line 3 instead of line 1. Any idea how I can get this working? Daniel
RE: Trouble with REGEX in PIG
That's what I was trying first, but then I tried defining it too. -Original Message- From: Ankit Bhatnagar [mailto:ank...@yahoo-inc.com] Sent: Wednesday, December 04, 2013 11:15 AM To: user@pig.apache.org; Watrous, Daniel Subject: Re: Trouble with REGEX in PIG R u planning to use org.apache.pig.builtin.REGEX_EXTRACT ? On 12/4/13 9:28 AM, Watrous, Daniel daniel.t.watr...@hp.com wrote: Hi, I'm trying to use regular expressions in PIG, but it's failing. Based on the documentation http://pig.apache.org/docs/r0.12.0/func.html#regex-extract I am trying this: [watrous@c0003913 ~]$ pig -x local which: no hadoop in (/opt/krb5/sbin/64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/ usr /local/sbin:/usr/sbin:/sbin:/usr/X11R6/bin:/sbin:/usr/sbin:/usr/bin:/op t/p b/bin:/opt/perf/bin:/bin:/usr/local/bin:/home/watrous/bin:/home/watrous /pi g-0.12.0/bin) 2013-12-04 17:15:15,398 [main] INFO org.apache.pig.Main - Apache Pig version 0.12.0 (r1529718) compiled Oct 07 2013, 12:20:14 2013-12-04 17:15:15,398 [main] INFO org.apache.pig.Main - Logging error messages to: /home/watrous/pig_1386177315394.log 2013-12-04 17:15:15,425 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/watrous/.pigbootup not found 2013-12-04 17:15:15,599 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// grunt REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1); 2013-12-04 17:16:59,753 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: line 1 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro must be defined before expansion. Details at logfile: /home/watrous/pig_1386177315394.log Here's the relevant bit from the log file: Pig Stack Trace --- ERROR 1200: line 1 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro must be defined before expansion. Failed to parse: line 1 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro must be defined before expansion. at org.apache.pig.parser.PigMacro.macroInline(PigMacro.java:455) at org.apache.pig.parser.QueryParserDriver.inlineMacro(QueryParserDriver.j ava :298) at org.apache.pig.parser.QueryParserDriver.expandMacro(QueryParserDriver.j ava :287) at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:180) at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1648) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1621) at org.apache.pig.PigServer.registerQuery(PigServer.java:575) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1093) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptPa rse r.java:501) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.jav a:1 98) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.jav a:1 73) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69) at org.apache.pig.Main.run(Main.java:541) at org.apache.pig.Main.main(Main.java:156) I attempted to define the macro (following this tutorial http://aws.amazon.com/articles/2729). However, piggybank.jar doesn't define org.apache.pig.piggybank.evaluation.string.EXTRACT, so I located the most likely file in the current version of the jar. grunt register /home/watrous/pig-0.12.0/contrib/piggybank/java/piggybank.jar grunt DEFINE REGEX_EXTRACT org.apache.pig.piggybank.evaluation.string.RegexExtract; grunt REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1); 2013-12-04 17:23:20,383 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: line 3 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro must be defined before expansion. Details at logfile: /home/watrous/pig_1386177315394.log I get the same stack trace with the only change being a reference to line 3 instead of line 1. Any idea how I can get this working? Daniel
RE: Trouble with REGEX in PIG
Pradeep, Does the documentation here need to be updated: http://pig.apache.org/docs/r0.12.0/func.html#regex-extract It suggests that the function can run against a string and should return the expected value. I did confirm that I can use REGEX_EXTRACT on values loaded from a file. Thank you, Daniel -Original Message- From: Pradeep Gollakota [mailto:pradeep...@gmail.com] Sent: Wednesday, December 04, 2013 11:28 AM To: user@pig.apache.org Subject: Re: Trouble with REGEX in PIG It's not valid PigLatin... The Grunt shell doesn't let you try out functions and UDFs are you're trying to use them. A = LOAD 'data' USING PigStorage() as (ip: chararray); B = FOREACH A GENERATE REGEX_EXTRACT(ip, '(.*):(.*)', 1); DUMP B; You always have to load a dataset and work with said dataset(s). You can create a file called 'data' (per the above script) and put 192.168.1.5:8020 in the file and try the above set of commands in the grunt shell. On Wed, Dec 4, 2013 at 10:15 AM, Ankit Bhatnagar ank...@yahoo-inc.comwrote: R u planning to use org.apache.pig.builtin.REGEX_EXTRACT ? On 12/4/13 9:28 AM, Watrous, Daniel daniel.t.watr...@hp.com wrote: Hi, I'm trying to use regular expressions in PIG, but it's failing. Based on the documentation http://pig.apache.org/docs/r0.12.0/func.html#regex-extract I am trying this: [watrous@c0003913 ~]$ pig -x local which: no hadoop in (/opt/krb5/sbin/64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin :/usr /local/sbin:/usr/sbin:/sbin:/usr/X11R6/bin:/sbin:/usr/sbin:/usr/bin:/ opt/p b/bin:/opt/perf/bin:/bin:/usr/local/bin:/home/watrous/bin:/home/watro us/pi g-0.12.0/bin) 2013-12-04 17:15:15,398 [main] INFO org.apache.pig.Main - Apache Pig version 0.12.0 (r1529718) compiled Oct 07 2013, 12:20:14 2013-12-04 17:15:15,398 [main] INFO org.apache.pig.Main - Logging error messages to: /home/watrous/pig_1386177315394.log 2013-12-04 17:15:15,425 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/watrous/.pigbootup not found 2013-12-04 17:15:15,599 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// grunt REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1); 2013-12-04 17:16:59,753 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: line 1 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro must be defined before expansion. Details at logfile: /home/watrous/pig_1386177315394.log Here's the relevant bit from the log file: Pig Stack Trace --- ERROR 1200: line 1 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro must be defined before expansion. Failed to parse: line 1 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro must be defined before expansion. at org.apache.pig.parser.PigMacro.macroInline(PigMacro.java:455) at org.apache.pig.parser.QueryParserDriver.inlineMacro(QueryParserDriver .java :298) at org.apache.pig.parser.QueryParserDriver.expandMacro(QueryParserDriver .java :287) at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:180) at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1648) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1621) at org.apache.pig.PigServer.registerQuery(PigServer.java:575) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1093) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScript Parse r.java:501) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.j ava:1 98) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.j ava:1 73) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69) at org.apache.pig.Main.run(Main.java:541) at org.apache.pig.Main.main(Main.java:156) I attempted to define the macro (following this tutorial http://aws.amazon.com/articles/2729). However, piggybank.jar doesn't define org.apache.pig.piggybank.evaluation.string.EXTRACT, so I located the most likely file in the current version of the jar. grunt register /home/watrous/pig-0.12.0/contrib/piggybank/java/piggybank.jar grunt DEFINE REGEX_EXTRACT org.apache.pig.piggybank.evaluation.string.RegexExtract; grunt REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1); 2013-12-04 17:23:20,383 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: line 3 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro must be defined before expansion. Details at logfile: /home/watrous/pig_1386177315394.log I get the same stack trace with the only change being a reference to line 3 instead of line 1. Any idea how I can get this working? Daniel
CROSS/Self-Join Bug - Please Help :(
I have this bug that is killing me, where I can't self-join/cross a dataset with itself. Its blocking my work :( The script is like this: businesses = LOAD 'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using com.twitter.elephantbird.pig.load.JsonLoader() as json:map[]; /* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar, business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E Camelback Rd Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business, city=Phoenix} */ locations = FOREACH businesses GENERATE $0#'business_id' AS business_id, $0#'longitude' AS longitude, $0#'latitude' AS latitude; STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv'; locations_2 = LOAD 'yelp_phoenix_academic_dataset/locations.tsv' AS (business_id:chararray, longitude:double, latitude:double); location_comparisons = CROSS locations_2, locations; distances = FOREACH businesses GENERATE locations.business_id AS business_id_1, locations_2.business_id AS business_id_2, udfs.haversine(locations.longitude, locations.latitude, locations_2.longitude, locations_2.latitude) AS distance; STORE distances INTO 'yelp_phoenix_academic_dataset/distances.tsv'; I have also tried converting this to a self-join using JOIN BY '1', and also locations_2 = locations, and I get the same error: *org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (rncjoVoEFUJGCUoC1JgnUA,-112.241596,33.581867), 2nd :(0FNFSzCFP_rGUoJx8W7tJg,-112.105933,33.604054)* at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:438) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) This makes no sense! What am I to do? I can't self-join :( -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
Re: CROSS/Self-Join Bug - Please Help :(
There was a bug in the script on the 2nd to last line. Fixed it, still have same issue. I found a workaround: if I store the CROSSED relation immediately after the CROSS, then load it... it works. Something about resetting the plan. This is a bug. I'll file a JIRA. On Wed, Dec 4, 2013 at 1:21 PM, Russell Jurney russell.jur...@gmail.comwrote: I have this bug that is killing me, where I can't self-join/cross a dataset with itself. Its blocking my work :( The script is like this: businesses = LOAD 'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using com.twitter.elephantbird.pig.load.JsonLoader() as json:map[]; /* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar, business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E Camelback Rd Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business, city=Phoenix} */ locations = FOREACH businesses GENERATE $0#'business_id' AS business_id, $0#'longitude' AS longitude, $0#'latitude' AS latitude; STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv'; locations_2 = LOAD 'yelp_phoenix_academic_dataset/locations.tsv' AS (business_id:chararray, longitude:double, latitude:double); location_comparisons = CROSS locations_2, locations; distances = FOREACH businesses GENERATE locations.business_id AS business_id_1, locations_2.business_id AS business_id_2, udfs.haversine(locations.longitude, locations.latitude, locations_2.longitude, locations_2.latitude) AS distance; STORE distances INTO 'yelp_phoenix_academic_dataset/distances.tsv'; I have also tried converting this to a self-join using JOIN BY '1', and also locations_2 = locations, and I get the same error: *org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (rncjoVoEFUJGCUoC1JgnUA,-112.241596,33.581867), 2nd :(0FNFSzCFP_rGUoJx8W7tJg,-112.105933,33.604054)* at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:438) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) This makes no sense! What am I to do? I can't self-join :( -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome. com -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
Re: CROSS/Self-Join Bug - Please Help :(
I tried to following script (not exactly the same) and it worked correctly for me. businesses = LOAD 'dataset' using PigStorage(',') AS (a, b, c, business_id: chararray, lat: double, lng: double); locations = FOREACH businesses GENERATE business_id, lat, lng; STORE locations INTO 'locations.tsv'; locations2 = LOAD 'locations.tsv' AS (business_id, lat, long); loc_com = CROSS locations2, locations; dump loc_com; I’m wondering your problem has something to do with the way that the JsonStorage works. Another thing you can try is to load ‘locations.tsv’ twice and do a self-cross on that. On Wed, Dec 4, 2013 at 1:21 PM, Russell Jurney russell.jur...@gmail.comwrote: I have this bug that is killing me, where I can't self-join/cross a dataset with itself. Its blocking my work :( The script is like this: businesses = LOAD 'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using com.twitter.elephantbird.pig.load.JsonLoader() as json:map[]; /* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar, business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E Camelback Rd Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business, city=Phoenix} */ locations = FOREACH businesses GENERATE $0#'business_id' AS business_id, $0#'longitude' AS longitude, $0#'latitude' AS latitude; STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv'; locations_2 = LOAD 'yelp_phoenix_academic_dataset/locations.tsv' AS (business_id:chararray, longitude:double, latitude:double); location_comparisons = CROSS locations_2, locations; distances = FOREACH businesses GENERATE locations.business_id AS business_id_1, locations_2.business_id AS business_id_2, udfs.haversine(locations.longitude, locations.latitude, locations_2.longitude, locations_2.latitude) AS distance; STORE distances INTO 'yelp_phoenix_academic_dataset/distances.tsv'; I have also tried converting this to a self-join using JOIN BY '1', and also locations_2 = locations, and I get the same error: *org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (rncjoVoEFUJGCUoC1JgnUA,-112.241596,33.581867), 2nd :(0FNFSzCFP_rGUoJx8W7tJg,-112.105933,33.604054)* at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:438) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) This makes no sense! What am I to do? I can't self-join :( -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
weird classpath problem
Hi everyone, I am having some weird classpath issues with a UDF that returns a custom tuple. My custom tuple has an arraylist of custom objects. It looks like: class MyTuple private ArrayListMyClass list; When the UDF is called, everything works fine: the tuples are created and the UDF returns successfully. But then in the next step when I try to dump the results of the UDF I see class not found exceptions. That is, MyClass can be found during UDF execution, but it cannot be found while dumping the UDF results. I tried setting PIG_CLASSPATH, but didn't work. The only working solution so far is to add my jar file to hadoop's classpath. I wonder whether this is a bug (PIG_CLASSPATH should work) or I am not setting something correctly. Any help will be greatly appreciated. Thanks, Nezih Here is my pig script: register 'my_jar.jar'; x = load '/etc/passwd' using PigStorage(':') as (username:chararray, f1: chararray, f2: chararray, f3:chararray, f4:chararray); parsed = foreach x generate username, MyParser(*); dump parsed; Here is the stack trace: Caused by: java.io.IOException: Could not find class MyClass, while attempting to de-serialize it at org.apache.pig.data.BinInterSedes.readWritable(BinInterSedes.java:293) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:422) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:318) at org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:349) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:318) at org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:349) at org.apache.pig.impl.io.InterRecordReader.nextKeyValue(InterRecordReader.java:113) at org.apache.pig.impl.io.InterStorage.getNext(InterStorage.java:77) at org.apache.pig.impl.io.ReadToEndLoader.getNextHelper(ReadToEndLoader.java:246) at org.apache.pig.impl.io.ReadToEndLoader.getNext(ReadToEndLoader.java:226) at org.apache.pig.backend.hadoop.executionengine.HJob$1.hasNext(HJob.java:111) ... 12 more
Re: CROSS/Self-Join Bug - Please Help :(
If you store immediately after the CROSS, it works. If you do another FOREACH/GENERATE, etc. it does not. On Wed, Dec 4, 2013 at 1:41 PM, Pradeep Gollakota pradeep...@gmail.comwrote: I tried to following script (not exactly the same) and it worked correctly for me. businesses = LOAD 'dataset' using PigStorage(',') AS (a, b, c, business_id: chararray, lat: double, lng: double); locations = FOREACH businesses GENERATE business_id, lat, lng; STORE locations INTO 'locations.tsv'; locations2 = LOAD 'locations.tsv' AS (business_id, lat, long); loc_com = CROSS locations2, locations; dump loc_com; I’m wondering your problem has something to do with the way that the JsonStorage works. Another thing you can try is to load ‘locations.tsv’ twice and do a self-cross on that. On Wed, Dec 4, 2013 at 1:21 PM, Russell Jurney russell.jur...@gmail.com wrote: I have this bug that is killing me, where I can't self-join/cross a dataset with itself. Its blocking my work :( The script is like this: businesses = LOAD 'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using com.twitter.elephantbird.pig.load.JsonLoader() as json:map[]; /* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar, business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E Camelback Rd Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business, city=Phoenix} */ locations = FOREACH businesses GENERATE $0#'business_id' AS business_id, $0#'longitude' AS longitude, $0#'latitude' AS latitude; STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv'; locations_2 = LOAD 'yelp_phoenix_academic_dataset/locations.tsv' AS (business_id:chararray, longitude:double, latitude:double); location_comparisons = CROSS locations_2, locations; distances = FOREACH businesses GENERATE locations.business_id AS business_id_1, locations_2.business_id AS business_id_2, udfs.haversine(locations.longitude, locations.latitude, locations_2.longitude, locations_2.latitude) AS distance; STORE distances INTO 'yelp_phoenix_academic_dataset/distances.tsv'; I have also tried converting this to a self-join using JOIN BY '1', and also locations_2 = locations, and I get the same error: *org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (rncjoVoEFUJGCUoC1JgnUA,-112.241596,33.581867), 2nd :(0FNFSzCFP_rGUoJx8W7tJg,-112.105933,33.604054)* at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:438) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) This makes no sense! What am I to do? I can't self-join :( -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com