Trouble with REGEX in PIG

2013-12-04 Thread Watrous, Daniel
Hi,

I'm trying to use regular expressions in PIG, but it's failing. Based on the 
documentation http://pig.apache.org/docs/r0.12.0/func.html#regex-extract I am 
trying this:

[watrous@c0003913 ~]$ pig -x local
which: no hadoop in 
(/opt/krb5/sbin/64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/X11R6/bin:/sbin:/usr/sbin:/usr/bin:/opt/pb/bin:/opt/perf/bin:/bin:/usr/local/bin:/home/watrous/bin:/home/watrous/pig-0.12.0/bin)
2013-12-04 17:15:15,398 [main] INFO  org.apache.pig.Main - Apache Pig version 
0.12.0 (r1529718) compiled Oct 07 2013, 12:20:14
2013-12-04 17:15:15,398 [main] INFO  org.apache.pig.Main - Logging error 
messages to: /home/watrous/pig_1386177315394.log
2013-12-04 17:15:15,425 [main] INFO  org.apache.pig.impl.util.Utils - Default 
bootup file /home/watrous/.pigbootup not found
2013-12-04 17:15:15,599 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 
hadoop file system at: file:///
grunt REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1);
2013-12-04 17:16:59,753 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1200: line 1 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro must be 
defined before expansion.
Details at logfile: /home/watrous/pig_1386177315394.log

Here's the relevant bit from the log file:
Pig Stack Trace
---
ERROR 1200: line 1 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro must be 
defined before expansion.

Failed to parse: line 1 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro 
must be defined before expansion.
at org.apache.pig.parser.PigMacro.macroInline(PigMacro.java:455)
at 
org.apache.pig.parser.QueryParserDriver.inlineMacro(QueryParserDriver.java:298)
at 
org.apache.pig.parser.QueryParserDriver.expandMacro(QueryParserDriver.java:287)
at 
org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:180)
at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1648)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1621)
at org.apache.pig.PigServer.registerQuery(PigServer.java:575)
at 
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1093)
at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:501)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:541)
at org.apache.pig.Main.main(Main.java:156)

I attempted to define the macro (following this tutorial 
http://aws.amazon.com/articles/2729). However, piggybank.jar doesn't define 
org.apache.pig.piggybank.evaluation.string.EXTRACT, so I located the most 
likely file in the current version of the jar.

grunt register /home/watrous/pig-0.12.0/contrib/piggybank/java/piggybank.jar
grunt DEFINE REGEX_EXTRACT 
org.apache.pig.piggybank.evaluation.string.RegexExtract;
grunt REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1);
2013-12-04 17:23:20,383 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1200: line 3 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro must be 
defined before expansion.
Details at logfile: /home/watrous/pig_1386177315394.log

I get the same stack trace with the only change being a reference to line 3 
instead of line 1.

Any idea how I can get this working?

Daniel


Re: Trouble with REGEX in PIG

2013-12-04 Thread Ankit Bhatnagar
R u planning to use

org.apache.pig.builtin.REGEX_EXTRACT


?

On 12/4/13 9:28 AM, Watrous, Daniel daniel.t.watr...@hp.com wrote:

Hi,

I'm trying to use regular expressions in PIG, but it's failing. Based on
the documentation 
http://pig.apache.org/docs/r0.12.0/func.html#regex-extract I am trying
this:

[watrous@c0003913 ~]$ pig -x local
which: no hadoop in
(/opt/krb5/sbin/64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr
/local/sbin:/usr/sbin:/sbin:/usr/X11R6/bin:/sbin:/usr/sbin:/usr/bin:/opt/p
b/bin:/opt/perf/bin:/bin:/usr/local/bin:/home/watrous/bin:/home/watrous/pi
g-0.12.0/bin)
2013-12-04 17:15:15,398 [main] INFO  org.apache.pig.Main - Apache Pig
version 0.12.0 (r1529718) compiled Oct 07 2013, 12:20:14
2013-12-04 17:15:15,398 [main] INFO  org.apache.pig.Main - Logging error
messages to: /home/watrous/pig_1386177315394.log
2013-12-04 17:15:15,425 [main] INFO  org.apache.pig.impl.util.Utils -
Default bootup file /home/watrous/.pigbootup not found
2013-12-04 17:15:15,599 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to hadoop file system at: file:///
grunt REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1);
2013-12-04 17:16:59,753 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1200: line 1 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro
must be defined before expansion.
Details at logfile: /home/watrous/pig_1386177315394.log

Here's the relevant bit from the log file:
Pig Stack Trace
---
ERROR 1200: line 1 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro
must be defined before expansion.

Failed to parse: line 1 Cannot expand macro 'REGEX_EXTRACT'. Reason:
Macro must be defined before expansion.
at org.apache.pig.parser.PigMacro.macroInline(PigMacro.java:455)
at 
org.apache.pig.parser.QueryParserDriver.inlineMacro(QueryParserDriver.java
:298)
at 
org.apache.pig.parser.QueryParserDriver.expandMacro(QueryParserDriver.java
:287)
at 
org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:180)
at 
org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1648)
at 
org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1621)
at org.apache.pig.PigServer.registerQuery(PigServer.java:575)
at 
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1093)
at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParse
r.java:501)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:1
98)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:1
73)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:541)
at org.apache.pig.Main.main(Main.java:156)

I attempted to define the macro (following this tutorial
http://aws.amazon.com/articles/2729). However, piggybank.jar doesn't
define org.apache.pig.piggybank.evaluation.string.EXTRACT, so I located
the most likely file in the current version of the jar.

grunt register 
/home/watrous/pig-0.12.0/contrib/piggybank/java/piggybank.jar
grunt DEFINE REGEX_EXTRACT
org.apache.pig.piggybank.evaluation.string.RegexExtract;
grunt REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1);
2013-12-04 17:23:20,383 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1200: line 3 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro
must be defined before expansion.
Details at logfile: /home/watrous/pig_1386177315394.log

I get the same stack trace with the only change being a reference to
line 3 instead of line 1.

Any idea how I can get this working?

Daniel



Re: Trouble with REGEX in PIG

2013-12-04 Thread Pradeep Gollakota
It's not valid PigLatin...

The Grunt shell doesn't let you try out functions and UDFs are you're
trying to use them.

A = LOAD 'data' USING PigStorage() as (ip: chararray);
B = FOREACH A GENERATE REGEX_EXTRACT(ip, '(.*):(.*)', 1);
DUMP B;

You always have to load a dataset and work with said dataset(s).
You can create a file called 'data' (per the above script) and put 
192.168.1.5:8020 in the file and try the above set of commands in the
grunt shell.


On Wed, Dec 4, 2013 at 10:15 AM, Ankit Bhatnagar ank...@yahoo-inc.comwrote:

 R u planning to use

 org.apache.pig.builtin.REGEX_EXTRACT


 ?

 On 12/4/13 9:28 AM, Watrous, Daniel daniel.t.watr...@hp.com wrote:

 Hi,
 
 I'm trying to use regular expressions in PIG, but it's failing. Based on
 the documentation
 http://pig.apache.org/docs/r0.12.0/func.html#regex-extract I am trying
 this:
 
 [watrous@c0003913 ~]$ pig -x local
 which: no hadoop in
 (/opt/krb5/sbin/64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr
 /local/sbin:/usr/sbin:/sbin:/usr/X11R6/bin:/sbin:/usr/sbin:/usr/bin:/opt/p
 b/bin:/opt/perf/bin:/bin:/usr/local/bin:/home/watrous/bin:/home/watrous/pi
 g-0.12.0/bin)
 2013-12-04 17:15:15,398 [main] INFO  org.apache.pig.Main - Apache Pig
 version 0.12.0 (r1529718) compiled Oct 07 2013, 12:20:14
 2013-12-04 17:15:15,398 [main] INFO  org.apache.pig.Main - Logging error
 messages to: /home/watrous/pig_1386177315394.log
 2013-12-04 17:15:15,425 [main] INFO  org.apache.pig.impl.util.Utils -
 Default bootup file /home/watrous/.pigbootup not found
 2013-12-04 17:15:15,599 [main] INFO
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
 Connecting to hadoop file system at: file:///
 grunt REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1);
 2013-12-04 17:16:59,753 [main] ERROR org.apache.pig.tools.grunt.Grunt -
 ERROR 1200: line 1 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro
 must be defined before expansion.
 Details at logfile: /home/watrous/pig_1386177315394.log
 
 Here's the relevant bit from the log file:
 Pig Stack Trace
 ---
 ERROR 1200: line 1 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro
 must be defined before expansion.
 
 Failed to parse: line 1 Cannot expand macro 'REGEX_EXTRACT'. Reason:
 Macro must be defined before expansion.
 at org.apache.pig.parser.PigMacro.macroInline(PigMacro.java:455)
 at
 org.apache.pig.parser.QueryParserDriver.inlineMacro(QueryParserDriver.java
 :298)
 at
 org.apache.pig.parser.QueryParserDriver.expandMacro(QueryParserDriver.java
 :287)
 at
 org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:180)
 at
 org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1648)
 at
 org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1621)
 at org.apache.pig.PigServer.registerQuery(PigServer.java:575)
 at
 org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1093)
 at
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParse
 r.java:501)
 at
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:1
 98)
 at
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:1
 73)
 at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
 at org.apache.pig.Main.run(Main.java:541)
 at org.apache.pig.Main.main(Main.java:156)
 
 I attempted to define the macro (following this tutorial
 http://aws.amazon.com/articles/2729). However, piggybank.jar doesn't
 define org.apache.pig.piggybank.evaluation.string.EXTRACT, so I located
 the most likely file in the current version of the jar.
 
 grunt register
 /home/watrous/pig-0.12.0/contrib/piggybank/java/piggybank.jar
 grunt DEFINE REGEX_EXTRACT
 org.apache.pig.piggybank.evaluation.string.RegexExtract;
 grunt REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1);
 2013-12-04 17:23:20,383 [main] ERROR org.apache.pig.tools.grunt.Grunt -
 ERROR 1200: line 3 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro
 must be defined before expansion.
 Details at logfile: /home/watrous/pig_1386177315394.log
 
 I get the same stack trace with the only change being a reference to
 line 3 instead of line 1.
 
 Any idea how I can get this working?
 
 Daniel




RE: Trouble with REGEX in PIG

2013-12-04 Thread Watrous, Daniel
That's what I was trying first, but then I tried defining it too.

-Original Message-
From: Ankit Bhatnagar [mailto:ank...@yahoo-inc.com] 
Sent: Wednesday, December 04, 2013 11:15 AM
To: user@pig.apache.org; Watrous, Daniel
Subject: Re: Trouble with REGEX in PIG

R u planning to use

org.apache.pig.builtin.REGEX_EXTRACT


?

On 12/4/13 9:28 AM, Watrous, Daniel daniel.t.watr...@hp.com wrote:

Hi,

I'm trying to use regular expressions in PIG, but it's failing. Based 
on the documentation 
http://pig.apache.org/docs/r0.12.0/func.html#regex-extract I am trying
this:

[watrous@c0003913 ~]$ pig -x local
which: no hadoop in
(/opt/krb5/sbin/64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/
usr 
/local/sbin:/usr/sbin:/sbin:/usr/X11R6/bin:/sbin:/usr/sbin:/usr/bin:/op
t/p 
b/bin:/opt/perf/bin:/bin:/usr/local/bin:/home/watrous/bin:/home/watrous
/pi
g-0.12.0/bin)
2013-12-04 17:15:15,398 [main] INFO  org.apache.pig.Main - Apache Pig 
version 0.12.0 (r1529718) compiled Oct 07 2013, 12:20:14
2013-12-04 17:15:15,398 [main] INFO  org.apache.pig.Main - Logging 
error messages to: /home/watrous/pig_1386177315394.log
2013-12-04 17:15:15,425 [main] INFO  org.apache.pig.impl.util.Utils - 
Default bootup file /home/watrous/.pigbootup not found
2013-12-04 17:15:15,599 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - 
Connecting to hadoop file system at: file:///
grunt REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1);
2013-12-04 17:16:59,753 [main] ERROR org.apache.pig.tools.grunt.Grunt - 
ERROR 1200: line 1 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro 
must be defined before expansion.
Details at logfile: /home/watrous/pig_1386177315394.log

Here's the relevant bit from the log file:
Pig Stack Trace
---
ERROR 1200: line 1 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro 
must be defined before expansion.

Failed to parse: line 1 Cannot expand macro 'REGEX_EXTRACT'. Reason:
Macro must be defined before expansion.
at org.apache.pig.parser.PigMacro.macroInline(PigMacro.java:455)
at
org.apache.pig.parser.QueryParserDriver.inlineMacro(QueryParserDriver.j
ava
:298)
at
org.apache.pig.parser.QueryParserDriver.expandMacro(QueryParserDriver.j
ava
:287)
at
org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:180)
at
org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1648)
at
org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1621)
at org.apache.pig.PigServer.registerQuery(PigServer.java:575)
at
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1093)
at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptPa
rse
r.java:501)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.jav
a:1
98)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.jav
a:1
73)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:541)
at org.apache.pig.Main.main(Main.java:156)

I attempted to define the macro (following this tutorial 
http://aws.amazon.com/articles/2729). However, piggybank.jar doesn't 
define org.apache.pig.piggybank.evaluation.string.EXTRACT, so I located 
the most likely file in the current version of the jar.

grunt register
/home/watrous/pig-0.12.0/contrib/piggybank/java/piggybank.jar
grunt DEFINE REGEX_EXTRACT
org.apache.pig.piggybank.evaluation.string.RegexExtract;
grunt REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1);
2013-12-04 17:23:20,383 [main] ERROR org.apache.pig.tools.grunt.Grunt - 
ERROR 1200: line 3 Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro 
must be defined before expansion.
Details at logfile: /home/watrous/pig_1386177315394.log

I get the same stack trace with the only change being a reference to 
line 3 instead of line 1.

Any idea how I can get this working?

Daniel



RE: Trouble with REGEX in PIG

2013-12-04 Thread Watrous, Daniel
Pradeep,

Does the documentation here need to be updated: 
http://pig.apache.org/docs/r0.12.0/func.html#regex-extract

It suggests that the function can run against a string and should return the 
expected value.

I did confirm that I can use REGEX_EXTRACT on values loaded from a file. 

Thank you,
Daniel

-Original Message-
From: Pradeep Gollakota [mailto:pradeep...@gmail.com] 
Sent: Wednesday, December 04, 2013 11:28 AM
To: user@pig.apache.org
Subject: Re: Trouble with REGEX in PIG

It's not valid PigLatin...

The Grunt shell doesn't let you try out functions and UDFs are you're trying to 
use them.

A = LOAD 'data' USING PigStorage() as (ip: chararray);
B = FOREACH A GENERATE REGEX_EXTRACT(ip, '(.*):(.*)', 1);
DUMP B;

You always have to load a dataset and work with said dataset(s).
You can create a file called 'data' (per the above script) and put 
192.168.1.5:8020 in the file and try the above set of commands in the grunt 
shell.


On Wed, Dec 4, 2013 at 10:15 AM, Ankit Bhatnagar ank...@yahoo-inc.comwrote:

 R u planning to use

 org.apache.pig.builtin.REGEX_EXTRACT


 ?

 On 12/4/13 9:28 AM, Watrous, Daniel daniel.t.watr...@hp.com wrote:

 Hi,
 
 I'm trying to use regular expressions in PIG, but it's failing. Based 
 on the documentation 
 http://pig.apache.org/docs/r0.12.0/func.html#regex-extract I am 
 trying
 this:
 
 [watrous@c0003913 ~]$ pig -x local
 which: no hadoop in
 (/opt/krb5/sbin/64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin
 :/usr 
 /local/sbin:/usr/sbin:/sbin:/usr/X11R6/bin:/sbin:/usr/sbin:/usr/bin:/
 opt/p 
 b/bin:/opt/perf/bin:/bin:/usr/local/bin:/home/watrous/bin:/home/watro
 us/pi
 g-0.12.0/bin)
 2013-12-04 17:15:15,398 [main] INFO  org.apache.pig.Main - Apache Pig 
 version 0.12.0 (r1529718) compiled Oct 07 2013, 12:20:14
 2013-12-04 17:15:15,398 [main] INFO  org.apache.pig.Main - Logging 
 error messages to: /home/watrous/pig_1386177315394.log
 2013-12-04 17:15:15,425 [main] INFO  org.apache.pig.impl.util.Utils - 
 Default bootup file /home/watrous/.pigbootup not found
 2013-12-04 17:15:15,599 [main] INFO
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - 
 Connecting to hadoop file system at: file:///
 grunt REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1);
 2013-12-04 17:16:59,753 [main] ERROR org.apache.pig.tools.grunt.Grunt 
 - ERROR 1200: line 1 Cannot expand macro 'REGEX_EXTRACT'. Reason: 
 Macro must be defined before expansion.
 Details at logfile: /home/watrous/pig_1386177315394.log
 
 Here's the relevant bit from the log file:
 Pig Stack Trace
 ---
 ERROR 1200: line 1 Cannot expand macro 'REGEX_EXTRACT'. Reason: 
 Macro must be defined before expansion.
 
 Failed to parse: line 1 Cannot expand macro 'REGEX_EXTRACT'. Reason:
 Macro must be defined before expansion.
 at org.apache.pig.parser.PigMacro.macroInline(PigMacro.java:455)
 at
 org.apache.pig.parser.QueryParserDriver.inlineMacro(QueryParserDriver
 .java
 :298)
 at
 org.apache.pig.parser.QueryParserDriver.expandMacro(QueryParserDriver
 .java
 :287)
 at
 org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:180)
 at
 org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1648)
 at
 org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1621)
 at org.apache.pig.PigServer.registerQuery(PigServer.java:575)
 at
 org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1093)
 at
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScript
 Parse
 r.java:501)
 at
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.j
 ava:1
 98)
 at
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.j
 ava:1
 73)
 at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
 at org.apache.pig.Main.run(Main.java:541)
 at org.apache.pig.Main.main(Main.java:156)
 
 I attempted to define the macro (following this tutorial 
 http://aws.amazon.com/articles/2729). However, piggybank.jar doesn't 
 define org.apache.pig.piggybank.evaluation.string.EXTRACT, so I 
 located the most likely file in the current version of the jar.
 
 grunt register
 /home/watrous/pig-0.12.0/contrib/piggybank/java/piggybank.jar
 grunt DEFINE REGEX_EXTRACT
 org.apache.pig.piggybank.evaluation.string.RegexExtract;
 grunt REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1);
 2013-12-04 17:23:20,383 [main] ERROR org.apache.pig.tools.grunt.Grunt 
 - ERROR 1200: line 3 Cannot expand macro 'REGEX_EXTRACT'. Reason: 
 Macro must be defined before expansion.
 Details at logfile: /home/watrous/pig_1386177315394.log
 
 I get the same stack trace with the only change being a reference to 
 line 3 instead of line 1.
 
 Any idea how I can get this working?
 
 Daniel




CROSS/Self-Join Bug - Please Help :(

2013-12-04 Thread Russell Jurney
I have this bug that is killing me, where I can't self-join/cross a dataset
with itself. Its blocking my work :(

The script is like this:

businesses = LOAD
'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using
com.twitter.elephantbird.pig.load.JsonLoader() as json:map[];

/* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar,
business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E Camelback
Rd
Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty 
Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business,
city=Phoenix} */
locations = FOREACH businesses GENERATE $0#'business_id' AS business_id,
  $0#'longitude' AS longitude,
  $0#'latitude' AS latitude;
STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv';
locations_2 = LOAD 'yelp_phoenix_academic_dataset/locations.tsv' AS
(business_id:chararray, longitude:double, latitude:double);
location_comparisons = CROSS locations_2, locations;

distances = FOREACH businesses GENERATE locations.business_id AS
business_id_1,
locations_2.business_id AS
business_id_2,
udfs.haversine(locations.longitude,
   locations.latitude,

 locations_2.longitude,

 locations_2.latitude) AS distance;
STORE distances INTO 'yelp_phoenix_academic_dataset/distances.tsv';


I have also tried converting this to a self-join using JOIN BY '1', and
also locations_2 = locations, and I get the same error:

*org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has
more than one row in the output. 1st :
(rncjoVoEFUJGCUoC1JgnUA,-112.241596,33.581867), 2nd
:(0FNFSzCFP_rGUoJx8W7tJg,-112.105933,33.604054)*

at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111)

at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336)

at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:438)

at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347)

at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)

at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298)

at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)

at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)

at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)

at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)

at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

This makes no sense! What am I to do? I can't self-join :(
-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com


Re: CROSS/Self-Join Bug - Please Help :(

2013-12-04 Thread Russell Jurney
There was a bug in the script on the 2nd to last line. Fixed it, still have
same issue.

I found a workaround: if I store the CROSSED relation immediately after the
CROSS, then load it... it works. Something about resetting the plan. This
is a bug. I'll file a JIRA.


On Wed, Dec 4, 2013 at 1:21 PM, Russell Jurney russell.jur...@gmail.comwrote:

 I have this bug that is killing me, where I can't self-join/cross a
 dataset with itself. Its blocking my work :(

 The script is like this:

 businesses = LOAD
 'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using
 com.twitter.elephantbird.pig.load.JsonLoader() as json:map[];

 /* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar,
 business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E Camelback
 Rd
 Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty 
 Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business,
 city=Phoenix} */
 locations = FOREACH businesses GENERATE $0#'business_id' AS business_id,
   $0#'longitude' AS longitude,
   $0#'latitude' AS latitude;
 STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv';
 locations_2 = LOAD 'yelp_phoenix_academic_dataset/locations.tsv' AS
 (business_id:chararray, longitude:double, latitude:double);
 location_comparisons = CROSS locations_2, locations;

 distances = FOREACH businesses GENERATE locations.business_id AS
 business_id_1,
 locations_2.business_id AS
 business_id_2,

 udfs.haversine(locations.longitude,
locations.latitude,

  locations_2.longitude,

  locations_2.latitude) AS distance;
 STORE distances INTO 'yelp_phoenix_academic_dataset/distances.tsv';


 I have also tried converting this to a self-join using JOIN BY '1', and
 also locations_2 = locations, and I get the same error:

 *org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has
 more than one row in the output. 1st :
 (rncjoVoEFUJGCUoC1JgnUA,-112.241596,33.581867), 2nd
 :(0FNFSzCFP_rGUoJx8W7tJg,-112.105933,33.604054)*

 at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111)

 at
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336)

 at
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:438)

 at
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347)

 at
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)

 at
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298)

 at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)

 at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)

 at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)

 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)

 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)

 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)

 at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

 This makes no sense! What am I to do? I can't self-join :(
 --
 Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.
 com




-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com


Re: CROSS/Self-Join Bug - Please Help :(

2013-12-04 Thread Pradeep Gollakota
I tried to following script (not exactly the same) and it worked correctly
for me.

businesses = LOAD 'dataset' using PigStorage(',') AS (a, b, c,
business_id: chararray, lat: double, lng: double);
locations = FOREACH businesses GENERATE business_id, lat, lng;
STORE locations INTO 'locations.tsv';
locations2 = LOAD 'locations.tsv' AS (business_id, lat, long);
loc_com = CROSS locations2, locations;
dump loc_com;

I’m wondering your problem has something to do with the way that the
JsonStorage works. Another thing you can try is to load ‘locations.tsv’
twice and do a self-cross on that.


On Wed, Dec 4, 2013 at 1:21 PM, Russell Jurney russell.jur...@gmail.comwrote:

 I have this bug that is killing me, where I can't self-join/cross a dataset
 with itself. Its blocking my work :(

 The script is like this:

 businesses = LOAD
 'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using
 com.twitter.elephantbird.pig.load.JsonLoader() as json:map[];

 /* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar,
 business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E Camelback
 Rd
 Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty 
 Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business,
 city=Phoenix} */
 locations = FOREACH businesses GENERATE $0#'business_id' AS business_id,
   $0#'longitude' AS longitude,
   $0#'latitude' AS latitude;
 STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv';
 locations_2 = LOAD 'yelp_phoenix_academic_dataset/locations.tsv' AS
 (business_id:chararray, longitude:double, latitude:double);
 location_comparisons = CROSS locations_2, locations;

 distances = FOREACH businesses GENERATE locations.business_id AS
 business_id_1,
 locations_2.business_id AS
 business_id_2,
 udfs.haversine(locations.longitude,
locations.latitude,

  locations_2.longitude,

  locations_2.latitude) AS distance;
 STORE distances INTO 'yelp_phoenix_academic_dataset/distances.tsv';


 I have also tried converting this to a self-join using JOIN BY '1', and
 also locations_2 = locations, and I get the same error:

 *org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has
 more than one row in the output. 1st :
 (rncjoVoEFUJGCUoC1JgnUA,-112.241596,33.581867), 2nd
 :(0FNFSzCFP_rGUoJx8W7tJg,-112.105933,33.604054)*

 at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111)

 at

 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336)

 at

 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:438)

 at

 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347)

 at

 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)

 at

 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298)

 at

 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)

 at

 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)

 at

 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)

 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)

 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)

 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)

 at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

 This makes no sense! What am I to do? I can't self-join :(
 --
 Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
 datasyndrome.com



weird classpath problem

2013-12-04 Thread Yigitbasi, Nezih
Hi everyone,
I am having some weird classpath issues with a UDF that returns a custom tuple. 
My custom tuple has an arraylist of custom objects. It looks like:
class MyTuple
private ArrayListMyClass list;

When the UDF is called, everything works fine: the tuples are created and the 
UDF returns successfully. But then in the next step when I try to dump the 
results of the UDF I see class not found exceptions. That is, MyClass can be 
found during UDF execution, but it cannot be found while dumping the UDF 
results. I tried setting PIG_CLASSPATH, but didn't work. The only working 
solution so far is to add my jar file to hadoop's classpath. I wonder whether 
this is a bug (PIG_CLASSPATH should work) or I am not setting something 
correctly. Any help will be greatly appreciated.

Thanks,
Nezih

Here is my pig script:
register 'my_jar.jar';
x = load '/etc/passwd' using PigStorage(':') as (username:chararray, f1: 
chararray, f2: chararray, f3:chararray, f4:chararray);
parsed = foreach x generate username, MyParser(*);
dump parsed;

Here is the stack trace:
Caused by: java.io.IOException: Could not find class MyClass, while attempting 
to de-serialize it
at org.apache.pig.data.BinInterSedes.readWritable(BinInterSedes.java:293)
at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:422)
at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:318)
at 
org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:349)
at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:318)
at 
org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:349)
at 
org.apache.pig.impl.io.InterRecordReader.nextKeyValue(InterRecordReader.java:113)
at org.apache.pig.impl.io.InterStorage.getNext(InterStorage.java:77)
at 
org.apache.pig.impl.io.ReadToEndLoader.getNextHelper(ReadToEndLoader.java:246)
at org.apache.pig.impl.io.ReadToEndLoader.getNext(ReadToEndLoader.java:226)
at 
org.apache.pig.backend.hadoop.executionengine.HJob$1.hasNext(HJob.java:111)
... 12 more



Re: CROSS/Self-Join Bug - Please Help :(

2013-12-04 Thread Russell Jurney
If you store immediately after the CROSS, it works. If you do another
FOREACH/GENERATE, etc. it does not.


On Wed, Dec 4, 2013 at 1:41 PM, Pradeep Gollakota pradeep...@gmail.comwrote:

 I tried to following script (not exactly the same) and it worked correctly
 for me.

 businesses = LOAD 'dataset' using PigStorage(',') AS (a, b, c,
 business_id: chararray, lat: double, lng: double);
 locations = FOREACH businesses GENERATE business_id, lat, lng;
 STORE locations INTO 'locations.tsv';
 locations2 = LOAD 'locations.tsv' AS (business_id, lat, long);
 loc_com = CROSS locations2, locations;
 dump loc_com;

 I’m wondering your problem has something to do with the way that the
 JsonStorage works. Another thing you can try is to load ‘locations.tsv’
 twice and do a self-cross on that.


 On Wed, Dec 4, 2013 at 1:21 PM, Russell Jurney russell.jur...@gmail.com
 wrote:

  I have this bug that is killing me, where I can't self-join/cross a
 dataset
  with itself. Its blocking my work :(
 
  The script is like this:
 
  businesses = LOAD
  'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using
  com.twitter.elephantbird.pig.load.JsonLoader() as json:map[];
 
  /* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar,
  business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E
 Camelback
  Rd
  Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty 
  Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business,
  city=Phoenix} */
  locations = FOREACH businesses GENERATE $0#'business_id' AS business_id,
$0#'longitude' AS longitude,
$0#'latitude' AS latitude;
  STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv';
  locations_2 = LOAD 'yelp_phoenix_academic_dataset/locations.tsv' AS
  (business_id:chararray, longitude:double, latitude:double);
  location_comparisons = CROSS locations_2, locations;
 
  distances = FOREACH businesses GENERATE locations.business_id AS
  business_id_1,
  locations_2.business_id AS
  business_id_2,
 
 udfs.haversine(locations.longitude,
 
  locations.latitude,
 
   locations_2.longitude,
 
   locations_2.latitude) AS distance;
  STORE distances INTO 'yelp_phoenix_academic_dataset/distances.tsv';
 
 
  I have also tried converting this to a self-join using JOIN BY '1', and
  also locations_2 = locations, and I get the same error:
 
  *org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar
 has
  more than one row in the output. 1st :
  (rncjoVoEFUJGCUoC1JgnUA,-112.241596,33.581867), 2nd
  :(0FNFSzCFP_rGUoJx8W7tJg,-112.105933,33.604054)*
 
  at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111)
 
  at
 
 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336)
 
  at
 
 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:438)
 
  at
 
 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347)
 
  at
 
 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
 
  at
 
 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298)
 
  at
 
 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
 
  at
 
 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
 
  at
 
 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
 
  at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
 
  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
 
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
 
  at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
 
  This makes no sense! What am I to do? I can't self-join :(
  --
  Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
  datasyndrome.com
 




-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com