Re: CROSS/Self-Join Bug - Please Help :(

2013-12-04 Thread Russell Jurney
If you store immediately after the CROSS, it works. If you do another
FOREACH/GENERATE, etc. it does not.


On Wed, Dec 4, 2013 at 1:41 PM, Pradeep Gollakota wrote:

> I tried to following script (not exactly the same) and it worked correctly
> for me.
>
> businesses = LOAD 'dataset' using PigStorage(',') AS (a, b, c,
> business_id: chararray, lat: double, lng: double);
> locations = FOREACH businesses GENERATE business_id, lat, lng;
> STORE locations INTO 'locations.tsv';
> locations2 = LOAD 'locations.tsv' AS (business_id, lat, long);
> loc_com = CROSS locations2, locations;
> dump loc_com;
>
> I’m wondering your problem has something to do with the way that the
> JsonStorage works. Another thing you can try is to load ‘locations.tsv’
> twice and do a self-cross on that.
>
>
> On Wed, Dec 4, 2013 at 1:21 PM, Russell Jurney  >wrote:
>
> > I have this bug that is killing me, where I can't self-join/cross a
> dataset
> > with itself. Its blocking my work :(
> >
> > The script is like this:
> >
> > businesses = LOAD
> > 'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using
> > com.twitter.elephantbird.pig.load.JsonLoader() as json:map[];
> >
> > /* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar,
> > business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E
> Camelback
> > Rd
> > Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty &
> > Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business,
> > city=Phoenix} */
> > locations = FOREACH businesses GENERATE $0#'business_id' AS business_id,
> >   $0#'longitude' AS longitude,
> >   $0#'latitude' AS latitude;
> > STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv';
> > locations_2 = LOAD 'yelp_phoenix_academic_dataset/locations.tsv' AS
> > (business_id:chararray, longitude:double, latitude:double);
> > location_comparisons = CROSS locations_2, locations;
> >
> > distances = FOREACH businesses GENERATE locations.business_id AS
> > business_id_1,
> > locations_2.business_id AS
> > business_id_2,
> >
> udfs.haversine(locations.longitude,
> >
>  locations.latitude,
> >
> >  locations_2.longitude,
> >
> >  locations_2.latitude) AS distance;
> > STORE distances INTO 'yelp_phoenix_academic_dataset/distances.tsv';
> >
> >
> > I have also tried converting this to a self-join using JOIN BY '1', and
> > also locations_2 = locations, and I get the same error:
> >
> > *org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar
> has
> > more than one row in the output. 1st :
> > (rncjoVoEFUJGCUoC1JgnUA,-112.241596,33.581867), 2nd
> > :(0FNFSzCFP_rGUoJx8W7tJg,-112.105933,33.604054)*
> >
> > at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111)
> >
> > at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336)
> >
> > at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:438)
> >
> > at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347)
> >
> > at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
> >
> > at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298)
> >
> > at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
> >
> > at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
> >
> > at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
> >
> > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> >
> > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
> >
> > at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> >
> > This makes no sense! What am I to do? I can't self-join :(
> > --
> > Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
> > datasyndrome.com
> >
>



-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com


weird classpath problem

2013-12-04 Thread Yigitbasi, Nezih
Hi everyone,
I am having some weird classpath issues with a UDF that returns a custom tuple. 
My custom tuple has an arraylist of custom objects. It looks like:
class MyTuple
private ArrayList list;

When the UDF is called, everything works fine: the tuples are created and the 
UDF returns successfully. But then in the next step when I try to dump the 
results of the UDF I see class not found exceptions. That is, MyClass can be 
found during UDF execution, but it cannot be found while dumping the UDF 
results. I tried setting PIG_CLASSPATH, but didn't work. The only working 
solution so far is to add my jar file to hadoop's classpath. I wonder whether 
this is a bug (PIG_CLASSPATH should work) or I am not setting something 
correctly. Any help will be greatly appreciated.

Thanks,
Nezih

Here is my pig script:
register 'my_jar.jar';
x = load '/etc/passwd' using PigStorage(':') as (username:chararray, f1: 
chararray, f2: chararray, f3:chararray, f4:chararray);
parsed = foreach x generate username, MyParser(*);
dump parsed;

Here is the stack trace:
Caused by: java.io.IOException: Could not find class MyClass, while attempting 
to de-serialize it
at org.apache.pig.data.BinInterSedes.readWritable(BinInterSedes.java:293)
at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:422)
at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:318)
at 
org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:349)
at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:318)
at 
org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:349)
at 
org.apache.pig.impl.io.InterRecordReader.nextKeyValue(InterRecordReader.java:113)
at org.apache.pig.impl.io.InterStorage.getNext(InterStorage.java:77)
at 
org.apache.pig.impl.io.ReadToEndLoader.getNextHelper(ReadToEndLoader.java:246)
at org.apache.pig.impl.io.ReadToEndLoader.getNext(ReadToEndLoader.java:226)
at 
org.apache.pig.backend.hadoop.executionengine.HJob$1.hasNext(HJob.java:111)
... 12 more



Re: CROSS/Self-Join Bug - Please Help :(

2013-12-04 Thread Pradeep Gollakota
I tried to following script (not exactly the same) and it worked correctly
for me.

businesses = LOAD 'dataset' using PigStorage(',') AS (a, b, c,
business_id: chararray, lat: double, lng: double);
locations = FOREACH businesses GENERATE business_id, lat, lng;
STORE locations INTO 'locations.tsv';
locations2 = LOAD 'locations.tsv' AS (business_id, lat, long);
loc_com = CROSS locations2, locations;
dump loc_com;

I’m wondering your problem has something to do with the way that the
JsonStorage works. Another thing you can try is to load ‘locations.tsv’
twice and do a self-cross on that.


On Wed, Dec 4, 2013 at 1:21 PM, Russell Jurney wrote:

> I have this bug that is killing me, where I can't self-join/cross a dataset
> with itself. Its blocking my work :(
>
> The script is like this:
>
> businesses = LOAD
> 'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using
> com.twitter.elephantbird.pig.load.JsonLoader() as json:map[];
>
> /* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar,
> business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E Camelback
> Rd
> Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty &
> Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business,
> city=Phoenix} */
> locations = FOREACH businesses GENERATE $0#'business_id' AS business_id,
>   $0#'longitude' AS longitude,
>   $0#'latitude' AS latitude;
> STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv';
> locations_2 = LOAD 'yelp_phoenix_academic_dataset/locations.tsv' AS
> (business_id:chararray, longitude:double, latitude:double);
> location_comparisons = CROSS locations_2, locations;
>
> distances = FOREACH businesses GENERATE locations.business_id AS
> business_id_1,
> locations_2.business_id AS
> business_id_2,
> udfs.haversine(locations.longitude,
>locations.latitude,
>
>  locations_2.longitude,
>
>  locations_2.latitude) AS distance;
> STORE distances INTO 'yelp_phoenix_academic_dataset/distances.tsv';
>
>
> I have also tried converting this to a self-join using JOIN BY '1', and
> also locations_2 = locations, and I get the same error:
>
> *org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has
> more than one row in the output. 1st :
> (rncjoVoEFUJGCUoC1JgnUA,-112.241596,33.581867), 2nd
> :(0FNFSzCFP_rGUoJx8W7tJg,-112.105933,33.604054)*
>
> at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:438)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
>
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
>
> This makes no sense! What am I to do? I can't self-join :(
> --
> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
> datasyndrome.com
>


Re: CROSS/Self-Join Bug - Please Help :(

2013-12-04 Thread Russell Jurney
There was a bug in the script on the 2nd to last line. Fixed it, still have
same issue.

I found a workaround: if I store the CROSSED relation immediately after the
CROSS, then load it... it works. Something about resetting the plan. This
is a bug. I'll file a JIRA.


On Wed, Dec 4, 2013 at 1:21 PM, Russell Jurney wrote:

> I have this bug that is killing me, where I can't self-join/cross a
> dataset with itself. Its blocking my work :(
>
> The script is like this:
>
> businesses = LOAD
> 'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using
> com.twitter.elephantbird.pig.load.JsonLoader() as json:map[];
>
> /* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar,
> business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E Camelback
> Rd
> Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty &
> Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business,
> city=Phoenix} */
> locations = FOREACH businesses GENERATE $0#'business_id' AS business_id,
>   $0#'longitude' AS longitude,
>   $0#'latitude' AS latitude;
> STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv';
> locations_2 = LOAD 'yelp_phoenix_academic_dataset/locations.tsv' AS
> (business_id:chararray, longitude:double, latitude:double);
> location_comparisons = CROSS locations_2, locations;
>
> distances = FOREACH businesses GENERATE locations.business_id AS
> business_id_1,
> locations_2.business_id AS
> business_id_2,
>
> udfs.haversine(locations.longitude,
>locations.latitude,
>
>  locations_2.longitude,
>
>  locations_2.latitude) AS distance;
> STORE distances INTO 'yelp_phoenix_academic_dataset/distances.tsv';
>
>
> I have also tried converting this to a self-join using JOIN BY '1', and
> also locations_2 = locations, and I get the same error:
>
> *org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has
> more than one row in the output. 1st :
> (rncjoVoEFUJGCUoC1JgnUA,-112.241596,33.581867), 2nd
> :(0FNFSzCFP_rGUoJx8W7tJg,-112.105933,33.604054)*
>
> at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111)
>
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336)
>
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:438)
>
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347)
>
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
>
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298)
>
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
>
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
>
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
>
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
>
> This makes no sense! What am I to do? I can't self-join :(
> --
> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.
> com
>



-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com


CROSS/Self-Join Bug - Please Help :(

2013-12-04 Thread Russell Jurney
I have this bug that is killing me, where I can't self-join/cross a dataset
with itself. Its blocking my work :(

The script is like this:

businesses = LOAD
'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using
com.twitter.elephantbird.pig.load.JsonLoader() as json:map[];

/* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar,
business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E Camelback
Rd
Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty &
Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business,
city=Phoenix} */
locations = FOREACH businesses GENERATE $0#'business_id' AS business_id,
  $0#'longitude' AS longitude,
  $0#'latitude' AS latitude;
STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv';
locations_2 = LOAD 'yelp_phoenix_academic_dataset/locations.tsv' AS
(business_id:chararray, longitude:double, latitude:double);
location_comparisons = CROSS locations_2, locations;

distances = FOREACH businesses GENERATE locations.business_id AS
business_id_1,
locations_2.business_id AS
business_id_2,
udfs.haversine(locations.longitude,
   locations.latitude,

 locations_2.longitude,

 locations_2.latitude) AS distance;
STORE distances INTO 'yelp_phoenix_academic_dataset/distances.tsv';


I have also tried converting this to a self-join using JOIN BY '1', and
also locations_2 = locations, and I get the same error:

*org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has
more than one row in the output. 1st :
(rncjoVoEFUJGCUoC1JgnUA,-112.241596,33.581867), 2nd
:(0FNFSzCFP_rGUoJx8W7tJg,-112.105933,33.604054)*

at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111)

at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336)

at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:438)

at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347)

at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)

at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298)

at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)

at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)

at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)

at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)

at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

This makes no sense! What am I to do? I can't self-join :(
-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com


RE: Trouble with REGEX in PIG

2013-12-04 Thread Watrous, Daniel
Pradeep,

Does the documentation here need to be updated: 
http://pig.apache.org/docs/r0.12.0/func.html#regex-extract

It suggests that the function can run against a string and should return the 
expected value.

I did confirm that I can use REGEX_EXTRACT on values loaded from a file. 

Thank you,
Daniel

-Original Message-
From: Pradeep Gollakota [mailto:pradeep...@gmail.com] 
Sent: Wednesday, December 04, 2013 11:28 AM
To: user@pig.apache.org
Subject: Re: Trouble with REGEX in PIG

It's not valid PigLatin...

The Grunt shell doesn't let you try out functions and UDFs are you're trying to 
use them.

A = LOAD 'data' USING PigStorage() as (ip: chararray);
B = FOREACH A GENERATE REGEX_EXTRACT(ip, '(.*):(.*)', 1);
DUMP B;

You always have to load a dataset and work with said dataset(s).
You can create a file called 'data' (per the above script) and put "
192.168.1.5:8020" in the file and try the above set of commands in the grunt 
shell.


On Wed, Dec 4, 2013 at 10:15 AM, Ankit Bhatnagar wrote:

> R u planning to use
>
> org.apache.pig.builtin.REGEX_EXTRACT
>
>
> ?
>
> On 12/4/13 9:28 AM, "Watrous, Daniel"  wrote:
>
> >Hi,
> >
> >I'm trying to use regular expressions in PIG, but it's failing. Based 
> >on the documentation 
> >http://pig.apache.org/docs/r0.12.0/func.html#regex-extract I am 
> >trying
> >this:
> >
> >[watrous@c0003913 ~]$ pig -x local
> >which: no hadoop in
> >(/opt/krb5/sbin/64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin
> >:/usr 
> >/local/sbin:/usr/sbin:/sbin:/usr/X11R6/bin:/sbin:/usr/sbin:/usr/bin:/
> >opt/p 
> >b/bin:/opt/perf/bin:/bin:/usr/local/bin:/home/watrous/bin:/home/watro
> >us/pi
> >g-0.12.0/bin)
> >2013-12-04 17:15:15,398 [main] INFO  org.apache.pig.Main - Apache Pig 
> >version 0.12.0 (r1529718) compiled Oct 07 2013, 12:20:14
> >2013-12-04 17:15:15,398 [main] INFO  org.apache.pig.Main - Logging 
> >error messages to: /home/watrous/pig_1386177315394.log
> >2013-12-04 17:15:15,425 [main] INFO  org.apache.pig.impl.util.Utils - 
> >Default bootup file /home/watrous/.pigbootup not found
> >2013-12-04 17:15:15,599 [main] INFO
> >org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - 
> >Connecting to hadoop file system at: file:///
> >grunt> REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1);
> >2013-12-04 17:16:59,753 [main] ERROR org.apache.pig.tools.grunt.Grunt 
> >- ERROR 1200:  Cannot expand macro 'REGEX_EXTRACT'. Reason: 
> >Macro must be defined before expansion.
> >Details at logfile: /home/watrous/pig_1386177315394.log
> >
> >Here's the relevant bit from the log file:
> >Pig Stack Trace
> >---
> >ERROR 1200:  Cannot expand macro 'REGEX_EXTRACT'. Reason: 
> >Macro must be defined before expansion.
> >
> >Failed to parse:  Cannot expand macro 'REGEX_EXTRACT'. Reason:
> >Macro must be defined before expansion.
> >at org.apache.pig.parser.PigMacro.macroInline(PigMacro.java:455)
> >at
> >org.apache.pig.parser.QueryParserDriver.inlineMacro(QueryParserDriver
> >.java
> >:298)
> >at
> >org.apache.pig.parser.QueryParserDriver.expandMacro(QueryParserDriver
> >.java
> >:287)
> >at
> >org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:180)
> >at
> >org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1648)
> >at
> >org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1621)
> >at org.apache.pig.PigServer.registerQuery(PigServer.java:575)
> >at
> >org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1093)
> >at
> >org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScript
> >Parse
> >r.java:501)
> >at
> >org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.j
> >ava:1
> >98)
> >at
> >org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.j
> >ava:1
> >73)
> >at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
> >at org.apache.pig.Main.run(Main.java:541)
> >at org.apache.pig.Main.main(Main.java:156)
> >
> >I attempted to define the macro (following this tutorial 
> >http://aws.amazon.com/articles/2729). However, piggybank.jar doesn't 
> >define org.apache.pig.piggybank.evaluation.string.EXTRACT, so I 
> >located the most likely file in the current version of the jar.
> >
> >grunt> register
> >/home/watrous/pig-0.12.0/contrib/piggybank/java/piggybank.jar
> >grunt> DEFINE REGEX_EXTRACT
> >org.apache.pig.piggybank.evaluation.string.RegexExtract;
> >grunt> REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1);
> >2013-12-04 17:23:20,383 [main] ERROR org.apache.pig.tools.grunt.Grunt 
> >- ERROR 1200:  Cannot expand macro 'REGEX_EXTRACT'. Reason: 
> >Macro must be defined before expansion.
> >Details at logfile: /home/watrous/pig_1386177315394.log
> >
> >I get the same stack trace with the only change being a reference to 
> > instead of .
> >
> >Any idea how I can get this working?
> >
> >Daniel
>
>


RE: Trouble with REGEX in PIG

2013-12-04 Thread Watrous, Daniel
That's what I was trying first, but then I tried defining it too.

-Original Message-
From: Ankit Bhatnagar [mailto:ank...@yahoo-inc.com] 
Sent: Wednesday, December 04, 2013 11:15 AM
To: user@pig.apache.org; Watrous, Daniel
Subject: Re: Trouble with REGEX in PIG

R u planning to use

org.apache.pig.builtin.REGEX_EXTRACT


?

On 12/4/13 9:28 AM, "Watrous, Daniel"  wrote:

>Hi,
>
>I'm trying to use regular expressions in PIG, but it's failing. Based 
>on the documentation 
>http://pig.apache.org/docs/r0.12.0/func.html#regex-extract I am trying
>this:
>
>[watrous@c0003913 ~]$ pig -x local
>which: no hadoop in
>(/opt/krb5/sbin/64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/
>usr 
>/local/sbin:/usr/sbin:/sbin:/usr/X11R6/bin:/sbin:/usr/sbin:/usr/bin:/op
>t/p 
>b/bin:/opt/perf/bin:/bin:/usr/local/bin:/home/watrous/bin:/home/watrous
>/pi
>g-0.12.0/bin)
>2013-12-04 17:15:15,398 [main] INFO  org.apache.pig.Main - Apache Pig 
>version 0.12.0 (r1529718) compiled Oct 07 2013, 12:20:14
>2013-12-04 17:15:15,398 [main] INFO  org.apache.pig.Main - Logging 
>error messages to: /home/watrous/pig_1386177315394.log
>2013-12-04 17:15:15,425 [main] INFO  org.apache.pig.impl.util.Utils - 
>Default bootup file /home/watrous/.pigbootup not found
>2013-12-04 17:15:15,599 [main] INFO
>org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - 
>Connecting to hadoop file system at: file:///
>grunt> REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1);
>2013-12-04 17:16:59,753 [main] ERROR org.apache.pig.tools.grunt.Grunt - 
>ERROR 1200:  Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro 
>must be defined before expansion.
>Details at logfile: /home/watrous/pig_1386177315394.log
>
>Here's the relevant bit from the log file:
>Pig Stack Trace
>---
>ERROR 1200:  Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro 
>must be defined before expansion.
>
>Failed to parse:  Cannot expand macro 'REGEX_EXTRACT'. Reason:
>Macro must be defined before expansion.
>at org.apache.pig.parser.PigMacro.macroInline(PigMacro.java:455)
>at
>org.apache.pig.parser.QueryParserDriver.inlineMacro(QueryParserDriver.j
>ava
>:298)
>at
>org.apache.pig.parser.QueryParserDriver.expandMacro(QueryParserDriver.j
>ava
>:287)
>at
>org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:180)
>at
>org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1648)
>at
>org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1621)
>at org.apache.pig.PigServer.registerQuery(PigServer.java:575)
>at
>org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1093)
>at
>org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptPa
>rse
>r.java:501)
>at
>org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.jav
>a:1
>98)
>at
>org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.jav
>a:1
>73)
>at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
>at org.apache.pig.Main.run(Main.java:541)
>at org.apache.pig.Main.main(Main.java:156)
>
>I attempted to define the macro (following this tutorial 
>http://aws.amazon.com/articles/2729). However, piggybank.jar doesn't 
>define org.apache.pig.piggybank.evaluation.string.EXTRACT, so I located 
>the most likely file in the current version of the jar.
>
>grunt> register
>/home/watrous/pig-0.12.0/contrib/piggybank/java/piggybank.jar
>grunt> DEFINE REGEX_EXTRACT
>org.apache.pig.piggybank.evaluation.string.RegexExtract;
>grunt> REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1);
>2013-12-04 17:23:20,383 [main] ERROR org.apache.pig.tools.grunt.Grunt - 
>ERROR 1200:  Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro 
>must be defined before expansion.
>Details at logfile: /home/watrous/pig_1386177315394.log
>
>I get the same stack trace with the only change being a reference to 
> instead of .
>
>Any idea how I can get this working?
>
>Daniel



Re: Trouble with REGEX in PIG

2013-12-04 Thread Pradeep Gollakota
It's not valid PigLatin...

The Grunt shell doesn't let you try out functions and UDFs are you're
trying to use them.

A = LOAD 'data' USING PigStorage() as (ip: chararray);
B = FOREACH A GENERATE REGEX_EXTRACT(ip, '(.*):(.*)', 1);
DUMP B;

You always have to load a dataset and work with said dataset(s).
You can create a file called 'data' (per the above script) and put "
192.168.1.5:8020" in the file and try the above set of commands in the
grunt shell.


On Wed, Dec 4, 2013 at 10:15 AM, Ankit Bhatnagar wrote:

> R u planning to use
>
> org.apache.pig.builtin.REGEX_EXTRACT
>
>
> ?
>
> On 12/4/13 9:28 AM, "Watrous, Daniel"  wrote:
>
> >Hi,
> >
> >I'm trying to use regular expressions in PIG, but it's failing. Based on
> >the documentation
> >http://pig.apache.org/docs/r0.12.0/func.html#regex-extract I am trying
> >this:
> >
> >[watrous@c0003913 ~]$ pig -x local
> >which: no hadoop in
> >(/opt/krb5/sbin/64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr
> >/local/sbin:/usr/sbin:/sbin:/usr/X11R6/bin:/sbin:/usr/sbin:/usr/bin:/opt/p
> >b/bin:/opt/perf/bin:/bin:/usr/local/bin:/home/watrous/bin:/home/watrous/pi
> >g-0.12.0/bin)
> >2013-12-04 17:15:15,398 [main] INFO  org.apache.pig.Main - Apache Pig
> >version 0.12.0 (r1529718) compiled Oct 07 2013, 12:20:14
> >2013-12-04 17:15:15,398 [main] INFO  org.apache.pig.Main - Logging error
> >messages to: /home/watrous/pig_1386177315394.log
> >2013-12-04 17:15:15,425 [main] INFO  org.apache.pig.impl.util.Utils -
> >Default bootup file /home/watrous/.pigbootup not found
> >2013-12-04 17:15:15,599 [main] INFO
> >org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> >Connecting to hadoop file system at: file:///
> >grunt> REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1);
> >2013-12-04 17:16:59,753 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> >ERROR 1200:  Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro
> >must be defined before expansion.
> >Details at logfile: /home/watrous/pig_1386177315394.log
> >
> >Here's the relevant bit from the log file:
> >Pig Stack Trace
> >---
> >ERROR 1200:  Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro
> >must be defined before expansion.
> >
> >Failed to parse:  Cannot expand macro 'REGEX_EXTRACT'. Reason:
> >Macro must be defined before expansion.
> >at org.apache.pig.parser.PigMacro.macroInline(PigMacro.java:455)
> >at
> >org.apache.pig.parser.QueryParserDriver.inlineMacro(QueryParserDriver.java
> >:298)
> >at
> >org.apache.pig.parser.QueryParserDriver.expandMacro(QueryParserDriver.java
> >:287)
> >at
> >org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:180)
> >at
> >org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1648)
> >at
> >org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1621)
> >at org.apache.pig.PigServer.registerQuery(PigServer.java:575)
> >at
> >org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1093)
> >at
> >org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParse
> >r.java:501)
> >at
> >org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:1
> >98)
> >at
> >org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:1
> >73)
> >at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
> >at org.apache.pig.Main.run(Main.java:541)
> >at org.apache.pig.Main.main(Main.java:156)
> >
> >I attempted to define the macro (following this tutorial
> >http://aws.amazon.com/articles/2729). However, piggybank.jar doesn't
> >define org.apache.pig.piggybank.evaluation.string.EXTRACT, so I located
> >the most likely file in the current version of the jar.
> >
> >grunt> register
> >/home/watrous/pig-0.12.0/contrib/piggybank/java/piggybank.jar
> >grunt> DEFINE REGEX_EXTRACT
> >org.apache.pig.piggybank.evaluation.string.RegexExtract;
> >grunt> REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1);
> >2013-12-04 17:23:20,383 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> >ERROR 1200:  Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro
> >must be defined before expansion.
> >Details at logfile: /home/watrous/pig_1386177315394.log
> >
> >I get the same stack trace with the only change being a reference to
> > instead of .
> >
> >Any idea how I can get this working?
> >
> >Daniel
>
>


Re: Trouble with REGEX in PIG

2013-12-04 Thread Ankit Bhatnagar
R u planning to use

org.apache.pig.builtin.REGEX_EXTRACT


?

On 12/4/13 9:28 AM, "Watrous, Daniel"  wrote:

>Hi,
>
>I'm trying to use regular expressions in PIG, but it's failing. Based on
>the documentation 
>http://pig.apache.org/docs/r0.12.0/func.html#regex-extract I am trying
>this:
>
>[watrous@c0003913 ~]$ pig -x local
>which: no hadoop in
>(/opt/krb5/sbin/64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr
>/local/sbin:/usr/sbin:/sbin:/usr/X11R6/bin:/sbin:/usr/sbin:/usr/bin:/opt/p
>b/bin:/opt/perf/bin:/bin:/usr/local/bin:/home/watrous/bin:/home/watrous/pi
>g-0.12.0/bin)
>2013-12-04 17:15:15,398 [main] INFO  org.apache.pig.Main - Apache Pig
>version 0.12.0 (r1529718) compiled Oct 07 2013, 12:20:14
>2013-12-04 17:15:15,398 [main] INFO  org.apache.pig.Main - Logging error
>messages to: /home/watrous/pig_1386177315394.log
>2013-12-04 17:15:15,425 [main] INFO  org.apache.pig.impl.util.Utils -
>Default bootup file /home/watrous/.pigbootup not found
>2013-12-04 17:15:15,599 [main] INFO
>org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>Connecting to hadoop file system at: file:///
>grunt> REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1);
>2013-12-04 17:16:59,753 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>ERROR 1200:  Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro
>must be defined before expansion.
>Details at logfile: /home/watrous/pig_1386177315394.log
>
>Here's the relevant bit from the log file:
>Pig Stack Trace
>---
>ERROR 1200:  Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro
>must be defined before expansion.
>
>Failed to parse:  Cannot expand macro 'REGEX_EXTRACT'. Reason:
>Macro must be defined before expansion.
>at org.apache.pig.parser.PigMacro.macroInline(PigMacro.java:455)
>at 
>org.apache.pig.parser.QueryParserDriver.inlineMacro(QueryParserDriver.java
>:298)
>at 
>org.apache.pig.parser.QueryParserDriver.expandMacro(QueryParserDriver.java
>:287)
>at 
>org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:180)
>at 
>org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1648)
>at 
>org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1621)
>at org.apache.pig.PigServer.registerQuery(PigServer.java:575)
>at 
>org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1093)
>at 
>org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParse
>r.java:501)
>at 
>org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:1
>98)
>at 
>org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:1
>73)
>at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
>at org.apache.pig.Main.run(Main.java:541)
>at org.apache.pig.Main.main(Main.java:156)
>
>I attempted to define the macro (following this tutorial
>http://aws.amazon.com/articles/2729). However, piggybank.jar doesn't
>define org.apache.pig.piggybank.evaluation.string.EXTRACT, so I located
>the most likely file in the current version of the jar.
>
>grunt> register 
>/home/watrous/pig-0.12.0/contrib/piggybank/java/piggybank.jar
>grunt> DEFINE REGEX_EXTRACT
>org.apache.pig.piggybank.evaluation.string.RegexExtract;
>grunt> REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1);
>2013-12-04 17:23:20,383 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>ERROR 1200:  Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro
>must be defined before expansion.
>Details at logfile: /home/watrous/pig_1386177315394.log
>
>I get the same stack trace with the only change being a reference to
> instead of .
>
>Any idea how I can get this working?
>
>Daniel



Trouble with REGEX in PIG

2013-12-04 Thread Watrous, Daniel
Hi,

I'm trying to use regular expressions in PIG, but it's failing. Based on the 
documentation http://pig.apache.org/docs/r0.12.0/func.html#regex-extract I am 
trying this:

[watrous@c0003913 ~]$ pig -x local
which: no hadoop in 
(/opt/krb5/sbin/64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/X11R6/bin:/sbin:/usr/sbin:/usr/bin:/opt/pb/bin:/opt/perf/bin:/bin:/usr/local/bin:/home/watrous/bin:/home/watrous/pig-0.12.0/bin)
2013-12-04 17:15:15,398 [main] INFO  org.apache.pig.Main - Apache Pig version 
0.12.0 (r1529718) compiled Oct 07 2013, 12:20:14
2013-12-04 17:15:15,398 [main] INFO  org.apache.pig.Main - Logging error 
messages to: /home/watrous/pig_1386177315394.log
2013-12-04 17:15:15,425 [main] INFO  org.apache.pig.impl.util.Utils - Default 
bootup file /home/watrous/.pigbootup not found
2013-12-04 17:15:15,599 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 
hadoop file system at: file:///
grunt> REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1);
2013-12-04 17:16:59,753 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1200:  Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro must be 
defined before expansion.
Details at logfile: /home/watrous/pig_1386177315394.log

Here's the relevant bit from the log file:
Pig Stack Trace
---
ERROR 1200:  Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro must be 
defined before expansion.

Failed to parse:  Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro 
must be defined before expansion.
at org.apache.pig.parser.PigMacro.macroInline(PigMacro.java:455)
at 
org.apache.pig.parser.QueryParserDriver.inlineMacro(QueryParserDriver.java:298)
at 
org.apache.pig.parser.QueryParserDriver.expandMacro(QueryParserDriver.java:287)
at 
org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:180)
at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1648)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1621)
at org.apache.pig.PigServer.registerQuery(PigServer.java:575)
at 
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1093)
at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:501)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:541)
at org.apache.pig.Main.main(Main.java:156)

I attempted to define the macro (following this tutorial 
http://aws.amazon.com/articles/2729). However, piggybank.jar doesn't define 
org.apache.pig.piggybank.evaluation.string.EXTRACT, so I located the most 
likely file in the current version of the jar.

grunt> register /home/watrous/pig-0.12.0/contrib/piggybank/java/piggybank.jar
grunt> DEFINE REGEX_EXTRACT 
org.apache.pig.piggybank.evaluation.string.RegexExtract;
grunt> REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1);
2013-12-04 17:23:20,383 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1200:  Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro must be 
defined before expansion.
Details at logfile: /home/watrous/pig_1386177315394.log

I get the same stack trace with the only change being a reference to  
instead of .

Any idea how I can get this working?

Daniel