cant execute pig script with custom udf in mapreduce mode

2016-03-07 Thread John Smith
Dear, I wrote my custom java UDF. To compile it i use maven and pom.xml contains exactly same versions of hadoop and pig that are running on the cluster. The compilation is done on one of the cluster machine. When i run pig script in local mode and using that UDF all works fine. When I execute p

UDF + libraries classpath

2016-03-01 Thread John Smith
Hi, I wrote UDF that uses some external libraries as jackson-databird etc how can specify where should pig looks for these external libs? Thanks

Avrostorage bug/issue

2016-02-01 Thread John Smith
Hi, the problem is described here: https://issues.apache.org/jira/browse/PIG-4793 Thank you

AvroStorage load doesnt work in mapreduce mode

2016-01-28 Thread John Smith
Hi, Im trying to load avro file or directory that contains avro file using AvroStorage in Mapreduce mode. I tried almost all the combinations (hdfs://, / , hdfs://ip:port/file ... ) but nothing works. I got error: set = load '/spool-dir/CustomerData-20160128-1501807/' USING org.apache.pig.piggy

AvroStorage - output file name

2016-01-19 Thread John Smith
Hi, I us AvroStorage to store result set from the pig. Is there a way how can I store data into one specified avro file...e.g OutputFileGen1? Pig is storing data into the directory named OutpuFileGen1 with structure as listed below: Thank you ls -al OutputFileGen1/ total 20 drwxr-xr-x 2 root

outputSchema method issues with input

2016-01-13 Thread John Smith
hi, could anyone take a look on that issue: http://stackoverflow.com/questions/34276494/apache-pig-udf-and-outputschema-customization Problem is that input as part of the outputSchema method is empty, but UDF is called using schema -> sourceData = foreach sourceData generate com.pig.Data('test.

Pig - outputSchema - create schema for tuple

2016-01-12 Thread John Smith
Im trying to define output schema which should be Tuple that contains another two tuples, i.e `stats:tuple(c:tuple(),d:tuple)`. The code below doesnt work as it was intended. It somehow produces structure as: stats:tuple(b:tuple(c:tuple(),d:tuple())) Below is output produced by describe.

Re: PigServer class and script execution

2016-01-11 Thread John Smith
Hi, no, I create pig commands inside the Java programmer so i call *pigServer.registerQuery* Best, John On Mon, Jan 11, 2016 at 11:51 PM, Jianfeng (Jeff) Zhang < jzh...@hortonworks.com> wrote: > > Do you use PigServer.registerScript(fileName) ? Then what errors do you >

Re: Casting

2016-01-11 Thread John Smith
Even when I do explicit casting im getting same error: sensitiveSet = foreach sensitiveSet generate (long) $0, (chararray) $1, (long) $2, (chararray) $3, (chararray) $4, (chararray) $5; On Mon, Jan 11, 2016 at 6:03 PM, John Smith wrote: > Hi, > > Im trying to dump relation into AVRO

Re: PigServer class and script execution

2016-01-11 Thread John Smith
hi, i think you are answering something different. I need execute whole pig script using PigServer class... that apparently doesnt support AvroStorage or I can see it. best, john On Mon, Jan 11, 2016 at 6:27 PM, Jianfeng (Jeff) Zhang < jzh...@hortonworks.com> wrote: > > Of cou

Casting

2016-01-11 Thread John Smith
Hi, Im trying to dump relation into AVRO file but im getting strange error: org.apache.pig.data.DataByteArray cannot be cast to java.lang.CharSequence I dont use DataByteArray (bytearray), see description of the relation below. sensitiveSet: {rank_ID: long,name: chararray,customerId: long,VIN:

data types & avrostorage

2016-01-11 Thread John Smith
Hi, I loading data into pig, using LOAD without specifying datatypes. In the second step I call UDF and using AS () I set the proper data types. Typed set looks as below. grunt> describe sensitiveSet; sensitiveSet: {rank_ID: long,name: chararray,customerId: long,VIN: chararray,birth_date: chararr

PigServer class and script execution

2016-01-11 Thread John Smith
Hi, I have a java code that generates pig script. I am wondering if there is option to execute that script directly within the java code. I found there is a option to embed pig script execution inside java code using PigServer

AvroStorage.java

2016-01-07 Thread John Smith
Hi, can someone flash the light on AvroStorage class? Im passing JSON Avro schema as parameter to AvroStorage, the schema contains additional attribute *"extraAttribute"*. Problem is that avro file produced by Pig doesnt contain that attribute within AVRO Schema. When I pass the same schema to A

AvroStorage custom attributes

2016-01-06 Thread John Smith
Hi, is there any chance to write custom attributes into AVRO file schema? A = load 'data' using PigStorage(',') AS (b1:int,b2:int,b3:bytearray); STORE A INTO 'testOutput' USING org.apache.pig.piggybank.storage.avro.AvroStorage( 'schema', ' {"type":"record","name":"X", "fields":[{"name":"b1","ty

outputSchema

2015-12-18 Thread John Smith
Hi, I know that might be a trivial question, but im struggling here... I want to define outputSchema to return one bag with two tuples. Below is a code which apparently doesnt work and does something different ;-( expected schema { tup1 (), tup2 () } outputSchema(Schema input) ... {

pass data into outputSchema method

2015-12-17 Thread John Smith
Hi there, is there any way how can I pass filename/string/or any other data/ into fronted method outputSchema()? There are distributedCaches, UDFcontext but as far as i understood both work with backed methods i.e exec(). I need to pre-compute some data before the execution of the pig scrip and

problem with the schema geneation - outputSchema

2015-12-16 Thread John Smith
Hi all, my intention is to write generic pig script using UDF that can process csv files with different number of fields per file. Each time pig processes one type of the input file. The UDF will produce a bag with two tuples, the number of records inside the tuple will depend based on the intern

Re: pig data flow controlling with metada

2015-12-07 Thread John Smith
, drop), (2, passthru), (3, encrypt) (3, drop), (2, passthru), (1, encrypt) Thanks On Sun, Dec 6, 2015 at 11:47 PM, John Smith wrote: > Hi, > > I would like to ask you if there is any possibility to write pig flow > using control (metadata file). > > As a source I wa

Re: pig data flow controlling with metada

2015-12-06 Thread John Smith
Hi, Not sure if I get it. I want to have modular pig script that process data file based on the metadata file. Not sure if thats possible. John On Mon, Dec 7, 2015 at 12:18 AM, Andrew Oliver wrote: > I wrote a simple shell script to generate the individual pig scripts and a > control

pig data flow controlling with metada

2015-12-06 Thread John Smith
: [ {name: name, encryption: yes ... }, {...} ]} ... CSV name, age, insurance, John S, 39, yes Pig script reads json file and based on the file it processes CSV eg. if there is flag for the first field "name" saying encryption: yes it will somehow encrypt the value inside c

Re: Combining small S3 inputs

2014-06-17 Thread John Meagher
Check out https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html. I don't know if there's an S3 version, but this should help. On Tue, Jun 17, 2014 at 4:48 PM, Brian Stempin wrote: > Hi, > I was comparing performance of a Hadoop job that I wrote in Ja

Re: Any way to join two aliases without using CROSS

2014-03-25 Thread John Meagher
Try this: http://pig.apache.org/docs/r0.11.0/basic.html#rank Rank each data set then join on the rank. On Tue, Mar 25, 2014 at 4:03 PM, Christopher Surage wrote: > The output I would like to see is > > (1,2,3,4,5,10,11) > (1,2,4,5,7,10,12) > (1,5,7,8,9,10,13) > > > On Tue, Mar 25, 2014 at 3:58 P

Re: limit map tasks for load function

2013-11-04 Thread John
okay. maybe you are right. thanks 2013/11/4 Pradeep Gollakota > You would only be able to set it for the script... which means it will > apply to all 8 jobs. However, my guess is that you don't need to control > the number of map tasks per machine. > > > On Sun, Nov 3

Re: limit map tasks for load function

2013-11-03 Thread John
the other jobs. > > > On Sun, Nov 3, 2013 at 3:04 PM, John wrote: > > > Hi, > > > > is it possible to limit the number of map slots used for the load > function? > > For example I have 5 nodes with 10 map slots (each node has 2 slots for > > every cpu) I

limit map tasks for load function

2013-11-03 Thread John
Hi, is it possible to limit the number of map slots used for the load function? For example I have 5 nodes with 10 map slots (each node has 2 slots for every cpu) I want only one map task for every node. Is there a way to set this only for the load function? I know there is a option called "mapred

Re: how to load custom Writable class from sequence file?

2013-09-24 Thread John Meagher
There's a patch available to allow using any available javax.script language to do the conversion from any Java object type in the sequence file to pig types. See https://issues.apache.org/jira/browse/PIG-1777 On Tue, Sep 24, 2013 at 5:22 AM, Dmitriy Ryaboy wrote: > I assume by scala you mean sc

Re: DataByteArray as Input in Load Function

2013-09-17 Thread John
Or is it only possible to execute the load function at the beginning the script? Otherwise it should be theoretical possible to handover information that are created while the programm is running. 2013/9/17 John > Hi, > > Im using Pig+Hbase. I try to create a Pig programm that looks

DataByteArray as Input in Load Function

2013-09-17 Thread John
Hi, Im using Pig+Hbase. I try to create a Pig programm that looks like this: MY_BLOOMFILTER = load 'hbase://bloomfilterTable' using ..." ... // do something to transform it to a DataByteArray Now I want to load data outside of hbase based on the bloomfilter, therefor I've build my own LoadFunct

Re: Problem while using merge join

2013-09-13 Thread John
to execute a sort after this!? Normaly I would join the 3 bags with a multi join, but merge joins doesn't work with the merge feature. regards, john 2013/9/13 Pradeep Gollakota > I think a better option is to completely bypass the HBaseStorage mechanism. > Since you've already

Re: Problem while using merge join

2013-09-13 Thread John
g, can you do your join first and then > execute your UDF? > > > On Fri, Sep 13, 2013 at 10:04 AM, John wrote: > > > Okay, I think I have found the problem here: > > http://pig.apache.org/docs/r0.11.1/perf.html#merge-joins ... there is > > wirtten; > > >

Re: Problem while using merge join

2013-09-13 Thread John
there are other ideas to make that faster I will try it. regards, john 2013/9/13 Shahab Yunus > Wouldn't this slow down your data retrieval? Once column in each call > instead of a batch? > > Regards, > Shahab > > > On Fri, Sep 13, 2013 at 2:34 PM, John wrote: >

Re: Problem while using merge join

2013-09-13 Thread John
setBatch(int batch) Set the maximum number of values to return for each call to next() I think this will work. Any idea if this way have disadvantages? regards 2013/9/13 John > hi, > > the join key is in the bag, thats the problem. The Load Function returns > only one element 0$ and tha

Re: Problem while using merge join

2013-09-13 Thread John
? thanks 2013/9/13 John > Hi, > > I try to use a merge join for 2 bags. Here is my pig code: > http://pastebin.com/Y9b2UtNk . > > But I got this error: > > Caused by: > org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogicalToPhysicalTranslatorExceptio

Re: Sort Order in HBase with Pig/Piglatin in Java

2013-09-13 Thread John
> security in terms of what you want. > > Regards, > Shahba > > > On Fri, Sep 13, 2013 at 12:29 PM, John wrote: > > > Hi, thanks for your quick answer! I figured it out by my self since the > > mailing server was down the last 2hours?! Btw. I did option 1. But I &g

Sort Order in HBase with Pig/Piglatin in Java

2013-09-13 Thread John
I have created a HBase Table in the hbase shell and added some data. In http://hbase.apache.org/book/dm.sort.html is written that the datasets are first sorted by the rowkey and then the column. So I tried something in the HBase Shell: http://pastebin.com/gLVAX0rJ Everything looks fine. I got the

Problem while using merge join

2013-09-13 Thread John
Hi, I try to use a merge join for 2 bags. Here is my pig code: http://pastebin.com/Y9b2UtNk . But I got this error: Caused by: org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogicalToPhysicalTranslatorException: ERROR 1103: Merge join/Cogroup only supports Filter, Foreach, Ascending

Re: Sort Order in HBase with Pig/Piglatin in Java

2013-09-13 Thread John
s a great observation John! The problem is that HBaseStorage maps > columns families into a HashMap, so the sort ordering is completely lost. > > You have two options: > > 1. Modify HBaseStorage to use a SortedMap data structure (i.e. TreeMap) and > use the modified HBaseStorage. (or

Re: COALESCE UDF?

2013-09-04 Thread John Meagher
https://github.com/mozilla-metrics/akela/blob/master/src/main/java/com/mozilla/pig/eval/Coalesce.java On Wed, Sep 4, 2013 at 12:40 PM, Something Something wrote: > Serega - I think you missed this line: > > "I know it will be very (very) easy to write this, but just don't want > to create one

Pig Latin Program with special Load Function

2013-08-21 Thread John
Im currently writing a Pig Latin programm: A = load 'hbase://mytable1' my.packages.CustomizeHBaseStorage('VALUE', '-loadKey true', 'myrowkey1') as (rowkey:chararray, columncontent:map[]); ABag = foreach PATTERN_0 generate flatten(my.packages.MapToBag($1)) as (output:chararray); the Custim

Re: Merging files

2013-07-31 Thread John Meagher
It is file size based, not file count based. For fewer files up the max-file-blocks setting. On Wed, Jul 31, 2013 at 12:21 PM, Something Something wrote: > Thanks, John. But I don't see an option to specify the # of output files. > How does Crush decide how many files to create?

Re: Merging files

2013-07-31 Thread John Meagher
Here's a great tool for handling exactly that case: https://github.com/edwardcapriolo/filecrush On Wed, Jul 31, 2013 at 2:40 AM, Something Something wrote: > Each bz2 file after merging is about 50Megs. The reducers take about 9 > minutes. > > Note: 'getmerge' is not an option. There isn't eno

fw: hi

2013-07-21 Thread John Morrison
http://wintralog.be/yiejynd/cmyligcexjahota john.h.morrison 7/21/2013 2:15:15 PM

Re: Tracking parts of a job taking the most time

2013-06-05 Thread John Meek
each script block? > > Johnny > > > On Tue, Jun 4, 2013 at 7:11 PM, John Meek wrote: > > > All, > > > > I have a 400 line pig script which perfoems the calculations I need it to > > perform, however I need to figure out the amount of time that specific >

Tracking parts of a job taking the most time

2013-06-04 Thread John Meek
All, I have a 400 line pig script which perfoems the calculations I need it to perform, however I need to figure out the amount of time that specific parts of the script take. For example, initial load from a Hbase table - id like to know how much time the load takes before moving onto the nex

Re: Removing Paranthesis to store to DB

2013-05-30 Thread John Meek
s to store to DB Hi John, What is the exception you are facing ? On 30 May 2013 04:50, John Meek wrote: > All, > > How do I remove the paranthesis in my output? > > I have a statement like - > > STORE Output INTO 'table' USING > org.apache.pig.piggyb

Removing Paranthesis to store to DB

2013-05-29 Thread John Meek
All, How do I remove the paranthesis in my output? I have a statement like - STORE Output INTO 'table' USING org.apache.pig.piggybank.storage.DBStorage('com.mysql.jdbc.Driver','jdbc:mysql://172.45.50.40:7312/db','user','pwd','INSERT INTOdata (name, operation, date, val, source, target) VALUES

Re: Is there a way to do replicated inner join and filter on another column at once?

2013-05-21 Thread John Meagher
Filter first and it will do it in a single scan and will make the join faster. http://pig.apache.org/docs/r0.11.1/perf.html#filter On Tue, May 21, 2013 at 8:28 PM, Thomas Edison wrote: > Here is a code sample: > > a = load 'fact' as (dim_key:chararray, fact_value:int); > b = load 'dim'; > > c = j

Error to read counters into Rank operation

2013-05-13 Thread John Meek
Hey all, One of my scripts is giving the below error. The script works fine when I run it in Grunt but I get the "Error to read counters into Rank operation counterSize 0". ?? I see this https://issues.apache.org/jira/browse/PIG-2985 but unable to decipher. Can someone let me know what this m

Re: Hbase Hex Values

2013-05-06 Thread John Meek
s > I am not aware of any built in or Piggybank UDF that converts Hex to Int, > but it would be a welcome contribution if you wanted to write it. > > Alan. > > On May 5, 2013, at 8:14 PM, John Meek wrote: > > > Hey all, > > > > If I need to load a Hbase ta

RE: Stopping after load with no tuples?

2013-04-18 Thread John Farrelly
Thanks for the reply Ruslan. I was guessing it wouldn't be possible, but thought I would ask incase there was an easy way to do it.I hadn't read about Oozie - looks useful! Regards, John. -Original Message- From: Ruslan Al-Fakikh [mailto:metarus...@gmail.com] Sent: 17 Apr

Stopping after load with no tuples?

2013-04-09 Thread John Farrelly
this? Thanks, John. define MyDataLoader com.example.test.MyDataLoader('config1'); raw = LOAD '$inputdir' USING MyDataLoader AS (date:chararray, values:bag{t:(policy:chararray)}); data1= FOREACH raw GENERATE date, . data2= GROUP data1 BY (date, . data3

Re: Spreading data in Pig

2013-03-31 Thread John Meek
in Pig Hi John, The only way I can think of to do this is using the RANK operator (available only in pig version 0.11) along with a custom udf as follows: * RANK the users relation to result in something like: (User1, 1) (User2, 2) (User3, 3) (User4, 4) (User5, 5) (User6, 6) (User7, 7) (User8

Spreading data in Pig

2013-03-31 Thread John Meek
hey all, Can anyone let me know how I can accomplish below problem in Pig? I have 2 data sources: TABLE A with a list of User IDs: User1 User2 User3 User4 User5 User6 User7 User8 User9 TABLE B with (Host name, Capacity): Hostb 2 Hostc 4 Hostd 3 I basically need to spread the data in table

RE: Don't process already processed files?

2013-03-27 Thread John Farrelly
Thanks Bill. Option 2 is what I've started coding, as I have multiple pig scripts that need to process the same files, and the pig scripts run at different times. Regards, John. -Original Message- From: Bill Graham [mailto:billgra...@gmail.com] Sent: 27 March 2013 15:15 To:

RE: Don't process already processed files?

2013-03-27 Thread John Farrelly
Thanks Mike. That's what I was thinking, but I was wondering if (hoping!) there was something already to do it :) Thanks, John. -Original Message- From: Mike Sukmanowsky [mailto:m...@parsely.com] Sent: 27 March 2013 14:05 To: user@pig.apache.org Subject: Re: Don't proce

Don't process already processed files?

2013-03-27 Thread John Farrelly
processed new files that it hasn't seen before? I was thinking of using a custom PathFilter for my loader, but I thought I would ask to see if there is already a way to do this, rather than me reinventing the wheel (!). Thanks,

Re: Pig Regex Help

2013-03-10 Thread John Meek
Harsha, thanks for your response. I needed to use USING PigStorage(',' ) in my load statement. Works now. -Original Message- From: Harsha To: user Sent: Sat, Mar 9, 2013 10:40 pm Subject: Re: Pig Regex Help Hi John, I ran these in pig 0.9.2 A = LOAD

Re: Pig Regex Help

2013-03-09 Thread John Meek
hi Harsha, Running release 0.11.0. Thanks. -Original Message- From: Harsha To: user Sent: Sat, Mar 9, 2013 10:40 pm Subject: Re: Pig Regex Help Hi John, I ran these in pig 0.9.2 A = LOAD 'data' as line:chararray; B = FOREACH A GENERA

RE: How can I load external files within jython UDF?

2012-12-09 Thread John Gordon
If you ship the file explicitly, you can use this syntax from there. It will pack it with the job jar and make sure it is in the working directory wherever the job runs. Be careful of shipping very large files, it is probably better to refactor your logic into multiple tiplevel pig statements

Re: foreach in PIG is not working.

2012-07-25 Thread John Meagher
Change: using PigStorage(',') to: using PigStorage(' ') The delimiter passed into PigStorage does not appear to be correct. On Wed, Jul 25, 2012 at 2:31 PM, wrote: > Thanks All :-) > > yes the file I have uploaded was text file having format > (Yogesh 12) > (Aashi 13) > (Mohit 14) > > > I used

ERROR 2997: Unable to recreate exception from backed error

2012-07-13 Thread John Morrison
ystem (see below). Thanks, John Details: # Data file hadoop fs -cat "hdfs://thadoop2/tmp/v.log" 12/07/13 10:23:23 INFO util.NativeCodeLoader: Loaded the native-hadoop library 1 6 2 8 3 10 4 12 # Pig file cat hdfs_dump.pig A = LOAD 'hdfs://thadoop

Re: How can I set the mapper number for pig script?

2012-06-23 Thread John Meagher
Another option is to either reduce the block sizes of the input data or disabling the combine input format and splitting the data into more files. On Sat, Jun 23, 2012 at 5:58 PM, Yang wrote: > hi Sheng: > > I had exactly the same problem as you did. > > right now with hadoop 0.20 and above you

Re: ? ERROR 1070: Could not resolve sum using imports

2012-05-17 Thread John Morrison
Thanks Jonathan, that was the key. I have been trying for several days to solve this simple problem. I guess I need to RTFM a little closer. :) -John On Thu, May 17, 2012 at 9:07 PM, Jonathan Coveney wrote: > Ah, this is the most common error for people starting out :) > >

Re: ? ERROR 1070: Could not resolve sum using imports

2012-05-17 Thread John Morrison
OK. I have simplified the script and tried 2 different ways without success: 1) B = foreach A generate flatten($0), SUM($0.a) ; 2) B = foreach A generate flatten($0), SUM(a) ; Which both produce different errors(see below)? Thanks, John Data: cat v.log 1 2 3 4 Complete script 1

Re: ? ERROR 1070: Could not resolve sum using imports

2012-05-17 Thread John Morrison
an explicit cast. On Thu, May 17, 2012 at 11:42 AM, Prashant Kommireddi wrote: > UDFs are case-sensitive. It should be all caps - SUM > > Can you please give that a try? > > > On May 17, 2012, at 8:24 AM, John Morrison > wrote: > > > Hi, > > > > I am ne

Re: ? ERROR 1070: Could not resolve sum using imports

2012-05-17 Thread John Meagher
The UDFs are case sensitive. Use SUM and it will work. On Thu, May 17, 2012 at 11:24 AM, John Morrison wrote: > Hi, > > I am new to ping and am unable to use pig builtin functions (please see > details below). > > Is this a CLASSPATH issue? > > Any ideas on how to resol

? ERROR 1070: Could not resolve sum using imports

2012-05-17 Thread John Morrison
Hi, I am new to ping and am unable to use pig builtin functions (please see details below). Is this a CLASSPATH issue? Any ideas on how to resolve? Thanks, John Details ### Line in pig script causing issue C = foreach B generate flatten($0), sum(lane_nbr) ; ### Error message 2012-05-17 11

Re: Can I define a procedure or function on pig?

2012-04-19 Thread John Whitfield
Pig supports Macros: http://pig.apache.org/docs/r0.9.2/cont.html#define-macros On Thu, Apr 19, 2012 at 5:21 PM, Fernando Doglio < fernando.dog...@moove-it.com> wrote: > Been looking around for this, but couldn't find an answer. > > Is there any way for me to define a function or procedure inside

dryrun creating extra nesting of parentheses in FILTER statement

2012-03-30 Thread John Whitfield
D ($7 IS NOT NULL)); dump test_result STORE test_result INTO 'test_result' USING PigStorage('\t'); ___ Note the extra nesting: In my actual script, the filter has even more parameters, and the nesting gets very deep. Is there a way that I could set up my filter to avoid this? Thanks, John

Re: Pig unit tests minus Java

2012-02-21 Thread John Meagher
You can use JUnit for system tests like that, but it ends up being a mess. You would need a JUnit test that ran hadoop, ran any other server pieces you needed, then you can use Selenium http://seleniumhq.org/ for the browser side of the test. On Tue, Feb 21, 2012 at 05:23, Dmitriy Ryaboy wrote:

Re: how to access solr from pig

2011-11-30 Thread John Conwell
that can be easily read by pig. On Wed, Nov 30, 2011 at 4:42 AM, kumar swami wrote: > Hi friends, > > I am new to Pig library. I need help on how to read data from solr using > pig?. If you have any code samples please provide me. > > Thanks, swami > -- Thanks, John C

Re: Distributing our jars to all machines in a cluster

2011-11-16 Thread John Conwell
cluster. This used to work until our cluster size was small. Now > our > >>> > cluster is getting bigger. What's the best way to start a Hadoop Job > >>> that > >>> > automatically distributes the Jar to all machines in a cluster? > >>> > > >>> > I read the doc at: > >>> > > http://hadoop.apache.org/common/docs/current/commands_manual.html#jar > >>> > > >>> > Would -libjars do the trick? But we need to use 'hadoop job' for > that, > >>> > right? Until now, we were using 'hadoop jar' to start all our jobs. > >>> > > >>> > Needless to say, we are getting our feet wet with Hadoop, so > appreciate > >>> > your help with our dumb questions. > >>> > > >>> > Thanks. > >>> > > >>> > PS: We use Pig a lot, which automatically does this, so there must > be > >>> a > >>> > clean way to do this. > >>> > >>> > >> > > > > > -- Thanks, John C

Pig performance profiling and reusing an optimized plan

2011-08-16 Thread John Amos
reuse the generated JAR files instead of regenerating them every time? Regards, John

Re: Advice on algorithm for joining data in bags

2011-07-13 Thread John Conwell
x27;d > like to get to is something like this (the arbitrary_id could be anything, > I > really just need a set of the overlapping IDs): > > (arbitrary_id, {12, 45, 67, 78}) > > How can I join on the bag of IDs in 'D' to find other labels that have at > least one of the same IDs? Or am I approaching this the wrong way? > > Thanks, > > Mike > -- Thanks, John C

Manually build tuple from three group relations

2011-07-06 Thread John Conwell
I have a dataset where each tupple is a term. I then do two filter operations, to find all terms that have numbers, then all terms that dont have numbers. Oddly, there are some terms that dont fit into either group (not really sure how). So at this point I have 3 bags, all terms, tems with numbe

Re: quick factual question about python load/store UDFs ...

2011-05-10 Thread John Meagher
There's a loader available as a patch in Jira, but nothing I'm aware of for storing. https://issues.apache.org/jira/browse/PIG-1777 On Tue, May 10, 2011 at 15:52, Daniel Eklund wrote: > I am looking at the jython UDF function capabilities. > > Is it fair to say that the jython UDFs are only for

Re: JSONToTuple for pig UDF

2011-04-19 Thread John Hui
Really, cool. Let me take a look when I have some "downtime". If that's the case, Xavier's parser is much better than mine. Who wants to take the lead in adding this to the piggybank, I am sure this makes for a very useful "storage" utility. John On Tue, A

Re: JSONToTuple for pig UDF

2011-04-19 Thread John Hui
I'll post my solution in a few hours =) On Tue, Apr 19, 2011 at 3:02 PM, John Hui wrote: > I don't think one parser will work for all solution. It really depends on > your data, since there might be a list within a list. > > But pick anyone as a starting point and cus

Re: JSONToTuple for pig UDF

2011-04-19 Thread John Hui
I don't think one parser will work for all solution. It really depends on your data, since there might be a list within a list. But pick anyone as a starting point and customize it for your own json data format. On Tue, Apr 19, 2011 at 3:00 PM, Alan Gates wrote: > > On Apr 19, 2011, at 11:44 A

Re: JSONToTuple for pig UDF

2011-04-19 Thread John Hui
I have a JSON library and pig script working. Should I just contribute it instead of reinventing the wheel? John On Tue, Apr 19, 2011 at 2:44 PM, Daniel Eklund wrote: > Bill, thanks... > > so that is a confirmation... people have rolled their own, and it's not in > pig

Group Concat.

2011-03-28 Thread mike st. john
Is it possible to do a group concat with pig. i've been trying with no success. basically the data is as follows 1234|test1 1234|test2 1234|test3 1244|test4 1244|test5 etc etc i'm trying to come up with. 1234|test1 test2 test3 1244|test4 test5 thanks Mike

Anti-Joins

2011-03-24 Thread mike st. john
Are there any examples of Anti-Joins using Pig. Thanks Msj

Re: [VOTE] Sponsoring Howl as an Apache Incubator project

2011-02-04 Thread John Sichi
Got it, thanks for the correction. JVS On Feb 3, 2011, at 4:56 PM, Alex Boisvert wrote: > Hi John, > > Just to clarify where I was going with my line of questioning. There's no > Apache policy that prevents dependencies on incubator project, whether it's > release

Re: [VOTE] Sponsoring Howl as an Apache Incubator project

2011-02-04 Thread John Sichi
On Feb 3, 2011, at 5:09 PM, Alan Gates wrote: > Are you referring to the serde jar or any particular serde's we are making > use of? Both (see below). JVS [jsichi@dev1066 ~/open/howl/howl/howl/src/java/org/apache/hadoop/hive/howl] ls cli/ common/ data/ mapreduce/ pig/ rcfile/ [jsic

Re: [VOTE] Sponsoring Howl as an Apache Incubator project

2011-02-03 Thread John Sichi
it we expect it will continue to add more > additional layers. > > Alan. > > On Feb 3, 2011, at 2:49 PM, John Sichi wrote: > >> But Howl does layer on some additional code, right? >> >> https://github.com/yahoo/howl/tree/howl/howl >> >> JVS >

Re: [VOTE] Sponsoring Howl as an Apache Incubator project

2011-02-03 Thread John Sichi
I was going off of what I read in HADOOP-3676 (which lacks a reference as well). But I guess if a release can be made from the incubator, then it's not a blocker. JVS On Feb 3, 2011, at 3:29 PM, Alex Boisvert wrote: > On Thu, Feb 3, 2011 at 11:38 AM, John Sichi wrote: > Besid

Re: [VOTE] Sponsoring Howl as an Apache Incubator project

2011-02-03 Thread John Sichi
But Howl does layer on some additional code, right? https://github.com/yahoo/howl/tree/howl/howl JVS On Feb 3, 2011, at 1:49 PM, Ashutosh Chauhan wrote: > There are none as of today. In the past, whenever we had to have > changes, we do it in a separate branch in Howl and once those get > commi

Re: [VOTE] Sponsoring Howl as an Apache Incubator project

2011-02-03 Thread John Sichi
Besides the fact that the refactoring required is significant, I don't think this is possible to do quickly since: 1) Hive (unlike Pig) requires a metastore 2) Hive releases can't depend on an incubator project It's worth pointing out that Howl is already using Hive's CLI+DDL (not just metasto

Re: Comparison between long

2010-12-15 Thread John Hui
btw.. you probably want to call that InSeconds, plural) > > D > > On Wed, Dec 15, 2010 at 2:00 PM, John Hui wrote: > > To give more context, the ISOToUnixInSecond return UnixTime in second. > The > > return value of this function is Long > > > > 75 @

Re: Comparison between long

2010-12-15 Thread John Hui
the Long into the proper long type in the pig script. ISOToUnixInSecond('$STARTDATETIME') AS startTime:long Hence during the comparion, it treat the Long as a string value ... On Wed, Dec 15, 2010 at 4:28 PM, John Hui wrote: > This is actually, please ignore the code section belo

Re: Comparison between long

2010-12-15 Thread John Hui
('$STARTDATETIME') AS startTime:long; 8 9 eventData = FILTER eventData BY (event == 'adImpression') AND (eventTimestamp <= startTime); 10 11 DESCRIBE eventData; 12 13 B = GROUP eventData BY (event, publication, deviceType, adID, mcc); On Wed, Dec 15, 2010 at 4:21 PM, John Hui wrote

Comparison between long

2010-12-15 Thread John Hui
I am having a hard time getting comparison to work. I am comparing from two long values but I keep on getting a cast long to String error Backend error message - java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.String at java.lang.String.compareT

Re: LOAD data USING to parse data in order to obtain the AS as desired.

2010-11-30 Thread John Hui
hird column into > multiple columns. > > -D > > On Tue, Nov 30, 2010 at 9:26 AM, John Hui wrote: > > > You can try using a customer storage parser. > > > > You can see a bunch of examples here.. > > > > > > > pig-0.7.0/contrib/piggybank/java/sr

Re: LOAD data USING to parse data in order to obtain the AS as desired.

2010-11-30 Thread John Hui
ng. > > I want to do something simple: > > I have a data file, mydata.log, formatted like this: > > a1 | b1 | c=foo&d=bar | e1 > a2 | b2 | c=john&d=doe | e2 > a3 | b3 | c=foo&d=doe | e3 > ... > > and I want to LOAD the data USING in order to get the

Re: UDF Loader - one line in input result in multiple tuples

2010-10-28 Thread John Hui
Awesome Alan, let me try that out and see if it works. John On Thu, Oct 28, 2010 at 11:49 AM, Alan Gates wrote: > > On Oct 28, 2010, at 8:36 AM, John Hui wrote: > > I look into the return data bag as an option. The problem is the Loader >> interface require me to ret

Re: UDF Loader - one line in input result in multiple tuples

2010-10-28 Thread John Hui
Isn't more flexibility good in this case given how the LoadFunc class was meant to be extend for different use cases? Thanks for all your responses, it really helps knowing I'm not stuck in a hole all myself! John On Thu, Oct 28, 2010 at 11:42 AM, Dmitriy Ryaboy wrote: > Alan means

Re: UDF Loader - one line in input result in multiple tuples

2010-10-28 Thread John Hui
loader to return a bag of tuples. Right? John On Wed, Oct 27, 2010 at 6:00 PM, John Hui wrote: > Hi Pig Users, > > I am currently writing a UDF loader. In one of my use case, one line in > the input stream results in multiple tuples. Has anyone encounter or solve > this iss

UDF Loader - one line in input result in multiple tuples

2010-10-27 Thread John Hui
know if there's use case out there like mine, I am coding it up to return List which is more more flexible than return only one tuple. Thanks, John

Add constant to pig output

2010-10-20 Thread John Hui
oupOnApp GENERATE group, COUNT(distinctAppUserIn), *'time_constant'*; Is this possible in Pig? thanks, John

  1   2   >