Re: extract string in Pig

2011-06-27 Thread Jonathan Holloway
Take a look at: REGEX_EXTRACT - http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#REGEX_EXTRACT and REGEX_EXTRACT_ALL: http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#REGEX_EXTRACT_ALL You could also use SUBSTRING, but I think a regex would be more applicable here for date/time extracti

Re: Invalid Alias Issue

2011-06-24 Thread Jonathan Holloway
I ended up fixing this issue - i did change it to a bag after but the main problem was that regexextractall was returning everything as a string (bia group) which meant that max, avg etc... was not matched as a matching function for a bag of tuple doubles. I ended up writing a new udf for extr

Invalid Alias Issue

2011-06-23 Thread Jonathan Holloway
Hi all, I'm getting the exception (at the end) from the following using Pig: eLine = FOREACH logLine GENERATE FLATTEN( REGEX_EXTRACT_ALL( $0, '.*Output.Count\\s*\\-\\s*([A-Za-z\\.]+)\\s*(\\d+)' ) ) AS (ename:CHARARRAY

Flatten a bag to a specific datatype

2011-06-22 Thread Jonathan Holloway
I'm having trouble trying to flatten a bag to a tuple of int's in Pig, e.g. {(12),(4),(7),(190)} to: (12,4,7,190) It seems like it should be trivial to do, but not quite sure how to do it. Can this by done with inbuilt Pig commands or do i need a custom UDF or an exec? Many thanks, Jon.

Re: Pig 0.9 Changes and Control Flow Structures

2011-06-20 Thread Jonathan Holloway
tent/xdocs/cont.xml?view=**markup<http://svn.apache.org/viewvc/pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/cont.xml?view=markup> > > Alan. > > > On Jun 20, 2011, at 8:03 AM, Jonathan Holloway wrote: > > Hi all, >> >> Does anybody have a list

Pig 0.9 Changes and Control Flow Structures

2011-06-20 Thread Jonathan Holloway
Hi all, Does anybody have a list of the features for the Pig 0.9 release. I noticed from SVN that there control flow structures have been added. How would these work with 0.9? Many thanks, Jon.

Re: PigStorage and ElephantBird's JsonLoader - InputFormat

2011-06-17 Thread Jonathan Holloway
but not the uncompressed version. > > On Jun 15, 2011, at 6:57 PM, Jonathan Holloway < > jonathan.hollo...@gmail.com> wrote: > > > Hi all, > > > > I was wondering whether somebody could explain how Pig deals with nested > > directories of log files, > > Somet

PigStorage and ElephantBird's JsonLoader - InputFormat

2011-06-15 Thread Jonathan Holloway
Hi all, I was wondering whether somebody could explain how Pig deals with nested directories of log files, Something like: /logs/2011-01-01/a.log /logs/2011-01-01/b.log /logs/2011-01-01/c.log I'm pretty sure if I give a Pig script the /logs directory as input it will successfully process all log

Computing Histograms with UDF's

2011-04-14 Thread Jonathan Holloway
Hi, This is a followon from another question I asked a while ago. I'm calculating percentiles etc.. for datasets and I wondered how I would do this with a histogram instead so it's more efficient. Does anybody have an example of this currently in the Pig source code or some advice on how to go a

Merging lines in a log into a single bag

2011-04-07 Thread Jonathan Holloway
Hi all, I have the following: A {(3),(Log Message A)} A {(5),(Log Message B)} B{(8),(Log Message C)} B {(1),(Log Message D)} C {(2),(Log message E)} C {(7),(Log message F)} and I want to merge the related line letters (A, B, C) into the same bag: A{(3),(Log M

Pig Query

2011-04-01 Thread Jonathan Holloway
Hi all, I'm trying to do something with Pig and I'm not quite sure whether it's possible or not. Hoping somebody could provide with some help on how to proceed here. I have a log file with a number of log lines that have relationships with each other. The structure of the log line is: DATE, UUI

Custom Storage Functions - MultiStorage

2011-03-31 Thread Jonathan Holloway
Hi all, I'm working with some data at the moment, for which I needed to generate multiple reports for a given grouped set of data by name. I wasn't initially sure about how to do this, I came across MultiStorage in Pig contrib, but a little worried about the 7k limit there at the moment due to a b

Storing and reporting off Pig data

2011-03-23 Thread Jonathan Holloway
I've got a general question surrounding the output of various Pig scripts and generally where people are storing that data and in what kind of format? I read Dmitriy's article on Apache log processing and noticed that the output of the scripts was a format more suitable for reporting and graphing

Re: Iterating through the fields in a tuple for use in a filter

2011-03-18 Thread Jonathan Holloway
u'll use to filter? > > It sounds like you'll want to write your own FilterFunc > > 2011/3/18 Jonathan Holloway > >> Hi, >> >> I want to iterate through the fields in a tuple and then pass each field to >> a FILTER statement. >> Does anybody know how I would go about doing this? >> >> Many thanks, >> Jon. >>

Iterating through the fields in a tuple for use in a filter

2011-03-18 Thread Jonathan Holloway
Hi, I want to iterate through the fields in a tuple and then pass each field to a FILTER statement. Does anybody know how I would go about doing this? Many thanks, Jon.

Percentage Calculation for Two Data Groups

2011-03-16 Thread Jonathan Holloway
Hi, Given the following: Group 1 - Tests Totals: (A, 4) (B, 30) (C, 40) (D, 30) Group 2 - Tests Passed: (A,1) (B,30) How would I calculate the percentage of Group 2 / Group 1 using Pig? I'm assuming one way is to join on the the two datasets and calculate the percentage that way. Another way

Re: Percentile UDF

2011-03-11 Thread Jonathan Holloway
I'd be interested in hearing about it. Cheers, Jon. On 10 March 2011 21:01, Jonathan Holloway wrote: > Hey Josh, > > That's the path I started down today, I don't suppose the UDF you wrote is > in the public domain > at all - would you consider contributing it to pig

Converting Pig DataTypes to Java Data Types

2011-03-10 Thread Jonathan Holloway
I ran into an issue tonight with parsing log lines whereby I had to generate a schema in a user defined function. Part of that involved converting various values into their associated data types, but I couldn't see a way to do it via Pig. Enclosed is a patch to convert org.apache.pig.data.DataType

Re: Percentile UDF

2011-03-10 Thread Jonathan Holloway
pull out all the fields as double[] and pass > into > Percentile. > > > http://commons.apache.org/math/apidocs/org/apache/commons/math/stat/descriptive/rank/Percentile.html > > Josh > > > On 10 March 2011 19:38, Kris Coward wrote: > > > On Thu, Mar 10, 2011 at 0

Percentile UDF

2011-03-10 Thread Jonathan Holloway
HI all, Does anybody have a UDF for calculating the percentile (median, 99%) at all? I took a look at the builtins and the piggybank project, but couldn't seem to see anything. Is there a reason why there isn't a builtin for this? Many thanks, Jon.

Percentile UDF

2011-03-10 Thread Jonathan Holloway
Hi all, Does anybody know if a Percentile UDF exists at all, I've searched through the manual and the piggybank project, but can't seem to see one there. Many thanks, Jon.