Re: [VOTE] Pig 1.0!

2011-03-09 Thread Dmitriy Ryaboy
The LoadFunc refactoring was painful. I think what you are describing absolutely needs to happen, but may need to be a 2.0 thing. On Wed, Mar 9, 2011 at 5:06 PM, Julien Le Dem wrote: > Before moving to 1.0, I think the public APIs should be refactored a bit. > (UDFs, ...: all the classes users e

Re: [VOTE] Pig 1.0!

2011-03-09 Thread Julien Le Dem
Before moving to 1.0, I think the public APIs should be refactored a bit. (UDFs, ...: all the classes users extend or use) Some of the Pig APIs have grown organically and would need changes. examples: - inconsistencies between EvalFunc and Accumulator - Algebraic UDFs can not pass FuncSpec param

Re: Any reason a bunch of nearly-identical jobs would suddenly stop working?

2011-03-09 Thread Mridul Muralidharan
Did you try checking the task logs ? There might be more details there ... Regards, Mridul On Wednesday 09 March 2011 04:23 AM, Kris Coward wrote: So I queued up a batch of jobs last night to run overnight (and into the day a bit, owing to to a bottleneck on the scheduler the way that things

Re: Schema

2011-03-09 Thread Mridul Muralidharan
In which case, cant you not model that as a Bag ? I imagine something like Tuple with fields person:chararray, books_read:bag{ (name:chararray, isbn:chararray) }, etc ? Ofcourse, it will work as a bag if the tuple contained within it has a fixed schema :-) (unless you repeat this process N nu

Re: Any reason a bunch of nearly-identical jobs would suddenly stop working?

2011-03-09 Thread Guy Bayes
Question, do normal map-reduce jobs run on this cluster? Like the example jar jobs? Guy On Mar 9, 2011, at 2:29 PM, Kris Coward wrote: > > Also, reading some uncompressed data off the same cluster using > PigStorage shows a failure to even read the data in the first place :| > > -K > > On

Any additional consultant-finding resources?

2011-03-09 Thread Kris Coward
Does anyone know of any resources other than http://wiki.apache.org/hadoop/Support for finding consultants who'd be able to help with pig/hadoop administration issues. With my sysadmin on vacation, I'm just looking for someone who can get things running again without completely displacing him, and

Re: Any reason a bunch of nearly-identical jobs would suddenly stop working?

2011-03-09 Thread Kris Coward
Also, reading some uncompressed data off the same cluster using PigStorage shows a failure to even read the data in the first place :| -K On Tue, Mar 08, 2011 at 09:24:18PM -0500, Kris Coward wrote: > > None of the nodes have more than 20% utilization on any of their disks; > so it must be the

Re: Limting output

2011-03-09 Thread Eric Lubow
Are you looking for: udf_regex_results = my_UDF(...); limited_regex_results = LIMIT udf_regex_results 10; -- 10 is configurable -e On Wed, Mar 9, 2011 at 13:58, souri datta wrote: > Hi, > I have a big dataset which contains mainly urls and their html > contents. Now given a regular expression

Limting output

2011-03-09 Thread souri datta
Hi, I have a big dataset which contains mainly urls and their html contents. Now given a regular expression I want to get 'x' number of urls matching the regex pattern. I have written a UDF to filter out urls based on regular expression. Is there a way in Pig script to limit the number of results

Fwd: First Hadoop meetup in Houston

2011-03-09 Thread Alan Gates
Begin forwarded message: From: Mark Kerzner Date: March 7, 2011 7:37:38 PM PST To: Hadoop Discussion Group Subject: First Hadoop meetup in Houston Reply-To: "common-u...@hadoop.apache.org" > Hi, I have just created the Houston Hadoop Meetup group, and all suggestions are welcome. http

Re: STORE with variable?

2011-03-09 Thread Xiaomeng Wan
sorry to hear that. We used it in a old project. It works well with pig0.6.0. Shawn On Tue, Mar 8, 2011 at 3:04 PM, Dexin Wang wrote: > Unfortunately, it doesn't work. > Seems the same problem as in https://issues.apache.org/jira/browse/PIG-1547 > > On Tue, Mar 8, 2011 at 1:22 PM, Dexin Wang wr

RE: Schema

2011-03-09 Thread Lai Will
It's the latter.. You can imagine my EvalFunc as ArrayList booksRead(Person p) {} So for a list of people I get a List of ArrayList of different lengths.. -Original Message- From: Jonathan Coveney [mailto:jcove...@gmail.com] Sent: Wednesday, March 09, 2011 6:12 PM To: user@pig.apache.or

Re: Schema

2011-03-09 Thread Jonathan Coveney
In any given instance will the size of the tuple change, or will it change on a row by row basis? If it's the former, you can have a constructor that indicates how many arguments, and the outputSchema can use that. Barring that, it is "good practice" to do so, but it's not necessary. Your script w

Schema

2011-03-09 Thread Lai Will
Hello, I read that it is good practice to declare the schema in Pig Script as well as in the UDF (by implementing outputSchema), because of performance reasons. Now in my case I have a EvalFunc that takes a chararray as input and produces a tuple with a dynamic number of chararrays (it creates