Re: pig latin & lucene

2013-01-07 Thread Dmitriy Ryaboy
Details: https://github.com/kevinweil/elephant-bird/wiki/Elephant-Bird-Lucene On Fri, Jan 4, 2013 at 7:55 AM, Bill Graham wrote: > ElephantBird now has pig-lucene support: > > > https://github.com/kevinweil/elephant-bird/blob/master/pig-lucene/src/main/java/com/twitter/elephantbird/pig/load/Luc

Re: Making Pig run faster in local mode

2013-01-07 Thread Dmitriy Ryaboy
Try jstacking it a few times while it's running. Is it just sitting idly in a sleep() ? On Mon, Jan 7, 2013 at 11:56 AM, Cheolsoo Park wrote: > Typo: it makes much sense to run them in cluster => it doesn't make much > sense to run them in cluster. > > On Mon, Jan 7, 2013 at 11:55 AM, Cheolsoo P

Re: Declaring schema for unknown number of columns

2013-01-07 Thread Jinyuan Zhou
Sorry, Looks like my suggestion won't help unless you were able to specify the schema with the original load statement. If the number of field is ONLY available at runtime but each row have the same number field and you know the position of join key, then I have a ugly approach. First, sample the

Re: Declaring schema for unknown number of columns

2013-01-07 Thread Chan, Tim
Hi Jinyuan, Since I don't know how many columns I will have, I do something like this. six_month_and_variable_month_sales_2 = FOREACH six_month_and_variable_month_sales GENERATE $0 AS ed_style_id, $1 AS sale_start_month, $2 AS sale_month_1, $3 AS sale_month_2, $4 AS sale_month_3

Re: JsonLoader schema field order shouldn't matter

2013-01-07 Thread Tim Sell
Hmm, I was using pretty much the same setup and got errors complaining about Counter being an interface when it expected a class. I'll try again with the jars straight out of maven tomorrow. Thanks. ~T On 7 January 2013 21:32, meghana narasimhan wrote: > Hi Tim, > > We are using elephant-bird 3.

Re: JsonLoader schema field order shouldn't matter

2013-01-07 Thread Tim Sell
This seems like a bug to me. It makes it risky to work with JSON data generated by something other than Pig since the ordering might change. What do you think? I didn't see a bug for it in Jira, so would this (still open) one be the place to mention it? Or should I make a new one? https://issues.a

Re: Declaring schema for unknown number of columns

2013-01-07 Thread Jinyuan Zhou
If you can load it but join operation need the complete schema, then you can try do a generate statement to project your original relation to produce the one you can define schema for all fields. On Mon, Jan 7, 2013 at 2:19 PM, Chan, Tim wrote: > Is it possible to declare a schema when doing a

Declaring schema for unknown number of columns

2013-01-07 Thread Chan, Tim
Is it possible to declare a schema when doing a LOAD for data in which you do not know the total number of columns? For instance. I know the data contains 6 or more columns. These columns are of the same data type. I basically want to join this data with another data set, but I was getting the fo

Re: JsonLoader schema field order shouldn't matter

2013-01-07 Thread meghana narasimhan
Hi Tim, We are using elephant-bird 3.0.2 with hadoop-2.0.0-mr1-cdh4.1.1 and pig-0.10.0-cdh4.1.1. We are using the jar available in the maven repo. Didnt have to build it out. - Meg On Mon, Jan 7, 2013 at 11:56 AM, Tim Sell wrote: > When using JsonLoader with Pig 0.10.0 > > if I have an input.

Re: JsonLoader schema field order shouldn't matter

2013-01-07 Thread Alan Gates
Currently the JsonLoader does assume ordering of the fields. It does not do any name matching against the given schema to find the right field. Alan. On Jan 7, 2013, at 11:56 AM, Tim Sell wrote: > When using JsonLoader with Pig 0.10.0 > > if I have an input.json file that looks like this: >

JsonLoader schema field order shouldn't matter

2013-01-07 Thread Tim Sell
When using JsonLoader with Pig 0.10.0 if I have an input.json file that looks like this: {"date": "2007-08-25", "id": 16} {"date": "2007-09-08", "id": 17} {"date": "2007-09-15", "id": 18} And I use a = LOAD 'input.json' USING JsonLoader('id:int,date:chararray'); DUMP a; I get errors when it tr

Re: Making Pig run faster in local mode

2013-01-07 Thread Cheolsoo Park
Typo: it makes much sense to run them in cluster => it doesn't make much sense to run them in cluster. On Mon, Jan 7, 2013 at 11:55 AM, Cheolsoo Park wrote: > it makes much sense to run them in cluster.

Re: Making Pig run faster in local mode

2013-01-07 Thread Cheolsoo Park
Hi Malc, >> When you say to use MR mode, do you mean install hadoop onto the node ? I meant the cluster mode, but given the size of your input files, it makes much sense to run them in cluster. Instead, you might consider to execute jobs in parallel in local mode if it's possible to process inpu

RE: Making Pig run faster in local mode

2013-01-07 Thread Malcolm Tye
Hi, It's Pig 0.10.0. Here's some timings I took. I have more than 3 files to process, but I just started out with 3 files to get some numbers. # Files Time(s) 1 28 2 48 3 73 Cheolsoo, the documentation does seem to indicate that you wi