Re: Implement Binary Search in PIG

2011-12-13 Thread Prashant Kommireddi
Seems like at the end of this you have a Single bag with all the elements, and somehow you would like to check whether an element exists in it based on ipstart/end. 1. Use FLATTEN http://pig.apache.org/docs/r0.9.1/basic.html#flatten - this will convert the Bag to Tuple: to_tuple = FOREACH

Re: Implement Binary Search in PIG

2011-12-13 Thread Jonathan Coveney
Here is a super naive UDF (pseudocode). It assumes that you have the data in HDFS, per Dmitriy's suggestion. public MyUdf() { Get data from distributed cache load data into a TreeMap } public T exec(Tuple input) { TreeMap.get(input.get(0)); } and so on. You might want to lazily initialize

Re: Implement Binary Search in PIG

2011-12-13 Thread 唐亮
The detailed PIG codes are as below: raw_ip_segment = load ... ip_segs = foreach raw_ip_segment generate ipstart, ipend, name; group_ip_segs = group ip_segs all; order_ip_segs = foreach group_ip_segs { order_seg = order ip_segs by ipstart, ipend; generate 't' as tag, order_seg; } describe ord

Re: Implement Binary Search in PIG

2011-12-13 Thread Prashant Kommireddi
How are you storing segments in a Bag? Can you forward the script. 2011/12/13 唐亮 > Then how can I transfer all the items in Bag to a Tuple? > > > 2011/12/14 Jonathan Coveney > > > It's funny, but if you look wa in the past, I actually asked a bunch > of > > questions that circled around, li

Re: Implement Binary Search in PIG

2011-12-13 Thread 唐亮
Then how can I transfer all the items in Bag to a Tuple? 2011/12/14 Jonathan Coveney > It's funny, but if you look wa in the past, I actually asked a bunch of > questions that circled around, literally, this exact problem. > > Dmitriy and Prahsant are correct: the best way is to make a UDF

Re: Help with optimizing query

2011-12-13 Thread Prashant Kommireddi
Seems like the functionality I need can only be achieved with a SPLIT. Basically some records could have a Superset/subset relation in which 1 record would be stored in 2 places. With bincond that might not be achievable. On Wed, Dec 7, 2011 at 8:30 PM, Prashant Kommireddi wrote: > Im not sure if

Re: Implement Binary Search in PIG

2011-12-13 Thread Jonathan Coveney
It's funny, but if you look wa in the past, I actually asked a bunch of questions that circled around, literally, this exact problem. Dmitriy and Prahsant are correct: the best way is to make a UDF that can do the lookup really efficiently. This is what the maxmind API does, for example. 2011

Re: Implement Binary Search in PIG

2011-12-13 Thread Prashant Kommireddi
I am lost when you say "If enumerate every IP, it will be more than 1 single IPs" If each bag is a collection of 3 tuples it might not be too bad on the memory if you used Tuple to store segments instead? (8 bytes long + 8 bytes long + 20 bytes for chararray ) = 36 Lets say we incur a

Re: Implement Binary Search in PIG

2011-12-13 Thread Dmitriy Ryaboy
Do you have many such bags or just one? If one, and you want to look up many ups in it, might be more efficient to serialize this relation to hdfs, and write a lookup udf that specifies the serialized data set as a file to put in distributed cache. At init time, load up the file into memory, the

Re: Implement Binary Search in PIG

2011-12-13 Thread 唐亮
Thank you all! The detail is: A bag contains many "IP Segments", whose schema is (ipStart:long, ipEnd:long, locName:chararray) and the number of tuples is about 3, and I want to check wheather an IP is belong to one segment in the bag. I want to order the "IP Segments" by (ipStart, ipEnd) in

Re: Implement Binary Search in PIG

2011-12-13 Thread Thejas Nair
My assumption is that 唐亮 is trying to do binary search on bags within the tuples in a relation (ie schema of the relation has a bag column). I don't think he is trying to treat the entire relation as one bag and do binary search on that. -Thejas On 12/13/11 2:30 PM, Andrew Wells wrote: I d

Re: Implement Binary Search in PIG

2011-12-13 Thread jiang licht
Generally speaking, fancy algorithms for single machine are often time not doable in a m/r manner, think about graph operations. So, go back to the original goal, what you want is to search for occurrence of sth in sth else. For the purpose of doing this in pig, I guess maybe one can do a left o

Join alias inference doesn't work when there are multiple joins

2011-12-13 Thread Jonathan Coveney
I'll file a bug if this is a bug, but here's an example of a script that will generate the error: A = LOAD 'thing1' as (x:chararray); B = LOAD 'thing2' AS (y:long); C = LOAD 'thing3' AS (y:long,x:chararray); joined_1 = join B by y, C by y; D = foreach joined_1 generate x; joined_2 = join D by x

Re: Implement Binary Search in PIG

2011-12-13 Thread Andrew Wells
Oh, I might as well make a suggestion for random access. Try looking into HBase On Tue, Dec 13, 2011 at 5:30 PM, Andrew Wells wrote: > I don't think this could be done, > > pig is just a hadoop job, and the idea behind hadoop is to read all the > data in a file. > > so by the time you put all t

Re: Implement Binary Search in PIG

2011-12-13 Thread Andrew Wells
I don't think this could be done, pig is just a hadoop job, and the idea behind hadoop is to read all the data in a file. so by the time you put all the data into an array, you would have been better off just checking each element for the one you were looking for. So what you would get is [n + l

limit (order by relation) 100 broken in pig9?

2011-12-13 Thread Jonathan Coveney
In the pig9 branch in svn, running this gives me an error: a = load 'thing' as (x:int); b = group a by x; c = foreach b generate group as x, COUNT(a) as count; d = limit (order c by count DESC) 2000; describe d; 2011-12-13 13:56:32,167 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200:

Re: Using AvroStorage()

2011-12-13 Thread Stan Rosenberg
It works for me with 0.9.1'. Not sure what else it could be; '\r' if you're on windows? Can you confirm that you don't have any funny newline characters, e.g., using 'od -h'. On Tue, Dec 13, 2011 at 2:47 PM, IGZ Nick wrote: > DUMP works as expected > If I write the exact same thing in one line,

Re: Using AvroStorage()

2011-12-13 Thread IGZ Nick
DUMP works as expected If I write the exact same thing in one line, it works.. I remember seeing a JIRA for this some time back, but am not able to find it now. On Wed, Dec 14, 2011 at 12:23 AM, Stan Rosenberg < srosenb...@proclivitysystems.com> wrote: > There is something syntactically wrong wit

Re: Using AvroStorage()

2011-12-13 Thread IGZ Nick
ah ok.. Isn't there anything that would take the elements in order as it is? Because mapping each field would almost lead to the same coupling between the schema file and the pig script which I am trying to avoid On Wed, Dec 14, 2011 at 12:21 AM, Bill Graham wrote: > You still need to map the Tu

Re: Implement Binary Search in PIG

2011-12-13 Thread Thejas Nair
Bags can be very large might not fit into memory, and in such cases some or all of the bag might have to be stored on disk. In such cases, it is not efficient to do random access on the bag. That is why the DataBag interface does not support it. As Prashant suggested, storing it in a tuple wou

Re: Using AvroStorage()

2011-12-13 Thread Stan Rosenberg
There is something syntactically wrong with your script. MismatchedTokenException seems to indicate that the semicolon character was expected (ttype==93). What happens if you replace the entire "STORE A ..." line by say "DUMP A"? On Tue, Dec 13, 2011 at 1:17 PM, IGZ Nick wrote: > Hi Stan, > > Her

Re: Using AvroStorage()

2011-12-13 Thread Bill Graham
You still need to map the Tuple fields to the avro schema fields. See the unit test for an example, or section 4.C of the documentation. It reads the schema from a data file, but the same approach is used when using schema_file instead. https://cwiki.apache.org/confluence/display/PIG/AvroStorage

Re: Using AvroStorage()

2011-12-13 Thread IGZ Nick
Hi Stan, Here is my pig script: REGISTER avro-1.4.0.jar REGISTER joda-time-1.6.jar REGISTER json-simple-1.1.jar REGISTER jackson-core-asl-1.5.5.jar REGISTER jackson-mapper-asl-1.5.5.jar REGISTER pig-0.9.1-SNAPSHOT.jar REGISTER dwh-udf-0.1.jar REGISTER piggybank.jar REGISTER linkedin-pig-0.8.jar RE

Re: Using AvroStorage()

2011-12-13 Thread IGZ Nick
Hi Bill, I tried schema_file but I get this error: grunt> STORE A INTO '/user/hshankar/out1' USING org.apache.pig.piggybank.storage.avro.AvroStorage ('{"schema_file": "/user/hshankar/schema1.schema"}'); 2011-12-13 18:06:00,879 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: could not

Re: Using AvroStorage()

2011-12-13 Thread Bill Graham
Yes, you can reference an Avro schema file in HDFS with the "schema_file" param. See TestAvroStorage.testRecordWithFieldSchemaFromTextWithSchemaFile here for an example: http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAv

empty tuple () vs (null)

2011-12-13 Thread Scott Gorlin
Hi, I noticed the following as I'm learning Pig, which I thought I'd share to get some insight on. It seems to me that Pig cannot (automatically) distinguish between an empty tuple and a tuple with a single null value. Example: data = LOAD 'testusers.dat' as (user, empty:tuple(), partial:tupl

Re: Using AvroStorage()

2011-12-13 Thread Stan Rosenberg
The following test script works for me: = A = load '$LOGS' using org.apache.pig.piggybank.storage.avro.AvroStorage(); describe A; B = foreach A generate region as my_region, google_ip; dump B; store B into './output' using org.apache.pig.piggybank.sto

Using AvroStorage()

2011-12-13 Thread IGZ Nick
Hi all, I want to keep the pig script and storage schema separate. Is it possible to do this in a clean way? THe only way that has worked so far is to do like: AvroStorage('schema', '{"name":"xyz","type":"record","fields":[{"name":"abc","type":"string"}]}'); That too, all the schema in one line.