Seems like at the end of this you have a Single bag with all the elements,
and somehow you would like to check whether an element exists in it based
on ipstart/end.
1. Use FLATTEN http://pig.apache.org/docs/r0.9.1/basic.html#flatten -
this will convert the Bag to Tuple: to_tuple = FOREACH
Here is a super naive UDF (pseudocode). It assumes that you have the data
in HDFS, per Dmitriy's suggestion.
public MyUdf() {
Get data from distributed cache
load data into a TreeMap
}
public T exec(Tuple input) {
TreeMap.get(input.get(0));
}
and so on. You might want to lazily initialize
The detailed PIG codes are as below:
raw_ip_segment = load ...
ip_segs = foreach raw_ip_segment generate ipstart, ipend, name;
group_ip_segs = group ip_segs all;
order_ip_segs = foreach group_ip_segs {
order_seg = order ip_segs by ipstart, ipend;
generate 't' as tag, order_seg;
}
describe ord
How are you storing segments in a Bag? Can you forward the script.
2011/12/13 唐亮
> Then how can I transfer all the items in Bag to a Tuple?
>
>
> 2011/12/14 Jonathan Coveney
>
> > It's funny, but if you look wa in the past, I actually asked a bunch
> of
> > questions that circled around, li
Then how can I transfer all the items in Bag to a Tuple?
2011/12/14 Jonathan Coveney
> It's funny, but if you look wa in the past, I actually asked a bunch of
> questions that circled around, literally, this exact problem.
>
> Dmitriy and Prahsant are correct: the best way is to make a UDF
Seems like the functionality I need can only be achieved with a SPLIT.
Basically some records could have a Superset/subset relation in which 1
record would be stored in 2 places. With bincond that might not be
achievable.
On Wed, Dec 7, 2011 at 8:30 PM, Prashant Kommireddi wrote:
> Im not sure if
It's funny, but if you look wa in the past, I actually asked a bunch of
questions that circled around, literally, this exact problem.
Dmitriy and Prahsant are correct: the best way is to make a UDF that can do
the lookup really efficiently. This is what the maxmind API does, for
example.
2011
I am lost when you say "If enumerate every IP, it will be more than
1 single IPs"
If each bag is a collection of 3 tuples it might not be too bad on the
memory if you used Tuple to store segments instead?
(8 bytes long + 8 bytes long + 20 bytes for chararray ) = 36
Lets say we incur a
Do you have many such bags or just one? If one, and you want to look up many
ups in it, might be more efficient to serialize this relation to hdfs, and
write a lookup udf that specifies the serialized data set as a file to put in
distributed cache. At init time, load up the file into memory, the
Thank you all!
The detail is:
A bag contains many "IP Segments", whose schema is (ipStart:long,
ipEnd:long, locName:chararray) and the number of tuples is about 3,
and I want to check wheather an IP is belong to one segment in the bag.
I want to order the "IP Segments" by (ipStart, ipEnd) in
My assumption is that 唐亮 is trying to do binary search on bags within
the tuples in a relation (ie schema of the relation has a bag column). I
don't think he is trying to treat the entire relation as one bag and do
binary search on that.
-Thejas
On 12/13/11 2:30 PM, Andrew Wells wrote:
I d
Generally speaking, fancy algorithms for single machine are often time not
doable in a m/r manner, think about graph operations. So, go back to the
original goal, what you want is to search for occurrence of sth in sth else.
For the purpose of doing this in pig, I guess maybe one can do a left o
I'll file a bug if this is a bug, but here's an example of a script that
will generate the error:
A = LOAD 'thing1' as (x:chararray);
B = LOAD 'thing2' AS (y:long);
C = LOAD 'thing3' AS (y:long,x:chararray);
joined_1 = join B by y, C by y;
D = foreach joined_1 generate x;
joined_2 = join D by x
Oh, I might as well make a suggestion for random access.
Try looking into HBase
On Tue, Dec 13, 2011 at 5:30 PM, Andrew Wells wrote:
> I don't think this could be done,
>
> pig is just a hadoop job, and the idea behind hadoop is to read all the
> data in a file.
>
> so by the time you put all t
I don't think this could be done,
pig is just a hadoop job, and the idea behind hadoop is to read all the
data in a file.
so by the time you put all the data into an array, you would have been
better off just checking each element for the one you were looking for.
So what you would get is [n + l
In the pig9 branch in svn, running this gives me an error:
a = load 'thing' as (x:int);
b = group a by x;
c = foreach b generate group as x, COUNT(a) as count;
d = limit (order c by count DESC) 2000;
describe d;
2011-12-13 13:56:32,167 [main] ERROR org.apache.pig.tools.grunt.Grunt
- ERROR 1200:
It works for me with 0.9.1'. Not sure what else it could be; '\r' if
you're on windows? Can you confirm that you don't have any funny
newline characters, e.g., using 'od -h'.
On Tue, Dec 13, 2011 at 2:47 PM, IGZ Nick wrote:
> DUMP works as expected
> If I write the exact same thing in one line,
DUMP works as expected
If I write the exact same thing in one line, it works.. I remember seeing a
JIRA for this some time back, but am not able to find it now.
On Wed, Dec 14, 2011 at 12:23 AM, Stan Rosenberg <
srosenb...@proclivitysystems.com> wrote:
> There is something syntactically wrong wit
ah ok.. Isn't there anything that would take the elements in order as it
is? Because mapping each field would almost lead to the same coupling
between the schema file and the pig script which I am trying to avoid
On Wed, Dec 14, 2011 at 12:21 AM, Bill Graham wrote:
> You still need to map the Tu
Bags can be very large might not fit into memory, and in such cases some
or all of the bag might have to be stored on disk. In such cases, it is
not efficient to do random access on the bag. That is why the DataBag
interface does not support it.
As Prashant suggested, storing it in a tuple wou
There is something syntactically wrong with your script.
MismatchedTokenException seems to indicate that the semicolon
character was expected (ttype==93).
What happens if you replace the entire "STORE A ..." line by say "DUMP A"?
On Tue, Dec 13, 2011 at 1:17 PM, IGZ Nick wrote:
> Hi Stan,
>
> Her
You still need to map the Tuple fields to the avro schema fields. See the
unit test for an example, or section 4.C of the documentation. It reads the
schema from a data file, but the same approach is used when using
schema_file instead.
https://cwiki.apache.org/confluence/display/PIG/AvroStorage
Hi Stan,
Here is my pig script:
REGISTER avro-1.4.0.jar
REGISTER joda-time-1.6.jar
REGISTER json-simple-1.1.jar
REGISTER jackson-core-asl-1.5.5.jar
REGISTER jackson-mapper-asl-1.5.5.jar
REGISTER pig-0.9.1-SNAPSHOT.jar
REGISTER dwh-udf-0.1.jar
REGISTER piggybank.jar
REGISTER linkedin-pig-0.8.jar
RE
Hi Bill,
I tried schema_file but I get this error:
grunt> STORE A INTO '/user/hshankar/out1' USING
org.apache.pig.piggybank.storage.avro.AvroStorage ('{"schema_file":
"/user/hshankar/schema1.schema"}');
2011-12-13 18:06:00,879 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1200: could not
Yes, you can reference an Avro schema file in HDFS with the "schema_file"
param. See TestAvroStorage.testRecordWithFieldSchemaFromTextWithSchemaFile
here for an example:
http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAv
Hi,
I noticed the following as I'm learning Pig, which I thought I'd share to get
some insight on. It seems to me that Pig cannot (automatically) distinguish
between an empty tuple and a tuple with a single null value. Example:
data = LOAD 'testusers.dat' as (user, empty:tuple(), partial:tupl
The following test script works for me:
=
A = load '$LOGS' using org.apache.pig.piggybank.storage.avro.AvroStorage();
describe A;
B = foreach A generate region as my_region, google_ip;
dump B;
store B into './output' using org.apache.pig.piggybank.sto
Hi all,
I want to keep the pig script and storage schema separate. Is it possible
to do this in a clean way? THe only way that has worked so far is to do
like:
AvroStorage('schema',
'{"name":"xyz","type":"record","fields":[{"name":"abc","type":"string"}]}');
That too, all the schema in one line.
28 matches
Mail list logo