Yet more NPEs with TOP

2011-05-06 Thread Jacob Perkins
Hello all, I have a bit of a maddening issue with builtin TOP. Consider the following script: A = LOAD '$DATA' AS (a_bag:bag {t:tuple (value:double)}); B = FOREACH A { top_one = TOP(1,0,a_bag); GENERATE FLATTEN(top_one) AS (value); }; DUMP B; Most of the time this script wor

Re: Working with an unknown number of values

2011-05-06 Thread jacob
On Fri, 2011-05-06 at 16:06 -0600, Christian wrote: > Thank you for taking the time to explain this to me Jacob! > > Am I stuck with hard-coding for my other question? > > Instead of: > 2011-05-01DIRECTIVE132423DIRECTIVE23433DIRECTIVE3 > 1983 > -- > 2011-05-0132423343

Re: Working with an unknown number of values

2011-05-06 Thread Christian
Thank you for taking the time to explain this to me Jacob! Am I stuck with hard-coding for my other question? Instead of: 2011-05-01DIRECTIVE132423DIRECTIVE23433DIRECTIVE3 1983 -- 2011-05-013242334331983 would also do as long as I could count on the column order.

RE: ERROR: String cannot be cast to org.apache.pig.data.Tuple

2011-05-06 Thread william.dowling
In case anyone comes across this ... This problem went away when I fixed a define ... ship(...) to make sure that the file I was shipping was accessible from the running environment on the non-local cluster. William F Dowling Sr Technical Specialist, Software Engineering Thomson Reuters 0 +1 2

Re: Working with an unknown number of values

2011-05-06 Thread jacob
On Fri, 2011-05-06 at 15:38 -0600, Christian wrote: > > > > > #1) Let's say you are tracking messages and extracting the hash tags from > > > the message and storing them as one field (#hash1#hash2#hash3). This > > means > > > you might have a line that looks something like the following: > > >

Re: Working with an unknown number of values

2011-05-06 Thread Christian
> > > #1) Let's say you are tracking messages and extracting the hash tags from > > the message and storing them as one field (#hash1#hash2#hash3). This > means > > you might have a line that looks something like the following: > > 23432011-05-06T03:04:00.000Zusername > > some+message

Re: Working with an unknown number of values

2011-05-06 Thread jacob
Christian, I've answered inline: On Fri, 2011-05-06 at 15:14 -0600, Christian wrote: > I am sorry if this has been asked in the past. I can't seem to find > information on it. > > I have two questions, but they are somewhat related. > > #1) Let's say you are tracking messages and extracting the

Re: Working with an unknown number of values

2011-05-06 Thread Xiaomeng Wan
you can group on group, like this: A = LOAD '/some/dir' Using PigStorage (date, directive); B = GROUP A by (date, directive); C = FOREACH B GENERATE FLATTEN(group) as (date, directive), COUNT(A) as cnt; D = group c by date; E = foreach D generate group as date, c.(directive,cnt) as cnts; Shaw

Working with an unknown number of values

2011-05-06 Thread Christian
I am sorry if this has been asked in the past. I can't seem to find information on it. I have two questions, but they are somewhat related. #1) Let's say you are tracking messages and extracting the hash tags from the message and storing them as one field (#hash1#hash2#hash3). This means you migh

ERROR: String cannot be cast to org.apache.pig.data.Tuple

2011-05-06 Thread william.dowling
I have a pig script that is tested and working in local mode. But when I try to run it in mapreduce mode on a non-local hadoop cluster I get an error with this stack trace: ERROR 2999: Unexpected internal error. java.lang.String cannot be cast to org.apache.pig.data.Tuple java.lang.ClassCastE

Re: PIG Cassandra - IPs of nodes in a ring

2011-05-06 Thread Jeremy Hanna
Hmmm - if that's the case, then you might try the cassandra user list or ask someone like driftx (brandon) in the #cassandra channel on IRC. He might know what implications there are for that setup. On May 6, 2011, at 1:13 PM, Badrinarayanan S wrote: > Hi, I am running from one of the nodes in

Re: Order By Sampling

2011-05-06 Thread Thejas M Nair
The sampling algorithm for order-by samples 100 records from every map task, using a reservoir sampling algorithm. I can't think of a way to store data that could adversely affect this sampling. This is the class (a pig load function) that is involved in sampling - org.apache.pig.impl.builtin.Ran

Re: Query help

2011-05-06 Thread Kim Vogt
Yep, works :) -Kim On Fri, May 6, 2011 at 11:32 AM, jacob wrote: > > Sorry, that's what I get for trying to do things quickly :) > > > A = LOAD 'foo.tsv' AS (item:chararray, user:chararray); > B = GROUP A BY item; > C = FOREACH B { > distinct_users = DISTINCT A.user; > GENERATE >

Re: Query help

2011-05-06 Thread Kim Vogt
Heh, that's it, forgot the COUNT A = LOAD 'data' AS (item:chararray, user:chararray); B = GROUP A BY item; C = FOREACH B { distinct_users = DISTINCT A.user; GENERATE group AS item, COUNT(distinct_users) AS distinct_users ; }; Thanks Jacob. On Fri, May 6, 2011 at 1

Re: Query help

2011-05-06 Thread jacob
Sorry, that's what I get for trying to do things quickly :) A = LOAD 'foo.tsv' AS (item:chararray, user:chararray); B = GROUP A BY item; C = FOREACH B { distinct_users = DISTINCT A.user; GENERATE group AS item, COUNT(distinct_users) AS num_distinct_users ; }

Re: Query help

2011-05-06 Thread Kim Vogt
I think you're missing a SUM and/or COUNT and that's the part I'm stuck on. -Kim On Fri, May 6, 2011 at 11:24 AM, jacob wrote: > Kim, > > This is something pig addresses exceedingly well: > > A = LOAD 'data' AS (item:chararray, user:chararray); > B = GROUP A BY item; > C = FOREACH B { > di

Re: Query help

2011-05-06 Thread jacob
Kim, This is something pig addresses exceedingly well: A = LOAD 'data' AS (item:chararray, user:chararray); B = GROUP A BY item; C = FOREACH B { distinct_users = DISTINCT A.user; GENERATE group AS item, distinct_users AS distinct_users ; }; should work. Have

accessing values from column names when using bag returned by CassandraStorage

2011-05-06 Thread Gianni Moschini
It is possible to access the columns values (stored in cassandra) from pig, using the column names defined in the Cassandra Schema, using the UDF from pygmalion. So imagine a schema being : create column family Users with column_type = Standard and comparator = UTF8Type and default_val

RE: PIG Cassandra - IPs of nodes in a ring

2011-05-06 Thread Badrinarayanan S
Hi, I am running from one of the nodes in the cluster. I too believe it is something to do with different address for rpc_address and listen_address but not sure what it is... -Original Message- From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com] Sent: Friday, May 06, 2011 11:10 PM

Query help

2011-05-06 Thread Kim Vogt
Hi, I'm stuck on a query for counting distinct users. Say I have data that looks like this: book, user1 book, user2 book, user1 movie, user1 movie, user2 movie, user3 music, user4 I want to group by the first column and count the number of distinct users for that product. The result would just b

Question about SHIP

2011-05-06 Thread Mark Laczin
I've asked this question before - and I cannot figure out how to reply to the message though, and I'm still quite confused about it. I'll simplify it for brevity: Can the SHIP keyword be used to take files from the machine where the Pig script is running (that is, files in the same directory as t

Re: PIG Cassandra - IPs of nodes in a ring

2011-05-06 Thread Jeremy Hanna
Where are you running the pig script from - your local machine or one of the nodes in the cluster or ? I would think it wouldn't matter which address you use, but what interface it's using. So if the internal and public address are both using the same interface, then you should be able to conn

PIG Cassandra - IPs of nodes in a ring

2011-05-06 Thread Badrinarayanan S
Hi, I got a cluster with seven Cassandra nodes. The ring is formed using the private ips of each of the nodes. The rpc_address of the nodes is set to private and listen_address of the nodes set to public mainly to facilitate cross data centre ring. When I ring the nodes, it shows all nodes are