Hello all,
I have a bit of a maddening issue with builtin TOP. Consider the
following script:
A = LOAD '$DATA' AS (a_bag:bag {t:tuple (value:double)});
B = FOREACH A {
top_one = TOP(1,0,a_bag);
GENERATE FLATTEN(top_one) AS (value);
};
DUMP B;
Most of the time this script wor
On Fri, 2011-05-06 at 16:06 -0600, Christian wrote:
> Thank you for taking the time to explain this to me Jacob!
>
> Am I stuck with hard-coding for my other question?
>
> Instead of:
> 2011-05-01DIRECTIVE132423DIRECTIVE23433DIRECTIVE3
> 1983
> --
> 2011-05-0132423343
Thank you for taking the time to explain this to me Jacob!
Am I stuck with hard-coding for my other question?
Instead of:
2011-05-01DIRECTIVE132423DIRECTIVE23433DIRECTIVE3
1983
--
2011-05-013242334331983
would also do as long as I could count on the column order.
In case anyone comes across this ...
This problem went away when I fixed a define ... ship(...)
to make sure that the file I was shipping was accessible from the running
environment on the non-local
cluster.
William F Dowling
Sr Technical Specialist, Software Engineering
Thomson Reuters
0 +1 2
On Fri, 2011-05-06 at 15:38 -0600, Christian wrote:
> >
> > > #1) Let's say you are tracking messages and extracting the hash tags from
> > > the message and storing them as one field (#hash1#hash2#hash3). This
> > means
> > > you might have a line that looks something like the following:
> > >
>
> > #1) Let's say you are tracking messages and extracting the hash tags from
> > the message and storing them as one field (#hash1#hash2#hash3). This
> means
> > you might have a line that looks something like the following:
> > 23432011-05-06T03:04:00.000Zusername
> > some+message
Christian,
I've answered inline:
On Fri, 2011-05-06 at 15:14 -0600, Christian wrote:
> I am sorry if this has been asked in the past. I can't seem to find
> information on it.
>
> I have two questions, but they are somewhat related.
>
> #1) Let's say you are tracking messages and extracting the
you can group on group, like this:
A = LOAD '/some/dir' Using PigStorage (date, directive);
B = GROUP A by (date, directive);
C = FOREACH B GENERATE FLATTEN(group) as (date, directive), COUNT(A) as cnt;
D = group c by date;
E = foreach D generate group as date, c.(directive,cnt) as cnts;
Shaw
I am sorry if this has been asked in the past. I can't seem to find
information on it.
I have two questions, but they are somewhat related.
#1) Let's say you are tracking messages and extracting the hash tags from
the message and storing them as one field (#hash1#hash2#hash3). This means
you migh
I have a pig script that is tested and working in local mode. But when I try
to run it in mapreduce mode on a non-local hadoop cluster I get an error with
this stack trace:
ERROR 2999: Unexpected internal error. java.lang.String cannot be cast to
org.apache.pig.data.Tuple
java.lang.ClassCastE
Hmmm - if that's the case, then you might try the cassandra user list or ask
someone like driftx (brandon) in the #cassandra channel on IRC. He might know
what implications there are for that setup.
On May 6, 2011, at 1:13 PM, Badrinarayanan S wrote:
> Hi, I am running from one of the nodes in
The sampling algorithm for order-by samples 100 records from every map task,
using a reservoir sampling algorithm.
I can't think of a way to store data that could adversely affect this sampling.
This is the class (a pig load function) that is involved in sampling -
org.apache.pig.impl.builtin.Ran
Yep, works :)
-Kim
On Fri, May 6, 2011 at 11:32 AM, jacob wrote:
>
> Sorry, that's what I get for trying to do things quickly :)
>
>
> A = LOAD 'foo.tsv' AS (item:chararray, user:chararray);
> B = GROUP A BY item;
> C = FOREACH B {
> distinct_users = DISTINCT A.user;
> GENERATE
>
Heh, that's it, forgot the COUNT
A = LOAD 'data' AS (item:chararray, user:chararray);
B = GROUP A BY item;
C = FOREACH B {
distinct_users = DISTINCT A.user;
GENERATE
group AS item,
COUNT(distinct_users) AS distinct_users
;
};
Thanks Jacob.
On Fri, May 6, 2011 at 1
Sorry, that's what I get for trying to do things quickly :)
A = LOAD 'foo.tsv' AS (item:chararray, user:chararray);
B = GROUP A BY item;
C = FOREACH B {
distinct_users = DISTINCT A.user;
GENERATE
group AS item,
COUNT(distinct_users) AS num_distinct_users
;
}
I think you're missing a SUM and/or COUNT and that's the part I'm stuck on.
-Kim
On Fri, May 6, 2011 at 11:24 AM, jacob wrote:
> Kim,
>
> This is something pig addresses exceedingly well:
>
> A = LOAD 'data' AS (item:chararray, user:chararray);
> B = GROUP A BY item;
> C = FOREACH B {
> di
Kim,
This is something pig addresses exceedingly well:
A = LOAD 'data' AS (item:chararray, user:chararray);
B = GROUP A BY item;
C = FOREACH B {
distinct_users = DISTINCT A.user;
GENERATE
group AS item,
distinct_users AS distinct_users
;
};
should work. Have
It is possible to access the columns values (stored in cassandra) from pig,
using the column names defined in the Cassandra Schema, using the UDF from
pygmalion.
So imagine a schema being :
create column family Users
with column_type = Standard
and comparator = UTF8Type
and default_val
Hi, I am running from one of the nodes in the cluster.
I too believe it is something to do with different address for rpc_address
and listen_address but not sure what it is...
-Original Message-
From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com]
Sent: Friday, May 06, 2011 11:10 PM
Hi,
I'm stuck on a query for counting distinct users. Say I have data that looks
like this:
book, user1
book, user2
book, user1
movie, user1
movie, user2
movie, user3
music, user4
I want to group by the first column and count the number of distinct users
for that product. The result would just b
I've asked this question before - and I cannot figure out how to reply
to the message though, and I'm still quite confused about it. I'll
simplify it for brevity:
Can the SHIP keyword be used to take files from the machine where the
Pig script is running (that is, files in the same directory as t
Where are you running the pig script from - your local machine or one of the
nodes in the cluster or ? I would think it wouldn't matter which address you
use, but what interface it's using. So if the internal and public address are
both using the same interface, then you should be able to conn
Hi,
I got a cluster with seven Cassandra nodes. The ring is formed using the
private ips of each of the nodes. The rpc_address of the nodes is set to
private and listen_address of the nodes set to public mainly to facilitate
cross data centre ring. When I ring the nodes, it shows all nodes are
23 matches
Mail list logo