rt#Oozie
>
> I am using Cassandra 1.2.10, Oozie 4.0.0 adn pig 0.11.1.
>
> I try to test these options and see if it works-
>
> Thanks in advance
>
>
>
>
>
>
>
>
>
>
>
> 2013/11/28 Jeremy Hanna
>
>> If I rememb
If I remember correctly when I configured pig, cassandra, and oozie to work
together, I just used vanilla pig but gave it the jars it needed.
What is the problem you’re experiencing that you are unable to do this?
Jeremy
On 28 Nov 2013, at 12:56, Miguel Angel Martin junquera
wrote:
> hi all;
it's just sort of languishing.
>
> 4. ONERROR would be a real coup for pig...there's a spec, someone just
> needs to do the work!
>
> And then there are various and sundry things that I would like to
> do...finish up SchemaTuple, move on to SchemaBag, and so on.
>
Thanks again to Twitter for doing their event and inspiring ours. I just
wanted to report on some things we did in Austin for any interested. We had a
good turnout of about 30 people.
Kevin Safford presented an introduction to Pig, or Pig 101. The slides are
available here:
http://www.slide
,
>
> Dan
> On May 11, 2012 2:00 PM, "Jeremy Hanna" wrote:
>
>> Here in Austin, we've been having a hack day for beginning to intermediate
>> developers. Just wanted to post some slides that were from presentations
>> here:
>> Pig 101 -
>
We've also started to use the #hadoop-pig channel on freenode (IRC).
On May 11, 2012, at 12:04 PM, Russell Jurney wrote:
> Up to 10 people can skype in to the Pig hackday. Call apachepig :)
>
> --
> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
Here in Austin, we've been having a hack day for beginning to intermediate
developers. Just wanted to post some slides that were from presentations here:
Pig 101 - http://www.slideshare.net/ktsafford/dachis-group-pigout101-12895911
Pig 202 - http://www.slideshare.net/thelabdude/dachis-group-pig-h
Here is the Austin event for those interested:
http://pig-hackday-austin.eventbrite.com/
On Apr 19, 2012, at 6:12 PM, Jeremy Hanna wrote:
> Cool - tx Russell et al. I'm talking with the higher ups here to see if we
> want to make it a general pig training and hacking day - we have
We're admittedly on an older version of pig (0.8.0-cdh3u0) but are trying to
build a databag in our UDF and are getting OOM exceptions even with 6 GB of
heap. Specifically, we're marshaling data prior to writing it to Cassandra
using our ToCassandraBag UDF and have a databag as one of the input
Cool - tx Russell et al. I'm talking with the higher ups here to see if we
want to make it a general pig training and hacking day - we have lunch time
training things here where we go over 101, 202, etc. Maybe we'll
organize something like that for this area and hack alongside people there.
Just curious - is there some way to do a remote connection? we have a few
people here in Austin and one in Colorado at the Dachis Group that may want to
participate.
On Apr 18, 2012, at 4:18 PM, Dmitriy Ryaboy wrote:
> Hi folks,
> The Analytics Infra team at Twitter will be hosting a Pig hackd
Not sure what mongo's doing (generate ID or triggers or something) but it
should only be a problem of efficiency if the writes are idempotent.
On Mar 2, 2012, at 3:39 AM, Jonathan Coveney wrote:
> I agree with Bill. Speculative execution is a feature of Hadoop that
> doesn't jive nicely with sto
rs from apache's repo and get something
that wasn't a release.
On Feb 17, 2012, at 11:44 PM, Dmitriy Ryaboy wrote:
> Do you mean the snapshot of current 0.8 branch? Once 8.2 is released, the
> version in the branch is bumped up. There has been no 8.3 release.
>
> On
So the current releases of pig are 0.8.1 and 0.9.2. However, in the apache mvn
repo (and mirrored repos) there is a pig 0.8.3. I find no release on it, no
svn tag for it, and no user mailing list announcement for it. Where does 0.8.3
come from?
it's in
https://repository.apache.org/index.ht
actually - he just put it on github :)
https://github.com/edwardcapriolo/filecrush
On Nov 30, 2011, at 9:03 AM, Jeremy Hanna wrote:
> We went through some grief with small files and inefficiencies there. First
> we went the route of CombinedInputFormat. That worked for us for a whi
We went through some grief with small files and inefficiencies there. First we
went the route of CombinedInputFormat. That worked for us for a while but then
we started getting errors relating to the number of open files. So we used a
utility that Ed Capriolo in the Hadoop/Hive/Cassandra comm
re of significant issues with HBaseStorage in pig trunk;
> some features are outstanding, but other than that, I think most
> complaints we get are about jar management (which is mostly solved in
> trunk and pig 9, iirc). Do file tickets if you run into problems!
>
> D
>
>
I just wondered about the status of hbase storage, specifically the store part
of it. Is it something people are using in production - ready for prime time?
I seemed to remember a couple of people having problems with the store side of
it and I didn't know if that was rumor or not.
Thanks!
J
Sorry - for cogroup only…? same question though.
On Oct 26, 2011, at 5:28 PM, Jeremy Hanna wrote:
> Does this ticket mean that inner and outer are deprecated for group/cogroup?
> It sounds that way, but I just wanted to make sure. (We may need to refactor
> some things if so.
Does this ticket mean that inner and outer are deprecated for group/cogroup?
It sounds that way, but I just wanted to make sure. (We may need to refactor
some things if so.)
https://issues.apache.org/jira/browse/PIG-1584
One of the reasons why we did pygmalion here was to facilitate working with
tabular data - extracting out values (with FromCassandraBag) using specified
column names. Not sure if it works with your use case, but just to mention it
- it doesn't work as easily with dynamic column names.
https://g
It's been mentioned in this thread, but if you're using tabular (static column
names) data, you might consider using Pygmalion. It will extract the values
from Cassandra to simplify grouping by values and other operations.
https://github.com/jeromatron/pygmalion
What you'll want to look at is th
>
> into task in build.xml, though I am not sure it is acceptable
> for your case.
>
> Daniel
>
> On Thu, Sep 22, 2011 at 11:43 AM, Jeremy Hanna
> wrote:
>> Is there a way to use -Dpig.additional.jars with pigunit to auto-register
>> jars for unit test scripts?
Is there a way to use -Dpig.additional.jars with pigunit to auto-register jars
for unit test scripts? Maybe we're just missing something because this seems
like a basic thing that people would like to use. I see in
test/org/apache/pig/test/pigunit/TestPigTest.java that there is a commented out
Seems like pigunit would be one of those jars that would be handy to just
depend on with maven/ivy. Is there any reason why pigunit isn't pushed to
maven central along with pig itself?
Thanks!
Jeremy
/repos/asf/cassandra/trunk/contrib/pig. Are there any
> other resource that you can point me to? There seems to be a lack of samples
> on this subject.
>
> On Tue, Aug 30, 2011 at 10:56 PM, Jeremy Hanna
> wrote:
> FWIW, we are using Pig (and Hadoop) with Cassandra and are looking to
FWIW, we are using Pig (and Hadoop) with Cassandra and are looking to
potentially move to Brisk because of the simplicity of operations there.
Not sure what you mean about the true power of Hadoop. In my mind the true
power of Hadoop is the ability to parallelize jobs and send each task to wher
ks for the response
>
> Fabio
>
> On 17/08/2011, at 22:14, Jeremy Hanna wrote:
>
>> Hi Fabio,
>>
>> I'm not sure if super columns are fully supported right now in
>> CassandraStorage. Brandon (who I CCed) would know for sure. That and I
>>
Hi Fabio,
I'm not sure if super columns are fully supported right now in
CassandraStorage. Brandon (who I CCed) would know for sure. That and I
thought the pig bug that made it impossible to get to nested data structures
has been resolved - the ticket you commented on today I think was a dupl
72)
>... 9 more
> Caused by: java.io.EOFException
>at java.io.DataInputStream.readInt(DataInputStream.java:375)
>at
> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:812)
>at org.apache.hadoop.ipc.Client$Connection.run(Client.java
a:303)
>>at
>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
>>at
>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
>>at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:76)
>>at org.apa
ig.Main.run(Main.java:465)
>at org.apache.pig.Main.main(Main.java:107)
>
> does anyone else have this problem?
>
>
> On Sun, Jul 31, 2011 at 2:04 PM, Jeremy Hanna
> wrote:
>
>> Try following this and see if it helps getting started:
>> https://github.com/
Try following this and see if it helps getting started:
https://github.com/jeromatron/pygmalion/wiki/Getting-Started
I haven't tried it with 0.9 yet but I plan to this week. We use the
CassandraStorage jar in production. If you can, validate your data with
Cassandra's schema validators. Cassa
Nice work Daniel and all on the release and the blog posts! Looking forward to
the other two. We'll be testing out on our stuff because of all the great
features added.
On Jul 29, 2011, at 4:02 PM, Daniel Dai wrote:
> We wrote a serial of blogs to describe the new feature of Pig 0.9.0 on
> ht
One thing that we use is filecrush to merge small files below a threshold. It
works pretty well.
http://www.jointhegrid.com/hadoop_filecrush/index.jsp
On Jul 16, 2011, at 1:17 AM, jagaran das wrote:
>
>
> Hi,
>
> Due to requirements in our current production CDH3 cluster we need to copy
> a
DRA-2869 for another
> case where this has reared it's head in an improper implementation.
>
> -Grant
>
> On Jul 7, 2011, at 3:24 AM, Jeremy Hanna wrote:
>
>>
>> On Jul 6, 2011, at 11:10 PM, Raghu Angadi wrote:
>>
>>> On Wed, Jul 6, 2011 at
On Jul 6, 2011, at 11:10 PM, Raghu Angadi wrote:
> On Wed, Jul 6, 2011 at 7:20 PM, Jeremy Hanna
> wrote:
>
>>
>> On Jul 6, 2011, at 12:47 PM, Dmitriy Ryaboy wrote:
>>
>>> I think this is the same problem we were having earlier:
>>> http:
in this case we'll just have to require the field names be
entered into the UDF and it won't introspect them. Ah well. Would be nice to
be able to use it but I don't really see another way around this bug with the
shared UDF context.
>
> D
>
> On Wed, Jul 6, 2011 at
We have a UDF that introspects the output schema and gets the field names there
and use that in the exec method.
The UDF is found here:
https://github.com/jeromatron/pygmalion/blob/master/udf/src/main/java/org/pygmalion/udf/ToCassandraBag.java
A simple example is found here:
https://github.com
According to
http://pig.apache.org/docs/r0.8.1/cookbook.html#Use+the+Parallel+Features there
are two ways for pig to determine the number of reducers to use:
1- set default_parallel and/or PARALLEL
2- let pig calculate it
What do people generally use right now? Is there a preferred option?
Answering my own question. Penny with 0.9 does this. Wahoo :)
Thanks for telling me Ashutosh.
On Jun 25, 2011, at 9:56 AM, Jeremy Hanna wrote:
> I was just wondering if the following was a common scenario for others and
> whether things could be done in a more debug friendly way und
I was just wondering if the following was a common scenario for others and
whether things could be done in a more debug friendly way under the covers.
Currently we've found that developing with pig is enormously helpful because
it's a scripting language that does a lot of the heavy lifting for u
, that's here: https://github.com/jeromatron/pygmalion/
Jeremy
On Jun 17, 2011, at 9:05 PM, Badrinarayanan S wrote:
> Hi Jeremy,
>
> Thanks. Till we get 1.0 we will also adopt separate CF for analysis
> purposes.
>
> Regards,
> badri
>
> -----Original M
The way cassandra currently does mapreduce is that it iterates over all the
rows of the column family. So yes, performance would be related to the growing
number of rows. You can use the pig FILTER function to filter them down, but
you are still iterating over all of the rows in that columns f
(the script).
>
> On Wed, Jun 15, 2011 at 3:04 PM, Jeremy Hanna
> wrote:
>
>> Hi Will,
>>
>> That's partly why I like to use FromCassandraBag and ToCassandraBag from
>> pygmalion - it does the work for you to get it back into a form that
>> cassandr
Hi Will,
That's partly why I like to use FromCassandraBag and ToCassandraBag from
pygmalion - it does the work for you to get it back into a form that cassandra
understands.
Others may know better how to massage the data into that form using just pig,
but if all else fails, you could write a u
ng keys even if you sampled in a way that didn't actually
> produce any, etc.
>
> D
>
> On Wed, Jun 15, 2011 at 10:35 AM, Jeremy Hanna
> wrote:
>> We started doing this recently and thought it might be useful to others.
>>
>> Pig (and Hive) have a sample
We started doing this recently and thought it might be useful to others.
Pig (and Hive) have a sample function that allows you to sample data from your
data store.
In pig it looks something like this:
mysample = SAMPLE myrelation 0.01;
One possible use for this, with pig and cassandra is to sol
You need to set the property in your hadoop configuration:
cassandra.consistencylevel.read
to LOCAL_QUORUM.
All of the properties you can set are in the
org.apache.cassandra.hadoop.ConfigHelper class. You can call that directly
with Java/MapReduce or use the properties defined at the top in you
I looked through the help and the docs pages but couldn't find anything that
did this. Is there any way to show a list of current relations loaded while on
the grunt shell? It would seem that the information is available, just not
exposed via a command.
Thanks!
Jeremy
ething to do with different address for rpc_address
>> and listen_address but not sure what it is...
>>
>>
>>
>> -Original Message-
>> From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com]
>> Sent: Friday, May 06, 2011 11:10 PM
>> To: user@
he nodes in the cluster.
>
> I too believe it is something to do with different address for rpc_address
> and listen_address but not sure what it is...
>
>
>
> -Original Message-
> From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com]
> Sent: Frida
Where are you running the pig script from - your local machine or one of the
nodes in the cluster or ? I would think it wouldn't matter which address you
use, but what interface it's using. So if the internal and public address are
both using the same interface, then you should be able to conn
A few questions:
What are you trying to do? What is the pig script that you're trying to run?
What version of Cassandra?
What version of Pig?
Did you add any column_metadata to your column family, like a validation_class?
On Apr 28, 2011, at 7:58 PM, Himanshu wrote:
> java.lang.ClassCastExceptio
a/browse/PIG-1420
Oh cool - gtk, thanks Bill!
>
>
> On Wed, Apr 27, 2011 at 12:31 PM, Jonathan Ellis wrote:
>> Nice!
>>
>> On Wed, Apr 27, 2011 at 1:57 PM, Jeremy Hanna
>> wrote:
>>> Hi all,
>>>
>>> A little while back, I started a pr
tuple
(name, value)}) - the column names are extracted from the variable names in the
Pig script.
Both contributed by Jacob Perkins with slight revisions by Jeremy Hanna
StringConcat: probably something everyone implements but instead of CONCAT that
only does two strings, it does any number of st
TE key,
FLATTEN(org.pygmalion.udf.FromCassandraBag('first_name, last_name, birth_place,
num_heads', columns)) AS (
first_name:chararray,
last_name:chararray,
birth_place:chararray,
num_heads:long
);
b = group rows by key;
x = foreach b generate group, SUM(rows.num_heads);
6)
>>>> at
>>>>
>>>>
>>>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
>>>> at
>>>>
>>>>
>>>
>> org.apache.pig.backend.hadoop.ex
On Apr 21, 2011, at 9:25 AM, Mridul Muralidharan wrote:
> On Thursday 21 April 2011 06:41 PM, Jeremy Hanna wrote:
>>
>> On Apr 21, 2011, at 3:19 AM, Mridul Muralidharan wrote:
>>
>>>
>>> In general (on hadoop based systems), if the input is not immu
On Apr 21, 2011, at 3:19 AM, Mridul Muralidharan wrote:
>
> In general (on hadoop based systems), if the input is not immutable - you can
> end up with issues during task re-execution, etc.
> This happens not just for cassandra but for hbase, others too - where you
> modify data in-place.
>
The answer is that it depends on which consistency level you are reading and
writing at. You can make sure you are always reading consistent data by using
quorum for reads and quorum for writes.
For more information on consistency level, see:
http://www.datastax.com/docs/0.7/consistency/index
example data/query that can be used to reproduce this ?
> Can you paste the entire stacktrace of the ClassCastException ?
> Do you have something like a bincond which might be returning different
> results for different rows ?
>
> -Thejas
>
>
>
>
> On 4/15/11 2:44
I have been getting strange errors in my pig script and narrowed it down a bit
and found that when I do a COUNT, sometimes it returns a float, but most of the
time it returns a long. Some example output of the result column that came
from a COUNT is below. Any reason why this would happen?
Th
ejas
>
>
>
> On 4/8/11 9:30 AM, "Jeremy Hanna" wrote:
>
> I am going through a lot of processing with my data and then I reformat it to
> go back into my data store using the storefunc. I store it out to hdfs and
> it visually looks just fine. However when I tr
The 0.7.4 version is here:
http://svn.apache.org/repos/asf/cassandra/tags/cassandra-0.7.4/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java
The latest from 0.7 branch contains a way to get the cassandra schema for the
column family it is querying against though:
http://sv
I am going through a lot of processing with my data and then I reformat it to
go back into my data store using the storefunc. I store it out to hdfs and it
visually looks just fine. However when I try to persist it, I'm getting an
exception that it can't cast one of the values from
org.apache
oing because it just makes it easier to deal with
tabular-like data - we don't have to munge through it quite as much. I'm still
pretty low on my pig-fu but others on the list might have better answers on how
to deal with that data structure.
>
> On Apr 6, 2011, at 3:51 PM, Jeremy
I'm going to put a UDF up on the pygmalion project hopefully today that will
convert that into something more usable. Props to Jacob from infochimps - he
and I have been creating UDFs like that lately for use with Cassandra. There's
an associated UDF for getting it back into the key, cols form
the next couple of days. Feel
free to add to it as well :).
https://github.com/jeromatron/pygmalion
Jeremy
On Apr 6, 2011, at 4:15 AM, Fabio Souto wrote:
> It works. Thank you for your help Jeremy!!
>
> Cheers
> Fabio
>
> On 05/04/2011, at 20:08, Jeremy Hanna wrote:
>
ra.dht.RandomPartitioner
>
>
> BTW I'm using the pig version that comes with Cassandra, the one in
> cassandra/contrib/pig
>
> Thanks for your time Jeremy! :)
> Fabio
>
> On 05/04/2011, at 17:04, Jeremy Hanna wrote:
>
>> Fabio,
>>
>> It look
gt; BTW I'm using the pig version that comes with Cassandra, the one in
> cassandra/contrib/pig
>
> Thanks for your time Jeremy! :)
> Fabio
>
> On 05/04/2011, at 17:04, Jeremy Hanna wrote:
>
>> Fabio,
>>
>> It looks like you need to set your environ
g.PigServer.executeCompiledLogicalPlan(PigServer.java:1198)
> at org.apache.pig.PigServer.storeEx(PigServer.java:874)
> at org.apache.pig.PigServer.store(PigServer.java:816)
> at org.apache.pig.PigServer.openIterator(PigServer.java:728)
> ... 7 more
>
Fabio,
Could you post the full stack trace that's found in the pig_.log
that's in the directory that you ran pig?
Thanks,
Jeremy
On Apr 5, 2011, at 8:42 AM, Fabio Souto wrote:
> Hello,
>
> I have installed Pig 0.8.0 and Cassandra 0.7.4 and I'm not able to read data
> from cassandra. I write
> On Thu, Mar 31, 2011 at 4:46 PM, Alan Gates wrote:
>
>> Isn't ivy picking it up for you? That's what is supposed to happen.
>>
>> Alan.
>>
>>
>> On Mar 28, 2011, at 11:32 AM, Jeremy Hanna wrote:
>>
>> Is there a standard way t
True. It is mentioned in the readme, but maybe it should be more explicit in
the readme or in the HadoopSupport page. I haven't had problems with
localhost, but how you defined it is the way I set things for running against
my cassandra/hadoop hybrid cluster.
On Mar 29, 2011, at 12:36 PM, Mar
Is there a standard way to get jline and commons-lang into pig? I work around
by copying them into my build/ivy/lib/Pig directory but didn't know if there
was a simpler way I was just overlooking. Otherwise I get an UNRESOLVED
DEPENDENCIES errors for those two libs when I try to build pig 0.8.
n the right track.
>
> We may have to go in and explicitly check the types of each column and
> cast manually.
>
> --jacob
>
> On Thu, 2011-03-24 at 13:11 -0500, Jeremy Hanna wrote:
>> I see that there are a few LoadCaster implementations in pig 0.8. There
I see that there are a few LoadCaster implementations in pig 0.8. There's the
Utf8StorageConverter, the HBaseBinaryConverter, and a couple of others.
The HBaseStorage class uses the Utf8StorageConverter by default but can be
configured to use the HBaseBinaryConverter. Also it's just used as a
Replying on here too since I noticed it was sent to the pig user list as well...
The pig output to cassandra was part of recently resolved CASSANDRA-1828. It's
usable and separate so you should be able to download 0.7-branch and build the
jar and use it against a 0.7.3 cluster. I've been using
What is the standard way to copy up jar dependencies to the cluster with Pig
(so that the nodes in the cluster don't get runtime errors with class not found
exceptions)?
moving this to the cassandra user list.
On Nov 10, 2010, at 11:05 AM, Aditya Muralidharan wrote:
> Hi,
>
> I'm building (on windows) a release tar from the HEAD of the Cassandra 0.7
> branch. Running a new single node instance of Cassandra gives me the
> following bootstrap exception:
> INFO 1
81 matches
Mail list logo