Re: Hadoop Data Modeling Tutorial?

2015-08-05 Thread Russell Jurney
odeler. Can anyone point me to a tutorial for > getting up to speed modeling data in the Hadoop environment? > > > > Thanks, > > Chris > > > -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com

Re: Joins in Hadoop

2015-06-25 Thread Russell Jurney
t;>> >>>>>> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar < >>>>>> ravikant.i...@gmail.com >>>>>> > wrote: >>>>>> >>>>>>> Hi Hadoop user, >>>>>>> >>>>>>> I want to use hadoop for performing operation on graph data >>>>>>> I have two file : >>>>>>> >>>>>>> 1. Edge list file >>>>>>> This file contains one line for each edge in the graph. >>>>>>> sample: >>>>>>> 12 (here 1 is source and 2 is sink node for the edge) >>>>>>> 15 >>>>>>> 23 >>>>>>> 42 >>>>>>> 43 >>>>>>> 56 >>>>>>> 54 >>>>>>> 57 >>>>>>> 78 >>>>>>> 89 >>>>>>> 810 >>>>>>> >>>>>>> 2. Partition file : >>>>>>> This file contains one line for each vertex. Each line has >>>>>>> two values first number is and second number is >>>>>> id > >>>>>>> sample : >>>>>>> 21 >>>>>>> 31 >>>>>>> 41 >>>>>>> 52 >>>>>>> 62 >>>>>>> 72 >>>>>>> 81 >>>>>>> 91 >>>>>>> 101 >>>>>>> >>>>>>> >>>>>>> The Edge list file is having size of 32Gb, while partition file is >>>>>>> of 10Gb. >>>>>>> (size is so large that map/reduce can read only partition file . I >>>>>>> have 20 node cluster with 24Gb memory per node.) >>>>>>> >>>>>>> My aim is to get all vertices (along with their adjacency list >>>>>>> )those having same partition id in one reducer so that I can perform >>>>>>> further analytics on a given partition in reducer. >>>>>>> >>>>>>> Is there any way in hadoop to get join of these two file in mapper >>>>>>> and so that I can map based on the partition id ? >>>>>>> >>>>>>> Thanks >>>>>>> Ravikant >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Harshit Mathur >>>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> Harshit Mathur >>> >> >> > > > -- > Harshit Mathur > -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com

Re: Interview Questions asked

2015-02-12 Thread Russell Jurney
n idea , that will be great. > > Thanks > Krish > -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com

Re: Spark vs Tez

2014-10-18 Thread Russell Jurney
n Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA < > adaryl.wakefi...@hotmail.com > > wrote: > >> Does anybody have any performance figures on how Spark stacks up >> against Tez? If you don’t have figures, does anybody have an opinion? Spark >> seems so popular but I’m not really seeing why. >> B. >> > > -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com

Re: Reading json format input

2013-05-29 Thread Russell Jurney
;hello world"} > {"author":"foo234", "text": "hello this world"} > > So I want to do wordcount for text part. > I understand that in mapper, I just have to pass this data as json and > extract "text" and rest of the code is just the same but I am trying to > switch from python to java hadoop. > How do I do this. > Thanks > -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com

Re: Accumulo and Mapreduce

2013-03-04 Thread Russell Jurney
AM, Russell Jurney < >> russell.jur...@gmail.com > 'russell.jur...@gmail.com');>> >> > wrote: >> >> >> >> >> >> >> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/Acc

Re: Accumulo and Mapreduce

2013-03-04 Thread Russell Jurney
http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java AccumuloStorage for Pig comes with Accumulo. Easiest way would be to try it. Russell Jurney http://datasyndrome.com On Mar 4, 2013, at 5:30 AM, Aji Janis wrote: Hello, I

Re: building a department GPU cluster

2013-01-17 Thread Russell Jurney
Hadoop streaming can do this, and there's been some discussion in the past, but it's not a core use case. Check the list archives. Russell Jurney http://datasyndrome.com On Jan 17, 2013, at 9:25 AM, Jeremy Lewi wrote: I don't think running hadoop on a GPU cluster is a commo

Re: Map-Reduce V/S Hadoop Ecosystem

2012-11-07 Thread Russell Jurney
Hourly consultants may prefer MapReduce. Everyone else should be using Pig, Hive, Cascading, etc. Russell Jurney twitter.com/rjurney On Nov 7, 2012, at 8:08 PM, yogesh dhari wrote: Thanks Bejoy Sir, I am always grateful to u for your help. Please explain these word into simple language with

Re: reference architecture

2012-10-29 Thread Russell Jurney
You just made my year. Let me know how I can make it better (off list). Russell Jurney twitter.com/rjurney On Oct 29, 2012, at 2:17 PM, "Daniel Käfer" wrote: > Thank you, that book is exactly what i'm looking for. > > Regards > Daniel Käfer > > Am Samstag, de

Re: reference architecture

2012-10-27 Thread Russell Jurney
Russell Jurney http://datasyndrome.com On Oct 25, 2012, at 12:24 PM, "Daniel Käfer" wrote: > Hello all, > > I'm looking for a reference architecture for hadoop. The only result I > found is Lambda architecture from Nathan Marz[0]. > > With architecture I mean

Re: reference architecture

2012-10-27 Thread Russell Jurney
I define one of these in the book agile data, from O'Reilly. I express opinions on all matters you query us about. But you don't have to take my word for it... It's a reading rainbow! Jordi! Russell Jurney http://datasyndrome.com On Oct 27, 2012, at 1:09 AM, "Daniel

Re: Why they recommend this (CPU) ?

2012-10-13 Thread Russell Jurney
);>" > 'cvml', 'user@hadoop.apache.org');>> >> Date: Thursday, October 11, 2012 12:36 PM >> To: "user@hadoop.apache.org > 'user@hadoop.apache.org');>" > 'cvml', 'user@hadoop.apache.org');>&g

Re: Why they recommend this (CPU) ?

2012-10-11 Thread Russell Jurney
My own clusters are too temporary and virtual for me to notice. I haven't thought of clock speed as having mattered in a long time, so I'm curious what kind of use cases might benefit from faster cores. Is there a category in some way where this sweet spot for faster cores occurs? Russ

Re: Why they recommend this (CPU) ?

2012-10-11 Thread Russell Jurney
Anyone got data on this? This is interesting, and somewhat counter-intuitive. Russell Jurney http://datasyndrome.com On Oct 11, 2012, at 10:47 AM, Jay Vyas wrote: > Presumably, if you have a reasonable number of cores - speeding the cores up > will be better than forking a task into s

Re: Legal Matter

2012-09-07 Thread Russell Jurney
r messes. >> >> As to the ninjas... sorry that sugar high or even caffeine high can be >> deadly. >> >> Definitely not a good mix. Gluten free foods with simple chicken and fish >> work best. >> >> >> On Sep 7, 2012, at 12:10 AM, Russell Jurney

Re: Legal Matter

2012-09-06 Thread Russell Jurney
With the pastries, I feel like you're calling me fat. And that they're a distraction for the Ninjas. Russell Jurney http://datasyndrome.com On Sep 6, 2012, at 10:05 PM, sathyavageeswaran wrote: Yah that would be great! *From:* Fabio Pitzolu [mailto:fabio.pitz...@gr-ci.com]

Re: Legal Matter

2012-09-06 Thread Russell Jurney
HR is giving us crap over our use of pirates for business development. Russell Jurney http://datasyndrome.com On Sep 6, 2012, at 6:02 AM, Michael Segel wrote: Why can't we use our Ninja's? They are sitting on the bench. On Sep 6, 2012, at 7:52 AM, Russell Jurney wrote: Also there

Re: Legal Matter

2012-09-06 Thread Russell Jurney
forward them along, as we already have a sizable bill outstanding (aforementioned copy and reply fees as well as two days back-retainer for a total of $6,000 US) and billing is hounding me for collection. Please don't make us use ninjas. Russell Jurney http://datasyndrome.com On Sep 5, 2012,

Re: Install Hive and Pig

2012-08-23 Thread Russell Jurney
-one-avroizing-the-enron-emails/ http://hortonworks.com/blog/the-data-lifecycle-part-two-mining-avros-with-pig-consuming-data-with-hive/ Russell Jurney http://datasyndrome.com On Aug 23, 2012, at 8:58 AM, rajesh bathala wrote: > Hi Friends, > > I am new to Hadoop. Can you please let us

Re: Can Hadoop replace the use of MQ b/w processes?

2012-08-19 Thread Russell Jurney
s. Order of processing is important in so far as related messages > need to be processed in sequence hence today all related messages go to the > same queue and are processed by the same queue consumer. > >> > >> The idea would be replace the use of MQ with some kind of reliab

Re: Can Hadoop replace the use of MQ b/w processes?

2012-08-19 Thread Russell Jurney
uling jobs look at Oozie and Azkaban. Russell Jurney http://datasyndrome.com On Aug 19, 2012, at 9:47 AM, Robert Nicholson wrote: > We have an application or a series of applications that listen to incoming > feeds they then distribute this data in XML form to a number of queues. >