Pachiderm Project

2011-09-10 Thread highpointe

Building it from scratch with a homegrown recipe. 

Follow and lend us suggestions, advice, kudos, etc. If you're mean...  Go away. 
You suck. 

www.7ops.com
@7Ops
highpoi...@7ops.com

Cheers. 


Sent from my iPhone

Disable Sorting?

2011-09-10 Thread john smith
Hi,

Some of the MR jobs I run doesn't need sorting of map-output in each
partition. Is there someway I can disable it?

Any help?

Thanks
jS


Re: Hbase + mapreduce -- operational design question

2011-09-10 Thread Eugene Kirpichov
I believe HBase has some kind of TTL (timeout-based expiry) for
records and it can clean them up on its own.

On Sat, Sep 10, 2011 at 1:54 AM, Dhodapkar, Chinmay
chinm...@qualcomm.com wrote:
 Hello,
 I have a setup where a bunch of clients store 'events' in an Hbase table . 
 Also, periodically(once a day), I run a mapreduce job that goes over the 
 table and computes some reports.

 Now my issue is that the next time I don't want mapreduce job to process the 
 'events' that it has already processed previously. I know that I can mark 
 processed event in the hbase table and the mapper can filter them them out 
 during the next run. But what I would really like/want is that previously 
 processed events don't even hit the mapper.

 One solution I can think of is to backup the hbase table after running the 
 job and then clear the table. But this has lot of problems..
 1) Clients may have inserted events while the job was running.
 2) I could disable and drop the table and then create it again...but then the 
 clients would complain about this short window of unavailability.


 What do people using Hbase (live) + mapreduce typically do. ?

 Thanks!
 Chinmay





-- 
Eugene Kirpichov
Principal Engineer, Mirantis Inc. http://www.mirantis.com/
Editor, http://fprog.ru/


Re: Hbase + mapreduce -- operational design question

2011-09-10 Thread Sonal Goyal
Chinmay, how are you configuring your job? Have you checked using setScan
and selecting the keys you care to run MR over? See

http://ofps.oreilly.com/titles/9781449396107/mapreduce.html

As a shameless plug - For your reports, see if you want to leverage Crux:
https://github.com/sonalgoyal/crux

Best Regards,
Sonal
Crux: Reporting for HBase https://github.com/sonalgoyal/crux
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Sat, Sep 10, 2011 at 2:53 PM, Eugene Kirpichov ekirpic...@gmail.comwrote:

 I believe HBase has some kind of TTL (timeout-based expiry) for
 records and it can clean them up on its own.

 On Sat, Sep 10, 2011 at 1:54 AM, Dhodapkar, Chinmay
 chinm...@qualcomm.com wrote:
  Hello,
  I have a setup where a bunch of clients store 'events' in an Hbase table
 . Also, periodically(once a day), I run a mapreduce job that goes over the
 table and computes some reports.
 
  Now my issue is that the next time I don't want mapreduce job to process
 the 'events' that it has already processed previously. I know that I can
 mark processed event in the hbase table and the mapper can filter them them
 out during the next run. But what I would really like/want is that
 previously processed events don't even hit the mapper.
 
  One solution I can think of is to backup the hbase table after running
 the job and then clear the table. But this has lot of problems..
  1) Clients may have inserted events while the job was running.
  2) I could disable and drop the table and then create it again...but then
 the clients would complain about this short window of unavailability.
 
 
  What do people using Hbase (live) + mapreduce typically do. ?
 
  Thanks!
  Chinmay
 
 



 --
 Eugene Kirpichov
 Principal Engineer, Mirantis Inc. http://www.mirantis.com/
 Editor, http://fprog.ru/



Re: Disable Sorting?

2011-09-10 Thread Arun C Murthy
Run a map-only job with #reduces set to 0.

Arun

On Sep 10, 2011, at 2:06 AM, john smith wrote:

 Hi,
 
 Some of the MR jobs I run doesn't need sorting of map-output in each
 partition. Is there someway I can disable it?
 
 Any help?
 
 Thanks
 jS



Re: Disable Sorting?

2011-09-10 Thread Meng Mao
Is there a way to collate the possibly large number of map output files,
though?

On Sat, Sep 10, 2011 at 2:48 PM, Arun C Murthy a...@hortonworks.com wrote:

 Run a map-only job with #reduces set to 0.

 Arun

 On Sep 10, 2011, at 2:06 AM, john smith wrote:

  Hi,
 
  Some of the MR jobs I run doesn't need sorting of map-output in each
  partition. Is there someway I can disable it?
 
  Any help?
 
  Thanks
  jS




Re: Disable Sorting?

2011-09-10 Thread Owen O'Malley
On Sat, Sep 10, 2011 at 12:33 PM, Meng Mao meng...@gmail.com wrote:

 Is there a way to collate the possibly large number of map output files,
 though?


You can make fewer mappers by setting the mapred.min.split.size to define
the smallest input that will be given to a mapper.

There isn't currently a way of getting a collated, but unsorted list of
key/value pairs. For most applications, the in memory sort is fairly cheap
relative to the shuffle and other parts of the processing.

-- Owen


Re: Disable Sorting?

2011-09-10 Thread john smith
Hey,

I have reduce phases too. But for each reduce, I dont need sorted input
(map-output for that corresponding reduce task).
Setting #red to 0 completely removes the reduce phase.

Am I missing something?

Thanks,

On Sun, Sep 11, 2011 at 12:18 AM, Arun C Murthy a...@hortonworks.com wrote:

 Run a map-only job with #reduces set to 0.

 Arun

 On Sep 10, 2011, at 2:06 AM, john smith wrote:

  Hi,
 
  Some of the MR jobs I run doesn't need sorting of map-output in each
  partition. Is there someway I can disable it?
 
  Any help?
 
  Thanks
  jS




Re: Disable Sorting?

2011-09-10 Thread Arun C Murthy
The point of a 'reduce phase' is to aggregate keys from different maps (i.e. 
all inputs).

I'm not sure what you are trying to do, but a use-case will help.

IAC, the only way to achieve what you are trying to do is to run to jobs with 
the first a map-only job (i.e. #reduces = 0).

Arun

On Sep 10, 2011, at 10:19 PM, john smith wrote:

 Hey,
 
 I have reduce phases too. But for each reduce, I dont need sorted input
 (map-output for that corresponding reduce task).
 Setting #red to 0 completely removes the reduce phase.
 
 Am I missing something?
 
 Thanks,
 
 On Sun, Sep 11, 2011 at 12:18 AM, Arun C Murthy a...@hortonworks.com wrote:
 
 Run a map-only job with #reduces set to 0.
 
 Arun
 
 On Sep 10, 2011, at 2:06 AM, john smith wrote:
 
 Hi,
 
 Some of the MR jobs I run doesn't need sorting of map-output in each
 partition. Is there someway I can disable it?
 
 Any help?
 
 Thanks
 jS