Hbase + mapreduce -- operational design question

2011-09-09 Thread Dhodapkar, Chinmay
Hello,
I have a setup where a bunch of clients store 'events' in an Hbase table . 
Also, periodically(once a day), I run a mapreduce job that goes over the table 
and computes some reports.

Now my issue is that the next time I don't want mapreduce job to process the 
'events' that it has already processed previously. I know that I can mark 
processed event in the hbase table and the mapper can filter them them out 
during the next run. But what I would really like/want is that previously 
processed events don't even hit the mapper.

One solution I can think of is to backup the hbase table after running the job 
and then clear the table. But this has lot of problems..
1) Clients may have inserted events while the job was running.
2) I could disable and drop the table and then create it again...but then the 
clients would complain about this short window of unavailability.


What do people using Hbase (live) + mapreduce typically do. ?

Thanks!
Chinmay



Re: HBase and HDFS sync support in the main branch

2011-09-09 Thread Eugene Kirpichov
Hello Arun,

Thanks for the clarification. Do you mean 0.20.205 will be a "default"
(main, trunk, you name it) release and have durable edits?

On Fri, Sep 9, 2011 at 2:22 PM, Arun C Murthy  wrote:
> Eugene,
>
>  Currently we are close to getting 0.20.205 frozen which will be the first 
> Apache Hadoop release with proper support for HBase.
>
> hth,
> Arun
>
> On Sep 9, 2011, at 9:06 AM, Eugene Kirpichov wrote:
>
>> Hello,
>>
>> It appears from http://wiki.apache.org/hadoop/Hbase/HdfsSyncSupport
>> that for more than at least the latest year, durable edits haven't
>> been supported by the main branch of HBase & HDFS - only by a non-main
>> branch, an unreleased branch and a third-party distribution.
>>
>> This seems strange to me, as
>> http://hbase.apache.org/acid-semantics.html seems to claim that
>> durability is present, without indicating that it's actually not, by
>> default (part of the page looks like a plan, not like a claim about
>> the current state of things, which is also confusing).
>>
>> Is this because durable edits are not that much of a needed feature?
>>
>> [Disclaimer: I've not done much with HBase and Hadoop in general in a
>> while, so I may be asking questions completely stupid in the current
>> context]
>>
>> --
>> Eugene Kirpichov
>> Principal Engineer, Mirantis Inc. http://www.mirantis.com/
>> Editor, http://fprog.ru/
>
>



-- 
Eugene Kirpichov
Principal Engineer, Mirantis Inc. http://www.mirantis.com/
Editor, http://fprog.ru/


Re: HBase and HDFS sync support in the main branch

2011-09-09 Thread Arun C Murthy
Eugene,

 Currently we are close to getting 0.20.205 frozen which will be the first 
Apache Hadoop release with proper support for HBase.

hth,
Arun

On Sep 9, 2011, at 9:06 AM, Eugene Kirpichov wrote:

> Hello,
> 
> It appears from http://wiki.apache.org/hadoop/Hbase/HdfsSyncSupport
> that for more than at least the latest year, durable edits haven't
> been supported by the main branch of HBase & HDFS - only by a non-main
> branch, an unreleased branch and a third-party distribution.
> 
> This seems strange to me, as
> http://hbase.apache.org/acid-semantics.html seems to claim that
> durability is present, without indicating that it's actually not, by
> default (part of the page looks like a plan, not like a claim about
> the current state of things, which is also confusing).
> 
> Is this because durable edits are not that much of a needed feature?
> 
> [Disclaimer: I've not done much with HBase and Hadoop in general in a
> while, so I may be asking questions completely stupid in the current
> context]
> 
> -- 
> Eugene Kirpichov
> Principal Engineer, Mirantis Inc. http://www.mirantis.com/
> Editor, http://fprog.ru/



What is the best way to implement a key that is an array of strings?

2011-09-09 Thread W.P. McNeill
I have a data structure that is a variable-length array of strings. Call it
a StringList. I am using StringLists as Hadoop keys. These objects sort
lexicographically (e.g. ["apple", "banana"] < ["apple", "banana", "pear"] <
["apple", "pear"] < ["zucchini"]) and are equivalent if and only if all of
their elements are equal. What is the best way to implement this object for
Hadoop?

Currently I have implemented StringList as an object that extends
ArrayWritable and sets the value class to Text. The compareTo method just
compares string representations of the StringList objects, since these
representations have the ordering property I desire. This works but I'm
uncertain about how it will perform at scale.

In order to get the highest performance, would I still have to write a raw
comparator for this object, or does ArrayWritable do this for me?

In lieu of writing a raw comparator, should I just implement StringList as
an Avro object? I think Avro gives you raw comparators for free, but I
haven't dug into this.


HBase and HDFS sync support in the main branch

2011-09-09 Thread Eugene Kirpichov
Hello,

It appears from http://wiki.apache.org/hadoop/Hbase/HdfsSyncSupport
that for more than at least the latest year, durable edits haven't
been supported by the main branch of HBase & HDFS - only by a non-main
branch, an unreleased branch and a third-party distribution.

This seems strange to me, as
http://hbase.apache.org/acid-semantics.html seems to claim that
durability is present, without indicating that it's actually not, by
default (part of the page looks like a plan, not like a claim about
the current state of things, which is also confusing).

Is this because durable edits are not that much of a needed feature?

[Disclaimer: I've not done much with HBase and Hadoop in general in a
while, so I may be asking questions completely stupid in the current
context]

-- 
Eugene Kirpichov
Principal Engineer, Mirantis Inc. http://www.mirantis.com/
Editor, http://fprog.ru/


[ANN] BigDataCamp Delhi, India, Sep 10, 2011

2011-09-09 Thread Sanjay Sharma
Registration here (few seats left) - http://www.cloudcamp.org/delhi

Agenda:
9:30 am  - Food, Drinks & Networking
10:00 am  - Welcome, Thank yous & Introductions
10:15 am  - Lightning Talks (5 minutes each)
10:45 am - Unpanel
11:45 am - Prepare for Unconference Breakout Sessions (solicit breakout topics, 
etc.).
12:00 - 12:15 Break
12:15 pm - Unconference - Round 1
1:00 pm Lunch
2:15pm - Unconference - Round 2
2:45pm - Unconference - Round 3
3:15pm - Unconference - Round 4
3:45pm - Wrap Up

Proposed Topics:
Introduction to Hadoop / Big Data
Kundera (ORM for Cassandra, Hbase and MongoDB)
Introduction to NOSQL
BigData Analytics
Crux
Sponsors:
IBM, Impetus, Nasscom

Location:
Impetus Infotech (India) Pvt. Ltd.
D-39 & 40, Sector 59
Noida (Near New Delhi)
Uttar Pradesh - 201307

Regards,
Sanjay Sharma




Need to identify code bottlenecks? Register for Impetus Webinar on 'Rapid 
Bottleneck Identification through Software Performance Diagnostic Tools' on Aug 
19.

Click http://www.impetus.com to know more. Follow us on 
www.twitter.com/impetuscalling


NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.