Hbase + mapreduce -- operational design question
Hello, I have a setup where a bunch of clients store 'events' in an Hbase table . Also, periodically(once a day), I run a mapreduce job that goes over the table and computes some reports. Now my issue is that the next time I don't want mapreduce job to process the 'events' that it has already processed previously. I know that I can mark processed event in the hbase table and the mapper can filter them them out during the next run. But what I would really like/want is that previously processed events don't even hit the mapper. One solution I can think of is to backup the hbase table after running the job and then clear the table. But this has lot of problems.. 1) Clients may have inserted events while the job was running. 2) I could disable and drop the table and then create it again...but then the clients would complain about this short window of unavailability. What do people using Hbase (live) + mapreduce typically do. ? Thanks! Chinmay
Re: HBase and HDFS sync support in the main branch
Hello Arun, Thanks for the clarification. Do you mean 0.20.205 will be a "default" (main, trunk, you name it) release and have durable edits? On Fri, Sep 9, 2011 at 2:22 PM, Arun C Murthy wrote: > Eugene, > > Currently we are close to getting 0.20.205 frozen which will be the first > Apache Hadoop release with proper support for HBase. > > hth, > Arun > > On Sep 9, 2011, at 9:06 AM, Eugene Kirpichov wrote: > >> Hello, >> >> It appears from http://wiki.apache.org/hadoop/Hbase/HdfsSyncSupport >> that for more than at least the latest year, durable edits haven't >> been supported by the main branch of HBase & HDFS - only by a non-main >> branch, an unreleased branch and a third-party distribution. >> >> This seems strange to me, as >> http://hbase.apache.org/acid-semantics.html seems to claim that >> durability is present, without indicating that it's actually not, by >> default (part of the page looks like a plan, not like a claim about >> the current state of things, which is also confusing). >> >> Is this because durable edits are not that much of a needed feature? >> >> [Disclaimer: I've not done much with HBase and Hadoop in general in a >> while, so I may be asking questions completely stupid in the current >> context] >> >> -- >> Eugene Kirpichov >> Principal Engineer, Mirantis Inc. http://www.mirantis.com/ >> Editor, http://fprog.ru/ > > -- Eugene Kirpichov Principal Engineer, Mirantis Inc. http://www.mirantis.com/ Editor, http://fprog.ru/
Re: HBase and HDFS sync support in the main branch
Eugene, Currently we are close to getting 0.20.205 frozen which will be the first Apache Hadoop release with proper support for HBase. hth, Arun On Sep 9, 2011, at 9:06 AM, Eugene Kirpichov wrote: > Hello, > > It appears from http://wiki.apache.org/hadoop/Hbase/HdfsSyncSupport > that for more than at least the latest year, durable edits haven't > been supported by the main branch of HBase & HDFS - only by a non-main > branch, an unreleased branch and a third-party distribution. > > This seems strange to me, as > http://hbase.apache.org/acid-semantics.html seems to claim that > durability is present, without indicating that it's actually not, by > default (part of the page looks like a plan, not like a claim about > the current state of things, which is also confusing). > > Is this because durable edits are not that much of a needed feature? > > [Disclaimer: I've not done much with HBase and Hadoop in general in a > while, so I may be asking questions completely stupid in the current > context] > > -- > Eugene Kirpichov > Principal Engineer, Mirantis Inc. http://www.mirantis.com/ > Editor, http://fprog.ru/
What is the best way to implement a key that is an array of strings?
I have a data structure that is a variable-length array of strings. Call it a StringList. I am using StringLists as Hadoop keys. These objects sort lexicographically (e.g. ["apple", "banana"] < ["apple", "banana", "pear"] < ["apple", "pear"] < ["zucchini"]) and are equivalent if and only if all of their elements are equal. What is the best way to implement this object for Hadoop? Currently I have implemented StringList as an object that extends ArrayWritable and sets the value class to Text. The compareTo method just compares string representations of the StringList objects, since these representations have the ordering property I desire. This works but I'm uncertain about how it will perform at scale. In order to get the highest performance, would I still have to write a raw comparator for this object, or does ArrayWritable do this for me? In lieu of writing a raw comparator, should I just implement StringList as an Avro object? I think Avro gives you raw comparators for free, but I haven't dug into this.
HBase and HDFS sync support in the main branch
Hello, It appears from http://wiki.apache.org/hadoop/Hbase/HdfsSyncSupport that for more than at least the latest year, durable edits haven't been supported by the main branch of HBase & HDFS - only by a non-main branch, an unreleased branch and a third-party distribution. This seems strange to me, as http://hbase.apache.org/acid-semantics.html seems to claim that durability is present, without indicating that it's actually not, by default (part of the page looks like a plan, not like a claim about the current state of things, which is also confusing). Is this because durable edits are not that much of a needed feature? [Disclaimer: I've not done much with HBase and Hadoop in general in a while, so I may be asking questions completely stupid in the current context] -- Eugene Kirpichov Principal Engineer, Mirantis Inc. http://www.mirantis.com/ Editor, http://fprog.ru/
[ANN] BigDataCamp Delhi, India, Sep 10, 2011
Registration here (few seats left) - http://www.cloudcamp.org/delhi Agenda: 9:30 am - Food, Drinks & Networking 10:00 am - Welcome, Thank yous & Introductions 10:15 am - Lightning Talks (5 minutes each) 10:45 am - Unpanel 11:45 am - Prepare for Unconference Breakout Sessions (solicit breakout topics, etc.). 12:00 - 12:15 Break 12:15 pm - Unconference - Round 1 1:00 pm Lunch 2:15pm - Unconference - Round 2 2:45pm - Unconference - Round 3 3:15pm - Unconference - Round 4 3:45pm - Wrap Up Proposed Topics: Introduction to Hadoop / Big Data Kundera (ORM for Cassandra, Hbase and MongoDB) Introduction to NOSQL BigData Analytics Crux Sponsors: IBM, Impetus, Nasscom Location: Impetus Infotech (India) Pvt. Ltd. D-39 & 40, Sector 59 Noida (Near New Delhi) Uttar Pradesh - 201307 Regards, Sanjay Sharma Need to identify code bottlenecks? Register for Impetus Webinar on 'Rapid Bottleneck Identification through Software Performance Diagnostic Tools' on Aug 19. Click http://www.impetus.com to know more. Follow us on www.twitter.com/impetuscalling NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.