OK.  Lets sink the RC.  Its gotten too many -1s.  HBASE-1792/3 are bad too.

For the record, I'm +1 on RC2 becoming release.  Its been running here at
pset on 110 nodes for last week or so.  I downloaded it and checked out its
documentation and started it up locally.

St.Ack


On Wed, Aug 26, 2009 at 9:04 AM, Jonathan Gray <[email protected]> wrote:

> I'm with Andrew.  -1 on RC2.
>
> I don't see the value in putting 0.20.0 into the wild when there are known
> defects.  Just to release 0.20.1 shortly thereafter saying, this fixes
> important issues so please upgrade immediately.  It's completely acceptable
> to say, lots of people are using RC2 in production and it's fine to move
> forward with... and upgrade to the release when it is available.  Following
> release of 0.20.0, we should all be on a PR kick; blogging, tweeting,
> emailing, reaching out, and talking to everyone we can about the awesome new
> release.  So the initial release itself should be solid.
>


>
> The balancing issue is a serious issue as it means if you lose a node and
> it comes back online, or if you add a new node, your cluster will suffer
> some serious reliability and performance issues.  I don't think we should
> consider this rare or fringe, in fact it means you can't do rolling restarts
> properly.
>
> I experienced this in our running production system and eventually I had to
> keep running the cluster w/o two of my nodes.  If you have a node with a far
> fewer regions than the others, then all new regions go to that
> regionserver... load becomes horribly unbalanced if you have a
> recent-data-bias, with a majority of reads and writes going to a single
> node.  This led to that RS being swamped w/ long GC pauses and generally bad
> cluster stability.  It's a release blocker alone, IMO.  JSharp ran into this
> yesterday which is how we realized it had been uncommitted.
>
> I *might* be okay with a release of 0.20.0 w/o a fix for HBASE-1784 because
> it is very rare... however failed compactions leading to data loss is pretty
> nasty and we should really try to fix it for release if we squash RC2
> anyways.  This is at least worth putting some effort into over the next few
> days to see if we can reproduce the issue and fix it (by rolling back failed
> compactions properly).  It's better that regions grow to huge sizes because
> compactions fail, thus no splits, rather than complete data loss.
>
> HBASE-1780 should be fixed and should not be too difficult, but maybe not a
> release blocker.
>
> HBASE-1794 we'll have to hear from Ryan what the status is of it.
>
> No one wants to delay release any longer, but the most important thing we
> can do is make sure the release is solid... We can't say that we these open
> issue.
>
> Also, HDFS-200 testing by Ryan is turning up some great stuff and he has
> had success (creating a table, kill -9ing the RS and DN, and META recovers
> fully and the table still exists... magic!).  If we wait until Monday or so
> to cut RC3 (hopefully with fixes for much of above), then perhaps by the
> time we're ready for release we can also have "official" but experimental
> support for HDFS-200.
>
> Ryan mentioned if it works sufficiently well he'd like to put it into
> production at supr... and I feel the same here at streamy.  If it generally
> works, we'll want to put it into production as the current data loss story
> is really the only frightening thing left :)
>
> JG
>
>
>
> Andrew Purtell wrote:
>
>> There is a lot riding on getting this release right. There have been some
>> serious bugs unearthed since 0.20.0 RC1. This makes me nervous. I'm not sure
>> I understand the rationale for releasing 0.20.0 now and then 0.20.1 in one
>> week, as opposed to taking the same amount of time to run another RC cycle
>> to produce a 0.20.0 without bad known defects. What is the benefit?
>>    HBASE-1794: Recovered data still seems missing until compaction, which
>> might not happen for 24 hours. Seems like a fix is already known?
>>    HBASE-1780: Data loss, known fix.
>>    HBASE-1784: Data loss.
>>
>> I'll try to put up a patch/band-aid against at least one of these tonight.
>>
>> HBASE-1784 is really troubling. We should roll back a failed compaction,
>> not vaporize data. -1 on those grounds alone.
>>
>>    - Andy
>>
>>
>>
>>
>> ________________________________
>> From: stack <[email protected]>
>> To: [email protected]
>> Sent: Wednesday, August 26, 2009 4:21:33 PM
>> Subject: Re: ANN: hbase 0.20.0 Release Candidate 2 available for download
>>
>> It will take a week or so to roll a new RC and to test and vote on it.
>>
>> Why not let out RC2 as 0.20.0 and do 0.20.1 within the next week or so?
>>
>> The balancing issue happens when you new node online only.  Usually
>> balancing ain't bad.
>>
>> The Mathias issue is bad but still being investigated.
>>
>> Andrew?
>>
>> St.Ack
>>
>>
>> On Wed, Aug 26, 2009 at 1:04 AM, Mathias Herberts <
>> [email protected]> wrote:
>>
>>  On Mon, Aug 24, 2009 at 16:51, Jean-Daniel Cryans<[email protected]>
>>> wrote:
>>>
>>>> +1 I ran it without any problem for a while. I asked Mathias if 1784
>>>> should kill it and he thinks no since it is not deterministic.
>>>>
>>> Given the latest run I did and the associated logs/investigation which
>>> clearly show that the missing rows is related to failed compactions I
>>> change my mind and now think 1784 should kill this RC.
>>>
>>> so -1 for rc2.
>>>
>>> Mathias.
>>>
>>>
>>
>>
>>
>>
>

Reply via email to