Re: HDFS-1599 status? (HDFS tickets to improve HBase)

2011-06-03 Thread Andrew Purtell
Regarding HDFS-347, I believe the following to be true: - The bastard option, i.e. Ryan's patch against 0.20 that just does local reads via File, does lower latency enough to make a difference in HBase random read latencies as measured. I forget the magnitude of the difference offhand but

Re: HDFS-1599 status? (HDFS tickets to improve HBase)

2011-06-03 Thread Kihwal Lee
HDFS-941 The trunk has moved on so the patch won't apply. There has been significant changes in HDFS lately, so it will require more than simple rebase/merge. If the original assignee is busy, I am willing to help. HDFS-347 The analysis is pointing out that local socket communication is

Security

2011-06-03 Thread Andrew Purtell
A competing project is out with intranode security We are running secure HBase in production at Trend Micro now. And secure ZooKeeper. This is full integration with Secure HDFS and MapReduce (including auth tokens for MR), secure RPC, and policy enforcement of table/column ACLs implemented as

Re: HDFS-1599 status? (HDFS tickets to improve HBase)

2011-06-03 Thread Dhruba Borthakur
I completely agree with Ryan. Most of the measurements in HDFS-347 are point comparisions data rate over socket, single-threaded sequential read from datanode, single-threaded random read form datanode, etc. These measurements are good, but when you run the entire Hbase system at load, you

RE: HDFS-1599 status? (HDFS tickets to improve HBase)

2011-06-03 Thread Doug Meil
Thanks everybody for commenting on this thread. We'd certainly like to lobby for movement on these two tickets, and although we don't have anybody that is familiar with the source code we'd be happy to perform some tests get some performance numbers. Per Kihwal's comments, it sounds like

Re: HDFS-1599 status? (HDFS tickets to improve HBase)

2011-06-03 Thread Todd Lipcon
On Fri, Jun 3, 2011 at 12:50 PM, Doug Meil doug.m...@explorysmedical.com wrote: Thanks everybody for commenting on this thread. We'd certainly like to lobby for movement on these two tickets, and although we don't have anybody that is familiar with the source code we'd be happy to perform

RE: HDFS-1599 status? (HDFS tickets to improve HBase)

2011-06-03 Thread Andrew Purtell
I have patches for HDFS-347 and HDFS-941 (and HDFS-918) for CDH3U0. - Andy From: Doug Meil doug.m...@explorysmedical.com Subject: RE: HDFS-1599 status? (HDFS tickets to improve HBase) To: dev@hbase.apache.org dev@hbase.apache.org Date: Friday, June 3, 2011, 12:50 PM Thanks everybody for

Re: HDFS-1599 status? (HDFS tickets to improve HBase)

2011-06-03 Thread Todd Lipcon
On Fri, Jun 3, 2011 at 3:38 PM, Andrew Purtell apurt...@apache.org wrote: I have patches for HDFS-347 and HDFS-941 (and HDFS-918) for CDH3U0. Does your 347 patch do security? or just the one where it sneaks around back? Have you tested the others under real load for a couple days?   - Andy

Re: HDFS-1599 status? (HDFS tickets to improve HBase)

2011-06-03 Thread Andrew Purtell
Yes, and though I have patches, and I'm happy to provide them if you want... Indeed, 347 doesn't do security or checksums so needs work to say the least. We use it with HBase given a privileged role such that it shares group-readable DFS data directories with the DataNodes. It works for us,

Re: HDFS-1599 status? (HDFS tickets to improve HBase)

2011-06-03 Thread Andrew Purtell
From: Todd Lipcon t...@cloudera.com I have patches for HDFS-347 and HDFS-941 (and HDFS-918) for CDH3U0. Does your 347 patch do security? or just the one where it sneaks around back? Have you tested the others under real load for a couple days? We use the sneaky 347 and, sure, it's a

Re: HDFS-1599 status? (HDFS tickets to improve HBase)

2011-06-03 Thread Jason Rutherglen
I think one'd need to checksum only once on the first file system instantiation, or first access of the file? As mentioned in HDFS-2004, HBase's usage of HDFS is outside of the initial design motivation. Eg, the rules may need to be bent in order to enable performant use of HBase with HDFS. The

Re: HDFS-1599 status? (HDFS tickets to improve HBase)

2011-06-03 Thread Stack
An hdfs-347 that checksums is over in a the hadoop branch that fb published over on github (Dhruba and Jon pointed me at it); i've been meaning to put the patch up in the hdfs-347 issue. St.Ack On Fri, Jun 3, 2011 at 4:42 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: I think one'd

Re: HDFS-1599 status? (HDFS tickets to improve HBase)

2011-06-03 Thread Todd Lipcon
Not to be too mean and discouraging to everyone passing around patches against CDH3 and/or 0.20-append, but just an FYI: there is no chance that these things will get committed to an 0.20 branch without first going through trunk. Sharing patches and testing them on real workloads in 20 is a nice

Sample data set of HBase

2011-06-03 Thread Jason Rutherglen
I'm looking for a sample data set to benchmark the Lucene FST, specifically the keys. I'm guessing a common key type for HBase users is timestamp? Perhaps simply creating timestamps for 10's of millions of keys would be a reasonable benchmark? Though synthetic it's also easy to adjust (eg,

Re: prefix compression

2011-06-03 Thread Matt Corgan
Thanks for the feedback Stack. Some inline responses: On Thu, Jun 2, 2011 at 9:48 PM, Stack st...@duboce.net wrote: High-level this sounds like a great. Inline below is some feedback and a bit of history on how we got here in case it helps: On Thu, Jun 2, 2011 at 3:28 PM, Matt Corgan

Re: prefix compression

2011-06-03 Thread Jason Rutherglen
Also the next thing to measure with the FST is the key lookup speed. I'm not sure what that'd look like, or how to compare with HBase right now? On Fri, Jun 3, 2011 at 8:42 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Here's a nice preliminary number with the FST, 50 million dates of

Re: prefix compression

2011-06-03 Thread Matt Corgan
Jason - are you feeding it that whole string for each date? Input data is 17 bytes per record * 50mm records = 850MB, and that reduces to 984 bytes? Is it possible to compress by that much? Maybe I'm missing something about how the FST works. Matt On Fri, Jun 3, 2011 at 8:51 PM, Jason

Re: prefix compression

2011-06-03 Thread Matt Corgan
Ah - I see. It's generating multiple duplicate timestamps per millisecond, so there are fewer than 50mm unique strings. Duplicates just require incrementing a counter. Agree it's very cool though! sent from my phone On Jun 3, 2011 9:02 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote:

Re: prefix compression

2011-06-03 Thread Stack
That can't be true? (smile) How would you search a 'key' in the FST? St.Ack On Fri, Jun 3, 2011 at 9:01 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Yeah it's truly super wild!  Here's the code: http://pastebin.com/bnB53UQz You can see the line that's adding the string: