Re: HBase and Lucene for realtime search

2011-02-12 Thread Bruno Dumon
Hi, AFAIU scaling fulltext search is usually done by processing partitions of posting lists concurrently. That is essentially what you get with sharded solr/katta/elasticsearch. I wonder how you would map things to HBase so that this would be possible. HBase scales on the row key, so if you use

Re: HBase and Lucene for realtime search

2011-02-12 Thread Jason Rutherglen
Brian, Thanks for the response. solr/katta/elasticsearch These don't have a distributed solution for realtime search [yet]. Eg, a transaction log is required, and a place to store the versioned documents, sounds a lot like HBase? The technique of query sharding/partitioning is fairly trivial,

Re: Parent/child relation - go vertical, horizontal, or many tables?

2011-02-12 Thread Jason
Thank you all for the great insight. Based on your thoughts I am going to try a hybrid approach - that is split children into buckets based on id range and store a bucket per row. The row key then would be parent-id:bucket-id where bucket-id=child-id/n, and n - bucket size chosen specifically

Re: HBase and Lucene for realtime search

2011-02-12 Thread Jason Rutherglen
So in giving this a day of breathing room, it looks like HBase loads values as it's scanning a column? I think that'd be a killer to some Lucene queries, eg, we'd be loading entire/part-of posting lists just for a linear scan of the terms dict? Or we'd probably instead want to place the posting

porting from hbase 0.20.3 to 0.90

2011-02-12 Thread Oleg Ruchovets
Hi , We are going to port our production environment to 0.90 and I have couple of questions: 1) We are using HTablePool which returned HTable in version 0.20.3 , but now it returns HTableInterface In our code we usedHTable class methods : getStartEndKeys(); setAutoFlush(false);

Re: HBase and Lucene for realtime search

2011-02-12 Thread Ted Dunning
I really think that putting update semantics into Katta would be much easier. Building the write-ahead log for the lucene case isn't all that hard. If you follow the Zookeeper model of having a WAL thread that writes batches of log entries you can get pretty high speed as well. The basic idea

Re: porting from hbase 0.20.3 to 0.90

2011-02-12 Thread Ted Yu
Great questions. For #2, I think hadoop append feature is for durability. From master log, you would see: 2011-02-11 00:34:09,494 INFO org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs -- HDFS-200 2011-02-11 00:34:09,495 INFO

Re: Using the Hadoop bundled in the lib directory of HBase

2011-02-12 Thread Mike Spreitzer
Let me be clear about the amount of testing I did: extremely little. I should also point out that at first I did not appreciate fully the meaning of you earlier comment to Vijay saying this is a little off --- I now realize you were in fact saying that Vijay told me to do things backward.

Re: Using the Hadoop bundled in the lib directory of HBase

2011-02-12 Thread Ryan Rawson
If you are taking the jar that we ship and slamming it in a hadoop 0.20.2 based distro that might work. I'm not sure if there are any differences than pure code (which would then be expressed in the jar only), so this approach might work. You could also check out to the revision that we built