[jira] Commented: (HBASE-3529) Add search to HBase

Jason Rutherglen (JIRA) Tue, 15 Mar 2011 12:02:57 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13007098#comment-13007098
 ]


Jason Rutherglen commented on HBASE-3529:
-----------------------------------------

bq. Patch looks great Jason. Is it working?

Stack, thanks for your comments.  The test cases pass.  They're not very 
stressful yet.

bq. FYI, there is Bytes.equals in place of

It's copied from TestRegionObserverInterface.  

bq. You are making HDFS locks. Would it make more sense doing ephemeral locks 
in zk since zk is part of your toolkit when up on hbase?

That's a good idea, however if HBase is enforcing the lock on a region, meaning 
the region can only exist on one server at a time, then the Lucene index locks 
are less important.

bq. Have you done any perf testing on this stuff. Is it going to be fast 
enough? You hoping for most searches in in-memory.

We can get the functionality working, restructuring Lucene or Solr as needed, 
assuming that positional reads in HDFS will be implemented (I have a separate 
HDFS patch that can be applied), then I'll start to benchmark.  The index 
doesn't need to be in heap space as the file local positional reads should rely 
on the system IO cache.

bq. Whats appending codec?

Some Lucene segments files after being written seek back to the beginning of 
the file to write header information, the append codec only writes forward.

bq. Class comment missing from documenttransformer to explain what it does. Its 
abstract. Should it be an Interface? (Has no functionality).

I will change it to an interface.

{quote}So, you think the package should be o.a.h.h.search? Do you think this 
all should ship with hbase Jason? By all means push back into hbase changes you 
need for your implementation but its looking big enough to be its own project? 
What you reckon?{quote}

I think it'll be easier to write the code embedded in HBase, then if it works 
[well] we can decide?

bq. What is this 'convert' in HIS doing? Cloning?

It's loading the actual data from HBase and returning it in a Lucene document.  
While we can simply return the row + timestamp, loading the doc data is useful 
if we integrate Solr, because Solr needs a fully fleshed out document to 
perform for example, highlighting.

I'll incorporate the rest of the code recommendations.  The next patch will 
[hopefully] have an RPC based search call, implement index splitting (eg, 
performing the same operation on the index as a region split), and have a test 
case for WAL based index restoring.

> Add search to HBase
> -------------------
>
>                 Key: HBASE-3529
>                 URL: https://issues.apache.org/jira/browse/HBASE-3529
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.0
>            Reporter: Jason Rutherglen
>         Attachments: HBASE-3529.patch, 
> lucene-analyzers-common-4.0-SNAPSHOT.jar, lucene-core-4.0-SNAPSHOT.jar, 
> lucene-misc-4.0-SNAPSHOT.jar
>
>
> Using the Apache Lucene library we can add freetext search to HBase.  The 
> advantages of this are:
> * HBase is highly scalable and distributed
> * HBase is realtime
> * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
> * Lucene offers many types of queries not currently available in HBase (eg, 
> AND, OR, NOT, phrase, etc)
> * It's easier to build scalable realtime systems on top of already 
> architecturally sound, scalable realtime data system, eg, HBase.
> * Scaling realtime search will be as simple as scaling HBase.
> Phase 1 - Indexing:
> * Integrate Lucene into HBase such that an index mirrors a given region.  
> This means cascading add, update, and deletes between a Lucene index and an 
> HBase region (and vice versa).
> * Define meta-data to mark a region as indexed, and use a Solr schema to 
> allow the user to define the fields and analyzers.
> * Integrate with the HLog to ensure that index recovery can occur properly 
> (eg, on region server failure)
> * Mirror region splits with indexes (use Lucene's IndexSplitter?)
> * When a region is written to HDFS, also write the corresponding Lucene index 
> to HDFS.
> * A row key will be the ID of a given Lucene document.  The Lucene docstore 
> will explicitly not be used because the document/row data is stored in HBase. 
>  We will need to solve what the best data structure for efficiently mapping a 
> docid -> row key is.  It could be a docstore, field cache, column stride 
> fields, or some other mechanism.
> * Write unit tests for the above
> Phase 2 - Queries:
> * Enable distributed Lucene queries
> * Regions that have Lucene indexes are inherently available and may be 
> searched on, meaning there's no need for a separate search related system in 
> Zookeeper.
> * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

Reply via email to