Hi Thomas,

Thanks for sharing your code for lucehbase.
The schema you used  seems the same as the one use in lucandra:

-------------------
*Documents Ids are currently random and autogenerated.

*Term keys and Document Keys are encoded as follows (using a random binary delimiter)

     Term Key                     col name         value
     "index_name/field/term" => { documentId , position vector }

     Document Key
     "index_name/documentId" => { fieldName , value }
--------------------

I have two questions:
1) for a given term key, the number of column can get potentially very large. Have you tried another schema where the document id is put in the key, i.e.:

Term Key col name value
     "index_name/field/term/docid" => { info , position vector }
That way you get trivial paging in the case where a lot of documents contain the term.

2) once you get the list of docids, to get the document details (i.e the pairs { fieldName , value }), you will trigger a lot of random access queries to Hbase (where in 1, with the alternative schema "index_name/field/term/docid" you open a scanner and with the schema "index_name/field/term" you just get one row). I am wondering how you can get fast answers that way. If you have few fields, would it be a good idea to store also the values in the index (only the alternative schema "index_name/field/term/docid" allows this)?

Thanks
TuX



Thomas Koch wrote:
Hi,

Lucandra stores a lucene index on cassandra:
http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend

As the author of lucandra writes: "I’m sure something similar could be built on hbase."

So here it is:
http://github.com/thkoch2001/lucehbase

This is only a first prototype which has not been tested on anything real yet. But if you're interested, please join me to get it production ready!

I propose to keep this thread on hbase-user and java-dev only.
Would it make sense to aim this project to become an hbase contrib? Or a lucene contrib?

Best regards,

Thomas Koch, http://www.koch.ro

Reply via email to