Hi Thomas,
Thanks for sharing your code for lucehbase.
The schema you used seems the same as the one use in lucandra:
-------------------
*Documents Ids are currently random and autogenerated.
*Term keys and Document Keys are encoded as follows (using a random
binary delimiter)
Term Key col name value
"index_name/field/term" => { documentId , position vector }
Document Key
"index_name/documentId" => { fieldName , value }
--------------------
I have two questions:
1) for a given term key, the number of column can get potentially very
large. Have you tried another schema where the document id is put in the
key, i.e.:
Term Key col
name value
"index_name/field/term/docid" => { info , position vector }
That way you get trivial paging in the case where a lot of documents
contain the term.
2) once you get the list of docids, to get the document details (i.e the
pairs { fieldName , value }), you will trigger a lot of random access
queries to Hbase (where in 1, with the alternative schema
"index_name/field/term/docid" you open a scanner and with the schema
"index_name/field/term" you just get one row). I am wondering how you
can get fast answers that way. If you have few fields, would it be a
good idea to store also the values in the index (only the alternative
schema "index_name/field/term/docid" allows this)?
Thanks
TuX
Thomas Koch wrote:
Hi,
Lucandra stores a lucene index on cassandra:
http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend
As the author of lucandra writes: "I’m sure something similar could be built
on hbase."
So here it is:
http://github.com/thkoch2001/lucehbase
This is only a first prototype which has not been tested on anything real yet.
But if you're interested, please join me to get it production ready!
I propose to keep this thread on hbase-user and java-dev only.
Would it make sense to aim this project to become an hbase contrib? Or a
lucene contrib?
Best regards,
Thomas Koch, http://www.koch.ro