Hi TuX - There is a different project, being done here about using HBase as the backing store of TF-IDF, here at - http://github.com/akkumar/hbasene , but addressing the same problem and I am speaking on behalf of that.
On Mon, Apr 19, 2010 at 1:06 AM, TuX RaceR <tuxrace...@gmail.com> wrote: > Hi Thomas, > > Thanks for sharing your code for lucehbase. > The schema you used seems the same as the one use in lucandra: > > ------------------- > *Documents Ids are currently random and autogenerated. > > *Term keys and Document Keys are encoded as follows (using a random binary > delimiter) > > Term Key col name value > "index_name/field/term" => { documentId , position vector } > > Document Key > "index_name/documentId" => { fieldName , value } > -------------------- > > I have two questions: > 1) for a given term key, the number of column can get potentially very > large. Have you tried another schema where the document id is put in the > key, i.e.: > > Term Key col name > value > "index_name/field/term/docid" => { info , position vector } > That way you get trivial paging in the case where a lot of documents > contain the term. > The documents are encoded using a compressed bitset to scale, since with the docid being part of the key, (docid * unique terms) , it will not address the best locality of reference for unions/ intersections / range queries etc. The HBase RPC is being modified , to append a docid to an already existing field/term , to the compressed encoding stored in the family/ col. name, to achieve the locality of reference and scale with the number of documents. > > 2) once you get the list of docids, to get the document details (i.e the > pairs { fieldName , value }), you will trigger a lot of random access > queries to Hbase (where in 1, with the alternative schema > "index_name/field/term/docid" you open a scanner and with the schema > "index_name/field/term" you just get one row). I am wondering how you can > get fast answers that way. If you have few fields, would it be a good idea > to store also the values in the index (only the alternative schema > "index_name/field/term/docid" allows this)? > Once the documents go in the index, for all practical purpose, the manipulation is done across numbers , assigned to the user specified id space. More often than not, the only field that is stored is the "id" , that is retrieved after all the computation, that can then be used to index into another store to retrieve other details of the search schema. Except for limited cases (sorting / faceting etc.) , using the tf-idf representation for storing the 'field's in document goes against the format being used and is advised to be used sparingly. There is a low-volume mailing list here - http://groups.google.com/group/hbasene-user , for discussion about the same, that you can hop on if you are interested. > Thanks > TuX > > > > > Thomas Koch wrote: > >> Hi, >> >> Lucandra stores a lucene index on cassandra: >> >> http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend >> >> As the author of lucandra writes: "I’m sure something similar could be >> built on hbase." >> >> So here it is: >> http://github.com/thkoch2001/lucehbase >> >> This is only a first prototype which has not been tested on anything real >> yet. But if you're interested, please join me to get it production ready! >> >> I propose to keep this thread on hbase-user and java-dev only. >> Would it make sense to aim this project to become an hbase contrib? Or a >> lucene contrib? >> >> Best regards, >> >> Thomas Koch, http://www.koch.ro >> >> > >