Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase

Jack Krupansky Sat, 06 Apr 2013 18:44:09 -0700

Solr would not be storing the original source form of the documents in anycase. Whether you use Tika or SolrCell, only the text stream of the contentand the metadata would ever get indexed or stored in Solr.

Solr completely decouples "indexing" and "storing" of data values. If youdon't want to "store" the text stream in Solr, then don't.

If you want to "store" the original blob of the source documents in someother data store, that's your choice. You can store the original URL or adocument ID or URL for some alternate document store. That's your choice tomake. Solr in no way forces you one way or the other. And whether that URLor document ID refers to HBase or a web site, doesn't matter to Solr either.

Whether or not you could more efficiently store the original document bytesin Lucene/Solr DocValues vs. HBase is a separate matter - I don't know oneway or the other whether DocValues help or not. Or whether a SolrBinaryField might be suitable for store the original bytes of a document(but without indexing the bytes.)

In other words, maybe you could just use two separate Solr servers, one fortext index and metadata store, and the other for raw store of the originaldocument bytes.


-- Jack Krupansky

-----Original Message-----From: Furkan KAMACI

Sent: Saturday, April 06, 2013 6:01 PM
To: solr-user@lucene.apache.org

Subject: Pointing to Hbase for Docuements or Directly Saving Documents atHbase


Hi;

First of all should mention that I am new to Solr and making a research
about it. What I am trying to do that I will crawl some websites with Nutch
and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud 4.2 )

I wonder about something. I have a cloud of machines that crawls websites
and stores that documents. Then I send that documents into SolrCloud. Solr
indexes that documents and generates indexes and save them. I know that
from Information Retrieval theory: it *may* not be efficient to store
indexes at a NoSQL database (they are something like linked lists and if
you store them in such kind of database you *may* have a sparse
representation -by the way there may be some solutions for it. If you
explain them you are welcome.)

However Solr stores some documents too (i.e. highlights) So some of my
documents will be doubled somehow. If I consider that I will have many
documents, that dobuled documents may cause a problem for me. So is there
any way not storing that documents at Solr and pointing to them at
Hbase(where I save my crawled documents) or instead of pointing directly

storing them at Hbase (is it efficient or not)?

Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase

Reply via email to