Solr would not be storing the original source form of the documents in any case. Whether you use Tika or SolrCell, only the text stream of the content and the metadata would ever get indexed or stored in Solr.

Solr completely decouples "indexing" and "storing" of data values. If you don't want to "store" the text stream in Solr, then don't.

If you want to "store" the original blob of the source documents in some other data store, that's your choice. You can store the original URL or a document ID or URL for some alternate document store. That's your choice to make. Solr in no way forces you one way or the other. And whether that URL or document ID refers to HBase or a web site, doesn't matter to Solr either.

Whether or not you could more efficiently store the original document bytes in Lucene/Solr DocValues vs. HBase is a separate matter - I don't know one way or the other whether DocValues help or not. Or whether a Solr BinaryField might be suitable for store the original bytes of a document (but without indexing the bytes.)

In other words, maybe you could just use two separate Solr servers, one for text index and metadata store, and the other for raw store of the original document bytes.

-- Jack Krupansky

-----Original Message----- From: Furkan KAMACI
Sent: Saturday, April 06, 2013 6:01 PM
To: solr-user@lucene.apache.org
Subject: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase

Hi;

First of all should mention that I am new to Solr and making a research
about it. What I am trying to do that I will crawl some websites with Nutch
and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud 4.2 )

I wonder about something. I have a cloud of machines that crawls websites
and stores that documents. Then I send that documents into SolrCloud. Solr
indexes that documents and generates indexes and save them. I know that
from Information Retrieval theory: it *may* not be efficient to store
indexes at a NoSQL database (they are something like linked lists and if
you store them in such kind of database you *may* have a sparse
representation -by the way there may be some solutions for it. If you
explain them you are welcome.)

However Solr stores some documents too (i.e. highlights) So some of my
documents will be doubled somehow. If I consider that I will have many
documents, that dobuled documents may cause a problem for me. So is there
any way not storing that documents at Solr and pointing to them at
Hbase(where I save my crawled documents) or instead of pointing directly
storing them at Hbase (is it efficient or not)?

Reply via email to