Solr would not be storing the original source form of the documents in any
case. Whether you use Tika or SolrCell, only the text stream of the content
and the metadata would ever get indexed or stored in Solr.
Solr completely decouples "indexing" and "storing" of data values. If you
don't want to "store" the text stream in Solr, then don't.
If you want to "store" the original blob of the source documents in some
other data store, that's your choice. You can store the original URL or a
document ID or URL for some alternate document store. That's your choice to
make. Solr in no way forces you one way or the other. And whether that URL
or document ID refers to HBase or a web site, doesn't matter to Solr either.
Whether or not you could more efficiently store the original document bytes
in Lucene/Solr DocValues vs. HBase is a separate matter - I don't know one
way or the other whether DocValues help or not. Or whether a Solr
BinaryField might be suitable for store the original bytes of a document
(but without indexing the bytes.)
In other words, maybe you could just use two separate Solr servers, one for
text index and metadata store, and the other for raw store of the original
document bytes.
-- Jack Krupansky
-----Original Message-----
From: Furkan KAMACI
Sent: Saturday, April 06, 2013 6:01 PM
To: solr-user@lucene.apache.org
Subject: Pointing to Hbase for Docuements or Directly Saving Documents at
Hbase
Hi;
First of all should mention that I am new to Solr and making a research
about it. What I am trying to do that I will crawl some websites with Nutch
and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud 4.2 )
I wonder about something. I have a cloud of machines that crawls websites
and stores that documents. Then I send that documents into SolrCloud. Solr
indexes that documents and generates indexes and save them. I know that
from Information Retrieval theory: it *may* not be efficient to store
indexes at a NoSQL database (they are something like linked lists and if
you store them in such kind of database you *may* have a sparse
representation -by the way there may be some solutions for it. If you
explain them you are welcome.)
However Solr stores some documents too (i.e. highlights) So some of my
documents will be doubled somehow. If I consider that I will have many
documents, that dobuled documents may cause a problem for me. So is there
any way not storing that documents at Solr and pointing to them at
Hbase(where I save my crawled documents) or instead of pointing directly
storing them at Hbase (is it efficient or not)?