Re: Minimizing the number of stored fields for Solr

Andrzej Bialecki Sat, 03 Jul 2010 04:15:54 -0700

On 2010-07-03 12:41, Doğacan Güney wrote:

Some fields of course are not useful for display, but are used for
searching only (e.g. anchors). These should be indexed but not stored in
Solr. And it's ok to get them from non-solr storage if requested, because
it's a rare event. The same goes for the full raw content, if you want to
offer a "cached" view - this should not be stored in Solr but instead it
should come from a separate layer (note that sometimes cached view might not
be in the original format - pdf, office, etc - and instead an html
representation may be more suitable, so in general the cached view shouldn't
automatically equal the original raw content).

I am also talking about fields like digest. For the most part, I think we
can get rid of all indexed=false fields.


Yes, digest doesn't have to be stored.


Are you sure about this part: " And it's ok to get them from non-solr
storage if requested, because it's a rare event." Your assumption seems to
be that random reads in solr will be faster (I am talking about reading
stored fields) than random reads in, say, hbase. The reason why I started

I'm pretty sure that's the case - are you aware of any benchmarks thatwould prove otherwise?

this discussion was that I think random reads in hbase can actually end up
being faster. Though you are right that this would be a premature
optimization at this point, I think it may be worthwhile to look into it at
some time in future.

Certainly - but at this point if you insist on keeping every non-indexedbit in external storage it will complicate and slow down the most commonuse case, which is just a plain search.

But for other fields I would argue that for now they should remain stored
in Solr, *even the full text*, until we figure out how they affect the
ability and performance of common search operations. E.g. if we remove the
stored "title" field then we need to reach to the storage layer in order to
display each page of results... not to mention issues like highlighting,
faceting, function queries and a host of other functionalities that Solr can
offer just because a field is stored in its index.

So I'm -0 to this proposal - of course we should review our schema, and of
course we should have a mechanism to get data from the storage layer, but
what you propose is IMHO a premature optimization at this point.

You obviously make good points. Am I correct in assuming that you agree that
our current schema needs change? If we want to make use of solr's awesome


Yes, it needs to change, but not as much as you propose. :)

features like faceting, then it makes sense that everything (I mean,
everything that is returned in a typical search query) is stored in solr.
But currently, title is stored in solr while content is not. Thus, we have

That's why I wrote that we should store the content in Solr as well. Weshould store as much (and not more) data as we need to present a typicalpage of search results.

to hit the storage anyway. My proposition was that we remove all storage
from Solr, but keeping everything in Solr also makes sense if it is actually
everything.

Not everything, just enough to present a typical page of results withouthitting external storage. For other use cases (cached view, anchors,etc) it's ok to use external storage because such use cases arerelatively infrequent.

But, IMHO, our hybrid apprach may need to change.

FWIW, there are some discussions about implementing a hybrid storagedirectly in Solr (using column stores for stored fields), but that'ssomething that will be completely transparent to us, so I think itdoesn't bear on this discussion (especially since it's still a vaporwareat this point).


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Minimizing the number of stored fields for Solr

Reply via email to