On 2010-07-03 12:41, Doğacan Güney wrote:

Some fields of course are not useful for display, but are used for
searching only (e.g. anchors). These should be indexed but not stored in
Solr. And it's ok to get them from non-solr storage if requested, because
it's a rare event. The same goes for the full raw content, if you want to
offer a "cached" view - this should not be stored in Solr but instead it
should come from a separate layer (note that sometimes cached view might not
be in the original format - pdf, office, etc - and instead an html
representation may be more suitable, so in general the cached view shouldn't
automatically equal the original raw content).


I am also talking about fields like digest. For the most part, I think we
can get rid of all indexed=false fields.

Yes, digest doesn't have to be stored.


Are you sure about this part: " And it's ok to get them from non-solr
storage if requested, because it's a rare event." Your assumption seems to
be that random reads in solr will be faster (I am talking about reading
stored fields) than random reads in, say, hbase. The reason why I started

I'm pretty sure that's the case - are you aware of any benchmarks that would prove otherwise?

this discussion was that I think random reads in hbase can actually end up
being faster. Though you are right that this would be a premature
optimization at this point, I think it may be worthwhile to look into it at
some time in future.

Certainly - but at this point if you insist on keeping every non-indexed bit in external storage it will complicate and slow down the most common use case, which is just a plain search.

But for other fields I would argue that for now they should remain stored
in Solr, *even the full text*, until we figure out how they affect the
ability and performance of common search operations. E.g. if we remove the
stored "title" field then we need to reach to the storage layer in order to
display each page of results... not to mention issues like highlighting,
faceting, function queries and a host of other functionalities that Solr can
offer just because a field is stored in its index.

So I'm -0 to this proposal - of course we should review our schema, and of
course we should have a mechanism to get data from the storage layer, but
what you propose is IMHO a premature optimization at this point.


You obviously make good points. Am I correct in assuming that you agree that
our current schema needs change? If we want to make use of solr's awesome

Yes, it needs to change, but not as much as you propose. :)

features like faceting, then it makes sense that everything (I mean,
everything that is returned in a typical search query) is stored in solr.
But currently, title is stored in solr while content is not. Thus, we have

That's why I wrote that we should store the content in Solr as well. We should store as much (and not more) data as we need to present a typical page of search results.

to hit the storage anyway. My proposition was that we remove all storage
from Solr, but keeping everything in Solr also makes sense if it is actually
everything.

Not everything, just enough to present a typical page of results without hitting external storage. For other use cases (cached view, anchors, etc) it's ok to use external storage because such use cases are relatively infrequent.


But, IMHO, our hybrid apprach may need to change.

FWIW, there are some discussions about implementing a hybrid storage directly in Solr (using column stores for stored fields), but that's something that will be completely transparent to us, so I think it doesn't bear on this discussion (especially since it's still a vaporware at this point).

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to