Hi, On Sat, Jul 3, 2010 at 13:12, Andrzej Bialecki <[email protected]> wrote:
> On 2010-07-03 10:00, Doğacan Güney wrote: > >> Hey everyone, >> >> This is not really a proposition but rather something I have been >> wondering >> for a while so I wanted to see what everyone is >> thinking. >> >> Currently in our solr backend, we have "stored=true indexed=false" fields >> and "stored=true indexed=true" fields. The former >> class of fields are mostly used for storing digest, caching information >> etc. >> I suggest that we get rid of all "indexed=false" fields and >> read all such data from storage backend. >> >> For the latter class of fields (i.e., stored=true indexed=true), I suggest >> that we set them to stored=false for everything but "id" field. As an >> example currently title is stored/indexed in solr while text is only >> indexed >> (thus, will need to be fetched from storage backend). But for hbase >> backend, title and text are already stored close together (in the same >> column family) so performance hit of reading just text or reading both >> will likely be same. And removing storage from solr may lead to better >> caching of indexed fields and may lead to better example. >> >> What does everyone think? >> >> > The issue is not as simple as it looks. If you want to have a good > performance for searching & snippet generation then you still need to store > some data in stored fields - at least url, title, and plain text (not to > mention the option to use term vectors in order to speed up the snippet > generation). Solr functionality can be also impaired by a lack of data > available directly from Lucene storage (field cache, faceting, term vector > highlighting). > > Some fields of course are not useful for display, but are used for > searching only (e.g. anchors). These should be indexed but not stored in > Solr. And it's ok to get them from non-solr storage if requested, because > it's a rare event. The same goes for the full raw content, if you want to > offer a "cached" view - this should not be stored in Solr but instead it > should come from a separate layer (note that sometimes cached view might not > be in the original format - pdf, office, etc - and instead an html > representation may be more suitable, so in general the cached view shouldn't > automatically equal the original raw content). > > I am also talking about fields like digest. For the most part, I think we can get rid of all indexed=false fields. Are you sure about this part: " And it's ok to get them from non-solr storage if requested, because it's a rare event." Your assumption seems to be that random reads in solr will be faster (I am talking about reading stored fields) than random reads in, say, hbase. The reason why I started this discussion was that I think random reads in hbase can actually end up being faster. Though you are right that this would be a premature optimization at this point, I think it may be worthwhile to look into it at some time in future. > But for other fields I would argue that for now they should remain stored > in Solr, *even the full text*, until we figure out how they affect the > ability and performance of common search operations. E.g. if we remove the > stored "title" field then we need to reach to the storage layer in order to > display each page of results... not to mention issues like highlighting, > faceting, function queries and a host of other functionalities that Solr can > offer just because a field is stored in its index. > > So I'm -0 to this proposal - of course we should review our schema, and of > course we should have a mechanism to get data from the storage layer, but > what you propose is IMHO a premature optimization at this point. > > You obviously make good points. Am I correct in assuming that you agree that our current schema needs change? If we want to make use of solr's awesome features like faceting, then it makes sense that everything (I mean, everything that is returned in a typical search query) is stored in solr. But currently, title is stored in solr while content is not. Thus, we have to hit the storage anyway. My proposition was that we remove all storage from Solr, but keeping everything in Solr also makes sense if it is actually everything. But, IMHO, our hybrid apprach may need to change. > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- Doğacan Güney

