On Sat, Jul 3, 2010 at 14:14, Andrzej Bialecki <[email protected]> wrote:
> On 2010-07-03 12:41, Doğacan Güney wrote: > > Some fields of course are not useful for display, but are used for >>> searching only (e.g. anchors). These should be indexed but not stored in >>> Solr. And it's ok to get them from non-solr storage if requested, because >>> it's a rare event. The same goes for the full raw content, if you want to >>> offer a "cached" view - this should not be stored in Solr but instead it >>> should come from a separate layer (note that sometimes cached view might >>> not >>> be in the original format - pdf, office, etc - and instead an html >>> representation may be more suitable, so in general the cached view >>> shouldn't >>> automatically equal the original raw content). >>> >>> >>> I am also talking about fields like digest. For the most part, I think >> we >> can get rid of all indexed=false fields. >> > > Yes, digest doesn't have to be stored. > > > >> Are you sure about this part: " And it's ok to get them from non-solr >> storage if requested, because it's a rare event." Your assumption seems to >> be that random reads in solr will be faster (I am talking about reading >> stored fields) than random reads in, say, hbase. The reason why I started >> > > I'm pretty sure that's the case - are you aware of any benchmarks that > would prove otherwise? > > > this discussion was that I think random reads in hbase can actually end up >> being faster. Though you are right that this would be a premature >> optimization at this point, I think it may be worthwhile to look into it >> at >> some time in future. >> > > Certainly - but at this point if you insist on keeping every non-indexed > bit in external storage it will complicate and slow down the most common use > case, which is just a plain search. > > I am not really insisting on anything. I guess I am wrong but I thought we do not really display any non-indexed field for plain search (it is really just URL, title and text, no?) > > But for other fields I would argue that for now they should remain stored >>> in Solr, *even the full text*, until we figure out how they affect the >>> ability and performance of common search operations. E.g. if we remove >>> the >>> stored "title" field then we need to reach to the storage layer in order >>> to >>> display each page of results... not to mention issues like highlighting, >>> faceting, function queries and a host of other functionalities that Solr >>> can >>> offer just because a field is stored in its index. >>> >>> So I'm -0 to this proposal - of course we should review our schema, and >>> of >>> course we should have a mechanism to get data from the storage layer, but >>> what you propose is IMHO a premature optimization at this point. >>> >>> >>> You obviously make good points. Am I correct in assuming that you agree >> that >> our current schema needs change? If we want to make use of solr's awesome >> > > Yes, it needs to change, but not as much as you propose. :) > > > features like faceting, then it makes sense that everything (I mean, >> everything that is returned in a typical search query) is stored in solr. >> But currently, title is stored in solr while content is not. Thus, we have >> > > That's why I wrote that we should store the content in Solr as well. We > should store as much (and not more) data as we need to present a typical > page of search results. > > > to hit the storage anyway. My proposition was that we remove all storage >> from Solr, but keeping everything in Solr also makes sense if it is >> actually >> everything. >> > > Not everything, just enough to present a typical page of results without > hitting external storage. For other use cases (cached view, anchors, etc) > it's ok to use external storage because such use cases are relatively > infrequent. > > I already clarified my "everything" a couple lines above ( "everything (I mean, everything that is returned in a typical search query)" ) :) > > > But, IMHO, our hybrid apprach may need to change. >> > > FWIW, there are some discussions about implementing a hybrid storage > directly in Solr (using column stores for stored fields), but that's > something that will be completely transparent to us, so I think it doesn't > bear on this discussion (especially since it's still a vaporware at this > point). > > Anyway, I am, for now, convinced that storing content is the better way to go (compared to my proposal of removing all). I will be dropping this for now. If, in the future, random reads in hbase are faster than solr, I'll bring it up again. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- Doğacan Güney

