Re: Minimizing the number of stored fields for Solr

Doğacan Güney Sat, 03 Jul 2010 04:31:22 -0700

On Sat, Jul 3, 2010 at 14:14, Andrzej Bialecki <[email protected]> wrote:


> On 2010-07-03 12:41, Doğacan Güney wrote:
>
>  Some fields of course are not useful for display, but are used for
>>> searching only (e.g. anchors). These should be indexed but not stored in
>>> Solr. And it's ok to get them from non-solr storage if requested, because
>>> it's a rare event. The same goes for the full raw content, if you want to
>>> offer a "cached" view - this should not be stored in Solr but instead it
>>> should come from a separate layer (note that sometimes cached view might
>>> not
>>> be in the original format - pdf, office, etc - and instead an html
>>> representation may be more suitable, so in general the cached view
>>> shouldn't
>>> automatically equal the original raw content).
>>>
>>>
>>>  I am also talking about fields like digest. For the most part, I think
>> we
>> can get rid of all indexed=false fields.
>>
>
> Yes, digest doesn't have to be stored.
>
>
>
>> Are you sure about this part: " And it's ok to get them from non-solr
>> storage if requested, because it's a rare event." Your assumption seems to
>> be that random reads in solr will be faster (I am talking about reading
>> stored fields) than random reads in, say, hbase. The reason why I started
>>
>
> I'm pretty sure that's the case - are you aware of any benchmarks that
> would prove otherwise?
>
>
>  this discussion was that I think random reads in hbase can actually end up
>> being faster. Though you are right that this would be a premature
>> optimization at this point, I think it may be worthwhile to look into it
>> at
>> some time in future.
>>
>
> Certainly - but at this point if you insist on keeping every non-indexed
> bit in external storage it will complicate and slow down the most common use
> case, which is just a plain search.
>
>
I am not really insisting on anything. I guess I am wrong but I thought we
do not really display any non-indexed field for plain search (it is really
just URL, title and text, no?)


>
>  But for other fields I would argue that for now they should remain stored
>>> in Solr, *even the full text*, until we figure out how they affect the
>>> ability and performance of common search operations. E.g. if we remove
>>> the
>>> stored "title" field then we need to reach to the storage layer in order
>>> to
>>> display each page of results... not to mention issues like highlighting,
>>> faceting, function queries and a host of other functionalities that Solr
>>> can
>>> offer just because a field is stored in its index.
>>>
>>> So I'm -0 to this proposal - of course we should review our schema, and
>>> of
>>> course we should have a mechanism to get data from the storage layer, but
>>> what you propose is IMHO a premature optimization at this point.
>>>
>>>
>>>  You obviously make good points. Am I correct in assuming that you agree
>> that
>> our current schema needs change? If we want to make use of solr's awesome
>>
>
> Yes, it needs to change, but not as much as you propose. :)
>
>
>  features like faceting, then it makes sense that everything (I mean,
>> everything that is returned in a typical search query) is stored in solr.
>> But currently, title is stored in solr while content is not. Thus, we have
>>
>
> That's why I wrote that we should store the content in Solr as well. We
> should store as much (and not more) data as we need to present a typical
> page of search results.
>
>
>  to hit the storage anyway. My proposition was that we remove all storage
>> from Solr, but keeping everything in Solr also makes sense if it is
>> actually
>> everything.
>>
>
> Not everything, just enough to present a typical page of results without
> hitting external storage. For other use cases (cached view, anchors, etc)
> it's ok to use external storage because such use cases are relatively
> infrequent.
>
>
I already clarified my "everything" a couple lines above ( "everything (I
mean,
everything that is returned in a typical search query)" ) :)


>
>
>  But, IMHO, our hybrid apprach may need to change.
>>
>
> FWIW, there are some discussions about implementing a hybrid storage
> directly in Solr (using column stores for stored fields), but that's
> something that will be completely transparent to us, so I think it doesn't
> bear on this discussion (especially since it's still a vaporware at this
> point).
>
>
Anyway, I am, for now, convinced that storing content is the better way to
go (compared to my proposal of removing all). I will be dropping this for
now. If, in the future, random reads in hbase are faster than solr, I'll
bring it up again.


>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
Doğacan Güney

Re: Minimizing the number of stored fields for Solr

Reply via email to