On 9-Oct-07, at 12:36 PM, David Whalen wrote:
<field name="id" type="string" indexed="true" stored="true" />
<field name="content_date" type="date" indexed="true" stored="true" />
<field name="media_type" type="string" indexed="true" stored="true" />
<field name="location" type="string" indexed="true" stored="true" />
<field name="country_code" type="string" indexed="true"
stored="true" />
<field name="text" type="text" indexed="true" stored="true"
multiValued="true" />
<field name="content_source" type="string" indexed="true"
stored="true" />
<field name="title" type="string" indexed="true" stored="true" />
<field name="site_id" type="string" indexed="true" stored="true" />
<field name="journalist_id" type="string" indexed="true"
stored="true" />
<field name="blog_url" type="string" indexed="true" stored="true" />
<field name="created_date" type="date" indexed="true" stored="true" />
I'm sure we could stop storing many of these columns, especially
if someone told me that would make a big difference.
I don't think that it would make a difference in memory consumption,
but storage is certainly not necessary for faceting. Extra stored
fields can slow down search if they are large (in terms of bytes),
but don't really occupy extra memory, unless they are polluting the
doc cache. Does 'text' need to be stored?
what does the LukeReqeust Handler tell you about the # of
distinct terms in each field that you facet on?
Where would I find that? I could probably estimate that myself
on a per-column basis. it ranges from 4 distinct values for
media_type to 30-ish for location to 200-ish for country_code
to almost 10,000 for site_id to almost 100,000 for journalist_id.
Using the filter cache method on the things like media type and
location; this will occupy ~2.3MB of memory _per unique value_, so it
should be a net win for those (although quite close in space
requirements for a 30-ary field on your index size).
-Mike