On 9-Oct-07, at 12:36 PM, David Whalen wrote:

<field name="id" type="string" indexed="true" stored="true" />
<field name="content_date" type="date" indexed="true" stored="true" />
<field name="media_type" type="string" indexed="true" stored="true" />
<field name="location" type="string" indexed="true" stored="true" />
<field name="country_code" type="string" indexed="true" stored="true" /> <field name="text" type="text" indexed="true" stored="true" multiValued="true" /> <field name="content_source" type="string" indexed="true" stored="true" />
<field name="title" type="string" indexed="true" stored="true" />
<field name="site_id" type="string" indexed="true" stored="true" />
<field name="journalist_id" type="string" indexed="true" stored="true" />
<field name="blog_url" type="string" indexed="true" stored="true" />
<field name="created_date" type="date" indexed="true" stored="true" />

I'm sure we could stop storing many of these columns, especially
if someone told me that would make a big difference.

I don't think that it would make a difference in memory consumption, but storage is certainly not necessary for faceting. Extra stored fields can slow down search if they are large (in terms of bytes), but don't really occupy extra memory, unless they are polluting the doc cache. Does 'text' need to be stored?

what does the LukeReqeust Handler tell you about the # of
distinct terms in each field that you facet on?

Where would I find that?  I could probably estimate that myself
on a per-column basis.  it ranges from 4 distinct values for
media_type to 30-ish for location to 200-ish for country_code
to almost 10,000 for site_id to almost 100,000 for journalist_id.

Using the filter cache method on the things like media type and location; this will occupy ~2.3MB of memory _per unique value_, so it should be a net win for those (although quite close in space requirements for a 30-ary field on your index size).

-Mike

Reply via email to