Re: Localize the largest fields (content) in index

Erick Erickson Thu, 29 Mar 2012 10:17:14 -0700

Yeah, it's worth a try. The term vectors aren't entirely necessary for
highlighting,
although they do make things more efficient.


As far as MLT, does MLT really need such a big field?

But you may be on your way to sharding your index if you remove this info
and testing shows problems....

Best
Erick

On Thu, Mar 29, 2012 at 9:32 AM, Vadim Kisselmann
<v.kisselm...@googlemail.com> wrote:
> Hi Erick,
> thanks:)
> The admin UI give me the counts, so i can identify fields with big
> bulks of unique terms.
> I known this wiki-page, but i read it one more time.
> List of my file extensions with size in GB(Index size ~150GB):
> tvf 90GB
> fdt 30GB
> tim 18GB
> prx 15GB
> frq 12GB
> tip 200MB
> tvx 150MB
>
> tvf is my biggest file extension.
> Wiki :This file contains, for each field that has a term vector
> stored, a list of the terms, their frequencies and, optionally,
> position and offest information.
>
> Hmm, i use termVectors on my biggest fields because of MLT and Highlighting.
> But i think i should test my performance without termVectors. Good Idea? :)
>
> What do you think about my file extension sizes?
>
> Best regards
> Vadim
>
>
>
>
> 2012/3/29 Erick Erickson <erickerick...@gmail.com>:
>> The admin UI (schema browser) will give you the counts of unique terms
>> in your fields, which is where I'd start.
>>
>> I suspect you've already seen this page, but if not:
>> http://lucene.apache.org/java/3_5_0/fileformats.html#file-names
>> the .fdt and .fdx file extensions are where data goes when
>> you set 'stored="true" '. These files don't affect search speed,
>> they just contain the verbatim copy of the data.
>>
>> The relative sizes of the various files above should give
>> you a hint as to what's using the most space, but it'll be a bit
>> of a hunt for you to pinpoint what's actually up. TermVectors
>> and norms are often sources of using up space.
>>
>> Best
>> Erick
>>
>> On Wed, Mar 28, 2012 at 10:55 AM, Vadim Kisselmann
>> <v.kisselm...@googlemail.com> wrote:
>>> Hello folks,
>>>
>>> i work with Solr 4.0 r1292064 from trunk.
>>> My index grows fast, with 10Mio. docs i get an index size of 150GB
>>> (25% stored, 75% indexed).
>>> I want to find out, which fields(content) are too large, to consider 
>>> measures.
>>>
>>> How can i localize/discover the largest fields in my index?
>>> Luke(latest from trunk) doesn't work
>>> with my Solr version. I build Lucene/Solr .jars and tried to feed Luke
>>> this these, but i get many errors
>>> and can't build it.
>>>
>>> What other options do i have?
>>>
>>> Thanks and best regards
>>> Vadim

Re: Localize the largest fields (content) in index

Reply via email to