Re: Localize the largest fields (content) in index

Vadim Kisselmann Thu, 29 Mar 2012 12:41:35 -0700

Yes, i think so, too :)
MLT doesn´t need termVectors really, but it´s faster with them. I
found out, what
MLT works better on the title field in my case, instead of big text fields.


Sharding is in planning, but my setup with SolrCloud, ZK and Tomcat
doesn´t work,
see here: 
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201203.mbox/%3CCA+GXEZE3LCTtgXFzn9uEdRxMymGF=z0ujb9s8b0qkipafn6...@mail.gmail.com%3E
I split my huge index (150GB-index in this case is my test-index), and
want use SolrCloud,
but it´s not runnable with tomcat at this time.

Best regards
Vadim


2012/3/29 Erick Erickson <erickerick...@gmail.com>:
> Yeah, it's worth a try. The term vectors aren't entirely necessary for
> highlighting,
> although they do make things more efficient.
>
> As far as MLT, does MLT really need such a big field?
>
> But you may be on your way to sharding your index if you remove this info
> and testing shows problems....
>
> Best
> Erick
>
> On Thu, Mar 29, 2012 at 9:32 AM, Vadim Kisselmann
> <v.kisselm...@googlemail.com> wrote:
>> Hi Erick,
>> thanks:)
>> The admin UI give me the counts, so i can identify fields with big
>> bulks of unique terms.
>> I known this wiki-page, but i read it one more time.
>> List of my file extensions with size in GB(Index size ~150GB):
>> tvf 90GB
>> fdt 30GB
>> tim 18GB
>> prx 15GB
>> frq 12GB
>> tip 200MB
>> tvx 150MB
>>
>> tvf is my biggest file extension.
>> Wiki :This file contains, for each field that has a term vector
>> stored, a list of the terms, their frequencies and, optionally,
>> position and offest information.
>>
>> Hmm, i use termVectors on my biggest fields because of MLT and Highlighting.
>> But i think i should test my performance without termVectors. Good Idea? :)
>>
>> What do you think about my file extension sizes?
>>
>> Best regards
>> Vadim
>>
>>
>>
>>
>> 2012/3/29 Erick Erickson <erickerick...@gmail.com>:
>>> The admin UI (schema browser) will give you the counts of unique terms
>>> in your fields, which is where I'd start.
>>>
>>> I suspect you've already seen this page, but if not:
>>> http://lucene.apache.org/java/3_5_0/fileformats.html#file-names
>>> the .fdt and .fdx file extensions are where data goes when
>>> you set 'stored="true" '. These files don't affect search speed,
>>> they just contain the verbatim copy of the data.
>>>
>>> The relative sizes of the various files above should give
>>> you a hint as to what's using the most space, but it'll be a bit
>>> of a hunt for you to pinpoint what's actually up. TermVectors
>>> and norms are often sources of using up space.
>>>
>>> Best
>>> Erick
>>>
>>> On Wed, Mar 28, 2012 at 10:55 AM, Vadim Kisselmann
>>> <v.kisselm...@googlemail.com> wrote:
>>>> Hello folks,
>>>>
>>>> i work with Solr 4.0 r1292064 from trunk.
>>>> My index grows fast, with 10Mio. docs i get an index size of 150GB
>>>> (25% stored, 75% indexed).
>>>> I want to find out, which fields(content) are too large, to consider 
>>>> measures.
>>>>
>>>> How can i localize/discover the largest fields in my index?
>>>> Luke(latest from trunk) doesn't work
>>>> with my Solr version. I build Lucene/Solr .jars and tried to feed Luke
>>>> this these, but i get many errors
>>>> and can't build it.
>>>>
>>>> What other options do i have?
>>>>
>>>> Thanks and best regards
>>>> Vadim

Re: Localize the largest fields (content) in index

Reply via email to