Re: Index size increases disproportionately to size of added field when indexed=false

Erick Erickson Thu, 15 Feb 2018 08:39:08 -0800

David:

Rats, the cfs files make everything I'd hoped to understand with the
sizes ambiguous, since they conceal the underlying sizes of each other
extension. We can approach it a bit differently though. Take one
segment that's _not_ in cfs format where the total size of all files
making up that segment is near 5GB (the default max segment size) and
compare the individual segments for that segment only. What I'm hoping
to find out, of course, is which extensions vary dramatically. But
let's assume for the nonce that the numbers you already have are
comparable if we ignore the .cfs files.


.doc    1094.68        2767.53 - term frequencies.
.fdt     1633.21         5387.92 - stored data
.pos    809.23          1272.70 - position information

So the file difference (if borne out) indicates the following

- doc you have more documents or more terms or different options on
your terms [1]
- fdt you're storing more fields than you used to. [1]
- pos you have more docs or more terms or have position information
turned on where you didn't before. [1]

[1] or lots of deleted docs that haven't been merged away. This
information should be on the admin page for any particular core. I
think this unlikely, but who knows? NOTE, just because you get 14M fro
querying *:* does _not_ say anything about the deleted docs, which
take up space. This is highly unlikely to be your problem, but let's
eliminate the easy stuff ;)

Where I'd go from here after checking that these ratios are true for a
single like-sized segment in both cases....

1> the LukeReqeustHandler can tell you information about exactly how
the index is defined, and using Luke itself can provide you a much
more detailed look at what's actually _in_ your index. You could also
have Luke reconstruct the same doc from your index in each case and
compare. Perhaps your SQL is doing something really unexpected. This
_should_ show you the realized meta-data for each field and let you
pinpoint any different options that have been enabled.

2> compare your Oracle intermediate tables, are they _really_
identical? The ordering shouldn't make any difference at all to Solr
assuming the same docs are being indexed (plus any expected delta).
There's an edge case I can imagine if you hit a "perfect storm" and
one version has a lot more deleted docs than the other that's possibly
the result of reordering, but that's unlikely. The edge case I'm
imagining would be easily verifiable by the two versions having a
radically different number of deleted docs....

Best,
Erick




On Thu, Feb 15, 2018 at 7:13 AM, Pratik Patel <pra...@semandex.net> wrote:
> @Alessandro I will see if I can reproduce the same issue just by turning
> off omitNorms on field type. I'll open another mail thread if required.
> Thanks.
>
> On Thu, Feb 15, 2018 at 6:12 AM, Howe, David <david.h...@auspost.com.au>
> wrote:
>
>>
>> Hi Alessandro,
>>
>> Some interesting testing today that seems to have gotten me closer to what
>> the issue is.  When I run the version of the index that is working
>> correctly against my database table that has the extra field in it, the
>> index suddenly increases in size.  This is even though the data importer is
>> running the same SELECT as before (which doesn't include the extra column)
>> and loads the same number of rows.
>>
>> After scratching my head for a bit and browsing through both versions of
>> the table I am loading from (with and without the extra field), I noticed
>> that the natural ordering of the tables is different.  These tables are
>> "staging" tables that I populate with another set of queries and inserts to
>> get the data into a format that is easy to ingest into Solr.  When I add
>> the extra field to these queries, it changes the Oracle query plan as the
>> field is contained in a different table that I need to join to.  As I don't
>> specify an "ORDER BY" on the query (as I didn't think it would make a
>> difference and would slow the query down), Oracle is free to chose how it
>> orders the result set.  Adding the extra field changes that natural
>> ordering, which affects the order things go into my staging table.  As I
>> don't specify an "ORDER BY" when I select things out of the staging table,
>> my data in the scenario that is working is being loaded in a different
>> order to the scenario which doesn't work.
>>
>> I am currently running full loads to verify this under each scenario, as I
>> have now forced the data in the scenario that doesn't work to be in the
>> same order as the scenario that does.  Will see how this load goes
>> overnight.
>>
>> This leads to the question of what difference does it make to Solr what
>> order I load the data in?
>>
>> I also noticed that the .cfs file is quite large in the second scenario,
>> even though this is supposed to be disabled by default in Solr.  I checked
>> my Solr config and there is no override of the default.
>>
>> In answer to your questions:
>>
>> 1) same number of documents - YES ~14,000,000 documents
>> 2) identical documents ( + 1 new field each not indexed) - YES, the second
>> scenario has one extra field that is stored but not indexed
>> 3) same number of deleted documents - YES, there are zero deleted
>> documents in both scenarios
>> 4) they both were born from scratch ( an empty index) - YES, both start
>> from a brand new virtual server with a brand new installation of Solr
>>
>> I am using the default auto commit, which I think is 15000.
>>
>> Thanks again for your assistance.
>>
>> Regards,
>>
>> David
>>
>> David Howe
>> Java Domain Architect
>> Postal Systems
>> Level 16, 111 Bourke Street Melbourne VIC 3000
>>
>> T  0391067904
>>
>> M  0424036591
>>
>> E  david.h...@auspost.com.au
>>
>> W  auspost.com.au
>> W  startrack.com.au
>>
>> Australia Post is committed to providing our customers with excellent
>> service. If we can assist you in any way please telephone 13 13 18 or visit
>> our website.
>>
>> The information contained in this email communication may be proprietary,
>> confidential or legally professionally privileged. It is intended
>> exclusively for the individual or entity to which it is addressed. You
>> should only read, disclose, re-transmit, copy, distribute, act in reliance
>> on or commercialise the information if you are authorised to do so.
>> Australia Post does not represent, warrant or guarantee that the integrity
>> of this email communication has been maintained nor that the communication
>> is free of errors, virus or interference.
>>
>> If you are not the addressee or intended recipient please notify us by
>> replying direct to the sender and then destroy any electronic or paper copy
>> of this message. Any views expressed in this email communication are taken
>> to be those of the individual sender, except where the sender specifically
>> attributes those views to Australia Post and is authorised to do so.
>>
>> Please consider the environment before printing this email.
>>

Re: Index size increases disproportionately to size of added field when indexed=false

Reply via email to