Good point! Do you store the large file in your documents, or just index them?

Do you have a "largest file" limit in your environment? Try this:
ulimit -a

What is the "file size"?

On Thu, Apr 19, 2012 at 8:04 AM, Shawn Heisey <> wrote:
> On 4/19/2012 7:49 AM, Bram Rongen wrote:
>> Yesterday I've started indexing again but this time on Solr 3.6.. Again
>> Solr is failing around the same time, but not exactly (now the largest fdt
>> file is 4.8G).. It's right after the moment I receive memory-errors at the
>> Drupal side which make me suspicious that it maybe has something to do
>> with
>> a huge document.. Is that possible? I was indexing 1500 documents at once
>> every minute. Drupal builds them all up in memory before submitting them
>> to
>> Solr. At some point it runs out of memory and I have to switch to 10/20
>> documents per minute for a while.. then I can switch back to 1000
>> documents
>> per minute.
>> The disk is a software RAID1 over 2 disks. But I've also run into the same
>> problem at another server.. This was a VM-server with only 1GB ram and
>> 40GB
>> of disk. With this server the merge-repeat happened at an earlier stage.
>> I've also let Solr continue with merging for about two days before  (in an
>> earlier attempt), without submitting new documents. The merging kept
>> repeating.
>> Somebody suggested it could be because I'm using Jetty, could that be
>> right?
> I am using Jetty for my Solr installation and it handles very large indexes
> without a problem.  I have created a single index with all my data (nearly
> 70 million documents, total index size over 100GB).  Aside from how long it
> takes to build and the fact that I don't have enough RAM to cache it for
> good performance, Solr handled it just fine.  For production I use a
> distributed index on multiple servers.
> I don't know why you are seeing a merge that continually restarts, that's
> truly odd.  I've never used drupal, don't know a lot about it.  From my
> small amount of research just now, I assume that it uses Tika, also another
> tool that I have no experience with.  I am guessing that you store the
> entire text of your documents into solr, and that they are indexed up to a
> maximum of 10000 tokens (the default value of maxFieldLength in
> solrconfig.xml), based purely on speculation about the "body" field in your
> schema.
> A document that's 100MB in size, if the whole thing gets stored, will
> completely overwhelm a 32MB buffer, and might even be enough to overwhelm a
> 256MB buffer as well, because it will basically have to build the entire
> index segment in RAM, with term vectors, indexed data, and stored data for
> all fields.
> With such large documents, you may have to increase the maxFieldLength, or
> you won't be able to search on the entire document text.  Depending on the
> content of those documents, it may or may not be a problem that only the
> first 10,000 tokens will get indexed.  Large documents tend to be repetitive
> and there might not be any search value after the introduction and initial
> words.  Your documents may be different, so you'll have to make that
> decision.
> To test whether my current thoughts are right, I recommend that you try with
> the following settings during the initial full import:  ramBufferSizeMB:
> 1024 (or maybe higher), autoCommit maxTime: 0, autoCommit maxDocs: 0.  This
> will mean that unless the indexing process issues manual commits (either in
> the middle of indexing or at the end), you will have to do a manual one.
>  Once you have the initial index built and it is only doing updates, you
> will probably be able to go back to using autoCommit.
> It's possible that I have no understanding of the real problem here, and my
> recommendation above may result in no improvement.  General recommendations,
> no matter what the current problem might be:
> 1) Get a lot more RAM.  Ideally you want to have enough free memory to cache
> your entire index.  That may not be possible, but you want to get as close
> to that goal as you can.
> 2) If you can, see what you can do to increase your IOPS.  Using mirrored
> high RPM SAS is an easy solution, and might be slightly cheaper than SATA
> RAID10, which is my solution.  SSD is easy and very fast, but expensive and
> not redundant -- I am currently not aware of any SSD RAID solutions that
> have OS TRIM support.  RAID10 with high RPM SAS would be best, but very
> expensive.  On the extreme high end, you could go with a high performance
> SAN.
