Good point! Do you store the large file in your documents, or just index them?
Do you have a "largest file" limit in your environment? Try this: ulimit -a What is the "file size"? On Thu, Apr 19, 2012 at 8:04 AM, Shawn Heisey <s...@elyograg.org> wrote: > On 4/19/2012 7:49 AM, Bram Rongen wrote: >> >> Yesterday I've started indexing again but this time on Solr 3.6.. Again >> Solr is failing around the same time, but not exactly (now the largest fdt >> file is 4.8G).. It's right after the moment I receive memory-errors at the >> Drupal side which make me suspicious that it maybe has something to do >> with >> a huge document.. Is that possible? I was indexing 1500 documents at once >> every minute. Drupal builds them all up in memory before submitting them >> to >> Solr. At some point it runs out of memory and I have to switch to 10/20 >> documents per minute for a while.. then I can switch back to 1000 >> documents >> per minute. >> >> The disk is a software RAID1 over 2 disks. But I've also run into the same >> problem at another server.. This was a VM-server with only 1GB ram and >> 40GB >> of disk. With this server the merge-repeat happened at an earlier stage. >> >> I've also let Solr continue with merging for about two days before (in an >> earlier attempt), without submitting new documents. The merging kept >> repeating. >> >> Somebody suggested it could be because I'm using Jetty, could that be >> right? > > > I am using Jetty for my Solr installation and it handles very large indexes > without a problem. I have created a single index with all my data (nearly > 70 million documents, total index size over 100GB). Aside from how long it > takes to build and the fact that I don't have enough RAM to cache it for > good performance, Solr handled it just fine. For production I use a > distributed index on multiple servers. > > I don't know why you are seeing a merge that continually restarts, that's > truly odd. I've never used drupal, don't know a lot about it. From my > small amount of research just now, I assume that it uses Tika, also another > tool that I have no experience with. I am guessing that you store the > entire text of your documents into solr, and that they are indexed up to a > maximum of 10000 tokens (the default value of maxFieldLength in > solrconfig.xml), based purely on speculation about the "body" field in your > schema. > > A document that's 100MB in size, if the whole thing gets stored, will > completely overwhelm a 32MB buffer, and might even be enough to overwhelm a > 256MB buffer as well, because it will basically have to build the entire > index segment in RAM, with term vectors, indexed data, and stored data for > all fields. > > With such large documents, you may have to increase the maxFieldLength, or > you won't be able to search on the entire document text. Depending on the > content of those documents, it may or may not be a problem that only the > first 10,000 tokens will get indexed. Large documents tend to be repetitive > and there might not be any search value after the introduction and initial > words. Your documents may be different, so you'll have to make that > decision. > > To test whether my current thoughts are right, I recommend that you try with > the following settings during the initial full import: ramBufferSizeMB: > 1024 (or maybe higher), autoCommit maxTime: 0, autoCommit maxDocs: 0. This > will mean that unless the indexing process issues manual commits (either in > the middle of indexing or at the end), you will have to do a manual one. > Once you have the initial index built and it is only doing updates, you > will probably be able to go back to using autoCommit. > > It's possible that I have no understanding of the real problem here, and my > recommendation above may result in no improvement. General recommendations, > no matter what the current problem might be: > > 1) Get a lot more RAM. Ideally you want to have enough free memory to cache > your entire index. That may not be possible, but you want to get as close > to that goal as you can. > 2) If you can, see what you can do to increase your IOPS. Using mirrored > high RPM SAS is an easy solution, and might be slightly cheaper than SATA > RAID10, which is my solution. SSD is easy and very fast, but expensive and > not redundant -- I am currently not aware of any SSD RAID solutions that > have OS TRIM support. RAID10 with high RPM SAS would be best, but very > expensive. On the extreme high end, you could go with a high performance > SAN. > > Thanks, > Shawn > -- Lance Norskog goks...@gmail.com