Hi Guys! I've removed the two largest documents which were very large. One of which consisted of 1 field and was around 4MB (text)..
This fixed my issue.. Kind regards, Bram Rongen On Fri, Apr 20, 2012 at 2:09 PM, Bram Rongen <m...@bramrongen.nl> wrote: > Hmm, reading your reply again I see that Solr only uses the first 10k > tokens from each field so field length should not be a problem per se.. It > could be my document contain very large tokens and unorganized tokens, > could this startle Solr? > > > On Fri, Apr 20, 2012 at 2:03 PM, Bram Rongen <m...@bramrongen.nl> wrote: > >> Yeah, I'm indexing some PDF documents.. I've extracted the text through >> tika (pre-indexing).. and the largest field in my DB is 20MB. That's quite >> extensive ;) My Solution for the moment is to cut this text to the first >> 500KB, that should be enough for a decent index and search capabilities.. >> Should I increase the buffer size for these sizes as well or will 32MB >> suffice? >> >> FYI, output of ulimit -a is >> core file size (blocks, -c) 0 >> data seg size (kbytes, -d) unlimited >> scheduling priority (-e) 20 >> *file size (blocks, -f) unlimited* >> pending signals (-i) 16382 >> max locked memory (kbytes, -l) 64 >> max memory size (kbytes, -m) unlimited >> open files (-n) 1024 >> pipe size (512 bytes, -p) 8 >> POSIX message queues (bytes, -q) 819200 >> real-time priority (-r) 0 >> stack size (kbytes, -s) 8192 >> cpu time (seconds, -t) unlimited >> max user processes (-u) unlimited >> virtual memory (kbytes, -v) unlimited >> file locks (-x) unlimited >> >> >> Kind regards! >> Bram >> >> On Fri, Apr 20, 2012 at 12:15 PM, Lance Norskog <goks...@gmail.com>wrote: >> >>> Good point! Do you store the large file in your documents, or just index >>> them? >>> >>> Do you have a "largest file" limit in your environment? Try this: >>> ulimit -a >>> >>> What is the "file size"? >>> >>> On Thu, Apr 19, 2012 at 8:04 AM, Shawn Heisey <s...@elyograg.org> wrote: >>> > On 4/19/2012 7:49 AM, Bram Rongen wrote: >>> >> >>> >> Yesterday I've started indexing again but this time on Solr 3.6.. >>> Again >>> >> Solr is failing around the same time, but not exactly (now the >>> largest fdt >>> >> file is 4.8G).. It's right after the moment I receive memory-errors >>> at the >>> >> Drupal side which make me suspicious that it maybe has something to do >>> >> with >>> >> a huge document.. Is that possible? I was indexing 1500 documents at >>> once >>> >> every minute. Drupal builds them all up in memory before submitting >>> them >>> >> to >>> >> Solr. At some point it runs out of memory and I have to switch to >>> 10/20 >>> >> documents per minute for a while.. then I can switch back to 1000 >>> >> documents >>> >> per minute. >>> >> >>> >> The disk is a software RAID1 over 2 disks. But I've also run into the >>> same >>> >> problem at another server.. This was a VM-server with only 1GB ram and >>> >> 40GB >>> >> of disk. With this server the merge-repeat happened at an earlier >>> stage. >>> >> >>> >> I've also let Solr continue with merging for about two days before >>> (in an >>> >> earlier attempt), without submitting new documents. The merging kept >>> >> repeating. >>> >> >>> >> Somebody suggested it could be because I'm using Jetty, could that be >>> >> right? >>> > >>> > >>> > I am using Jetty for my Solr installation and it handles very large >>> indexes >>> > without a problem. I have created a single index with all my data >>> (nearly >>> > 70 million documents, total index size over 100GB). Aside from how >>> long it >>> > takes to build and the fact that I don't have enough RAM to cache it >>> for >>> > good performance, Solr handled it just fine. For production I use a >>> > distributed index on multiple servers. >>> > >>> > I don't know why you are seeing a merge that continually restarts, >>> that's >>> > truly odd. I've never used drupal, don't know a lot about it. From my >>> > small amount of research just now, I assume that it uses Tika, also >>> another >>> > tool that I have no experience with. I am guessing that you store the >>> > entire text of your documents into solr, and that they are indexed up >>> to a >>> > maximum of 10000 tokens (the default value of maxFieldLength in >>> > solrconfig.xml), based purely on speculation about the "body" field in >>> your >>> > schema. >>> > >>> > A document that's 100MB in size, if the whole thing gets stored, will >>> > completely overwhelm a 32MB buffer, and might even be enough to >>> overwhelm a >>> > 256MB buffer as well, because it will basically have to build the >>> entire >>> > index segment in RAM, with term vectors, indexed data, and stored data >>> for >>> > all fields. >>> > >>> > With such large documents, you may have to increase the >>> maxFieldLength, or >>> > you won't be able to search on the entire document text. Depending on >>> the >>> > content of those documents, it may or may not be a problem that only >>> the >>> > first 10,000 tokens will get indexed. Large documents tend to be >>> repetitive >>> > and there might not be any search value after the introduction and >>> initial >>> > words. Your documents may be different, so you'll have to make that >>> > decision. >>> > >>> > To test whether my current thoughts are right, I recommend that you >>> try with >>> > the following settings during the initial full import: >>> ramBufferSizeMB: >>> > 1024 (or maybe higher), autoCommit maxTime: 0, autoCommit maxDocs: 0. >>> This >>> > will mean that unless the indexing process issues manual commits >>> (either in >>> > the middle of indexing or at the end), you will have to do a manual >>> one. >>> > Once you have the initial index built and it is only doing updates, >>> you >>> > will probably be able to go back to using autoCommit. >>> > >>> > It's possible that I have no understanding of the real problem here, >>> and my >>> > recommendation above may result in no improvement. General >>> recommendations, >>> > no matter what the current problem might be: >>> > >>> > 1) Get a lot more RAM. Ideally you want to have enough free memory to >>> cache >>> > your entire index. That may not be possible, but you want to get as >>> close >>> > to that goal as you can. >>> > 2) If you can, see what you can do to increase your IOPS. Using >>> mirrored >>> > high RPM SAS is an easy solution, and might be slightly cheaper than >>> SATA >>> > RAID10, which is my solution. SSD is easy and very fast, but >>> expensive and >>> > not redundant -- I am currently not aware of any SSD RAID solutions >>> that >>> > have OS TRIM support. RAID10 with high RPM SAS would be best, but very >>> > expensive. On the extreme high end, you could go with a high >>> performance >>> > SAN. >>> > >>> > Thanks, >>> > Shawn >>> > >>> >>> >>> >>> -- >>> Lance Norskog >>> goks...@gmail.com >>> >> >> >