Hi Guys!

I've removed the two largest documents which were very large. One of which
consisted of 1 field and was around 4MB (text)..

This fixed my issue..

Kind regards,

Bram Rongen

On Fri, Apr 20, 2012 at 2:09 PM, Bram Rongen <m...@bramrongen.nl> wrote:

> Hmm, reading your reply again I see that Solr only uses the first 10k
> tokens from each field so field length should not be a problem per se.. It
> could be my document contain very large tokens and unorganized tokens,
> could this startle Solr?
>
>
> On Fri, Apr 20, 2012 at 2:03 PM, Bram Rongen <m...@bramrongen.nl> wrote:
>
>> Yeah, I'm indexing some PDF documents.. I've extracted the text through
>> tika (pre-indexing).. and the largest field in my DB is 20MB. That's quite
>> extensive ;) My Solution for the moment is to cut this text to the first
>> 500KB, that should be enough for a decent index and search capabilities..
>> Should I increase the buffer size for these sizes as well or will 32MB
>> suffice?
>>
>> FYI, output of ulimit -a is
>> core file size          (blocks, -c) 0
>> data seg size           (kbytes, -d) unlimited
>> scheduling priority             (-e) 20
>> *file size               (blocks, -f) unlimited*
>> pending signals                 (-i) 16382
>> max locked memory       (kbytes, -l) 64
>> max memory size         (kbytes, -m) unlimited
>> open files                      (-n) 1024
>> pipe size            (512 bytes, -p) 8
>> POSIX message queues     (bytes, -q) 819200
>> real-time priority              (-r) 0
>> stack size              (kbytes, -s) 8192
>> cpu time               (seconds, -t) unlimited
>> max user processes              (-u) unlimited
>> virtual memory          (kbytes, -v) unlimited
>> file locks                      (-x) unlimited
>>
>>
>> Kind regards!
>> Bram
>>
>> On Fri, Apr 20, 2012 at 12:15 PM, Lance Norskog <goks...@gmail.com>wrote:
>>
>>> Good point! Do you store the large file in your documents, or just index
>>> them?
>>>
>>> Do you have a "largest file" limit in your environment? Try this:
>>> ulimit -a
>>>
>>> What is the "file size"?
>>>
>>> On Thu, Apr 19, 2012 at 8:04 AM, Shawn Heisey <s...@elyograg.org> wrote:
>>> > On 4/19/2012 7:49 AM, Bram Rongen wrote:
>>> >>
>>> >> Yesterday I've started indexing again but this time on Solr 3.6..
>>> Again
>>> >> Solr is failing around the same time, but not exactly (now the
>>> largest fdt
>>> >> file is 4.8G).. It's right after the moment I receive memory-errors
>>> at the
>>> >> Drupal side which make me suspicious that it maybe has something to do
>>> >> with
>>> >> a huge document.. Is that possible? I was indexing 1500 documents at
>>> once
>>> >> every minute. Drupal builds them all up in memory before submitting
>>> them
>>> >> to
>>> >> Solr. At some point it runs out of memory and I have to switch to
>>> 10/20
>>> >> documents per minute for a while.. then I can switch back to 1000
>>> >> documents
>>> >> per minute.
>>> >>
>>> >> The disk is a software RAID1 over 2 disks. But I've also run into the
>>> same
>>> >> problem at another server.. This was a VM-server with only 1GB ram and
>>> >> 40GB
>>> >> of disk. With this server the merge-repeat happened at an earlier
>>> stage.
>>> >>
>>> >> I've also let Solr continue with merging for about two days before
>>>  (in an
>>> >> earlier attempt), without submitting new documents. The merging kept
>>> >> repeating.
>>> >>
>>> >> Somebody suggested it could be because I'm using Jetty, could that be
>>> >> right?
>>> >
>>> >
>>> > I am using Jetty for my Solr installation and it handles very large
>>> indexes
>>> > without a problem.  I have created a single index with all my data
>>> (nearly
>>> > 70 million documents, total index size over 100GB).  Aside from how
>>> long it
>>> > takes to build and the fact that I don't have enough RAM to cache it
>>> for
>>> > good performance, Solr handled it just fine.  For production I use a
>>> > distributed index on multiple servers.
>>> >
>>> > I don't know why you are seeing a merge that continually restarts,
>>> that's
>>> > truly odd.  I've never used drupal, don't know a lot about it.  From my
>>> > small amount of research just now, I assume that it uses Tika, also
>>> another
>>> > tool that I have no experience with.  I am guessing that you store the
>>> > entire text of your documents into solr, and that they are indexed up
>>> to a
>>> > maximum of 10000 tokens (the default value of maxFieldLength in
>>> > solrconfig.xml), based purely on speculation about the "body" field in
>>> your
>>> > schema.
>>> >
>>> > A document that's 100MB in size, if the whole thing gets stored, will
>>> > completely overwhelm a 32MB buffer, and might even be enough to
>>> overwhelm a
>>> > 256MB buffer as well, because it will basically have to build the
>>> entire
>>> > index segment in RAM, with term vectors, indexed data, and stored data
>>> for
>>> > all fields.
>>> >
>>> > With such large documents, you may have to increase the
>>> maxFieldLength, or
>>> > you won't be able to search on the entire document text.  Depending on
>>> the
>>> > content of those documents, it may or may not be a problem that only
>>> the
>>> > first 10,000 tokens will get indexed.  Large documents tend to be
>>> repetitive
>>> > and there might not be any search value after the introduction and
>>> initial
>>> > words.  Your documents may be different, so you'll have to make that
>>> > decision.
>>> >
>>> > To test whether my current thoughts are right, I recommend that you
>>> try with
>>> > the following settings during the initial full import:
>>>  ramBufferSizeMB:
>>> > 1024 (or maybe higher), autoCommit maxTime: 0, autoCommit maxDocs: 0.
>>>  This
>>> > will mean that unless the indexing process issues manual commits
>>> (either in
>>> > the middle of indexing or at the end), you will have to do a manual
>>> one.
>>> >  Once you have the initial index built and it is only doing updates,
>>> you
>>> > will probably be able to go back to using autoCommit.
>>> >
>>> > It's possible that I have no understanding of the real problem here,
>>> and my
>>> > recommendation above may result in no improvement.  General
>>> recommendations,
>>> > no matter what the current problem might be:
>>> >
>>> > 1) Get a lot more RAM.  Ideally you want to have enough free memory to
>>> cache
>>> > your entire index.  That may not be possible, but you want to get as
>>> close
>>> > to that goal as you can.
>>> > 2) If you can, see what you can do to increase your IOPS.  Using
>>> mirrored
>>> > high RPM SAS is an easy solution, and might be slightly cheaper than
>>> SATA
>>> > RAID10, which is my solution.  SSD is easy and very fast, but
>>> expensive and
>>> > not redundant -- I am currently not aware of any SSD RAID solutions
>>> that
>>> > have OS TRIM support.  RAID10 with high RPM SAS would be best, but very
>>> > expensive.  On the extreme high end, you could go with a high
>>> performance
>>> > SAN.
>>> >
>>> > Thanks,
>>> > Shawn
>>> >
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> goks...@gmail.com
>>>
>>
>>
>

Reply via email to