Re: Lucene (unexpected ) fsync on existing segments

Rahul Goswami Fri, 26 Mar 2021 06:57:29 -0700

Mike,

 >> "But, I believe you (system locks up with MMapDirectory for you
use-case), so there is a bug somewhere!  And I wish we could get to the
bottom of that, and fix it."


Yes that's true for Windows for sure. I haven't tested it on Unix-like
systems to that scale, so don't have any observations to report there.

>> "Also, this (system locks up when using MMapDirectory) sounds different
from the "Lucene fsyncs files that it doesn't need to" bug, right?"

That's correct, they are separate issues. I just brought up the
system-freezing-up-on-Windows point in response to Uwe's explanation
earlier.

I know I had taken it upon myself to open up a Jira for the fsync issue,
but it got delayed from my side as I got occupied with other things
in my day job. Will open up one later today.

Thanks,
Rahul


On Wed, Mar 24, 2021 at 12:58 PM Michael McCandless <
[email protected]> wrote:

> MMapDirectory really should be (is supposed to be) better than
> SimpleFSDirectory for your usage case.
>
> Memory mapped pages do not have to fit into your 64 GB physical space, but
> the "hot" pages (parts of the index that you are actively querying) ideally
> would fit mostly in free RAM on your box to have OK search performance.
> Run with as small a JVM heap as possible so the OS has the most RAM to keep
> such pages hot.  Since you are getting OK performance with
> SimpleFSDirectory it sounds like you do have enough free RAM for the parts
> of the index you are searching...
>
> But, I believe you (system locks up with MMapDirectory for you use-case),
> so there is a bug somewhere!  And I wish we could get to the bottom of
> that, and fix it.
>
> Also, this (system locks up when using MMapDirectory) sounds different
> from the "Lucene fsyncs files that it doesn't need to" bug, right?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Mar 15, 2021 at 4:28 PM Rahul Goswami <[email protected]>
> wrote:
>
>> Uwe,
>> I understand that mmap would only map *a part* of the index from virtual
>> address space to physical memory as and when the pages are requested.
>> However the limitation on our side is that in most cases, we cannot ask for
>> more than 128 GB RAM (and unfortunately even that would be a stretch) for
>> the Solr machine.
>>
>> I have read and re-read the article you referenced in the past :) It's
>> brilliantly written and did help clarify quite a few things for me I must
>> say. However, at the end of the day, there is only so much the OS (at least
>> Windows) can do before it starts to swap different pages in a 2-3 TB index
>> into 64 GB of physical space, isn't that right ? The CPU usage spikes to
>> 100% at such times and the machine becomes totally unresponsive. Turning on
>> SimpleFSDIrectory at such times does rid us of this issue. I understand
>> that we are losing out on performance by an order of magnitude compared to
>> mmap, but I don't know any alternate solution. Also, since most of our use
>> cases are more write-heavy than read-heavy, we can afford to compromise on
>> the search performance due to SimpleFS.
>>
>> Please let me know still, if there is anything about my explanation that
>> doesn't sound right to you.
>>
>> Thanks,
>> Rahul
>>
>> On Mon, Mar 15, 2021 at 3:54 PM Uwe Schindler <[email protected]> wrote:
>>
>>> This is not true. Memory mapping does not need to load the index into
>>> ram, so you don't need so much physical memory. Paging is done only between
>>> index files and ram, that's what memory mapping is about.
>>>
>>> Please read the blog post:
>>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>>>
>>> Uwe
>>>
>>> Am March 15, 2021 7:43:29 PM UTC schrieb Rahul Goswami <
>>> [email protected]>:
>>>>
>>>> Mike,
>>>> Yes I am using a 64 bit JVM on Windows. I haven't tried reproducing the
>>>> issue on Linux yet. In the past we have had problems with mmap on Windows
>>>> with the machine freezing. The rationale I gave to myself is the amount of
>>>> disk and CPU activity for paging in and out must be intense for the OS
>>>> while trying to map an index that large into 64 GB of heap. Also since it's
>>>> an on-premise deployment, we can't expect the customers of the product to
>>>> provide nodes with > 400 GB RAM which is what *I think* would be required
>>>> to get a decent performance with mmap. Hence we had to switch to
>>>> SimpleFSDirectory.
>>>>
>>>> As for the fsync behavior, you are right. I tried with
>>>> NRTCachingDirectoryFactory as well which defaults to using mmap underneath
>>>> and still makes fsync calls for already existing index files.
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>> On Mon, Mar 15, 2021 at 3:15 PM Michael McCandless <
>>>> [email protected]> wrote:
>>>>
>>>>> Thanks Rahul.
>>>>>
>>>>> > primary reason being that memory mapping multi-terabyte indexes is
>>>>> not feasible through mmap
>>>>>
>>>>> Hmm, that is interesting -- are you using a 64 bit JVM?  If so, what
>>>>> goes wrong with such large maps?  Lucene's MMapDirectory should chunk the
>>>>> mapping to deal with ByteBuffer int only address space.
>>>>>
>>>>> SimpleFSDirectory usually has substantially worse performance than
>>>>> MMapDirectory.
>>>>>
>>>>> Still, I suspect you would hit the same issue if you used other
>>>>> FSDirectory implementations -- the fsync behavior should be the same.
>>>>>
>>>>> Mike McCandless
>>>>>
>>>>> http://blog.mikemccandless.com
>>>>>
>>>>>
>>>>> On Fri, Mar 12, 2021 at 1:46 PM Rahul Goswami <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Thanks Michael. For your question...yes I am running Solr on Windows
>>>>>> and running it with SimpleFSDirectoryFactory (primary reason being that
>>>>>> memory mapping multi-terabyte indexes is not feasible through mmap). I 
>>>>>> will
>>>>>> create a Jira later today with the details in this thread and assign it 
>>>>>> to
>>>>>> myself. Will take a shot at the fix.
>>>>>>
>>>>>> Thanks,
>>>>>> Rahul
>>>>>>
>>>>>> On Fri, Mar 12, 2021 at 10:00 AM Michael McCandless <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> I think long ago we used to track which files were actually dirty
>>>>>>> (we had written bytes to) and only fsync those ones.  But something went
>>>>>>> wrong with that, and at some point we "simplified" this logic, I think 
>>>>>>> on
>>>>>>> the assumption that asking the OS to fsync a file that does in fact 
>>>>>>> exist
>>>>>>> yet indeed has not changed would be harmless?  But somehow it is not in
>>>>>>> your case?  Are you on Windows?
>>>>>>>
>>>>>>> I tried to do a bit of digital archaeology and remember what
>>>>>>> happened here, and I came across this relevant looking issue:
>>>>>>> https://issues.apache.org/jira/browse/LUCENE-2328.  That issue
>>>>>>> moved tracking of which files have been written but not yet fsync'd down
>>>>>>> from IndexWriter into FSDirectory.
>>>>>>>
>>>>>>> But there was another change that then removed staleFiles from
>>>>>>> FSDirectory entirely.... still trying to find that.  Aha, found it!
>>>>>>> https://issues.apache.org/jira/browse/LUCENE-6150.  Phew Uwe was
>>>>>>> really quite upset in that issue ;)
>>>>>>>
>>>>>>> I also came across this delightful related issue, showing how a
>>>>>>> massive hurricane (Irene) can lead to finding and fixing a bug in 
>>>>>>> Lucene!
>>>>>>> https://issues.apache.org/jira/browse/LUCENE-3418
>>>>>>>
>>>>>>> > The assumption is that while the commit point is saved, no changes
>>>>>>> happen to the segment files in the saved generation.
>>>>>>>
>>>>>>> This assumption should really be true.  Lucene writes the files,
>>>>>>> append only, once, and then never changes them, once they are closed.
>>>>>>> Pulling a commit point from Solr should further ensure that, even as
>>>>>>> indexing continues and new segments are written, the old segments
>>>>>>> referenced in that commit point will not be deleted.  But apparently 
>>>>>>> this
>>>>>>> "harmless fsync" Lucene is doing is not so harmless in your use case.
>>>>>>> Maybe open an issue and pull out the details from this discussion onto 
>>>>>>> it?
>>>>>>>
>>>>>>> Mike McCandless
>>>>>>>
>>>>>>> http://blog.mikemccandless.com
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Mar 12, 2021 at 9:03 AM Michael Sokolov <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Also - I should have said - I think the first step here is to write
>>>>>>>> a
>>>>>>>> focused unit test that demonstrates the existence of the extra
>>>>>>>> fsyncs
>>>>>>>> that we want to eliminate. It would be awesome if you were able to
>>>>>>>> create such a thing.
>>>>>>>>
>>>>>>>> On Fri, Mar 12, 2021 at 9:00 AM Michael Sokolov <[email protected]>
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> > Yes, please go ahead and open an issue. TBH I'm not sure why this
>>>>>>>> is
>>>>>>>> > happening - there may be a good reason?? But let's explore it
>>>>>>>> using an
>>>>>>>> > issue, thanks.
>>>>>>>> >
>>>>>>>> > On Fri, Mar 12, 2021 at 12:16 AM Rahul Goswami <
>>>>>>>> [email protected]> wrote:
>>>>>>>> > >
>>>>>>>> > > I can create a Jira and assign it to myself if that's ok (?). I
>>>>>>>> think this can help improve commit performance.
>>>>>>>> > > Also, to answer your question, we have indexes sometimes going
>>>>>>>> into multiple terabytes. Using the replication handler for backup would
>>>>>>>> mean requiring a disk capacity more than 2x the index size on the 
>>>>>>>> machine
>>>>>>>> at all times, which might not be feasible. So we directly back the 
>>>>>>>> index up
>>>>>>>> from the Solr node to a remote repository.
>>>>>>>> > >
>>>>>>>> > > Thanks,
>>>>>>>> > > Rahul
>>>>>>>> > >
>>>>>>>> > > On Thu, Mar 11, 2021 at 4:09 PM Michael Sokolov <
>>>>>>>> [email protected]> wrote:
>>>>>>>> > >>
>>>>>>>> > >> Well, it certainly doesn't seem necessary to fsync files that
>>>>>>>> are
>>>>>>>> > >> unchanged and have already been fsync'ed. Maybe there's an
>>>>>>>> opportunity
>>>>>>>> > >> to improve it? On the other hand, support for external
>>>>>>>> processes
>>>>>>>> > >> reading Lucene index files isn't likely to become a feature of
>>>>>>>> Lucene.
>>>>>>>> > >> You might want to consider using Solr replication to power your
>>>>>>>> > >> backup?
>>>>>>>> > >>
>>>>>>>> > >> On Thu, Mar 11, 2021 at 2:52 PM Rahul Goswami <
>>>>>>>> [email protected]> wrote:
>>>>>>>> > >> >
>>>>>>>> > >> > Thanks Michael. I thought since this discussion is closer to
>>>>>>>> the code than most discussions on the solr-users list, it seemed like a
>>>>>>>> more appropriate forum. Will be mindful going forward.
>>>>>>>> > >> > On your point about new segments, I attached a debugger and
>>>>>>>> tried to do a new commit (just pure Solr commit, no backup process
>>>>>>>> running), and the code indeed does fsync on a pre-existing segment 
>>>>>>>> file.
>>>>>>>> Hence I was a bit baffled since it challenged my fundamental 
>>>>>>>> understanding
>>>>>>>> that segment files once written are immutable, no matter what (unless
>>>>>>>> picked up for a merge of course). Hence I thought of reaching out, in 
>>>>>>>> case
>>>>>>>> there are scenarios where this might happen which I might be unaware 
>>>>>>>> of.
>>>>>>>> > >> >
>>>>>>>> > >> > Thanks,
>>>>>>>> > >> > Rahul
>>>>>>>> > >> >
>>>>>>>> > >> > On Thu, Mar 11, 2021 at 2:38 PM Michael Sokolov <
>>>>>>>> [email protected]> wrote:
>>>>>>>> > >> >>
>>>>>>>> > >> >> This isn't a support forum; solr-users@ might be more
>>>>>>>> appropriate. On
>>>>>>>> > >> >> that list someone might have a better idea about how the
>>>>>>>> replication
>>>>>>>> > >> >> handler gets its list of files. This would be a good list
>>>>>>>> to try if
>>>>>>>> > >> >> you wanted to propose a fix for the problem you're having.
>>>>>>>> But since
>>>>>>>> > >> >> you're here -- it looks to me as if IndexWriter indeed
>>>>>>>> syncs all "new"
>>>>>>>> > >> >> files in the current segments being committed; look in
>>>>>>>> > >> >> IndexWriter.startCommit and SegmentInfos.files. Caveat: (1)
>>>>>>>> I'm
>>>>>>>> > >> >> looking at this code for the first time, and (2) things may
>>>>>>>> have been
>>>>>>>> > >> >> different in 7.7.2? Sorry I don't know for sure, but are
>>>>>>>> you sure that
>>>>>>>> > >> >> your backup process is not attempting to copy one of the
>>>>>>>> new files?
>>>>>>>> > >> >>
>>>>>>>> > >> >> On Thu, Mar 11, 2021 at 1:35 PM Rahul Goswami <
>>>>>>>> [email protected]> wrote:
>>>>>>>> > >> >> >
>>>>>>>> > >> >> > Hello,
>>>>>>>> > >> >> > Just wanted to follow up one more time to see if this is
>>>>>>>> the right form for my question? Or is this suitable for some other 
>>>>>>>> mailing
>>>>>>>> list?
>>>>>>>> > >> >> >
>>>>>>>> > >> >> > Best,
>>>>>>>> > >> >> > Rahul
>>>>>>>> > >> >> >
>>>>>>>> > >> >> > On Sat, Mar 6, 2021 at 3:57 PM Rahul Goswami <
>>>>>>>> [email protected]> wrote:
>>>>>>>> > >> >> >>
>>>>>>>> > >> >> >> Hello everyone,
>>>>>>>> > >> >> >> Following up on my question in case anyone has any idea.
>>>>>>>> Why it's important to know this is because I am thinking of allowing 
>>>>>>>> the
>>>>>>>> backup process to not hold any lock on the index files, which should 
>>>>>>>> allow
>>>>>>>> the fsync during parallel commits. BUT, in case doing an fsync on 
>>>>>>>> existing
>>>>>>>> segment files in a saved commit point DOES have an effect, it might 
>>>>>>>> render
>>>>>>>> the backed up index in a corrupt state.
>>>>>>>> > >> >> >>
>>>>>>>> > >> >> >> Thanks,
>>>>>>>> > >> >> >> Rahul
>>>>>>>> > >> >> >>
>>>>>>>> > >> >> >> On Fri, Mar 5, 2021 at 3:04 PM Rahul Goswami <
>>>>>>>> [email protected]> wrote:
>>>>>>>> > >> >> >>>
>>>>>>>> > >> >> >>> Hello,
>>>>>>>> > >> >> >>> We have a process which backs up the index (Solr 7.7.2)
>>>>>>>> on a schedule. The way we do it is we first save a commit point on the
>>>>>>>> index and then using Solr's /replication handler, get the list of 
>>>>>>>> files in
>>>>>>>> that generation. After the backup completes, we release the commit 
>>>>>>>> point
>>>>>>>> (Please note that this is a separate backup process outside of Solr 
>>>>>>>> and not
>>>>>>>> the backup command of the /replication handler)
>>>>>>>> > >> >> >>> The assumption is that while the commit point is saved,
>>>>>>>> no changes happen to the segment files in the saved generation.
>>>>>>>> > >> >> >>>
>>>>>>>> > >> >> >>> Now the issue... The backup process opens the index
>>>>>>>> files in a shared READ mode, preventing writes. This is causing any
>>>>>>>> parallel commits to fail as it seems to be complaining about the index
>>>>>>>> files to be locked by another process(the backup process). Upon 
>>>>>>>> debugging,
>>>>>>>> I see that fsync is being called during commit on already existing 
>>>>>>>> segment
>>>>>>>> files which is not expected. So, my question is, is there any reason 
>>>>>>>> for
>>>>>>>> lucene to call fsync on already existing segment files?
>>>>>>>> > >> >> >>>
>>>>>>>> > >> >> >>> The line of code I am referring to is as below:
>>>>>>>> > >> >> >>> try (final FileChannel file =
>>>>>>>> FileChannel.open(fileToSync, isDir ? StandardOpenOption.READ :
>>>>>>>> StandardOpenOption.WRITE))
>>>>>>>> > >> >> >>>
>>>>>>>> > >> >> >>> in method fsync(Path fileToSync, boolean isDir) of the
>>>>>>>> class file
>>>>>>>> > >> >> >>>
>>>>>>>> > >> >> >>> lucene\core\src\java\org\apache\lucene\util\IOUtils.java
>>>>>>>> > >> >> >>>
>>>>>>>> > >> >> >>> Thanks,
>>>>>>>> > >> >> >>> Rahul
>>>>>>>> > >> >>
>>>>>>>> > >> >>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> > >> >> To unsubscribe, e-mail: [email protected]
>>>>>>>> > >> >> For additional commands, e-mail: [email protected]
>>>>>>>> > >> >>
>>>>>>>> > >>
>>>>>>>> > >>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> > >> To unsubscribe, e-mail: [email protected]
>>>>>>>> > >> For additional commands, e-mail: [email protected]
>>>>>>>> > >>
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>>
>>>>>>>>
>>> --
>>> Uwe Schindler
>>> Achterdiek 19, 28357 Bremen
>>> https://www.thetaphi.de
>>>
>>

Re: Lucene (unexpected ) fsync on existing segments

Reply via email to