Hello,
Opened the below JIRA for this issue. I will work on this and try to submit
a patch.
[LUCENE-9889] Lucene (unexpected ) fsync on existing segments - ASF JIRA
(apache.org) <https://issues.apache.org/jira/browse/LUCENE-9889>

Thanks,
Rahul

On Fri, Mar 26, 2021 at 9:56 AM Rahul Goswami <[email protected]> wrote:

> Mike,
>
>  >> "But, I believe you (system locks up with MMapDirectory for you
> use-case), so there is a bug somewhere!  And I wish we could get to the
> bottom of that, and fix it."
>
> Yes that's true for Windows for sure. I haven't tested it on Unix-like
> systems to that scale, so don't have any observations to report there.
>
> >> "Also, this (system locks up when using MMapDirectory) sounds different
> from the "Lucene fsyncs files that it doesn't need to" bug, right?"
>
> That's correct, they are separate issues. I just brought up the
> system-freezing-up-on-Windows point in response to Uwe's explanation
> earlier.
>
> I know I had taken it upon myself to open up a Jira for the fsync issue,
> but it got delayed from my side as I got occupied with other things
> in my day job. Will open up one later today.
>
> Thanks,
> Rahul
>
>
> On Wed, Mar 24, 2021 at 12:58 PM Michael McCandless <
> [email protected]> wrote:
>
>> MMapDirectory really should be (is supposed to be) better than
>> SimpleFSDirectory for your usage case.
>>
>> Memory mapped pages do not have to fit into your 64 GB physical space,
>> but the "hot" pages (parts of the index that you are actively querying)
>> ideally would fit mostly in free RAM on your box to have OK search
>> performance.  Run with as small a JVM heap as possible so the OS has the
>> most RAM to keep such pages hot.  Since you are getting OK performance with
>> SimpleFSDirectory it sounds like you do have enough free RAM for the parts
>> of the index you are searching...
>>
>> But, I believe you (system locks up with MMapDirectory for you use-case),
>> so there is a bug somewhere!  And I wish we could get to the bottom of
>> that, and fix it.
>>
>> Also, this (system locks up when using MMapDirectory) sounds different
>> from the "Lucene fsyncs files that it doesn't need to" bug, right?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Mon, Mar 15, 2021 at 4:28 PM Rahul Goswami <[email protected]>
>> wrote:
>>
>>> Uwe,
>>> I understand that mmap would only map *a part* of the index from virtual
>>> address space to physical memory as and when the pages are requested.
>>> However the limitation on our side is that in most cases, we cannot ask for
>>> more than 128 GB RAM (and unfortunately even that would be a stretch) for
>>> the Solr machine.
>>>
>>> I have read and re-read the article you referenced in the past :) It's
>>> brilliantly written and did help clarify quite a few things for me I must
>>> say. However, at the end of the day, there is only so much the OS (at least
>>> Windows) can do before it starts to swap different pages in a 2-3 TB index
>>> into 64 GB of physical space, isn't that right ? The CPU usage spikes to
>>> 100% at such times and the machine becomes totally unresponsive. Turning on
>>> SimpleFSDIrectory at such times does rid us of this issue. I understand
>>> that we are losing out on performance by an order of magnitude compared to
>>> mmap, but I don't know any alternate solution. Also, since most of our use
>>> cases are more write-heavy than read-heavy, we can afford to compromise on
>>> the search performance due to SimpleFS.
>>>
>>> Please let me know still, if there is anything about my explanation that
>>> doesn't sound right to you.
>>>
>>> Thanks,
>>> Rahul
>>>
>>> On Mon, Mar 15, 2021 at 3:54 PM Uwe Schindler <[email protected]> wrote:
>>>
>>>> This is not true. Memory mapping does not need to load the index into
>>>> ram, so you don't need so much physical memory. Paging is done only between
>>>> index files and ram, that's what memory mapping is about.
>>>>
>>>> Please read the blog post:
>>>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>>>>
>>>> Uwe
>>>>
>>>> Am March 15, 2021 7:43:29 PM UTC schrieb Rahul Goswami <
>>>> [email protected]>:
>>>>>
>>>>> Mike,
>>>>> Yes I am using a 64 bit JVM on Windows. I haven't tried reproducing
>>>>> the issue on Linux yet. In the past we have had problems with mmap on
>>>>> Windows with the machine freezing. The rationale I gave to myself is the
>>>>> amount of disk and CPU activity for paging in and out must be intense for
>>>>> the OS while trying to map an index that large into 64 GB of heap. Also
>>>>> since it's an on-premise deployment, we can't expect the customers of the
>>>>> product to provide nodes with > 400 GB RAM which is what *I think* would 
>>>>> be
>>>>> required to get a decent performance with mmap. Hence we had to switch to
>>>>> SimpleFSDirectory.
>>>>>
>>>>> As for the fsync behavior, you are right. I tried with
>>>>> NRTCachingDirectoryFactory as well which defaults to using mmap underneath
>>>>> and still makes fsync calls for already existing index files.
>>>>>
>>>>> Thanks,
>>>>> Rahul
>>>>>
>>>>> On Mon, Mar 15, 2021 at 3:15 PM Michael McCandless <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Thanks Rahul.
>>>>>>
>>>>>> > primary reason being that memory mapping multi-terabyte indexes is
>>>>>> not feasible through mmap
>>>>>>
>>>>>> Hmm, that is interesting -- are you using a 64 bit JVM?  If so, what
>>>>>> goes wrong with such large maps?  Lucene's MMapDirectory should chunk the
>>>>>> mapping to deal with ByteBuffer int only address space.
>>>>>>
>>>>>> SimpleFSDirectory usually has substantially worse performance than
>>>>>> MMapDirectory.
>>>>>>
>>>>>> Still, I suspect you would hit the same issue if you used other
>>>>>> FSDirectory implementations -- the fsync behavior should be the same.
>>>>>>
>>>>>> Mike McCandless
>>>>>>
>>>>>> http://blog.mikemccandless.com
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 12, 2021 at 1:46 PM Rahul Goswami <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks Michael. For your question...yes I am running Solr on Windows
>>>>>>> and running it with SimpleFSDirectoryFactory (primary reason being that
>>>>>>> memory mapping multi-terabyte indexes is not feasible through mmap). I 
>>>>>>> will
>>>>>>> create a Jira later today with the details in this thread and assign it 
>>>>>>> to
>>>>>>> myself. Will take a shot at the fix.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Rahul
>>>>>>>
>>>>>>> On Fri, Mar 12, 2021 at 10:00 AM Michael McCandless <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> I think long ago we used to track which files were actually dirty
>>>>>>>> (we had written bytes to) and only fsync those ones.  But something 
>>>>>>>> went
>>>>>>>> wrong with that, and at some point we "simplified" this logic, I think 
>>>>>>>> on
>>>>>>>> the assumption that asking the OS to fsync a file that does in fact 
>>>>>>>> exist
>>>>>>>> yet indeed has not changed would be harmless?  But somehow it is not in
>>>>>>>> your case?  Are you on Windows?
>>>>>>>>
>>>>>>>> I tried to do a bit of digital archaeology and remember what
>>>>>>>> happened here, and I came across this relevant looking issue:
>>>>>>>> https://issues.apache.org/jira/browse/LUCENE-2328.  That issue
>>>>>>>> moved tracking of which files have been written but not yet fsync'd 
>>>>>>>> down
>>>>>>>> from IndexWriter into FSDirectory.
>>>>>>>>
>>>>>>>> But there was another change that then removed staleFiles from
>>>>>>>> FSDirectory entirely.... still trying to find that.  Aha, found it!
>>>>>>>> https://issues.apache.org/jira/browse/LUCENE-6150.  Phew Uwe was
>>>>>>>> really quite upset in that issue ;)
>>>>>>>>
>>>>>>>> I also came across this delightful related issue, showing how a
>>>>>>>> massive hurricane (Irene) can lead to finding and fixing a bug in 
>>>>>>>> Lucene!
>>>>>>>> https://issues.apache.org/jira/browse/LUCENE-3418
>>>>>>>>
>>>>>>>> > The assumption is that while the commit point is saved, no
>>>>>>>> changes happen to the segment files in the saved generation.
>>>>>>>>
>>>>>>>> This assumption should really be true.  Lucene writes the files,
>>>>>>>> append only, once, and then never changes them, once they are closed.
>>>>>>>> Pulling a commit point from Solr should further ensure that, even as
>>>>>>>> indexing continues and new segments are written, the old segments
>>>>>>>> referenced in that commit point will not be deleted.  But apparently 
>>>>>>>> this
>>>>>>>> "harmless fsync" Lucene is doing is not so harmless in your use case.
>>>>>>>> Maybe open an issue and pull out the details from this discussion onto 
>>>>>>>> it?
>>>>>>>>
>>>>>>>> Mike McCandless
>>>>>>>>
>>>>>>>> http://blog.mikemccandless.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Mar 12, 2021 at 9:03 AM Michael Sokolov <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Also - I should have said - I think the first step here is to
>>>>>>>>> write a
>>>>>>>>> focused unit test that demonstrates the existence of the extra
>>>>>>>>> fsyncs
>>>>>>>>> that we want to eliminate. It would be awesome if you were able to
>>>>>>>>> create such a thing.
>>>>>>>>>
>>>>>>>>> On Fri, Mar 12, 2021 at 9:00 AM Michael Sokolov <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> >
>>>>>>>>> > Yes, please go ahead and open an issue. TBH I'm not sure why
>>>>>>>>> this is
>>>>>>>>> > happening - there may be a good reason?? But let's explore it
>>>>>>>>> using an
>>>>>>>>> > issue, thanks.
>>>>>>>>> >
>>>>>>>>> > On Fri, Mar 12, 2021 at 12:16 AM Rahul Goswami <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> > >
>>>>>>>>> > > I can create a Jira and assign it to myself if that's ok (?).
>>>>>>>>> I think this can help improve commit performance.
>>>>>>>>> > > Also, to answer your question, we have indexes sometimes going
>>>>>>>>> into multiple terabytes. Using the replication handler for backup 
>>>>>>>>> would
>>>>>>>>> mean requiring a disk capacity more than 2x the index size on the 
>>>>>>>>> machine
>>>>>>>>> at all times, which might not be feasible. So we directly back the 
>>>>>>>>> index up
>>>>>>>>> from the Solr node to a remote repository.
>>>>>>>>> > >
>>>>>>>>> > > Thanks,
>>>>>>>>> > > Rahul
>>>>>>>>> > >
>>>>>>>>> > > On Thu, Mar 11, 2021 at 4:09 PM Michael Sokolov <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> > >>
>>>>>>>>> > >> Well, it certainly doesn't seem necessary to fsync files that
>>>>>>>>> are
>>>>>>>>> > >> unchanged and have already been fsync'ed. Maybe there's an
>>>>>>>>> opportunity
>>>>>>>>> > >> to improve it? On the other hand, support for external
>>>>>>>>> processes
>>>>>>>>> > >> reading Lucene index files isn't likely to become a feature
>>>>>>>>> of Lucene.
>>>>>>>>> > >> You might want to consider using Solr replication to power
>>>>>>>>> your
>>>>>>>>> > >> backup?
>>>>>>>>> > >>
>>>>>>>>> > >> On Thu, Mar 11, 2021 at 2:52 PM Rahul Goswami <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> > >> >
>>>>>>>>> > >> > Thanks Michael. I thought since this discussion is closer
>>>>>>>>> to the code than most discussions on the solr-users list, it seemed 
>>>>>>>>> like a
>>>>>>>>> more appropriate forum. Will be mindful going forward.
>>>>>>>>> > >> > On your point about new segments, I attached a debugger and
>>>>>>>>> tried to do a new commit (just pure Solr commit, no backup process
>>>>>>>>> running), and the code indeed does fsync on a pre-existing segment 
>>>>>>>>> file.
>>>>>>>>> Hence I was a bit baffled since it challenged my fundamental 
>>>>>>>>> understanding
>>>>>>>>> that segment files once written are immutable, no matter what (unless
>>>>>>>>> picked up for a merge of course). Hence I thought of reaching out, in 
>>>>>>>>> case
>>>>>>>>> there are scenarios where this might happen which I might be unaware 
>>>>>>>>> of.
>>>>>>>>> > >> >
>>>>>>>>> > >> > Thanks,
>>>>>>>>> > >> > Rahul
>>>>>>>>> > >> >
>>>>>>>>> > >> > On Thu, Mar 11, 2021 at 2:38 PM Michael Sokolov <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> > >> >>
>>>>>>>>> > >> >> This isn't a support forum; solr-users@ might be more
>>>>>>>>> appropriate. On
>>>>>>>>> > >> >> that list someone might have a better idea about how the
>>>>>>>>> replication
>>>>>>>>> > >> >> handler gets its list of files. This would be a good list
>>>>>>>>> to try if
>>>>>>>>> > >> >> you wanted to propose a fix for the problem you're having.
>>>>>>>>> But since
>>>>>>>>> > >> >> you're here -- it looks to me as if IndexWriter indeed
>>>>>>>>> syncs all "new"
>>>>>>>>> > >> >> files in the current segments being committed; look in
>>>>>>>>> > >> >> IndexWriter.startCommit and SegmentInfos.files. Caveat:
>>>>>>>>> (1) I'm
>>>>>>>>> > >> >> looking at this code for the first time, and (2) things
>>>>>>>>> may have been
>>>>>>>>> > >> >> different in 7.7.2? Sorry I don't know for sure, but are
>>>>>>>>> you sure that
>>>>>>>>> > >> >> your backup process is not attempting to copy one of the
>>>>>>>>> new files?
>>>>>>>>> > >> >>
>>>>>>>>> > >> >> On Thu, Mar 11, 2021 at 1:35 PM Rahul Goswami <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> > >> >> >
>>>>>>>>> > >> >> > Hello,
>>>>>>>>> > >> >> > Just wanted to follow up one more time to see if this is
>>>>>>>>> the right form for my question? Or is this suitable for some other 
>>>>>>>>> mailing
>>>>>>>>> list?
>>>>>>>>> > >> >> >
>>>>>>>>> > >> >> > Best,
>>>>>>>>> > >> >> > Rahul
>>>>>>>>> > >> >> >
>>>>>>>>> > >> >> > On Sat, Mar 6, 2021 at 3:57 PM Rahul Goswami <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> > >> >> >>
>>>>>>>>> > >> >> >> Hello everyone,
>>>>>>>>> > >> >> >> Following up on my question in case anyone has any
>>>>>>>>> idea. Why it's important to know this is because I am thinking of 
>>>>>>>>> allowing
>>>>>>>>> the backup process to not hold any lock on the index files, which 
>>>>>>>>> should
>>>>>>>>> allow the fsync during parallel commits. BUT, in case doing an fsync 
>>>>>>>>> on
>>>>>>>>> existing segment files in a saved commit point DOES have an effect, it
>>>>>>>>> might render the backed up index in a corrupt state.
>>>>>>>>> > >> >> >>
>>>>>>>>> > >> >> >> Thanks,
>>>>>>>>> > >> >> >> Rahul
>>>>>>>>> > >> >> >>
>>>>>>>>> > >> >> >> On Fri, Mar 5, 2021 at 3:04 PM Rahul Goswami <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> > >> >> >>>
>>>>>>>>> > >> >> >>> Hello,
>>>>>>>>> > >> >> >>> We have a process which backs up the index (Solr
>>>>>>>>> 7.7.2) on a schedule. The way we do it is we first save a commit 
>>>>>>>>> point on
>>>>>>>>> the index and then using Solr's /replication handler, get the list of 
>>>>>>>>> files
>>>>>>>>> in that generation. After the backup completes, we release the commit 
>>>>>>>>> point
>>>>>>>>> (Please note that this is a separate backup process outside of Solr 
>>>>>>>>> and not
>>>>>>>>> the backup command of the /replication handler)
>>>>>>>>> > >> >> >>> The assumption is that while the commit point is
>>>>>>>>> saved, no changes happen to the segment files in the saved generation.
>>>>>>>>> > >> >> >>>
>>>>>>>>> > >> >> >>> Now the issue... The backup process opens the index
>>>>>>>>> files in a shared READ mode, preventing writes. This is causing any
>>>>>>>>> parallel commits to fail as it seems to be complaining about the index
>>>>>>>>> files to be locked by another process(the backup process). Upon 
>>>>>>>>> debugging,
>>>>>>>>> I see that fsync is being called during commit on already existing 
>>>>>>>>> segment
>>>>>>>>> files which is not expected. So, my question is, is there any reason 
>>>>>>>>> for
>>>>>>>>> lucene to call fsync on already existing segment files?
>>>>>>>>> > >> >> >>>
>>>>>>>>> > >> >> >>> The line of code I am referring to is as below:
>>>>>>>>> > >> >> >>> try (final FileChannel file =
>>>>>>>>> FileChannel.open(fileToSync, isDir ? StandardOpenOption.READ :
>>>>>>>>> StandardOpenOption.WRITE))
>>>>>>>>> > >> >> >>>
>>>>>>>>> > >> >> >>> in method fsync(Path fileToSync, boolean isDir) of the
>>>>>>>>> class file
>>>>>>>>> > >> >> >>>
>>>>>>>>> > >> >> >>>
>>>>>>>>> lucene\core\src\java\org\apache\lucene\util\IOUtils.java
>>>>>>>>> > >> >> >>>
>>>>>>>>> > >> >> >>> Thanks,
>>>>>>>>> > >> >> >>> Rahul
>>>>>>>>> > >> >>
>>>>>>>>> > >> >>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> > >> >> To unsubscribe, e-mail: [email protected]
>>>>>>>>> > >> >> For additional commands, e-mail:
>>>>>>>>> [email protected]
>>>>>>>>> > >> >>
>>>>>>>>> > >>
>>>>>>>>> > >>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> > >> To unsubscribe, e-mail: [email protected]
>>>>>>>>> > >> For additional commands, e-mail: [email protected]
>>>>>>>>> > >>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>>>
>>>>>>>>>
>>>> --
>>>> Uwe Schindler
>>>> Achterdiek 19, 28357 Bremen
>>>> https://www.thetaphi.de
>>>>
>>>

Reply via email to