and if you delete those segments, will that data ever be actually removed from the underlying physical storage? equally uncertain.
deleting a file from the filesystem is similar to what lucene is doing, it doesn't really delete anything from the disk, just allows it to be overwritten by future writes. so I don't think we should provide any "GDPRMergePolicy" to satisfy an extreme (and short-sighted) legal interpretation. it wouldn't solve the problem anyway. On Tue, Nov 28, 2023 at 3:27 PM Ilan Ginzburg <ilans...@gmail.com> wrote: > > Are larger and older segments even certain to ever be merged in practice? I > was assuming that if there is not a lot of new indexed content and not a lot > of older documents being deleted, large older segment might never have to be > merged. > > > On Tue 28 Nov 2023 at 20:53, Robert Muir <rcm...@gmail.com> wrote: >> >> I don't think there's any problem with GDPR, and I don't think users >> should be running unnecessary "optimize". GDRP just says data should >> be erased without "undue" delay. waiting for a merge to nuke the >> deleted docs isn't "undue", there is a good reason for it. >> >> On Tue, Nov 28, 2023 at 2:40 PM Patrick Zhai <zhai7...@gmail.com> wrote: >> > >> > Hi Folks, >> > In LinkedIn we need to comply with GDPR for a large part of our data, and >> > an important part of it is that we need to be sure we have completely >> > deleted the data the user requested to delete within a certain period of >> > time. >> > The way we have come up with so far is to: >> > 1. Record the segment creation time somewhere (not decided yet, maybe >> > index commit userinfo, maybe some other place outside of lucene) >> > 2. Create a new merge policy which delegate most operations to a normal >> > MP, like TieredMergePolicy, and then add extra single-segment (merge from >> > 1 segment to 1 segment, basically only do deletion) merges if it finds any >> > segment is about to violate the GDPR time frame. >> > >> > So here's my question: >> > 1. Is there a better/existing way to do this? >> > 2. I would like to directly contribute to Lucene about such a merge policy >> > since I think GDPR is more or less a common thing. Would like to know >> > whether people feel like it's necessary or not? >> > 3. It's also nice if we can store the segment creation time to the index >> > directly by IndexWriter (maybe write to SegmentInfo?), I can try to do >> > that but would like to ask whether there's any objections? >> > >> > Best >> > Patrick >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org