and if you delete those segments, will that data ever be actually
removed from the underlying physical storage? equally uncertain.

deleting a file from the filesystem is similar to what lucene is
doing, it doesn't really delete anything from the disk, just allows it
to be overwritten by future writes.

so I don't think we should provide any "GDPRMergePolicy" to satisfy an
extreme (and short-sighted) legal interpretation. it wouldn't solve
the problem anyway.

On Tue, Nov 28, 2023 at 3:27 PM Ilan Ginzburg <ilans...@gmail.com> wrote:
>
> Are larger and older segments even certain to ever be merged in practice? I 
> was assuming that if there is not a lot of new indexed content and not a lot 
> of older documents being deleted, large older segment might never have to be 
> merged.
>
>
> On Tue 28 Nov 2023 at 20:53, Robert Muir <rcm...@gmail.com> wrote:
>>
>> I don't think there's any problem with GDPR, and I don't think users
>> should be running unnecessary "optimize". GDRP just says data should
>> be erased without "undue" delay. waiting for a merge to nuke the
>> deleted docs isn't "undue", there is a good reason for it.
>>
>> On Tue, Nov 28, 2023 at 2:40 PM Patrick Zhai <zhai7...@gmail.com> wrote:
>> >
>> > Hi Folks,
>> > In LinkedIn we need to comply with GDPR for a large part of our data, and 
>> > an important part of it is that we need to be sure we have completely 
>> > deleted the data the user requested to delete within a certain period of 
>> > time.
>> > The way we have come up with so far is to:
>> > 1. Record the segment creation time somewhere (not decided yet, maybe 
>> > index commit userinfo, maybe some other place outside of lucene)
>> > 2. Create a new merge policy which delegate most operations to a normal 
>> > MP, like TieredMergePolicy, and then add extra single-segment (merge from 
>> > 1 segment to 1 segment, basically only do deletion) merges if it finds any 
>> > segment is about to violate the GDPR time frame.
>> >
>> > So here's my question:
>> > 1. Is there a better/existing way to do this?
>> > 2. I would like to directly contribute to Lucene about such a merge policy 
>> > since I think GDPR is more or less a common thing. Would like to know 
>> > whether people feel like it's necessary or not?
>> > 3. It's also nice if we can store the segment creation time to the index 
>> > directly by IndexWriter (maybe write to SegmentInfo?), I can try to do 
>> > that but would like to ask whether there's any objections?
>> >
>> > Best
>> > Patrick
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to