Re: GDPR compliance

2023-11-28 Thread Patrick Zhai
It's not that insane, it's about several weeks however the big segment can
stay there for quite long if there's not enough update for a merge policy
to pick it up

On Tue, Nov 28, 2023, 17:14 Dongyu Xu  wrote:

> What is the expected grace time for the data-deletion request to take
> place?
>
> I'm not expert about the policy but I think something like "I need my data
> to be gone in next 2 second" is unreasonable.
>
> Tony X
>
> --
> *From:* Robert Muir 
> *Sent:* Tuesday, November 28, 2023 11:52 AM
> *To:* dev@lucene.apache.org 
> *Subject:* Re: GDPR compliance
>
> I don't think there's any problem with GDPR, and I don't think users
> should be running unnecessary "optimize". GDRP just says data should
> be erased without "undue" delay. waiting for a merge to nuke the
> deleted docs isn't "undue", there is a good reason for it.
>
> On Tue, Nov 28, 2023 at 2:40 PM Patrick Zhai  wrote:
> >
> > Hi Folks,
> > In LinkedIn we need to comply with GDPR for a large part of our data,
> and an important part of it is that we need to be sure we have completely
> deleted the data the user requested to delete within a certain period of
> time.
> > The way we have come up with so far is to:
> > 1. Record the segment creation time somewhere (not decided yet, maybe
> index commit userinfo, maybe some other place outside of lucene)
> > 2. Create a new merge policy which delegate most operations to a normal
> MP, like TieredMergePolicy, and then add extra single-segment (merge from 1
> segment to 1 segment, basically only do deletion) merges if it finds any
> segment is about to violate the GDPR time frame.
> >
> > So here's my question:
> > 1. Is there a better/existing way to do this?
> > 2. I would like to directly contribute to Lucene about such a merge
> policy since I think GDPR is more or less a common thing. Would like to
> know whether people feel like it's necessary or not?
> > 3. It's also nice if we can store the segment creation time to the index
> directly by IndexWriter (maybe write to SegmentInfo?), I can try to do that
> but would like to ask whether there's any objections?
> >
> > Best
> > Patrick
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: [JENKINS] Lucene » Lucene-NightlyTests-main - Build # 1199 - Unstable!

2023-11-28 Thread Michael McCandless
OK I pushed a fix.

Mike

On Tue, Nov 28, 2023 at 7:32 PM Michael McCandless <
luc...@mikemccandless.com> wrote:

> I think maybe LuceneTestCase.newSearcher is turning on concurrency
> (setting the executor randomly).  Since this test explicitly passes a "no
> concurrency" collector manager I think we should switch to "new
> IndexSearcher(...)".
>
> Mike
>
> On Tue, Nov 28, 2023 at 7:29 PM Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> This reproduces for me.
>>
>> Maybe related to LUCENE-10002 / #240?
>>
>> Mike
>>
>> On Tue, Nov 28, 2023 at 1:58 AM Apache Jenkins Server <
>> jenk...@builds.apache.org> wrote:
>>
>>> Build:
>>> https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/1199/
>>>
>>> 1 tests failed.
>>> FAILED:  org.apache.lucene.search.TestTopFieldCollector.testSort
>>>
>>> Error Message:
>>> java.lang.IllegalStateException: This TopFieldCollectorManager was
>>> created without concurrency (supportsConcurrency=false), but multiple
>>> collectors are being created
>>>
>>> Stack Trace:
>>> java.lang.IllegalStateException: This TopFieldCollectorManager was
>>> created without concurrency (supportsConcurrency=false), but multiple
>>> collectors are being created
>>> at
>>> __randomizedtesting.SeedInfo.seed([4B0B913D92123C6D:1AEEB914595F267D]:0)
>>> at
>>> org.apache.lucene.search.TopFieldCollectorManager.newCollector(TopFieldCollectorManager.java:142)
>>> at
>>> org.apache.lucene.search.TopFieldCollectorManager.newCollector(TopFieldCollectorManager.java:31)
>>> at
>>> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:623)
>>> at
>>> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:607)
>>> at
>>> org.apache.lucene.search.TestTopFieldCollector.testSort(TestTopFieldCollector.java:124)
>>> at
>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
>>> Method)
>>> at
>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
>>> at
>>> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.base/java.lang.reflect.Method.invoke(Method.java:568)
>>> at
>>> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
>>> at
>>> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
>>> at
>>> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
>>> at
>>> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
>>> at
>>> org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
>>> at
>>> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
>>> at
>>> org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
>>> at
>>> org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
>>> at
>>> org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
>>> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>>> at
>>> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>>> at
>>> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
>>> at
>>> com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
>>> at
>>> com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
>>> at
>>> com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
>>> at
>>> com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
>>> at
>>> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
>>> at
>>> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
>>> at
>>> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
>>> at
>>> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>>> at
>>> org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
>>> at
>>> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
>>> at
>>> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
>>> at
>>> 

Re: [JENKINS] Lucene » Lucene-NightlyTests-main - Build # 1199 - Unstable!

2023-11-28 Thread Michael McCandless
I think maybe LuceneTestCase.newSearcher is turning on concurrency (setting
the executor randomly).  Since this test explicitly passes a "no
concurrency" collector manager I think we should switch to "new
IndexSearcher(...)".

Mike

On Tue, Nov 28, 2023 at 7:29 PM Michael McCandless <
luc...@mikemccandless.com> wrote:

> This reproduces for me.
>
> Maybe related to LUCENE-10002 / #240?
>
> Mike
>
> On Tue, Nov 28, 2023 at 1:58 AM Apache Jenkins Server <
> jenk...@builds.apache.org> wrote:
>
>> Build:
>> https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/1199/
>>
>> 1 tests failed.
>> FAILED:  org.apache.lucene.search.TestTopFieldCollector.testSort
>>
>> Error Message:
>> java.lang.IllegalStateException: This TopFieldCollectorManager was
>> created without concurrency (supportsConcurrency=false), but multiple
>> collectors are being created
>>
>> Stack Trace:
>> java.lang.IllegalStateException: This TopFieldCollectorManager was
>> created without concurrency (supportsConcurrency=false), but multiple
>> collectors are being created
>> at
>> __randomizedtesting.SeedInfo.seed([4B0B913D92123C6D:1AEEB914595F267D]:0)
>> at
>> org.apache.lucene.search.TopFieldCollectorManager.newCollector(TopFieldCollectorManager.java:142)
>> at
>> org.apache.lucene.search.TopFieldCollectorManager.newCollector(TopFieldCollectorManager.java:31)
>> at
>> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:623)
>> at
>> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:607)
>> at
>> org.apache.lucene.search.TestTopFieldCollector.testSort(TestTopFieldCollector.java:124)
>> at
>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
>> Method)
>> at
>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
>> at
>> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.base/java.lang.reflect.Method.invoke(Method.java:568)
>> at
>> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
>> at
>> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
>> at
>> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
>> at
>> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
>> at
>> org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
>> at
>> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
>> at
>> org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
>> at
>> org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
>> at
>> org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
>> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>> at
>> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>> at
>> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
>> at
>> com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
>> at
>> com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
>> at
>> com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
>> at
>> com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
>> at
>> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
>> at
>> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
>> at
>> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
>> at
>> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>> at
>> org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
>> at
>> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
>> at
>> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
>> at
>> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>> at
>> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>> at
>> 

Re: [JENKINS] Lucene » Lucene-NightlyTests-main - Build # 1199 - Unstable!

2023-11-28 Thread Michael McCandless
This reproduces for me.

Maybe related to LUCENE-10002 / #240?

Mike

On Tue, Nov 28, 2023 at 1:58 AM Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> Build:
> https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/1199/
>
> 1 tests failed.
> FAILED:  org.apache.lucene.search.TestTopFieldCollector.testSort
>
> Error Message:
> java.lang.IllegalStateException: This TopFieldCollectorManager was created
> without concurrency (supportsConcurrency=false), but multiple collectors
> are being created
>
> Stack Trace:
> java.lang.IllegalStateException: This TopFieldCollectorManager was created
> without concurrency (supportsConcurrency=false), but multiple collectors
> are being created
> at
> __randomizedtesting.SeedInfo.seed([4B0B913D92123C6D:1AEEB914595F267D]:0)
> at
> org.apache.lucene.search.TopFieldCollectorManager.newCollector(TopFieldCollectorManager.java:142)
> at
> org.apache.lucene.search.TopFieldCollectorManager.newCollector(TopFieldCollectorManager.java:31)
> at
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:623)
> at
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:607)
> at
> org.apache.lucene.search.TestTopFieldCollector.testSort(TestTopFieldCollector.java:124)
> at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
> at
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:568)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
> at
> org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
> at
> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at
> org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
> at
> org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> at
> org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
> at
> com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
> at
> com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
> at
> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
> at
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
> at
> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at
> org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> at
> org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> at
> 

Re: GDPR compliance

2023-11-28 Thread Patrick Zhai
Thanks Robert and Dawid,
I think what you said is reasonable to me, I can keep the MP private then I
guess(and it's not hard to code it out anyway so I guess people can still
figure it out easily if they're facing a similar situation).
For our case I think we do have some other constraints so we have to
"clean" them every so often, so we still need to do that.

Anyway thank you for the interpretation of GDPR, I'm actually not sure what
exactly it's trying to enforce so it's a good learn for me as well.

Patrick


On Tue, Nov 28, 2023 at 2:48 PM Robert Muir  wrote:

> and if you delete those segments, will that data ever be actually
> removed from the underlying physical storage? equally uncertain.
>
> deleting a file from the filesystem is similar to what lucene is
> doing, it doesn't really delete anything from the disk, just allows it
> to be overwritten by future writes.
>
> so I don't think we should provide any "GDPRMergePolicy" to satisfy an
> extreme (and short-sighted) legal interpretation. it wouldn't solve
> the problem anyway.
>
> On Tue, Nov 28, 2023 at 3:27 PM Ilan Ginzburg  wrote:
> >
> > Are larger and older segments even certain to ever be merged in
> practice? I was assuming that if there is not a lot of new indexed content
> and not a lot of older documents being deleted, large older segment might
> never have to be merged.
> >
> >
> > On Tue 28 Nov 2023 at 20:53, Robert Muir  wrote:
> >>
> >> I don't think there's any problem with GDPR, and I don't think users
> >> should be running unnecessary "optimize". GDRP just says data should
> >> be erased without "undue" delay. waiting for a merge to nuke the
> >> deleted docs isn't "undue", there is a good reason for it.
> >>
> >> On Tue, Nov 28, 2023 at 2:40 PM Patrick Zhai 
> wrote:
> >> >
> >> > Hi Folks,
> >> > In LinkedIn we need to comply with GDPR for a large part of our data,
> and an important part of it is that we need to be sure we have completely
> deleted the data the user requested to delete within a certain period of
> time.
> >> > The way we have come up with so far is to:
> >> > 1. Record the segment creation time somewhere (not decided yet, maybe
> index commit userinfo, maybe some other place outside of lucene)
> >> > 2. Create a new merge policy which delegate most operations to a
> normal MP, like TieredMergePolicy, and then add extra single-segment (merge
> from 1 segment to 1 segment, basically only do deletion) merges if it finds
> any segment is about to violate the GDPR time frame.
> >> >
> >> > So here's my question:
> >> > 1. Is there a better/existing way to do this?
> >> > 2. I would like to directly contribute to Lucene about such a merge
> policy since I think GDPR is more or less a common thing. Would like to
> know whether people feel like it's necessary or not?
> >> > 3. It's also nice if we can store the segment creation time to the
> index directly by IndexWriter (maybe write to SegmentInfo?), I can try to
> do that but would like to ask whether there's any objections?
> >> >
> >> > Best
> >> > Patrick
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: GDPR compliance

2023-11-28 Thread Robert Muir
and if you delete those segments, will that data ever be actually
removed from the underlying physical storage? equally uncertain.

deleting a file from the filesystem is similar to what lucene is
doing, it doesn't really delete anything from the disk, just allows it
to be overwritten by future writes.

so I don't think we should provide any "GDPRMergePolicy" to satisfy an
extreme (and short-sighted) legal interpretation. it wouldn't solve
the problem anyway.

On Tue, Nov 28, 2023 at 3:27 PM Ilan Ginzburg  wrote:
>
> Are larger and older segments even certain to ever be merged in practice? I 
> was assuming that if there is not a lot of new indexed content and not a lot 
> of older documents being deleted, large older segment might never have to be 
> merged.
>
>
> On Tue 28 Nov 2023 at 20:53, Robert Muir  wrote:
>>
>> I don't think there's any problem with GDPR, and I don't think users
>> should be running unnecessary "optimize". GDRP just says data should
>> be erased without "undue" delay. waiting for a merge to nuke the
>> deleted docs isn't "undue", there is a good reason for it.
>>
>> On Tue, Nov 28, 2023 at 2:40 PM Patrick Zhai  wrote:
>> >
>> > Hi Folks,
>> > In LinkedIn we need to comply with GDPR for a large part of our data, and 
>> > an important part of it is that we need to be sure we have completely 
>> > deleted the data the user requested to delete within a certain period of 
>> > time.
>> > The way we have come up with so far is to:
>> > 1. Record the segment creation time somewhere (not decided yet, maybe 
>> > index commit userinfo, maybe some other place outside of lucene)
>> > 2. Create a new merge policy which delegate most operations to a normal 
>> > MP, like TieredMergePolicy, and then add extra single-segment (merge from 
>> > 1 segment to 1 segment, basically only do deletion) merges if it finds any 
>> > segment is about to violate the GDPR time frame.
>> >
>> > So here's my question:
>> > 1. Is there a better/existing way to do this?
>> > 2. I would like to directly contribute to Lucene about such a merge policy 
>> > since I think GDPR is more or less a common thing. Would like to know 
>> > whether people feel like it's necessary or not?
>> > 3. It's also nice if we can store the segment creation time to the index 
>> > directly by IndexWriter (maybe write to SegmentInfo?), I can try to do 
>> > that but would like to ask whether there's any objections?
>> >
>> > Best
>> > Patrick
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: GDPR compliance

2023-11-28 Thread Dongyu Xu
What is the expected grace time for the data-deletion request to take place?

I'm not expert about the policy but I think something like "I need my data to 
be gone in next 2 second" is unreasonable.

Tony X


From: Robert Muir 
Sent: Tuesday, November 28, 2023 11:52 AM
To: dev@lucene.apache.org 
Subject: Re: GDPR compliance

I don't think there's any problem with GDPR, and I don't think users
should be running unnecessary "optimize". GDRP just says data should
be erased without "undue" delay. waiting for a merge to nuke the
deleted docs isn't "undue", there is a good reason for it.

On Tue, Nov 28, 2023 at 2:40 PM Patrick Zhai  wrote:
>
> Hi Folks,
> In LinkedIn we need to comply with GDPR for a large part of our data, and an 
> important part of it is that we need to be sure we have completely deleted 
> the data the user requested to delete within a certain period of time.
> The way we have come up with so far is to:
> 1. Record the segment creation time somewhere (not decided yet, maybe index 
> commit userinfo, maybe some other place outside of lucene)
> 2. Create a new merge policy which delegate most operations to a normal MP, 
> like TieredMergePolicy, and then add extra single-segment (merge from 1 
> segment to 1 segment, basically only do deletion) merges if it finds any 
> segment is about to violate the GDPR time frame.
>
> So here's my question:
> 1. Is there a better/existing way to do this?
> 2. I would like to directly contribute to Lucene about such a merge policy 
> since I think GDPR is more or less a common thing. Would like to know whether 
> people feel like it's necessary or not?
> 3. It's also nice if we can store the segment creation time to the index 
> directly by IndexWriter (maybe write to SegmentInfo?), I can try to do that 
> but would like to ask whether there's any objections?
>
> Best
> Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: GDPR compliance

2023-11-28 Thread Ilan Ginzburg
Are larger and older segments even certain to ever be merged in practice? I
was assuming that if there is not a lot of new indexed content and not a
lot of older documents being deleted, large older segment might never have
to be merged.


On Tue 28 Nov 2023 at 20:53, Robert Muir  wrote:

> I don't think there's any problem with GDPR, and I don't think users
> should be running unnecessary "optimize". GDRP just says data should
> be erased without "undue" delay. waiting for a merge to nuke the
> deleted docs isn't "undue", there is a good reason for it.
>
> On Tue, Nov 28, 2023 at 2:40 PM Patrick Zhai  wrote:
> >
> > Hi Folks,
> > In LinkedIn we need to comply with GDPR for a large part of our data,
> and an important part of it is that we need to be sure we have completely
> deleted the data the user requested to delete within a certain period of
> time.
> > The way we have come up with so far is to:
> > 1. Record the segment creation time somewhere (not decided yet, maybe
> index commit userinfo, maybe some other place outside of lucene)
> > 2. Create a new merge policy which delegate most operations to a normal
> MP, like TieredMergePolicy, and then add extra single-segment (merge from 1
> segment to 1 segment, basically only do deletion) merges if it finds any
> segment is about to violate the GDPR time frame.
> >
> > So here's my question:
> > 1. Is there a better/existing way to do this?
> > 2. I would like to directly contribute to Lucene about such a merge
> policy since I think GDPR is more or less a common thing. Would like to
> know whether people feel like it's necessary or not?
> > 3. It's also nice if we can store the segment creation time to the index
> directly by IndexWriter (maybe write to SegmentInfo?), I can try to do that
> but would like to ask whether there's any objections?
> >
> > Best
> > Patrick
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: GDPR compliance

2023-11-28 Thread Robert Muir
I don't think there's any problem with GDPR, and I don't think users
should be running unnecessary "optimize". GDRP just says data should
be erased without "undue" delay. waiting for a merge to nuke the
deleted docs isn't "undue", there is a good reason for it.

On Tue, Nov 28, 2023 at 2:40 PM Patrick Zhai  wrote:
>
> Hi Folks,
> In LinkedIn we need to comply with GDPR for a large part of our data, and an 
> important part of it is that we need to be sure we have completely deleted 
> the data the user requested to delete within a certain period of time.
> The way we have come up with so far is to:
> 1. Record the segment creation time somewhere (not decided yet, maybe index 
> commit userinfo, maybe some other place outside of lucene)
> 2. Create a new merge policy which delegate most operations to a normal MP, 
> like TieredMergePolicy, and then add extra single-segment (merge from 1 
> segment to 1 segment, basically only do deletion) merges if it finds any 
> segment is about to violate the GDPR time frame.
>
> So here's my question:
> 1. Is there a better/existing way to do this?
> 2. I would like to directly contribute to Lucene about such a merge policy 
> since I think GDPR is more or less a common thing. Would like to know whether 
> people feel like it's necessary or not?
> 3. It's also nice if we can store the segment creation time to the index 
> directly by IndexWriter (maybe write to SegmentInfo?), I can try to do that 
> but would like to ask whether there's any objections?
>
> Best
> Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



GDPR compliance

2023-11-28 Thread Patrick Zhai
Hi Folks,
In LinkedIn we need to comply with GDPR for a large part of our data, and
an important part of it is that we need to be sure we have completely
deleted the data the user requested to delete within a certain period of
time.
The way we have come up with so far is to:
1. Record the segment creation time somewhere (not decided yet, maybe index
commit userinfo, maybe some other place outside of lucene)
2. Create a new merge policy which delegate most operations to a normal MP,
like TieredMergePolicy, and then add extra single-segment (merge from 1
segment to 1 segment, basically only do deletion) merges if it finds any
segment is about to violate the GDPR time frame.

So here's my question:
1. Is there a better/existing way to do this?
2. I would like to directly contribute to Lucene about such a merge policy
since I think GDPR is more or less a common thing. Would like to know
whether people feel like it's necessary or not?
3. It's also nice if we can store the segment creation time to the index
directly by IndexWriter (maybe write to SegmentInfo?), I can try to do that
but would like to ask whether there's any objections?

Best
Patrick


Re: Lucene 9.9.0 Release

2023-11-28 Thread Chris Hegarty
Hi Guo,

Thanks for the update.

Let’s push the 9.9.0 branch cut until tomorrow (rather than today as previously 
suggested), which should allow time to determine the outstanding issues you 
mentioned below. That should be more straightforward all round.

New 9.9.0 branch cut 12:00 29th Nov 2023 UTC.

We have flexibility here, and I hope that this helps.

-Chris.

> On 28 Nov 2023, at 05:31, Guo Feng  wrote:
> 
> +1, thanks for volunteering Chris!
> 
> #12699 is merged to main. I plan to backport it to 9.9 if it fixes the 
> performance drop, otherwise  revert #12699 and #12631 (the PR introduced 
> regression) and push them to the next version.
> 
> On 2023/11/21 09:51:43 Chris Hegarty wrote:
>> Hi,
>> 
>> It's been a while since the 9.8.0 release and we’ve accumulated quite a few 
>> changes. I’d like to propose that we release 9.9.0.
>> 
>> If there's no objections, I volunteer to be the release manager and will cut 
>> the feature branch a week from now, 12:00 28th Nov UTC.
>> 
>> -Chris.
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>> 
>> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org