Re: Where do we go from here? WAS: Turning off public access to the regression corpora?

Tim Allison Thu, 16 Jan 2025 05:57:43 -0800

https://issues.apache.org/jira/browse/LEGAL-696


On Thu, Jan 16, 2025 at 8:50 AM Dave Fisher <[email protected]> wrote:

> I think the question about (c) might best be directed to the VP, Privacy.
> It feels similar to email exposure of PII on mailing lists. Everything that
> was crawled was publicly available at one time or another.
>
> I could be wrong, but Privacy is knows GPDR while LEGAL knows the AL2, etc.
>
> > On Jan 16, 2025, at 8:26 AM, Tim Allison <[email protected]> wrote:
> >
> > This is a really helpful delineation of the issues. Thank you, Maruan,
> for
> > this and for all of your support with the server.
> >
> > I'll open a ticket on LEGAL's jira?
> >
> > On Wed, Jan 15, 2025 at 3:55 AM [email protected] <
> > [email protected]> wrote:
> >
> >> Hi Tim,
> >>
> >> IMHO there are several parts to it.
> >>
> >> a) serving content which might look like other corps sites can be
> >> interpreted as phishing
> >> b) scraping and storing coyprighted content
> >> c) scraping and storing content containing personal data
> >>
> >> a) is being dealt with in the current form. As long as we don't
> >> publicly serve the files we are fine. We could also allow password
> >> protected https access if that has a benefit over ssh.
> >> b) scraping copyrighted information is typically OK (there are legal
> >> cases where this has been decided) although there might be cases where
> >> we need to remove individual files
> >> c) scraping and storing personal data is mostly not OK with GDPR and
> >> other acts without permission. This becomes very difficult to handle.
> >> E.g. if one uploaded a file to a bug tracker one could argue that if
> >> that file contained personal data by uploading one gave permission to
> >> use it within the context of the bug tracking and the dev process
> >> behind it. That doesn't include permission to load the file from that
> >> system and use it in a different context.
> >>
> >> I think until c is sorted we can not allow access in a wider context
> >> and even need to reconsider if we can use it at all although being very
> >> beneficial.
> >>
> >> Maybe we can have a chat with legal about that.
> >>
> >> BR
> >> Maruan
> >>
> >>
> >>
> >>
> >> Am Dienstag, dem 14.01.2025 um 08:17 -0500 schrieb Tim Allison:
> >>> Hi Stefan,
> >>>
> >>>  I'm sorry for this sudden change. I'm hoping that we can find a way
> >>> to
> >>> make this all work again, but there are complexities. Part of the
> >>> challenge
> >>> is that the liability is spread across several organizations and
> >>> individuals; part of the challenge is everything to do with the
> >>> varying
> >>> global legal/privacy requirements around crawled data. And there are
> >>> other
> >>> challenges.
> >>>
> >>>  These corpora have been critical to numerous parsing projects at
> >>> the ASF
> >>> and to devs and projects outside of ASF.   I've heard from a few
> >>> others
> >>> offline who are also affected by this.
> >>>
> >>>
> >>> All,
> >>>  What are our priorities? How can we move forward? Some options that
> >>> I see:
> >>>
> >>> 0) nuclear option: shutdown the server entirely
> >>> 1) continue as we have it now -- no http/s access
> >>> 2) host reports/metadata only via https
> >>> 3) host "packaged" corpora in zips (password protected?) via https
> >>> 4) password protect https access to the corpora
> >>> 5) not a viable option: turn everything back on
> >>> 6) not a viable option: turn everything back on with a strict
> >>> robots.txt
> >>> policy
> >>>
> >>>  Any other options? What are our preferences?
> >>>
> >>>          Best,
> >>>
> >>>                Tim
> >>>
> >>> On Sat, Jan 11, 2025 at 9:01 AM stefan6419846
> >>> <[email protected]>
> >>> wrote:
> >>>
> >>>> We at pypdf (https://github.com/py-pdf/pypdf) have been hit by the
> >>>> unexpected shutdown of the service and were glad to at least find
> >>>> this
> >>>> indirect announcement. Nevertheless, it seems like we have to find
> >>>> a
> >>>> suitable alternative for the previously used govdocs1 PDF files
> >>>> from
> >>>> your server, as the official govdocs1 sources do not expose the
> >>>> single
> >>>> PDF files directly.
> >>>>
> >>>> Thanks for hosting these files in the past.
> >>>>
> >>>> Best regards,
> >>>> Stefan
> >>>>
> >>>> On 2025/01/09 01:36:59 Tim Allison wrote:
> >>>>> \All,
> >>>>> We've gotten a handful of takedown requests recently. I had
> >>>>> initially
> >>>>> envisioned public sharing of files as a key component of our
> >>>>> server. We
> >>>> can
> >>>>> still use the files and offer read access to fellow file
> >>>>> researchers. I'm
> >>>>> not sure I want to deal with further takedown requests.
> >>>>> As an intermediate step, we could ask robots not to crawl the
> >>>>> data, but
> >>>>> that's not reliable.
> >>>>> So, in lieu of that, with heavy heart, I ask if it is time to
> >>>>> close off
> >>>>> public access?
> >>>>>  WDYT?
> >>>>>
> >>>>>          Best,
> >>>>>
> >>>>>                    Tim
> >>>>>
> >>>>
> >>
> >>
>
>

Re: Where do we go from here? WAS: Turning off public access to the regression corpora?

Reply via email to