https://issues.apache.org/jira/browse/LEGAL-696
On Thu, Jan 16, 2025 at 8:50 AM Dave Fisher <[email protected]> wrote: > I think the question about (c) might best be directed to the VP, Privacy. > It feels similar to email exposure of PII on mailing lists. Everything that > was crawled was publicly available at one time or another. > > I could be wrong, but Privacy is knows GPDR while LEGAL knows the AL2, etc. > > > On Jan 16, 2025, at 8:26 AM, Tim Allison <[email protected]> wrote: > > > > This is a really helpful delineation of the issues. Thank you, Maruan, > for > > this and for all of your support with the server. > > > > I'll open a ticket on LEGAL's jira? > > > > On Wed, Jan 15, 2025 at 3:55 AM [email protected] < > > [email protected]> wrote: > > > >> Hi Tim, > >> > >> IMHO there are several parts to it. > >> > >> a) serving content which might look like other corps sites can be > >> interpreted as phishing > >> b) scraping and storing coyprighted content > >> c) scraping and storing content containing personal data > >> > >> a) is being dealt with in the current form. As long as we don't > >> publicly serve the files we are fine. We could also allow password > >> protected https access if that has a benefit over ssh. > >> b) scraping copyrighted information is typically OK (there are legal > >> cases where this has been decided) although there might be cases where > >> we need to remove individual files > >> c) scraping and storing personal data is mostly not OK with GDPR and > >> other acts without permission. This becomes very difficult to handle. > >> E.g. if one uploaded a file to a bug tracker one could argue that if > >> that file contained personal data by uploading one gave permission to > >> use it within the context of the bug tracking and the dev process > >> behind it. That doesn't include permission to load the file from that > >> system and use it in a different context. > >> > >> I think until c is sorted we can not allow access in a wider context > >> and even need to reconsider if we can use it at all although being very > >> beneficial. > >> > >> Maybe we can have a chat with legal about that. > >> > >> BR > >> Maruan > >> > >> > >> > >> > >> Am Dienstag, dem 14.01.2025 um 08:17 -0500 schrieb Tim Allison: > >>> Hi Stefan, > >>> > >>> I'm sorry for this sudden change. I'm hoping that we can find a way > >>> to > >>> make this all work again, but there are complexities. Part of the > >>> challenge > >>> is that the liability is spread across several organizations and > >>> individuals; part of the challenge is everything to do with the > >>> varying > >>> global legal/privacy requirements around crawled data. And there are > >>> other > >>> challenges. > >>> > >>> These corpora have been critical to numerous parsing projects at > >>> the ASF > >>> and to devs and projects outside of ASF. I've heard from a few > >>> others > >>> offline who are also affected by this. > >>> > >>> > >>> All, > >>> What are our priorities? How can we move forward? Some options that > >>> I see: > >>> > >>> 0) nuclear option: shutdown the server entirely > >>> 1) continue as we have it now -- no http/s access > >>> 2) host reports/metadata only via https > >>> 3) host "packaged" corpora in zips (password protected?) via https > >>> 4) password protect https access to the corpora > >>> 5) not a viable option: turn everything back on > >>> 6) not a viable option: turn everything back on with a strict > >>> robots.txt > >>> policy > >>> > >>> Any other options? What are our preferences? > >>> > >>> Best, > >>> > >>> Tim > >>> > >>> On Sat, Jan 11, 2025 at 9:01 AM stefan6419846 > >>> <[email protected]> > >>> wrote: > >>> > >>>> We at pypdf (https://github.com/py-pdf/pypdf) have been hit by the > >>>> unexpected shutdown of the service and were glad to at least find > >>>> this > >>>> indirect announcement. Nevertheless, it seems like we have to find > >>>> a > >>>> suitable alternative for the previously used govdocs1 PDF files > >>>> from > >>>> your server, as the official govdocs1 sources do not expose the > >>>> single > >>>> PDF files directly. > >>>> > >>>> Thanks for hosting these files in the past. > >>>> > >>>> Best regards, > >>>> Stefan > >>>> > >>>> On 2025/01/09 01:36:59 Tim Allison wrote: > >>>>> \All, > >>>>> We've gotten a handful of takedown requests recently. I had > >>>>> initially > >>>>> envisioned public sharing of files as a key component of our > >>>>> server. We > >>>> can > >>>>> still use the files and offer read access to fellow file > >>>>> researchers. I'm > >>>>> not sure I want to deal with further takedown requests. > >>>>> As an intermediate step, we could ask robots not to crawl the > >>>>> data, but > >>>>> that's not reliable. > >>>>> So, in lieu of that, with heavy heart, I ask if it is time to > >>>>> close off > >>>>> public access? > >>>>> WDYT? > >>>>> > >>>>> Best, > >>>>> > >>>>> Tim > >>>>> > >>>> > >> > >> > >
