There was an amusing revelation on another list of mine about PDF conversion. A blind user was complaining that the PDF manuals are useless for screen readers. another reader on the list produced an html conversion of the pdf in a few hours,, It was done by Claude. I was gobsmacked, having been thwarted by pdf documents many times before, but of course it's the perfect task for an LLM.
-- rec -- On Fri, Feb 6, 2026 at 7:19 AM Tom Johnson <[email protected]> wrote: > Thanks, Marcus. > > So you're saying there seems to be no consistency or standardization of > methodology in the so-called review and release process. That is a story in > itself. > > Also, having to convert PDF to Text is another time suck but necessary at > some point. > > Another approach would be to FOIA the DOJ for directives from whoTK to > those assigned to do the reviews/redactions. Of course given that Trump has > shut down many of the offices that responded to FOIAs, it's unlikely we > would see those documents in our life time. > > Onward, > Tom > (TK means "to come" in journalize) > ======================= > Tom Johnson > Inst. for Analytic Journalism > Santa Fe, New Mexico > 505-577-6482 > ======================= > > On Fri, Feb 6, 2026, 1:36 AM Marcus Daniels <[email protected]> wrote: > >> So.. The early tranches were the FBI searches of the properties. Then >> there were a bunch of personal photographs of Epstein and Maxwell on their >> travels with various famous people. Amusingly, faces some folks on this >> list would recognize. (Read 2and3.md if so inclined and look-up Maxwell’s >> recent proffer to Blanche.) >> >> The early volume was modest enough in the early sets that I could push a >> lot through Claude, even images. Summaries attached of that. >> >> The new documents vary a lot in size. There are examples of subpoenaed >> e-mail accounts that go on and on for hundreds of pages, but also singled >> isolated e-mails. There’s an unusually large volume on investigating >> Epstein’s demise in prison. Overall, it is mostly PDF format, and it >> often the case that text can be extracted, e.g., using pdftotext. It’s >> just the DOJ convention to use PDF. It doesn’t mean they are composed >> documents. >> >> >> >> I’ve been focused on “Dataset 9” as that one is large, and the DOJ failed >> (or refused?) to make zip file that would be easy to download. This dataset >> gives more insight into Epstein’s contemptible personality. There are many >> emotionally manipulative e-mails to some of his more independent young >> female associates. I haven’t worked with the new data systematically yet, >> just spot checking the download from time to time. I feel guilty wasting >> GPU cycles and energy on traumatizing a perfectly good AI on this stuff. >> >> >> >> The file numbering has become sparse in the later datasets. In the >> early batches, that occurred when Donald Trump was in a picture. Just >> sayin. >> >> >> >> Marcus >> >> *From: *Friam <[email protected]> on behalf of Tom Johnson < >> [email protected]> >> *Date: *Thursday, February 5, 2026 at 9:38 PM >> *To: *The Friday Morning Applied Complexity Coffee Group < >> [email protected]> >> *Subject: *Re: [FRIAM] Gauging interest.. >> >> Marcus-- >> >> Congrats and many thanks for harvesting this whole crop and keeping it in >> various grain bins. >> >> >> >> Quick questions: >> >> >> >> The DOJ, on multiple occasions, has talked about various numbers of >> pages. How many "pages" do you think you have? Are they all standard 8.5x11 >> pages? All PDF? If so, searchable PDF? >> >> Do the various batches released come with any kind of title page, index? >> Glossary? >> >> >> >> Are the pages/documents in any chronological order or any categorical >> order? >> >> >> >> Do you think we could do a word count vs. lines (each containing an >> words-per-line estimate) redacted? (i.e a story reporting X percent of the >> documents still hidden or useless). >> >> >> >> I'm sure I can bug you for more. >> >> Tom >> >> >> >> ======================= >> Tom Johnson >> Inst. for Analytic Journalism >> Santa Fe, New Mexico >> 505-577-6482 >> ======================= >> >> >> >> On Thu, Feb 5, 2026, 10:37 PM Marcus Daniels <[email protected]> >> wrote: >> >> I’m closing-in on a full download of Dataset 9 of the Epstein >> Transparency Act. (I have the rest.) I’m thinking of building a vector >> database (e.g. pgvector for Postgres). I was thinking of wrapping a MCP >> server around it so LLMs can get a directory of articles and then >> summarize, or cross-reference sets of them. RAG is what Perplexity does, >> but apparently, they don’t have the content yet. >> >> >> >> I imagine a SETI-at-home type project to reduce the data. Another >> analogy that comes to mind is annotations of the genome: Line all the >> documents up and then slowly fill in the summaries. The vector database >> could help inform how to combine documents for consumption within context >> window limits (PCA vicinity). >> >> >> >> I could keep my Max subscription on it and make some progress, but really >> such a project needs tens or hundreds of workers. >> >> >> >> Marcus >> >> >> >> >> >> >> >> >> >> .- .-.. .-.. / ..-. --- --- - . .-. ... / .- .-. . / .-- .-. --- -. --. / >> ... --- -- . / .- .-. . / ..- ... . ..-. ..- .-.. >> FRIAM Applied Complexity Group listserv >> Fridays 9a-12p Friday St. Johns Cafe / Thursdays 9a-12p Zoom >> https://bit.ly/virtualfriam >> to (un)subscribe http://redfish.com/mailman/listinfo/friam_redfish.com >> FRIAM-COMIC http://friam-comic.blogspot.com/ >> archives: 5/2017 thru present >> https://redfish.com/pipermail/friam_redfish.com/ >> 1/2003 thru 6/2021 http://friam.383.s1.nabble.com/ >> >> .- .-.. .-.. / ..-. --- --- - . .-. ... / .- .-. . / .-- .-. --- -. --. / >> ... --- -- . / .- .-. . / ..- ... . ..-. ..- .-.. >> FRIAM Applied Complexity Group listserv >> Fridays 9a-12p Friday St. Johns Cafe / Thursdays 9a-12p Zoom >> https://bit.ly/virtualfriam >> to (un)subscribe http://redfish.com/mailman/listinfo/friam_redfish.com >> FRIAM-COMIC http://friam-comic.blogspot.com/ >> archives: 5/2017 thru present >> https://redfish.com/pipermail/friam_redfish.com/ >> 1/2003 thru 6/2021 http://friam.383.s1.nabble.com/ >> > .- .-.. .-.. / ..-. --- --- - . .-. ... / .- .-. . / .-- .-. --- -. --. / > ... --- -- . / .- .-. . / ..- ... . ..-. ..- .-.. > FRIAM Applied Complexity Group listserv > Fridays 9a-12p Friday St. Johns Cafe / Thursdays 9a-12p Zoom > https://bit.ly/virtualfriam > to (un)subscribe http://redfish.com/mailman/listinfo/friam_redfish.com > FRIAM-COMIC http://friam-comic.blogspot.com/ > archives: 5/2017 thru present > https://redfish.com/pipermail/friam_redfish.com/ > 1/2003 thru 6/2021 http://friam.383.s1.nabble.com/ >
.- .-.. .-.. / ..-. --- --- - . .-. ... / .- .-. . / .-- .-. --- -. --. / ... --- -- . / .- .-. . / ..- ... . ..-. ..- .-.. FRIAM Applied Complexity Group listserv Fridays 9a-12p Friday St. Johns Cafe / Thursdays 9a-12p Zoom https://bit.ly/virtualfriam to (un)subscribe http://redfish.com/mailman/listinfo/friam_redfish.com FRIAM-COMIC http://friam-comic.blogspot.com/ archives: 5/2017 thru present https://redfish.com/pipermail/friam_redfish.com/ 1/2003 thru 6/2021 http://friam.383.s1.nabble.com/
