Re: [CODE4LIB] twarc and 30-day limitation

2020-05-05 Thread Edward Summers
Hi Eric,

Like Francis and Darnelle said, Twitter's primary free search API is limited to 
the last 7 days of activity. The so called "Standard" search API is what twarc 
uses to gather data when you `twarc search …`

However a couple years ago Twitter added the Premium Search API [1] which is a 
hybrid approach that lets you search two endpoints (30 day and full archive), 
and is engineered to move you from collecting data for free to paying Twitter 
as you (inevitably) want to gather more.

From your email it sounds like you want to use the Full Archive endpoint? We 
have had this on the Documenting the Now roadmap to add premium support to 
twarc but haven't quite got around to it yet.

I went ahead and created a GitHub issue for you to track our progress [2]. It 
actually shouldn't be too difficult to add, so if you have a present need let 
us know so we can prioritize it higher.

//Ed

PS. As Francis mentioned twint gets around Twitter's API constraints by 
scraping Twitter's search results web page. Scraping comes with its own set of 
complexities, the biggest one is that Twitter actively work to prevent it, 
which (in my experience) can make twint a bit unpredictable to use at times.

[1] https://developer.twitter.com/en/docs/tweets/search/overview/premium
[2] https://github.com/DocNow/twarc/issues/326

Re: [CODE4LIB] file sharing/transfer

2020-01-15 Thread Edward Summers
Depending on the context, OnionShare might be worth looking at:

https://onionshare.org/

The publisher of the data runs OnionShare on their laptop/workstation to share 
the data. They send the generated unguessable .onion URL to the recipient who 
can download the data by opening the URL in Tor Browser, or in their own 
OnionShare app. There is no server in between to trust.

It's important to stress that OnionShare is only as secure as the mechanism for 
sharing the onion URL, since anyone with the onion URL can access it. When the 
publisher turns off OnionShare on their laptop/workstation it is no longer 
available.
 
//Ed

Re: [CODE4LIB] Preserving content from a closed Facebook group?

2019-11-04 Thread Edward Summers
Hi Carol,

As far as I know there's no way to download an official copy of Facebook 
groups, even if you are the owner. This is something that the Society of 
American Archivists wrote a letter to Facebook about in 2015 [1], and I don't 
think they ever received a response.

If you have access to the Facebook Group you may want to try using Webrecorder 
Desktop [2] to capture it as a web archive (WARC file). wenrecprder's autopilot 
functionality should let you automatically scroll the page to collect content.

I suggest the desktop version because you will have to log in to your Facebook 
account to retrieve the content, and if you use Webrecorder-desktop you will 
know that there is no chance of your credentials existing up on webrecorder.io

One nice thing about having the site as a WARC file is you can use it later to 
extract images and things like that from the WARC. It might be a bit thorny, 
but at least it's possible.

//Ed

[1] https://www2.archivists.org/sites/all/files/Letter%20to%20Facebook_SAA.pdf
[2] https://blog.webrecorder.io/2019/08/29/desktop-app.html

> On Nov 4, 2019, at 11:25 AM, Carol Kassel  wrote:
> 
> Hi all,
> 
> I wondered if anyone here has any experience in preserving the content of a
> closed Facebook group. I thought it would be straightforward to create some
> kind of archive, or download all the photos, or really anything. But it
> isn't. Searching for a solution yields a variety of interesting options but
> I'm not sure what to try.
> 
> Please note that I used to write commands but now I'm a manager-type who
> only issues commands. However, I can roll up my sleeves and try to remember
> what I used to be able to do, if that's what it takes.
> 
> Thank you!
> 
> -Carol
> 
> -- 
> Carol Kassel
> Senior Manager, Digital Library Infrastructure
> NYU Digital Library Technology Services
> she/her/hers
> c...@nyu.edu
> (212) 992-9246
> dlib.nyu.edu


Re: [CODE4LIB] COinS

2019-03-19 Thread Edward Summers
I think that Zotero still looks for COinS, but understands a lot more? If you 
drop your COinS markup you may want to make sure that you have some other 
metadata in there for Zotero (and tools like it) to use?

> On Mar 19, 2019, at 11:45 AM, Kyle Banerjee  wrote:
> 
> If you have to ask, you have your answer :)
> 
> On Tue, Mar 19, 2019 at 8:00 AM Bigwood, David 
> wrote:
> 
>> Are COinS still of any value? Or should I clean-up my pages by getting rid
>> of them? I haven't heard them mentioned in years.
>> 
>> Thanks,
>> David Bigwood
>> dbigw...@lpi.usra.edu
>> Regional Planetary Image Facility/Library
>> Lunar and Planetary Institute
>> https://www.facebook.com/RPIFN/
>> https://repository.hou.usra.edu/
>> 



signature.asc
Description: Message signed with OpenPGP


[CODE4LIB] Call For Proposals: Forum on Ethics and Archiving the Web

2017-10-25 Thread Edward Summers
Forum on Ethics and Archiving the Web
New Museum, New York City, March 22-24
Proposals due by November 14 (funding available)
http://rhizome.org/editorial/2017/oct/24/open-call-national-forum-on-ethics-and-archiving-the-web/

The dramatic rise in the public’s use of the web and social media to document 
events presents tremendous opportunities to transform social memory practices. 
As new kinds of archives emerge, there is a pressing need for dialogue about 
the ethical risks and opportunities that they present to both those documenting 
and those documented.

Proposals for presentations, discussions, case studies and workshops are now 
being accepted till November 14. Honorariums for presenters and travel funding 
for both presenters and interested attendees are available. Topics of interest 
include, but are not limited to:

* community-driven web archiving efforts
* documentation of activism
* archiving trauma, violence, and human rights issues
* recognizing and dismantling digital colonialism and white supremacy in web 
archives
* strategies for protecting users, from one another, from surveillance, or from 
commercial interests
* design-driven approaches to building values & ethics into web archives
* issues arising when archives become big data or are used for machine learning

There will also be a day for open collaboration on tools, practices and 
strategies for web archiving. Presenters will receive an honorarium and there 
is also travel funding available to help folks make it who might otherwise not 
be able to attend.

Recognizing that ethics and web archiving is a rapidly evolving field and that 
it might not fit directly into your primary work/research interests we wanted 
to keep the proposal process simple. We just need 100 words from you about why 
you would like to be part of the event which you can enter into this form. We 
hope to see you there!


signature.asc
Description: Message signed with OpenPGP


Re: [CODE4LIB] Persistent Identifiers for organizations/institutions.

2017-10-14 Thread Edward Summers

> On Oct 14, 2017, at 7:09 AM, Stuart A. Yeates  wrote:
> 
> archive.org web harvests include at least some DNS details for the content
> they harvest. I'm not sure how comprehensive it is and I'm pretty such that
> there isn't a tool for easily exploring it.

I was thinking about that too. It seems like a window in on those records 
stored in WARCs could be pretty useful right? But I guess with whois guard not 
much information about the organizations is often available via DNS.

I have heard through the grapevine that the UK web archiving effort use DNS 
registrars to help estimate the limits of the UK web. But I’m not sure how they 
negotiated that.

> Having said that, there are some significant organisations which have yet
> to embrace the digital yet, which makes DNS tracking of their structure
> challenging.

Yes, that’s a good point. Of course any effort to build another registry will 
encounter similar barriers.

//Ed


signature.asc
Description: Message signed with OpenPGP


Re: [CODE4LIB] Persistent Identifiers for organizations/institutions.

2017-10-13 Thread Edward Summers
What if we created an identifier system that organizations would pay an annual 
feel to belong to? This identifier would be guaranteed to be globally unique as 
long as the organization cared to maintain it. You could use this identifier 
with your web browser to find information about the organization.

Yes, that’s DNS.

What if we (memory institutions writ large) did something about remembering the 
history of DNS? It sounds simple, but it’s not. Is it possible?

//Ed

> On Oct 13, 2017, at 1:54 PM, Kyle Banerjee  wrote:
> 
> On Thu, Oct 12, 2017 at 12:08 PM, Jonathan Rochkind 
> wrote:
> 
>> At a former research university employer, I talked to a new high-level
>> research data officer type person, whose team had spent months just trying
>> to make a list of all (or even most) of the academic/research
>> organizational units currently existing, and their hiearchical
>> relationships. Before even getting to change management for the future.  No
>> such list or org chart even existed.


signature.asc
Description: Message signed with OpenPGP


Re: [CODE4LIB] internet archive api

2017-09-18 Thread Edward Summers
The internetarchive [1] python library that folks have already mentioned is 
pretty nice for working with IA collections.

For a small project I needed to download the metadata for a collection created 
by the National Agriculture Library, and write it out to the filesystem as JSON 
in a pairtree [2]. Perhaps it helps illustrate how to use it as a library as 
opposed to the command line?

https://github.com/UMD-DCIC/seed-catalogs/blob/master/fetch.py 


//Ed

[1] https://internetarchive.readthedocs.io/ 

[2] https://confluence.ucop.edu/display/Curation/PairTree 
 

> On Sep 18, 2017, at 3:37 PM, Eric Lease Morgan  wrote:
> 
> Is there an Internet Archive API that will allow me to get the contents of a 
> collection as a stream of data and not as a stream of HTML.
> 
> A cool collection of early English print materials is available at the 
> following URL:
> 
>  https://archive.org/details/bplsceep
> 
> Each item is associated with an Internet Archive identifier. If I were able 
> to easily extract these identifiers, then I would be more easily able to 
> provide services based on the collection. But I’m lazy. I don’t want to read 
> the HTML and scrape it accordingly. Ick! I’d rather be given the list of 
> bibliographics in a more computer-friendly way.
> 
> Again, can I programmatically read the contents of a Internet Archive 
> collection?
> 
> —
> Eric Morgan