Re: [CODE4LIB] Code4Lib jobs list data dump?

2021-01-22 Thread Graeme Williams
Python has a library which will parse mbox files.  It worked on the test
file I downloaded from GMail. If all you want are the message bodies, it
looks like you can do that in seven lines. Obviously, this doesn't
guarantee much of anything for the jobs mbox files.

Looking at some of the posts on the web site, it looks like you'll have two
top-level problems with posts / message body content:
   - posts that contain more than one job description
   - posts that contain no job descriptions, just a link to a job
description somewhere else.

I'm happy to continue this discussion either here or offline, and if
someone sends me an mbox file, I'll see what I can do (in seven lines (-:)).

Graeme Williams
Las Vegas, NV

p.s.  I love scraping web pages


On Fri, Jan 22, 2021 at 10:39 AM Monica Maceli  wrote:

> Hi all,
>
> I've done a couple projects mining the data from the code4lib listserv
> (e.g. https://ejournals.bc.edu/index.php/ital/article/view/5893 ).  Both
> times the fastest route was finding helpful folks involved in it to provide
> me with a data dump vs. spending time on a scraper.
>
> The most recent work I did was in 2018 - I have a tarball of all the
> message log files for the listserv (some will be job posts and others not)
> which is 2003 through 2018.  I believe I asked about this on the c4l Slack
> at the time and Wayne Graham from CLIR kindly helped me out with the data!
> This data is not anonymized (as it was/is publically available with names
> and emails associated) but I did anonymize the findings for reporting.
>
> Ellen - I'd be happy to chat sometime about how I mined the data for job
> titles and related skills/technologies, feel free to reach out to me
> directly!
>
> Best,
>
> Monica Maceli, Ph.D.
> Associate Professor
> Pratt Institute | School of Information
> 144 W 14th St, 6th Floor, New York, NY, 10011-7301
> www.monicamaceli.com | mmac...@pratt.edu
>
>
> On Fri, Jan 22, 2021 at 1:18 PM Andromeda Yelton <
> andromeda.yel...@gmail.com>
> wrote:
>
> > The initial commit in https://github.com/code4lib/shortimer/ was
> November
> > 2011, which is ten years for some values of ten. Taking a quick and
> > noncomprehensive glance around, I see postings as old as 2005. I don't
> see
> > an obvious API, but maybe a maintainer could weigh in about data dump
> > possibilities?
> >
> > On Fri, Jan 22, 2021 at 11:28 AM Eric Lease Morgan 
> wrote:
> >
> > > On Jan 22, 2021, at 11:11 AM, Jill Ellern 
> wrote:
> > >
> > > > I'm doing some research into systems librarian duties and wondering
> if
> > > there is an easy way to get a dump of the code4lib jobs from the last
> 10
> > > years?  In excel format?
> > >
> > >
> > > Easy? I'd be surprised.
> > >
> > > There are two or three sources of the Code4Lib jobs data:
> > >
> > >   1. the underlying data from the jobs.code4lib.org site
> > >
> > >   2. any one of a number of different Code4Lib mailing list Web
> archives
> > >
> > >   3. the archived mailbox (mbox) files from the mailing list
> > >
> > > I don't think the jobs site has been around for ten years. Has it? Nor
> do
> > > I know whether or not the data is archived. If it is, then I'd bet you
> > will
> > > be able get it in some sort of structured format like JSON or delimited
> > > delimited format like Excel.
> > >
> > > Scraping different Web archives would require... scraping which,
> > > personally, I run away from.
> > >
> > > Finally, the archived mbox files would be the most comprehensive, but a
> > > programmer would have to parse the mbox (email) files, which is a
> > > specialized task in and of itself. If you want to know where the mbox
> > files
> > > are located, then drop me a line and I'll let you know. Easy.
> > >
> > > Finally, what's the questions you would like to answer? How many system
> > > librarian jobs have been posted? Where were the jobs? What are the
> > > characteristics of systems librarianship and how have they changed over
> > > time? How much they pay? Extracting some of this information from the
> > > postings may be difficult, if not heroic in nature.
> > >
> > > --
> > > Eric Morgan
> > > University of Notre Dame
> >
> >
> >
> > --
> > Andromeda Yelton
> > Humanistic Machine Learning for Library Data
> > Lecturer, San José State University iSchool
> > https://andromedayelton.com
> > @ThatAndromeda
> > 
> >
>


Re: [CODE4LIB] Code4Lib jobs list data dump?

2021-01-22 Thread Monica Maceli
Hi all,

I've done a couple projects mining the data from the code4lib listserv
(e.g. https://ejournals.bc.edu/index.php/ital/article/view/5893 ).  Both
times the fastest route was finding helpful folks involved in it to provide
me with a data dump vs. spending time on a scraper.

The most recent work I did was in 2018 - I have a tarball of all the
message log files for the listserv (some will be job posts and others not)
which is 2003 through 2018.  I believe I asked about this on the c4l Slack
at the time and Wayne Graham from CLIR kindly helped me out with the data!
This data is not anonymized (as it was/is publically available with names
and emails associated) but I did anonymize the findings for reporting.

Ellen - I'd be happy to chat sometime about how I mined the data for job
titles and related skills/technologies, feel free to reach out to me
directly!

Best,

Monica Maceli, Ph.D.
Associate Professor
Pratt Institute | School of Information
144 W 14th St, 6th Floor, New York, NY, 10011-7301
www.monicamaceli.com | mmac...@pratt.edu


On Fri, Jan 22, 2021 at 1:18 PM Andromeda Yelton 
wrote:

> The initial commit in https://github.com/code4lib/shortimer/ was November
> 2011, which is ten years for some values of ten. Taking a quick and
> noncomprehensive glance around, I see postings as old as 2005. I don't see
> an obvious API, but maybe a maintainer could weigh in about data dump
> possibilities?
>
> On Fri, Jan 22, 2021 at 11:28 AM Eric Lease Morgan  wrote:
>
> > On Jan 22, 2021, at 11:11 AM, Jill Ellern  wrote:
> >
> > > I'm doing some research into systems librarian duties and wondering if
> > there is an easy way to get a dump of the code4lib jobs from the last 10
> > years?  In excel format?
> >
> >
> > Easy? I'd be surprised.
> >
> > There are two or three sources of the Code4Lib jobs data:
> >
> >   1. the underlying data from the jobs.code4lib.org site
> >
> >   2. any one of a number of different Code4Lib mailing list Web archives
> >
> >   3. the archived mailbox (mbox) files from the mailing list
> >
> > I don't think the jobs site has been around for ten years. Has it? Nor do
> > I know whether or not the data is archived. If it is, then I'd bet you
> will
> > be able get it in some sort of structured format like JSON or delimited
> > delimited format like Excel.
> >
> > Scraping different Web archives would require... scraping which,
> > personally, I run away from.
> >
> > Finally, the archived mbox files would be the most comprehensive, but a
> > programmer would have to parse the mbox (email) files, which is a
> > specialized task in and of itself. If you want to know where the mbox
> files
> > are located, then drop me a line and I'll let you know. Easy.
> >
> > Finally, what's the questions you would like to answer? How many system
> > librarian jobs have been posted? Where were the jobs? What are the
> > characteristics of systems librarianship and how have they changed over
> > time? How much they pay? Extracting some of this information from the
> > postings may be difficult, if not heroic in nature.
> >
> > --
> > Eric Morgan
> > University of Notre Dame
>
>
>
> --
> Andromeda Yelton
> Humanistic Machine Learning for Library Data
> Lecturer, San José State University iSchool
> https://andromedayelton.com
> @ThatAndromeda
> 
>


Re: [CODE4LIB] [EXTERNAL] [CODE4LIB] Code4Lib jobs list data dump?

2021-01-22 Thread Turner, Steven
I would see if you can just get an SQL or CSV dump of the tables, maybe it’s 
not super-normalized and you can get most of what you need in a table or two, 
or perhaps the provider would be so kind as to write a join for the data you 
need, and write a dump to a CSV file which you can the import in Excel and 
pursue / analyze to your heart’s content. That seems to be the easiest thing by 
far, to me anyway.

> On Jan 22, 2021, at 12:17 PM, Andromeda Yelton  
> wrote:
> 
> The initial commit in 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcode4lib%2Fshortimer%2F&data=04%7C01%7Csteven.j.turner%40ua.edu%7Ca7b50aed122a4cbb42bc08d8bf022f8d%7C2a00728ef0d040b4a4e8ce433f3fbca7%7C0%7C0%7C637469363394049896%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=k8a7Wvpbtq%2FJv5pJb5dsVRkLxm9i9yJ0S%2BfGmLy5OQM%3D&reserved=0
>  was November
> 2011, which is ten years for some values of ten. Taking a quick and
> noncomprehensive glance around, I see postings as old as 2005. I don't see
> an obvious API, but maybe a maintainer could weigh in about data dump
> possibilities?
> 
> On Fri, Jan 22, 2021 at 11:28 AM Eric Lease Morgan  wrote:
> 
>> On Jan 22, 2021, at 11:11 AM, Jill Ellern  wrote:
>> 
>>> I'm doing some research into systems librarian duties and wondering if
>> there is an easy way to get a dump of the code4lib jobs from the last 10
>> years?  In excel format?
>> 
>> 
>> Easy? I'd be surprised.
>> 
>> There are two or three sources of the Code4Lib jobs data:
>> 
>>  1. the underlying data from the jobs.code4lib.org site
>> 
>>  2. any one of a number of different Code4Lib mailing list Web archives
>> 
>>  3. the archived mailbox (mbox) files from the mailing list
>> 
>> I don't think the jobs site has been around for ten years. Has it? Nor do
>> I know whether or not the data is archived. If it is, then I'd bet you will
>> be able get it in some sort of structured format like JSON or delimited
>> delimited format like Excel.
>> 
>> Scraping different Web archives would require... scraping which,
>> personally, I run away from.
>> 
>> Finally, the archived mbox files would be the most comprehensive, but a
>> programmer would have to parse the mbox (email) files, which is a
>> specialized task in and of itself. If you want to know where the mbox files
>> are located, then drop me a line and I'll let you know. Easy.
>> 
>> Finally, what's the questions you would like to answer? How many system
>> librarian jobs have been posted? Where were the jobs? What are the
>> characteristics of systems librarianship and how have they changed over
>> time? How much they pay? Extracting some of this information from the
>> postings may be difficult, if not heroic in nature.
>> 
>> --
>> Eric Morgan
>> University of Notre Dame
> 
> 
> 
> -- 
> Andromeda Yelton
> Humanistic Machine Learning for Library Data
> Lecturer, San José State University iSchool
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fandromedayelton.com%2F&data=04%7C01%7Csteven.j.turner%40ua.edu%7Ca7b50aed122a4cbb42bc08d8bf022f8d%7C2a00728ef0d040b4a4e8ce433f3fbca7%7C0%7C0%7C637469363394049896%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=yBAbEQzkJmwSiJI7pFNb9k%2F1LHMdgxerk67ERm%2B94ew%3D&reserved=0
> @ThatAndromeda
> 



smime.p7s
Description: S/MIME cryptographic signature


Re: [CODE4LIB] Code4Lib jobs list data dump?

2021-01-22 Thread Andromeda Yelton
The initial commit in https://github.com/code4lib/shortimer/ was November
2011, which is ten years for some values of ten. Taking a quick and
noncomprehensive glance around, I see postings as old as 2005. I don't see
an obvious API, but maybe a maintainer could weigh in about data dump
possibilities?

On Fri, Jan 22, 2021 at 11:28 AM Eric Lease Morgan  wrote:

> On Jan 22, 2021, at 11:11 AM, Jill Ellern  wrote:
>
> > I'm doing some research into systems librarian duties and wondering if
> there is an easy way to get a dump of the code4lib jobs from the last 10
> years?  In excel format?
>
>
> Easy? I'd be surprised.
>
> There are two or three sources of the Code4Lib jobs data:
>
>   1. the underlying data from the jobs.code4lib.org site
>
>   2. any one of a number of different Code4Lib mailing list Web archives
>
>   3. the archived mailbox (mbox) files from the mailing list
>
> I don't think the jobs site has been around for ten years. Has it? Nor do
> I know whether or not the data is archived. If it is, then I'd bet you will
> be able get it in some sort of structured format like JSON or delimited
> delimited format like Excel.
>
> Scraping different Web archives would require... scraping which,
> personally, I run away from.
>
> Finally, the archived mbox files would be the most comprehensive, but a
> programmer would have to parse the mbox (email) files, which is a
> specialized task in and of itself. If you want to know where the mbox files
> are located, then drop me a line and I'll let you know. Easy.
>
> Finally, what's the questions you would like to answer? How many system
> librarian jobs have been posted? Where were the jobs? What are the
> characteristics of systems librarianship and how have they changed over
> time? How much they pay? Extracting some of this information from the
> postings may be difficult, if not heroic in nature.
>
> --
> Eric Morgan
> University of Notre Dame



-- 
Andromeda Yelton
Humanistic Machine Learning for Library Data
Lecturer, San José State University iSchool
https://andromedayelton.com
@ThatAndromeda



Re: [CODE4LIB] Code4Lib jobs list data dump?

2021-01-22 Thread Eric Lease Morgan
On Jan 22, 2021, at 11:11 AM, Jill Ellern  wrote:

> I'm doing some research into systems librarian duties and wondering if there 
> is an easy way to get a dump of the code4lib jobs from the last 10 years?  In 
> excel format?


Easy? I'd be surprised.

There are two or three sources of the Code4Lib jobs data:

  1. the underlying data from the jobs.code4lib.org site

  2. any one of a number of different Code4Lib mailing list Web archives

  3. the archived mailbox (mbox) files from the mailing list

I don't think the jobs site has been around for ten years. Has it? Nor do I 
know whether or not the data is archived. If it is, then I'd bet you will be 
able get it in some sort of structured format like JSON or delimited delimited 
format like Excel.

Scraping different Web archives would require... scraping which, personally, I 
run away from.

Finally, the archived mbox files would be the most comprehensive, but a 
programmer would have to parse the mbox (email) files, which is a specialized 
task in and of itself. If you want to know where the mbox files are located, 
then drop me a line and I'll let you know. Easy.

Finally, what's the questions you would like to answer? How many system 
librarian jobs have been posted? Where were the jobs? What are the 
characteristics of systems librarianship and how have they changed over time? 
How much they pay? Extracting some of this information from the postings may be 
difficult, if not heroic in nature.

--
Eric Morgan
University of Notre Dame

[CODE4LIB] Code4Lib jobs list data dump?

2021-01-22 Thread Jill Ellern
Hey folks,
I'm doing some research into systems librarian duties and wondering if there is 
an easy way to get a dump of the code4lib jobs from the last 10 years?  In 
excel format?
Jill Ellern