Re: [Gossip] doing simple data analysis on list archives

2017-04-22 Thread Matt Morgan

On 04/22/2017 11:12 AM, Matt Morgan wrote:



On 04/20/2017 06:15 PM, Dossy Shiobara wrote:

On 4/20/17 4:43 PM, Matt Morgan wrote:

I guess what I'm asking, is there an easy path from mail-archive.com
search results into a spreadsheet (I guess mySQL or postgres would be
OK too) or some other kind of analysis tool?

I thought about doing this in Node.js but that would require a bit more
machinery that isn't available "out of the box" and I didn't want you to
get hung up on any dependencies, so, here it is in PHP which should work
with just out-of-the-box PHP (on most platforms, anyway):

$ php -r '
 $dom = new DOMDocument;
$dom->loadHTML(file_get_contents("https://www.mail-archive.com/search?l=mcn-l%40mcn.edu=%28%2Bjob+OR+%2Bposition%29=1;));
 $doc = simplexml_import_dom($dom);
 $out = fopen("php://output", "w");
 fputcsv($out, array("link", "subject", "date", "name", "message"));
 $msg = array();
 foreach ($doc->body->div[0]->children() as $node) {
 switch ($node->getName()) {
 case "h3":
 $msg["subj"] = (string) $node->span->a;
 $msg["link"] = "https://www.mail-archive.com; . 
(string)

$node->span->a["href"];
 break;
 case "div":
 $msg["date"] = (string) $node->span[0]->span->a;
 $msg["name"] = (string) $node->span[2]->a;
 break;
 case "blockquote":
 $msg["body"] = (string) $node->span->pre;
 break;
 case "br":
 fputcsv($out, array($msg["link"], $msg["subj"],
 $msg["date"], $msg["name"], $msg["body"]));
 $msg = array();
 break;
 default: break;
 }
 }' | tee msgs.csv
Closing the loop on this, Dossy's script works perfectly. In my case I 
had to install php-xml, but that was it (and if you use php at all you 
probably already have that).


I think I may have found a bug in the search engine. Compare these two 
queries:


1. search for

(+job OR +position)

https://www.mail-archive.com/search?l=mcn-l%40mcn.edu=%28%2Bjob+OR+%2Bposition%29=0 



(1167 results, all of which have at least one occurrence of "job" or 
"position").


2. search for

job OR position

https://www.mail-archive.com/search?l=mcn-l%40mcn.edu=job+OR+position=13=18 



(10185 results, many of which have no occurrences of either "job" or 
"position").


I had expected query #2 to do what I wanted, but I had to fuss with 
the parens and plus signs to get it to actually limit the results to 
items with one of the words. Even


+job OR +position

without the parens got me results with neither word.
 Oops, had meant to add that the + signs turned out not to be 
necessary. But I had to have the OR statement in parens for it to do 
what I expected.


___
Gossip mailing list
https://www.mail-archive.com/gossip@mail-archive.com
https://www.mail-archive.com/cgi-bin/mailman/options/gossip


[Gossip] doing simple data analysis on list archives

2017-04-20 Thread Matt Morgan
The Museum Computer Network (http://mcn.edu) is having their 50th 
anniversary this year and I'm doing a volunteer project where I'm 
looking at the history of computery museum jobs. Fortunately, a couple 
years ago I put every mcn-l posting since 1996 up on The Mail Archive 
and there are ~1000 job postings in there. (Other people get to look at 
old print publications for the earlier history).


It's easy enough on mail-archive to search for "job" or "position" and 
get the results. The expand button gets me the content of each message 
right on the search results page. Hence,


https://www.mail-archive.com/search?l=mcn-l%40mcn.edu=%28%2Bjob+OR+%2Bposition%29=1

And the HTML of that page is nicely structured. I'd love to get them 
from there into a spreadsheet and figure out things like "most job 
postings go out at the end of the month" or "we stopped saying 
'webmaster' in 2001" or other kinds of data-informed insights. Has 
anybody tried something like this before?


I guess what I'm asking, is there an easy path from mail-archive.com 
search results into a spreadsheet (I guess mySQL or postgres would be OK 
too) or some other kind of analysis tool?


Thanks!
Matt

___
Gossip mailing list
https://www.mail-archive.com/gossip@mail-archive.com
https://www.mail-archive.com/cgi-bin/mailman/options/gossip