Re: [Gossip] doing simple data analysis on list archives
Glad you were able to get things working. For search experts out there, The Mail Archive uses Lucene search syntax. See the following documentation. http://www.mail-archive.com/faq.html#search http://www.mail-archive.com/searching.html More specifically, the site uses the default Lucene query parser, except for two things. We default the parser to the AND instead of OR. And we support the special search term "sort:newest" which puts results in chronological order. query_parser.setDefaultOperator(QueryParser.Operator.AND) Sounds like things are going well. But if you need a small tweak to the search engine, just ask. ___ Gossip mailing list https://www.mail-archive.com/[email protected] https://www.mail-archive.com/cgi-bin/mailman/options/gossip
Re: [Gossip] doing simple data analysis on list archives
On 04/22/2017 11:12 AM, Matt Morgan wrote:
On 04/20/2017 06:15 PM, Dossy Shiobara wrote:
On 4/20/17 4:43 PM, Matt Morgan wrote:
I guess what I'm asking, is there an easy path from mail-archive.com
search results into a spreadsheet (I guess mySQL or postgres would be
OK too) or some other kind of analysis tool?
I thought about doing this in Node.js but that would require a bit more
machinery that isn't available "out of the box" and I didn't want you to
get hung up on any dependencies, so, here it is in PHP which should work
with just out-of-the-box PHP (on most platforms, anyway):
$ php -r '
$dom = new DOMDocument;
$dom->loadHTML(file_get_contents("https://www.mail-archive.com/search?l=mcn-l%40mcn.edu&q=%28%2Bjob+OR+%2Bposition%29&f=1";));
$doc = simplexml_import_dom($dom);
$out = fopen("php://output", "w");
fputcsv($out, array("link", "subject", "date", "name", "message"));
$msg = array();
foreach ($doc->body->div[0]->children() as $node) {
switch ($node->getName()) {
case "h3":
$msg["subj"] = (string) $node->span->a;
$msg["link"] = "https://www.mail-archive.com"; .
(string)
$node->span->a["href"];
break;
case "div":
$msg["date"] = (string) $node->span[0]->span->a;
$msg["name"] = (string) $node->span[2]->a;
break;
case "blockquote":
$msg["body"] = (string) $node->span->pre;
break;
case "br":
fputcsv($out, array($msg["link"], $msg["subj"],
$msg["date"], $msg["name"], $msg["body"]));
$msg = array();
break;
default: break;
}
}' | tee msgs.csv
Closing the loop on this, Dossy's script works perfectly. In my case I
had to install php-xml, but that was it (and if you use php at all you
probably already have that).
I think I may have found a bug in the search engine. Compare these two
queries:
1. search for
(+job OR +position)
https://www.mail-archive.com/search?l=mcn-l%40mcn.edu&q=%28%2Bjob+OR+%2Bposition%29&start=0
(1167 results, all of which have at least one occurrence of "job" or
"position").
2. search for
job OR position
https://www.mail-archive.com/search?l=mcn-l%40mcn.edu&q=job+OR+position&x=13&y=18
(10185 results, many of which have no occurrences of either "job" or
"position").
I had expected query #2 to do what I wanted, but I had to fuss with
the parens and plus signs to get it to actually limit the results to
items with one of the words. Even
+job OR +position
without the parens got me results with neither word.
Oops, had meant to add that the + signs turned out not to be
necessary. But I had to have the OR statement in parens for it to do
what I expected.
___
Gossip mailing list
https://www.mail-archive.com/[email protected]
https://www.mail-archive.com/cgi-bin/mailman/options/gossip
Re: [Gossip] doing simple data analysis on list archives
On 04/20/2017 06:15 PM, Dossy Shiobara wrote:
On 4/20/17 4:43 PM, Matt Morgan wrote:
I guess what I'm asking, is there an easy path from mail-archive.com
search results into a spreadsheet (I guess mySQL or postgres would be
OK too) or some other kind of analysis tool?
I thought about doing this in Node.js but that would require a bit more
machinery that isn't available "out of the box" and I didn't want you to
get hung up on any dependencies, so, here it is in PHP which should work
with just out-of-the-box PHP (on most platforms, anyway):
$ php -r '
$dom = new DOMDocument;
$dom->loadHTML(file_get_contents("https://www.mail-archive.com/search?l=mcn-l%40mcn.edu&q=%28%2Bjob+OR+%2Bposition%29&f=1";));
$doc = simplexml_import_dom($dom);
$out = fopen("php://output", "w");
fputcsv($out, array("link", "subject", "date", "name", "message"));
$msg = array();
foreach ($doc->body->div[0]->children() as $node) {
switch ($node->getName()) {
case "h3":
$msg["subj"] = (string) $node->span->a;
$msg["link"] = "https://www.mail-archive.com"; . (string)
$node->span->a["href"];
break;
case "div":
$msg["date"] = (string) $node->span[0]->span->a;
$msg["name"] = (string) $node->span[2]->a;
break;
case "blockquote":
$msg["body"] = (string) $node->span->pre;
break;
case "br":
fputcsv($out, array($msg["link"], $msg["subj"],
$msg["date"], $msg["name"], $msg["body"]));
$msg = array();
break;
default: break;
}
}' | tee msgs.csv
Closing the loop on this, Dossy's script works perfectly. In my case I
had to install php-xml, but that was it (and if you use php at all you
probably already have that).
I think I may have found a bug in the search engine. Compare these two
queries:
1. search for
(+job OR +position)
https://www.mail-archive.com/search?l=mcn-l%40mcn.edu&q=%28%2Bjob+OR+%2Bposition%29&start=0
(1167 results, all of which have at least one occurrence of "job" or
"position").
2. search for
job OR position
https://www.mail-archive.com/search?l=mcn-l%40mcn.edu&q=job+OR+position&x=13&y=18
(10185 results, many of which have no occurrences of either "job" or
"position").
I had expected query #2 to do what I wanted, but I had to fuss with the
parens and plus signs to get it to actually limit the results to items
with one of the words. Even
+job OR +position
without the parens got me results with neither word.
Thanks,
Matt
___
Gossip mailing list
https://www.mail-archive.com/[email protected]
https://www.mail-archive.com/cgi-bin/mailman/options/gossip
Re: [Gossip] doing simple data analysis on list archives
On 4/20/17 4:43 PM, Matt Morgan wrote:
> I guess what I'm asking, is there an easy path from mail-archive.com
> search results into a spreadsheet (I guess mySQL or postgres would be
> OK too) or some other kind of analysis tool?
I thought about doing this in Node.js but that would require a bit more
machinery that isn't available "out of the box" and I didn't want you to
get hung up on any dependencies, so, here it is in PHP which should work
with just out-of-the-box PHP (on most platforms, anyway):
$ php -r '
$dom = new DOMDocument;
$dom->loadHTML(file_get_contents("https://www.mail-archive.com/search?l=mcn-l%40mcn.edu&q=%28%2Bjob+OR+%2Bposition%29&f=1";));
$doc = simplexml_import_dom($dom);
$out = fopen("php://output", "w");
fputcsv($out, array("link", "subject", "date", "name", "message"));
$msg = array();
foreach ($doc->body->div[0]->children() as $node) {
switch ($node->getName()) {
case "h3":
$msg["subj"] = (string) $node->span->a;
$msg["link"] = "https://www.mail-archive.com"; . (string)
$node->span->a["href"];
break;
case "div":
$msg["date"] = (string) $node->span[0]->span->a;
$msg["name"] = (string) $node->span[2]->a;
break;
case "blockquote":
$msg["body"] = (string) $node->span->pre;
break;
case "br":
fputcsv($out, array($msg["link"], $msg["subj"],
$msg["date"], $msg["name"], $msg["body"]));
$msg = array();
break;
default: break;
}
}' | tee msgs.csv
HTH, HAND,
Dossy
--
Dossy Shiobara | "He realized the fastest way to change
[email protected] | is to laugh at your own folly -- then you
http://panoptic.com/ | can let go and quickly move on." (p. 70)
* WordPress * jQuery * MySQL * Security * Business Continuity *
___
Gossip mailing list
https://www.mail-archive.com/[email protected]
https://www.mail-archive.com/cgi-bin/mailman/options/gossip
