Re: [Gossip] doing simple data analysis on list archives

2017-04-22 Thread Jeff Breidenbach
Glad you were able to get things working. For search experts out there, The
Mail Archive uses Lucene search syntax. See the following documentation.

http://www.mail-archive.com/faq.html#search
http://www.mail-archive.com/searching.html

More specifically, the site uses the default Lucene query parser, except
for two
things. We default the parser to the AND instead of OR. And we support the
special search term "sort:newest" which puts results in chronological order.

query_parser.setDefaultOperator(QueryParser.Operator.AND)

Sounds like things are going well. But if you need a small tweak to the
search engine, just ask.
___
Gossip mailing list
https://www.mail-archive.com/[email protected]
https://www.mail-archive.com/cgi-bin/mailman/options/gossip

Re: [Gossip] doing simple data analysis on list archives

2017-04-22 Thread Matt Morgan

On 04/22/2017 11:12 AM, Matt Morgan wrote:



On 04/20/2017 06:15 PM, Dossy Shiobara wrote:

On 4/20/17 4:43 PM, Matt Morgan wrote:

I guess what I'm asking, is there an easy path from mail-archive.com
search results into a spreadsheet (I guess mySQL or postgres would be
OK too) or some other kind of analysis tool?

I thought about doing this in Node.js but that would require a bit more
machinery that isn't available "out of the box" and I didn't want you to
get hung up on any dependencies, so, here it is in PHP which should work
with just out-of-the-box PHP (on most platforms, anyway):

$ php -r '
 $dom = new DOMDocument;
$dom->loadHTML(file_get_contents("https://www.mail-archive.com/search?l=mcn-l%40mcn.edu&q=%28%2Bjob+OR+%2Bposition%29&f=1";));
 $doc = simplexml_import_dom($dom);
 $out = fopen("php://output", "w");
 fputcsv($out, array("link", "subject", "date", "name", "message"));
 $msg = array();
 foreach ($doc->body->div[0]->children() as $node) {
 switch ($node->getName()) {
 case "h3":
 $msg["subj"] = (string) $node->span->a;
 $msg["link"] = "https://www.mail-archive.com"; . 
(string)

$node->span->a["href"];
 break;
 case "div":
 $msg["date"] = (string) $node->span[0]->span->a;
 $msg["name"] = (string) $node->span[2]->a;
 break;
 case "blockquote":
 $msg["body"] = (string) $node->span->pre;
 break;
 case "br":
 fputcsv($out, array($msg["link"], $msg["subj"],
 $msg["date"], $msg["name"], $msg["body"]));
 $msg = array();
 break;
 default: break;
 }
 }' | tee msgs.csv
Closing the loop on this, Dossy's script works perfectly. In my case I 
had to install php-xml, but that was it (and if you use php at all you 
probably already have that).


I think I may have found a bug in the search engine. Compare these two 
queries:


1. search for

(+job OR +position)

https://www.mail-archive.com/search?l=mcn-l%40mcn.edu&q=%28%2Bjob+OR+%2Bposition%29&start=0 



(1167 results, all of which have at least one occurrence of "job" or 
"position").


2. search for

job OR position

https://www.mail-archive.com/search?l=mcn-l%40mcn.edu&q=job+OR+position&x=13&y=18 



(10185 results, many of which have no occurrences of either "job" or 
"position").


I had expected query #2 to do what I wanted, but I had to fuss with 
the parens and plus signs to get it to actually limit the results to 
items with one of the words. Even


+job OR +position

without the parens got me results with neither word.
 Oops, had meant to add that the + signs turned out not to be 
necessary. But I had to have the OR statement in parens for it to do 
what I expected.


___
Gossip mailing list
https://www.mail-archive.com/[email protected]
https://www.mail-archive.com/cgi-bin/mailman/options/gossip


Re: [Gossip] doing simple data analysis on list archives

2017-04-22 Thread Matt Morgan



On 04/20/2017 06:15 PM, Dossy Shiobara wrote:

On 4/20/17 4:43 PM, Matt Morgan wrote:

I guess what I'm asking, is there an easy path from mail-archive.com
search results into a spreadsheet (I guess mySQL or postgres would be
OK too) or some other kind of analysis tool?

I thought about doing this in Node.js but that would require a bit more
machinery that isn't available "out of the box" and I didn't want you to
get hung up on any dependencies, so, here it is in PHP which should work
with just out-of-the-box PHP (on most platforms, anyway):

$ php -r '
 $dom = new DOMDocument;

$dom->loadHTML(file_get_contents("https://www.mail-archive.com/search?l=mcn-l%40mcn.edu&q=%28%2Bjob+OR+%2Bposition%29&f=1";));

 $doc = simplexml_import_dom($dom);
 $out = fopen("php://output", "w");
 fputcsv($out, array("link", "subject", "date", "name", "message"));
 $msg = array();
 foreach ($doc->body->div[0]->children() as $node) {
 switch ($node->getName()) {
 case "h3":
 $msg["subj"] = (string) $node->span->a;
 $msg["link"] = "https://www.mail-archive.com"; . (string)
$node->span->a["href"];
 break;
 case "div":
 $msg["date"] = (string) $node->span[0]->span->a;
 $msg["name"] = (string) $node->span[2]->a;
 break;
 case "blockquote":
 $msg["body"] = (string) $node->span->pre;
 break;
 case "br":
 fputcsv($out, array($msg["link"], $msg["subj"],
 $msg["date"], $msg["name"], $msg["body"]));
 $msg = array();
 break;
 default: break;
 }
 }' | tee msgs.csv
Closing the loop on this, Dossy's script works perfectly. In my case I 
had to install php-xml, but that was it (and if you use php at all you 
probably already have that).


I think I may have found a bug in the search engine. Compare these two 
queries:


1. search for

(+job OR +position)

https://www.mail-archive.com/search?l=mcn-l%40mcn.edu&q=%28%2Bjob+OR+%2Bposition%29&start=0

(1167 results, all of which have at least one occurrence of "job" or 
"position").


2. search for

job OR position

https://www.mail-archive.com/search?l=mcn-l%40mcn.edu&q=job+OR+position&x=13&y=18

(10185 results, many of which have no occurrences of either "job" or 
"position").


I had expected query #2 to do what I wanted, but I had to fuss with the 
parens and plus signs to get it to actually limit the results to items 
with one of the words. Even


+job OR +position

without the parens got me results with neither word.

Thanks,
Matt

___
Gossip mailing list
https://www.mail-archive.com/[email protected]
https://www.mail-archive.com/cgi-bin/mailman/options/gossip


Re: [Gossip] doing simple data analysis on list archives

2017-04-20 Thread Dossy Shiobara
On 4/20/17 4:43 PM, Matt Morgan wrote:
> I guess what I'm asking, is there an easy path from mail-archive.com
> search results into a spreadsheet (I guess mySQL or postgres would be
> OK too) or some other kind of analysis tool?

I thought about doing this in Node.js but that would require a bit more
machinery that isn't available "out of the box" and I didn't want you to
get hung up on any dependencies, so, here it is in PHP which should work
with just out-of-the-box PHP (on most platforms, anyway):

$ php -r '
$dom = new DOMDocument;
   
$dom->loadHTML(file_get_contents("https://www.mail-archive.com/search?l=mcn-l%40mcn.edu&q=%28%2Bjob+OR+%2Bposition%29&f=1";));
$doc = simplexml_import_dom($dom);
$out = fopen("php://output", "w");
fputcsv($out, array("link", "subject", "date", "name", "message"));
$msg = array();
foreach ($doc->body->div[0]->children() as $node) {
switch ($node->getName()) {
case "h3":
$msg["subj"] = (string) $node->span->a;
$msg["link"] = "https://www.mail-archive.com"; . (string)
$node->span->a["href"];
break;
case "div":
$msg["date"] = (string) $node->span[0]->span->a;
$msg["name"] = (string) $node->span[2]->a;
break;
case "blockquote":
$msg["body"] = (string) $node->span->pre;
break;
case "br":
fputcsv($out, array($msg["link"], $msg["subj"],
$msg["date"], $msg["name"], $msg["body"]));
$msg = array();
break;
default: break;
}
}' | tee msgs.csv


HTH, HAND,

Dossy

-- 
Dossy Shiobara |  "He realized the fastest way to change
[email protected] |   is to laugh at your own folly -- then you
http://panoptic.com/   |   can let go and quickly move on." (p. 70) 
  * WordPress * jQuery * MySQL * Security * Business Continuity *


___
Gossip mailing list
https://www.mail-archive.com/[email protected]
https://www.mail-archive.com/cgi-bin/mailman/options/gossip