[Anthy-dev 3518] Re: 単語ダウンロードスクリプト

Yusuke TABATA Tue, 03 Jul 2007 06:49:08 -0700

Yusuke TABATA wrote:
> anthyのwikiの「単語収集/未分類語」「単語収集/時事」


> から単語をダウンロードして、 ~/.anthy/imported_words_default.d/wiki_wordに
> 保存するスクリプトを書いたので添付します。
色々な単語が登録されて、面白いことになってますね(実用性は無い気もしますが)
前のスクリプトだと文字コードの変換エラーが出て止まってしまうので、
変換できなかったものは捨てるようにしたものに更新します。



-- 
--
 CHAOS AND CHANCE!
  Yusuke TABATA

#! /usr/bin/python

import urllib2
import os
import stat
import re

urls = [
    
"http://anthy.sourceforge.jp/cgi-bin/hiki/hiki.cgi?%C3%B1%B8%EC%BC%FD%BD%B8%2F%CC%A4%CA%AC%CE%E0%B8%EC";,
    
"http://anthy.sourceforge.jp/cgi-bin/hiki/hiki.cgi?%C3%B1%B8%EC%BC%FD%BD%B8%2F%BB%FE%BB%F6";];

visited_urls = [];

words = [];

def procline(line):
    m = 
re.match("<tr><td>([^<]+)</td><td>([^<]+)</td><td>([^<]+)</td><td>.+</td></tr>",
 line);
    if m:
        (word, idx, wt) = m.groups();
        try:
            us = unicode(idx + " " + wt + " " + word, 'EUC-JP').encode('UTF8')
            global words
            words.append(us)
        except UnicodeError, s:
            print idx + ":ignored:"
            print s
    pass

#
home = os.getenv("HOME")
dn = home+"/.anthy/imported_words_default.d/";
if (os.access(dn, os.F_OK)):
    sm = os.stat(dn).st_mode;
    if (not stat.S_ISDIR(sm)):
        print dn + " :is not a directory"
        exit(0);
else:
    os.mkdir(dn)

for u in urls:
    if u in visited_urls:
        continue
    print "retribing words from:" + u
    try:
        f = urllib2.urlopen(u)
    except urllib2.HTTPError, e:
        print e
        continue
    visited_urls += [u]
    line = f.readline()
    while line:
        procline(line)
        line = f.readline()
    f.close
    
words.sort()
wf = open(dn + "wiki_word",'w');
for w in words:
    wf.write(w + "\n")
wf.close();

_______________________________________________
Anthy-dev mailing list
[email protected]
http://lists.sourceforge.jp/mailman/listinfo/anthy-dev

[Anthy-dev 3518] Re: 単語ダウンロードスクリプト

メールによる返信