[
https://issues.apache.org/jira/browse/LUCENE-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200339#comment-13200339
]
Christian Moen commented on LUCENE-3745:
----------------------------------------
I'm attaching some lexical assets that are useful for building stopwords and
stoptag lists.
The frequency lists are made from ~1.5 million segmented Japanese Wikipedia
documents from after some scrubbing and handling. I'd prefer to use a more
balanced corpus for this, but I believe Wikipedia will be fine for this.
The following files are attached in TSV format using UTF-8 encoding:
* {{top-pos.txt}} - Part-of-speech tag distribution
* {{top-100000.txt}} - Top 100,000 most frequent surface forms and their
frequencies
* {{top-1000000-pos.txt}} - Top 1,000,000 most frequent surface form and
part-of-speech tag combinations and their frequencies
There's also a tool {{filter_stoptags.py}} attached that reads a set of
stoptags and evaluates it on {{top-1000000-pos.txt}} to give us an idea what
passes through any given stoptag set.
An example with my current stoptag set is given below.
{noformat}
filter_stoptags.py -s stoptags.txt top-1000000-pos.txt
stop: 、 freq: 14426806 pos: 記号-読点
stop: の freq: 14212851 pos: 助詞-連体化
stop: 。 freq: 10553747 pos: 記号-句点
stop: は freq: 8956177 pos: 助詞-係助詞
stop: に freq: 8757138 pos: 助詞-格助詞-一般
stop: を freq: 7723958 pos: 助詞-格助詞-一般
stop: freq: 7417005 pos: 記号-空白
stop: た freq: 7366368 pos: 助動詞
stop: が freq: 5427730 pos: 助詞-格助詞-一般
stop: て freq: 4874861 pos: 助詞-接続助詞
pass: し freq: 4312613 pos: 動詞-自立
stop: で freq: 3702106 pos: 助詞-格助詞-一般
stop: freq: 3485125 pos: 記号-空白
stop: ) freq: 3049861 pos: 記号-括弧閉
stop: ( freq: 3045461 pos: 記号-括弧開
pass: れ freq: 2722773 pos: 動詞-接尾
pass: さ freq: 2441965 pos: 動詞-自立
stop: で freq: 2403133 pos: 助動詞
stop: ・ freq: 2250725 pos: 記号-一般
stop: も freq: 1962142 pos: 助詞-係助詞
pass: する freq: 1959374 pos: 動詞-自立
pass: いる freq: 1937789 pos: 動詞-非自立
stop: と freq: 1927529 pos: 助詞-格助詞-引用
pass: 年 freq: 1796435 pos: 名詞-接尾-助数詞
stop: 「 freq: 1701848 pos: 記号-括弧開
stop: と freq: 1697926 pos: 助詞-格助詞-一般
stop: 」 freq: 1672052 pos: 記号-括弧閉
stop: から freq: 1414661 pos: 助詞-格助詞-一般
stop: ある freq: 1400235 pos: 助動詞
stop: freq: 1319235 pos: 記号-空白
pass: こと freq: 1272503 pos: 名詞-非自立-一般
stop: な freq: 1254673 pos: 助動詞
stop: が freq: 1110771 pos: 助詞-接続助詞
pass: の freq: 1037815 pos: 名詞-非自立-一般
stop: として freq: 1002940 pos: 助詞-格助詞-連語
stop: freq: 989166 pos: 記号-空白
pass: い freq: 923836 pos: 動詞-非自立
(...)
{noformat}
> Need stopwords and stoptags lists for default Japanese configuration
> --------------------------------------------------------------------
>
> Key: LUCENE-3745
> URL: https://issues.apache.org/jira/browse/LUCENE-3745
> Project: Lucene - Java
> Issue Type: Improvement
> Components: modules/analysis
> Reporter: Christian Moen
> Attachments: filter_stoptags.py, top-100000.txt, top-1000000-pos.txt,
> top-pos.txt
>
>
> Stopwords and stoptags lists for Japanese needs to be developed, tested and
> integrated into Lucene.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]