This is an automated email from the ASF dual-hosted git repository.
myui pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hivemall-site.git
The following commit(s) were added to refs/heads/asf-site by this push:
new c13e9d3 Updated tokenizer usages
c13e9d3 is described below
commit c13e9d34cbd85e7e076b2f76c250224ab2dbffdc
Author: Makoto Yui <[email protected]>
AuthorDate: Fri Apr 23 19:13:46 2021 +0900
Updated tokenizer usages
---
userguide/misc/funcs.html | 8 ++--
userguide/misc/tokenizer.html | 98 ++++++++++++++++++++++++++++++++++---------
2 files changed, 83 insertions(+), 23 deletions(-)
diff --git a/userguide/misc/funcs.html b/userguide/misc/funcs.html
index 2404931..d92bf2f 100644
--- a/userguide/misc/funcs.html
+++ b/userguide/misc/funcs.html
@@ -3414,19 +3414,19 @@ limit 100;
</li>
<li><p><code>tokenize_cn(String line [, const list<string>
stopWords])</code> - returns tokenized strings in array<string></p>
</li>
-<li><p><code>tokenize_ja(String line [, const string mode =
"normal", const array<string> stopWords, const
array<string> stopTags, const array<string> userDict (or string
userDictURL)</code>]) - returns tokenized strings in array<string></p>
+<li><p><code>tokenize_ja(String line [, const string mode =
"normal", const array<string> stopWords, const
array<string> stopTags, const array<string> userDict (or const
string userDictURL)</code>]) - returns tokenized strings in
array<string></p>
<pre><code class="lang-sql">select
tokenize_ja("kuromojiを使った分かち書きのテストです。第二引数にはnormal/search/extendedを指定できます。デフォルトではnormalモードです。");
>
["kuromoji","使う","分かち書き","テスト","第","二","引数","normal","search","extended","指定","デフォルト","normal","
モード"]
</code></pre>
</li>
-<li><p><code>tokenize_ja_neologd(String line [, const string mode =
"normal", const array<string> stopWords, const
array<string> stopTags, const array<string> userDict (or string
userDictURL)</code>]) - returns tokenized strings in array<string></p>
+<li><p><code>tokenize_ja_neologd(String line [, const string mode =
"normal", const array<string> stopWords, const
array<string> stopTags, const array<string> userDict (or const
string userDictURL)</code>]) - returns tokenized strings in
array<string></p>
<pre><code class="lang-sql">select
tokenize_ja_neologd("kuromojiを使った分かち書きのテストです。第二引数にはnormal/search/extendedを指定できます。デフォルトではnormalモードです。");
>
["kuromoji","使う","分かち書き","テスト","第","二","引数","normal","search","extended","指定","デフォルト","normal","
モード"]
</code></pre>
</li>
-<li><p><code>tokenize_ko(String line [, const array<string> userDict,
const string mode = "discard", const array<string> stopTags,
boolean outputUnknownUnigrams])</code> - returns tokenized strings in
array<string></p>
+<li><p><code>tokenize_ko(String line [, const string mode =
"discard" (or const string opts)</code>, const array<string>
stopWords, const array<string> stopTags, const array<string>
userDict (or const string userDictURL)]) - returns tokenized strings in
array<string></p>
<pre><code class="lang-sql">select tokenize_ko("소설
무궁화꽃이
피었습니다.");
>
["소설","무궁","화","꽃","피"]
@@ -3505,7 +3505,7 @@ Apache Hivemall is an effort undergoing incubation at The
Apache Software Founda
<script>
var gitbook = gitbook || [];
gitbook.push(function() {
- gitbook.page.hasChanged({"page":{"title":"List of
Functions","level":"1.3","depth":1,"next":{"title":"Tips for Effective
Hivemall","level":"1.4","depth":1,"path":"tips/README.md","ref":"tips/README.md","articles":[{"title":"Explicit
add_bias() for better
prediction","level":"1.4.1","depth":2,"path":"tips/addbias.md","ref":"tips/addbias.md","articles":[]},{"title":"Use
rand_amplify() to better prediction
results","level":"1.4.2","depth":2,"path":"tips/rand_amplify.md","ref":"t [...]
+ gitbook.page.hasChanged({"page":{"title":"List of
Functions","level":"1.3","depth":1,"next":{"title":"Tips for Effective
Hivemall","level":"1.4","depth":1,"path":"tips/README.md","ref":"tips/README.md","articles":[{"title":"Explicit
add_bias() for better
prediction","level":"1.4.1","depth":2,"path":"tips/addbias.md","ref":"tips/addbias.md","articles":[]},{"title":"Use
rand_amplify() to better prediction
results","level":"1.4.2","depth":2,"path":"tips/rand_amplify.md","ref":"t [...]
});
</script>
</div>
diff --git a/userguide/misc/tokenizer.html b/userguide/misc/tokenizer.html
index ede3604..bf1abf0 100644
--- a/userguide/misc/tokenizer.html
+++ b/userguide/misc/tokenizer.html
@@ -2391,10 +2391,16 @@
<ul>
<li><a href="#tokenizer-for-english-texts">Tokenizer for English Texts</a></li>
<li><a href="#tokenizer-for-non-english-texts">Tokenizer for Non-English
Texts</a><ul>
-<li><a href="#japanese-tokenizer">Japanese Tokenizer</a></li>
+<li><a href="#japanese-tokenizer">Japanese Tokenizer</a><ul>
+<li><a href="#custom-dictionary">Custom dictionary</a></li>
<li><a href="#part-of-speech">Part-of-speech</a></li>
+</ul>
+</li>
<li><a href="#chinese-tokenizer">Chinese Tokenizer</a></li>
-<li><a href="#korean-tokenizer">Korean Tokenizer</a></li>
+<li><a href="#korean-tokenizer">Korean Tokenizer</a><ul>
+<li><a href="#custom-dictionary-1">Custom dictionary</a></li>
+</ul>
+</li>
</ul>
</li>
</ul>
@@ -2463,6 +2469,7 @@ select tokenize_ja_neologd();
詞-形容詞接続","接頭詞-数接","未知語","記号","記号-アルファベット","記号-一般","記号-句点","記号-括弧閉
","記号-括弧開","記号-空白","記号-読点","語断片","連体詞","非言語音"]</p>
</blockquote>
+<h3 id="custom-dictionary">Custom dictionary</h3>
<p>Moreover, the fifth argument <code>userDict</code> enables you to register
a user-defined custom dictionary in <a
href="https://github.com/atilika/kuromoji/blob/909fd6b32bf4e9dc86b7599de5c9b50ca8f004a1/kuromoji-core/src/test/resources/userdict.txt"
target="_blank">Kuromoji official format</a>:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">select</span>
tokenize_ja(<span
class="hljs-string">"日本経済新聞&関西国際空港"</span>,
<span class="hljs-string">"normal"</span>, <span
class="hljs-literal">null</span>, <span class="hljs-literal">null</span>,
<span class="hljs-built_in">array</span>(
@@ -2482,7 +2489,7 @@ select tokenize_ja_neologd();
</code></pre>
<div class="panel panel-primary"><div class="panel-heading"><h3
class="panel-title" id="note"><i class="fa fa-edit"></i> Note</h3></div><div
class="panel-body"><p>Dictionary SHOULD be accessible through http/https
protocol. And, it SHOULD be compressed using gzip with <code>.gz</code> suffix
because the maximum dictionary size is limited to 32MB and read timeout is set
to 60 sec. Also, connection must be established in 10 sec.</p><p>If you want to
use HTTP Basic Authentication, please us [...]
<p>For detailed APIs, please refer Javadoc of <a
href="https://lucene.apache.org/core/5_3_1/analyzers-kuromoji/org/apache/lucene/analysis/ja/JapaneseAnalyzer.html"
target="_blank">JapaneseAnalyzer</a> as well.</p>
-<h2 id="part-of-speech">Part-of-speech</h2>
+<h3 id="part-of-speech">Part-of-speech</h3>
<p>From Hivemall v0.6.0, the second argument can also accept the following
option format:</p>
<pre><code> -mode <arg> The tokenization mode. One of
['normal', 'search',
'extended', 'default' (normal)]
@@ -2533,12 +2540,27 @@ select tokenize_ja_neologd();
<h2 id="korean-tokenizer">Korean Tokenizer</h2>
<p>Korean toknizer internally uses <a href="analyzers-nori: Korean
Morphological Analyzer" target="_blank">lucene-analyzers-nori</a> for
tokenization.</p>
<p>The signature of the UDF is as follows:</p>
-<pre><code class="lang-sql">tokenize_ko(String line [,
- const array<string> userDict,
- const string mode = "discard",
- const array<string> stopTags,
- boolean outputUnknownUnigrams
- ]) - returns tokenized strings in array<string>
+<pre><code class="lang-sql">tokenize_ko(
+ String line [, const string mode = "discard" (or const string
opts),
+ const array<string> stopWords,
+ const array<string>
+ stopTags,
+ const array<string> userDict (or const string userDictURL)]
+) - returns tokenized strings in array<string>
+</code></pre>
+<div class="panel panel-primary"><div class="panel-heading"><h3
class="panel-title" id="note"><i class="fa fa-edit"></i> Note</h3></div><div
class="panel-body"><p>Instead of mode, the 2nd argument can take options
starting with <code>-</code>.</p></div></div>
+<p>You can get usage as follows:</p>
+<pre><code class="lang-sql">select tokenize_ko("",
"-help");
+
+usage: tokenize_ko(String line [, const string mode = "discard" (or
const
+ string opts), const array<string> stopWords, const
array<string>
+ stopTags, const array<string> userDict (or const string
+ userDictURL)]) - returns tokenized strings in array<string>
[-help]
+ [-mode <arg>] [-outputUnknownUnigrams]
+ -help Show function help
+ -mode <arg> The tokenization mode. One of
['node', 'discard'
+ (default), 'mixed']
+ -outputUnknownUnigrams outputs unigrams for unknown words.
</code></pre>
<div class="panel panel-primary"><div class="panel-heading"><h3
class="panel-title" id="note"><i class="fa fa-edit"></i> Note</h3></div><div
class="panel-body"><p>For details options, please refer <a
href="https://lucene.apache.org/core/8_8_2/analyzers-nori/org/apache/lucene/analysis/ko/KoreanAnalyzer.html"
target="_blank">Lucene API document</a>. <code>none</code>,
<code>discord</code> (default), or <code>mixed</code> are supported for the
mode argument.</p></div></div>
<p>See the following examples for the usage.</p>
@@ -2546,27 +2568,65 @@ select tokenize_ja_neologd();
select tokenize_ko();
> 8.8.2
-select tokenize_ko("소설
무궁화꽃이
피었습니다.");
+select tokenize_ko('소설
무궁화꽃이
피었습니다.');
>
["소설","무궁","화","꽃","피"]
-select tokenize_ko("소설
무궁화꽃이
피었습니다.", null, "mixed");
+select tokenize_ko('소설
무궁화꽃이
피었습니다.', '-mode discard');
+>
["소설","무궁","화","꽃","피"]
+
+select tokenize_ko('소설
무궁화꽃이
피었습니다.', 'mixed');
>
["소설","무궁화","무궁","화","꽃","피"]
-select tokenize_ko("소설
무궁화꽃이
피었습니다.", null, "discard",
array("E", "VV"));
->
["소설","무궁","화","꽃","이"]
+select tokenize_ko('소설
무궁화꽃이
피었습니다.', '-mode mixed');
+>
["소설","무궁화","무궁","화","꽃","피"]
-select tokenize_ko("Hello, world.", null, "none", array(),
true);
->
["h","e","l","l","o","w","o","r","l","d"]
+select tokenize_ko('소설
무궁화꽃이
피었습니다.', '-mode none');
+>
["소설","무궁화","꽃","피"]
-select tokenize_ko("Hello, world.", null, "none", array(),
false);
+select tokenize_ko('Hello, world.', '-mode none');
> ["hello","world"]
-select tokenize_ko("나는 C++ 언어를
프로그래밍 언어로
사랑한다.", null, "discard", array());
+select tokenize_ko('Hello, world.', '-mode none
-outputUnknownUnigrams');
+>
["h","e","l","l","o","w","o","r","l","d"]
+
+-- default stopward (null), with stoptags
+select tokenize_ko('소설
무궁화꽃이
피었습니다.', 'discard', null,
array('E'));
+>
["소설","무궁","화","꽃","이","피"]
+
+select tokenize_ko('소설
무궁화꽃이
피었습니다.', 'discard', null,
array('E', 'VV'));
+>
["소설","무궁","화","꽃","이"]
+
+select tokenize_ko('나는 C++ 언어를
프로그래밍 언어로
사랑한다.', '-mode discard');
+>
["나","c","언어","프로그래밍","언어","사랑"]
+
+select tokenize_ko('나는 C++ 언어를
프로그래밍 언어로
사랑한다.', '-mode discard', array(),
null);
>
["나","는","c","언어","를","프로그래밍","언어","로","사랑","하","ᆫ다"]
-select tokenize_ko("나는 C++ 언어를
프로그래밍 언어로
사랑한다.", array("C++"),
"discard", array());
+-- default stopward (null), default stoptags (null)
+select tokenize_ko('나는 C++ 언어를
프로그래밍 언어로
사랑한다.', '-mode discard');
+select tokenize_ko('나는 C++ 언어를
프로그래밍 언어로
사랑한다.', '-mode discard', null, null);
+>
["나","c","언어","프로그래밍","언어","사랑"]
+
+-- no stopward (empty array), default stoptags (null)
+select tokenize_ko('나는 C++ 언어를
프로그래밍 언어로
사랑한다.', '-mode discard', array());
+select tokenize_ko('나는 C++ 언어를
프로그래밍 언어로
사랑한다.', '-mode discard', array(),
null);
+>
["나","c","언어","프로그래밍","언어","사랑"]
+
+-- no stopward (empty array), no stoptags (emptry array), custom dict
+select tokenize_ko('나는 C++ 언어를
프로그래밍 언어로
사랑한다.', '-mode discard', array(),
array(), array('C++'));
>
["나","는","c++","언어","를","프로그래밍","언어","로","사랑","하","ᆫ다"]
+
+> -- default stopward (null), default stoptags (null), custom dict
+select tokenize_ko('나는 C++ 언어를
프로그래밍 언어로
사랑한다.', '-mode discard', null, null,
array('C++'));
+>
["나","c++","언어","프로그래밍","언어","사랑"]
+</code></pre>
+<h3 id="custom-dictionary">Custom dictionary</h3>
+<p>Moreover, the fifth argument <code>userDictURL</code> enables you to
register a user-defined custom dictionary placed in http/https accessible
external site. Find the dictionary format <a
href="https://raw.githubusercontent.com/apache/lucene/main/lucene/analysis/nori/src/test/org/apache/lucene/analysis/ko/userdict.txt"
target="_blank">here from Lucene's one</a>.</p>
+<pre><code class="lang-sql">select tokenize_ko('나는 c++
프로그래밍을
즐긴다.', '-mode discard', null, null,
'https://raw.githubusercontent.com/apache/lucene/main/lucene/analysis/nori/src/test/org/apache/lucene/analysis/ko/userdict.txt');
+
+>
["나","c++","프로그래밍","즐기"]
</code></pre>
+<div class="panel panel-primary"><div class="panel-heading"><h3
class="panel-title" id="note"><i class="fa fa-edit"></i> Note</h3></div><div
class="panel-body"><p>Dictionary SHOULD be accessible through http/https
protocol. And, it SHOULD be compressed using gzip with <code>.gz</code> suffix
because the maximum dictionary size is limited to 32MB and read timeout is set
to 60 sec. Also, connection must be established in 10 sec.</p></div></div>
<p><div id="page-footer" class="localized-footer"><hr><!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
@@ -2622,7 +2682,7 @@ Apache Hivemall is an effort undergoing incubation at The
Apache Software Founda
<script>
var gitbook = gitbook || [];
gitbook.push(function() {
- gitbook.page.hasChanged({"page":{"title":"Text
Tokenizer","level":"2.3","depth":1,"next":{"title":"Approximate Aggregate
Functions","level":"2.4","depth":1,"path":"misc/approx.md","ref":"misc/approx.md","articles":[]},"previous":{"title":"Efficient
Top-K Query
Processing","level":"2.2","depth":1,"path":"misc/topk.md","ref":"misc/topk.md","articles":[]},"dir":"ltr"},"config":{"plugins":["theme-api","edit-link","github","splitter","etoc","callouts","toggle-chapters","anchorjs",
[...]
+ gitbook.page.hasChanged({"page":{"title":"Text
Tokenizer","level":"2.3","depth":1,"next":{"title":"Approximate Aggregate
Functions","level":"2.4","depth":1,"path":"misc/approx.md","ref":"misc/approx.md","articles":[]},"previous":{"title":"Efficient
Top-K Query
Processing","level":"2.2","depth":1,"path":"misc/topk.md","ref":"misc/topk.md","articles":[]},"dir":"ltr"},"config":{"plugins":["theme-api","edit-link","github","splitter","etoc","callouts","toggle-chapters","anchorjs",
[...]
});
</script>
</div>