Added: lucy/site/trunk/content/docs/c/Lucy/Analysis/Normalizer.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/c/Lucy/Analysis/Normalizer.mdtext?rev=1737682&view=auto ============================================================================== --- lucy/site/trunk/content/docs/c/Lucy/Analysis/Normalizer.mdtext (added) +++ lucy/site/trunk/content/docs/c/Lucy/Analysis/Normalizer.mdtext Mon Apr 4 12:55:10 2016 @@ -0,0 +1,157 @@ +Title: Lucy::Analysis::Normalizer â C API Documentation + +<div class="c-api"> +<h2>Lucy::Analysis::Normalizer</h2> +<table> +<tr> +<td class="label">parcel</td> +<td><a href="../../lucy.html">Lucy</a></td> +</tr> +<tr> +<td class="label">class variable</td> +<td><code><span class="prefix">LUCY_</span>NORMALIZER</code></td> +</tr> +<tr> +<td class="label">struct symbol</td> +<td><code><span class="prefix">lucy_</span>Normalizer</code></td> +</tr> +<tr> +<td class="label">class nickname</td> +<td><code><span class="prefix">lucy_</span>Normalizer</code></td> +</tr> +<tr> +<td class="label">header file</td> +<td><code>Lucy/Analysis/Normalizer.h</code></td> +</tr> +</table> +<h3>Name</h3> +<p>Lucy::Analysis::Normalizer â Unicode normalization, case folding and accent stripping.</p> +<h3>Description</h3> +<p>Normalizer is an <a href="../../Lucy/Analysis/Analyzer.html">Analyzer</a> which normalizes +tokens to one of the Unicode normalization forms. Optionally, it +performs Unicode case folding and converts accented characters to their +base character.</p> +<p>If you use highlighting, Normalizer should be run after tokenization +because it might add or remove characters.</p> +<h3>Functions</h3> +<dl> +<dt id="func_new">new</dt> +<dd> +<pre><code><span class="prefix">lucy_</span>Normalizer* <span class="comment">// incremented</span> +<span class="prefix">lucy_</span><strong>Normalizer_new</strong>( + <span class="prefix">cfish_</span><a href="../../Clownfish/String.html">String</a> *<strong>normalization_form</strong>, + bool <strong>case_fold</strong>, + bool <strong>strip_accents</strong> +); +</code></pre> +<p>Create a new Normalizer.</p> +<dl> +<dt>normalization_form</dt> +<dd><p>Unicode normalization form, can be one of +âNFCâ, âNFKCâ, âNFDâ, âNFKDâ. Defaults to âNFKCâ.</p> +</dd> +<dt>case_fold</dt> +<dd><p>Perform case folding, default is true.</p> +</dd> +<dt>strip_accents</dt> +<dd><p>Strip accents, default is false.</p> +</dd> +</dl> +</dd> +<dt id="func_init">init</dt> +<dd> +<pre><code><span class="prefix">lucy_</span>Normalizer* +<span class="prefix">lucy_</span><strong>Normalizer_init</strong>( + <span class="prefix">lucy_</span>Normalizer *<strong>self</strong>, + <span class="prefix">cfish_</span><a href="../../Clownfish/String.html">String</a> *<strong>normalization_form</strong>, + bool <strong>case_fold</strong>, + bool <strong>strip_accents</strong> +); +</code></pre> +<p>Initialize a Normalizer.</p> +<dl> +<dt>normalization_form</dt> +<dd><p>Unicode normalization form, can be one of +âNFCâ, âNFKCâ, âNFDâ, âNFKDâ. Defaults to âNFKCâ.</p> +</dd> +<dt>case_fold</dt> +<dd><p>Perform case folding, default is true.</p> +</dd> +<dt>strip_accents</dt> +<dd><p>Strip accents, default is false.</p> +</dd> +</dl> +</dd> +</dl> +<h3>Methods</h3> +<dl> +<dt id="func_Transform">Transform</dt> +<dd> +<pre><code><span class="prefix">lucy_</span><a href="../../Lucy/Analysis/Inversion.html">Inversion</a>* <span class="comment">// incremented</span> +<span class="prefix">lucy_</span><strong>Normalizer_Transform</strong>( + <span class="prefix">lucy_</span>Normalizer *<strong>self</strong>, + <span class="prefix">lucy_</span><a href="../../Lucy/Analysis/Inversion.html">Inversion</a> *<strong>inversion</strong> +); +</code></pre> +<p>Take a single <a href="../../Lucy/Analysis/Inversion.html">Inversion</a> as input +and returns an Inversion, either the same one (presumably transformed +in some way), or a new one.</p> +<dl> +<dt>inversion</dt> +<dd><p>An inversion.</p> +</dd> +</dl> +</dd> +<dt id="func_Dump">Dump</dt> +<dd> +<pre><code><span class="prefix">cfish_</span><a href="../../Clownfish/Hash.html">Hash</a>* <span class="comment">// incremented</span> +<span class="prefix">lucy_</span><strong>Normalizer_Dump</strong>( + <span class="prefix">lucy_</span>Normalizer *<strong>self</strong> +); +</code></pre> +<p>Dump the analyzer as hash.</p> +<p>Subclasses should call <a href="../../Lucy/Analysis/Normalizer.html#func_Dump">Dump()</a> on the superclass. The returned +object is a hash which should be populated with parameters of +the analyzer.</p> +<p><strong>Returns:</strong> A hash containing a description of the analyzer.</p> +</dd> +<dt id="func_Load">Load</dt> +<dd> +<pre><code><span class="prefix">lucy_</span>Normalizer* <span class="comment">// incremented</span> +<span class="prefix">lucy_</span><strong>Normalizer_Load</strong>( + <span class="prefix">lucy_</span>Normalizer *<strong>self</strong>, + <span class="prefix">cfish_</span><a href="../../Clownfish/Obj.html">Obj</a> *<strong>dump</strong> +); +</code></pre> +<p>Reconstruct an analyzer from a dump.</p> +<p>Subclasses should first call <a href="../../Lucy/Analysis/Normalizer.html#func_Load">Load()</a> on the superclass. The +returned object is an analyzer which should be reconstructed by +setting the dumped parameters from the hash contained in <code>dump</code>.</p> +<p>Note that the invocant analyzer is unused.</p> +<dl> +<dt>dump</dt> +<dd><p>A hash.</p> +</dd> +</dl> +<p><strong>Returns:</strong> An analyzer.</p> +</dd> +<dt id="func_Equals">Equals</dt> +<dd> +<pre><code>bool +<span class="prefix">lucy_</span><strong>Normalizer_Equals</strong>( + <span class="prefix">lucy_</span>Normalizer *<strong>self</strong>, + <span class="prefix">cfish_</span><a href="../../Clownfish/Obj.html">Obj</a> *<strong>other</strong> +); +</code></pre> +<p>Indicate whether two objects are the same. By default, compares the +memory address.</p> +<dl> +<dt>other</dt> +<dd><p>Another Obj.</p> +</dd> +</dl> +</dd> +</dl> +<h3>Inheritance</h3> +<p>Lucy::Analysis::Normalizer is a <a href="../../Lucy/Analysis/Analyzer.html">Lucy::Analysis::Analyzer</a> is a <a href="../../Clownfish/Obj.html">Clownfish::Obj</a>.</p> +</div>
Added: lucy/site/trunk/content/docs/c/Lucy/Analysis/PolyAnalyzer.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/c/Lucy/Analysis/PolyAnalyzer.mdtext?rev=1737682&view=auto ============================================================================== --- lucy/site/trunk/content/docs/c/Lucy/Analysis/PolyAnalyzer.mdtext (added) +++ lucy/site/trunk/content/docs/c/Lucy/Analysis/PolyAnalyzer.mdtext Mon Apr 4 12:55:10 2016 @@ -0,0 +1,216 @@ +Title: Lucy::Analysis::PolyAnalyzer â C API Documentation + +<div class="c-api"> +<h2>Lucy::Analysis::PolyAnalyzer</h2> +<table> +<tr> +<td class="label">parcel</td> +<td><a href="../../lucy.html">Lucy</a></td> +</tr> +<tr> +<td class="label">class variable</td> +<td><code><span class="prefix">LUCY_</span>POLYANALYZER</code></td> +</tr> +<tr> +<td class="label">struct symbol</td> +<td><code><span class="prefix">lucy_</span>PolyAnalyzer</code></td> +</tr> +<tr> +<td class="label">class nickname</td> +<td><code><span class="prefix">lucy_</span>PolyAnalyzer</code></td> +</tr> +<tr> +<td class="label">header file</td> +<td><code>Lucy/Analysis/PolyAnalyzer.h</code></td> +</tr> +</table> +<h3>Name</h3> +<p>Lucy::Analysis::PolyAnalyzer â Multiple Analyzers in series.</p> +<h3>Description</h3> +<p>A PolyAnalyzer is a series of <a href="../../Lucy/Analysis/Analyzer.html">Analyzers</a>, +each of which will be called upon to âanalyzeâ text in turn. You can +either provide the Analyzers yourself, or you can specify a supported +language, in which case a PolyAnalyzer consisting of a +<a href="../../Lucy/Analysis/CaseFolder.html">CaseFolder</a>, a +<a href="../../Lucy/Analysis/RegexTokenizer.html">RegexTokenizer</a>, and a +<a href="../../Lucy/Analysis/SnowballStemmer.html">SnowballStemmer</a> will be generated for you.</p> +<p>The language parameter is DEPRECATED. Use +<a href="../../Lucy/Analysis/EasyAnalyzer.html">EasyAnalyzer</a> instead.</p> +<p>Supported languages:</p> +<pre><code>en => English, +da => Danish, +de => German, +es => Spanish, +fi => Finnish, +fr => French, +hu => Hungarian, +it => Italian, +nl => Dutch, +no => Norwegian, +pt => Portuguese, +ro => Romanian, +ru => Russian, +sv => Swedish, +tr => Turkish, +</code></pre> +<h3>Functions</h3> +<dl> +<dt id="func_new">new</dt> +<dd> +<pre><code><span class="prefix">lucy_</span>PolyAnalyzer* <span class="comment">// incremented</span> +<span class="prefix">lucy_</span><strong>PolyAnalyzer_new</strong>( + <span class="prefix">cfish_</span><a href="../../Clownfish/String.html">String</a> *<strong>language</strong>, + <span class="prefix">cfish_</span><a href="../../Clownfish/Vector.html">Vector</a> *<strong>analyzers</strong> +); +</code></pre> +<p>Create a new PolyAnalyzer.</p> +<dl> +<dt>language</dt> +<dd><p>An ISO code from the list of supported languages. +DEPRECATED, use <a href="../../Lucy/Analysis/EasyAnalyzer.html">EasyAnalyzer</a> instead.</p> +</dd> +<dt>analyzers</dt> +<dd><p>An array of Analyzers. The order of the analyzers +matters. Donât put a SnowballStemmer before a RegexTokenizer (canât stem whole +documents or paragraphs â just individual words), or a SnowballStopFilter +after a SnowballStemmer (stemmed words, e.g. âthemselvâ, will not appear in a +stoplist). In general, the sequence should be: tokenize, normalize, +stopalize, stem.</p> +</dd> +</dl> +</dd> +<dt id="func_init">init</dt> +<dd> +<pre><code><span class="prefix">lucy_</span>PolyAnalyzer* +<span class="prefix">lucy_</span><strong>PolyAnalyzer_init</strong>( + <span class="prefix">lucy_</span>PolyAnalyzer *<strong>self</strong>, + <span class="prefix">cfish_</span><a href="../../Clownfish/String.html">String</a> *<strong>language</strong>, + <span class="prefix">cfish_</span><a href="../../Clownfish/Vector.html">Vector</a> *<strong>analyzers</strong> +); +</code></pre> +<p>Initialize a PolyAnalyzer.</p> +<dl> +<dt>language</dt> +<dd><p>An ISO code from the list of supported languages. +DEPRECATED, use <a href="../../Lucy/Analysis/EasyAnalyzer.html">EasyAnalyzer</a> instead.</p> +</dd> +<dt>analyzers</dt> +<dd><p>An array of Analyzers. The order of the analyzers +matters. Donât put a SnowballStemmer before a RegexTokenizer (canât stem whole +documents or paragraphs â just individual words), or a SnowballStopFilter +after a SnowballStemmer (stemmed words, e.g. âthemselvâ, will not appear in a +stoplist). In general, the sequence should be: tokenize, normalize, +stopalize, stem.</p> +</dd> +</dl> +</dd> +</dl> +<h3>Methods</h3> +<dl> +<dt id="func_Get_Analyzers">Get_Analyzers</dt> +<dd> +<pre><code><span class="prefix">cfish_</span><a href="../../Clownfish/Vector.html">Vector</a>* +<span class="prefix">lucy_</span><strong>PolyAnalyzer_Get_Analyzers</strong>( + <span class="prefix">lucy_</span>PolyAnalyzer *<strong>self</strong> +); +</code></pre> +<p>Getter for âanalyzersâ member.</p> +</dd> +<dt id="func_Transform">Transform</dt> +<dd> +<pre><code><span class="prefix">lucy_</span><a href="../../Lucy/Analysis/Inversion.html">Inversion</a>* <span class="comment">// incremented</span> +<span class="prefix">lucy_</span><strong>PolyAnalyzer_Transform</strong>( + <span class="prefix">lucy_</span>PolyAnalyzer *<strong>self</strong>, + <span class="prefix">lucy_</span><a href="../../Lucy/Analysis/Inversion.html">Inversion</a> *<strong>inversion</strong> +); +</code></pre> +<p>Take a single <a href="../../Lucy/Analysis/Inversion.html">Inversion</a> as input +and returns an Inversion, either the same one (presumably transformed +in some way), or a new one.</p> +<dl> +<dt>inversion</dt> +<dd><p>An inversion.</p> +</dd> +</dl> +</dd> +<dt id="func_Transform_Text">Transform_Text</dt> +<dd> +<pre><code><span class="prefix">lucy_</span><a href="../../Lucy/Analysis/Inversion.html">Inversion</a>* <span class="comment">// incremented</span> +<span class="prefix">lucy_</span><strong>PolyAnalyzer_Transform_Text</strong>( + <span class="prefix">lucy_</span>PolyAnalyzer *<strong>self</strong>, + <span class="prefix">cfish_</span><a href="../../Clownfish/String.html">String</a> *<strong>text</strong> +); +</code></pre> +<p>Kick off an analysis chain, creating an Inversion from string input. +The default implementation simply creates an initial Inversion with a +single Token, then calls <a href="../../Lucy/Analysis/PolyAnalyzer.html#func_Transform">Transform()</a>, but occasionally subclasses will +provide an optimized implementation which minimizes string copies.</p> +<dl> +<dt>text</dt> +<dd><p>A string.</p> +</dd> +</dl> +</dd> +<dt id="func_Equals">Equals</dt> +<dd> +<pre><code>bool +<span class="prefix">lucy_</span><strong>PolyAnalyzer_Equals</strong>( + <span class="prefix">lucy_</span>PolyAnalyzer *<strong>self</strong>, + <span class="prefix">cfish_</span><a href="../../Clownfish/Obj.html">Obj</a> *<strong>other</strong> +); +</code></pre> +<p>Indicate whether two objects are the same. By default, compares the +memory address.</p> +<dl> +<dt>other</dt> +<dd><p>Another Obj.</p> +</dd> +</dl> +</dd> +<dt id="func_Dump">Dump</dt> +<dd> +<pre><code><span class="prefix">cfish_</span><a href="../../Clownfish/Obj.html">Obj</a>* <span class="comment">// incremented</span> +<span class="prefix">lucy_</span><strong>PolyAnalyzer_Dump</strong>( + <span class="prefix">lucy_</span>PolyAnalyzer *<strong>self</strong> +); +</code></pre> +<p>Dump the analyzer as hash.</p> +<p>Subclasses should call <a href="../../Lucy/Analysis/PolyAnalyzer.html#func_Dump">Dump()</a> on the superclass. The returned +object is a hash which should be populated with parameters of +the analyzer.</p> +<p><strong>Returns:</strong> A hash containing a description of the analyzer.</p> +</dd> +<dt id="func_Load">Load</dt> +<dd> +<pre><code><span class="prefix">lucy_</span>PolyAnalyzer* <span class="comment">// incremented</span> +<span class="prefix">lucy_</span><strong>PolyAnalyzer_Load</strong>( + <span class="prefix">lucy_</span>PolyAnalyzer *<strong>self</strong>, + <span class="prefix">cfish_</span><a href="../../Clownfish/Obj.html">Obj</a> *<strong>dump</strong> +); +</code></pre> +<p>Reconstruct an analyzer from a dump.</p> +<p>Subclasses should first call <a href="../../Lucy/Analysis/PolyAnalyzer.html#func_Load">Load()</a> on the superclass. The +returned object is an analyzer which should be reconstructed by +setting the dumped parameters from the hash contained in <code>dump</code>.</p> +<p>Note that the invocant analyzer is unused.</p> +<dl> +<dt>dump</dt> +<dd><p>A hash.</p> +</dd> +</dl> +<p><strong>Returns:</strong> An analyzer.</p> +</dd> +<dt id="func_Destroy">Destroy</dt> +<dd> +<pre><code>void +<span class="prefix">lucy_</span><strong>PolyAnalyzer_Destroy</strong>( + <span class="prefix">lucy_</span>PolyAnalyzer *<strong>self</strong> +); +</code></pre> +<p>Generic destructor. Frees the struct itself but not any complex +member elements.</p> +</dd> +</dl> +<h3>Inheritance</h3> +<p>Lucy::Analysis::PolyAnalyzer is a <a href="../../Lucy/Analysis/Analyzer.html">Lucy::Analysis::Analyzer</a> is a <a href="../../Clownfish/Obj.html">Clownfish::Obj</a>.</p> +</div> Added: lucy/site/trunk/content/docs/c/Lucy/Analysis/RegexTokenizer.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/c/Lucy/Analysis/RegexTokenizer.mdtext?rev=1737682&view=auto ============================================================================== --- lucy/site/trunk/content/docs/c/Lucy/Analysis/RegexTokenizer.mdtext (added) +++ lucy/site/trunk/content/docs/c/Lucy/Analysis/RegexTokenizer.mdtext Mon Apr 4 12:55:10 2016 @@ -0,0 +1,191 @@ +Title: Lucy::Analysis::RegexTokenizer â C API Documentation + +<div class="c-api"> +<h2>Lucy::Analysis::RegexTokenizer</h2> +<table> +<tr> +<td class="label">parcel</td> +<td><a href="../../lucy.html">Lucy</a></td> +</tr> +<tr> +<td class="label">class variable</td> +<td><code><span class="prefix">LUCY_</span>REGEXTOKENIZER</code></td> +</tr> +<tr> +<td class="label">struct symbol</td> +<td><code><span class="prefix">lucy_</span>RegexTokenizer</code></td> +</tr> +<tr> +<td class="label">class nickname</td> +<td><code><span class="prefix">lucy_</span>RegexTokenizer</code></td> +</tr> +<tr> +<td class="label">header file</td> +<td><code>Lucy/Analysis/RegexTokenizer.h</code></td> +</tr> +</table> +<h3>Name</h3> +<p>Lucy::Analysis::RegexTokenizer â Split a string into tokens.</p> +<h3>Description</h3> +<p>Generically, âtokenizingâ is a process of breaking up a string into an +array of âtokensâ. For instance, the string âthree blind miceâ might be +tokenized into âthreeâ, âblindâ, âmiceâ.</p> +<p>Lucy::Analysis::RegexTokenizer decides where it should break up the text +based on a regular expression compiled from a supplied <code>pattern</code> +matching one token. If our source string isâ¦</p> +<pre><code>"Eats, Shoots and Leaves." +</code></pre> +<p>⦠then a âwhitespace tokenizerâ with a <code>pattern</code> of +<code>"\\S+"</code> producesâ¦</p> +<pre><code>Eats, +Shoots +and +Leaves. +</code></pre> +<p>⦠while a âword character tokenizerâ with a <code>pattern</code> of +<code>"\\w+"</code> producesâ¦</p> +<pre><code>Eats +Shoots +and +Leaves +</code></pre> +<p>⦠the difference being that the word character tokenizer skips over +punctuation as well as whitespace when determining token boundaries.</p> +<h3>Functions</h3> +<dl> +<dt id="func_new">new</dt> +<dd> +<pre><code><span class="prefix">lucy_</span>RegexTokenizer* <span class="comment">// incremented</span> +<span class="prefix">lucy_</span><strong>RegexTokenizer_new</strong>( + <span class="prefix">cfish_</span><a href="../../Clownfish/String.html">String</a> *<strong>pattern</strong> +); +</code></pre> +<p>Create a new RegexTokenizer.</p> +<dl> +<dt>pattern</dt> +<dd><p>A string specifying a Perl-syntax regular expression +which should match one token. The default value is +<code>\w+(?:[\x{2019}']\w+)*</code>, which matches âitâsâ as well as +âitâ and âOâHenryâsâ as well as âHenryâ.</p> +</dd> +</dl> +</dd> +<dt id="func_init">init</dt> +<dd> +<pre><code><span class="prefix">lucy_</span>RegexTokenizer* +<span class="prefix">lucy_</span><strong>RegexTokenizer_init</strong>( + <span class="prefix">lucy_</span>RegexTokenizer *<strong>self</strong>, + <span class="prefix">cfish_</span><a href="../../Clownfish/String.html">String</a> *<strong>pattern</strong> +); +</code></pre> +<p>Initialize a RegexTokenizer.</p> +<dl> +<dt>pattern</dt> +<dd><p>A string specifying a Perl-syntax regular expression +which should match one token. The default value is +<code>\w+(?:[\x{2019}']\w+)*</code>, which matches âitâsâ as well as +âitâ and âOâHenryâsâ as well as âHenryâ.</p> +</dd> +</dl> +</dd> +</dl> +<h3>Methods</h3> +<dl> +<dt id="func_Transform">Transform</dt> +<dd> +<pre><code><span class="prefix">lucy_</span><a href="../../Lucy/Analysis/Inversion.html">Inversion</a>* <span class="comment">// incremented</span> +<span class="prefix">lucy_</span><strong>RegexTokenizer_Transform</strong>( + <span class="prefix">lucy_</span>RegexTokenizer *<strong>self</strong>, + <span class="prefix">lucy_</span><a href="../../Lucy/Analysis/Inversion.html">Inversion</a> *<strong>inversion</strong> +); +</code></pre> +<p>Take a single <a href="../../Lucy/Analysis/Inversion.html">Inversion</a> as input +and returns an Inversion, either the same one (presumably transformed +in some way), or a new one.</p> +<dl> +<dt>inversion</dt> +<dd><p>An inversion.</p> +</dd> +</dl> +</dd> +<dt id="func_Transform_Text">Transform_Text</dt> +<dd> +<pre><code><span class="prefix">lucy_</span><a href="../../Lucy/Analysis/Inversion.html">Inversion</a>* <span class="comment">// incremented</span> +<span class="prefix">lucy_</span><strong>RegexTokenizer_Transform_Text</strong>( + <span class="prefix">lucy_</span>RegexTokenizer *<strong>self</strong>, + <span class="prefix">cfish_</span><a href="../../Clownfish/String.html">String</a> *<strong>text</strong> +); +</code></pre> +<p>Kick off an analysis chain, creating an Inversion from string input. +The default implementation simply creates an initial Inversion with a +single Token, then calls <a href="../../Lucy/Analysis/RegexTokenizer.html#func_Transform">Transform()</a>, but occasionally subclasses will +provide an optimized implementation which minimizes string copies.</p> +<dl> +<dt>text</dt> +<dd><p>A string.</p> +</dd> +</dl> +</dd> +<dt id="func_Dump">Dump</dt> +<dd> +<pre><code><span class="prefix">cfish_</span><a href="../../Clownfish/Obj.html">Obj</a>* <span class="comment">// incremented</span> +<span class="prefix">lucy_</span><strong>RegexTokenizer_Dump</strong>( + <span class="prefix">lucy_</span>RegexTokenizer *<strong>self</strong> +); +</code></pre> +<p>Dump the analyzer as hash.</p> +<p>Subclasses should call <a href="../../Lucy/Analysis/RegexTokenizer.html#func_Dump">Dump()</a> on the superclass. The returned +object is a hash which should be populated with parameters of +the analyzer.</p> +<p><strong>Returns:</strong> A hash containing a description of the analyzer.</p> +</dd> +<dt id="func_Load">Load</dt> +<dd> +<pre><code><span class="prefix">lucy_</span>RegexTokenizer* <span class="comment">// incremented</span> +<span class="prefix">lucy_</span><strong>RegexTokenizer_Load</strong>( + <span class="prefix">lucy_</span>RegexTokenizer *<strong>self</strong>, + <span class="prefix">cfish_</span><a href="../../Clownfish/Obj.html">Obj</a> *<strong>dump</strong> +); +</code></pre> +<p>Reconstruct an analyzer from a dump.</p> +<p>Subclasses should first call <a href="../../Lucy/Analysis/RegexTokenizer.html#func_Load">Load()</a> on the superclass. The +returned object is an analyzer which should be reconstructed by +setting the dumped parameters from the hash contained in <code>dump</code>.</p> +<p>Note that the invocant analyzer is unused.</p> +<dl> +<dt>dump</dt> +<dd><p>A hash.</p> +</dd> +</dl> +<p><strong>Returns:</strong> An analyzer.</p> +</dd> +<dt id="func_Equals">Equals</dt> +<dd> +<pre><code>bool +<span class="prefix">lucy_</span><strong>RegexTokenizer_Equals</strong>( + <span class="prefix">lucy_</span>RegexTokenizer *<strong>self</strong>, + <span class="prefix">cfish_</span><a href="../../Clownfish/Obj.html">Obj</a> *<strong>other</strong> +); +</code></pre> +<p>Indicate whether two objects are the same. By default, compares the +memory address.</p> +<dl> +<dt>other</dt> +<dd><p>Another Obj.</p> +</dd> +</dl> +</dd> +<dt id="func_Destroy">Destroy</dt> +<dd> +<pre><code>void +<span class="prefix">lucy_</span><strong>RegexTokenizer_Destroy</strong>( + <span class="prefix">lucy_</span>RegexTokenizer *<strong>self</strong> +); +</code></pre> +<p>Generic destructor. Frees the struct itself but not any complex +member elements.</p> +</dd> +</dl> +<h3>Inheritance</h3> +<p>Lucy::Analysis::RegexTokenizer is a <a href="../../Lucy/Analysis/Analyzer.html">Lucy::Analysis::Analyzer</a> is a <a href="../../Clownfish/Obj.html">Clownfish::Obj</a>.</p> +</div> Added: lucy/site/trunk/content/docs/c/Lucy/Analysis/SnowballStemmer.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/c/Lucy/Analysis/SnowballStemmer.mdtext?rev=1737682&view=auto ============================================================================== --- lucy/site/trunk/content/docs/c/Lucy/Analysis/SnowballStemmer.mdtext (added) +++ lucy/site/trunk/content/docs/c/Lucy/Analysis/SnowballStemmer.mdtext Mon Apr 4 12:55:10 2016 @@ -0,0 +1,150 @@ +Title: Lucy::Analysis::SnowballStemmer â C API Documentation + +<div class="c-api"> +<h2>Lucy::Analysis::SnowballStemmer</h2> +<table> +<tr> +<td class="label">parcel</td> +<td><a href="../../lucy.html">Lucy</a></td> +</tr> +<tr> +<td class="label">class variable</td> +<td><code><span class="prefix">LUCY_</span>SNOWBALLSTEMMER</code></td> +</tr> +<tr> +<td class="label">struct symbol</td> +<td><code><span class="prefix">lucy_</span>SnowballStemmer</code></td> +</tr> +<tr> +<td class="label">class nickname</td> +<td><code><span class="prefix">lucy_</span>SnowStemmer</code></td> +</tr> +<tr> +<td class="label">header file</td> +<td><code>Lucy/Analysis/SnowballStemmer.h</code></td> +</tr> +</table> +<h3>Name</h3> +<p>Lucy::Analysis::SnowballStemmer â Reduce related words to a shared root.</p> +<h3>Description</h3> +<p>SnowballStemmer is an <a href="../../Lucy/Analysis/Analyzer.html">Analyzer</a> which reduces +related words to a root form (using the âSnowballâ stemming library). For +instance, âhorseâ, âhorsesâ, and âhorsingâ all become âhorsâ â so that a +search for âhorseâ will also match documents containing âhorsesâ and +âhorsingâ.</p> +<h3>Functions</h3> +<dl> +<dt id="func_new">new</dt> +<dd> +<pre><code><span class="prefix">lucy_</span>SnowballStemmer* <span class="comment">// incremented</span> +<span class="prefix">lucy_</span><strong>SnowStemmer_new</strong>( + <span class="prefix">cfish_</span><a href="../../Clownfish/String.html">String</a> *<strong>language</strong> +); +</code></pre> +<p>Create a new SnowballStemmer.</p> +<dl> +<dt>language</dt> +<dd><p>A two-letter ISO code identifying a language supported +by Snowball.</p> +</dd> +</dl> +</dd> +<dt id="func_init">init</dt> +<dd> +<pre><code><span class="prefix">lucy_</span>SnowballStemmer* +<span class="prefix">lucy_</span><strong>SnowStemmer_init</strong>( + <span class="prefix">lucy_</span>SnowballStemmer *<strong>self</strong>, + <span class="prefix">cfish_</span><a href="../../Clownfish/String.html">String</a> *<strong>language</strong> +); +</code></pre> +<p>Initialize a SnowballStemmer.</p> +<dl> +<dt>language</dt> +<dd><p>A two-letter ISO code identifying a language supported +by Snowball.</p> +</dd> +</dl> +</dd> +</dl> +<h3>Methods</h3> +<dl> +<dt id="func_Transform">Transform</dt> +<dd> +<pre><code><span class="prefix">lucy_</span><a href="../../Lucy/Analysis/Inversion.html">Inversion</a>* <span class="comment">// incremented</span> +<span class="prefix">lucy_</span><strong>SnowStemmer_Transform</strong>( + <span class="prefix">lucy_</span>SnowballStemmer *<strong>self</strong>, + <span class="prefix">lucy_</span><a href="../../Lucy/Analysis/Inversion.html">Inversion</a> *<strong>inversion</strong> +); +</code></pre> +<p>Take a single <a href="../../Lucy/Analysis/Inversion.html">Inversion</a> as input +and returns an Inversion, either the same one (presumably transformed +in some way), or a new one.</p> +<dl> +<dt>inversion</dt> +<dd><p>An inversion.</p> +</dd> +</dl> +</dd> +<dt id="func_Dump">Dump</dt> +<dd> +<pre><code><span class="prefix">cfish_</span><a href="../../Clownfish/Hash.html">Hash</a>* <span class="comment">// incremented</span> +<span class="prefix">lucy_</span><strong>SnowStemmer_Dump</strong>( + <span class="prefix">lucy_</span>SnowballStemmer *<strong>self</strong> +); +</code></pre> +<p>Dump the analyzer as hash.</p> +<p>Subclasses should call <a href="../../Lucy/Analysis/SnowballStemmer.html#func_Dump">Dump()</a> on the superclass. The returned +object is a hash which should be populated with parameters of +the analyzer.</p> +<p><strong>Returns:</strong> A hash containing a description of the analyzer.</p> +</dd> +<dt id="func_Load">Load</dt> +<dd> +<pre><code><span class="prefix">lucy_</span>SnowballStemmer* <span class="comment">// incremented</span> +<span class="prefix">lucy_</span><strong>SnowStemmer_Load</strong>( + <span class="prefix">lucy_</span>SnowballStemmer *<strong>self</strong>, + <span class="prefix">cfish_</span><a href="../../Clownfish/Obj.html">Obj</a> *<strong>dump</strong> +); +</code></pre> +<p>Reconstruct an analyzer from a dump.</p> +<p>Subclasses should first call <a href="../../Lucy/Analysis/SnowballStemmer.html#func_Load">Load()</a> on the superclass. The +returned object is an analyzer which should be reconstructed by +setting the dumped parameters from the hash contained in <code>dump</code>.</p> +<p>Note that the invocant analyzer is unused.</p> +<dl> +<dt>dump</dt> +<dd><p>A hash.</p> +</dd> +</dl> +<p><strong>Returns:</strong> An analyzer.</p> +</dd> +<dt id="func_Equals">Equals</dt> +<dd> +<pre><code>bool +<span class="prefix">lucy_</span><strong>SnowStemmer_Equals</strong>( + <span class="prefix">lucy_</span>SnowballStemmer *<strong>self</strong>, + <span class="prefix">cfish_</span><a href="../../Clownfish/Obj.html">Obj</a> *<strong>other</strong> +); +</code></pre> +<p>Indicate whether two objects are the same. By default, compares the +memory address.</p> +<dl> +<dt>other</dt> +<dd><p>Another Obj.</p> +</dd> +</dl> +</dd> +<dt id="func_Destroy">Destroy</dt> +<dd> +<pre><code>void +<span class="prefix">lucy_</span><strong>SnowStemmer_Destroy</strong>( + <span class="prefix">lucy_</span>SnowballStemmer *<strong>self</strong> +); +</code></pre> +<p>Generic destructor. Frees the struct itself but not any complex +member elements.</p> +</dd> +</dl> +<h3>Inheritance</h3> +<p>Lucy::Analysis::SnowballStemmer is a <a href="../../Lucy/Analysis/Analyzer.html">Lucy::Analysis::Analyzer</a> is a <a href="../../Clownfish/Obj.html">Clownfish::Obj</a>.</p> +</div> Added: lucy/site/trunk/content/docs/c/Lucy/Analysis/SnowballStopFilter.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/c/Lucy/Analysis/SnowballStopFilter.mdtext?rev=1737682&view=auto ============================================================================== --- lucy/site/trunk/content/docs/c/Lucy/Analysis/SnowballStopFilter.mdtext (added) +++ lucy/site/trunk/content/docs/c/Lucy/Analysis/SnowballStopFilter.mdtext Mon Apr 4 12:55:10 2016 @@ -0,0 +1,182 @@ +Title: Lucy::Analysis::SnowballStopFilter â C API Documentation + +<div class="c-api"> +<h2>Lucy::Analysis::SnowballStopFilter</h2> +<table> +<tr> +<td class="label">parcel</td> +<td><a href="../../lucy.html">Lucy</a></td> +</tr> +<tr> +<td class="label">class variable</td> +<td><code><span class="prefix">LUCY_</span>SNOWBALLSTOPFILTER</code></td> +</tr> +<tr> +<td class="label">struct symbol</td> +<td><code><span class="prefix">lucy_</span>SnowballStopFilter</code></td> +</tr> +<tr> +<td class="label">class nickname</td> +<td><code><span class="prefix">lucy_</span>SnowStop</code></td> +</tr> +<tr> +<td class="label">header file</td> +<td><code>Lucy/Analysis/SnowballStopFilter.h</code></td> +</tr> +</table> +<h3>Name</h3> +<p>Lucy::Analysis::SnowballStopFilter â Suppress a âstoplistâ of common words.</p> +<h3>Description</h3> +<p>A âstoplistâ is collection of âstopwordsâ: words which are common enough to +be of little value when determining search results. For example, so many +documents in English contain âtheâ, âifâ, and âmaybeâ that it may improve +both performance and relevance to block them.</p> +<p>Before filtering stopwords:</p> +<pre><code>("i", "am", "the", "walrus") +</code></pre> +<p>After filtering stopwords:</p> +<pre><code>("walrus") +</code></pre> +<p>SnowballStopFilter provides default stoplists for several languages, +courtesy of the <a href="http://snowball.tartarus.org">Snowball project</a>, or you may +supply your own.</p> +<pre><code>|-----------------------| +| ISO CODE | LANGUAGE | +|-----------------------| +| da | Danish | +| de | German | +| en | English | +| es | Spanish | +| fi | Finnish | +| fr | French | +| hu | Hungarian | +| it | Italian | +| nl | Dutch | +| no | Norwegian | +| pt | Portuguese | +| sv | Swedish | +| ru | Russian | +|-----------------------| +</code></pre> +<h3>Functions</h3> +<dl> +<dt id="func_new">new</dt> +<dd> +<pre><code><span class="prefix">lucy_</span>SnowballStopFilter* <span class="comment">// incremented</span> +<span class="prefix">lucy_</span><strong>SnowStop_new</strong>( + <span class="prefix">cfish_</span><a href="../../Clownfish/String.html">String</a> *<strong>language</strong>, + <span class="prefix">cfish_</span><a href="../../Clownfish/Hash.html">Hash</a> *<strong>stoplist</strong> +); +</code></pre> +<p>Create a new SnowballStopFilter.</p> +<dl> +<dt>stoplist</dt> +<dd><p>A hash with stopwords as the keys.</p> +</dd> +<dt>language</dt> +<dd><p>The ISO code for a supported language.</p> +</dd> +</dl> +</dd> +<dt id="func_init">init</dt> +<dd> +<pre><code><span class="prefix">lucy_</span>SnowballStopFilter* +<span class="prefix">lucy_</span><strong>SnowStop_init</strong>( + <span class="prefix">lucy_</span>SnowballStopFilter *<strong>self</strong>, + <span class="prefix">cfish_</span><a href="../../Clownfish/String.html">String</a> *<strong>language</strong>, + <span class="prefix">cfish_</span><a href="../../Clownfish/Hash.html">Hash</a> *<strong>stoplist</strong> +); +</code></pre> +<p>Initialize a SnowballStopFilter.</p> +<dl> +<dt>stoplist</dt> +<dd><p>A hash with stopwords as the keys.</p> +</dd> +<dt>language</dt> +<dd><p>The ISO code for a supported language.</p> +</dd> +</dl> +</dd> +</dl> +<h3>Methods</h3> +<dl> +<dt id="func_Transform">Transform</dt> +<dd> +<pre><code><span class="prefix">lucy_</span><a href="../../Lucy/Analysis/Inversion.html">Inversion</a>* <span class="comment">// incremented</span> +<span class="prefix">lucy_</span><strong>SnowStop_Transform</strong>( + <span class="prefix">lucy_</span>SnowballStopFilter *<strong>self</strong>, + <span class="prefix">lucy_</span><a href="../../Lucy/Analysis/Inversion.html">Inversion</a> *<strong>inversion</strong> +); +</code></pre> +<p>Take a single <a href="../../Lucy/Analysis/Inversion.html">Inversion</a> as input +and returns an Inversion, either the same one (presumably transformed +in some way), or a new one.</p> +<dl> +<dt>inversion</dt> +<dd><p>An inversion.</p> +</dd> +</dl> +</dd> +<dt id="func_Equals">Equals</dt> +<dd> +<pre><code>bool +<span class="prefix">lucy_</span><strong>SnowStop_Equals</strong>( + <span class="prefix">lucy_</span>SnowballStopFilter *<strong>self</strong>, + <span class="prefix">cfish_</span><a href="../../Clownfish/Obj.html">Obj</a> *<strong>other</strong> +); +</code></pre> +<p>Indicate whether two objects are the same. By default, compares the +memory address.</p> +<dl> +<dt>other</dt> +<dd><p>Another Obj.</p> +</dd> +</dl> +</dd> +<dt id="func_Dump">Dump</dt> +<dd> +<pre><code><span class="prefix">cfish_</span><a href="../../Clownfish/Obj.html">Obj</a>* <span class="comment">// incremented</span> +<span class="prefix">lucy_</span><strong>SnowStop_Dump</strong>( + <span class="prefix">lucy_</span>SnowballStopFilter *<strong>self</strong> +); +</code></pre> +<p>Dump the analyzer as hash.</p> +<p>Subclasses should call <a href="../../Lucy/Analysis/SnowballStopFilter.html#func_Dump">Dump()</a> on the superclass. The returned +object is a hash which should be populated with parameters of +the analyzer.</p> +<p><strong>Returns:</strong> A hash containing a description of the analyzer.</p> +</dd> +<dt id="func_Load">Load</dt> +<dd> +<pre><code><span class="prefix">cfish_</span><a href="../../Clownfish/Obj.html">Obj</a>* <span class="comment">// incremented</span> +<span class="prefix">lucy_</span><strong>SnowStop_Load</strong>( + <span class="prefix">lucy_</span>SnowballStopFilter *<strong>self</strong>, + <span class="prefix">cfish_</span><a href="../../Clownfish/Obj.html">Obj</a> *<strong>dump</strong> +); +</code></pre> +<p>Reconstruct an analyzer from a dump.</p> +<p>Subclasses should first call <a href="../../Lucy/Analysis/SnowballStopFilter.html#func_Load">Load()</a> on the superclass. The +returned object is an analyzer which should be reconstructed by +setting the dumped parameters from the hash contained in <code>dump</code>.</p> +<p>Note that the invocant analyzer is unused.</p> +<dl> +<dt>dump</dt> +<dd><p>A hash.</p> +</dd> +</dl> +<p><strong>Returns:</strong> An analyzer.</p> +</dd> +<dt id="func_Destroy">Destroy</dt> +<dd> +<pre><code>void +<span class="prefix">lucy_</span><strong>SnowStop_Destroy</strong>( + <span class="prefix">lucy_</span>SnowballStopFilter *<strong>self</strong> +); +</code></pre> +<p>Generic destructor. Frees the struct itself but not any complex +member elements.</p> +</dd> +</dl> +<h3>Inheritance</h3> +<p>Lucy::Analysis::SnowballStopFilter is a <a href="../../Lucy/Analysis/Analyzer.html">Lucy::Analysis::Analyzer</a> is a <a href="../../Clownfish/Obj.html">Clownfish::Obj</a>.</p> +</div> Added: lucy/site/trunk/content/docs/c/Lucy/Analysis/StandardTokenizer.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/c/Lucy/Analysis/StandardTokenizer.mdtext?rev=1737682&view=auto ============================================================================== --- lucy/site/trunk/content/docs/c/Lucy/Analysis/StandardTokenizer.mdtext (added) +++ lucy/site/trunk/content/docs/c/Lucy/Analysis/StandardTokenizer.mdtext Mon Apr 4 12:55:10 2016 @@ -0,0 +1,111 @@ +Title: Lucy::Analysis::StandardTokenizer â C API Documentation + +<div class="c-api"> +<h2>Lucy::Analysis::StandardTokenizer</h2> +<table> +<tr> +<td class="label">parcel</td> +<td><a href="../../lucy.html">Lucy</a></td> +</tr> +<tr> +<td class="label">class variable</td> +<td><code><span class="prefix">LUCY_</span>STANDARDTOKENIZER</code></td> +</tr> +<tr> +<td class="label">struct symbol</td> +<td><code><span class="prefix">lucy_</span>StandardTokenizer</code></td> +</tr> +<tr> +<td class="label">class nickname</td> +<td><code><span class="prefix">lucy_</span>StandardTokenizer</code></td> +</tr> +<tr> +<td class="label">header file</td> +<td><code>Lucy/Analysis/StandardTokenizer.h</code></td> +</tr> +</table> +<h3>Name</h3> +<p>Lucy::Analysis::StandardTokenizer â Split a string into tokens.</p> +<h3>Description</h3> +<p>Generically, âtokenizingâ is a process of breaking up a string into an +array of âtokensâ. For instance, the string âthree blind miceâ might be +tokenized into âthreeâ, âblindâ, âmiceâ.</p> +<p>Lucy::Analysis::StandardTokenizer breaks up the text at the word +boundaries defined in Unicode Standard Annex #29. It then returns those +words that contain alphabetic or numeric characters.</p> +<h3>Functions</h3> +<dl> +<dt id="func_new">new</dt> +<dd> +<pre><code><span class="prefix">lucy_</span>StandardTokenizer* <span class="comment">// incremented</span> +<span class="prefix">lucy_</span><strong>StandardTokenizer_new</strong>(void); +</code></pre> +<p>Constructor. Takes no arguments.</p> +</dd> +<dt id="func_init">init</dt> +<dd> +<pre><code><span class="prefix">lucy_</span>StandardTokenizer* +<span class="prefix">lucy_</span><strong>StandardTokenizer_init</strong>( + <span class="prefix">lucy_</span>StandardTokenizer *<strong>self</strong> +); +</code></pre> +<p>Initialize a StandardTokenizer.</p> +</dd> +</dl> +<h3>Methods</h3> +<dl> +<dt id="func_Transform">Transform</dt> +<dd> +<pre><code><span class="prefix">lucy_</span><a href="../../Lucy/Analysis/Inversion.html">Inversion</a>* <span class="comment">// incremented</span> +<span class="prefix">lucy_</span><strong>StandardTokenizer_Transform</strong>( + <span class="prefix">lucy_</span>StandardTokenizer *<strong>self</strong>, + <span class="prefix">lucy_</span><a href="../../Lucy/Analysis/Inversion.html">Inversion</a> *<strong>inversion</strong> +); +</code></pre> +<p>Take a single <a href="../../Lucy/Analysis/Inversion.html">Inversion</a> as input +and returns an Inversion, either the same one (presumably transformed +in some way), or a new one.</p> +<dl> +<dt>inversion</dt> +<dd><p>An inversion.</p> +</dd> +</dl> +</dd> +<dt id="func_Transform_Text">Transform_Text</dt> +<dd> +<pre><code><span class="prefix">lucy_</span><a href="../../Lucy/Analysis/Inversion.html">Inversion</a>* <span class="comment">// incremented</span> +<span class="prefix">lucy_</span><strong>StandardTokenizer_Transform_Text</strong>( + <span class="prefix">lucy_</span>StandardTokenizer *<strong>self</strong>, + <span class="prefix">cfish_</span><a href="../../Clownfish/String.html">String</a> *<strong>text</strong> +); +</code></pre> +<p>Kick off an analysis chain, creating an Inversion from string input. +The default implementation simply creates an initial Inversion with a +single Token, then calls <a href="../../Lucy/Analysis/StandardTokenizer.html#func_Transform">Transform()</a>, but occasionally subclasses will +provide an optimized implementation which minimizes string copies.</p> +<dl> +<dt>text</dt> +<dd><p>A string.</p> +</dd> +</dl> +</dd> +<dt id="func_Equals">Equals</dt> +<dd> +<pre><code>bool +<span class="prefix">lucy_</span><strong>StandardTokenizer_Equals</strong>( + <span class="prefix">lucy_</span>StandardTokenizer *<strong>self</strong>, + <span class="prefix">cfish_</span><a href="../../Clownfish/Obj.html">Obj</a> *<strong>other</strong> +); +</code></pre> +<p>Indicate whether two objects are the same. By default, compares the +memory address.</p> +<dl> +<dt>other</dt> +<dd><p>Another Obj.</p> +</dd> +</dl> +</dd> +</dl> +<h3>Inheritance</h3> +<p>Lucy::Analysis::StandardTokenizer is a <a href="../../Lucy/Analysis/Analyzer.html">Lucy::Analysis::Analyzer</a> is a <a href="../../Clownfish/Obj.html">Clownfish::Obj</a>.</p> +</div> Added: lucy/site/trunk/content/docs/c/Lucy/Analysis/Token.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/c/Lucy/Analysis/Token.mdtext?rev=1737682&view=auto ============================================================================== --- lucy/site/trunk/content/docs/c/Lucy/Analysis/Token.mdtext (added) +++ lucy/site/trunk/content/docs/c/Lucy/Analysis/Token.mdtext Mon Apr 4 12:55:10 2016 @@ -0,0 +1,204 @@ +Title: Lucy::Analysis::Token â C API Documentation + +<div class="c-api"> +<h2>Lucy::Analysis::Token</h2> +<table> +<tr> +<td class="label">parcel</td> +<td><a href="../../lucy.html">Lucy</a></td> +</tr> +<tr> +<td class="label">class variable</td> +<td><code><span class="prefix">LUCY_</span>TOKEN</code></td> +</tr> +<tr> +<td class="label">struct symbol</td> +<td><code><span class="prefix">lucy_</span>Token</code></td> +</tr> +<tr> +<td class="label">class nickname</td> +<td><code><span class="prefix">lucy_</span>Token</code></td> +</tr> +<tr> +<td class="label">header file</td> +<td><code>Lucy/Analysis/Token.h</code></td> +</tr> +</table> +<h3>Name</h3> +<p>Lucy::Analysis::Token â Unit of text.</p> +<h3>Description</h3> +<p>Token is the fundamental unit used by Apache Lucyâs Analyzer subclasses. +Each Token has 5 attributes: <code>text</code>, <code>start_offset</code>, +<code>end_offset</code>, <code>boost</code>, and <code>pos_inc</code>.</p> +<p>The <code>text</code> attribute is a Unicode string encoded as UTF-8.</p> +<p><code>start_offset</code> is the start point of the token text, measured in +Unicode code points from the top of the stored field; +<code>end_offset</code> delimits the corresponding closing boundary. +<code>start_offset</code> and <code>end_offset</code> locate the Token +within a larger context, even if the Tokenâs text attribute gets modified +â by stemming, for instance. The Token for âbeatingâ in the text âbeating +a dead horseâ begins life with a start_offset of 0 and an end_offset of 7; +after stemming, the text is âbeatâ, but the start_offset is still 0 and the +end_offset is still 7. This allows âbeatingâ to be highlighted correctly +after a search matches âbeatâ.</p> +<p><code>boost</code> is a per-token weight. Use this when you want to assign +more or less importance to a particular token, as you might for emboldened +text within an HTML document, for example. (Note: The field this token +belongs to must be specâd to use a posting of type RichPosting.)</p> +<p><code>pos_inc</code> is the POSition INCrement, measured in Tokens. This +attribute, which defaults to 1, is a an advanced tool for manipulating +phrase matching. Ordinarily, Tokens are assigned consecutive position +numbers: 0, 1, and 2 for <code>"three blind mice"</code>. However, if you +set the position increment for âblindâ to, say, 1000, then the three tokens +will end up assigned to positions 0, 1, and 1001 â and will no longer +produce a phrase match for the query <code>"three blind mice"</code>.</p> +<h3>Functions</h3> +<dl> +<dt id="func_new">new</dt> +<dd> +<pre><code><span class="prefix">lucy_</span>Token* <span class="comment">// incremented</span> +<span class="prefix">lucy_</span><strong>Token_new</strong>( + char *<strong>text</strong>, + size_t <strong>len</strong>, + uint32_t <strong>start_offset</strong>, + uint32_t <strong>end_offset</strong>, + float <strong>boost</strong>, + int32_t <strong>pos_inc</strong> +); +</code></pre> +<p>Create a new Token.</p> +<dl> +<dt>text</dt> +<dd><p>A UTF-8 string.</p> +</dd> +<dt>len</dt> +<dd><p>Size of the string in bytes.</p> +</dd> +<dt>start_offset</dt> +<dd><p>Start offset into the original document in Unicode +code points.</p> +</dd> +<dt>start_offset</dt> +<dd><p>End offset into the original document in Unicode +code points.</p> +</dd> +<dt>boost</dt> +<dd><p>Per-token weight.</p> +</dd> +<dt>pos_inc</dt> +<dd><p>Position increment for phrase matching.</p> +</dd> +</dl> +</dd> +<dt id="func_init">init</dt> +<dd> +<pre><code><span class="prefix">lucy_</span>Token* +<span class="prefix">lucy_</span><strong>Token_init</strong>( + <span class="prefix">lucy_</span>Token *<strong>self</strong>, + char *<strong>text</strong>, + size_t <strong>len</strong>, + uint32_t <strong>start_offset</strong>, + uint32_t <strong>end_offset</strong>, + float <strong>boost</strong>, + int32_t <strong>pos_inc</strong> +); +</code></pre> +<p>Initialize a Token.</p> +<dl> +<dt>text</dt> +<dd><p>A UTF-8 string.</p> +</dd> +<dt>len</dt> +<dd><p>Size of the string in bytes.</p> +</dd> +<dt>start_offset</dt> +<dd><p>Start offset into the original document in Unicode +code points.</p> +</dd> +<dt>start_offset</dt> +<dd><p>End offset into the original document in Unicode +code points.</p> +</dd> +<dt>boost</dt> +<dd><p>Per-token weight.</p> +</dd> +<dt>pos_inc</dt> +<dd><p>Position increment for phrase matching.</p> +</dd> +</dl> +</dd> +</dl> +<h3>Methods</h3> +<dl> +<dt id="func_Get_Start_Offset">Get_Start_Offset</dt> +<dd> +<pre><code>uint32_t +<span class="prefix">lucy_</span><strong>Token_Get_Start_Offset</strong>( + <span class="prefix">lucy_</span>Token *<strong>self</strong> +); +</code></pre> +</dd> +<dt id="func_Get_End_Offset">Get_End_Offset</dt> +<dd> +<pre><code>uint32_t +<span class="prefix">lucy_</span><strong>Token_Get_End_Offset</strong>( + <span class="prefix">lucy_</span>Token *<strong>self</strong> +); +</code></pre> +</dd> +<dt id="func_Get_Boost">Get_Boost</dt> +<dd> +<pre><code>float +<span class="prefix">lucy_</span><strong>Token_Get_Boost</strong>( + <span class="prefix">lucy_</span>Token *<strong>self</strong> +); +</code></pre> +</dd> +<dt id="func_Get_Pos_Inc">Get_Pos_Inc</dt> +<dd> +<pre><code>int32_t +<span class="prefix">lucy_</span><strong>Token_Get_Pos_Inc</strong>( + <span class="prefix">lucy_</span>Token *<strong>self</strong> +); +</code></pre> +</dd> +<dt id="func_Get_Text">Get_Text</dt> +<dd> +<pre><code>char* +<span class="prefix">lucy_</span><strong>Token_Get_Text</strong>( + <span class="prefix">lucy_</span>Token *<strong>self</strong> +); +</code></pre> +</dd> +<dt id="func_Get_Len">Get_Len</dt> +<dd> +<pre><code>size_t +<span class="prefix">lucy_</span><strong>Token_Get_Len</strong>( + <span class="prefix">lucy_</span>Token *<strong>self</strong> +); +</code></pre> +</dd> +<dt id="func_Set_Text">Set_Text</dt> +<dd> +<pre><code>void +<span class="prefix">lucy_</span><strong>Token_Set_Text</strong>( + <span class="prefix">lucy_</span>Token *<strong>self</strong>, + char *<strong>text</strong>, + size_t <strong>len</strong> +); +</code></pre> +</dd> +<dt id="func_Destroy">Destroy</dt> +<dd> +<pre><code>void +<span class="prefix">lucy_</span><strong>Token_Destroy</strong>( + <span class="prefix">lucy_</span>Token *<strong>self</strong> +); +</code></pre> +<p>Generic destructor. Frees the struct itself but not any complex +member elements.</p> +</dd> +</dl> +<h3>Inheritance</h3> +<p>Lucy::Analysis::Token is a <a href="../../Clownfish/Obj.html">Clownfish::Obj</a>.</p> +</div> Added: lucy/site/trunk/content/docs/c/Lucy/Docs/Cookbook.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/c/Lucy/Docs/Cookbook.mdtext?rev=1737682&view=auto ============================================================================== --- lucy/site/trunk/content/docs/c/Lucy/Docs/Cookbook.mdtext (added) +++ lucy/site/trunk/content/docs/c/Lucy/Docs/Cookbook.mdtext Mon Apr 4 12:55:10 2016 @@ -0,0 +1,32 @@ +Title: Lucy::Docs::Cookbook + +<div class="c-api"> +<h2>Apache Lucy recipes</h2> +<p>The Cookbook provides thematic documentation covering some of Apache Lucyâs +more sophisticated features. For a step-by-step introduction to Lucy, +see <a href="../../Lucy/Docs/Tutorial.html">Tutorial</a>.</p> +<h3>Chapters</h3> +<ul> +<li> +<p><a href="../../Lucy/Docs/Cookbook/FastUpdates.html">FastUpdates</a> - While index updates are fast on +average, worst-case update performance may be significantly slower. To make +index updates consistently quick, we must manually intervene to control the +process of index segment consolidation.</p> +</li> +<li> +<p><a href="../../Lucy/Docs/Cookbook/CustomQuery.html">CustomQuery</a> - Explore Lucyâs support for +custom query types by creating a âPrefixQueryâ class to handle trailing +wildcards.</p> +</li> +<li> +<p><a href="../../Lucy/Docs/Cookbook/CustomQueryParser.html">CustomQueryParser</a> - Define your own custom +search query syntax using <a href="../../Lucy/Search/QueryParser.html">QueryParser</a> and +Parse::RecDescent.</p> +</li> +</ul> +<h3>Materials</h3> +<p>Some of the recipes in the Cookbook reference the completed +<a href="../../Lucy/Docs/Tutorial.html">Tutorial</a> application. These materials can be +found in the <code>sample</code> directory at the root of the Lucy distribution:</p> +<pre><code>Code example for C is missing</code></pre> +</div> Added: lucy/site/trunk/content/docs/c/Lucy/Docs/Cookbook/CustomQuery.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/c/Lucy/Docs/Cookbook/CustomQuery.mdtext?rev=1737682&view=auto ============================================================================== --- lucy/site/trunk/content/docs/c/Lucy/Docs/Cookbook/CustomQuery.mdtext (added) +++ lucy/site/trunk/content/docs/c/Lucy/Docs/Cookbook/CustomQuery.mdtext Mon Apr 4 12:55:10 2016 @@ -0,0 +1,102 @@ +Title: Lucy::Docs::Cookbook::CustomQuery + +<div class="c-api"> +<h2>Sample subclass of Query</h2> +<p>Explore Apache Lucyâs support for custom query types by creating a +âPrefixQueryâ class to handle trailing wildcards.</p> +<pre><code>Code example for C is missing</code></pre> +<h3>Query, Compiler, and Matcher</h3> +<p>To add support for a new query type, we need three classes: a Query, a +Compiler, and a Matcher.</p> +<ul> +<li> +<p>PrefixQuery - a subclass of <a href="../../../Lucy/Search/Query.html">Query</a>, and the only class +that client code will deal with directly.</p> +</li> +<li> +<p>PrefixCompiler - a subclass of <a href="../../../Lucy/Search/Compiler.html">Compiler</a>, whose primary +role is to compile a PrefixQuery to a PrefixMatcher.</p> +</li> +<li> +<p>PrefixMatcher - a subclass of <a href="../../../Lucy/Search/Matcher.html">Matcher</a>, which does the +heavy lifting: it applies the query to individual documents and assigns a +score to each match.</p> +</li> +</ul> +<p>The PrefixQuery class on its own isnât enough because a Query objectâs role is +limited to expressing an abstract specification for the search. A Query is +basically nothing but metadata; execution is left to the Queryâs companion +Compiler and Matcher.</p> +<p>Hereâs a simplified sketch illustrating how a Searcherâs hits() method ties +together the three classes.</p> +<pre><code>Code example for C is missing</code></pre> +<h4>PrefixQuery</h4> +<p>Our PrefixQuery class will have two attributes: a query string and a field +name.</p> +<pre><code>Code example for C is missing</code></pre> +<p>PrefixQueryâs constructor collects and validates the attributes.</p> +<pre><code>Code example for C is missing</code></pre> +<p>Since this is an inside-out class, weâll need a destructor:</p> +<pre><code>Code example for C is missing</code></pre> +<p>The equals() method determines whether two Queries are logically equivalent:</p> +<pre><code>Code example for C is missing</code></pre> +<p>The last thing weâll need is a make_compiler() factory method which kicks out +a subclass of <a href="../../../Lucy/Search/Compiler.html">Compiler</a>.</p> +<pre><code>Code example for C is missing</code></pre> +<h4>PrefixCompiler</h4> +<p>PrefixQueryâs make_compiler() method will be called internally at search-time +by objects which subclass <a href="../../../Lucy/Search/Searcher.html">Searcher</a> â such as +<a href="../../../Lucy/Search/IndexSearcher.html">IndexSearchers</a>.</p> +<p>A Searcher is associated with a particular collection of documents. These +documents may all reside in one index, as with IndexSearcher, or they may be +spread out across multiple indexes on one or more machines, as with +LucyX::Remote::ClusterSearcher.</p> +<p>Searcher objects have access to certain statistical information about the +collections they represent; for instance, a Searcher can tell you how many +documents are in the collectionâ¦</p> +<pre><code>Code example for C is missing</code></pre> +<p>⦠or how many documents a specific term appears in:</p> +<pre><code>Code example for C is missing</code></pre> +<p>Such information can be used by sophisticated Compiler implementations to +assign more or less heft to individual queries or sub-queries. However, weâre +not going to bother with weighting for this demo; weâll just assign a fixed +score of 1.0 to each matching document.</p> +<p>We donât need to write a constructor, as it will suffice to inherit new() from +Lucy::Search::Compiler. The only method we need to implement for +PrefixCompiler is make_matcher().</p> +<pre><code>Code example for C is missing</code></pre> +<p>PrefixCompiler gets access to a <a href="../../../Lucy/Index/SegReader.html">SegReader</a> +object when make_matcher() gets called. From the SegReader and its +sub-components <a href="../../../Lucy/Index/LexiconReader.html">LexiconReader</a> and +<a href="../../../Lucy/Index/PostingListReader.html">PostingListReader</a>, we acquire a +<a href="../../../Lucy/Index/Lexicon.html">Lexicon</a>, scan through the Lexiconâs unique +terms, and acquire a <a href="../../../Lucy/Index/PostingList.html">PostingList</a> for each +term that matches our prefix.</p> +<p>Each of these PostingList objects represents a set of documents which match +the query.</p> +<h4>PrefixMatcher</h4> +<p>The Matcher subclass is the most involved.</p> +<pre><code>Code example for C is missing</code></pre> +<p>The doc ids must be in order, or some will be ignored; hence the <code>sort</code> +above.</p> +<p>In addition to the constructor and destructor, there are three methods that +must be overridden.</p> +<p>next() advances the Matcher to the next valid matching doc.</p> +<pre><code>Code example for C is missing</code></pre> +<p>get_doc_id() returns the current document id, or 0 if the Matcher is +exhausted. (<a href="../../../Lucy/Docs/DocIDs.html">Document numbers</a> start at 1, so 0 is +a sentinel.)</p> +<pre><code>Code example for C is missing</code></pre> +<p>score() conveys the relevance score of the current match. Weâll just return a +fixed score of 1.0:</p> +<pre><code>Code example for C is missing</code></pre> +<h3>Usage</h3> +<p>To get a basic feel for PrefixQuery, insert the FlatQueryParser module +described in <a href="../../../Lucy/Docs/Cookbook/CustomQueryParser.html">CustomQueryParser</a> (which supports +PrefixQuery) into the search.cgi sample app.</p> +<pre><code>Code example for C is missing</code></pre> +<p>If youâre planning on using PrefixQuery in earnest, though, you may want to +change up analyzers to avoid stemming, because stemming â another approach to +prefix conflation â is not perfectly compatible with prefix searches.</p> +<pre><code>Code example for C is missing</code></pre> +</div> Added: lucy/site/trunk/content/docs/c/Lucy/Docs/Cookbook/CustomQueryParser.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/c/Lucy/Docs/Cookbook/CustomQueryParser.mdtext?rev=1737682&view=auto ============================================================================== --- lucy/site/trunk/content/docs/c/Lucy/Docs/Cookbook/CustomQueryParser.mdtext (added) +++ lucy/site/trunk/content/docs/c/Lucy/Docs/Cookbook/CustomQueryParser.mdtext Mon Apr 4 12:55:10 2016 @@ -0,0 +1,77 @@ +Title: Lucy::Docs::Cookbook::CustomQueryParser + +<div class="c-api"> +<h2>Sample subclass of QueryParser.</h2> +<p>Implement a custom search query language using a subclass of +<a href="../../../Lucy/Search/QueryParser.html">QueryParser</a>.</p> +<h3>The language</h3> +<p>At first, our query language will support only simple term queries and phrases +delimited by double quotes. For simplicityâs sake, it will not support +parenthetical groupings, boolean operators, or prepended plus/minus. The +results for all subqueries will be unioned together â i.e. joined using an OR +â which is usually the best approach for small-to-medium-sized document +collections.</p> +<p>Later, weâll add support for trailing wildcards.</p> +<h3>Single-field parser</h3> +<p>Our initial parser implentation will generate queries against a single fixed +field, âcontentâ, and it will analyze text using a fixed choice of English +EasyAnalyzer. We wonât subclass Lucy::Search::QueryParser just yet.</p> +<pre><code>Code example for C is missing</code></pre> +<p>Some private helper subs for creating TermQuery and PhraseQuery objects will +help keep the size of our main parse() subroutine down:</p> +<pre><code>Code example for C is missing</code></pre> +<p>Our private _tokenize() method treats double-quote delimited material as a +single token and splits on whitespace everywhere else.</p> +<pre><code>Code example for C is missing</code></pre> +<p>The main parsing routine creates an array of tokens by calling _tokenize(), +runs the tokens through through the EasyAnalyzer, creates TermQuery or +PhraseQuery objects according to how many tokens emerge from the +EasyAnalyzerâs split() method, and adds each of the sub-queries to the primary +ORQuery.</p> +<pre><code>Code example for C is missing</code></pre> +<h3>Multi-field parser</h3> +<p>Most often, the end user will want their search query to match not only a +single âcontentâ field, but also âtitleâ and so on. To make that happen, we +have to turn queries such as thisâ¦</p> +<pre><code>foo AND NOT bar +</code></pre> +<p>⦠into the logical equivalent of this:</p> +<pre><code>(title:foo OR content:foo) AND NOT (title:bar OR content:bar) +</code></pre> +<p>Rather than continue with our own from-scratch parser class and write the +routines to accomplish that expansion, weâre now going to subclass Lucy::Search::QueryParser +and take advantage of some of its existing methods.</p> +<p>Our first parser implementation had the âcontentâ field name and the choice of +English EasyAnalyzer hard-coded for simplicity, but we donât need to do that +once we subclass Lucy::Search::QueryParser. QueryParserâs constructor â +which we will inherit, allowing us to eliminate our own constructor â +requires a Schema which conveys field +and Analyzer information, so we can just defer to that.</p> +<pre><code>Code example for C is missing</code></pre> +<p>Weâre also going to jettison our _make_term_query() and _make_phrase_query() +helper subs and chop our parse() subroutine way down. Our revised parse() +routine will generate Lucy::Search::LeafQuery objects instead of TermQueries +and PhraseQueries:</p> +<pre><code>Code example for C is missing</code></pre> +<p>The magic happens in QueryParserâs expand() method, which walks the ORQuery +object we supply to it looking for LeafQuery objects, and calls expand_leaf() +for each one it finds. expand_leaf() performs field-specific analysis, +decides whether each query should be a TermQuery or a PhraseQuery, and if +multiple fields are required, creates an ORQuery which mults out e.g. <code>foo</code> +into <code>(title:foo OR content:foo)</code>.</p> +<h3>Extending the query language</h3> +<p>To add support for trailing wildcards to our query language, we need to +override expand_leaf() to accommodate PrefixQuery, while deferring to the +parent class implementation on TermQuery and PhraseQuery.</p> +<pre><code>Code example for C is missing</code></pre> +<p>Ordinarily, those asterisks would have been stripped when running tokens +through the EasyAnalyzer â query strings containing âfoo*â would produce +TermQueries for the term âfooâ. Our override intercepts tokens with trailing +asterisks and processes them as PrefixQueries before <code>SUPER::expand_leaf</code> can +discard them, so that a search for âfoo*â can match âfoodâ, âfoosballâ, and so +on.</p> +<h3>Usage</h3> +<p>Insert our custom parser into the search.cgi sample app to get a feel for how +it behaves:</p> +<pre><code>Code example for C is missing</code></pre> +</div> Added: lucy/site/trunk/content/docs/c/Lucy/Docs/Cookbook/FastUpdates.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/c/Lucy/Docs/Cookbook/FastUpdates.mdtext?rev=1737682&view=auto ============================================================================== --- lucy/site/trunk/content/docs/c/Lucy/Docs/Cookbook/FastUpdates.mdtext (added) +++ lucy/site/trunk/content/docs/c/Lucy/Docs/Cookbook/FastUpdates.mdtext Mon Apr 4 12:55:10 2016 @@ -0,0 +1,163 @@ +Title: Lucy::Docs::Cookbook::FastUpdates + +<div class="c-api"> +<h2>Near real-time index updates</h2> +<p>While index updates are fast on average, worst-case update performance may be +significantly slower. To make index updates consistently quick, we must +manually intervene to control the process of index segment consolidation.</p> +<h3>The problem</h3> +<p>Ordinarily, modifying an index is cheap. New data is added to new segments, +and the time to write a new segment scales more or less linearly with the +number of documents added during the indexing session.</p> +<p>Deletions are also cheap most of the time, because we donât remove documents +immediately but instead mark them as deleted, and adding the deletion mark is +cheap.</p> +<p>However, as new segments are added and the deletion rate for existing segments +increases, search-time performance slowly begins to degrade. At some point, +it becomes necessary to consolidate existing segments, rewriting their data +into a new segment.</p> +<p>If the recycled segments are small, the time it takes to rewrite them may not +be significant. Every once in a while, though, a large amount of data must be +rewritten.</p> +<h3>Procrastinating and playing catch-up</h3> +<p>The simplest way to force fast index updates is to avoid rewriting anything.</p> +<p>Indexer relies upon <a href="../../../Lucy/Index/IndexManager.html">IndexManager</a>âs +<a href="../../../Lucy/Index/IndexManager.html#func_Recycle">Recycle()</a> method to tell it which segments should +be consolidated. If we subclass IndexManager and override the method so that +it always returns an empty array, we get consistently quick performance:</p> +<pre><code class="language-c">Vector* +NoMergeManager_Recycle_IMP(IndexManager *self, PolyReader *reader, + DeletionsWriter *del_writer, int64_t cutoff, + bool optimize) { + return Vec_new(0); +} + +void +do_index(Obj *index) { + CFCClass *klass = Class_singleton("NoMergeManager", INDEXMANAGER); + Class_Override(klass, (cfish_method_t)NoMergeManager_Recycle_IMP, + LUCY_IndexManager_Recycle_OFFSET); + + IndexManager *manager = (IndexManager*)Class_Make_Obj(klass); + IxManager_init(manager, NULL, NULL); + + Indexer *indexer = Indexer_new(NULL, index, manager, 0); + ... + Indexer_Commit(indexer); + + DECREF(indexer); + DECREF(manager); +} +</code></pre> +<p>However, we canât procrastinate forever. Eventually, weâll have to run an +ordinary, uncontrolled indexing session, potentially triggering a large +rewrite of lots of small and/or degraded segments:</p> +<pre><code class="language-c">void +do_index(Obj *index) { + Indexer *indexer = Indexer_new(NULL, index, NULL /* manager */, 0); + ... + Indexer_Commit(indexer); + DECREF(indexer); +} +</code></pre> +<h3>Acceptable worst-case update time, slower degradation</h3> +<p>Never merging anything at all in the main indexing process is probably +overkill. Small segments are relatively cheap to merge; we just need to guard +against the big rewrites.</p> +<p>Setting a ceiling on the number of documents in the segments to be recycled +allows us to avoid a mass proliferation of tiny, single-document segments, +while still offering decent worst-case update speed:</p> +<pre><code class="language-c">Vector* +LightMergeManager_Recycle_IMP(IndexManager *self, PolyReader *reader, + DeletionsWriter *del_writer, int64_t cutoff, + bool optimize) { + IndexManager_Recycle_t super_recycle + = SUPER_METHOD_PTR(IndexManager, LUCY_IndexManager_Recycle); + Vector *seg_readers = super_recycle(self, reader, del_writer, cutoff, + optimize); + Vector *small_segments = Vec_new(0); + + for (size_t i = 0, max = Vec_Get_Size(seg_readers); i < max; i++) { + SegReader *seg_reader = (SegReader*)Vec_Fetch(seg_readers, i); + + if (SegReader_Doc_Max(seg_reader) < 10) { + Vec_Push(small_segments, INCREF(seg_reader)); + } + } + + DECREF(seg_readers); + return small_segments; +} +</code></pre> +<p>However, we still have to consolidate every once in a while, and while that +happens content updates will be locked out.</p> +<h3>Background merging</h3> +<p>If itâs not acceptable to lock out updates while the index consolidation +process runs, the alternative is to move the consolidation process out of +band, using <a href="../../../Lucy/Index/BackgroundMerger.html">BackgroundMerger</a>.</p> +<p>Itâs never safe to have more than one Indexer attempting to modify the content +of an index at the same time, but a BackgroundMerger and an Indexer can +operate simultaneously:</p> +<pre><code class="language-c">typedef struct { + Obj *index; + Doc *doc; +} Context; + +static void +S_index_doc(void *arg) { + Context *ctx = (Context*)arg; + + CFCClass *klass = Class_singleton("LightMergeManager", INDEXMANAGER); + Class_Override(klass, (cfish_method_t)LightMergeManager_Recycle_IMP, + LUCY_IndexManager_Recycle_OFFSET); + + IndexManager *manager = (IndexManager*)Class_Make_Obj(klass); + IxManager_init(manager, NULL, NULL); + + Indexer *indexer = Indexer_new(NULL, ctx->index, manager, 0); + Indexer_Add_Doc(indexer, ctx->doc, 1.0); + Indexer_Commit(indexer); + + DECREF(indexer); + DECREF(manager); +} + +void indexing_process(Obj *index, Doc *doc) { + Context ctx; + ctx.index = index; + ctx.doc = doc; + + for (int i = 0; i < max_retries; i++) { + Err *err = Err_trap(S_index_doc, &ctx); + if (!err) { break; } + if (!Err_is_a(err, LOCKERR)) { + RETHROW(err); + } + WARN("Couldn't get lock (%d retries)", i); + DECREF(err); + } +} + +void +background_merge_process(Obj *index) { + IndexManager *manager = IxManager_new(NULL, NULL); + IxManager_Set_Write_Lock_Timeout(manager, 60000); + + BackgroundMerger bg_merger = BGMerger_new(index, manager); + BGMerger_Commit(bg_merger); + + DECREF(bg_merger); + DECREF(manager); +} +</code></pre> +<p>The exception handling code becomes useful once you have more than one index +modification process happening simultaneously. By default, Indexer tries +several times to acquire a write lock over the span of one second, then holds +it until <a href="../../../Lucy/Index/Indexer.html#func_Commit">Commit()</a> completes. BackgroundMerger handles +most of its work +without the write lock, but it does need it briefly once at the beginning and +once again near the end. Under normal loads, the internal retry logic will +resolve conflicts, but if itâs not acceptable to miss an insert, you probably +want to catch <a href="../../../Lucy/Store/LockErr.html">LockErr</a> exceptions thrown by Indexer. In +contrast, a LockErr from BackgroundMerger probably just needs to be logged.</p> +</div> Added: lucy/site/trunk/content/docs/c/Lucy/Docs/DevGuide.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/c/Lucy/Docs/DevGuide.mdtext?rev=1737682&view=auto ============================================================================== --- lucy/site/trunk/content/docs/c/Lucy/Docs/DevGuide.mdtext (added) +++ lucy/site/trunk/content/docs/c/Lucy/Docs/DevGuide.mdtext Mon Apr 4 12:55:10 2016 @@ -0,0 +1,36 @@ +Title: Lucy::Docs::DevGuide + +<div class="c-api"> +<h2>Quick-start guide to hacking on Apache Lucy.</h2> +<p>The Apache Lucy code base is organized into roughly four layers:</p> +<ul> +<li>Charmonizer - compiler and OS configuration probing.</li> +<li>Clownfish - header files.</li> +<li>C - implementation files.</li> +<li>Host - binding language.</li> +</ul> +<p>Charmonizer is a configuration prober which writes a single header file, +âcharmony.hâ, describing the build environment and facilitating +cross-platform development. Itâs similar to Autoconf or Metaconfig, but +written in pure C.</p> +<p>The â.cfhâ files within the Lucy core are Clownfish header files. +Clownfish is a purpose-built, declaration-only language which superimposes +a single-inheritance object model on top of C which is specifically +designed to co-exist happily with variety of âhostâ languages and to allow +limited run-time dynamic subclassing. For more information see the +Clownfish docs, but if thereâs one thing you should know about Clownfish OO +before you start hacking, itâs that method calls are differentiated from +functions by capitalization:</p> +<pre><code>Indexer_Add_Doc <-- Method, typically uses dynamic dispatch. +Indexer_add_doc <-- Function, always a direct invocation. +</code></pre> +<p>The C files within the Lucy core are where most of Lucyâs low-level +functionality lies. They implement the interface defined by the Clownfish +header files.</p> +<p>The C core is intentionally left incomplete, however; to be usable, it must +be bound to a âhostâ language. (In this context, even C is considered a +âhostâ which must implement the missing pieces and be âboundâ to the core.) +Some of the binding code is autogenerated by Clownfish on a spec customized +for each language. Other pieces are hand-coded in either C (using the +hostâs C API) or the host language itself.</p> +</div> Added: lucy/site/trunk/content/docs/c/Lucy/Docs/DocIDs.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/c/Lucy/Docs/DocIDs.mdtext?rev=1737682&view=auto ============================================================================== --- lucy/site/trunk/content/docs/c/Lucy/Docs/DocIDs.mdtext (added) +++ lucy/site/trunk/content/docs/c/Lucy/Docs/DocIDs.mdtext Mon Apr 4 12:55:10 2016 @@ -0,0 +1,20 @@ +Title: Lucy::Docs::DocIDs + +<div class="c-api"> +<h2>Characteristics of Apache Lucy document ids.</h2> +<h3>Document ids are signed 32-bit integers</h3> +<p>Document ids in Apache Lucy start at 1. Because 0 is never a valid doc id, we +can use it as a sentinel value:</p> +<pre><code>Code example for C is missing</code></pre> +<h3>Document ids are ephemeral</h3> +<p>The document ids used by Lucy are associated with a single index +snapshot. The moment an index is updated, the mapping of document ids to +documents is subject to change.</p> +<p>Since IndexReader objects represent a point-in-time view of an index, document +ids are guaranteed to remain static for the life of the reader. However, +because they are not permanent, Lucy document ids cannot be used as +foreign keys to locate records in external data sources. If you truly need a +primary key field, you must define it and populate it yourself.</p> +<p>Furthermore, the order of document ids does not tell you anything about the +sequence in which documents were added to the index.</p> +</div> Added: lucy/site/trunk/content/docs/c/Lucy/Docs/FileFormat.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/c/Lucy/Docs/FileFormat.mdtext?rev=1737682&view=auto ============================================================================== --- lucy/site/trunk/content/docs/c/Lucy/Docs/FileFormat.mdtext (added) +++ lucy/site/trunk/content/docs/c/Lucy/Docs/FileFormat.mdtext Mon Apr 4 12:55:10 2016 @@ -0,0 +1,172 @@ +Title: Lucy::Docs::FileFormat + +<div class="c-api"> +<h2>Overview of index file format</h2> +<p>It is not necessary to understand the current implementation details of the +index file format in order to use Apache Lucy effectively, but it may be +helpful if you are interested in tweaking for high performance, exotic usage, +or debugging and development.</p> +<p>On a file system, an index is a directory. The files inside have a +hierarchical relationship: an index is made up of âsegmentsâ, each of which is +an independent inverted index with its own subdirectory; each segment is made +up of several component parts.</p> +<pre><code>[index]--| + |--snapshot_XXX.json + |--schema_XXX.json + |--write.lock + | + |--seg_1--| + | |--segmeta.json + | |--cfmeta.json + | |--cf.dat-------| + | |--[lexicon] + | |--[postings] + | |--[documents] + | |--[highlight] + | |--[deletions] + | + |--seg_2--| + | |--segmeta.json + | |--cfmeta.json + | |--cf.dat-------| + | |--[lexicon] + | |--[postings] + | |--[documents] + | |--[highlight] + | |--[deletions] + | + |--[...]--| +</code></pre> +<h3>Write-once philosophy</h3> +<p>All segment directory names consist of the string âseg_â followed by a number +in base 36: seg_1, seg_5m, seg_p9s2 and so on, with higher numbers indicating +more recent segments. Once a segment is finished and committed, its name is +never re-used and its files are never modified.</p> +<p>Old segments become obsolete and can be removed when their data has been +consolidated into new segments during the process of segment merging and +optimization. A fully-optimized index has only one segment.</p> +<h3>Top-level entries</h3> +<p>There are a handful of âtop-levelâ files and directories which belong to the +entire index rather than to a particular segment.</p> +<h4>snapshot_XXX.json</h4> +<p>A âsnapshotâ file, e.g. <code>snapshot_m7p.json</code>, is list of index files and +directories. Because index files, once written, are never modified, the list +of entries in a snapshot defines a point-in-time view of the data in an index.</p> +<p>Like segment directories, snapshot files also utilize the +unique-base-36-number naming convention; the higher the number, the more +recent the file. The appearance of a new snapshot file within the index +directory constitutes an index update. While a new segment is being written +new files may be added to the index directory, but until a new snapshot file +gets written, a Searcher opening the index for reading wonât know about them.</p> +<h4>schema_XXX.json</h4> +<p>The schema file is a Schema object describing the indexâs format, serialized +as JSON. It, too, is versioned, and a given snapshot file will reference one +and only one schema file.</p> +<h4>locks</h4> +<p>By default, only one indexing process may safely modify the index at any given +time. Processes reserve an index by laying claim to the <code>write.lock</code> file +within the <code>locks/</code> directory. A smattering of other lock files may be used +from time to time, as well.</p> +<h3>A segmentâs component parts</h3> +<p>By default, each segment has up to five logical components: lexicon, postings, +document storage, highlight data, and deletions. Binary data from these +components gets stored in virtual files within the âcf.datâ compound file; +metadata is stored in a shared âsegmeta.jsonâ file.</p> +<h4>segmeta.json</h4> +<p>The segmeta.json file is a central repository for segment metadata. In +addition to information such as document counts and field numbers, it also +warehouses arbitrary metadata on behalf of individual index components.</p> +<h4>Lexicon</h4> +<p>Each indexed field gets its own lexicon in each segment. The exact files +involved depend on the fieldâs type, but generally speaking there will be two +parts. First, thereâs a primary <code>lexicon-XXX.dat</code> file which houses a +complete term list associating terms with corpus frequency statistics, +postings file locations, etc. Second, one or more âlexicon indexâ files may +be present which contain periodic samples from the primary lexicon file to +facilitate fast lookups.</p> +<h4>Postings</h4> +<p>âPostingâ is a technical term from the field of +<a href="../../Lucy/Docs/IRTheory.html">information retrieval</a>, defined as a single +instance of a one term indexing one document. If you are looking at the index +in the back of a book, and you see that âfreedomâ is referenced on pages 8, +86, and 240, that would be three postings, which taken together form a +âposting listâ. The same terminology applies to an index in electronic form.</p> +<p>Each segment has one postings file per indexed field. When a search is +performed for a single term, first that term is looked up in the lexicon. If +the term exists in the segment, the record in the lexicon will contain +information about which postings file to look at and where to look.</p> +<p>The first thing any posting record tells you is a document id. By iterating +over all the postings associated with a term, you can find all the documents +that match that term, a process which is analogous to looking up page numbers +in a bookâs index. However, each posting record typically contains other +information in addition to document id, e.g. the positions at which the term +occurs within the field.</p> +<h4>Documents</h4> +<p>The document storage section is a simple database, organized into two files:</p> +<ul> +<li> +<p><strong>documents.dat</strong> - Serialized documents.</p> +</li> +<li> +<p><strong>documents.ix</strong> - Document storage index, a solid array of 64-bit integers +where each integer location corresponds to a document id, and the value at +that location points at a file position in the documents.dat file.</p> +</li> +</ul> +<h4>Highlight data</h4> +<p>The files which store data used for excerpting and highlighting are organized +similarly to the files used to store documents.</p> +<ul> +<li> +<p><strong>highlight.dat</strong> - Chunks of serialized highlight data, one per doc id.</p> +</li> +<li> +<p><strong>highlight.ix</strong> - Highlight data index â as with the <code>documents.ix</code> file, a +solid array of 64-bit file pointers.</p> +</li> +</ul> +<h4>Deletions</h4> +<p>When a document is âdeletedâ from a segment, it is not actually purged right +away; it is merely marked as âdeletedâ via a deletions file. Deletions files +contains bit vectors with one bit for each document in the segment; if bit +#254 is set then document 254 is deleted, and if that document turns up in a +search it will be masked out.</p> +<p>It is only when a segmentâs contents are rewritten to a new segment during the +segment-merging process that deleted documents truly go away.</p> +<h3>Compound Files</h3> +<p>If you peer inside an index directory, you wonât actually find any files named +âdocuments.datâ, âhighlight.ixâ, etc. unless there is an indexing process +underway. What you will find instead is one âcf.datâ and one âcfmeta.jsonâ +file per segment.</p> +<p>To minimize the need for file descriptors at search-time, all per-segment +binary data files are concatenated together in âcf.datâ at the close of each +indexing session. Information about where each file begins and ends is stored +in <code>cfmeta.json</code>. When the segment is opened for reading, a single file +descriptor per âcf.datâ file can be shared among several readers.</p> +<h3>A Typical Search</h3> +<p>Hereâs a simplified narrative, dramatizing how a search for âfreedomâ against +a given segment plays out:</p> +<ol> +<li> +<p>The searcher asks the relevant Lexicon Index, âDo you know anything about +âfreedomâ?â Lexicon Index replies, âCanât say for sure, but if the main +Lexicon file does, âfreedomâ is probably somewhere around byte 21008â.</p> +</li> +<li> +<p>The main Lexicon tells the searcher âOne moment, let me scan our records⦠+Yes, we have 2 documents which contain âfreedomâ. Youâll find them in +seg_6/postings-4.dat starting at byte 66991.â</p> +</li> +<li> +<p>The Postings file says âYep, we have âfreedomâ, all right! Document id 40 +has 1 âfreedomâ, and document 44 has 8. If you need to know more, like if any +âfreedomâ is part of the phrase âfreedom of speechâ, ask me about positions!</p> +</li> +<li> +<p>If the searcher is only looking for âfreedomâ in isolation, thatâs where it +stops. It now knows enough to assign the documents scores against âfreedomâ, +with the 8-freedom document likely ranking higher than the single-freedom +document.</p> +</li> +</ol> +</div> Added: lucy/site/trunk/content/docs/c/Lucy/Docs/FileLocking.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/c/Lucy/Docs/FileLocking.mdtext?rev=1737682&view=auto ============================================================================== --- lucy/site/trunk/content/docs/c/Lucy/Docs/FileLocking.mdtext (added) +++ lucy/site/trunk/content/docs/c/Lucy/Docs/FileLocking.mdtext Mon Apr 4 12:55:10 2016 @@ -0,0 +1,56 @@ +Title: Lucy::Docs::FileLocking + +<div class="c-api"> +<h2>Manage indexes on shared volumes.</h2> +<p>Normally, index locking is an invisible process. Exclusive write access is +controlled via lockfiles within the index directory and problems only arise +if multiple processes attempt to acquire the write lock simultaneously; +search-time processes do not ordinarily require locking at all.</p> +<p>On shared volumes, however, the default locking mechanism fails, and manual +intervention becomes necessary.</p> +<p>Both read and write applications accessing an index on a shared volume need +to identify themselves with a unique <code>host</code> id, e.g. hostname or +ip address. Knowing the host id makes it possible to tell which lockfiles +belong to other machines and therefore must not be removed when the +lockfileâs pid number appears not to correspond to an active process.</p> +<p>At index-time, the danger is that multiple indexing processes from +different machines which fail to specify a unique <code>host</code> id can +delete each othersâ lockfiles and then attempt to modify the index at the +same time, causing index corruption. The search-time problem is more +complex.</p> +<p>Once an index file is no longer listed in the most recent snapshot, Indexer +attempts to delete it as part of a post-<a href="lucy:Indexer.Commit"></a> cleanup routine. It is +possible that at the moment an Indexer is deleting files which it believes +no longer needed, a Searcher referencing an earlier snapshot is in fact +using them. The more often that an index is either updated or searched, +the more likely it is that this conflict will arise from time to time.</p> +<p>Ordinarily, the deletion attempts are not a problem. On a typical unix +volume, the files will be deleted in name only: any process which holds an +open filehandle against a given file will continue to have access, and the +file wonât actually get vaporized until the last filehandle is cleared. +Thanks to âdelete on last close semanticsâ, an Indexer canât truly delete +the file out from underneath an active Searcher. On Windows, where file +deletion fails whenever any process holds an open handle, the situation is +different but still workable: Indexer just keeps retrying after each commit +until deletion finally succeeds.</p> +<p>On NFS, however, the system breaks, because NFS allows files to be deleted +out from underneath active processes. Should this happen, the unlucky read +process will crash with a âStale NFS filehandleâ exception.</p> +<p>Under normal circumstances, it is neither necessary nor desirable for +IndexReaders to secure read locks against an index, but for NFS we have to +make an exception. LockFactoryâs <a href="lucy:LockFactory.Make_Shared_Lock"></a> method exists for this +reason; supplying an IndexManager instance to IndexReaderâs constructor +activates an internal locking mechanism using <a href="lucy:LockFactory.Make_Shared_Lock"></a> which +prevents concurrent indexing processes from deleting files that are needed +by active readers.</p> +<pre><code>Code example for C is missing</code></pre> +<p>Since shared locks are implemented using lockfiles located in the index +directory (as are exclusive locks), reader applications must have write +access for read locking to work. Stale lock files from crashed processes +are ordinarily cleared away the next time the same machine â as identified +by the <code>host</code> parameter â opens another IndexReader. (The +classic technique of timing out lock files is not feasible because search +processes may lie dormant indefinitely.) However, please be aware that if +the last thing a given machine does is crash, lock files belonging to it +may persist, preventing deletion of obsolete index data.</p> +</div>