Added: lucy/site/trunk/content/docs/perl/Lucy.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/perl/Lucy.mdtext?rev=1737642&view=auto ============================================================================== --- lucy/site/trunk/content/docs/perl/Lucy.mdtext (added) +++ lucy/site/trunk/content/docs/perl/Lucy.mdtext Mon Apr 4 09:22:30 2016 @@ -0,0 +1,243 @@ +Title: Lucy â Apache Lucy Documentation + +<div> +<a name='___top' class='dummyTopAnchor' ></a> + +<h2><a class='u' +name="NAME" +>NAME</a></h2> + +<p>Lucy - Apache Lucy search engine library.</p> + +<h2><a class='u' +name="VERSION" +>VERSION</a></h2> + +<p>0.5.0</p> + +<h2><a class='u' +name="SYNOPSIS" +>SYNOPSIS</a></h2> + +<p>First, +plan out your index structure, +create the index, +and add documents:</p> + +<pre># indexer.pl + +use Lucy::Index::Indexer; +use Lucy::Plan::Schema; +use Lucy::Analysis::EasyAnalyzer; +use Lucy::Plan::FullTextType; + +# Create a Schema which defines index fields. +my $schema = Lucy::Plan::Schema->new; +my $easyanalyzer = Lucy::Analysis::EasyAnalyzer->new( + language => 'en', +); +my $type = Lucy::Plan::FullTextType->new( + analyzer => $easyanalyzer, +); +$schema->spec_field( name => 'title', type => $type ); +$schema->spec_field( name => 'content', type => $type ); + +# Create the index and add documents. +my $indexer = Lucy::Index::Indexer->new( + schema => $schema, + index => '/path/to/index', + create => 1, +); +while ( my ( $title, $content ) = each %source_docs ) { + $indexer->add_doc({ + title => $title, + content => $content, + }); +} +$indexer->commit;</pre> + +<p>Then, +search the index:</p> + +<pre># search.pl + +use Lucy::Search::IndexSearcher; + +my $searcher = Lucy::Search::IndexSearcher->new( + index => '/path/to/index' +); +my $hits = $searcher->hits( query => "foo bar" ); +while ( my $hit = $hits->next ) { + print "$hit->{title}\n"; +}</pre> + +<h2><a class='u' +name="DESCRIPTION" +>DESCRIPTION</a></h2> + +<p>The Apache Lucy search engine library delivers high-performance, +modular full-text search.</p> + +<h3><a class='u' +name="Features" +>Features</a></h3> + +<ul> +<li>Extremely fast. +A single machine can handle millions of documents.</li> + +<li>Scalable to multiple machines.</li> + +<li>Incremental indexing (addition/deletion of documents to/from an existing index).</li> + +<li>Configurable near-real-time index updates.</li> + +<li>Unicode support.</li> + +<li>Support for boolean operators AND, +OR, +and AND NOT; parenthetical groupings; prepended +plus and -minus.</li> + +<li>Algorithmic selection of relevant excerpts and highlighting of search terms within excerpts.</li> + +<li>Highly customizable query and indexing APIs.</li> + +<li>Customizable sorting.</li> + +<li>Phrase matching.</li> + +<li>Stemming.</li> + +<li>Stoplists.</li> +</ul> + +<h3><a class='u' +name="Getting_Started" +>Getting Started</a></h3> + +<p><a href="./Lucy/Simple.html" class="podlinkpod" +>Lucy::Simple</a> provides a stripped down API which may suffice for many tasks.</p> + +<p><a href="./Lucy/Docs/Tutorial.html" class="podlinkpod" +>Lucy::Docs::Tutorial</a> demonstrates how to build a basic CGI search application.</p> + +<p>The tutorial spends most of its time on these five classes:</p> + +<ul> +<li><a href="./Lucy/Plan/Schema.html" class="podlinkpod" +>Lucy::Plan::Schema</a> - Plan out your index.</li> + +<li><a href="./Lucy/Plan/FieldType.html" class="podlinkpod" +>Lucy::Plan::FieldType</a> - Define index fields.</li> + +<li><a href="./Lucy/Index/Indexer.html" class="podlinkpod" +>Lucy::Index::Indexer</a> - Manipulate index content.</li> + +<li><a href="./Lucy/Search/IndexSearcher.html" class="podlinkpod" +>Lucy::Search::IndexSearcher</a> - Search an index.</li> + +<li><a href="./Lucy/Analysis/EasyAnalyzer.html" class="podlinkpod" +>Lucy::Analysis::EasyAnalyzer</a> - A one-size-fits-all parser/tokenizer.</li> +</ul> + +<h3><a class='u' +name="Delving_Deeper" +>Delving Deeper</a></h3> + +<p><a href="./Lucy/Docs/Cookbook.html" class="podlinkpod" +>Lucy::Docs::Cookbook</a> augments the tutorial with more advanced recipes.</p> + +<p>For creating complex queries, +see <a href="./Lucy/Search/Query.html" class="podlinkpod" +>Lucy::Search::Query</a> and its subclasses <a href="./Lucy/Search/TermQuery.html" class="podlinkpod" +>TermQuery</a>, +<a href="./Lucy/Search/PhraseQuery.html" class="podlinkpod" +>PhraseQuery</a>, +<a href="./Lucy/Search/ANDQuery.html" class="podlinkpod" +>ANDQuery</a>, +<a href="./Lucy/Search/ORQuery.html" class="podlinkpod" +>ORQuery</a>, +<a href="./Lucy/Search/NOTQuery.html" class="podlinkpod" +>NOTQuery</a>, +<a href="./Lucy/Search/RequiredOptionalQuery.html" class="podlinkpod" +>RequiredOptionalQuery</a>, +<a href="./Lucy/Search/MatchAllQuery.html" class="podlinkpod" +>MatchAllQuery</a>, +and <a href="./Lucy/Search/NoMatchQuery.html" class="podlinkpod" +>NoMatchQuery</a>, +plus <a href="./Lucy/Search/QueryParser.html" class="podlinkpod" +>Lucy::Search::QueryParser</a>.</p> + +<p>For distributed searching, +see <a href="./LucyX/Remote/SearchServer.html" class="podlinkpod" +>LucyX::Remote::SearchServer</a>, +<a href="./LucyX/Remote/SearchClient.html" class="podlinkpod" +>LucyX::Remote::SearchClient</a>, +and <a href="./LucyX/Remote/ClusterSearcher.html" class="podlinkpod" +>LucyX::Remote::ClusterSearcher</a>.</p> + +<h3><a class='u' +name="Backwards_Compatibility_Policy" +>Backwards Compatibility Policy</a></h3> + +<p>Lucy will spin off stable forks into new namespaces periodically. +The first will be named "Lucy1". +Users who require strong backwards compatibility should use a stable fork.</p> + +<p>The main namespace, +"Lucy", +is an API-unstable development branch (as hinted at by its 0.x.x version number). +Superficial interface changes happen frequently. +Hard file format compatibility breaks which require reindexing are rare, +as we generally try to provide continuity across multiple releases, +but we reserve the right to make such changes.</p> + +<h2><a class='u' +name="CLASS_METHODS" +>CLASS METHODS</a></h2> + +<p>The Lucy module itself does not have a large interface, +providing only a single public class method.</p> + +<h3><a class='u' +name="error" +>error</a></h3> + +<pre>my $instream = $folder->open_in( file => 'foo' ) or die Clownfish->error;</pre> + +<p>Access a shared variable which is set by some routines on failure. +It will always be either a <a href="./Clownfish/Err.html" class="podlinkpod" +>Clownfish::Err</a> object or undef.</p> + +<h2><a class='u' +name="SUPPORT" +>SUPPORT</a></h2> + +<p>The Apache Lucy homepage, +where you'll find links to our mailing lists and so on, +is <a href="http://lucy.apache.org" class="podlinkurl" +>http://lucy.apache.org</a>. +Please direct support questions to the Lucy users mailing list.</p> + +<h2><a class='u' +name="BUGS" +>BUGS</a></h2> + +<p>Not thread-safe.</p> + +<p>Some exceptions leak memory.</p> + +<p>If you find a bug, +please inquire on the Lucy users mailing list about it, +then report it on the Lucy issue tracker once it has been confirmed: <a href="https://issues.apache.org/jira/browse/LUCY" class="podlinkurl" +>https://issues.apache.org/jira/browse/LUCY</a>.</p> + +<h2><a class='u' +name="COPYRIGHT" +>COPYRIGHT</a></h2> + +<p>Apache Lucy is distributed under the Apache License, +Version 2.0, +as described in the file <code>LICENSE</code> included with the distribution.</p> + +</div>
Added: lucy/site/trunk/content/docs/perl/Lucy/Analysis/Analyzer.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/perl/Lucy/Analysis/Analyzer.mdtext?rev=1737642&view=auto ============================================================================== --- lucy/site/trunk/content/docs/perl/Lucy/Analysis/Analyzer.mdtext (added) +++ lucy/site/trunk/content/docs/perl/Lucy/Analysis/Analyzer.mdtext Mon Apr 4 09:22:30 2016 @@ -0,0 +1,143 @@ +Title: Lucy::Analysis::Analyzer â Apache Lucy Documentation + +<div> +<a name='___top' class='dummyTopAnchor' ></a> + +<h2><a class='u' +name="NAME" +>NAME</a></h2> + +<p>Lucy::Analysis::Analyzer - Tokenize/modify/filter text.</p> + +<h2><a class='u' +name="SYNOPSIS" +>SYNOPSIS</a></h2> + +<pre># Abstract base class.</pre> + +<h2><a class='u' +name="DESCRIPTION" +>DESCRIPTION</a></h2> + +<p>An Analyzer is a filter which processes text, +transforming it from one form into another. +For instance, +an analyzer might break up a long text into smaller pieces (<a href="../../Lucy/Analysis/RegexTokenizer.html" class="podlinkpod" +>RegexTokenizer</a>), +or it might perform case folding to facilitate case-insensitive search (<a href="../../Lucy/Analysis/Normalizer.html" class="podlinkpod" +>Normalizer</a>).</p> + +<h2><a class='u' +name="CONSTRUCTORS" +>CONSTRUCTORS</a></h2> + +<h3><a class='u' +name="new" +>new</a></h3> + +<pre>package MyAnalyzer; +use base qw( Lucy::Analysis::Analyzer ); +our %foo; +sub new { + my $self = shift->SUPER::new; + my %args = @_; + $foo{$$self} = $args{foo}; + return $self; +}</pre> + +<p>Abstract constructor. +Takes no arguments.</p> + +<h2><a class='u' +name="ABSTRACT_METHODS" +>ABSTRACT METHODS</a></h2> + +<h3><a class='u' +name="transform" +>transform</a></h3> + +<pre>my $inversion = $analyzer->transform($inversion);</pre> + +<p>Take a single <a href="../../Lucy/Analysis/Inversion.html" class="podlinkpod" +>Inversion</a> as input and returns an Inversion, +either the same one (presumably transformed in some way), +or a new one.</p> + +<ul> +<li><b>inversion</b> - An inversion.</li> +</ul> + +<h2><a class='u' +name="METHODS" +>METHODS</a></h2> + +<h3><a class='u' +name="transform_text" +>transform_text</a></h3> + +<pre>my $inversion = $analyzer->transform_text($text);</pre> + +<p>Kick off an analysis chain, +creating an Inversion from string input. +The default implementation simply creates an initial Inversion with a single Token, +then calls <a href="#transform" class="podlinkpod" +>transform()</a>, +but occasionally subclasses will provide an optimized implementation which minimizes string copies.</p> + +<ul> +<li><b>text</b> - A string.</li> +</ul> + +<h3><a class='u' +name="split" +>split</a></h3> + +<pre>my $arrayref = $analyzer->split($text);</pre> + +<p>Analyze text and return an array of token texts.</p> + +<ul> +<li><b>text</b> - A string.</li> +</ul> + +<h3><a class='u' +name="dump" +>dump</a></h3> + +<pre>my $obj = $analyzer->dump();</pre> + +<p>Dump the analyzer as hash.</p> + +<p>Subclasses should call <a href="#dump" class="podlinkpod" +>dump()</a> on the superclass. +The returned object is a hash which should be populated with parameters of the analyzer.</p> + +<p>Returns: A hash containing a description of the analyzer.</p> + +<h3><a class='u' +name="load" +>load</a></h3> + +<pre>my $obj = $analyzer->load($dump);</pre> + +<p>Reconstruct an analyzer from a dump.</p> + +<p>Subclasses should first call <a href="#load" class="podlinkpod" +>load()</a> on the superclass. +The returned object is an analyzer which should be reconstructed by setting the dumped parameters from the hash contained in <code>dump</code>.</p> + +<p>Note that the invocant analyzer is unused.</p> + +<ul> +<li><b>dump</b> - A hash.</li> +</ul> + +<p>Returns: An analyzer.</p> + +<h2><a class='u' +name="INHERITANCE" +>INHERITANCE</a></h2> + +<p>Lucy::Analysis::Analyzer isa Clownfish::Obj.</p> + +</div> Added: lucy/site/trunk/content/docs/perl/Lucy/Analysis/CaseFolder.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/perl/Lucy/Analysis/CaseFolder.mdtext?rev=1737642&view=auto ============================================================================== --- lucy/site/trunk/content/docs/perl/Lucy/Analysis/CaseFolder.mdtext (added) +++ lucy/site/trunk/content/docs/perl/Lucy/Analysis/CaseFolder.mdtext Mon Apr 4 09:22:30 2016 @@ -0,0 +1,73 @@ +Title: Lucy::Analysis::CaseFolder â Apache Lucy Documentation + +<div> +<a name='___top' class='dummyTopAnchor' ></a> + +<h2><a class='u' +name="NAME" +>NAME</a></h2> + +<p>Lucy::Analysis::CaseFolder - Normalize case, +facilitating case-insensitive search.</p> + +<h2><a class='u' +name="SYNOPSIS" +>SYNOPSIS</a></h2> + +<pre>my $case_folder = Lucy::Analysis::CaseFolder->new; + +my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new( + analyzers => [ $tokenizer, $case_folder, $stemmer ], +);</pre> + +<h2><a class='u' +name="DESCRIPTION" +>DESCRIPTION</a></h2> + +<p>CaseFolder is DEPRECATED. +Use <a href="../../Lucy/Analysis/Normalizer.html" class="podlinkpod" +>Normalizer</a> instead.</p> + +<p>CaseFolder normalizes text according to Unicode case-folding rules, +so that searches will be case-insensitive.</p> + +<h2><a class='u' +name="CONSTRUCTORS" +>CONSTRUCTORS</a></h2> + +<h3><a class='u' +name="new" +>new</a></h3> + +<pre>my $case_folder = Lucy::Analysis::CaseFolder->new;</pre> + +<p>Constructor. +Takes no arguments.</p> + +<h2><a class='u' +name="METHODS" +>METHODS</a></h2> + +<h3><a class='u' +name="transform" +>transform</a></h3> + +<pre>my $inversion = $case_folder->transform($inversion);</pre> + +<p>Take a single <a href="../../Lucy/Analysis/Inversion.html" class="podlinkpod" +>Inversion</a> as input and returns an Inversion, +either the same one (presumably transformed in some way), +or a new one.</p> + +<ul> +<li><b>inversion</b> - An inversion.</li> +</ul> + +<h2><a class='u' +name="INHERITANCE" +>INHERITANCE</a></h2> + +<p>Lucy::Analysis::CaseFolder isa <a href="../../Lucy/Analysis/Analyzer.html" class="podlinkpod" +>Lucy::Analysis::Analyzer</a> isa Clownfish::Obj.</p> + +</div> Added: lucy/site/trunk/content/docs/perl/Lucy/Analysis/EasyAnalyzer.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/perl/Lucy/Analysis/EasyAnalyzer.mdtext?rev=1737642&view=auto ============================================================================== --- lucy/site/trunk/content/docs/perl/Lucy/Analysis/EasyAnalyzer.mdtext (added) +++ lucy/site/trunk/content/docs/perl/Lucy/Analysis/EasyAnalyzer.mdtext Mon Apr 4 09:22:30 2016 @@ -0,0 +1,99 @@ +Title: Lucy::Analysis::EasyAnalyzer â Apache Lucy Documentation + +<div> +<a name='___top' class='dummyTopAnchor' ></a> + +<h2><a class='u' +name="NAME" +>NAME</a></h2> + +<p>Lucy::Analysis::EasyAnalyzer - A simple analyzer chain.</p> + +<h2><a class='u' +name="SYNOPSIS" +>SYNOPSIS</a></h2> + +<pre>my $schema = Lucy::Plan::Schema->new; +my $analyzer = Lucy::Analysis::EasyAnalyzer->new( + language => 'en', +); +my $type = Lucy::Plan::FullTextType->new( + analyzer => $analyzer, +); +$schema->spec_field( name => 'title', type => $type ); +$schema->spec_field( name => 'content', type => $type );</pre> + +<h2><a class='u' +name="DESCRIPTION" +>DESCRIPTION</a></h2> + +<p>EasyAnalyzer is an analyzer chain consisting of a <a href="../../Lucy/Analysis/StandardTokenizer.html" class="podlinkpod" +>StandardTokenizer</a>, +a <a href="../../Lucy/Analysis/Normalizer.html" class="podlinkpod" +>Normalizer</a>, +and a <a href="../../Lucy/Analysis/SnowballStemmer.html" class="podlinkpod" +>SnowballStemmer</a>.</p> + +<p>Supported languages:</p> + +<pre>en => English, +da => Danish, +de => German, +es => Spanish, +fi => Finnish, +fr => French, +hu => Hungarian, +it => Italian, +nl => Dutch, +no => Norwegian, +pt => Portuguese, +ro => Romanian, +ru => Russian, +sv => Swedish, +tr => Turkish,</pre> + +<h2><a class='u' +name="CONSTRUCTORS" +>CONSTRUCTORS</a></h2> + +<h3><a class='u' +name="new" +>new</a></h3> + +<pre>my $analyzer = Lucy::Analysis::EasyAnalyzer->new( + language => 'es', +);</pre> + +<p>Create a new EasyAnalyzer.</p> + +<ul> +<li><b>language</b> - An ISO code from the list of supported languages.</li> +</ul> + +<h2><a class='u' +name="METHODS" +>METHODS</a></h2> + +<h3><a class='u' +name="transform" +>transform</a></h3> + +<pre>my $inversion = $easy_analyzer->transform($inversion);</pre> + +<p>Take a single <a href="../../Lucy/Analysis/Inversion.html" class="podlinkpod" +>Inversion</a> as input and returns an Inversion, +either the same one (presumably transformed in some way), +or a new one.</p> + +<ul> +<li><b>inversion</b> - An inversion.</li> +</ul> + +<h2><a class='u' +name="INHERITANCE" +>INHERITANCE</a></h2> + +<p>Lucy::Analysis::EasyAnalyzer isa <a href="../../Lucy/Analysis/Analyzer.html" class="podlinkpod" +>Lucy::Analysis::Analyzer</a> isa Clownfish::Obj.</p> + +</div> Added: lucy/site/trunk/content/docs/perl/Lucy/Analysis/Inversion.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/perl/Lucy/Analysis/Inversion.mdtext?rev=1737642&view=auto ============================================================================== --- lucy/site/trunk/content/docs/perl/Lucy/Analysis/Inversion.mdtext (added) +++ lucy/site/trunk/content/docs/perl/Lucy/Analysis/Inversion.mdtext Mon Apr 4 09:22:30 2016 @@ -0,0 +1,87 @@ +Title: Lucy::Analysis::Inversion â Apache Lucy Documentation + +<div> +<a name='___top' class='dummyTopAnchor' ></a> + +<h2><a class='u' +name="NAME" +>NAME</a></h2> + +<p>Lucy::Analysis::Inversion - A collection of Tokens.</p> + +<h2><a class='u' +name="SYNOPSIS" +>SYNOPSIS</a></h2> + +<pre>my $result = Lucy::Analysis::Inversion->new; + +while (my $token = $inversion->next) { + $result->append($token); +}</pre> + +<h2><a class='u' +name="DESCRIPTION" +>DESCRIPTION</a></h2> + +<p>An Inversion is a collection of Token objects which you can add to, +then iterate over.</p> + +<h2><a class='u' +name="CONSTRUCTORS" +>CONSTRUCTORS</a></h2> + +<h3><a class='u' +name="new" +>new</a></h3> + +<pre>my $inversion = Lucy::Analysis::Inversion->new( + $seed, # optional +);</pre> + +<p>Create a new Inversion.</p> + +<ul> +<li><b>seed</b> - An initial Token to start things off, +which may be undef.</li> +</ul> + +<h2><a class='u' +name="METHODS" +>METHODS</a></h2> + +<h3><a class='u' +name="append" +>append</a></h3> + +<pre>$inversion->append($token);</pre> + +<p>Tack a token onto the end of the Inversion.</p> + +<ul> +<li><b>token</b> - A Token.</li> +</ul> + +<h3><a class='u' +name="next" +>next</a></h3> + +<pre>my $token = $inversion->next();</pre> + +<p>Return the next token in the Inversion until out of tokens.</p> + +<h3><a class='u' +name="reset" +>reset</a></h3> + +<pre>$inversion->reset();</pre> + +<p>Reset the Inversion’s iterator, +so that the next call to next() returns the first Token in the inversion.</p> + +<h2><a class='u' +name="INHERITANCE" +>INHERITANCE</a></h2> + +<p>Lucy::Analysis::Inversion isa Clownfish::Obj.</p> + +</div> Added: lucy/site/trunk/content/docs/perl/Lucy/Analysis/Normalizer.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/perl/Lucy/Analysis/Normalizer.mdtext?rev=1737642&view=auto ============================================================================== --- lucy/site/trunk/content/docs/perl/Lucy/Analysis/Normalizer.mdtext (added) +++ lucy/site/trunk/content/docs/perl/Lucy/Analysis/Normalizer.mdtext Mon Apr 4 09:22:30 2016 @@ -0,0 +1,92 @@ +Title: Lucy::Analysis::Normalizer â Apache Lucy Documentation + +<div> +<a name='___top' class='dummyTopAnchor' ></a> + +<h2><a class='u' +name="NAME" +>NAME</a></h2> + +<p>Lucy::Analysis::Normalizer - Unicode normalization, +case folding and accent stripping.</p> + +<h2><a class='u' +name="SYNOPSIS" +>SYNOPSIS</a></h2> + +<pre>my $normalizer = Lucy::Analysis::Normalizer->new; + +my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new( + analyzers => [ $tokenizer, $normalizer, $stemmer ], +);</pre> + +<h2><a class='u' +name="DESCRIPTION" +>DESCRIPTION</a></h2> + +<p>Normalizer is an <a href="../../Lucy/Analysis/Analyzer.html" class="podlinkpod" +>Analyzer</a> which normalizes tokens to one of the Unicode normalization forms. +Optionally, +it performs Unicode case folding and converts accented characters to their base character.</p> + +<p>If you use highlighting, +Normalizer should be run after tokenization because it might add or remove characters.</p> + +<h2><a class='u' +name="CONSTRUCTORS" +>CONSTRUCTORS</a></h2> + +<h3><a class='u' +name="new" +>new</a></h3> + +<pre>my $normalizer = Lucy::Analysis::Normalizer->new( + normalization_form => 'NFKC', + case_fold => 1, + strip_accents => 0, +);</pre> + +<p>Create a new Normalizer.</p> + +<ul> +<li><b>normalization_form</b> - Unicode normalization form, +can be one of ‘NFC’, +‘NFKC’, +‘NFD’, +‘NFKD’. +Defaults to ‘NFKC’.</li> + +<li><b>case_fold</b> - Perform case folding, +default is true.</li> + +<li><b>strip_accents</b> - Strip accents, +default is false.</li> +</ul> + +<h2><a class='u' +name="METHODS" +>METHODS</a></h2> + +<h3><a class='u' +name="transform" +>transform</a></h3> + +<pre>my $inversion = $normalizer->transform($inversion);</pre> + +<p>Take a single <a href="../../Lucy/Analysis/Inversion.html" class="podlinkpod" +>Inversion</a> as input and returns an Inversion, +either the same one (presumably transformed in some way), +or a new one.</p> + +<ul> +<li><b>inversion</b> - An inversion.</li> +</ul> + +<h2><a class='u' +name="INHERITANCE" +>INHERITANCE</a></h2> + +<p>Lucy::Analysis::Normalizer isa <a href="../../Lucy/Analysis/Analyzer.html" class="podlinkpod" +>Lucy::Analysis::Analyzer</a> isa Clownfish::Obj.</p> + +</div> Added: lucy/site/trunk/content/docs/perl/Lucy/Analysis/PolyAnalyzer.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/perl/Lucy/Analysis/PolyAnalyzer.mdtext?rev=1737642&view=auto ============================================================================== --- lucy/site/trunk/content/docs/perl/Lucy/Analysis/PolyAnalyzer.mdtext (added) +++ lucy/site/trunk/content/docs/perl/Lucy/Analysis/PolyAnalyzer.mdtext Mon Apr 4 09:22:30 2016 @@ -0,0 +1,134 @@ +Title: Lucy::Analysis::PolyAnalyzer â Apache Lucy Documentation + +<div> +<a name='___top' class='dummyTopAnchor' ></a> + +<h2><a class='u' +name="NAME" +>NAME</a></h2> + +<p>Lucy::Analysis::PolyAnalyzer - Multiple Analyzers in series.</p> + +<h2><a class='u' +name="SYNOPSIS" +>SYNOPSIS</a></h2> + +<pre>my $schema = Lucy::Plan::Schema->new; +my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new( + analyzers => \@analyzers, +); +my $type = Lucy::Plan::FullTextType->new( + analyzer => $polyanalyzer, +); +$schema->spec_field( name => 'title', type => $type ); +$schema->spec_field( name => 'content', type => $type );</pre> + +<h2><a class='u' +name="DESCRIPTION" +>DESCRIPTION</a></h2> + +<p>A PolyAnalyzer is a series of <a href="../../Lucy/Analysis/Analyzer.html" class="podlinkpod" +>Analyzers</a>, +each of which will be called upon to “analyze” text in turn. +You can either provide the Analyzers yourself, +or you can specify a supported language, +in which case a PolyAnalyzer consisting of a <a href="../../Lucy/Analysis/CaseFolder.html" class="podlinkpod" +>CaseFolder</a>, +a <a href="../../Lucy/Analysis/RegexTokenizer.html" class="podlinkpod" +>RegexTokenizer</a>, +and a <a href="../../Lucy/Analysis/SnowballStemmer.html" class="podlinkpod" +>SnowballStemmer</a> will be generated for you.</p> + +<p>The language parameter is DEPRECATED. +Use <a href="../../Lucy/Analysis/EasyAnalyzer.html" class="podlinkpod" +>EasyAnalyzer</a> instead.</p> + +<p>Supported languages:</p> + +<pre>en => English, +da => Danish, +de => German, +es => Spanish, +fi => Finnish, +fr => French, +hu => Hungarian, +it => Italian, +nl => Dutch, +no => Norwegian, +pt => Portuguese, +ro => Romanian, +ru => Russian, +sv => Swedish, +tr => Turkish,</pre> + +<h2><a class='u' +name="CONSTRUCTORS" +>CONSTRUCTORS</a></h2> + +<h3><a class='u' +name="new" +>new</a></h3> + +<pre>my $tokenizer = Lucy::Analysis::StandardTokenizer->new; +my $normalizer = Lucy::Analysis::Normalizer->new; +my $stemmer = Lucy::Analysis::SnowballStemmer->new( language => 'en' ); +my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new( + analyzers => [ $tokenizer, $normalizer, $stemmer, ], );</pre> + +<p>Create a new PolyAnalyzer.</p> + +<ul> +<li><b>language</b> - An ISO code from the list of supported languages. +DEPRECATED, +use <a href="../../Lucy/Analysis/EasyAnalyzer.html" class="podlinkpod" +>EasyAnalyzer</a> instead.</li> + +<li><b>analyzers</b> - An array of Analyzers. +The order of the analyzers matters. +Don’t put a SnowballStemmer before a RegexTokenizer (can’t stem whole documents or paragraphs – just individual words), +or a SnowballStopFilter after a SnowballStemmer (stemmed words, +e.g. +“themselv”, +will not appear in a stoplist). +In general, +the sequence should be: tokenize, +normalize, +stopalize, +stem.</li> +</ul> + +<h2><a class='u' +name="METHODS" +>METHODS</a></h2> + +<h3><a class='u' +name="get_analyzers" +>get_analyzers</a></h3> + +<pre>my $arrayref = $poly_analyzer->get_analyzers();</pre> + +<p>Getter for “analyzers” member.</p> + +<h3><a class='u' +name="transform" +>transform</a></h3> + +<pre>my $inversion = $poly_analyzer->transform($inversion);</pre> + +<p>Take a single <a href="../../Lucy/Analysis/Inversion.html" class="podlinkpod" +>Inversion</a> as input and returns an Inversion, +either the same one (presumably transformed in some way), +or a new one.</p> + +<ul> +<li><b>inversion</b> - An inversion.</li> +</ul> + +<h2><a class='u' +name="INHERITANCE" +>INHERITANCE</a></h2> + +<p>Lucy::Analysis::PolyAnalyzer isa <a href="../../Lucy/Analysis/Analyzer.html" class="podlinkpod" +>Lucy::Analysis::Analyzer</a> isa Clownfish::Obj.</p> + +</div> Added: lucy/site/trunk/content/docs/perl/Lucy/Analysis/RegexTokenizer.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/perl/Lucy/Analysis/RegexTokenizer.mdtext?rev=1737642&view=auto ============================================================================== --- lucy/site/trunk/content/docs/perl/Lucy/Analysis/RegexTokenizer.mdtext (added) +++ lucy/site/trunk/content/docs/perl/Lucy/Analysis/RegexTokenizer.mdtext Mon Apr 4 09:22:30 2016 @@ -0,0 +1,108 @@ +Title: Lucy::Analysis::RegexTokenizer â Apache Lucy Documentation + +<div> +<a name='___top' class='dummyTopAnchor' ></a> + +<h2><a class='u' +name="NAME" +>NAME</a></h2> + +<p>Lucy::Analysis::RegexTokenizer - Split a string into tokens.</p> + +<h2><a class='u' +name="SYNOPSIS" +>SYNOPSIS</a></h2> + +<pre>my $whitespace_tokenizer + = Lucy::Analysis::RegexTokenizer->new( pattern => '\S+' ); + +# or... +my $word_char_tokenizer + = Lucy::Analysis::RegexTokenizer->new( pattern => '\w+' ); + +# or... +my $apostrophising_tokenizer = Lucy::Analysis::RegexTokenizer->new; + +# Then... once you have a tokenizer, put it into a PolyAnalyzer: +my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new( + analyzers => [ $word_char_tokenizer, $normalizer, $stemmer ], );</pre> + +<h2><a class='u' +name="DESCRIPTION" +>DESCRIPTION</a></h2> + +<p>Generically, +“tokenizing” is a process of breaking up a string into an array of “tokens”. +For instance, +the string “three blind mice” might be tokenized into “three”, +“blind”, +“mice”.</p> + +<p>Lucy::Analysis::RegexTokenizer decides where it should break up the text based on a regular expression compiled from a supplied <code>pattern</code> matching one token. +If our source string is…</p> + +<pre>"Eats, Shoots and Leaves."</pre> + +<p>… then a “whitespace tokenizer” with a <code>pattern</code> of <code>"\\S+"</code> produces…</p> + +<pre>Eats, +Shoots +and +Leaves.</pre> + +<p>… while a “word character tokenizer” with a <code>pattern</code> of <code>"\\w+"</code> produces…</p> + +<pre>Eats +Shoots +and +Leaves</pre> + +<p>… the difference being that the word character tokenizer skips over punctuation as well as whitespace when determining token boundaries.</p> + +<h2><a class='u' +name="CONSTRUCTORS" +>CONSTRUCTORS</a></h2> + +<h3><a class='u' +name="new" +>new</a></h3> + +<pre>my $word_char_tokenizer = Lucy::Analysis::RegexTokenizer->new( + pattern => '\w+', # required +);</pre> + +<p>Create a new RegexTokenizer.</p> + +<ul> +<li><b>pattern</b> - A string specifying a Perl-syntax regular expression which should match one token. +The default value is <code>\w+(?:[\x{2019}']\w+)*</code>, +which matches “it’s” as well as “it” and “O’Henry’s” as well as “Henry”.</li> +</ul> + +<h2><a class='u' +name="METHODS" +>METHODS</a></h2> + +<h3><a class='u' +name="transform" +>transform</a></h3> + +<pre>my $inversion = $regex_tokenizer->transform($inversion);</pre> + +<p>Take a single <a href="../../Lucy/Analysis/Inversion.html" class="podlinkpod" +>Inversion</a> as input and returns an Inversion, +either the same one (presumably transformed in some way), +or a new one.</p> + +<ul> +<li><b>inversion</b> - An inversion.</li> +</ul> + +<h2><a class='u' +name="INHERITANCE" +>INHERITANCE</a></h2> + +<p>Lucy::Analysis::RegexTokenizer isa <a href="../../Lucy/Analysis/Analyzer.html" class="podlinkpod" +>Lucy::Analysis::Analyzer</a> isa Clownfish::Obj.</p> + +</div> Added: lucy/site/trunk/content/docs/perl/Lucy/Analysis/SnowballStemmer.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/perl/Lucy/Analysis/SnowballStemmer.mdtext?rev=1737642&view=auto ============================================================================== --- lucy/site/trunk/content/docs/perl/Lucy/Analysis/SnowballStemmer.mdtext (added) +++ lucy/site/trunk/content/docs/perl/Lucy/Analysis/SnowballStemmer.mdtext Mon Apr 4 09:22:30 2016 @@ -0,0 +1,78 @@ +Title: Lucy::Analysis::SnowballStemmer â Apache Lucy Documentation + +<div> +<a name='___top' class='dummyTopAnchor' ></a> + +<h2><a class='u' +name="NAME" +>NAME</a></h2> + +<p>Lucy::Analysis::SnowballStemmer - Reduce related words to a shared root.</p> + +<h2><a class='u' +name="SYNOPSIS" +>SYNOPSIS</a></h2> + +<pre>my $stemmer = Lucy::Analysis::SnowballStemmer->new( language => 'es' ); + +my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new( + analyzers => [ $tokenizer, $normalizer, $stemmer ], +);</pre> + +<p>This class is a wrapper around the Snowball stemming library, +so it supports the same languages.</p> + +<h2><a class='u' +name="DESCRIPTION" +>DESCRIPTION</a></h2> + +<p>SnowballStemmer is an <a href="../../Lucy/Analysis/Analyzer.html" class="podlinkpod" +>Analyzer</a> which reduces related words to a root form (using the “Snowball” stemming library). +For instance, +“horse”, +“horses”, +and “horsing” all become “hors” – so that a search for ‘horse’ will also match documents containing ‘horses’ and ‘horsing’.</p> + +<h2><a class='u' +name="CONSTRUCTORS" +>CONSTRUCTORS</a></h2> + +<h3><a class='u' +name="new" +>new</a></h3> + +<pre>my $stemmer = Lucy::Analysis::SnowballStemmer->new( language => 'es' );</pre> + +<p>Create a new SnowballStemmer.</p> + +<ul> +<li><b>language</b> - A two-letter ISO code identifying a language supported by Snowball.</li> +</ul> + +<h2><a class='u' +name="METHODS" +>METHODS</a></h2> + +<h3><a class='u' +name="transform" +>transform</a></h3> + +<pre>my $inversion = $snowball_stemmer->transform($inversion);</pre> + +<p>Take a single <a href="../../Lucy/Analysis/Inversion.html" class="podlinkpod" +>Inversion</a> as input and returns an Inversion, +either the same one (presumably transformed in some way), +or a new one.</p> + +<ul> +<li><b>inversion</b> - An inversion.</li> +</ul> + +<h2><a class='u' +name="INHERITANCE" +>INHERITANCE</a></h2> + +<p>Lucy::Analysis::SnowballStemmer isa <a href="../../Lucy/Analysis/Analyzer.html" class="podlinkpod" +>Lucy::Analysis::Analyzer</a> isa Clownfish::Obj.</p> + +</div> Added: lucy/site/trunk/content/docs/perl/Lucy/Analysis/SnowballStopFilter.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/perl/Lucy/Analysis/SnowballStopFilter.mdtext?rev=1737642&view=auto ============================================================================== --- lucy/site/trunk/content/docs/perl/Lucy/Analysis/SnowballStopFilter.mdtext (added) +++ lucy/site/trunk/content/docs/perl/Lucy/Analysis/SnowballStopFilter.mdtext Mon Apr 4 09:22:30 2016 @@ -0,0 +1,115 @@ +Title: Lucy::Analysis::SnowballStopFilter â Apache Lucy Documentation + +<div> +<a name='___top' class='dummyTopAnchor' ></a> + +<h2><a class='u' +name="NAME" +>NAME</a></h2> + +<p>Lucy::Analysis::SnowballStopFilter - Suppress a “stoplist” of common words.</p> + +<h2><a class='u' +name="SYNOPSIS" +>SYNOPSIS</a></h2> + +<pre>my $stopfilter = Lucy::Analysis::SnowballStopFilter->new( + language => 'fr', +); +my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new( + analyzers => [ $tokenizer, $normalizer, $stopfilter, $stemmer ], +);</pre> + +<h2><a class='u' +name="DESCRIPTION" +>DESCRIPTION</a></h2> + +<p>A “stoplist” is collection of “stopwords”: words which are common enough to be of little value when determining search results. +For example, +so many documents in English contain “the”, +“if”, +and “maybe” that it may improve both performance and relevance to block them.</p> + +<p>Before filtering stopwords:</p> + +<pre>("i", "am", "the", "walrus")</pre> + +<p>After filtering stopwords:</p> + +<pre>("walrus")</pre> + +<p>SnowballStopFilter provides default stoplists for several languages, +courtesy of the <a href="http://snowball.tartarus.org" class="podlinkurl" +>Snowball project</a>, +or you may supply your own.</p> + +<pre>|-----------------------| +| ISO CODE | LANGUAGE | +|-----------------------| +| da | Danish | +| de | German | +| en | English | +| es | Spanish | +| fi | Finnish | +| fr | French | +| hu | Hungarian | +| it | Italian | +| nl | Dutch | +| no | Norwegian | +| pt | Portuguese | +| sv | Swedish | +| ru | Russian | +|-----------------------|</pre> + +<h2><a class='u' +name="CONSTRUCTORS" +>CONSTRUCTORS</a></h2> + +<h3><a class='u' +name="new" +>new</a></h3> + +<pre>my $stopfilter = Lucy::Analysis::SnowballStopFilter->new( + language => 'de', +); + +# or... +my $stopfilter = Lucy::Analysis::SnowballStopFilter->new( + stoplist => \%stoplist, +);</pre> + +<p>Create a new SnowballStopFilter.</p> + +<ul> +<li><b>stoplist</b> - A hash with stopwords as the keys.</li> + +<li><b>language</b> - The ISO code for a supported language.</li> +</ul> + +<h2><a class='u' +name="METHODS" +>METHODS</a></h2> + +<h3><a class='u' +name="transform" +>transform</a></h3> + +<pre>my $inversion = $snowball_stop_filter->transform($inversion);</pre> + +<p>Take a single <a href="../../Lucy/Analysis/Inversion.html" class="podlinkpod" +>Inversion</a> as input and returns an Inversion, +either the same one (presumably transformed in some way), +or a new one.</p> + +<ul> +<li><b>inversion</b> - An inversion.</li> +</ul> + +<h2><a class='u' +name="INHERITANCE" +>INHERITANCE</a></h2> + +<p>Lucy::Analysis::SnowballStopFilter isa <a href="../../Lucy/Analysis/Analyzer.html" class="podlinkpod" +>Lucy::Analysis::Analyzer</a> isa Clownfish::Obj.</p> + +</div> Added: lucy/site/trunk/content/docs/perl/Lucy/Analysis/StandardTokenizer.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/perl/Lucy/Analysis/StandardTokenizer.mdtext?rev=1737642&view=auto ============================================================================== --- lucy/site/trunk/content/docs/perl/Lucy/Analysis/StandardTokenizer.mdtext (added) +++ lucy/site/trunk/content/docs/perl/Lucy/Analysis/StandardTokenizer.mdtext Mon Apr 4 09:22:30 2016 @@ -0,0 +1,75 @@ +Title: Lucy::Analysis::StandardTokenizer â Apache Lucy Documentation + +<div> +<a name='___top' class='dummyTopAnchor' ></a> + +<h2><a class='u' +name="NAME" +>NAME</a></h2> + +<p>Lucy::Analysis::StandardTokenizer - Split a string into tokens.</p> + +<h2><a class='u' +name="SYNOPSIS" +>SYNOPSIS</a></h2> + +<pre>my $tokenizer = Lucy::Analysis::StandardTokenizer->new; + +# Then... once you have a tokenizer, put it into a PolyAnalyzer: +my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new( + analyzers => [ $tokenizer, $normalizer, $stemmer ], );</pre> + +<h2><a class='u' +name="DESCRIPTION" +>DESCRIPTION</a></h2> + +<p>Generically, +“tokenizing” is a process of breaking up a string into an array of “tokens”. +For instance, +the string “three blind mice” might be tokenized into “three”, +“blind”, +“mice”.</p> + +<p>Lucy::Analysis::StandardTokenizer breaks up the text at the word boundaries defined in Unicode Standard Annex #29. +It then returns those words that contain alphabetic or numeric characters.</p> + +<h2><a class='u' +name="CONSTRUCTORS" +>CONSTRUCTORS</a></h2> + +<h3><a class='u' +name="new" +>new</a></h3> + +<pre>my $tokenizer = Lucy::Analysis::StandardTokenizer->new;</pre> + +<p>Constructor. +Takes no arguments.</p> + +<h2><a class='u' +name="METHODS" +>METHODS</a></h2> + +<h3><a class='u' +name="transform" +>transform</a></h3> + +<pre>my $inversion = $standard_tokenizer->transform($inversion);</pre> + +<p>Take a single <a href="../../Lucy/Analysis/Inversion.html" class="podlinkpod" +>Inversion</a> as input and returns an Inversion, +either the same one (presumably transformed in some way), +or a new one.</p> + +<ul> +<li><b>inversion</b> - An inversion.</li> +</ul> + +<h2><a class='u' +name="INHERITANCE" +>INHERITANCE</a></h2> + +<p>Lucy::Analysis::StandardTokenizer isa <a href="../../Lucy/Analysis/Analyzer.html" class="podlinkpod" +>Lucy::Analysis::Analyzer</a> isa Clownfish::Obj.</p> + +</div> Added: lucy/site/trunk/content/docs/perl/Lucy/Analysis/Token.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/perl/Lucy/Analysis/Token.mdtext?rev=1737642&view=auto ============================================================================== --- lucy/site/trunk/content/docs/perl/Lucy/Analysis/Token.mdtext (added) +++ lucy/site/trunk/content/docs/perl/Lucy/Analysis/Token.mdtext Mon Apr 4 09:22:30 2016 @@ -0,0 +1,154 @@ +Title: Lucy::Analysis::Token â Apache Lucy Documentation + +<div> +<a name='___top' class='dummyTopAnchor' ></a> + +<h2><a class='u' +name="NAME" +>NAME</a></h2> + +<p>Lucy::Analysis::Token - Unit of text.</p> + +<h2><a class='u' +name="SYNOPSIS" +>SYNOPSIS</a></h2> + +<pre> my $token = Lucy::Analysis::Token->new( + text => 'blind', + start_offset => 8, + end_offset => 13, + ); + + $token->set_text('mice');</pre> + +<h2><a class='u' +name="DESCRIPTION" +>DESCRIPTION</a></h2> + +<p>Token is the fundamental unit used by Apache Lucy’s Analyzer subclasses. +Each Token has 5 attributes: <code>text</code>, +<code>start_offset</code>, +<code>end_offset</code>, +<code>boost</code>, +and <code>pos_inc</code>.</p> + +<p>The <code>text</code> attribute is a Unicode string encoded as UTF-8.</p> + +<p><code>start_offset</code> is the start point of the token text, +measured in Unicode code points from the top of the stored field; <code>end_offset</code> delimits the corresponding closing boundary. +<code>start_offset</code> and <code>end_offset</code> locate the Token within a larger context, +even if the Token’s text attribute gets modified – by stemming, +for instance. +The Token for “beating” in the text “beating a dead horse” begins life with a start_offset of 0 and an end_offset of 7; after stemming, +the text is “beat”, +but the start_offset is still 0 and the end_offset is still 7. +This allows “beating” to be highlighted correctly after a search matches “beat”.</p> + +<p><code>boost</code> is a per-token weight. +Use this when you want to assign more or less importance to a particular token, +as you might for emboldened text within an HTML document, +for example. +(Note: The field this token belongs to must be spec’d to use a posting of type RichPosting.)</p> + +<p><code>pos_inc</code> is the POSition INCrement, +measured in Tokens. +This attribute, +which defaults to 1, +is a an advanced tool for manipulating phrase matching. +Ordinarily, +Tokens are assigned consecutive position numbers: 0, +1, +and 2 for <code>"three blind mice"</code>. +However, +if you set the position increment for “blind” to, +say, +1000, +then the three tokens will end up assigned to positions 0, +1, +and 1001 – and will no longer produce a phrase match for the query <code>"three blind mice"</code>.</p> + +<h2><a class='u' +name="CONSTRUCTORS" +>CONSTRUCTORS</a></h2> + +<h3><a class='u' +name="new" +>new</a></h3> + +<pre>my $token = Lucy::Analysis::Token->new( + text => $text, # required + start_offset => $start_offset, # required + end_offset => $end_offset, # required + boost => 1.0, # optional + pos_inc => 1, # optional +);</pre> + +<ul> +<li><b>text</b> - A string.</li> + +<li><b>start_offset</b> - Start offset into the original document in Unicode code points.</li> + +<li><b>start_offset</b> - End offset into the original document in Unicode code points.</li> + +<li><b>boost</b> - Per-token weight.</li> + +<li><b>pos_inc</b> - Position increment for phrase matching.</li> +</ul> + +<h2><a class='u' +name="METHODS" +>METHODS</a></h2> + +<h3><a class='u' +name="get_text" +>get_text</a></h3> + +<pre>my $text = $token->get_text;</pre> + +<p>Get the token's text.</p> + +<h3><a class='u' +name="set_text" +>set_text</a></h3> + +<pre>$token->set_text($text);</pre> + +<p>Set the token's text.</p> + +<h3><a class='u' +name="get_start_offset" +>get_start_offset</a></h3> + +<pre>my $int = $token->get_start_offset();</pre> + +<h3><a class='u' +name="get_end_offset" +>get_end_offset</a></h3> + +<pre>my $int = $token->get_end_offset();</pre> + +<h3><a class='u' +name="get_boost" +>get_boost</a></h3> + +<pre>my $float = $token->get_boost();</pre> + +<h3><a class='u' +name="get_pos_inc" +>get_pos_inc</a></h3> + +<pre>my $int = $token->get_pos_inc();</pre> + +<h3><a class='u' +name="get_len" +>get_len</a></h3> + +<pre>my $int = $token->get_len();</pre> + +<h2><a class='u' +name="INHERITANCE" +>INHERITANCE</a></h2> + +<p>Lucy::Analysis::Token isa Clownfish::Obj.</p> + +</div> Added: lucy/site/trunk/content/docs/perl/Lucy/Docs/Cookbook.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/perl/Lucy/Docs/Cookbook.mdtext?rev=1737642&view=auto ============================================================================== --- lucy/site/trunk/content/docs/perl/Lucy/Docs/Cookbook.mdtext (added) +++ lucy/site/trunk/content/docs/perl/Lucy/Docs/Cookbook.mdtext Mon Apr 4 09:22:30 2016 @@ -0,0 +1,52 @@ +Title: Lucy::Docs::Cookbook â Apache Lucy Documentation + +<div> +<a name='___top' class='dummyTopAnchor' ></a> + +<h2><a class='u' +name="NAME" +>NAME</a></h2> + +<p>Lucy::Docs::Cookbook - Apache Lucy recipes</p> + +<h2><a class='u' +name="DESCRIPTION" +>DESCRIPTION</a></h2> + +<p>The Cookbook provides thematic documentation covering some of Apache Lucy’s more sophisticated features. +For a step-by-step introduction to Lucy, +see <a href="../../Lucy/Docs/Tutorial.html" class="podlinkpod" +>Tutorial</a>.</p> + +<h3><a class='u' +name="Chapters" +>Chapters</a></h3> + +<ul> +<li><a href="../../Lucy/Docs/Cookbook/FastUpdates.html" class="podlinkpod" +>FastUpdates</a> - While index updates are fast on average, +worst-case update performance may be significantly slower. +To make index updates consistently quick, +we must manually intervene to control the process of index segment consolidation.</li> + +<li><a href="../../Lucy/Docs/Cookbook/CustomQuery.html" class="podlinkpod" +>CustomQuery</a> - Explore Lucy’s support for custom query types by creating a “PrefixQuery” class to handle trailing wildcards.</li> + +<li><a href="../../Lucy/Docs/Cookbook/CustomQueryParser.html" class="podlinkpod" +>CustomQueryParser</a> - Define your own custom search query syntax using <a href="../../Lucy/Search/QueryParser.html" class="podlinkpod" +>QueryParser</a> and Parse::RecDescent.</li> +</ul> + +<h3><a class='u' +name="Materials" +>Materials</a></h3> + +<p>Some of the recipes in the Cookbook reference the completed <a href="../../Lucy/Docs/Tutorial.html" class="podlinkpod" +>Tutorial</a> application. +These materials can be found in the <code>sample</code> directory at the root of the Lucy distribution:</p> + +<pre>sample/indexer.pl # indexing app +sample/search.cgi # search app +sample/us_constitution # corpus</pre> + +</div> Added: lucy/site/trunk/content/docs/perl/Lucy/Docs/Cookbook/CustomQuery.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/perl/Lucy/Docs/Cookbook/CustomQuery.mdtext?rev=1737642&view=auto ============================================================================== --- lucy/site/trunk/content/docs/perl/Lucy/Docs/Cookbook/CustomQuery.mdtext (added) +++ lucy/site/trunk/content/docs/perl/Lucy/Docs/Cookbook/CustomQuery.mdtext Mon Apr 4 09:22:30 2016 @@ -0,0 +1,321 @@ +Title: Lucy::Docs::Cookbook::CustomQuery â Apache Lucy Documentation + +<div> +<a name='___top' class='dummyTopAnchor' ></a> + +<h2><a class='u' +name="NAME" +>NAME</a></h2> + +<p>Lucy::Docs::Cookbook::CustomQuery - Sample subclass of Query</p> + +<h2><a class='u' +name="DESCRIPTION" +>DESCRIPTION</a></h2> + +<p>Explore Apache Lucy’s support for custom query types by creating a “PrefixQuery” class to handle trailing wildcards.</p> + +<pre>my $prefix_query = PrefixQuery->new( + field => 'content', + query_string => 'foo*', +); +my $hits = $searcher->hits( query => $prefix_query ); +...</pre> + +<h3><a class='u' +name="Query,_Compiler,_and_Matcher" +>Query, +Compiler, +and Matcher</a></h3> + +<p>To add support for a new query type, +we need three classes: a Query, +a Compiler, +and a Matcher.</p> + +<ul> +<li>PrefixQuery - a subclass of <a href="../../../Lucy/Search/Query.html" class="podlinkpod" +>Query</a>, +and the only class that client code will deal with directly.</li> + +<li>PrefixCompiler - a subclass of <a href="../../../Lucy/Search/Compiler.html" class="podlinkpod" +>Compiler</a>, +whose primary role is to compile a PrefixQuery to a PrefixMatcher.</li> + +<li>PrefixMatcher - a subclass of <a href="../../../Lucy/Search/Matcher.html" class="podlinkpod" +>Matcher</a>, +which does the heavy lifting: it applies the query to individual documents and assigns a score to each match.</li> +</ul> + +<p>The PrefixQuery class on its own isn’t enough because a Query object’s role is limited to expressing an abstract specification for the search. +A Query is basically nothing but metadata; execution is left to the Query’s companion Compiler and Matcher.</p> + +<p>Here’s a simplified sketch illustrating how a Searcher’s hits() method ties together the three classes.</p> + +<pre>sub hits { + my ( $self, $query ) = @_; + my $compiler = $query->make_compiler( + searcher => $self, + boost => $query->get_boost, + ); + my $matcher = $compiler->make_matcher( + reader => $self->get_reader, + need_score => 1, + ); + my @hits = $matcher->capture_hits; + return \@hits; +}</pre> + +<h4><a class='u' +name="PrefixQuery" +>PrefixQuery</a></h4> + +<p>Our PrefixQuery class will have two attributes: a query string and a field name.</p> + +<pre>package PrefixQuery; +use base qw( Lucy::Search::Query ); +use Carp; +use Scalar::Util qw( blessed ); + +# Inside-out member vars and hand-rolled accessors. +my %query_string; +my %field; +sub get_query_string { my $self = shift; return $query_string{$$self} } +sub get_field { my $self = shift; return $field{$$self} }</pre> + +<p>PrefixQuery’s constructor collects and validates the attributes.</p> + +<pre>sub new { + my ( $class, %args ) = @_; + my $query_string = delete $args{query_string}; + my $field = delete $args{field}; + my $self = $class->SUPER::new(%args); + confess("'query_string' param is required") + unless defined $query_string; + confess("Invalid query_string: '$query_string'") + unless $query_string =~ /\*\s*$/; + confess("'field' param is required") + unless defined $field; + $query_string{$$self} = $query_string; + $field{$$self} = $field; + return $self; +}</pre> + +<p>Since this is an inside-out class, +we’ll need a destructor:</p> + +<pre>sub DESTROY { + my $self = shift; + delete $query_string{$$self}; + delete $field{$$self}; + $self->SUPER::DESTROY; +}</pre> + +<p>The equals() method determines whether two Queries are logically equivalent:</p> + +<pre>sub equals { + my ( $self, $other ) = @_; + return 0 unless blessed($other); + return 0 unless $other->isa("PrefixQuery"); + return 0 unless $field{$$self} eq $field{$$other}; + return 0 unless $query_string{$$self} eq $query_string{$$other}; + return 1; +}</pre> + +<p>The last thing we’ll need is a make_compiler() factory method which kicks out a subclass of <a href="../../../Lucy/Search/Compiler.html" class="podlinkpod" +>Compiler</a>.</p> + +<pre>sub make_compiler { + my ( $self, %args ) = @_; + my $subordinate = delete $args{subordinate}; + my $compiler = PrefixCompiler->new( %args, parent => $self ); + $compiler->normalize unless $subordinate; + return $compiler; +}</pre> + +<h4><a class='u' +name="PrefixCompiler" +>PrefixCompiler</a></h4> + +<p>PrefixQuery’s make_compiler() method will be called internally at search-time by objects which subclass <a href="../../../Lucy/Search/Searcher.html" class="podlinkpod" +>Searcher</a> – such as <a href="../../../Lucy/Search/IndexSearcher.html" class="podlinkpod" +>IndexSearchers</a>.</p> + +<p>A Searcher is associated with a particular collection of documents. +These documents may all reside in one index, +as with IndexSearcher, +or they may be spread out across multiple indexes on one or more machines, +as with LucyX::Remote::ClusterSearcher.</p> + +<p>Searcher objects have access to certain statistical information about the collections they represent; for instance, +a Searcher can tell you how many documents are in the collection…</p> + +<pre>my $maximum_number_of_docs_in_collection = $searcher->doc_max;</pre> + +<p>… or how many documents a specific term appears in:</p> + +<pre>my $term_appears_in_this_many_docs = $searcher->doc_freq( + field => 'content', + term => 'foo', +);</pre> + +<p>Such information can be used by sophisticated Compiler implementations to assign more or less heft to individual queries or sub-queries. +However, +we’re not going to bother with weighting for this demo; we’ll just assign a fixed score of 1.0 to each matching document.</p> + +<p>We don’t need to write a constructor, +as it will suffice to inherit new() from Lucy::Search::Compiler. +The only method we need to implement for PrefixCompiler is make_matcher().</p> + +<pre>package PrefixCompiler; +use base qw( Lucy::Search::Compiler ); + +sub make_matcher { + my ( $self, %args ) = @_; + my $seg_reader = $args{reader}; + + # Retrieve low-level components LexiconReader and PostingListReader. + my $lex_reader + = $seg_reader->obtain("Lucy::Index::LexiconReader"); + my $plist_reader + = $seg_reader->obtain("Lucy::Index::PostingListReader"); + + # Acquire a Lexicon and seek it to our query string. + my $substring = $self->get_parent->get_query_string; + $substring =~ s/\*.\s*$//; + my $field = $self->get_parent->get_field; + my $lexicon = $lex_reader->lexicon( field => $field ); + return unless $lexicon; + $lexicon->seek($substring); + + # Accumulate PostingLists for each matching term. + my @posting_lists; + while ( defined( my $term = $lexicon->get_term ) ) { + last unless $term =~ /^\Q$substring/; + my $posting_list = $plist_reader->posting_list( + field => $field, + term => $term, + ); + if ($posting_list) { + push @posting_lists, $posting_list; + } + last unless $lexicon->next; + } + return unless @posting_lists; + + return PrefixMatcher->new( posting_lists => \@posting_lists ); +}</pre> + +<p>PrefixCompiler gets access to a <a href="../../../Lucy/Index/SegReader.html" class="podlinkpod" +>SegReader</a> object when make_matcher() gets called. +From the SegReader and its sub-components <a href="../../../Lucy/Index/LexiconReader.html" class="podlinkpod" +>LexiconReader</a> and <a href="../../../Lucy/Index/PostingListReader.html" class="podlinkpod" +>PostingListReader</a>, +we acquire a <a href="../../../Lucy/Index/Lexicon.html" class="podlinkpod" +>Lexicon</a>, +scan through the Lexicon’s unique terms, +and acquire a <a href="../../../Lucy/Index/PostingList.html" class="podlinkpod" +>PostingList</a> for each term that matches our prefix.</p> + +<p>Each of these PostingList objects represents a set of documents which match the query.</p> + +<h4><a class='u' +name="PrefixMatcher" +>PrefixMatcher</a></h4> + +<p>The Matcher subclass is the most involved.</p> + +<pre>package PrefixMatcher; +use base qw( Lucy::Search::Matcher ); + +# Inside-out member vars. +my %doc_ids; +my %tick; + +sub new { + my ( $class, %args ) = @_; + my $posting_lists = delete $args{posting_lists}; + my $self = $class->SUPER::new(%args); + + # Cheesy but simple way of interleaving PostingList doc sets. + my %all_doc_ids; + for my $posting_list (@$posting_lists) { + while ( my $doc_id = $posting_list->next ) { + $all_doc_ids{$doc_id} = undef; + } + } + my @doc_ids = sort { $a <=> $b } keys %all_doc_ids; + $doc_ids{$$self} = \@doc_ids; + + # Track our position within the array of doc ids. + $tick{$$self} = -1; + + return $self; +} + +sub DESTROY { + my $self = shift; + delete $doc_ids{$$self}; + delete $tick{$$self}; + $self->SUPER::DESTROY; +}</pre> + +<p>The doc ids must be in order, +or some will be ignored; hence the <code>sort</code> above.</p> + +<p>In addition to the constructor and destructor, +there are three methods that must be overridden.</p> + +<p>next() advances the Matcher to the next valid matching doc.</p> + +<pre>sub next { + my $self = shift; + my $doc_ids = $doc_ids{$$self}; + my $tick = ++$tick{$$self}; + return 0 if $tick >= scalar @$doc_ids; + return $doc_ids->[$tick]; +}</pre> + +<p>get_doc_id() returns the current document id, +or 0 if the Matcher is exhausted. +(<a href="../../../Lucy/Docs/DocIDs.html" class="podlinkpod" +>Document numbers</a> start at 1, +so 0 is a sentinel.)</p> + +<pre>sub get_doc_id { + my $self = shift; + my $tick = $tick{$$self}; + my $doc_ids = $doc_ids{$$self}; + return $tick < scalar @$doc_ids ? $doc_ids->[$tick] : 0; +}</pre> + +<p>score() conveys the relevance score of the current match. +We’ll just return a fixed score of 1.0:</p> + +<pre>sub score { 1.0 }</pre> + +<h3><a class='u' +name="Usage" +>Usage</a></h3> + +<p>To get a basic feel for PrefixQuery, +insert the FlatQueryParser module described in <a href="../../../Lucy/Docs/Cookbook/CustomQueryParser.html" class="podlinkpod" +>CustomQueryParser</a> (which supports PrefixQuery) into the search.cgi sample app.</p> + +<pre>my $parser = FlatQueryParser->new( schema => $searcher->get_schema ); +my $query = $parser->parse($q);</pre> + +<p>If you’re planning on using PrefixQuery in earnest, +though, +you may want to change up analyzers to avoid stemming, +because stemming – another approach to prefix conflation – is not perfectly compatible with prefix searches.</p> + +<pre># Polyanalyzer with no SnowballStemmer. +my $analyzer = Lucy::Analysis::PolyAnalyzer->new( + analyzers => [ + Lucy::Analysis::StandardTokenizer->new, + Lucy::Analysis::Normalizer->new, + ], +);</pre> + +</div> Added: lucy/site/trunk/content/docs/perl/Lucy/Docs/Cookbook/CustomQueryParser.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/perl/Lucy/Docs/Cookbook/CustomQueryParser.mdtext?rev=1737642&view=auto ============================================================================== --- lucy/site/trunk/content/docs/perl/Lucy/Docs/Cookbook/CustomQueryParser.mdtext (added) +++ lucy/site/trunk/content/docs/perl/Lucy/Docs/Cookbook/CustomQueryParser.mdtext Mon Apr 4 09:22:30 2016 @@ -0,0 +1,239 @@ +Title: Lucy::Docs::Cookbook::CustomQueryParser â Apache Lucy Documentation + +<div> +<a name='___top' class='dummyTopAnchor' ></a> + +<h2><a class='u' +name="NAME" +>NAME</a></h2> + +<p>Lucy::Docs::Cookbook::CustomQueryParser - Sample subclass of QueryParser.</p> + +<h2><a class='u' +name="DESCRIPTION" +>DESCRIPTION</a></h2> + +<p>Implement a custom search query language using a subclass of <a href="../../../Lucy/Search/QueryParser.html" class="podlinkpod" +>QueryParser</a>.</p> + +<h3><a class='u' +name="The_language" +>The language</a></h3> + +<p>At first, +our query language will support only simple term queries and phrases delimited by double quotes. +For simplicity’s sake, +it will not support parenthetical groupings, +boolean operators, +or prepended plus/minus. +The results for all subqueries will be unioned together – i.e. +joined using an OR – which is usually the best approach for small-to-medium-sized document collections.</p> + +<p>Later, +we’ll add support for trailing wildcards.</p> + +<h3><a class='u' +name="Single-field_parser" +>Single-field parser</a></h3> + +<p>Our initial parser implentation will generate queries against a single fixed field, +“content”, +and it will analyze text using a fixed choice of English EasyAnalyzer. +We won’t subclass Lucy::Search::QueryParser just yet.</p> + +<pre>package FlatQueryParser; +use Lucy::Search::TermQuery; +use Lucy::Search::PhraseQuery; +use Lucy::Search::ORQuery; +use Carp; + +sub new { + my $analyzer = Lucy::Analysis::EasyAnalyzer->new( + language => 'en', + ); + return bless { + field => 'content', + analyzer => $analyzer, + }, __PACKAGE__; +}</pre> + +<p>Some private helper subs for creating TermQuery and PhraseQuery objects will help keep the size of our main parse() subroutine down:</p> + +<pre>sub _make_term_query { + my ( $self, $term ) = @_; + return Lucy::Search::TermQuery->new( + field => $self->{field}, + term => $term, + ); +} + +sub _make_phrase_query { + my ( $self, $terms ) = @_; + return Lucy::Search::PhraseQuery->new( + field => $self->{field}, + terms => $terms, + ); +}</pre> + +<p>Our private _tokenize() method treats double-quote delimited material as a single token and splits on whitespace everywhere else.</p> + +<pre>sub _tokenize { + my ( $self, $query_string ) = @_; + my @tokens; + while ( length $query_string ) { + if ( $query_string =~ s/^\s+// ) { + next; # skip whitespace + } + elsif ( $query_string =~ s/^("[^"]*(?:"|$))// ) { + push @tokens, $1; # double-quoted phrase + } + else { + $query_string =~ s/(\S+)//; + push @tokens, $1; # single word + } + } + return \@tokens; +}</pre> + +<p>The main parsing routine creates an array of tokens by calling _tokenize(), +runs the tokens through through the EasyAnalyzer, +creates TermQuery or PhraseQuery objects according to how many tokens emerge from the EasyAnalyzer’s split() method, +and adds each of the sub-queries to the primary ORQuery.</p> + +<pre>sub parse { + my ( $self, $query_string ) = @_; + my $tokens = $self->_tokenize($query_string); + my $analyzer = $self->{analyzer}; + my $or_query = Lucy::Search::ORQuery->new; + + for my $token (@$tokens) { + if ( $token =~ s/^"// ) { + $token =~ s/"$//; + my $terms = $analyzer->split($token); + my $query = $self->_make_phrase_query($terms); + $or_query->add_child($phrase_query); + } + else { + my $terms = $analyzer->split($token); + if ( @$terms == 1 ) { + my $query = $self->_make_term_query( $terms->[0] ); + $or_query->add_child($query); + } + elsif ( @$terms > 1 ) { + my $query = $self->_make_phrase_query($terms); + $or_query->add_child($query); + } + } + } + + return $or_query; +}</pre> + +<h3><a class='u' +name="Multi-field_parser" +>Multi-field parser</a></h3> + +<p>Most often, +the end user will want their search query to match not only a single ‘content’ field, +but also ‘title’ and so on. +To make that happen, +we have to turn queries such as this…</p> + +<pre>foo AND NOT bar</pre> + +<p>… into the logical equivalent of this:</p> + +<pre>(title:foo OR content:foo) AND NOT (title:bar OR content:bar)</pre> + +<p>Rather than continue with our own from-scratch parser class and write the routines to accomplish that expansion, +we’re now going to subclass Lucy::Search::QueryParser and take advantage of some of its existing methods.</p> + +<p>Our first parser implementation had the “content” field name and the choice of English EasyAnalyzer hard-coded for simplicity, +but we don’t need to do that once we subclass Lucy::Search::QueryParser. +QueryParser’s constructor – which we will inherit, +allowing us to eliminate our own constructor – requires a Schema which conveys field and Analyzer information, +so we can just defer to that.</p> + +<pre>package FlatQueryParser; +use base qw( Lucy::Search::QueryParser ); +use Lucy::Search::TermQuery; +use Lucy::Search::PhraseQuery; +use Lucy::Search::ORQuery; +use PrefixQuery; +use Carp; + +# Inherit new()</pre> + +<p>We’re also going to jettison our _make_term_query() and _make_phrase_query() helper subs and chop our parse() subroutine way down. +Our revised parse() routine will generate Lucy::Search::LeafQuery objects instead of TermQueries and PhraseQueries:</p> + +<pre>sub parse { + my ( $self, $query_string ) = @_; + my $tokens = $self->_tokenize($query_string); + my $or_query = Lucy::Search::ORQuery->new; + for my $token (@$tokens) { + my $leaf_query = Lucy::Search::LeafQuery->new( text => $token ); + $or_query->add_child($leaf_query); + } + return $self->expand($or_query); +}</pre> + +<p>The magic happens in QueryParser’s expand() method, +which walks the ORQuery object we supply to it looking for LeafQuery objects, +and calls expand_leaf() for each one it finds. +expand_leaf() performs field-specific analysis, +decides whether each query should be a TermQuery or a PhraseQuery, +and if multiple fields are required, +creates an ORQuery which mults out e.g. +<code>foo</code> into <code>(title:foo OR content:foo)</code>.</p> + +<h3><a class='u' +name="Extending_the_query_language" +>Extending the query language</a></h3> + +<p>To add support for trailing wildcards to our query language, +we need to override expand_leaf() to accommodate PrefixQuery, +while deferring to the parent class implementation on TermQuery and PhraseQuery.</p> + +<pre>sub expand_leaf { + my ( $self, $leaf_query ) = @_; + my $text = $leaf_query->get_text; + if ( $text =~ /\*$/ ) { + my $or_query = Lucy::Search::ORQuery->new; + for my $field ( @{ $self->get_fields } ) { + my $prefix_query = PrefixQuery->new( + field => $field, + query_string => $text, + ); + $or_query->add_child($prefix_query); + } + return $or_query; + } + else { + return $self->SUPER::expand_leaf($leaf_query); + } +}</pre> + +<p>Ordinarily, +those asterisks would have been stripped when running tokens through the EasyAnalyzer – query strings containing “foo*” would produce TermQueries for the term “foo”. +Our override intercepts tokens with trailing asterisks and processes them as PrefixQueries before <code>SUPER::expand_leaf</code> can discard them, +so that a search for “foo*” can match “food”, +“foosball”, +and so on.</p> + +<h3><a class='u' +name="Usage" +>Usage</a></h3> + +<p>Insert our custom parser into the search.cgi sample app to get a feel for how it behaves:</p> + +<pre>my $parser = FlatQueryParser->new( schema => $searcher->get_schema ); +my $query = $parser->parse( decode( 'UTF-8', $cgi->param('q') || '' ) ); +my $hits = $searcher->hits( + query => $query, + offset => $offset, + num_wanted => $page_size, +); +...</pre> + +</div> Added: lucy/site/trunk/content/docs/perl/Lucy/Docs/Cookbook/FastUpdates.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/perl/Lucy/Docs/Cookbook/FastUpdates.mdtext?rev=1737642&view=auto ============================================================================== --- lucy/site/trunk/content/docs/perl/Lucy/Docs/Cookbook/FastUpdates.mdtext (added) +++ lucy/site/trunk/content/docs/perl/Lucy/Docs/Cookbook/FastUpdates.mdtext Mon Apr 4 09:22:30 2016 @@ -0,0 +1,170 @@ +Title: Lucy::Docs::Cookbook::FastUpdates â Apache Lucy Documentation + +<div> +<a name='___top' class='dummyTopAnchor' ></a> + +<h2><a class='u' +name="NAME" +>NAME</a></h2> + +<p>Lucy::Docs::Cookbook::FastUpdates - Near real-time index updates</p> + +<h2><a class='u' +name="DESCRIPTION" +>DESCRIPTION</a></h2> + +<p>While index updates are fast on average, +worst-case update performance may be significantly slower. +To make index updates consistently quick, +we must manually intervene to control the process of index segment consolidation.</p> + +<h3><a class='u' +name="The_problem" +>The problem</a></h3> + +<p>Ordinarily, +modifying an index is cheap. +New data is added to new segments, +and the time to write a new segment scales more or less linearly with the number of documents added during the indexing session.</p> + +<p>Deletions are also cheap most of the time, +because we don’t remove documents immediately but instead mark them as deleted, +and adding the deletion mark is cheap.</p> + +<p>However, +as new segments are added and the deletion rate for existing segments increases, +search-time performance slowly begins to degrade. +At some point, +it becomes necessary to consolidate existing segments, +rewriting their data into a new segment.</p> + +<p>If the recycled segments are small, +the time it takes to rewrite them may not be significant. +Every once in a while, +though, +a large amount of data must be rewritten.</p> + +<h3><a class='u' +name="Procrastinating_and_playing_catch-up" +>Procrastinating and playing catch-up</a></h3> + +<p>The simplest way to force fast index updates is to avoid rewriting anything.</p> + +<p>Indexer relies upon <a href="../../../Lucy/Index/IndexManager.html" class="podlinkpod" +>IndexManager</a>’s <a href="../../../Lucy/Index/IndexManager.html#recycle" class="podlinkpod" +>recycle()</a> method to tell it which segments should be consolidated. +If we subclass IndexManager and override the method so that it always returns an empty array, +we get consistently quick performance:</p> + +<pre>package NoMergeManager; +use base qw( Lucy::Index::IndexManager ); +sub recycle { [] } + +package main; +my $indexer = Lucy::Index::Indexer->new( + index => '/path/to/index', + manager => NoMergeManager->new, +); +... +$indexer->commit;</pre> + +<p>However, +we can’t procrastinate forever. +Eventually, +we’ll have to run an ordinary, +uncontrolled indexing session, +potentially triggering a large rewrite of lots of small and/or degraded segments:</p> + +<pre>my $indexer = Lucy::Index::Indexer->new( + index => '/path/to/index', + # manager => NoMergeManager->new, +); +... +$indexer->commit;</pre> + +<h3><a class='u' +name="Acceptable_worst-case_update_time,_slower_degradation" +>Acceptable worst-case update time, +slower degradation</a></h3> + +<p>Never merging anything at all in the main indexing process is probably overkill. +Small segments are relatively cheap to merge; we just need to guard against the big rewrites.</p> + +<p>Setting a ceiling on the number of documents in the segments to be recycled allows us to avoid a mass proliferation of tiny, +single-document segments, +while still offering decent worst-case update speed:</p> + +<pre>package LightMergeManager; +use base qw( Lucy::Index::IndexManager ); + +sub recycle { + my $self = shift; + my $seg_readers = $self->SUPER::recycle(@_); + @$seg_readers = grep { $_->doc_max < 10 } @$seg_readers; + return $seg_readers; +}</pre> + +<p>However, +we still have to consolidate every once in a while, +and while that happens content updates will be locked out.</p> + +<h3><a class='u' +name="Background_merging" +>Background merging</a></h3> + +<p>If it’s not acceptable to lock out updates while the index consolidation process runs, +the alternative is to move the consolidation process out of band, +using <a href="../../../Lucy/Index/BackgroundMerger.html" class="podlinkpod" +>BackgroundMerger</a>.</p> + +<p>It’s never safe to have more than one Indexer attempting to modify the content of an index at the same time, +but a BackgroundMerger and an Indexer can operate simultaneously:</p> + +<pre># Indexing process. +use Scalar::Util qw( blessed ); +my $retries = 0; +while (1) { + eval { + my $indexer = Lucy::Index::Indexer->new( + index => '/path/to/index', + manager => LightMergeManager->new, + ); + $indexer->add_doc($doc); + $indexer->commit; + }; + last unless $@; + if ( blessed($@) and $@->isa("Lucy::Store::LockErr") ) { + # Catch LockErr. + warn "Couldn't get lock ($retries retries)"; + $retries++; + } + else { + die "Write failed: $@"; + } +} + +# Background merge process. +my $manager = Lucy::Index::IndexManager->new; +$manager->set_write_lock_timeout(60_000); +my $bg_merger = Lucy::Index::BackgroundMerger->new( + index => '/path/to/index', + manager => $manager, +); +$bg_merger->commit;</pre> + +<p>The exception handling code becomes useful once you have more than one index modification process happening simultaneously. +By default, +Indexer tries several times to acquire a write lock over the span of one second, +then holds it until <a href="../../../Lucy/Index/Indexer.html#commit" class="podlinkpod" +>commit()</a> completes. +BackgroundMerger handles most of its work without the write lock, +but it does need it briefly once at the beginning and once again near the end. +Under normal loads, +the internal retry logic will resolve conflicts, +but if it’s not acceptable to miss an insert, +you probably want to catch <a href="../../../Lucy/Store/LockErr.html" class="podlinkpod" +>LockErr</a> exceptions thrown by Indexer. +In contrast, +a LockErr from BackgroundMerger probably just needs to be logged.</p> + +</div> Added: lucy/site/trunk/content/docs/perl/Lucy/Docs/DevGuide.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/perl/Lucy/Docs/DevGuide.mdtext?rev=1737642&view=auto ============================================================================== --- lucy/site/trunk/content/docs/perl/Lucy/Docs/DevGuide.mdtext (added) +++ lucy/site/trunk/content/docs/perl/Lucy/Docs/DevGuide.mdtext Mon Apr 4 09:22:30 2016 @@ -0,0 +1,54 @@ +Title: Lucy::Docs::DevGuide â Apache Lucy Documentation + +<div> +<a name='___top' class='dummyTopAnchor' ></a> + +<h2><a class='u' +name="NAME" +>NAME</a></h2> + +<p>Lucy::Docs::DevGuide - Quick-start guide to hacking on Apache Lucy.</p> + +<h2><a class='u' +name="DESCRIPTION" +>DESCRIPTION</a></h2> + +<p>The Apache Lucy code base is organized into roughly four layers:</p> + +<ul> +<li>Charmonizer - compiler and OS configuration probing.</li> + +<li>Clownfish - header files.</li> + +<li>C - implementation files.</li> + +<li>Host - binding language.</li> +</ul> + +<p>Charmonizer is a configuration prober which writes a single header file, +“charmony.h”, +describing the build environment and facilitating cross-platform development. +It’s similar to Autoconf or Metaconfig, +but written in pure C.</p> + +<p>The “.cfh” files within the Lucy core are Clownfish header files. +Clownfish is a purpose-built, +declaration-only language which superimposes a single-inheritance object model on top of C which is specifically designed to co-exist happily with variety of “host” languages and to allow limited run-time dynamic subclassing. +For more information see the Clownfish docs, +but if there’s one thing you should know about Clownfish OO before you start hacking, +it’s that method calls are differentiated from functions by capitalization:</p> + +<pre>Indexer_Add_Doc <-- Method, typically uses dynamic dispatch. +Indexer_add_doc <-- Function, always a direct invocation.</pre> + +<p>The C files within the Lucy core are where most of Lucy’s low-level functionality lies. +They implement the interface defined by the Clownfish header files.</p> + +<p>The C core is intentionally left incomplete, +however; to be usable, +it must be bound to a “host” language. +(In this context, +even C is considered a “host” which must implement the missing pieces and be “bound” to the core.) Some of the binding code is autogenerated by Clownfish on a spec customized for each language. +Other pieces are hand-coded in either C (using the host’s C API) or the host language itself.</p> + +</div> Added: lucy/site/trunk/content/docs/perl/Lucy/Docs/DocIDs.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/perl/Lucy/Docs/DocIDs.mdtext?rev=1737642&view=auto ============================================================================== --- lucy/site/trunk/content/docs/perl/Lucy/Docs/DocIDs.mdtext (added) +++ lucy/site/trunk/content/docs/perl/Lucy/Docs/DocIDs.mdtext Mon Apr 4 09:22:30 2016 @@ -0,0 +1,47 @@ +Title: Lucy::Docs::DocIDs â Apache Lucy Documentation + +<div> +<a name='___top' class='dummyTopAnchor' ></a> + +<h2><a class='u' +name="NAME" +>NAME</a></h2> + +<p>Lucy::Docs::DocIDs - Characteristics of Apache Lucy document ids.</p> + +<h2><a class='u' +name="DESCRIPTION" +>DESCRIPTION</a></h2> + +<h3><a class='u' +name="Document_ids_are_signed_32-bit_integers" +>Document ids are signed 32-bit integers</a></h3> + +<p>Document ids in Apache Lucy start at 1. +Because 0 is never a valid doc id, +we can use it as a sentinel value:</p> + +<pre>while ( my $doc_id = $posting_list->next ) { + ... +}</pre> + +<h3><a class='u' +name="Document_ids_are_ephemeral" +>Document ids are ephemeral</a></h3> + +<p>The document ids used by Lucy are associated with a single index snapshot. +The moment an index is updated, +the mapping of document ids to documents is subject to change.</p> + +<p>Since IndexReader objects represent a point-in-time view of an index, +document ids are guaranteed to remain static for the life of the reader. +However, +because they are not permanent, +Lucy document ids cannot be used as foreign keys to locate records in external data sources. +If you truly need a primary key field, +you must define it and populate it yourself.</p> + +<p>Furthermore, +the order of document ids does not tell you anything about the sequence in which documents were added to the index.</p> + +</div> Added: lucy/site/trunk/content/docs/perl/Lucy/Docs/FileFormat.mdtext URL: http://svn.apache.org/viewvc/lucy/site/trunk/content/docs/perl/Lucy/Docs/FileFormat.mdtext?rev=1737642&view=auto ============================================================================== --- lucy/site/trunk/content/docs/perl/Lucy/Docs/FileFormat.mdtext (added) +++ lucy/site/trunk/content/docs/perl/Lucy/Docs/FileFormat.mdtext Mon Apr 4 09:22:30 2016 @@ -0,0 +1,270 @@ +Title: Lucy::Docs::FileFormat â Apache Lucy Documentation + +<div> +<a name='___top' class='dummyTopAnchor' ></a> + +<h2><a class='u' +name="NAME" +>NAME</a></h2> + +<p>Lucy::Docs::FileFormat - Overview of index file format</p> + +<h2><a class='u' +name="DESCRIPTION" +>DESCRIPTION</a></h2> + +<p>It is not necessary to understand the current implementation details of the index file format in order to use Apache Lucy effectively, +but it may be helpful if you are interested in tweaking for high performance, +exotic usage, +or debugging and development.</p> + +<p>On a file system, +an index is a directory. +The files inside have a hierarchical relationship: an index is made up of “segments”, +each of which is an independent inverted index with its own subdirectory; each segment is made up of several component parts.</p> + +<pre>[index]--| + |--snapshot_XXX.json + |--schema_XXX.json + |--write.lock + | + |--seg_1--| + | |--segmeta.json + | |--cfmeta.json + | |--cf.dat-------| + | |--[lexicon] + | |--[postings] + | |--[documents] + | |--[highlight] + | |--[deletions] + | + |--seg_2--| + | |--segmeta.json + | |--cfmeta.json + | |--cf.dat-------| + | |--[lexicon] + | |--[postings] + | |--[documents] + | |--[highlight] + | |--[deletions] + | + |--[...]--| </pre> + +<h3><a class='u' +name="Write-once_philosophy" +>Write-once philosophy</a></h3> + +<p>All segment directory names consist of the string “seg_” followed by a number in base 36: seg_1, +seg_5m, +seg_p9s2 and so on, +with higher numbers indicating more recent segments. +Once a segment is finished and committed, +its name is never re-used and its files are never modified.</p> + +<p>Old segments become obsolete and can be removed when their data has been consolidated into new segments during the process of segment merging and optimization. +A fully-optimized index has only one segment.</p> + +<h3><a class='u' +name="Top-level_entries" +>Top-level entries</a></h3> + +<p>There are a handful of “top-level” files and directories which belong to the entire index rather than to a particular segment.</p> + +<h4><a class='u' +name="snapshot_XXX.json" +>snapshot_XXX.json</a></h4> + +<p>A “snapshot” file, +e.g. +<code>snapshot_m7p.json</code>, +is list of index files and directories. +Because index files, +once written, +are never modified, +the list of entries in a snapshot defines a point-in-time view of the data in an index.</p> + +<p>Like segment directories, +snapshot files also utilize the unique-base-36-number naming convention; the higher the number, +the more recent the file. +The appearance of a new snapshot file within the index directory constitutes an index update. +While a new segment is being written new files may be added to the index directory, +but until a new snapshot file gets written, +a Searcher opening the index for reading won’t know about them.</p> + +<h4><a class='u' +name="schema_XXX.json" +>schema_XXX.json</a></h4> + +<p>The schema file is a Schema object describing the index’s format, +serialized as JSON. +It, +too, +is versioned, +and a given snapshot file will reference one and only one schema file.</p> + +<h4><a class='u' +name="locks" +>locks</a></h4> + +<p>By default, +only one indexing process may safely modify the index at any given time. +Processes reserve an index by laying claim to the <code>write.lock</code> file within the <code>locks/</code> directory. +A smattering of other lock files may be used from time to time, +as well.</p> + +<h3><a class='u' +name="A_segment(8217)s_component_parts" +>A segment’s component parts</a></h3> + +<p>By default, +each segment has up to five logical components: lexicon, +postings, +document storage, +highlight data, +and deletions. +Binary data from these components gets stored in virtual files within the “cf.dat” compound file; metadata is stored in a shared “segmeta.json” file.</p> + +<h4><a class='u' +name="segmeta.json" +>segmeta.json</a></h4> + +<p>The segmeta.json file is a central repository for segment metadata. +In addition to information such as document counts and field numbers, +it also warehouses arbitrary metadata on behalf of individual index components.</p> + +<h4><a class='u' +name="Lexicon" +>Lexicon</a></h4> + +<p>Each indexed field gets its own lexicon in each segment. +The exact files involved depend on the field’s type, +but generally speaking there will be two parts. +First, +there’s a primary <code>lexicon-XXX.dat</code> file which houses a complete term list associating terms with corpus frequency statistics, +postings file locations, +etc. +Second, +one or more “lexicon index” files may be present which contain periodic samples from the primary lexicon file to facilitate fast lookups.</p> + +<h4><a class='u' +name="Postings" +>Postings</a></h4> + +<p>“Posting” is a technical term from the field of <a href="../../Lucy/Docs/IRTheory.html" class="podlinkpod" +>information retrieval</a>, +defined as a single instance of a one term indexing one document. +If you are looking at the index in the back of a book, +and you see that “freedom” is referenced on pages 8, +86, +and 240, +that would be three postings, +which taken together form a “posting list”. +The same terminology applies to an index in electronic form.</p> + +<p>Each segment has one postings file per indexed field. +When a search is performed for a single term, +first that term is looked up in the lexicon. +If the term exists in the segment, +the record in the lexicon will contain information about which postings file to look at and where to look.</p> + +<p>The first thing any posting record tells you is a document id. +By iterating over all the postings associated with a term, +you can find all the documents that match that term, +a process which is analogous to looking up page numbers in a book’s index. +However, +each posting record typically contains other information in addition to document id, +e.g. +the positions at which the term occurs within the field.</p> + +<h4><a class='u' +name="Documents" +>Documents</a></h4> + +<p>The document storage section is a simple database, +organized into two files:</p> + +<ul> +<li><b>documents.dat</b> - Serialized documents.</li> + +<li><b>documents.ix</b> - Document storage index, +a solid array of 64-bit integers where each integer location corresponds to a document id, +and the value at that location points at a file position in the documents.dat file.</li> +</ul> + +<h4><a class='u' +name="Highlight_data" +>Highlight data</a></h4> + +<p>The files which store data used for excerpting and highlighting are organized similarly to the files used to store documents.</p> + +<ul> +<li><b>highlight.dat</b> - Chunks of serialized highlight data, +one per doc id.</li> + +<li><b>highlight.ix</b> - Highlight data index – as with the <code>documents.ix</code> file, +a solid array of 64-bit file pointers.</li> +</ul> + +<h4><a class='u' +name="Deletions" +>Deletions</a></h4> + +<p>When a document is “deleted” from a segment, +it is not actually purged right away; it is merely marked as “deleted” via a deletions file. +Deletions files contains bit vectors with one bit for each document in the segment; if bit #254 is set then document 254 is deleted, +and if that document turns up in a search it will be masked out.</p> + +<p>It is only when a segment’s contents are rewritten to a new segment during the segment-merging process that deleted documents truly go away.</p> + +<h3><a class='u' +name="Compound_Files" +>Compound Files</a></h3> + +<p>If you peer inside an index directory, +you won’t actually find any files named “documents.dat”, +“highlight.ix”, +etc. +unless there is an indexing process underway. +What you will find instead is one “cf.dat” and one “cfmeta.json” file per segment.</p> + +<p>To minimize the need for file descriptors at search-time, +all per-segment binary data files are concatenated together in “cf.dat” at the close of each indexing session. +Information about where each file begins and ends is stored in <code>cfmeta.json</code>. +When the segment is opened for reading, +a single file descriptor per “cf.dat” file can be shared among several readers.</p> + +<h3><a class='u' +name="A_Typical_Search" +>A Typical Search</a></h3> + +<p>Here’s a simplified narrative, +dramatizing how a search for “freedom” against a given segment plays out:</p> + +<ul> +<li>The searcher asks the relevant Lexicon Index, +“Do you know anything about ‘freedom’?” Lexicon Index replies, +“Can’t say for sure, +but if the main Lexicon file does, +‘freedom’ is probably somewhere around byte 21008”.</li> + +<li>The main Lexicon tells the searcher “One moment, +let me scan our records… Yes, +we have 2 documents which contain ‘freedom’. +You’ll find them in seg_6/postings-4.dat starting at byte 66991.”</li> + +<li>The Postings file says “Yep, +we have ‘freedom’, +all right! +Document id 40 has 1 ‘freedom’, +and document 44 has 8. +If you need to know more, +like if any ‘freedom’ is part of the phrase ‘freedom of speech’, +ask me about positions!</li> + +<li>If the searcher is only looking for ‘freedom’ in isolation, +that’s where it stops. +It now knows enough to assign the documents scores against “freedom”, +with the 8-freedom document likely ranking higher than the single-freedom document.</li> +</ul> + +</div>