y has been a pleasure.
pek
--
Peter Karman . he/him/his . 785.337.0405 . https://karpet.github.io/
org/thread.html/7eff8d7202731622d7cfa6731fdf43d82f1c2d261729b07c76393328@%3Cdev.lucy.apache.org%3E
--
Peter Karman . he/him/his . 785.337.0405 . https://karpet.github.io/
resources will be moved to a
read-only state.
You can read more about the Apache Attic and the process of moving to the Attic
at http://attic.apache.org.
Thanks.
pek
[1]
https://lists.apache.org/thread.html/fd5532bbec1fef48894b9f88d6b1b38e68d3875f5dc5db3bd80ce21a@%3Cdev.lucy.apache.org%3E
uot; if it is assumption about performance, you might not see much gain by
using RAMFolder.
--
Peter Karman . https://karpet.github.io . https://keybase.io/peterkarman
not being
(intentionally) ignored.
--
Peter Karman . https://karpet.github.io . https://keybase.io/peterkarman
* happier with storing the file extension as a separate field
and searching on that. Far far more efficient at search time than munging a regex.
--
Peter Karman . https://karpet.github.io . https://keybase.io/peterkarman
t; >
> > Then at search time, perform a search query with every keystroke.
> >
> > h -> (no result)
> > he -> (no result)
> > hel -> "hello world"
> >
> > Once you've got basic functionality running, experiment with minimum
> token
> > length, adding Soundex/Metaphone, performing character normalization,
> etc.
> >
> > Marvin Humphrey
> >
>
--
Peter Karman . https://peknet.com/ <http://peknet.com/> .
https://keybase.io/peterkarman
u more concrete advice, we'd need to see your indexing code,
especially how you define your Schema.
--
Peter Karman . https://karpet.github.io . https://keybase.io/peterkarman
years.
--
Peter Karman . https://peknet.com/ . https://keybase.io/peterkarman
other query-parsing niceties, switch your code from the Lucy
QueryParser to
https://metacpan.org/pod/Search::Query::Dialect::Lucy
--
Peter Karman . https://peknet.com/ . https://keybase.io/peterkarman
are three examples, 2 of which are in active use in Dezi as part of
https://github.com/karpet/search-query-dialect-lucy-perl
--
Peter Karman . https://peknet.com/ . https://keybase.io/peterkarman
Nick Wellnhofer wrote on 2/23/17 6:49 AM:
On 23/02/2017 04:52, Peter Karman wrote:
package MyAnalyzer {
use base qw( Lucy::Analysis::Analyzer );
sub transform { $_[1] }
}
Every Analyzer needs an `equals` method. For simple Analyzers, it can simply
check whether the class of the other
---
Example above ^^ based on the gist below.
Hao Wu wrote on 2/20/17 11:40 PM:
Hi Peter,
Thanks for spending time in the script.
I clean it up a bit, so there is no dependency now.
https://gist.github.com/swuecho/1b960ae17a1f47466be006fd14
ine 23
Segmentation fault: 11
I would expect the code to work as you wrote it, so maybe someone else can spot
what's going wrong.
Here's what the schema_1.json file looks like after the initial index creation:
{
"_class": "Lucy::Plan::Schema",
"analyzers": [
null,
{
"_class": "ChineseAnalyzer"
}
],
"fields": {
"body": {
"analyzer": "1",
"type": "fulltext"
}
}
}
--
Peter Karman . https://peknet.com/ . https://keybase.io/peterkarman
new;
my $raw_type = Lucy::Plan::FullTextType->new(
analyzer => $tokenizer,
);
So I guess I must miss something in the custom Chinese Analyzer.
since you changed the field definition with a new analyzer, you must create a
new index. You cannot update an existing index with 2 dif
r the stemming
analyzer. Also, Chinese is not among the supported languages listed.
Maybe something wrapped around https://metacpan.org/pod/Lingua::CJK::Tokenizer
would work as a custom analyzer.
You can see an example in the documentation here
https://metacpan.org/pod/Lucy::Analysis::Analyzer#n
`is_phrase`.
HTH,
pek
--
Peter Karman . https://peknet.com/ . https://keybase.io/peterkarman
github.com/apache/lucy/blob/master/core/Lucy/Search/QueryParser.c#L862
--
Peter Karman . https://peknet.com/ . https://keybase.io/peterkarman
i-App-0.014/lib/Dezi/Lucy/Searcher.pm#L406
tl;dr is that Dezi writes its own index metadata header that includes a UUID and
timestamp for the last time the index was updated, and checks that UUID against
the current Searcher to know if it is stale and needs to be re-created.
--
Peter Karman .
stored, which seems to be confirmed in that URL you
reference.
As far as why it is not compressed, I'm not sure. I expect that decompression
incurs a performance hit.
--
Peter Karman . https://peknet.com/ . https://keybase.io/peterkarman
You can see one example of ProximityQuery usage here (Perl)
https://metacpan.org/source/KARMAN/Search-Query-Dialect-Lucy-0.202/lib/Search/Query/Dialect/Lucy.pm#L701
Of note:
* `within` is like NEAR - it takes an integer argument
* order of terms is respected. It's like a phrase
--
#x27;s a Lucy bug.
>
> If you can't provide a test case, it's a good idea to test whether the
> problems are caused by parallel indexing at all. I'd also try to move your
> indices to a local file system to see whether it makes a difference.
>
> > Creating Indexer manager adding overhead to the search process.
>
> You only have to use IndexManagers for searchers to avoid errors like
> "Stale NFS filehandle". If you have another way to handle such errors,
> there might be no need for IndexManagers at all. Again, see
> Lucy::Docs:FileLocking.
>
> Nick
>
>
--
Peter Karman . pe...@peknet.com . http://peknet.com/
ader file are checked on every search, and the searcher is destroyed and a new
one created if the old searcher is stale.
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
but not much churn (not
updating docs constantly). IME the bottleneck is not the search. It's a search
engine; it's pretty fast. The bottleneck is updating the index. That's true
whether you delete first or not.
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
e from how Dezi solves this more generally:
https://metacpan.org/source/KARMAN/Dezi-App-0.014/lib/Dezi/Lucy/Indexer.pm#L451
Lucy isn't a RDBMS. It just tokenizes the fields you shove into it, and
retrieves very quickly.
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
/Indexer.pm#L110
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
rser->parse($query);
my $lucy_query = $parsed_query->as_lucy_query();
my $hits = $searcher->hits( query => $lucy_query );
--------
Something similar is performed in Dezi::Lucy::Searcher:
https://metacpan.org/source/KARMAN/Dezi-App-0.013/lib/Dezi/Lucy/Searcher.pm#L124
See
https://metacpan.org/pod/Search::Query::Dialect::Lucy
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
idiom, as here:
https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Docs/Cookbook/FastUpdates.pod
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
. Not sure if it was the hidden
directory bug Marvin pointed out or not. In any case, it seemed to be related to
@INC because re-running the cpan install command worked, since Clownfish was
then in the expected @INC path.
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
words..
>
Yes, you could use Lucy. Create a virtual "document" for each line in the file,
and fields for the line number and line content. If the line contains "fields"
like a logfile might, you can create a separate field for each segment of the
line. I do that same kind of t
Shahab,
Lucy works on collections of documents, where each hit in a search result set
represents a single document.
That said, people use tools like Lucy to search logs, e.g., by creating a single
"document" for each line in the log file.
If you just want to find all the matches for a term
later
* Dezi::App 0.004 or later
* Search::OpenSearch::Engine::Lucy 0.400 or later
You can read more about how Dezi defines fields here:
http://dezi.org/2014/07/18/metanames-and-propertynames/
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
0.004 or later
* Search::OpenSearch::Engine::Lucy 0.400 or later
You can read more about how Dezi defines fields here:
http://dezi.org/2014/07/18/metanames-and-propertynames/
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
problem, though it doesn't directly answer
the question you posed, is to treat the synonyms as 'suggestions' for
further searches, rather than searching for them automatically.
Something like LucyX::Suggester[1] could be extended to include synonyms
in addition to spellings.
[0]
; {
foo => [qw( foo_lc foo_cs )],
},
);
https://metacpan.org/pod/Search::Query::Field#alias_for
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
If you're doing development on Lucy you should probably subscribe
yourself to d...@lucy.apache.org where a lot of this churn has been
discussed lately.
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
retrieve records.
The doc_id is ephemeral. It can change whenever an index changes
(segments getting merged, etc.).
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
identification. Note that
'keyword' might be a misnomer depending on what Analysis classes you
apply to your documents: i.e., you might have phrases, etc., not just
single terms.
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
ot; so
many places of the first term? Does each term have to be "within" at max "x"
amount of places from another term?
your guesses as to how it works are correct.
this might help:
http://mail-archives.apache.org/mod_mbox/lucy-user/201206.mbox/%3c4fe54df0.8060...@peknet.com
security"/,
qq/"652 security"/,
);
}
return ($term);
},
);
# run it
my $query = $qp->parse( qq/body:'(?:654|656|650|652)\s+Security'/ );
my $lucy_query = $query->as_lucy_query();
my $hits = $lucy_searcher->hits( query => $lucy_query );
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
nd so the
.c -based messages are porting the compile-time directory. Has nothing
to do with daemonization.
Are you by any chance running more than one indexer at a time?
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
and some blob data use that doesn't
affect this issue, since blobs are for later and I'm not getting any
hits on some strings, that I can grep from the .dat in a seg.
Instead of grep'ing the segment files, you might try seeing what Lucy
reports via the API:
https://metacpan.org/source/KARMAN/SWISH-Prog-Lucy-0.17/bin/lucyx-dump-terms
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
06DAC9 <-- this one returns a result
So my questions are, do I need to do anything special to query multiple
segments? Is Search::Query getting in the way?
show your code, please.
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
7;,
);
my $query = $qparser->parse('query');
my $hits = $searcher->hits(
query => $query,
num_wanted => 1_000_000, # really big number
);
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
On 9/24/13 12:44 PM, Anil Pachuri wrote:
> # something like:
> my $docIDs = $hits->doc_ids;
>
my @doc_ids;
while ( my $hit_doc = $hits->next ) {
push @doc_ids, $hit_doc->get_doc_id();
}
https://metacpan.org/module/Lucy::Document::Doc
--
Peter Karman . http://
On 9/18/13 8:11 AM, Lee Goddard wrote:
Is it possible to have Lucy return all documents in the index?
I'm sure this must be possible and in the docs, but I can't find it
http://search.cpan.org/~creamyg/Lucy-0.3.3/lib/Lucy/Index/IndexReader.pod
--
Peter Karman . http://
have one similar implementation
here as an example:
https://metacpan.org/module/SWISH::Prog::Lucy::Results#find_relevant_fields-1-0
see http://markmail.org/message/xoqwxofwphlowqxf
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
there a
> way to query multiple fields?
>
sure. join them with OR or AND.
foo:NULL or bar:NULL
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
$parser = Search::Query->parser(
dialect => 'Lucy',
null_term => 'NULL',
);
my $query = $parser->parse('foo:NULL');
my $hits = $lucy_searcher->hits( query => $query->as_lucy_query() );
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
ome platform detection
in the Makefile.PL to fix it.
Thanks.
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
light/Highlighter.pod
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
::Query::Dialect::Lucy to parse your queries, you
get wildcard parsing automatically.
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
up as many docs as possible per $indexer, but only one
commit() call is needed.
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
ge multiple XML files or
> convert these into tabular format?
CPAN has many XML handling tools. I'm sure there's something there that will do
most or all of what you want.
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
Peter Karman wrote on 1/28/13 2:22 PM:
>
> my $query_parser = Search::Query->parser(
> query_class => 'Lucy',
> query_class_opts => {
> ignore_order_in_proximity => 0, # the default
> },
> default_field => 'content
order_in_proximity => 0, # the default
},
default_field => 'content',
);
my $string = qq/"here's looking * * kid"~10/; # proximity within => 10
my $query = $query_parser->parse($string);
my $lucy_query = $query->as_lucy_query();
my $hits = $lucy_searcher->hits( query => $lucy_query );
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
#x27;Ca(.-)',
);
sub extract_chem_names {
my $text = shift;
my @matches;
for my $n (@chem_names) {
my $esc = quotemeta($n);
if ($text =~ m/$esc/) {
push @matches, $n;
}
}
return \@matches;
}
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
Search::Query::Parser with the term_expander feature
and the Lucy dialect:
https://metacpan.org/module/Search::Query::Parser
You probably also want to use LucyX::Search::ProximityQuery instead of
PhraseQuery, since ProximityQuery will allow for different word orders.
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
Aleksandar Radovanovic wrote on 11/15/12 1:51 PM:
> On 11/15/12 9:13 PM, Peter Karman wrote:
>> On 11/15/12 7:25 AM, Aleksandar Radovanovic wrote:
>>> Hi there,
>>>
>>> I was wondering is it possible to extract information (like the most
>>> common wor
http://cpansearch.perl.org/src/KARMAN/SWISH-Prog-Lucy-0.11/bin/lucyx-dump-terms
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
Thomas den Braber wrote on 11/13/12 10:47 AM:
> Peter Karman wrote on 11/12/2012 5:46 PM
>> Swish-e does presort attributes, but rank/score is not one of them. That is
>> always a per-search attribute.
>>
>> ISTR an email exchange about this back when I was first us
Marvin Humphrey wrote on 11/13/12 6:38 PM:
> On Tue, Nov 13, 2012 at 9:47 AM, Peter Karman wrote:
>> On 11/13/12 11:33 AM, Marvin Humphrey wrote:
>>
>>> I would oppose a seek() which runs implicit searches behind the scenes
>>> because it would surprise use
udo-code:
sub seek {
my ($hits, $new_pos) = @_;
# some error check here to verify $new_pos
# isn't beyond the offset+num_wanted
while ($hits->current_pos < $new_pos) {
$hits->next or last;
}
}
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
ill continue my migration and will let you know if there are 'more bumps
> on the road'.
>
> I can also make a more detailed performance comparison if you like.
I, for one, would be interested in hearing your thoughts, Thomas. As you might
expect, I have some experience with both. :)
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
d the implementation in
https://metacpan.org/module/SWISH::Prog::Lucy::Results
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
ts faceted search on top of Lucy. It does
it with https://metacpan.org/module/Search::OpenSearch::Engine::Lucy
There is no built-in facet function as part of Lucy.
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
Ken Youens-Clark wrote on 10/22/12 5:36 PM:
> On Oct 22, 2012, at 10:41 AM, Peter Karman wrote:
>
>> You don't mention, specifically, how much memory your processes are using,
>
> How can I tell how much memory is being used during the indexing process?
> Sorry
er of 100s of MB of RAM during indexing, more during index
optimization when sorting is happening. The number of fields you define
can also make memory use go up.
I've never run out of memory on my 5-year-old CentOS 5.x box with 8g of RAM.
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
doc() that checks the field types and in %doc and does... what? croak?
delete_by_term (as in your code and my example above)?
Alternately, it might be worth sharing your LucyIndex class on CPAN in the
LucyX::* namespace. Something to consider.
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
more details.
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
uplicate".
A small, reproducable example is best if you are looking for help.
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
0). Otherwise use Peter's
> solution (mmdd - format)
>
I would expect fixed-width with leading zeros to work too. E.g.:
my $epoch_fixed = sprintf("%012d", $epoch);
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
has only one public field
storage type, which is a string. So if you want to get coherent results
from a range query, make sure you are searching fixed-width strings.
E.g., I format all my dates as MMDD so that I can do range queries like:
my $all_hits_in_2012 = $parser->parse(
d to trunk as r1360509.
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
jakarta apache"~4
# Hits: 1
# Search time: 0.0253
306 doc.xml ""
.
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
dialect => 'Lucy',
fields => \%fieldtypes,
dialect_opts => { default_field => 'content' }, # just for example
);
my $proximity_query
= $qp->parse('content:"apache jakarta"~4')->as_lucy_query
example docs and indexing code, to
really demonstrate the problem. Since we can't see what's in your index,
it's difficult to help determine if this is a problem in your code or in
Lucy.
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
xed-width strings.
For more on sorting see:
https://metacpan.org/module/Lucy::Search::SortSpec
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
king. I have fixed that on github
and will release a new version soon.
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
/
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
to tell if you've implemented this right or wrong since the use case
you provide isn't complete. E.g., I don't know what is in the index. A fully
runnable piece of code would help.
I use NOT queries all the time with Search::Query::Dialect::Lucy, so I know the
Lucy feature works.
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
query );
my $lucy_proximity_query = $proximity_query->as_lucy_query();
my $proximity_hits = $lucy_searcher->hits( query => $lucy_proximity_query );
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
;,
>)
>
note that you likely don't want to specify multiple languages for a single
index, because the stemming (for example) rules applied will be
confused/confusing. I.e., Lucy doesn't do language *detection* -- it just
performs language-specific analysis based on the kin
Grant McLean wrote on 5/10/12 4:25 PM:
>
> I *think* the attached pacth is all that's required to fix this.
Thanks, Grant. Committed to trunk as r1336991 and branches/0.3 as r1336992.
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
hits: 1
search time: 0.06957
build time: 0.14598
query: peknet
[0] http://swish-e.org/swish3/
[1] https://metacpan.org/module/SWISH::Prog::Lucy
[2] http://dezi.org/
--
Peter Karman . http://peknet.com/ . pe...@peknet.com
85 matches
Mail list logo