RE: Upgrade Lucene to latest version (4.0) from 2.4.0

2013-01-09 Thread Paul Hill
My guess is that upgrading to 3.6 to cover the _mostly_ upward compatible changes to that point (Fieldable vs. Field) might make a worthwhile intermediate step. Then test that to make sure it is working using whatever have to test. Then work out the "real" changes to 4.0. That is only a thought

RE: Is StandardAnalyzer good enough for multi languages...

2013-01-09 Thread Paul Hill
There is often the possibility to put another tokenizer in the chain to create a variant analyzer. This NOT very hard at all in either Lucene or ElasticSearch. Extra tokenizers can often be used to tweak the overall processing to add a late tokenization to overcome an overlooked tokenization (

RE: Is StandardAnalyzer good enough for multi languages...

2013-01-08 Thread Paul Hill
The ICU project ( http://site.icu-project.org/ ) has Analyzers for Lucene and it has been ported to ElasticSearch. Maybe those integrate better. As to not doing some tokenization, I would think an extra tokenizer in you chain would be just the thing. -Paul > -Original Message- > From:

RE: any good idea for loading fields into memory?

2012-06-25 Thread Paul Hill
OK, fair enough, you want to keep everything very fast. I'm surprised that large documents are slower for searching. I'm way impressed all the time by the search times. Finding good hit fragments on a big document can be slow, but for me (searching human created documents) is never slow. > the

RE: Fast way to get the start of document

2012-06-25 Thread Paul Hill
rom Tika (the > > "content" metadata.) The latter would store but not necessarily index > > the first 10K or so characters of the full text. Do searches on the > > full body field and highlighting on the limited body field. > > > > -- Jack Krupansky

RE: any good idea for loading fields into memory?

2012-06-22 Thread Paul Hill
t; use collector and field cache is a good idea for ranking by certain > > > field's value. > > > but I just need to return matched documents' fields. and also field > > > cache can't store multi-value fields? > > > I have to store special chars like

RE: any good idea for loading fields into memory?

2012-06-21 Thread Paul Hill
I would ask the question that if you want to look at the whole value of a field during searching, why don't you have a just such a field in your index? I have an index with several fields that have 2 versions of the field both analyzed and unanalyzed. It works great for me in 3.x (not 4.x). Have

RE: Stemming - limited index expansion

2012-06-12 Thread Paul Hill
Thanks for the reply. > -Original Message- > From: Jack Krupansky [mailto:j...@basetechnology.com] > Sent: Tuesday, June 12, 2012 1:14 PM > To: java-user@lucene.apache.org > Subject: Re: Stemming - limited index expansion > > I don't completely follow precisely what you want to do, but th

Stemming - limited index expansion

2012-06-12 Thread Paul Hill
As others have previously proposed on this list, I am interesting in inserting a second token at some positions in my index. I'll call this Limited Index Expansion. I want to retain the original token, so that I can score an original word that matches in a text better than just any synonym/stem

RE: CPU usage increased using 3.4.0

2012-06-11 Thread Paul Hill
Maybe someone already discussed this, but why is 100% usage a problem? I ask, because one interpretation of going from 30% usage to 100% usage is that the program is no longer I/O bound; it is no longer waiting around for I/O to complete. But maybe that is not what you mean. If the total elapse

RE: ORM for Android + Lucene

2012-06-08 Thread Paul Hill
You don't actually need a relational DB when using Lucene, but if you do, did you try searching Google for "ORM for Android SQLite", because there is just such library. -Paul > -Original Message- > From: GuenterR [mailto:gunt...@gmail.com] > All examples that I have already found so far

RE: CodeMaps updates for Lucene

2012-06-08 Thread Paul Hill
As text retrieval geeks, we hate manual tagging :-) We want you to analyze the content (might I suggest using Lucene and Mahout) and categorize it for us. :-) But jokes aside, a major category of tags would be "(text) analysis" or "tokenization" or "term processing for indexing" -- all that stuff

IndexSearcher.search(query, filter, collector) considered less efficient

2012-06-08 Thread Paul Hill
I noticed today that my code calls IndexSearcher.search (Query query, Filter filter, Collector collector) But also noticed that the DOCs says "Applications should only use this if they need all of the matching documents. The high-level search API (Searcher.search(Query, Filter, int) ) is usually

RE: Similarity coefficient for more exact matching

2012-05-04 Thread Paul Hill
> [use] IndexWriterConfig.setSimilarity() and > IndexSearcher.setSimilarity(), unless you are clever or like being confused. > > SweetSpotSimilarity might also be worth a look. > > -- > Ian. Being even less clever, I just make sure I set: Similarity.setDefault(new MySimilarity()) when crawl

Query Rewrite - Utilities?

2012-04-09 Thread Paul Hill
Just thought I throw out a question. What is available in the libraries to help with manipulating and asking questions about queries? So far my best (and worst) efforts have involved combinations setting up a parser, generating a query object, then looking through the various clauses and re-work

Hit Highlighting which highlighter to use?

2012-04-04 Thread Paul Hill
Using the original org.apache.lucene.search.highlight.Highlighter should I be able to give it a query like [ My AND Words AND "My Words"^100 ] (the actually phrase in this query is converted to a span query with a slop 1), and expect it find the fragment many pages into the file that has span "My

RE: Lucene 4 - POS and Syntactic Tagging

2012-04-02 Thread Paul Hill
> Mark McGuire wrote: > I'm working on a project where I need to tag both the part of speech and > other syntactic information on tokens To pick up on this thread from a few weeks back. I've never done this myself, but I think that your desire to put extra information that is not really a token

RE: More About NOT Optimizing

2012-03-08 Thread Paul Hill
> Uwe Schindler wrote: > TieredMP is already the default in Lucene 3.5, unless you explicitely set > another one! > I was going to add the detail that I was running 3.4 at the moment (I'm looking to upgrade very soon) and thought LogByteSizeMergePolicy was the default there, but I am wrong th

RE: More About NOT Optimizing

2012-03-08 Thread Paul Hill
> I think a good question is whether you are really seeing performance issues > due to the 1/3 deleted- > but-not-yet-reclaimed documents... No, I'm NOT worried about performance. I've got the message about optimize(). I was just looking for something I might do maybe once or twice a year when

More About NOT Optimizing

2012-03-06 Thread Paul Hill
I'm running with 3.4 code and have studied up on all the API related to the optimize() replacements and understand I needn't worry about deleted documents, but I still want to ask a few things about keeping the index in good shape And about merge policy. I have an index with 421163 documents (in

RE: SweetSpotSimilarity

2012-03-05 Thread Paul Hill
> -Original Message- > My only thought is that the new stuff seems to be at the expense of the > formulas listed in the old > class overview for Similarity. > http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/searc > h/Similarity.html Opps, my bad

RE: SweetSpotSimilarity

2012-03-05 Thread Paul Hill
> I would definitely not suggest using SSS for fields like legal brief text or > emails where there is huge > variability in the length of the content -- i can't think of any context > where a "short" email is > definitively better/worse then a "long" email. more traditional TF/IDF seems > like

RE: Building FST-like automaton queries

2012-03-02 Thread Paul Hill
> > Wow, that was quick!  Thanks! > > The power of open source and coffee break, combined... 12 minutes! Wow, that is fast turnaround or a lot of coffee. -Paul

RE: SweetSpotSimilarity

2012-03-01 Thread Paul Hill
HI Chris, I didn't see your response. Thanks. Actually I was recently playing in fooplot , an online plotting tool (one of many), to examine the various formulas and getting a better handle on what they do. Thanks for the discussion of 'sweetspot'. I'm thinking this might help others going