Re: Difference between '-' and 'NOT' in Lucene Query.

2024-05-06 Thread Paul Libbrecht
A weighted OR, of course. On 6 May 2024, at 12:43, Paul Libbrecht wrote: Do I mistake or “ “ makes an OR if there’s no other? On 6 May 2024, at 12:41, Saha, Rajib wrote: Hi Experts, As per the definition in https://lucene.apache.org/core/2_9_4/queryparsersyntax.html

Re: Difference between '-' and 'NOT' in Lucene Query.

2024-05-06 Thread Paul Libbrecht
Do I mistake or “ “ makes an OR if there’s no other? On 6 May 2024, at 12:41, Saha, Rajib wrote: Hi Experts, As per the definition in https://lucene.apache.org/core/2_9_4/queryparsersyntax.html '-' and 'NOT' in query string stands for same reason theoretically.

Re: Exact KNN

2024-01-30 Thread Paul Libbrecht
Isn’t that what Semantic-Vectors is doing? E.g. https://github.com/Ontotext-AD/semanticvectors Paul On 30 Jan 2024, at 20:50, William Zhou wrote: > Is there a way of directly executing an exact nearest neighbor search? It > seems like the API provides some general functionality, and we can

Re: Search results/criteria validation

2021-03-17 Thread Paul Libbrecht
Explain is a heavyweight thing. Maybe it helps you, maybe you need something high-performance. I was asking a similar question ~10 years ago and got a very interesting answer on this list. If you want I can try to dig this to find it. At the end, and with some limitation in the number of

Re: Document metadata in ranking?

2021-02-25 Thread Paul Libbrecht
Hello Philip, I’ll answer with a possibility that might be outdated and predates the existence of payloads (which I think are non-analysed parts so not appropriate). Lucene has fields and you can include the metadata within fields in form of particular tokens. Then you can enrich every

Re: Using Lucene for technical documentation

2020-11-23 Thread Paul Libbrecht
Hello Trevor, I don’t know of an analyzer for mixes of code and text but I know of an analyser for mixes of code and formulæ. Clearly, you could build a custom analyzer that would tokenize differently depending on weather you’re in code or in text. That’s no super hard. However, where

Re: [ANN] word2vec for Lucene

2014-11-20 Thread Paul Libbrecht
Hello Koji, how would you compare that to SemanticVectors? paul On 20 nov. 2014, at 10:10, Koji Sekiguchi k...@r.email.ne.jp wrote: Hello, It's my pleasure to share that I have an interesting tool word2vec for Lucene available at https://github.com/kojisekig/word2vec-lucene . As you

Re: [ANN] word2vec for Lucene

2014-11-20 Thread Paul Libbrecht
('France') + vector('Italy') results in a vector that is very close to vector('Rome'), and vector('king') - vector('man') + vector('woman') is close to vector('queen') Thanks, Koji (2014/11/20 20:01), Paul Libbrecht wrote: Hello Koji, how would you compare that to SemanticVectors

Re: Document Term matrix

2014-11-11 Thread Paul Libbrecht
The project semanticvectors might be doing what you are looking for. paul On 11 nov. 2014, at 22:37, parnab kumar parnab.2...@gmail.com wrote: hi, While indexing the documents , store the Term Vectors for the content field. Now for each document you will have an array of terms and their

Re: how to ignore full stop for specific word

2014-11-06 Thread Paul Libbrecht
My trick would be to replace .net with dotNet (or use some funky Unicode-letter to replace the dot). If you use consistently the same analyzer-chain, then it will match cleanly. paul On 6 nov. 2014, at 12:42, Rajendra Rao rajendra@launchship.com wrote: I have some word which contain

Re: Case sensitivity

2014-09-19 Thread Paul Libbrecht
two fields? paul On 19 sept. 2014, at 15:07, John Cecere john.cec...@oracle.com wrote: Is there a way to set up Lucene so that both case-sensitive and case-insensitive searches can be done without having to generate two indexes? -- John Cecere Principal Engineer - Oracle Corporation

Re: Lucene for Log file indexing and search

2013-09-19 Thread Paul Libbrecht
Ashok, I would look at solr which has an amount more field types to support more queries. E.g. there you have a nice query syntax for times-spans and fantastic caching. I think there's very few initiatives for indexing logs and I would be interested to see the results of your entreprise. paul

Re: international stop set?

2012-10-27 Thread Paul Libbrecht
Le 27 oct. 2012 à 11:43, Tom a écrit : Aha! Exactly the problem! And only because the user-agent is one language, doesn't mean all search terms will be! For example, someone might type in the name of an English event (such as Halloween) first, and then type in the name of their home town

Re: Lucene index on NFS

2012-10-02 Thread Paul Libbrecht
My experience in the Lucene 1.x times were a factor of at least four in writing to NFS and about two when reading from there. I'd discourage this as much as possible! (rsync is way more your friend for transporting and replication à la solr should also be considered) paul Le 2 oct. 2012 à

Re: Lucene index on NFS

2012-10-02 Thread Paul Libbrecht
anyone run into such trouble? Or is it strictly just a performance issue? /Jong On Tue, Oct 2, 2012 at 5:17 AM, Paul Libbrecht p...@hoplahup.net wrote: My experience in the Lucene 1.x times were a factor of at least four in writing to NFS and about two when reading from there. I'd discourage

Re: let's use our native language

2012-09-14 Thread Paul Libbrecht
most sentences around Lucene what I searched out aren't compiled correctly. wondering if we build our local mailing list... Which language? paul - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For

Re: Lucene tokenization

2012-03-28 Thread Paul Libbrecht
Nilesh, the StandardAnalyzer is full of generally useful special cases, including emails and numbers detection. I am supposing you met one such special case which has a justification of some sort. I can't tell you why but I can tell it's really hard to change because others rely on this

Re: analyzer per document

2012-02-09 Thread Paul Libbrecht
I would use a different field per language and use PerFieldAnalyzer indeed. This is also important for queries whose language is not always clear. paul Le 9 févr. 2012 à 13:01, Vinaya Kumar Thimmappa a écrit : Hello All, I have a requirement of using different analyzer per document. How

Re: Designing a multilingual index

2012-01-03 Thread Paul Libbrecht
Le 3 janv. 2012 à 13:56, heikki a écrit : In our case, it is known in which language the user is searching (because he tells us, and if he doesn't, we use the current GUI language). On the web it is often hard to trust such (e.g. because of people working in multiple languages, internet

Re: Designing a multilingual index

2012-01-03 Thread Paul Libbrecht
Heikki, it does solve your main concern: a term in lucene is a pair of a token and field name. The term frequency is, thus, the frequency of a token in a field. So the term-frequency of text-stemmed-de:firewall is independent of the term-frequency of text-stemmed-en:firewall (for example).

Re: Designing a multilingual index

2012-01-03 Thread Paul Libbrecht
indexes, the relevance scoring is more accurate. Kind regards, Heikki Doeleman On Tue, Jan 3, 2012 at 3:29 PM, Paul Libbrecht p...@hoplahup.net wrote: Heikki, it does solve your main concern: a term in lucene is a pair of a token and field name. The term frequency is, thus

Re: Retrieving large numbers of documents from several disks in parallel

2011-12-21 Thread Paul Libbrecht
Michael, from a physical point of view, it would seem like the order in which the documents are read is very significant for the reading speed (feel the random access jump as being the issue). You could: - move to ram-disk or ssd to make a difference? - use something different than a searcher

Re: how to do remote debug on benchmark test or whatever test?

2011-12-09 Thread Paul Libbrecht
hao, this is a java question not a lucene question. Here's a short answer: Those options are to be fed to the java command. Running on the command-line is where you could put them. Running in IDEs there is generally such a feature ready, or the possibility to connect to the socket address.

Re: Phonetic search with Lucene 3.2

2011-11-09 Thread Paul Libbrecht
We've been using http://www.tangentum.biz/en/products/phonetix/ which does double-metaphone. Maybe that helps. paul Le 9 nov. 2011 à 11:29, Felipe Carvalho a écrit : Using PerFieldAnalyzerWrapper seems to be working for what I need! On indexing: PerFieldAnalyzerWrapper

Re: Phonetic search with Lucene 3.2

2011-11-09 Thread Paul Libbrecht
That uses Lucene 2.9.2 indeed. paul Le 9 nov. 2011 à 11:43, Felipe Carvalho a écrit : Which version of Lucene are you using? I had tried it with Lucene 3.3 and had some problems, did you have to do any customizations? On Wed, Nov 9, 2011 at 8:38 AM, Paul Libbrecht p...@hoplahup.net wrote

Re: Phonetic search with Lucene 3.2

2011-11-08 Thread Paul Libbrecht
Felipe, I do not have a tutorial but what you are describing is what I have been doing in ActiveMath. I have a little paper for you if you want that explains how it goes there (http://www.hoplahup.net/paul_pubs/AccessRetrievalAM.html) and the software is open-source

Re: Phonetic search with Lucene 3.2

2011-11-07 Thread Paul Libbrecht
Felipe, in Lucene in Action there's a little bit on that. Basically it's just about using the right analyzer. paul Le 8 nov. 2011 à 01:45, Felipe Carvalho a écrit : Hello, I'm using Lucene 3.2 on a phone book app and phonetic search is a requirement. I've googled up lucene phonetic search

Re: What's the best way to translate a query in multiple languages?

2011-11-02 Thread Paul Libbrecht
Raf, I always do this: query expansion. Take the Lucene QueryParser, default field default, default analyzer whitespace analyzer... feed the query in. You typically get a BooleanQuery which you can now process to perform the query expansion. For example I replace all termQueries by a boolean

Re: LSI

2011-08-29 Thread Paul Libbrecht
Zarrinkalam, have a look at semanticvectors. paul Le 29 août 2011 à 15:55, zarrinkalam a écrit : hi, I want to use LSI for clustring ducuments indexed with lucene, I dont know how, plz help me thanks, - To

Re: SSD Experience

2011-08-23 Thread Paul Libbrecht
I think we're getting out of topic about Lucene usage for SSDs but I fully acknowledge that below mail: SSDs are faster than normal disk for development. Actually, one of the things that got real faster with the SSD is IntelliJ indexing and reboot; I could not tell if it is using Lucene sadly.

Re: SSD Experience (on developer machine)

2011-08-23 Thread Paul Libbrecht
Funnily, I had such an experience: an SSD on the laptop of the brand SanDisk, guaranteed for 80 TB of writes. Well, I had it twice changed under guarantee. Then the shop provided me an OCZ. Maybe that lasts longer... I'm still in guarantee. paul Le 23 août 2011 à 17:11, Toke Eskildsen a écrit

Re: SSD Experience (on developer machine)

2011-08-23 Thread Paul Libbrecht
Sorry Toke, I do not know. The service shop replaced it fairly blindly. paul Le 23 août 2011 à 20:46, Toke Eskildsen a écrit : On Tue, 2011-08-23 at 17:20 +0200, Paul Libbrecht wrote: Funnily, I had such an experience: an SSD on the laptop of the brand SanDisk, guaranteed for 80 TB

Re: Semantic indexing in Lucene

2011-05-24 Thread Paul Libbrecht
Diego, The semanticvectors project has a mailing list and his author, Dominic Widdows, is responding actively there. paul Le 24 mai 2011 à 02:34, Diego Cavalcanti a écrit : Sorry, I thought the blog was yours! I will read the post and see if it helps me. Thank you! About the Semantic

Re: Please help me with a basic question...

2011-05-18 Thread Paul Libbrecht
Richard, in SOLR at least there's an analyzer that avoids duplicates. I think that would solve it. There's also somewhere the option to ignore IDF (in similarity? in solrconfig?). paul Le 18 mai 2011 à 21:30, Rich Heimann a écrit : Hello all, This is my first time on the list and my first

Re: Thoughts on Search Analytics?

2011-05-06 Thread Paul Libbrecht
Le 6 mai 2011 à 00:20, Otis Gospodnetic a écrit : thus far, only search-testing has provided some analytics measures for us (precision and recall ones). We, of course, construct the test-suites from the logs. Interesting. It sounds like you don't currently utilize any sort of

Re: Are Okapi BM25 scores normalized into 0 and 1 ?

2011-04-29 Thread Paul Libbrecht
Patrick if the question is about the code snippert at the page you mention, which I copy below, I believe the answer is no and the author is aware of it since he is adding a comment about not-normalized in the second example. ScoreDocs and TopDocs are not returning normalized scores. Normalized

Re: file formats: MacRoman and UTF-8...

2011-03-28 Thread Paul Libbrecht
java -Dfile.encoding=utf-8 should do the trick. Or... which java app are you using? paul Le 28 mars 2011 à 09:03, Patrick Diviacco a écrit : When I run my Lucene app and a parse a xml file I get the following error due to some fonts such as é written in the text file. If I save the text

Re: Indexing of multilingual labels

2011-03-14 Thread Paul Libbrecht
Stephane, I think that you have the freedom to put what you want in the stored value of a field. The simplest would even be to make it that the fields that you want to use for display are stored, preformatted, xml-ished, owl-ified, or json-ized, to be separate from the indexed fields (where

Re: ManifoldCF in Action

2011-03-10 Thread Paul Libbrecht
Erm, google DIH SOLR or http://wiki.apache.org/solr/DataImportHandler paul Le 10 mars 2011 à 14:37, karl.wri...@nokia.com a écrit : Karl, can you give, in one paragraph, the difference between ManifoldCF and DIH? thanks in advance paul I am unfamiliar with DIH as an acronym

Re: Lucene paid support

2011-03-03 Thread Paul Libbrecht
David, I'm sure that if you request something more precise you might get enthusiasts over here easily. I heard several committers of Lucene have gone into LucidImagination and they offer paid services specialized for Lucene. hope it helps. paul Le 3 mars 2011 à 21:13, Jarrin, David a écrit

Re: gracefully interrupting an optimize

2011-01-26 Thread Paul Libbrecht
, then that defaults to IW.close(true) which means wait for all BG merges to finish. So normally IW.close() reserves the right to take a long time. But IW.close(false) should finish relatively quickly... Mike On Fri, Jan 21, 2011 at 9:20 AM, Paul Libbrecht p...@hoplahup.net wrote: Would that happen

Re: gracefully interrupting an optimize

2011-01-21 Thread Paul Libbrecht
Would that happen automagically at finalization? paul Le 21 janv. 2011 à 15:13, Michael McCandless a écrit : If you call optimize(false), that'll return immediately but run the optimize in the background (assuming you are using the default ConcurrentMergeScheduler). Later, when it's time

NOT_ANALYZED... should be an analyzer

2011-01-20 Thread Paul Libbrecht
Hello list, I am hitting a stupid bug where a unit test shows me that QueryParser analyzes fierciely anything it finds hence... I have to tune the analyzer to not decompose the terms with fields that should be non-analyzed. For indexing, you can choose to have something not_analyzed. For

Re: Best practices for multiple languages?

2011-01-20 Thread Paul Libbrecht
Isn't this approach somewhat bad for term-frequency? Words that would appear in several languages would be a lot more frequent (hence less significative). I'm still preferring the split-field method with a proper query expansion. This way, the term-frequency is evaluated on the corpus of one

Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?

2011-01-19 Thread Paul Libbrecht
Grant Ingersoll gsing...@apache.org wrote: Where do you get your Lucene/Solr downloads from? [x] ASF Mirrors (linked in our release announcements or via the Lucene website) [X] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) [X] I/we build them from source via an

Re: Best practices for multiple languages?

2011-01-19 Thread Paul Libbrecht
exceeds 3-4 languages, I know of some that handle 10. If you're careful enough, it just works. Hope this helps. Shai On Wed, Jan 19, 2011 at 9:44 AM, Paul Libbrecht p...@hoplahup.net wrote: But for this, you need a skillfully designed: - set of fields - multiplexing analyzer - query

Re: AW: Best practices for multiple languages?

2011-01-19 Thread Paul Libbrecht
So you are only indexing analyzed and querying analyzed. Is that correct? Wouldn't it be better to prefer precise matches (a field that is analyzed with StandardAnalyzer for example) but also allow matches are stemmed. paul Le 19 janv. 2011 à 19:21, Bill Janssen a écrit : Clemens Wyss

Re: search on a field that is NOT_ANALYZED

2011-01-19 Thread Paul Libbrecht
I think you should use a TermQuery. paul Le 19 janv. 2011 à 20:03, Yuhan Zhang a écrit : Hi all, I am trying to use *IndexSearcherhttp://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/search/IndexSearcher.html#IndexSearcher%28org.apache.lucene.store.Directory%29 * to retrieve a

Re: AW: Best practices for multiple languages?

2011-01-19 Thread Paul Libbrecht
Le 19 janv. 2011 à 20:56, Bill Janssen a écrit : Paul Libbrecht p...@hoplahup.net wrote: So you are only indexing analyzed and querying analyzed. Is that correct? Yes, that's correct. I fall back to StandardAnalyzer if no language-specific analyzer is available. Wouldn't

Re: Best practices for multiple languages?

2011-01-18 Thread Paul Libbrecht
But for this, you need a skillfully designed: - set of fields - multiplexing analyzer - query expansion In one of my projects, we do not split language by fields and it's a pain... I'm having recurring issues in one sense or the other. - the die example that Oti s mentioned is a good one:

lucene-based log searcher?

2011-01-13 Thread Paul Libbrecht
Hello list, has anyone built a log-analyzer based on Lucene? Our logs are so big that grep takes more hours to do what I want it to do. I'm sure Lucene would solve it. Thanks in advance paul - To unsubscribe, e-mail:

Re: Where to find non-English dictionaries, thesaurus, synonyms

2011-01-07 Thread Paul Libbrecht
Somehow, I had the impression that the TrebleCLEF and EuroMatrix european projects are meant to gather this kind of information sources. But honestly, it's not as homogeneous as in OpenOffice. Mozilla also has dictionaries. Wiktionary can also be helpful. paul Le 7 janv. 2011 à 22:26, Robert

two IndexSearchers on one dir?

2010-12-31 Thread Paul Libbrecht
Hello list, is it a good or bad thing to open to index-searchers on FSDirectories of the same path? (namely, one short-lived, one long-lived). thanks in advance paul - To unsubscribe, e-mail:

Comment in query-parser?

2010-12-30 Thread Paul Libbrecht
I'm more and more involved into preparing dedicated pages that list resources of our servers according to an elaborate query I received in a human description and implement as a query-parser query. Doing this I regularly use indexed-doc views. The implementation is thus a query that could be

Re: Outof memory exception on using Integer.MaxValue

2010-12-28 Thread Paul Libbrecht
I also not that this is a fundamental characteristic of the great performance of Lucene and its related products since it allows cleanly managed resources. this is generally called paging. paul Le 28 déc. 2010 à 10:32, Uwe Schindler a écrit : The TopDocs returning methods are not intended

Re: java.lang.NoClassDefFoundError: org/apache/lucene/util/CharacterUtils

2010-12-13 Thread Paul Libbrecht
Allow me to recommend a little trick to track the origin of a class which works often: org.apache.lucene.analysis.WhitespaceAnalyzer.class.getResource(WhitespaceAnalyzer.class) will give you a URL that should be the URL of the jar, followed by an exclamation mark, followed by the

Lucene index exchange format?

2010-11-09 Thread Paul Libbrecht
hello list, more and more I seem to encounter situations where the delivery of a prebuilt lucene index is desirable. The binary format probably works (experience hints would be welcome) but I fear it would be fragile with versioning (it certainly fails at version-downgrading). Did anyone work

Re: Does Lucene compress postings (or posting lists) in its inverted index?

2010-10-17 Thread Paul Libbrecht
Mahmoud, Lucene's documents' fields can be, when stored, compressed on disk. I think that answers your question. paul On 17 oct. 2010, at 09:16, Mahmoud Abdelkader wrote: Hello, We're currently evaluating utilizing Lucene to index a large English corpus and we were are optimizing for

Re: trying to use the highlighter

2010-09-06 Thread Paul Libbrecht
ping! Any hope for help here? I'm a bit stuck before deploying a release. thanks in advance paul On 3 sept. 2010, at 14:05, Paul Libbrecht wrote: Hello list, I'm strugging again with the highlighter. I don't understand why I obtain sporadically InvalidTokenOffsetsException

trying to use the highlighter

2010-09-03 Thread Paul Libbrecht
Hello list, I'm strugging again with the highlighter. I don't understand why I obtain sporadically InvalidTokenOffsetsException. The mission: given a query, detect which field was matched, among the names of the concepts: there can be several names for a given concept, also in one language.

Re: Fastest way to get number of matching documents

2010-07-26 Thread Paul Libbrecht
Le 26-juil.-10 à 16:01, Michael McCandless a écrit : You can make a custom Collector? Ie, it'd just increment a counter for each hit. As long as it does not call the Scorer.score() method then no scoring is done. I've done that. Code below. It feels a bit stupid to have to do that

Re: Best practices for searcher memory usage?

2010-07-13 Thread Paul Libbrecht
Le 13-juil.-10 à 23:49, Christopher Condit a écrit : * are there performance optimizations that I haven't thought of? The first and most important one I'd think of is get rid of NFS. You can happily do a local copy which might, even for 10 Gb take less than 30 seconds at server start.

Re: best way to interest two queries?

2010-05-15 Thread Paul Libbrecht
Le 12-mai-10 à 10:55, mark harwood a écrit : two terminology questions: - is multiplier in the mail mentioned there the same as boost? This factor controls how many decimal places precision is retained in the adjusted scores. Pick to low a multiplier and scores that are only

Re: best way to interest two queries?

2010-05-11 Thread Paul Libbrecht
don't know what to do for b). thanks for hints. paul Le 31-mars-10 à 23:00, Paul Libbrecht a écrit : I've been wandering around but I see no solution yet: I would like to intersect two query results: going through the list of one query and indicating which ones actually match the other query

Re: best way to interest two queries?

2010-05-11 Thread Paul Libbrecht
intended to use prefix and fuzzyqueries. I believe this is contradictory to this or? paul Le 11-mai-10 à 12:02, mark harwood a écrit : See https://issues.apache.org/jira/browse/LUCENE-1999 - Original Message From: Paul Libbrecht p...@activemath.org To: java-user@lucene.apache.org

an analyzer map at hand?

2010-04-26 Thread Paul Libbrecht
Hello Luceners, I am sure I'm not the only one having such a snippet in my dedicated analyzer: m.put(en, new SnowballAnalyzer(English)); m.put(es, new SnowballAnalyzer(Spanish)); m.put(de, new SnowballAnalyzer(German)); m.put(dk, new

Re: Designing a multilingual index

2010-04-02 Thread Paul Libbrecht
Le 01-avr.-10 à 16:29, henrib a écrit : By issuing multiple queries, one against each localized index, results being clustered by locale. You can further refine by translating the end-user input query terms for each locale and issue translated queries against the respective indices. I've

Re: Designing a multilingual index

2010-04-01 Thread Paul Libbrecht
How? paul Le 01-avr.-10 à 14:19, henrib a écrit : Finally, query expansion can also be used in the multiple indices case and might even use automated/guided translation. - To unsubscribe, e-mail:

Re: Designing a multilingual index

2010-03-31 Thread Paul Libbrecht
David, I'm doing exactly that. And I think there's one crucial advantage aside: multilingual queries: if your user requests segment you have no way to know which language he is searching for; erm, well, you have the user-language(s) (through the browser Accept-Language header for example)

best way to interest two queries?

2010-03-31 Thread Paul Libbrecht
Hello list, I've been wandering around but I see no solution yet: I would like to intersect two query results: going through the list of one query and indicating which ones actually match the other query or, even better, indicating that passed this, nothing matches that query anymore.

Re: If you could have one feature in Lucene...

2010-02-24 Thread Paul Libbrecht
I would wish a highlighting feature that's fully integrated. paul On 24-févr.-10, at 14:42, Grant Ingersoll wrote: What would it be? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands,

Re: lucene webinterface

2010-02-16 Thread Paul Libbrecht
On 16-févr.-10, at 17:40, luciusvorenus wrote: how can I build a webinterface for my aplication ? I read something with HTML table and php but i had no idea? Can anobody help me? Lucius, try solr. paul - To unsubscribe,

one of the terms

2010-01-29 Thread Paul Libbrecht
Hello luceners, In our project, we are building queries from long list of possible terms (expanded through ontology deduction). I would like, however, that the rank is unaffected by the number of matches: one or thirty occurrences of one of the many words should give the same score. Did

are Lucene queries thread-safe?

2010-01-23 Thread Paul Libbrecht
Hello list, for some strange reason I wish to cache very frequent (and big, ~3000 terms) queries. Now, this might mean that a query is searched for in several threads on the same index. Do I run a risk? thanks in advance paul

Re: a complete solution for building a website search with lucene

2010-01-08 Thread Paul Libbrecht
Zhou, Lucene is a back-end library, it's very useful for developer but it is not a complete site-search-engine. A lucene-based site-search-engine is Nutch, it does crawl. Solr also provides functions close to these with a large amount of thoughts on flexible integration; crawling methods

Re: [ANN] Luke 0.9.9 release

2009-10-23 Thread Paul Libbrecht
Because I like to have Luke always sitting at hand, I have packed this release as a MacOSX disk-image and applcation. http://www.activemath.org/~paul/tmp/Luke-0.9.9.dmg The icon could be better (I need a hires of Lucene's icon, haven't found it yet). Potentially the packaging

Re: Using org.apache.lucene.analysis.compound

2009-10-21 Thread Paul Libbrecht
Can the dictionary have weights? überwachungsgesetz alone probably needs a higher rank than überwachung and gesetzt or? paul Le 21-oct.-09 à 21:09, Benjamin Douglas a écrit : OK, that makes sense. So I just need to add all of the sub-compounds that are real words at posIncr=0, even if

Re: Using org.apache.lucene.analysis.compound

2009-10-21 Thread Paul Libbrecht
, maxDocs=1) 1.6294457 = queryNorm 0.15342641 = (MATCH) fieldWeight(field:gesetz in 0), product of: 1.0 = tf(termFreq(field:gesetz)=1) 0.30685282 = idf(docFreq=1, maxDocs=1) 0.5 = fieldNorm(field=field, doc=0) On Wed, Oct 21, 2009 at 3:16 PM, Paul Libbrecht p

Re: OpenRelevance

2009-10-16 Thread Paul Libbrecht
Not something for the very soon future, but I'd be interested to base on such an infrastructure for a mathematical-formulæ search corpus (both semantic and presentation math). I believe the OpenRelevance infrastructure might present a best practice or infrastructure to be based on for

Re: Review and questions about Lucene Java 2.9.0

2009-10-08 Thread Paul Libbrecht
Mehdi, your requirements sound to be fulfilled mostly by Apache Solr which is a web-based packaging of Lucene. paul. Le 08-oct.-09 à 10:11, Mehdi Ben Hamida a écrit : Hello, I'm reviewing and doing some researches on Lucene Java 2.9.0, to check if it meets our needs.

phonetic encoders for other languages?

2009-08-23 Thread Paul Libbrecht
Hello list, I will need to use phonetic analyzers to do phonetic search. I know of the Metaphone analyzers and use them but they're really only known to work for English. Does anyone have pointers to projects that encode phonetically words of other languages? I'm interested to French,

Re: phonetic encoders for other languages?

2009-08-23 Thread Paul Libbrecht
Le 23-août-09 à 17:05, Petite Abeille a écrit : I will need to use phonetic analyzers to do phonetic search. I know of the Metaphone analyzers and use them but they're really only known to work for English. Double Metaphone? http://en.wikipedia.org/wiki/Double_Metaphone thanks, I hadn't

Re: wheres the word

2009-06-25 Thread Paul Libbrecht
Le 25-juin-09 à 01:28, Mark Miller a écrit : im figgering about the following problem. in my index i cant find the word BE, but it exists in two documents. im usinglucene 2.4 with the standardanalyzer. other querys with words like de, et or de la works good. any ideas? be is a stopword. Do

Re: Lucene for the Mac

2009-06-08 Thread Paul Libbrecht
Le 08-juin-09 à 23:55, Ian Vink a écrit : Is there a Mac port of the Lucene engine? I don't get it, are you asking whether Lucene java works on MacOSX? answer is yes. Are you asking for a Cocoa and ObjC port? (don't know) paul smime.p7s Description: S/MIME cryptographic signature

Re: Help Needed...

2009-05-28 Thread Paul Libbrecht
Kumar, you'll have to make your own documents with after parsing yourself the HTML (e.g. with Nekohtml to dom). As for the weights of tokens, supplementarily to IDF, you can do that per field, i.e. when you add a field into the document. paul Le 28-mai-09 à 12:22, Gaurav Kumar a écrit :

Re: Servlets Sharing Resources

2009-04-21 Thread Paul Libbrecht
Various servlets or various webapps? Various servlets is trivial, indeed using ServletContext.getAttribute(). Various webapps is more difficult: - you need to set cross context so that context.getContext(/ otherpath) is accessible (a config of context in tomcat) - you need classes to be shared

Re: Indexing Complex XML

2009-04-18 Thread Paul Libbrecht
daniel, have a look at solr DIH, it has prebuilt tools to do just that. http://wiki.apache.org/solr/DataImportHandler This bases on solr which is a web-application that bases on lucene. It does not need imperatively to be run as a web application though, it can be embedded. paul Le

Re: semantic vectors

2009-04-06 Thread Paul Libbrecht
I am sorry Nittin, I may have injected you the doubt about this... semantic-vectors is a project based on Lucene: http://code.google.com/p/semanticvectors/ you probably want to look there and ask questions on the forum there. paul Le 06-avr.-09 à 22:45, Richard Marr a écrit : Hi Nitin,

Re: How to know the matched field?

2009-03-24 Thread Paul Libbrecht
there's TextFragment(stringbuffer) and the pass through the tokenizers but removing any of them breaks my unit- test. I guess this is the whole idead behind LUCENE-1522 which I would up-take later. paul Le 23-mars-09 à 11:35, Paul Libbrecht a écrit : Thanks Erick, I browsed but no full answer

Re: How to know the matched field?

2009-03-23 Thread Paul Libbrecht
... On Sun, Mar 22, 2009 at 4:30 PM, Paul Libbrecht p...@activemath.org wrote: in an auto-completion task, I would like to show to the user the field that's been matched against the query in the found document. Typically, my documents have multiple fields for each field-name and I would like

Re: Matching query terms

2009-03-23 Thread Paul Libbrecht
searcher.explain definitely seems to do the trick, going through the sub-queries. paul Le 23-mars-09 à 13:12, Wouter Heijke a écrit : I want to know for each term in a query if it matched the result or not. What is the best way to implement this? Highlighter seems to be able to do the

How to know the matched field?

2009-03-22 Thread Paul Libbrecht
Hello list, in an auto-completion task, I would like to show to the user the field that's been matched against the query in the found document. Typically, my documents have multiple fields for each field-name and I would like the index's findings to give me the field used. How can I do

robust inverse of query parser?

2009-03-20 Thread Paul Libbrecht
Hello luceners, query.toString() does a fair job at being reparsed by QueryParser but is there a safe way to do so? I have a lucene query object and want a string that QueryParser will reparse fairly exacty. thanks in advance paul smime.p7s Description: S/MIME cryptographic signature

Re: lsi as indexing algorithm with lucene

2009-03-18 Thread Paul Libbrecht
Nitin, LSI is patented so it's not been a flurry of implementation attempts. However, SemanticVectors is a library that does similar approaches to LSA/LSI for indexing and is based on Lucene's term-vectors. paul Le 18-mars-09 à 07:09, nitin gopi a écrit : hi all , has any body tried to

Lucene-contrib maven artifact id?

2009-03-17 Thread Paul Libbrecht
Hello Luceners, what is the official pom.xml fragment to be used for the contribs package of lucene? It seems to be only of type pom inside the maven repository... does it mean that I have to fetch sub-contribs ? paul smime.p7s Description: S/MIME cryptographic signature

Re: underscore a word separator in StandardAnalyzer?

2009-03-15 Thread Paul Libbrecht
-09 à 00:03, Daniel Noll a écrit : Paul Libbrecht wrote: Hello fellows of Lucene, I just discovered that the _ character is a word separator in the StandardAnalyzer. Can it be? It broke our usage of a field that stores a comma-separated list of uri-fragments If I were analysing a URI, I

underscore a word separator in StandardAnalyzer?

2009-03-14 Thread Paul Libbrecht
Hello fellows of Lucene, I just discovered that the _ character is a word separator in the StandardAnalyzer. Can it be? It broke our usage of a field that stores a comma-separated list of uri-fragments which, of course, contain _: the standard-analyzer splits these as separate term which

Re: Google finance-like suggestible search field

2009-01-14 Thread Paul Libbrecht
We have a suggestion engine and we only auto-complete from 3 characters (or a number). http://draft.i2geo.net/SearchI2G/skills-text-box-editor.jsp?language=en What would be nice for your case and maybe for ours is that this expansion done in PrefixQuery is made more explicit so that one

Re: Google finance-like suggestible search field

2009-01-14 Thread Paul Libbrecht
(sorry to respond to myself) Le 15-janv.-09 à 08:13, Paul Libbrecht a écrit : We have a suggestion engine and we only auto-complete from 3 characters (or a number). http://draft.i2geo.net/SearchI2G/skills-text-box-editor.jsp?language=en What would be nice for your case and maybe for ours

Re: Any way to ignore repeated terms in TF calculation?

2008-12-25 Thread Paul Libbrecht
Shouldn't your analyzer also convert Rochelle Rochelle to Rochelle ? paul Le 25-déc.-08 à 14:20, Israel Tsadok a écrit : A recurring problem I have with Lucene results is when a document contains the same word over and over again. If for some reason I have a document containing badger

  1   2   >