Re: Subset Matching
Hi Otmar, Shouldn't Occur.SHOULD alone do what you ask? Documents that match all terms in the query would be scored higher than documents that match fewer than all terms. -sujit On Fri, Mar 25, 2016 at 2:20 AM, Otmar Caduffwrote: > Hi all > In Lucene, I know of the possibility of Occur.SHOULD, Occur.MUST and the > “minimum should match” setting on the boolean query. > > Now, when querying, I want to > - (1) match the documents which either contain all the terms of the query > (Occur.MUST for all terms would do that) or, > - (2) if all terms for a given field of a document are a subset of the > query terms, that document should match as well. > > Any clue on how to accomplish this? > > Otmar >
Re: Calculate the score of an arbitrary string vs a query?
Hi Ali, I agree with the others that there is no good way to do what you are looking for if you want to assign lucene-like scores to your external results, but if you have some objective measure of goodness that doesn't depend on your lucene scores, you can apply it to both result sets and merge them that way. One such measure could probably be the number of words in your query that you found in your title, or if you want to take the title length into consideration, the Jaccard similarity between the query words and title words. I once solved a slightly different (but related) problem using a somewhat different approach - mentioning it here in case it gives you some ideas. In my previous job we would concept map documents using our ontology - so each document could be thought of as a (weighted) bag of concepts - our concept search involved querying this bag of concepts. The indexing process was expensive, and we had just migrated to a new Java based annotation pipeline which assigned very different concept scores to documents, but which were intuitively more correct. However, whereas the old system assigned concept scores typically in the 20,000 range, our new system assigned scores to similar documents in the 100 range. We also had a set of huge indexes we had crawled with the old pipeline that would take us weeks/months to get done with the new pipeline, so we decided to merge results from our old index and newly crawled content (much smaller set) for a client. So I calculated the z-score (across all concepts) for both content sets and used that to rescale the concept scores of the old set to the new set. Although the underlying math was a bit sketchy, the merged results looked quite good. Hope this helps, -sujit On Fri, Apr 10, 2015 at 2:32 PM, Jack Krupansky jack.krupan...@gmail.com wrote: There is doc for tf*idf scoring in the javadoc: http://lucene.apache.org/core/5_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html The IndexSearcher#explain method returns an Explanation structure which details the scoring for a document: http://lucene.apache.org/core/5_0_0/core/org/apache/lucene/search/IndexSearcher.html#explain(org.apache.lucene.search.Query , int) -- Jack Krupansky On Fri, Apr 10, 2015 at 4:15 PM, Gregory Dearing gregdear...@gmail.com wrote: Hi Ali, The short answer to your question is... there's no good way to create a score from your result string, without using the Lucene index, that will be directly comparable to the Lucene score. The reason is that the score isn't just a function of the query and the contents of the document. It's also (usually) a function of the contents of the entire corpus... or rather how common terms are across the entire corpus. That being said... the default scoring algorithm is based on tf/idf. The implementation isn't in any one class... every query type (e.g. Term Query, Boolean Query, etc...) contains its own code for calculating scores. So the complete scoring formula will depend on the type of queries you're using. Many of those implementations also call into the Similarity API that you mentioned. If you'd like to see representative examples of scoring code, then take a look at TermWeight/TermScorer, and also BooleanWeight, which has several associated scorers. -Greg On Tue, Apr 7, 2015 at 1:32 AM, Ali Akhtar ali.rac...@gmail.com wrote: Hello, I'm in a situation where a search query string is being submitted simultaneously to Lucene, and to an external API. Results are fetched from both sources. I already have a score available for Lucene results, but I don't have a score for the results fetched from the external source. I'd like to calculate scores of results from the API, so that I can rank the results by the score, and show the top 5 results from both sources. (I.e the results would be merged.) Is there any Lucene API method, to which I can submit a search string and result string, and get a score back? If not, which class contains the source code for calculating the score, so that I can implement my own scoring class, using the same algorithm? I've looked at the Similarity class Javadocs, but it doesn't include any source code for calculating the score. Any help would be greatly appreciated. Thanks.
Re: Proximity query
I did something like this sometime back. The objective was to find patterns surrounding some keywords of interest so I could find keywords similar to the ones I was looking for, sort of like a poor man's word2vec. It uses SpanQuery as Jigar said, and you can find the code here (I believe it was written against Lucene 3.x so you may have to upgrade it if you are using Lucene 4.x): http://sujitpal.blogspot.com/2011/08/implementing-concordance-with-lucene.html -sujit On Thu, Feb 12, 2015 at 8:57 AM, Maisnam Ns maisnam...@gmail.com wrote: Hi Shah, Thanks for your reply. Will try to google SpanQuery meanwhile if you have some links can you please share Thanks On Thu, Feb 12, 2015 at 10:17 PM, Jigar Shah jigaronl...@gmail.com wrote: This concept is called Proximity Search in general. In Lucene they are achieved using SpanQuery. On Thu, Feb 12, 2015 at 10:10 PM, Maisnam Ns maisnam...@gmail.com wrote: Hi, Can someone help me if this use case is possible or not with lucene Use case: I have a string say 'Japan' appearing in 10 documents and I want to get back , say some results which contain two words before 'Japan' and two words after 'Japan' may be something like this ' Economy of Japan is growing' etc. If it is not possible where should I look for such queries Thanks
Re: Case sensitivity
Hi John, Take a look at the PerFieldAnalyzerWrapper. As the name suggests, it allows you to create different analyzers per field. -sujit On Fri, Sep 19, 2014 at 6:50 AM, John Cecere john.cec...@oracle.com wrote: I've considered this, but there are two problems with it. First of all, it feels like I'm still taking up twice the storage, I'm just doing it using a single index rather than two of them. This doesn't sound like it's buying me anything. The second problem with this is simply that I haven't figured out how to do this. I assume in creating two fields you would implement two separate analyzers on them, one using LowerCaseFilter and the other not. I haven't made the connection on how to tie an Analyzer to a particular field. It seems to be tied to the IndexWriterConfig and the IndexWriter. Thanks, John On 9/19/14 9:36 AM, Paul Libbrecht wrote: two fields? paul On 19 sept. 2014, at 15:07, John Cecere john.cec...@oracle.com wrote: Is there a way to set up Lucene so that both case-sensitive and case-insensitive searches can be done without having to generate two indexes? -- John Cecere Principal Engineer - Oracle Corporation 732-987-4317 / john.cec...@oracle.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- John Cecere Principal Engineer - Oracle Corporation 732-987-4317 / john.cec...@oracle.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Quickest way to collect one field from the searched docs....
Hi Shouvik, not sure if you have already considered this, but you could put the database primary key for the record into the index - ie, reverse your insert to do DB first, get the record_id and then add this to the Lucene index as record_id field. During retrieval you can minimize the network traffic by setting field list to only this record_id. -sujit On Thu, Sep 18, 2014 at 9:23 PM, Shouvik Bardhan sbard...@gisfederal.com wrote: Pardon the length of the question. I have an index with 100 million docs (lucene not solr) and term queries (A*, A AND B* type queries) return pretty quickly (2 -4 secs) and I pick the lucene docIds up pretty quickly with a collector. This is good for us since we take the docIds and do further filtering based on another database we maintain whose record ids match with the stored lucene doc ids and we are able to do what we want. I know that depending on the lucene doc id value is not a good thing, since after delete/merge/optimize, the doc ids may change and if that was to happen, our other datastore will not line up with lucene doc index and chaps will ensue. Thus we do not optimize the index etc. My question is what is the fastest way I can gather 1 field value from the docs which are found to match the query? Is there any way to do this as fast as (or at least not much slower) I am able to collect the lucene docids? I want to get away from depending on the lucene docids not changing if possible. Thanks for any suggestions.
Re: How to handle words that stem to stop words
Hi Arjen, This is kind of a spin on your last observation that your list of stop words don't change frequently. If you have a custom filter that attempts to stem the incoming token and if it stems to the same as a stopword, only then sets the keyword attribute on the original token. That way your reindex frequency is based on the stopword change frequency not on the frequency of discovery of new words that stem to stopwords. -sujit On Thu, Jul 10, 2014 at 11:57 AM, Arjen van der Meijden acmmail...@tweakers.net wrote: I'm reluctant to apply either solution: Emitting both tokens will likely still provide the user with a very long result list. Even though the results with 'vans' in it are likely to be ranked to the top, its still not very user friendly due to its overwhelmingly large number of results (nor is it very good for the performance of my application). In our specific case we also boost documents based on their age and popularity, so the extra results will probably interfere even if 'vans'-results are generally ranked higher. The approach with a list of specially treated terms is something we'll have to build and maintain by hand. Every time such a list is adjusted, it'll require a reindex of the database, which is not a huge problem but still not very practical. But I'm getting more and more convinced there isn't really a (reasonably easy) solution that would leave it dynamically changing without requiring database reindexes. Luckily the list of stop words shouldn't change that fast and we already have more than ten years worth of data, so it should be fairly easy to build a list of terms that are stemmed into stop words. Best regards, Arjen On 7-7-2014 23:06 Tri Cao wrote: I think emitting two tokens for vans is the right (potentially only) way to do it. You could also control the dictionary of terms that require this special treatment. Any reason makes you not happy with this approach? On Jul 06, 2014, at 11:48 AM, Arjen van der Meijden acmmail...@tweakers.net wrote: Hello list, We have a fairly large Lucene database for a 30+ million post forum. Users post and search for all kinds of things. To make sure users don't have to type exact matches, we combine a WordDelimiterFilter with a (Dutch) SnowballFilter. Unfortunately users sometimes find examples of words that get stemmed to a word that's basically a stop word. Or reversely, where a very common word is stemmed so that it becomes the same as a rare word. We do index stop words, so theoretically they could still find their result. But when a rare word is stemmed in such a way it yields a million hits, that makes it very unusable... One example is the Dutch word 'van' which is the equivalent of 'of' in English. A user tried to search for the shoe brand 'vans', which gets stemmed to 'van' and obviously gives useless results. I already noticed the 'KeywordRepeatFilter' to index/search both 'vans' and 'van' and the StemmerOverrideFilter to try and prevent these cases. Are there any other solutions for these kinds of problems? Best regards, Arjen van der Meijden - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org mailto:java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org mailto:java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to handle words that stem to stop words
Hi Arjen, You could also mark a token as keyword so the stemmer passes it through unchanged. For example, per the Javadocs for PorterStemFilter: http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/en/PorterStemFilter.html Note: This filter is aware of the KeywordAttribute http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/tokenattributes/KeywordAttribute.html?is-external=true. To prevent certain terms from being passed to the stemmer KeywordAttribute.isKeyword() http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/tokenattributes/KeywordAttribute.html?is-external=true#isKeyword() should be set to true in a previousTokenStream http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/TokenStream.html?is-external=true. Note: For including the original term as well as the stemmed version, see KeywordRepeatFilterFactory http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilterFactory.html Assuming your stemmer is also keyword attribute aware, you could build a filter that reads a list of words (such as vans) that should be protected from stemming and marks them with the KeywordAttribute before sending to the Porter stemmer and put it into your analysis chain. -sujit On Mon, Jul 7, 2014 at 2:06 PM, Tri Cao tm...@me.com wrote: I think emitting two tokens for vans is the right (potentially only) way to do it. You could also control the dictionary of terms that require this special treatment. Any reason makes you not happy with this approach? On Jul 06, 2014, at 11:48 AM, Arjen van der Meijden acmmail...@tweakers.net wrote: Hello list, We have a fairly large Lucene database for a 30+ million post forum. Users post and search for all kinds of things. To make sure users don't have to type exact matches, we combine a WordDelimiterFilter with a (Dutch) SnowballFilter. Unfortunately users sometimes find examples of words that get stemmed to a word that's basically a stop word. Or reversely, where a very common word is stemmed so that it becomes the same as a rare word. We do index stop words, so theoretically they could still find their result. But when a rare word is stemmed in such a way it yields a million hits, that makes it very unusable... One example is the Dutch word 'van' which is the equivalent of 'of' in English. A user tried to search for the shoe brand 'vans', which gets stemmed to 'van' and obviously gives useless results. I already noticed the 'KeywordRepeatFilter' to index/search both 'vans' and 'van' and the StemmerOverrideFilter to try and prevent these cases. Are there any other solutions for these kinds of problems? Best regards, Arjen van der Meijden - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Securing stored data using Lucene
Hi Rafaela, I built something along these lines as a proof of concept. All data in the index was unstored and only fields which were searchable (tokenized and indexed) were kept in the index. The full record was encrypted and stored in a MongoDB database. A custom Solr component did the search against the index, gathered up unique ids of the results, then pulled out the encrypted data from MongoDB, decrypted it on the fly and rendered the results. You can find the (Scala) code here: https://github.com/sujitpal/solr4-extras (under the src/main/scala/com/mycompany/solr4extras/secure folder). More information (more or less the same as what I wrote but probably a bit more readable with inlined code): http://sujitpal.blogspot.com/2012/12/searching-encrypted-document-collection.html There are some obvious data sync concerns with this sort of setup, but as Adrian points out, you can't index encrypted data. HTH Sujit On Jun 25, 2013, at 4:17 AM, Adrien Grand wrote: On Tue, Jun 25, 2013 at 1:03 PM, Rafaela Voiculescu rafaela.voicule...@gmail.com wrote: Hello, Hi, I am sorry I was not a bit more explicit. I am trying to find an acceptable way to encrypt the data to prevent any access of it in any way unless the person who is trying to access it knows how to decrypt it. As I mentioned, I looked a bit through the patch, but I am not sure of its status. You can encrypt stored fields, but there is no way to do it correctly with fields that have positions indexed: attackers could infer the actual terms based on the order of terms (the encrypted version must sort the same way as the original terms), frequencies and positions. -- Adrien - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Payload Matching Query
Hi Michael, Instead of putting the annotation in Payloads, why not put them in as synonyms, ie at the same spot as the original string (see SynonymFilter in the LIA book). So your string would look like (to the index): W. A. Mozart was born in Salzburg artist city so you can query as s:__artist__ __city__~slop -sujit On Jun 20, 2013, at 9:27 AM, michal samek wrote: Hi Adrien, thanks for your reply. If payloads cannot be used for searching, is there any workaround how to achieve similar functionality? What I'd like to accomplish is to be able to search documents with contents for example W. A. Mozart[artist] was born in Salzburg[city] just by specifying the *payload*s [artist] [city]. Thanks *Michal * 2013/6/20 Adrien Grand jpou...@gmail.com Hi Michal, Although payloads can be used at query time to customize scoring, they can't be used for searching. Lucene only allows to search on terms. -- Adrien - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Statically store sub-collections for search (faceted search?)
Hi Uwe, Thanks for the info, I was under the impression that it didn't... I got this info (that filters don't have a limit because they are not scoring) from a document like the one below. Can't say this is the exact doc because its been a while since I saw that, though. http://searchhub.org/2009/06/08/bringing-the-highlighter-back-to-wildcard-queries-in-solr-14/ As a response to this performance pitfall on very large indices’s (and the infamous TooManyClauses exception), new queries were developed that relied on a new Query class called ConstantScoreQuery. ConstantScoreQuerys accept a filter of matching documents and then score with a constant value equal to the boost. Depending on the qualities of your index, this method can be faster than the Boolean expansion method, and more importantly, does not suffer from TooManyClauses exceptions. Rather than matching and scoring n BooleanQuery clauses (potentially thousands of clauses), a single filter is enumerated and then traveled for scoring. On the other hand, constructing and scoring with a BooleanQuery containing a few clauses is likely to be much faster than constructing and traveling a Filter. -sujit On Apr 15, 2013, at 1:04 AM, Uwe Schindler wrote: The limit also applies for filters. If you have a list of terms ORed together, the fastest way is not to use a BooleanQuery at all, but instead a TermsFilter (which has no limits). - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Carsten Schnober [mailto:schno...@ids-mannheim.de] Sent: Monday, April 15, 2013 9:53 AM To: java-user@lucene.apache.org Subject: Re: Statically store sub-collections for search (faceted search?) Am 12.04.2013 20:08, schrieb SUJIT PAL: Hi Carsten, Why not use your idea of the BooleanQuery but wrap it in a Filter instead? Since you are not doing any scoring (only filtering), the max boolean clauses limit should not apply to a filter. Hi Sujit, thanks for your suggestion! I wasn't aware that the max clause limit does not match for a BooleanQuery wrapped in a filter. I suppose the ideal way would be to use a BooleanFilter but not a QueryWrapperFilter, right? However, I am also not sure how to apply a filter in my use case because I perform a SpanQuery. Although SpanQuery#getSpans() does take a Bits object as an argument (acceptDocs), I haven't been able to figure out how to generate this Bits object correctly from a Filter object. Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Statically store sub-collections for search (faceted search?)
Hi Uwe, I see, makes sense, thanks very much for the info. Sorry about giving you wrong info Carsten. -sujit On Apr 15, 2013, at 1:06 PM, Uwe Schindler wrote: Hi, Original Message- From: Sujit Pal [mailto:sujitatgt...@gmail.com] On Behalf Of SUJIT PAL Sent: Monday, April 15, 2013 9:43 PM To: java-user@lucene.apache.org Subject: Re: Statically store sub-collections for search (faceted search?) Hi Uwe, Thanks for the info, I was under the impression that it didn't... I got this info (that filters don't have a limit because they are not scoring) from a document like the one below. Can't say this is the exact doc because its been a while since I saw that, though. http://searchhub.org/2009/06/08/bringing-the-highlighter-back-to-wildcard- queries-in-solr-14/ As a response to this performance pitfall on very large indices’s (and the infamous TooManyClauses exception), new queries were developed that relied on a new Query class called ConstantScoreQuery. ConstantScoreQuerys accept a filter of matching documents and then score with a constant value equal to the boost. Depending on the qualities of your index, this method can be faster than the Boolean expansion method, and more importantly, does not suffer from TooManyClauses exceptions. Rather than matching and scoring n BooleanQuery clauses (potentially thousands of clauses), a single filter is enumerated and then traveled for scoring. On the other hand, constructing and scoring with a BooleanQuery containing a few clauses is likely to be much faster than constructing and traveling a Filter. This is true, but you misunderstood it: This is about MultiTermQueries (which is the superclass of WildcardQuery, Fuzzy-, and range queries). Those queries are no native Lucene queries, so they rewrite to basic/native queries. In earlier Lucene versions, Wildcards were always rewritten to BooleanQueries with many TermQueries (one for each term that matches the wildcard), leading to the problem with too many terms. This is still the case, but only in some limits (this mode is only used if the wildcard expands to few terms). Those BooleanQueris are then used with ConstantScoreQuery(Query). The above text talks about another mode (which is used for many terms today): *No* BooleanQuery is build at all, instead all matching term's documents are marked in a BitSet and this BitSet is used with a Filter to construct a different Query type: ConstantScoreQuery(Filter). The BooleanQuery max clause count does not apply, because no BooleanQuery is involved in the whole process. If you use ConstantScoreQuery(BooleanQuery), the limit still applies, but not for ConstantScoreQuery(internalWildcardFilter). Uwe On Apr 15, 2013, at 1:04 AM, Uwe Schindler wrote: The limit also applies for filters. If you have a list of terms ORed together, the fastest way is not to use a BooleanQuery at all, but instead a TermsFilter (which has no limits). - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Carsten Schnober [mailto:schno...@ids-mannheim.de] Sent: Monday, April 15, 2013 9:53 AM To: java-user@lucene.apache.org Subject: Re: Statically store sub-collections for search (faceted search?) Am 12.04.2013 20:08, schrieb SUJIT PAL: Hi Carsten, Why not use your idea of the BooleanQuery but wrap it in a Filter instead? Since you are not doing any scoring (only filtering), the max boolean clauses limit should not apply to a filter. Hi Sujit, thanks for your suggestion! I wasn't aware that the max clause limit does not match for a BooleanQuery wrapped in a filter. I suppose the ideal way would be to use a BooleanFilter but not a QueryWrapperFilter, right? However, I am also not sure how to apply a filter in my use case because I perform a SpanQuery. Although SpanQuery#getSpans() does take a Bits object as an argument (acceptDocs), I haven't been able to figure out how to generate this Bits object correctly from a Filter object. Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e
Re: Statically store sub-collections for search (faceted search?)
Hi Carsten, Why not use your idea of the BooleanQuery but wrap it in a Filter instead? Since you are not doing any scoring (only filtering), the max boolean clauses limit should not apply to a filter. -sujit On Apr 12, 2013, at 7:34 AM, Carsten Schnober wrote: Dear list, I would like to create a sub-set of the documents in an index that is to be used for further searches. However, the criteria that lead to the creation of that sub-set are not predefined so I think that faceted search cannot be applied my this use case. For instance: A user searches for documents that contain token 'A' in a field 'text'. These results form a set of documents that is persistently stored (in a database). Each document in the index has a field 'id' that identifies it, so these external IDs are stored in the database. Later on, a user loads the document IDs from the database and wants to execute another search on this set of documents only. However, performing a search on the full index and subsequently filtering the results against that list of documents takes very long if there are many matches. This is obvious as I have to retrieve the external id from each matching document and check whether it is part of the desired sub-set. Constructing a BooleanQuery in the style id:Doc1 OR id:Doc2 ... is not suitable either because there could be thousands of documents exceeding any limit for Boolean clauses. Any suggestions how to solve this? I would have gone for the Lucene document numbers and store them as a bit set that I could use as a filter during later searches, but I read that the document numbers are ephemeral. One possible way out seems to be to create another index from the documents that have matched the initial search, but this seems quite an overkill, especially if there are plenty of them... Thanks for any hint! Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Accent insensitive analyzer
Hi Jerome, How about this one? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ISOLatin1AccentFilterFactory Regards, Sujit On Mar 22, 2013, at 9:22 AM, Jerome Blouin wrote: Hello, I'm looking for an analyzer that allows performing accent insensitive search in latin languages. I'm currently using the StandardAnalyzer but it doesn't fulfill this need. Could you please point me to the one I need to use? I've checked the javadoc for the various analyzer packages but can't find one. Do I need to implement my own analyzer? Regards, Jerome - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?
Hi Glen, I don't believe you can attach a single payload to multiple tokens. What I did for a similar requirement was to combine the tokens into a single _ delimited single token and attached the payload to it. For example: The Big Bad Wolf huffed and puffed and blew the house of the Three Little Pigs down. Now assume Big Bad Wolf and Three Little Pigs are spans to which I would like to attach payloads to. I run the tokens through a custom tokenizer that produces: The Big_Bad_Wolf$payload1 huffed and puffed and blew the house of the Three_Little_Pigs$payload2 down. In my case this makes sense, ie I can treat the span as a single unit. Not sure about your use case. HTH Sujit On Dec 13, 2012, at 2:08 PM, Glen Newton wrote: Cool! Sounds great! :-) Any pointers to a (Lucene) example that attaches a payload to a start..end span that is more than one token? thanks, -Glen On Thu, Dec 13, 2012 at 5:03 PM, Lance Norskog goks...@gmail.com wrote: I should not have added that note. The Opennlp patch gives a concrete example of adding an annotation to text. On 12/13/2012 01:54 PM, Glen Newton wrote: It is not clear this is exactly what is needed/being discussed. From the issue: We are also planning a Tokenizer/TokenFilter that can put parts of speech as either payloads (PartOfSpeechAttribute?) on a token or at the same position. This adds it to a token, not a span. 'same position' does not suggest it also records the end position. -Glen On Thu, Dec 13, 2012 at 4:45 PM, Lance Norskog goks...@gmail.com wrote: Parts-of-speech is available now, in the indexer. LUCENE-2899 adds OpenNLP to the LuceneSolr codebase. It does parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an Apache project for natural-language processing. Some parts are in Solr that could be in Lucene. https://issues.apache.org/jira/browse/lucene-2899 On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote: Is there any (preliminary) code checked in somewhere that I can look at, that would help me understand the practical issues that would need to be addressed? Maybe we can make this more concrete: what new attribute are you needing to record in the postings and access at search time? For example: - part of speech of a token. - syntactic parse subtree (over a span). - semantically normalized phrase (to canonical text or ontological code). - semantic group (of a span). - coreference link. stephen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- - http://zzzoot.blogspot.com/ - - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Scoring a document using LDA topics
Hi Stephen, We precompute a variant of P(z,d) during indexing, and do the first 3 steps. The resulting documents are ordered by payload score, which is basically z in our case. We don't currently care about P(t,z) but it seems like a good thing to have for disambiguation purposes. So anyway, I have never done what you are looking to do, but I guess the approach you have outlined would be the one you would use to do this. Although there may be performance issues where you have a large number of topic matches. An alternative - since you need to know the P(t,z) (the probability of the terms in the query being in a particular topic), and each PayloadTermQuery in the BooleanQuery corresponds to a z (topic), perhaps you could boost each clauses by P(t,z)? -sujit On Tue, 2011-11-29 at 10:50 -0500, Stephen Thomas wrote: Sujit, Thanks for your reply, and the link to your blog post, which was helpful and got me thinking about Payloads. I still have one more question. I need to be able to compute the Sim(query q, doc d) similarity function, which is defined below: Sim (query q, doc d) = sum_{t in q} sum_{z} P(t, z) * P(z, d) So, I'm guessing that the only what to do this is to do the following: - At index time, store the (flattened) topics as a payload for each documen, as you suggest in your blog - At query time, find out which topics are in the query - Construct a BooleanQuery, consisting of one PayloadTermQuery per topic in the query - Search on the BooleanQuery. This essentially tells me which documents have the topics in the query - Iterate over the TopDocs returns by the search. For each doc, get the full payload, unflatten it, and use it to compute Sim(query q, doc d). - Reorder the results based on the Sim(query q, doc d) results. Is this the best way? I can't see a way to compute the Sim() metric at any other time, because in scorePayload(), we don't have access to the full payload, nor to the query. Thanks again, Steve On Mon, Nov 28, 2011 at 1:51 PM, Sujit Pal sujit@comcast.net wrote: Hi Stephen, We are doing something similar, and we store as a multifield with each document as (d,z) pairs where we store the z's (scores) as payloads for each d (topic). We have had to build a custom similarity which implements the scorePayload function. So to find docs for a given d (topic), we do a simple PayloadTermQuery and the docs come back in descending order of z. Simple boolean term queries also work. We turn off norms (in the ctor for the PayloadTermQuery) to get scores that are identical to the d values. I wrote about this sometime back...maybe this would help you. http://sujitpal.blogspot.com/2011/01/payloads-with-solr.html -sujit On Mon, 2011-11-28 at 12:29 -0500, Stephen Thomas wrote: List, I am trying to incorporate the Latent Dirichlet Allocation (LDA) topic model into Lucene. Briefly, the LDA model extracts topics (distribution over words) from a set of documents, and then represents each document with topic vectors. For example, documents could be represented as: d1 = (0, 0.5, 0, 0.5) d2 = (1, 0, 0, 0) This means that document d1 contains topics 2 and 4, and document d2 contains topic 1. I.e., P(z1, d1) = 0 P(z2, d1) = 0.5 P(z3, d1) = 0 P(z4, d1) = 0.5 P(z1, d2) = 1 P(z2, d2) = 0 ... Also, topics are represented by the probability that a term appears in that topic, so we also have a set of vectors: z1 = (0, 0, .02, ...) meaning that topic z1 does not contain terms 1 or 2, but does contain term 3. I.e., P(t1, z1) = 0 P(t2, z1) = 0 P(t3, z1) = .02 ... Then, the similarity between a query and a document is computed as: Sim (query q, doc d) = sum_{t in q} sum_{z} P(t, z) * P(z, d) Basically, for each term in the query, and each topic in existence, see how relevant that term is in that topic, and how relevant that topic is in the document. I've been thinking about how to do this in Lucene. Assume I already have the topics and the topic vectors for each document. I know that I need to write my own Similarity class that extends DefaultSimilarity. I need to override tf(), queryNorm(), coord(), and computeNorm() to all return a constant 1, so that they have no effect. Then, I can override idf() to compute the Sim equation above. Seems simple enough. However, I have a few practical issues: - Storing the topic vectors for each document. Can I store this in the index somehow? If so, how do I retrieve it later in my CustomSimilarity class? - Changing the Boolean model. Instead of only computing the similarity on a documents that contain any of the terms in the query (the default behavior), I need to compute the similarity on all of the documents. (This is the whole idea behind LDA: you don't need an exact term match for there to be a similarity.) I understand that this will result in a performance hit
Re: Scoring a document using LDA topics
Hi Stephen, We are doing something similar, and we store as a multifield with each document as (d,z) pairs where we store the z's (scores) as payloads for each d (topic). We have had to build a custom similarity which implements the scorePayload function. So to find docs for a given d (topic), we do a simple PayloadTermQuery and the docs come back in descending order of z. Simple boolean term queries also work. We turn off norms (in the ctor for the PayloadTermQuery) to get scores that are identical to the d values. I wrote about this sometime back...maybe this would help you. http://sujitpal.blogspot.com/2011/01/payloads-with-solr.html -sujit On Mon, 2011-11-28 at 12:29 -0500, Stephen Thomas wrote: List, I am trying to incorporate the Latent Dirichlet Allocation (LDA) topic model into Lucene. Briefly, the LDA model extracts topics (distribution over words) from a set of documents, and then represents each document with topic vectors. For example, documents could be represented as: d1 = (0, 0.5, 0, 0.5) d2 = (1, 0, 0, 0) This means that document d1 contains topics 2 and 4, and document d2 contains topic 1. I.e., P(z1, d1) = 0 P(z2, d1) = 0.5 P(z3, d1) = 0 P(z4, d1) = 0.5 P(z1, d2) = 1 P(z2, d2) = 0 ... Also, topics are represented by the probability that a term appears in that topic, so we also have a set of vectors: z1 = (0, 0, .02, ...) meaning that topic z1 does not contain terms 1 or 2, but does contain term 3. I.e., P(t1, z1) = 0 P(t2, z1) = 0 P(t3, z1) = .02 ... Then, the similarity between a query and a document is computed as: Sim (query q, doc d) = sum_{t in q} sum_{z} P(t, z) * P(z, d) Basically, for each term in the query, and each topic in existence, see how relevant that term is in that topic, and how relevant that topic is in the document. I've been thinking about how to do this in Lucene. Assume I already have the topics and the topic vectors for each document. I know that I need to write my own Similarity class that extends DefaultSimilarity. I need to override tf(), queryNorm(), coord(), and computeNorm() to all return a constant 1, so that they have no effect. Then, I can override idf() to compute the Sim equation above. Seems simple enough. However, I have a few practical issues: - Storing the topic vectors for each document. Can I store this in the index somehow? If so, how do I retrieve it later in my CustomSimilarity class? - Changing the Boolean model. Instead of only computing the similarity on a documents that contain any of the terms in the query (the default behavior), I need to compute the similarity on all of the documents. (This is the whole idea behind LDA: you don't need an exact term match for there to be a similarity.) I understand that this will result in a performance hit, but I do not see a way around it. - Turning off fieldNorm(). How can I set the field norm for each doc to a constant 1? Any help is greatly appreciated. Steve - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Bet you didn't know Lucene can...
Hi Grant, Not sure if this qualifies as a bet you didn't know, but one could use Lucene term vectors to construct document vectors for similarity, clustering and classification tasks. I found this out recently (although I am probably not the first one), and I think this could be quite useful. -sujit On Sat, 2011-10-22 at 11:11 +0200, Grant Ingersoll wrote: Hi All, I'm giving a talk at ApacheCon titled Bet you didn't know Lucene can... (http://na11.apachecon.com/talks/18396). It's based on my observation, that over the years, a number of us in the community have done some pretty cool things using Lucene that don't fit under the core premise of full text search. I've got a fair number of ideas for the talk (easily enough for 1 hour), but I wanted to reach out to hear your stories of ways you've (ab)used Lucene and Solr to see if we couldn't extend the conversation to a bit more than the conference and also see if I can't inject more ideas beyond the ones I have. I don't need deep technical details, but just high level use case and the basic insight that led you to believe Lucene could solve the problem. Thanks in advance, Grant Grant Ingersoll http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How do you see if a tokenstream has tokens without consuming the tokens ?
Hi Paul, Since you have modified the StandardAnalyzer (I presume you mean StandardFilter), why not do a check on the term.text() and if its all punctuation, skip the analysis for that term? Something like this in your StandardFilter: public final boolean incrementToken() throws IOException { CharTermAttribute ta = getAttribute(CharTermAttribute.class); if (isAllPunctuation(ta.buffer()) { return true; } else { ... normal processing here } } If the filters are made keyword attribute aware (I have a bug open on this, LUCENE-3236, although I only asked for Lowercase and Stop filters in here), then its even simpler, you can plug in your own filter that marks the term as a KeywordAttribute so downstream filters pass it through. -sujit On Mon, 2011-10-17 at 13:12 +0100, Paul Taylor wrote: We have a modified version of a Lucene StandardAnalyzer , we use it for tokenizing music metadata such as as artist names song titles, so typically only a few words. On tokenizing it usually it strips out punctuations which is correct, however if the input text consists of only punctuation characters then we end up with nothing, for these particular RARE cases I want to use a mapping filter. So what I try to do is have my analyzer tokenize as normal, then if the results is no tokens retokenize with the mapping filter , I check it has no token using incrementToken() but then cant see how I decrementToken(). How can I do this, or is there a more efficient way of doing this. Note of maybe 10,000,000 records only a few 100 records will have this problem so I need a solution which doesn't impact performance unreasonably. NormalizeCharMap specialcharConvertMap = new NormalizeCharMap(); specialcharConvertMap.add(!, Exclamation); specialcharConvertMap.add(?,QuestionMark); ... public TokenStream tokenStream(String fieldName, Reader reader) { CharFilter specialCharFilter = new MappingCharFilter(specialcharConvertMap,reader); StandardTokenizer tokenStream = new StandardTokenizer(LuceneVersion.LUCENE_VERSION); try { if(tokenStream.incrementToken()==false) { tokenStream = new StandardTokenizer(LuceneVersion.LUCENE_VERSION, specialCharFilter); } else { //TODO set tokenstream back as it was before increment token } } catch(IOException ioe) { } TokenStream result = new LowercaseFilter(result); return result; } thanks for any help Paul - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Is there any Query in Lucene can search the term, which is similar as SQL-LIKE?
Hi Mead, You may want to check out the permuterm index idea. http://www-nlp.stanford.edu/IR-book/html/htmledition/permuterm-indexes-1.html Basically you write a custom filter that takes a term and generates all word permutations off it. On the query side, you convert your query so its always a prefix query by rotating the characters so * is always at the end and match against the permuterm indexed field. I have a simple (and currently incomplete) working implementation (works with queries such as *keyword, keyword*, key*rd, *keyword*, but only a single * and no ? unlike the Wildcard query. But because its always a prefix query internally, it does not have the performance penalty of leading * in WildcardQuery. Maybe it will give you some ideas... http://sujitpal.blogspot.com/2011/10/lucene-wildcard-query-and-permuterm.html -sujit On Thu, 2011-10-13 at 10:10 +0800, Mead Lai wrote: Thank you very much, With your helps, that, finally, I use WildcardQuery to find right result: BooleanQuery resultQuery = new BooleanQuery(); resultQuery.add(WildcardQuery(new Term(content, *keyword*)); TopDocs topDocs = searcher.search(resultQuery,*1000*); But there is also a problem puzzle me, the result only can get 1000 items, which is not enough. I want to have entire/whole items, which match that condition(*keyword*). OR, may I put a date condtion to query, e.g: select * from table where start_date *=* 2011-10-12 Regards, Mead On Tue, Oct 11, 2011 at 11:39 PM, Chris Lu chris...@gmail.com wrote: You need to analyze the search keyword with the same analyzer that's applied on the content field. -- Chris Lu - Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes On Tue, Oct 11, 2011 at 12:11 AM, Mead Lai laiqi...@gmail.com wrote: Hello all, *Background: *There are *ONE MILLION* data in a table, and this table has 100 columns inside. The application need to search the data in EVERY column with one 'keyword'. so, I try it in a clumsy way, using a database view, then search the view. Just like the following SQL: *=Step1*: create a view. CREATE OR REPLACE VIEW V_MY_VIEW(id,title,content) as SELECT mv.l_instanceid,mv.c_param1,mv.c_param2||';'||mv.c_param3||';'||mv.c_param4||';'||mv.c_param5||';'||mv.c_param6||';'||mv.c_param7||';'||mv.c_param8||';'||mv.c_param9||';'||mv.c_param10||';'||mv.c_param11||';'||mv.c_param12||';'||mv.c_param13||';'||mv.c_param14||';'||mv.c_param15||';'||mv.c_param16||';'||mv.c_param17||';'||mv.c_param18||';'||mv.c_param19||';'||mv.c_param20||';'||mv.c_param21||';'||mv.c_param22||';'||mv.c_param23||';'||mv.c_param24||';'||mv.c_param25||';'||mv.c_param26||';'||mv.c_param27||';'||mv.c_param28||';'||mv.c_param29||';'||mv.c_param30||';'||mv.c_param31||';'||mv.c_param32||';'||mv.c_param33||';'||mv.c_param34||';'||mv.c_param35||';'||mv.c_param36||';'||mv.c_param37||';'||mv.c_param38||';'||mv.c_param39||';'||mv.c_param40||';'||mv.c_param41||';'||mv.c_param42||';'||mv.c_param43||';'||mv.c_param44||';'||mv.c_param45||';'||mv.c_param46||';'||mv.c_param47||';'||mv.c_param48||';'||mv.c_param49||';'||mv.c_param50||';'||mv.c_param51||';'||mv.c_param52||';'||mv.c_param53||';'||mv.c_param54||';'||mv.c_param55||';'||mv.c_param56||';'||mv.c_param57||';'||mv.c_param58||';'||mv.c_param59||';'||mv.c_param60||';'||mv.c_param61||';'||mv.c_param62||';'||mv.c_param63||';'||mv.c_param64||';'||mv.c_param65||';'||mv.c_param66||';'||mv.c_param67||';'||mv.c_param68||';'||mv.c_param69||';'||mv.c_param70||';'||mv.c_param71||';'||mv.c_param72||';'||mv.c_param73||';'||mv.c_param74||';'||mv.c_param75||';'||mv.c_param76||';'||mv.c_param77||';'||mv.c_param78||';'||mv.c_param79||';'||mv.c_param80||';'||mv.c_param81||';'||mv.c_param82||';'||mv.c_param83||';'||mv.c_param84||';'||mv.c_param85||';'||mv.c_param86||';'||mv.c_param87||';'||mv.c_param88||';'||mv.c_param89||';'||mv.c_param90||';'||mv.c_param91||';'||mv.c_param92||';'||mv.c_param93||';'||mv.c_param94||';'||mv.c_param95||';'||mv.c_param96||';'||mv.c_param97||';'||mv.c_param98||';'||mv.c_param99||';'||mv.c_param100||';' FROM MyTable mv *=Step2*: search the view with LIKE '%keyword%' SELECT * FROM V_MY_VIEW wcv WHERE wcv.content LIKE '%keyword%' Finally, it works nice, but inefficiency, almost cost 5~7 seconds. cos ONE MILLION rows are tooo huge. *Lucene way:* So, I use the Lucene to store these ONE MILLION data, code:document.add(new Field(content, content, Store.YES, Index.ANALYZED));//variable content, is the strings which jointed from the 100 columns The problem is that: if some keyword is not a word or a term, the search will return nothing. Usually, the keyword would be a person's name
Payload Query and Document Boosts
Hi, Question about Payload Query and Document Boosts. We are using Lucene 3.2 and Payload queries, with our own PayloadSimilarity class which overrides the scorePayload method like so: {code} @Override public float scorePayload(int docId, String fieldName, int start, int end, byte[] payload, int offset, int length) { if (payload != null) { return PayloadHelper.decodeFloat(payload, offset); } else { return 1.0F; } } {/code} We are injecting payloads as ID$SCORE pairs using the DelimitedPayloadTokenFilter and life was good - when we run PayloadTermQuery() the scores came back as our score. I have included code below that illustrates the calling pattern, its this: {code} PayloadTermQuery q = new PayloadTermQuery(new Term(imuids_p, 2790926), new AveragePayloadFunction(), false); {/code} ie, do not include the span score (the SCORE is calculated as a result of offline processing and we can't change that value). Now we would like to boost each document differently (index time, document.setBoost(boost), based on its content type), and we are running into problems. Looks like the document boost is not applied to the document score during search if includeSpanScore==false. When we set it to true, we see a difference in scores (the original score without document boosts is multiplied by the document boost set), but the original scores without boost is not the same as SCORE, ie its now affected by the span score. My question is - is there some method in DefaultSimilarity that I can override so that my score is my original SCORE * document boost? The Similarity documentation does not provide any clues to my problem - I tried modifying the computeNorm() method to return state.getBoost() but it looks like its never called. If not, the other option would be to bake in the doc boost into the SCORE value, by multiplying them on their way into lucene, so that now SCORE *= doc boost. Here is my unit test which illustrates the issue: {code} import java.io.Reader; import java.util.HashMap; import java.util.Map; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.PerFieldAnalyzerWrapper; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.WhitespaceTokenizer; import org.apache.lucene.analysis.payloads.DelimitedPayloadTokenFilter; import org.apache.lucene.analysis.payloads.FloatEncoder; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.Field.Index; import org.apache.lucene.document.Field.Store; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.index.Term; import org.apache.lucene.index.IndexWriterConfig.OpenMode; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.payloads.AveragePayloadFunction; import org.apache.lucene.search.payloads.PayloadTermQuery; import org.apache.lucene.store.RAMDirectory; import org.junit.Test; import com.healthline.query.kb.ConceptAnalyzer; import com.healthline.solr.HlSolrConstants; import com.healthline.solr.search.PayloadSimilarity; import com.healthline.util.Config; public class DocBoostTest { private class PayloadAnalyzer extends Analyzer { @Override public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream tokens = new WhitespaceTokenizer(HlSolrConstants.CURRENT_VERSION, reader); tokens = new DelimitedPayloadTokenFilter(tokens, '$', new FloatEncoder()); return tokens; } }; private Analyzer getAnalyzer() { MapString,Analyzer pfas = new HashMapString,Analyzer(); pfas.put(imuids_p, new PayloadAnalyzer()); PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper( new ConceptAnalyzer(), pfas); return analyzer; } private IndexSearcher loadTestData(boolean setBoosts) throws Exception { RAMDirectory ramdir = new RAMDirectory(); IndexWriterConfig iwconf = new IndexWriterConfig( HlSolrConstants.CURRENT_VERSION, getAnalyzer()); iwconf.setOpenMode(OpenMode.CREATE); IndexWriter writer = new IndexWriter(ramdir, iwconf); Document doc1 = new Document(); doc1.add(new Field(itemtitle, Cancer and the Nervous System PARANEOPLASTIC DISORDERS, Store.YES, Index.ANALYZED)); doc1.add(new Field(imuids_p, 2790917$52.01 2790926$53.18, Store.YES, Index.ANALYZED)); doc1.add(new Field(contenttype, BK, Store.YES, Index.NOT_ANALYZED)); if (setBoosts) doc1.setBoost(1.2F); writer.addDocument(doc1); Document doc2 = new Document(); doc2.add(new Field(itemtitle, Esophagogastric cancer: Targeted agents, Store.YES, Index.ANALYZED)); doc2.add(new Field(imuids_p, 2790926$52.18 2790981$5.19, Store.YES, Index.ANALYZED)); doc2.add(new Field(contenttype, JL, Store.YES, Index.NOT_ANALYZED));
Re: How can i index a Java Bean into Lucene application ?
Depending on what you wanted to do with the Javabean (I assume you want to make some or all its fields searchable since you are writing to Lucene), you could use reflection to break it up into field name value pairs and write them out to the IndexWriter using something like this: Document d = new Document(); d.addField(fieldname1, fieldvalue1, Store.YES, Index.ANALYZED, ...) ... writer.addDocument(d); -sujit On Sat, 2011-08-06 at 18:28 +0530, KARTHIK SHIVAKUMAR wrote: Hi How can i index a Java Bean into Lucene application ? instead of a file API : IndexWriter writer = new IndexWriter(*FSDirectory.open(INDEX_DIR)*, new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED); Is there any alternate for the same . ex: * package com.web.beans.searchdata;* * public class SearchIndexHtmlData { public String CONTENT =NA; public String DATEOFCREATION =NA; public String DATEOFINDEXCREATION =NA; public String getCONTENT() { return CONTENT; } public void setCONTENT(String cONTENT) { CONTENT = cONTENT; } public String getDATEOFCREATION() { return DATEOFCREATION; } public void setDATEOFCREATION(String dATEOFCREATION) { DATEOFCREATION = dATEOFCREATION; } public String getDATEOFINDEXCREATION() { return DATEOFINDEXCREATION; } public void setDATEOFINDEXCREATION(String dATEOFINDEXCREATION) { DATEOFINDEXCREATION = dATEOFINDEXCREATION; } }* - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Suggestion: make some more TokenFilters KeywordAttribute aware
Thanks Simon, I have opened a JIRA and attached a patch. I have verified that I haven't broken anything, and I have used these patched files to test in my local application and have verified that they work. https://issues.apache.org/jira/browse/LUCENE-3236 -sujit On Thu, 2011-06-23 at 08:21 +0200, Simon Willnauer wrote: On Wed, Jun 22, 2011 at 8:53 PM, Sujit Pal s...@healthline.com wrote: Hello, I am currently in need of a LowerCaseFilter and StopFilter that will recognize KeywordAttribute, similar to the way PorterStemFilter currently does (on trunk). Specifically, in case the term is a KeywordAttribute.isKeyword(), it should not lowercase and remove respectively. This can be achieved without breaking backward compatibility by introducing an extra constructor which takes a boolean ignoreKeyword parameter. If this sounds like this would be a good idea, please let me know, I can open a JIRA and attach a patch. Currently, I have created my own versions of KeywordAwareXXX filters that does pretty much the same thing. I think you should open an issue and take it from there. I can't promise this is going to be added but its worth to try! please go ahead and open an issue. simon Thanks Sujit - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Suggestion: make some more TokenFilters KeywordAttribute aware
Hello, I am currently in need of a LowerCaseFilter and StopFilter that will recognize KeywordAttribute, similar to the way PorterStemFilter currently does (on trunk). Specifically, in case the term is a KeywordAttribute.isKeyword(), it should not lowercase and remove respectively. This can be achieved without breaking backward compatibility by introducing an extra constructor which takes a boolean ignoreKeyword parameter. If this sounds like this would be a good idea, please let me know, I can open a JIRA and attach a patch. Currently, I have created my own versions of KeywordAwareXXX filters that does pretty much the same thing. Thanks Sujit - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Passage retrieval with Lucene-based application
Hi Leroy, Would it make sense to index as Lucene documents the unit to be searched? So if you want paragraphs to be shown in search results, you could parse the source document during indexing into paragraphs and index them as separate Lucene documents. -sujit On Wed, 2011-05-25 at 15:46 -0400, Leroy Stone wrote: Hello! I am purchased Lucene in Action, 2nd Ed., and posted the question below at the Manning Forum. Mike MCCandless suggested that I send it to you. Thanks in advance for your attention. the question I posted ___ I would like the search program to return with segments of a document (paragraphs) that contain my search phrase, rather than simply pointers to the whole document. in searching among applications based upon the Lucene, I have found only one that seems to have this functionality. It is at http://www.crosswire.org/bibledesktop/ . Can someone point me to some other Lucene-based applications where the search engine returns text segments from within documents? Thanks in advance. N.B. I know Lucene can be modified to do what I wish. My problem is that my professional obligations do not allow the time for me to build the entire application that I need. Thus I am searching for one that exists already, that I can adapt quickly, and which has all the code with which I must surround Lucene to make a full-blown application. The Bible application I cite requires preprocessing of the documents into SWORD format. I will try that route if that is all that is available. I thought I would look around (with your help) before trying to take on the SWORD-format issue. Thanks. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: FastVectorHighlighter - can FieldFragList expose fragInfo?
Thank you Koji. I opened LUCENE-3141 for this. https://issues.apache.org/jira/browse/LUCENE-3141 -sujit On Tue, 2011-05-24 at 22:33 +0900, Koji Sekiguchi wrote: (11/05/24 3:28), Sujit Pal wrote: Hello, My version: Lucene 3.1.0 I've had to customize the snippet for highlighting based on our application requirements. Specifically, instead of the snippet being a set of relevant fragments in the text, I need it to be the first sentence where a match occurs, with a fixed size from the beginning of the sentence. For this, I built (in my application code, using Lucene jars) a custom FragmentsBuilder (subclassing SimpleFragmentBuilder and overriding the createFragment(IndexReader reader, int docId, String fieldName, FieldFragList fieldFragList). However, the FieldFragList does not allow access to the ListWeightedFragInfo member variable. I changed this locally to be public so my subclass can access it, ie: public ListWeightedFragInfo fragInfos = new ArrayListWeightedFragInfo(); Once this is done, my createFragment method can get at the fragInfos from the passed in fieldFragList, iterate through its WeightedFragInfo.SubInfo.Toffs to get the term offsets, which I then use to calculate and highlight my snippet (I can provide the code if it makes things clearer, but thats the gist). So my question is - would it be feasible to make the FieldFragList.fragInfos variable public in a future release? No. Please open a jira ticket and attach a patch, if possible. I'll take a look. koji - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
FastVectorHighlighter - can FieldFragList expose fragInfo?
Hello, My version: Lucene 3.1.0 I've had to customize the snippet for highlighting based on our application requirements. Specifically, instead of the snippet being a set of relevant fragments in the text, I need it to be the first sentence where a match occurs, with a fixed size from the beginning of the sentence. For this, I built (in my application code, using Lucene jars) a custom FragmentsBuilder (subclassing SimpleFragmentBuilder and overriding the createFragment(IndexReader reader, int docId, String fieldName, FieldFragList fieldFragList). However, the FieldFragList does not allow access to the ListWeightedFragInfo member variable. I changed this locally to be public so my subclass can access it, ie: public ListWeightedFragInfo fragInfos = new ArrayListWeightedFragInfo(); Once this is done, my createFragment method can get at the fragInfos from the passed in fieldFragList, iterate through its WeightedFragInfo.SubInfo.Toffs to get the term offsets, which I then use to calculate and highlight my snippet (I can provide the code if it makes things clearer, but thats the gist). So my question is - would it be feasible to make the FieldFragList.fragInfos variable public in a future release? If not, is there some other way that I should do what I need to do? Thanks very much, Sujit - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Reg: Query behavior
Hi Deepak, Would something like this work in your case? Arcos Bioscience^2.0 Arcos Bioscience ie, a BooleanQuery with the full phrase boosted OR'd with a query on each word? -sujit On Tue, 2011-04-26 at 14:46 -0400, Deepak Konidena wrote: Hi, Currently when I type in Arcos Bioscience in my lucene search, it returns all those documents with either Arcos or Bioscience at the top of the search results and the actual document containing Arcos Bioscience somewhere in the middle/bottom. The desired behavior is to rank those documents that contain the terms Arcos and Bioscience next to each other higher than those that contain either of the terms or contain both the terms but which far away from each other. When I search the same term with quotes Arcos Bioscience in the term, it gives the exact document that contains the term and nothing else. In general, how would I modify the system in such a way that the documents containing exact term are shown first and also the documents with either exact or term are shown later (without just showing one result). Thanks Deepak Konidena. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Searching partial names using Lucene
I don't know if there is already an analyzer available for this, but you could use GATE or UIMA for Named Entity Extraction against names and expand the query to include the extra names that are used synonymously. You could do this outside Lucene or inline using a custom Lucene tokenizer that embeds either a GATE or UIMA NER. If you go the custom route (and you are not familiar with GATE or UIMA), you may want to take a look at Dr Manu Konchady's book on Lingpipe, Lucene and GATE - there is code in there to embed a GATE NER into a Lucene tokenizer (although its not a streaming tokenizer due to the nature of the NER process). The process would be similar for embedding a UIMA NER. GATE (ANNIE) contains data files that list the common synonyms (eg. Bill == William, Bob == Robert, Tom == Thomas, etc) which you can leverage with GATE's Jape rule language. Alternatively, you could use the same data from UIMA using a custom analysis engine (I prefer this route because this is all Java, easier learning curve and maintainability). -sujit On Thu, 2011-03-24 at 14:31 -0400, Deepak Konidena wrote: Hi, I would like to build a search system where a search for Dan would also search for Daniel and a search for Will, William . Any ideas on how to go about implementing that? I can think of writing a custom Analyzer that would map these partial tokens to their full firstname or lastnames. But is there an Analyzer in lucene contrib modules or elsewhere that does a similar job for me? Thanks, Deepak Konidena. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to define different similarity scores per field ?
One way to do this currently is to build a per field similarity wrapper (that triggers off the field name). I believe there is some work going on with Lucene Similarity that would make it pluggable for this sort of stuff, but in the meantime, this is what I did: public class MyPerFieldSimilarityWrapper extends Similarity { public MyPerFieldSimilarityWrapper() { this.defaultSimilarity = new DefaultSimilarity(); this.fieldSimilarityMap = new HashMapString,Similarity(); this.fieldSimilarityMap.put(fieldA, new FieldASimilarity()); ... } @Override public float lengthNorm(String fieldName, int numTokens) { Similarity sim = fieldSimilarityMap.get(fieldName); if (sim == null) { return defaultSimilarity.lengthNorm(fieldName, numTokens); } else { return sim.lengthNorm(fieldName, numTokens); } } // same for scorePayload. For the others, I just delegate // to defaultSimilarity (all I really need is scorePayload in // my case). } and in the schema.xml, I just set this class to be the similarity class: similarity class=com.mycompany.MyPerFieldSimilarityWrapper/ hth -sujit On Tue, 2011-03-01 at 20:41 +0100, Patrick Diviacco wrote: I need to define different similarity scores per document field. For example for field A I want to use Lucene tf.idf score, for the numerical field B I want to use a different metric (difference between values) and so on... thanks - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to define different similarity scores per field ?
Yes, for the other methods (except scorePayload), I just use delegate to the corresponding method in DefaultSimilarity. The reason is that I don't have a way to trigger off the field name for these others. For me, I really only need to distinguish between DefaultSimilarity and PayloadSimilarity (which needs to be triggered for certain fields in my index), so I overrode the scorePayloads method also in the same Map driven way. On Tue, 2011-03-01 at 23:28 +0100, Patrick Diviacco wrote: I see, but I don't get one thing... you are actually customizing only normLenght method but not all the other methods that are calculating the similarity scores... those methods are called and they have the implementation you have in DefaultSimilarityClass.. right ? On 1 March 2011 21:12, Sujit Pal sujit@comcast.net wrote: One way to do this currently is to build a per field similarity wrapper (that triggers off the field name). I believe there is some work going on with Lucene Similarity that would make it pluggable for this sort of stuff, but in the meantime, this is what I did: public class MyPerFieldSimilarityWrapper extends Similarity { public MyPerFieldSimilarityWrapper() { this.defaultSimilarity = new DefaultSimilarity(); this.fieldSimilarityMap = new HashMapString,Similarity(); this.fieldSimilarityMap.put(fieldA, new FieldASimilarity()); ... } @Override public float lengthNorm(String fieldName, int numTokens) { Similarity sim = fieldSimilarityMap.get(fieldName); if (sim == null) { return defaultSimilarity.lengthNorm(fieldName, numTokens); } else { return sim.lengthNorm(fieldName, numTokens); } } // same for scorePayload. For the others, I just delegate // to defaultSimilarity (all I really need is scorePayload in // my case). } and in the schema.xml, I just set this class to be the similarity class: similarity class=com.mycompany.MyPerFieldSimilarityWrapper/ hth -sujit On Tue, 2011-03-01 at 20:41 +0100, Patrick Diviacco wrote: I need to define different similarity scores per document field. For example for field A I want to use Lucene tf.idf score, for the numerical field B I want to use a different metric (difference between values) and so on... thanks - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org