Re: How to create a Lucene in-memory index at webapp deployment time
Follow the instruction here: http://lucene.apache.org/core/discussion.html -- Jack Krupansky -Original Message- From: Noopur Julka Sent: Monday, September 10, 2012 12:43 PM To: java-user@lucene.apache.org Cc: Dhananjeyan Balaretnaraja Subject: Re: How to create a Lucene in-memory index at webapp deployment time unsuscribe me pls Regards, Noopur Julka On Fri, Sep 7, 2012 at 10:40 AM, Kasun Perera wrote: I have a web java/jsp application running on Apache Tomcat server. In this web application I have used lucene, to index and calculate similrarity between some PDF documents(PDF documents are in the database). My live server dosent allow web-app to access files, so I have created the in-memory lucene index using RAMDirectory class. In the current way that I have coded in my application, when for each time user access the lucene involved functionality, it creates a new in-memory index. Is there any way to create the in-memory index at the webapp deployment time, so that in-memory index will be created only once and I can access in-memory index as long as web app is live? -- Regards Kasun Perera - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Hibernate Search with Regex based on Table
It sounds as if MappingCharFilter would be sufficient. Unless there is some additional requirement? In Solr we have: positionIncrementGap="100" > mapping="mapping-ISOLatin1Accent.txt"/> That mapping-ISOLatin1Accent.txt file maps or "folds" all the accented characters into the base ASCII letter. -- Jack Krupansky -Original Message- From: Robert Streitberger Sent: Wednesday, September 12, 2012 8:45 AM To: java-user@lucene.apache.org Subject: Hibernate Search with Regex based on Table Hello, I am currently discussing the possibilities of introducing Hibernate Search (Lucene) into an existing Java Web Project with existing Hibernate Layer. Hibernate Queries are quite complex and mostly done with criteries. For certain properties/columns we are looking for advanced search possibilities. Example: Assume we have a where clause with like search looking up for names from different languages (we are on UTF-8 database) like let's say Gomez -> which could also be written as Gómez or Gômez... what ever... The idea for the search is to hava a table which provides all alternatives for a certain letter... let's say o -> ô, ó, ò, ... and creating a regex from this to find all possible combinations of Gomez no matter if we use o, or variants of it from utf-8 character set. Problem is that regex can be very large as there are alternatives for nearly any vocals and consonants and regexp_like search of oracle database is quite restricted. Thus idea would be to use some kind of index search with lucene. In short: Would it be possible to introduce Hibernate Search in the project? (There is at least hibernate 3.0 and Jdk 1.5 on tomcat 6 with hbm.xml files available but not with annotations). Would it be possible to use indexed lucene search by adding Restrictions to Hibernate Criterias? Would it be possible to also introduce the matching table to create a complex regex? Or is there a restriction on the length of lucene regex expressions? Or is there maybe another way which is not using regex at all if regex is not possible with this complexity? Many thanks in advance! kr Rob - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Hibernate Search with Regex based on Table
MappingCharFilter can do all of that. The file I referenced already has ae, oe, and ss. That default file handles your umlauts differently, but you can change the rules to suit your exact needs. -- Jack Krupansky -Original Message- From: Robert Streitberger Sent: Wednesday, September 12, 2012 9:22 AM To: java-user@lucene.apache.org Subject: Re: Hibernate Search with Regex based on Table Hi, thx for the hint. It seems to be an interesting solution. Unfortunately I think it will come to problems with german names when umlauts (ö, ä) and the sharp s (ß) are mapped, because there are some requirements to map these chars to the usual german representation and consider this in search. let's say oe, ae, ss. kr Rob From: "Jack Krupansky" To: Date: 12.09.2012 15:02 Subject:Re: Hibernate Search with Regex based on Table It sounds as if MappingCharFilter would be sufficient. Unless there is some additional requirement? In Solr we have: That mapping-ISOLatin1Accent.txt file maps or "folds" all the accented characters into the base ASCII letter. -- Jack Krupansky -Original Message- From: Robert Streitberger Sent: Wednesday, September 12, 2012 8:45 AM To: java-user@lucene.apache.org Subject: Hibernate Search with Regex based on Table Hello, I am currently discussing the possibilities of introducing Hibernate Search (Lucene) into an existing Java Web Project with existing Hibernate Layer. Hibernate Queries are quite complex and mostly done with criteries. For certain properties/columns we are looking for advanced search possibilities. Example: Assume we have a where clause with like search looking up for names from different languages (we are on UTF-8 database) like let's say Gomez -> which could also be written as Gómez or Gômez... what ever... The idea for the search is to hava a table which provides all alternatives for a certain letter... let's say o -> ô, ó, ò, ... and creating a regex from this to find all possible combinations of Gomez no matter if we use o, or variants of it from utf-8 character set. Problem is that regex can be very large as there are alternatives for nearly any vocals and consonants and regexp_like search of oracle database is quite restricted. Thus idea would be to use some kind of index search with lucene. In short: Would it be possible to introduce Hibernate Search in the project? (There is at least hibernate 3.0 and Jdk 1.5 on tomcat 6 with hbm.xml files available but not with annotations). Would it be possible to use indexed lucene search by adding Restrictions to Hibernate Criterias? Would it be possible to also introduce the matching table to create a complex regex? Or is there a restriction on the length of lucene regex expressions? Or is there maybe another way which is not using regex at all if regex is not possible with this complexity? Many thanks in advance! kr Rob - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene 4.0 GA time frame
My personal estimate is that it will likely be within a week or two, but there is no official date. -- Jack Krupansky -Original Message- From: sausarkar Sent: Friday, September 14, 2012 1:05 PM To: java-user@lucene.apache.org Subject: Re: Lucene 4.0 GA time frame Now that the BETA is out for a while any clue when the GA release wil take place? -- View this message in context: http://lucene.472066.n3.nabble.com/Lucene-4-0-GA-time-frame-tp3998264p4007802.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: how to fully preprocess query before fuzzy search?
" Lucene supports escaping special characters that are part of the query syntax. The current list special characters are + - && || ! ( ) { } [ ] ^ " ~ * ? : \ / " See: http://lucene.apache.org/core/4_0_0-ALPHA/queryparser/org/apache/lucene/queryparser/classic/package-summary.html So, maybe you should escape all special characters, and then add the fuzzy query. Note: In 4.0 the fuzzy query is limited to an editing distance of 2. -- Jack Krupansky -Original Message- From: Ilya Zavorin Sent: Monday, September 17, 2012 10:41 AM To: java-user@lucene.apache.org Subject: how to fully preprocess query before fuzzy search? I am processing a bunch of text coming out of OCR, i.e. it's machine-generated text that contains some errors like garbage characters attached to words, letters replaced with similarly looking characters (e.g. "I" with "1") etc. The text is whitespace-tokenized and I am trying to match each token against an index using a fuzzy match, so that small amounts of occasional garbage in the tokens do not prevent a match. Right now I am preprocessing each query as follows: //term = token Query queryF = parser.Parse(term.Replace("~", "") + "~"); However, searcher.Search still throws "can't parse" exceptions for queries that contain brackets, quotes and other garbage characters. So how should I fully preprocess a query to avoid these exceptions? Looks like I just need to remove a certain set of characters just like the tilde is removed above. What is the complete set of such characters? Do I need to do any other preprocess? Thanks, Ilya Zavorin - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: how to fully preprocess query before fuzzy search?
Either is fine. In fact just escape based on the individual character, not the context. The multi-character context is telling you places where escape is not essential, but that doesn't mean it would hurt. -- Jack Krupansky -Original Message- From: Ilya Zavorin Sent: Monday, September 17, 2012 11:08 AM To: java-user@lucene.apache.org Subject: RE: how to fully preprocess query before fuzzy search? Thanks so I do not need to escape the "&" in "dog & cat" But I do need to escape the "&&" in "dog && cat" correct? And do I escape as "dog \&& cat" or as "dog \&\& cat"? Ilya -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Monday, September 17, 2012 10:55 AM To: java-user@lucene.apache.org Subject: Re: how to fully preprocess query before fuzzy search? " Lucene supports escaping special characters that are part of the query syntax. The current list special characters are + - && || ! ( ) { } [ ] ^ " ~ * ? : \ / " See: http://lucene.apache.org/core/4_0_0-ALPHA/queryparser/org/apache/lucene/queryparser/classic/package-summary.html So, maybe you should escape all special characters, and then add the fuzzy query. Note: In 4.0 the fuzzy query is limited to an editing distance of 2. -- Jack Krupansky -Original Message- From: Ilya Zavorin Sent: Monday, September 17, 2012 10:41 AM To: java-user@lucene.apache.org Subject: how to fully preprocess query before fuzzy search? I am processing a bunch of text coming out of OCR, i.e. it's machine-generated text that contains some errors like garbage characters attached to words, letters replaced with similarly looking characters (e.g. "I" with "1") etc. The text is whitespace-tokenized and I am trying to match each token against an index using a fuzzy match, so that small amounts of occasional garbage in the tokens do not prevent a match. Right now I am preprocessing each query as follows: //term = token Query queryF = parser.Parse(term.Replace("~", "") + "~"); However, searcher.Search still throws "can't parse" exceptions for queries that contain brackets, quotes and other garbage characters. So how should I fully preprocess a query to avoid these exceptions? Looks like I just need to remove a certain set of characters just like the tilde is removed above. What is the complete set of such characters? Do I need to do any other preprocess? Thanks, Ilya Zavorin - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: how to disable the field cache
Could you suggest the code for a "mock field cache"? I mean, what would an anonymous instance look like. -- Jack Krupansky -Original Message- From: karsten-s...@gmx.de Sent: Tuesday, September 18, 2012 9:07 AM To: java-user@lucene.apache.org Subject: Re: how to disable the field cache Hi 惠达 王, if you do not sort (by field values) and do not use faceted search or joins the field cache will not be used. If you want to go sure: Write a MockFieldCache and call FieldCache.DEFAULT = new MockFieldCache(); Best regards Karsten btw. the are other objects in the main-memory like doc frequency in context: http://lucene.472066.n3.nabble.com/how-to-disable-the-field-cache-td4007384.html Original-Nachricht Datum: Thu, 13 Sep 2012 15:34:08 +0800 Von: "惠达 王" An: java-user@lucene.apache.org Betreff: how to disable the field cache Hi all how to disable the field cache? 王惠达 (PHP开发工程师, Sysdev Team) - 分机:8836 QQ:429335915 mobile: 13795449454 E-mail: williamw...@anjuke.com 上海市浦东新区陆家嘴环路166号未来资产大厦10楼 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Using stop words with snowball analyzer and shingle filter
The underscores are due to the fact that the StopFilter defaults to "enable position increments", so there are no terms at the positions where the stop words appeared in the source text. Unfortunately, SnowballAnalyzer does not pass that in as a parameter and is "final" so you can't subclass it to override the "createComponents" method that creates the StopFilter, so you would essentially have to copy the source for SnowballAnalyzer and then add in the code to invoke StopFilter.setEnablePositionIncrements the way StopFilterFactory does. -- Jack Krupansky -Original Message- From: Martin O'Shea Sent: Wednesday, September 19, 2012 4:24 AM To: java-user@lucene.apache.org Subject: Using stop words with snowball analyzer and shingle filter I'm currently giving the user an option to include stop words or not when filtering a body of text for ngram frequencies. Typically, this is done as follows: snowballAnalyzer = new SnowballAnalyzer(Version.LUCENE_30, "English", stopWords); shingleAnalyzer = new ShingleAnalyzerWrapper(snowballAnalyzer, this.getnGramLength()); stopWords is set to either a full list of words to include in ngrams or to remove from them. this.getnGramLength()); simply contains the current ngram length up to a maximum of three. If I use stopwords in filtering text "satellite is definitely falling to Earth" for trigrams, the output is: No=1, Key=to, Freq=1 No=2, Key=definitely, Freq=1 No=3, Key=falling to earth, Freq=1 No=4, Key=satellite, Freq=1 No=5, Key=is, Freq=1 No=6, Key=definitely falling to, Freq=1 No=7, Key=definitely falling, Freq=1 No=8, Key=falling, Freq=1 No=9, Key=to earth, Freq=1 No=10, Key=satellite is, Freq=1 No=11, Key=is definitely, Freq=1 No=12, Key=falling to, Freq=1 No=13, Key=is definitely falling, Freq=1 No=14, Key=earth, Freq=1 No=15, Key=satellite is definitely, Freq=1 But if I don't use stopwords for trigrams , the output is this: No=1, Key=satellite, Freq=1 No=2, Key=falling _, Freq=1 No=3, Key=satellite _ _, Freq=1 No=4, Key=_ earth, Freq=1 No=5, Key=falling, Freq=1 No=6, Key=satellite _, Freq=1 No=7, Key=_ _, Freq=1 No=8, Key=_ falling _, Freq=1 No=9, Key=falling _ earth, Freq=1 No=10, Key=_, Freq=3 No=11, Key=earth, Freq=1 No=12, Key=_ _ falling, Freq=1 No=13, Key=_ falling, Freq=1 Why am I seeing underscores? I would have thought to see simple unigrams, "satellite falling" and "falling earth", and "satellite falling earth"? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Searching for a search string containing a literal slash doesn't work with QueryParser
The scape merely assures that the slash will not be parsed as query syntax and will be passed directly to the analyzer, but the standard analyzer will in fact always remove it. Maybe you want the white space analyzer or keyword analyzer (no characters removed.) -- Jack Krupansky -Original Message- From: Jochen Hebbrecht Sent: Monday, October 01, 2012 8:59 AM To: java-user@lucene.apache.org Subject: Searching for a search string containing a literal slash doesn't work with QueryParser Hi, I'm currently trying to search on the following search string in my Lucene index: "2012/0.124.323". The java code to search for ('value' is my search string) QueryParser queryParser = new QueryParser(Version.LUCENE_36, field, new StandardAnalyzer(Version.LUCENE_36)); queryParser.setAllowLeadingWildcard(true); return queryParser.parse(value); This returns a query result: "2012" "0.124.323". QueryParser is replacing the forward slash by a space. I tried escaping the "/" with a backslash "\", but this doesn't work either. Maybe required to fully understand my scenario. I have the following import XML: --- ... Vervaldag 17/07/12 09/07/12 2012/0.124.323 Kapitaals ... --- I get all TEXT values with an XPath expression and I index them as: --- XPathExpression expr = xpath.compile("//TEXT"); Object result = expr.evaluate(document, XPathConstants.NODESET); NodeList nodes = (NodeList) result; for (int i = 0; i < nodes.getLength(); i++) { doc.add(new org.apache.lucene.document.Field("IMAGE", nodes.item(i).getFirstChild().getNodeValue(), Store.NO, Index.ANALYZED)); } --- I'm using the StandardAnalyzer. What is the best way to solve my issue? Do I need to switch from Analyzer? Do I have to use something else then QueryParser? ... I also want to support searching on 2012/0.*, so I cannot only use TermQuery ... Kind regards, Jochen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Searching for a search string containing a literal slash doesn't work with QueryParser
That's "The escape merely..." -- Jack Krupansky -Original Message----- From: Jack Krupansky Sent: Monday, October 01, 2012 9:58 AM To: java-user@lucene.apache.org Subject: Re: Searching for a search string containing a literal slash doesn't work with QueryParser The scape merely assures that the slash will not be parsed as query syntax and will be passed directly to the analyzer, but the standard analyzer will in fact always remove it. Maybe you want the white space analyzer or keyword analyzer (no characters removed.) -- Jack Krupansky -Original Message- From: Jochen Hebbrecht Sent: Monday, October 01, 2012 8:59 AM To: java-user@lucene.apache.org Subject: Searching for a search string containing a literal slash doesn't work with QueryParser Hi, I'm currently trying to search on the following search string in my Lucene index: "2012/0.124.323". The java code to search for ('value' is my search string) QueryParser queryParser = new QueryParser(Version.LUCENE_36, field, new StandardAnalyzer(Version.LUCENE_36)); queryParser.setAllowLeadingWildcard(true); return queryParser.parse(value); This returns a query result: "2012" "0.124.323". QueryParser is replacing the forward slash by a space. I tried escaping the "/" with a backslash "\", but this doesn't work either. Maybe required to fully understand my scenario. I have the following import XML: --- ... Vervaldag 17/07/12 09/07/12 2012/0.124.323 Kapitaals ... --- I get all TEXT values with an XPath expression and I index them as: --- XPathExpression expr = xpath.compile("//TEXT"); Object result = expr.evaluate(document, XPathConstants.NODESET); NodeList nodes = (NodeList) result; for (int i = 0; i < nodes.getLength(); i++) { doc.add(new org.apache.lucene.document.Field("IMAGE", nodes.item(i).getFirstChild().getNodeValue(), Store.NO, Index.ANALYZED)); } --- I'm using the StandardAnalyzer. What is the best way to solve my issue? Do I need to switch from Analyzer? Do I have to use something else then QueryParser? ... I also want to support searching on 2012/0.*, so I cannot only use TermQuery ... Kind regards, Jochen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Searching for a search string containing a literal slash doesn't work with QueryParser
You can apply the lower case filter to the whitespace or other analyzer and use that as the analyzer. -- Jack Krupansky -Original Message- From: Jochen Hebbrecht Sent: Monday, October 01, 2012 10:34 AM To: java-user@lucene.apache.org Subject: Re: Searching for a search string containing a literal slash doesn't work with QueryParser Hi Jack, I tried analyzing through WhitespaceAnalyzer. Now I can search on my query string AND I can find my document! Great! But all my searches are now case sensitive. So when I index a field as "JavaOne", I also have to enter in my search word: "JavaOne" and not "javaone" or "javaOne". How do you solve this in a proper way? Bringing all characters toLowerCase() when indexing them? Jochen 2012/10/1 Jack Krupansky That's "The escape merely..." -- Jack Krupansky -Original Message- From: Jack Krupansky Sent: Monday, October 01, 2012 9:58 AM To: java-user@lucene.apache.org Subject: Re: Searching for a search string containing a literal slash doesn't work with QueryParser The scape merely assures that the slash will not be parsed as query syntax and will be passed directly to the analyzer, but the standard analyzer will in fact always remove it. Maybe you want the white space analyzer or keyword analyzer (no characters removed.) -- Jack Krupansky -Original Message- From: Jochen Hebbrecht Sent: Monday, October 01, 2012 8:59 AM To: java-user@lucene.apache.org Subject: Searching for a search string containing a literal slash doesn't work with QueryParser Hi, I'm currently trying to search on the following search string in my Lucene index: "2012/0.124.323". The java code to search for ('value' is my search string) QueryParser queryParser = new QueryParser(Version.LUCENE_36, field, new StandardAnalyzer(Version.**LUCENE_36)); queryParser.**setAllowLeadingWildcard(true); return queryParser.parse(value); This returns a query result: "2012" "0.124.323". QueryParser is replacing the forward slash by a space. I tried escaping the "/" with a backslash "\", but this doesn't work either. Maybe required to fully understand my scenario. I have the following import XML: --- ... Vervaldag 17/07/12 09/07/12 2012/0.124.323 Kapitaals ... --- I get all TEXT values with an XPath expression and I index them as: --- XPathExpression expr = xpath.compile("//TEXT"); Object result = expr.evaluate(document, XPathConstants.NODESET); NodeList nodes = (NodeList) result; for (int i = 0; i < nodes.getLength(); i++) { doc.add(new org.apache.lucene.document.**Field("IMAGE", nodes.item(i).getFirstChild().**getNodeValue(), Store.NO, Index.ANALYZED)); } --- I'm using the StandardAnalyzer. What is the best way to solve my issue? Do I need to switch from Analyzer? Do I have to use something else then QueryParser? ... I also want to support searching on 2012/0.*, so I cannot only use TermQuery ... Kind regards, Jochen --**--**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org For additional commands, e-mail: java-user-help@lucene.apache.**org --**--**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org For additional commands, e-mail: java-user-help@lucene.apache.**org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Searching for a search string containing a literal slash doesn't work with QueryParser
Sorry, I meant apply the filter to the TOKENIZER that the analyzer uses. -- Jack Krupansky -Original Message- From: Jack Krupansky Sent: Monday, October 01, 2012 10:44 AM To: java-user@lucene.apache.org Subject: Re: Searching for a search string containing a literal slash doesn't work with QueryParser You can apply the lower case filter to the whitespace or other analyzer and use that as the analyzer. -- Jack Krupansky -Original Message- From: Jochen Hebbrecht Sent: Monday, October 01, 2012 10:34 AM To: java-user@lucene.apache.org Subject: Re: Searching for a search string containing a literal slash doesn't work with QueryParser Hi Jack, I tried analyzing through WhitespaceAnalyzer. Now I can search on my query string AND I can find my document! Great! But all my searches are now case sensitive. So when I index a field as "JavaOne", I also have to enter in my search word: "JavaOne" and not "javaone" or "javaOne". How do you solve this in a proper way? Bringing all characters toLowerCase() when indexing them? Jochen 2012/10/1 Jack Krupansky That's "The escape merely..." -- Jack Krupansky -Original Message- From: Jack Krupansky Sent: Monday, October 01, 2012 9:58 AM To: java-user@lucene.apache.org Subject: Re: Searching for a search string containing a literal slash doesn't work with QueryParser The scape merely assures that the slash will not be parsed as query syntax and will be passed directly to the analyzer, but the standard analyzer will in fact always remove it. Maybe you want the white space analyzer or keyword analyzer (no characters removed.) -- Jack Krupansky -Original Message- From: Jochen Hebbrecht Sent: Monday, October 01, 2012 8:59 AM To: java-user@lucene.apache.org Subject: Searching for a search string containing a literal slash doesn't work with QueryParser Hi, I'm currently trying to search on the following search string in my Lucene index: "2012/0.124.323". The java code to search for ('value' is my search string) QueryParser queryParser = new QueryParser(Version.LUCENE_36, field, new StandardAnalyzer(Version.**LUCENE_36)); queryParser.**setAllowLeadingWildcard(true); return queryParser.parse(value); This returns a query result: "2012" "0.124.323". QueryParser is replacing the forward slash by a space. I tried escaping the "/" with a backslash "\", but this doesn't work either. Maybe required to fully understand my scenario. I have the following import XML: --- ... Vervaldag 17/07/12 09/07/12 2012/0.124.323 Kapitaals ... --- I get all TEXT values with an XPath expression and I index them as: --- XPathExpression expr = xpath.compile("//TEXT"); Object result = expr.evaluate(document, XPathConstants.NODESET); NodeList nodes = (NodeList) result; for (int i = 0; i < nodes.getLength(); i++) { doc.add(new org.apache.lucene.document.**Field("IMAGE", nodes.item(i).getFirstChild().**getNodeValue(), Store.NO, Index.ANALYZED)); } --- I'm using the StandardAnalyzer. What is the best way to solve my issue? Do I need to switch from Analyzer? Do I have to use something else then QueryParser? ... I also want to support searching on 2012/0.*, so I cannot only use TermQuery ... Kind regards, Jochen --**--**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org For additional commands, e-mail: java-user-help@lucene.apache.**org --**--**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org For additional commands, e-mail: java-user-help@lucene.apache.**org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: understanding the need to reindex a document
You can in fact add fields. And you can remove the value of a field for a single document. Both can be done by simply re-adding the document with the same unique key value. But that won't add that field to other or all documents, or remove the value of that field for all other documents. Re-indexing is usually required when you change the type of a field, its attributes, or its analyzer. The primary difficulty with automatically re-indexing from the index itself is that Lucene does not require all fields to be "stored". So, replacing a document means you need to have an external source for all non-stored fields, or at least non-stored fields which are not copied from another field which is stored. -- Jack Krupansky -Original Message- From: Shaya Potter Sent: Monday, October 22, 2012 3:47 PM To: java-user@lucene.apache.org Subject: Re: understanding the need to reindex a document I can understand that (i.e. why in place wont work), but the Q would be, why one can't read in a document, add() or removeField() that document and then updateDocument() that document. On 10/22/2012 03:43 PM, Apostolis Xekoukoulotakis wrote: I am not that familiar with Lucene, so my answer may be a bit off. Search on the internet about log structured storage. There you will find why rewriting an entry is better than updating an existing entry. Leveldb/cassandra/bigTable use it. maybe search these terms as well. 2012/10/22 Shaya Potter so there are lots of Qs that are asked about wanting to modify a lucene document (i.e. remove fields, add fields) but are told that one needs to reindex. No one ever answers the technical Q of why this is, and I'm interested in that. presumambly because documents aren't stored as documents or even in any form that could be reassembled into a document, but I'm interested in what the actual answer is. thanks. --**--**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org For additional commands, e-mail: java-user-help@lucene.apache.**org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: StandardAnalyzer functionality change
Yes, by design. StandardAnalyzer implements "simple word boundaries" (the technical term is "Unicode text segmentation"), period. As the javadoc says, "As of Lucene version 3.1, this class implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29." That is a "standard". See: http://lucene.apache.org/core/4_0_0-ALPHA/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html http://lucene.apache.org/core/4_0_0-BETA/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizer.html -- Jack Krupansky -Original Message- From: kiwi clive Sent: Wednesday, October 24, 2012 6:42 AM To: java-user@lucene.apache.org Subject: StandardAnalyzer functionality change Hi all, Sorry if I'm asking an age old question but we have migrated to lucene 3.6.0 and I see StandardAnalyzer has changed its behaviour, particularly when tokenizing email addresses. From reading the forums, I understand StandardAnalyzer was renamed to ClassicAnalyzer - is this the case ? If I pass the string 'u...@domain.com' through these analyzers, I get the following tokens: Using StandardAnalyzer(Version.LUCENE_23): --> u...@domain.com (one token) Using StandardAnalyzer(Version.LUCENE_36): --> user domain.com(two tokens) Using ClassicAnalyzer(Version.LUCENE_36): --> u...@domain.com (one token) StandardAnalyzer is normally a good compromise as a default analyzer but the failure to keep an email address intact makes it less fit for purpose than it used to be. Is this a bug or is it by design ? If by design, what is the reason for the change and is ClassicAnalyzer now the defacto-analyzer to use ? Thanks, Clive - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: StandardAnalyzer functionality change
I didn't explicitly say it, but ClassicAnalyzer does do exactly what you want it to do - work break plus email and URL, or StandardAnalyzer plus email and URL. -- Jack Krupansky -Original Message- From: kiwi clive Sent: Wednesday, October 24, 2012 1:27 PM To: java-user@lucene.apache.org Subject: Re: StandardAnalyzer functionality change Thanks for the responses chaps, very informative, and most appreciated :-) From: Ian Lea To: java-user@lucene.apache.org Sent: Wednesday, October 24, 2012 4:19 PM Subject: Re: StandardAnalyzer functionality change If you want email addresses, UAX29URLEmailAnalyzer is another alternative. -- Ian. On Wed, Oct 24, 2012 at 3:56 PM, Jack Krupansky wrote: Yes, by design. StandardAnalyzer implements "simple word boundaries" (the technical term is "Unicode text segmentation"), period. As the javadoc says, "As of Lucene version 3.1, this class implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29." That is a "standard". See: http://lucene.apache.org/core/4_0_0-ALPHA/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html http://lucene.apache.org/core/4_0_0-BETA/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizer.html -- Jack Krupansky -Original Message- From: kiwi clive Sent: Wednesday, October 24, 2012 6:42 AM To: java-user@lucene.apache.org Subject: StandardAnalyzer functionality change Hi all, Sorry if I'm asking an age old question but we have migrated to lucene 3.6.0 and I see StandardAnalyzer has changed its behaviour, particularly when tokenizing email addresses. From reading the forums, I understand StandardAnalyzer was renamed to ClassicAnalyzer - is this the case ? If I pass the string 'u...@domain.com' through these analyzers, I get the following tokens: Using StandardAnalyzer(Version.LUCENE_23): --> u...@domain.com (one token) Using StandardAnalyzer(Version.LUCENE_36): --> user domain.com(two tokens) Using ClassicAnalyzer(Version.LUCENE_36): --> u...@domain.com (one token) StandardAnalyzer is normally a good compromise as a default analyzer but the failure to keep an email address intact makes it less fit for purpose than it used to be. Is this a bug or is it by design ? If by design, what is the reason for the change and is ClassicAnalyzer now the defacto-analyzer to use ? Thanks, Clive - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: StandardAnalyzer functionality change
s/work break/word break/ -- Jack Krupansky -Original Message- From: Jack Krupansky Sent: Wednesday, October 24, 2012 3:52 PM To: java-user@lucene.apache.org ; kiwi clive Subject: Re: StandardAnalyzer functionality change I didn't explicitly say it, but ClassicAnalyzer does do exactly what you want it to do - work break plus email and URL, or StandardAnalyzer plus email and URL. -- Jack Krupansky -Original Message- From: kiwi clive Sent: Wednesday, October 24, 2012 1:27 PM To: java-user@lucene.apache.org Subject: Re: StandardAnalyzer functionality change Thanks for the responses chaps, very informative, and most appreciated :-) From: Ian Lea To: java-user@lucene.apache.org Sent: Wednesday, October 24, 2012 4:19 PM Subject: Re: StandardAnalyzer functionality change If you want email addresses, UAX29URLEmailAnalyzer is another alternative. -- Ian. On Wed, Oct 24, 2012 at 3:56 PM, Jack Krupansky wrote: Yes, by design. StandardAnalyzer implements "simple word boundaries" (the technical term is "Unicode text segmentation"), period. As the javadoc says, "As of Lucene version 3.1, this class implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29." That is a "standard". See: http://lucene.apache.org/core/4_0_0-ALPHA/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html http://lucene.apache.org/core/4_0_0-BETA/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizer.html -- Jack Krupansky -Original Message- From: kiwi clive Sent: Wednesday, October 24, 2012 6:42 AM To: java-user@lucene.apache.org Subject: StandardAnalyzer functionality change Hi all, Sorry if I'm asking an age old question but we have migrated to lucene 3.6.0 and I see StandardAnalyzer has changed its behaviour, particularly when tokenizing email addresses. From reading the forums, I understand StandardAnalyzer was renamed to ClassicAnalyzer - is this the case ? If I pass the string 'u...@domain.com' through these analyzers, I get the following tokens: Using StandardAnalyzer(Version.LUCENE_23): --> u...@domain.com (one token) Using StandardAnalyzer(Version.LUCENE_36): --> user domain.com(two tokens) Using ClassicAnalyzer(Version.LUCENE_36): --> u...@domain.com (one token) StandardAnalyzer is normally a good compromise as a default analyzer but the failure to keep an email address intact makes it less fit for purpose than it used to be. Is this a bug or is it by design ? If by design, what is the reason for the change and is ClassicAnalyzer now the defacto-analyzer to use ? Thanks, Clive - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Is there anything in Lucene 4.0 that provides 'absolute' scoring so that i can compare the scoring results of different searches ?
Could you provide a more concrete definition of what you mean by "absolute scoring"? I mean, you can implement your own scoring or "similarity", so what exact criteria are you proposing? Try providing a concise example of a couple of documents and a couple of searches and how your propose to score them. -- Jack Krupansky -Original Message- From: Paul Taylor Sent: Thursday, October 25, 2012 7:11 AM To: java-user@lucene.apache.org Subject: Is there anything in Lucene 4.0 that provides 'absolute' scoring so that i can compare the scoring results of different searches ? Is there anything in Lucene 4.0 that provides 'absolute' scoring so that i can compare the scoring results of different searches ? To explain if I do a search for two values fred OR jane and there is a document that contains both those words exaclty then that document will score 100, documents that contain only one word will score less. But if there is no document contain both words but there is one document that contains fred then that document will score 100 even though it didnt match jane at all. (Im clearly ignoring all the complexties but you get my gist) So all documents returned from a search are scored relative to each other, but I cannot perform a second score and sensibly compare the score of the first search with the second whihc is what I would like to do. Why would I want to this ? In a music database have an index of releases and a separate indexes of artists, usually the user just searches artists or releases. But sometimes they want to search all and interleave the results from the two indexes, but its not sensible for me to interleave them based on their score at the moment. thanks Paul - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: App supplied docID in lucene possible?
Have you looked at or decided against an approach like Solr's ExternalFileField? See: http://lucene.apache.org/solr/4_0_0/solr-core/org/apache/solr/schema/ExternalFileField.html Is that at least the kind of issue you are trying to deal with? One final question: How much of a document's field values are stable vs. frequently changing? What are the numbers here - total field count, count of frequently changed fields, and percentage of documents being updated in some period of time? And, I don't quite follow why you can't just use a unique key for a document rather than the low-level Lucene document id. -- Jack Krupansky -Original Message- From: Ravikumar Govindarajan Sent: Thursday, October 25, 2012 6:10 AM To: java-user@lucene.apache.org Subject: App supplied docID in lucene possible? We have the need to re-index some fields in our application frequently. Our typical document consists of a) Many single-valued {long/int} re-indexable fields b) Few large-valued {text/string} static fields We have to re-index an entire document if a single smallish field changes and it is turning out to be a problem for us. I have gone through the https://issues.apache.org/jira/browse/LUCENE-3837 proposal where it tries to work-around this limitation using a secondary mapping of new-old docids. As I understand, lucene strictly maintains internal doc-id order so that many queries that depend on it, will work correctly. Segment merges will also maintain order as well as reclaim deleted doc-ids There should be many applications like us, which manage index shards limiting a given shard based on doc-id limits or size. So reclaiming deleted doc-ids is mostly a non-issue for us. That leaves us with changing doc-ids. How about leaving open the doc-ids themselves to the applications, at-least as an option to the needy? Taking such an approach might inter-leave doc-ids across segments, but within a segment, the docIds are always in increasing order. There are possibilities of ghost-deletes, duplicate docIds etc..., but all should be solvable, I believe. Fronting these doc-ids during search from all segment readers and returning the correct value from one of them should be easy. Will it incur a heavy penalty during search? Another advantage gained, is the triviality of cross-joining indexes when docIDs are fixed. There must be many other places where an app supplied docId might make lucene behave funny. Need some help in identifying those areas at least for understanding this problem correctly, if not solving it all together. -- Ravi - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to use/create an alias to a field?
With edismax in Solr 3.6/4.0 field aliases are supported: "The syntax for aliasing is f.myalias.qf=realfield. A user query for myalias:foo will be queried as realfield:foo." See: http://wiki.apache.org/solr/ExtendedDisMax#Field_aliasing_.2BAC8_renaming -- Jack Krupansky -Original Message- From: Willi Haase Sent: Thursday, October 25, 2012 9:26 AM To: java-user@lucene.apache.org Subject: How to use/create an alias to a field? Hello I am using Lucene 3.4 and I have nearly the same question like Jan in his post: http://mail-archives.apache.org/mod_mbox/lucene-java-user/200801.mbox/%3C4791E482.60900%40gmx.de%3E I could found something helpful in "Lucene in Action" or in the web. What is the recommended way to create an alias to a field? I want be able to use in the searchquery "au:some_name" or "auth:some_name" or "author:some_name", all using the field "author". Many thanks in advance for any help or recommendations Willi : ) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to use/create an alias to a field?
I almost added that you could create a subclass of the Lucene query parser, as Solr does, and add the aliasing that way. There might/should be field aliasing code in Solr that you could easily apply in Solr. There really isn't a great reason why aliasing is only available in Solr and not in Lucene. But unless you are prepared to hack on the query parser, alias support isn't available in Lucene right now. -- Jack Krupansky -Original Message- From: Willi Haase Sent: Thursday, October 25, 2012 11:59 AM To: java-user@lucene.apache.org Subject: Re: How to use/create an alias to a field? Hi Jack Thank you for your help. My problem is, I have only a Lucene setup and can not switch to Solr : ( Cheers Willi Von: Jack Krupansky An: java-user@lucene.apache.org Gesendet: 15:57 Donnerstag, 25.Oktober 2012 Betreff: Re: How to use/create an alias to a field? With edismax in Solr 3.6/4.0 field aliases are supported: "The syntax for aliasing is f.myalias.qf=realfield. A user query for myalias:foo will be queried as realfield:foo." See: http://wiki.apache.org/solr/ExtendedDisMax#Field_aliasing_.2BAC8_renaming -- Jack Krupansky -Original Message- From: Willi Haase Sent: Thursday, October 25, 2012 9:26 AM To: java-user@lucene.apache.org Subject: How to use/create an alias to a field? Hello I am using Lucene 3.4 and I have nearly the same question like Jan in his post: http://mail-archives.apache.org/mod_mbox/lucene-java-user/200801.mbox/%3C4791E482.60900%40gmx.de%3E I could found something helpful in "Lucene in Action" or in the web. What is the recommended way to create an alias to a field? I want be able to use in the searchquery "au:some_name" or "auth:some_name" or "author:some_name", all using the field "author". Many thanks in advance for any help or recommendations Willi : ) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: query for documents WITHOUT a field?
"OR allergies IS NULL" would be "OR (*:* -allergies:[* TO *])" in Lucene/Solr. -- Jack Krupansky -Original Message- From: Vitaly Funstein Sent: Thursday, October 25, 2012 8:25 PM To: java-user@lucene.apache.org Subject: Re: query for documents WITHOUT a field? Sorry for resurrecting an old thread, but how would one go about writing a Lucene query similar to this? SELECT * FROM patient WHERE first_name = 'Zed' OR allergies IS NULL An AND case would be easy since one would just use a simple TermQuery with a FieldValueFilter added, but what about other boolean cases? Admittedly, this is a contrived example, but the point here is that it seems that since filters are always applied to results after they are returned, how would one go about making the null-ness of a field part of the query logic? On Thu, Feb 16, 2012 at 1:45 PM, Uwe Schindler wrote: I already mentioned that pseudo NULL term, but the user asked for another solution... -- Uwe Schindler H.-H.-Meier-Allee 63, 28213 Bremen http://www.thetaphi.de Jamie Johnson schrieb: Another possible solution is while indexing insert a custom token which is impossible to show up in the index otherwise, then do the filter based on that token. On Thu, Feb 16, 2012 at 4:41 PM, Uwe Schindler wrote: > As the documentation states: > Lucene is an inverted index that does not have per-document fields. It only > knows terms pointing to documents. The query you are searching is a > query > that returns all documents which have no term. To execute this query, it > will get the term index and iterate all terms of a field, mark those in > a > bitset and negates that. The filter/query I told you uses the FieldCache to > do this. Since 3.6 (also in 3.5, but there it is buggy/API different) there > is another fieldcache that returns exactly that bitset. The filter mentioned > only uses that bitset from this new fieldcache. Fieldcache is populated on > first access and keeps alive as long as the underlying index segment is open > (means as long as IndexReader is open and the parts of the index is not > refreshed). If you are also sorting against your fields or doing other > queries using FieldCache, there is no overhead, otherwise the bitset is > populated on first access to the filter. > > Lucene 3.5 has no easy way to implement that filter, a "NULL" pseudo term is > the only solution (and also much faster on the first access in Lucene 3.6). > Later accesses hitting the cache in 3.6 will be faster, of course. > > Another hacky way to achieve the same results is (works with almost any > Lucene version): > BooleanQuery consisting of: MatchAllDocsQuery() as MUST clause and > PrefixQuery(field, "") as MUST_NOT clause. But the PrefixQuery will do a > full term index scan without caching :-). You may use CachingWrapperFilter > with PrefixFilter instead. > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > >> -Original Message- >> From: Tim Eck [mailto:tim...@gmail.com] >> Sent: Thursday, February 16, 2012 10:14 PM >> To: java-user@lucene.apache.org >> Subject: RE: query for documents WITHOUT a field? >> >> Thanks for the fast response. I'll certainly have a look at the >> upcoming > 3.6.x >> release. What is the expected performance for using a negated filter? >> In particular does it defeat the index in any way and require a full index > scan? >> Is it different between regular fields and numeric fields? >> >> For 3.5 and earlier though, is there any suggestion other than magic > values? >> >> -Original Message- >> From: Uwe Schindler [mailto:u...@thetaphi.de] >> Sent: Thursday, February 16, 2012 1:07 PM >> To: java-user@lucene.apache.org >> Subject: RE: query for documents WITHOUT a field? >> >> Lucene 3.6 will have a FieldValueFilter that can be negated: >> >> Query q = new ConstantScoreQuery(new FieldValueFilter("field", true)); >> >> (see http://goo.gl/wyjxn) >> >> Lucen 3.5 does not yet have it, you can download 3.6 snapshots from > Jenkins: >> http://goo.gl/Ka0gr >> >> - >> Uwe Schindler >> H.-H.-Meier-Allee 63, D-28213 Bremen >> http://www.thetaphi.de >> eMail: u...@thetaphi.de >> >> >> > -Original Message- >> > From: Tim Eck [mailto:t...@terracottatech.com] >> > Sent: Thursday, February 16, 2012 9:59 PM >> > To: java-user@lucene.apache.org >> > Subject: query for documents WITHOUT a field? >> > >> > My apologies if this answer is readily available someplace, I've >> > searched around
Re: query for documents WITHOUT a field?
Right another level of BooleanQuery that is a SHOULD clause, with TWO terms: a MUST of MatchAllDocsQuery and a MUST_NOT of the TermRangeQuery for "allergies" with null for both start and end. Actually, there is a new filter that you can use to detect empty fields down at that level. See https://issues.apache.org/jira/browse/LUCENE-4386 I think it is: new ConstantScoreQuery(new FieldValueFilter(fieldname, false)) Use a SHOULD of that rather than a second level of BooleanQuery. Let us know if it actually works! -- Jack Krupansky -Original Message- From: Vitaly Funstein Sent: Thursday, October 25, 2012 8:55 PM To: java-user@lucene.apache.org Subject: Re: query for documents WITHOUT a field? This is the QueryParser syntax, right? So an API equivalent for the not null case would be something like this? BooleanQuery q = new BooleanQuery(); q.add(new BooleanClause(new TermQuery(new Term("first_name", "Zed")), Occur.SHOULD)); q.add(new BooleanClause(new TermRangeQuery("allergies", null, null, true, true), Occur.SHOULD)); Whereas, for "IS NULL" the TermRangeQuery above would need to be wrapped in another BooleanClause with Occur.MUST_NOT? On Thu, Oct 25, 2012 at 5:29 PM, Jack Krupansky wrote: "OR allergies IS NULL" would be "OR (*:* -allergies:[* TO *])" in Lucene/Solr. -- Jack Krupansky -Original Message- From: Vitaly Funstein Sent: Thursday, October 25, 2012 8:25 PM To: java-user@lucene.apache.org Subject: Re: query for documents WITHOUT a field? Sorry for resurrecting an old thread, but how would one go about writing a Lucene query similar to this? SELECT * FROM patient WHERE first_name = 'Zed' OR allergies IS NULL An AND case would be easy since one would just use a simple TermQuery with a FieldValueFilter added, but what about other boolean cases? Admittedly, this is a contrived example, but the point here is that it seems that since filters are always applied to results after they are returned, how would one go about making the null-ness of a field part of the query logic? On Thu, Feb 16, 2012 at 1:45 PM, Uwe Schindler wrote: I already mentioned that pseudo NULL term, but the user asked for another solution... -- Uwe Schindler H.-H.-Meier-Allee 63, 28213 Bremen http://www.thetaphi.de Jamie Johnson schrieb: Another possible solution is while indexing insert a custom token which is impossible to show up in the index otherwise, then do the filter based on that token. On Thu, Feb 16, 2012 at 4:41 PM, Uwe Schindler wrote: > As the documentation states: > Lucene is an inverted index that does not have per-document fields. It only > knows terms pointing to documents. The query you are searching is a > query > that returns all documents which have no term. To execute this query, > it > will get the term index and iterate all terms of a field, mark those in > a > bitset and negates that. The filter/query I told you uses the > FieldCache to > do this. Since 3.6 (also in 3.5, but there it is buggy/API different) there > is another fieldcache that returns exactly that bitset. The filter mentioned > only uses that bitset from this new fieldcache. Fieldcache is populated on > first access and keeps alive as long as the underlying index segment is open > (means as long as IndexReader is open and the parts of the index is not > refreshed). If you are also sorting against your fields or doing other > queries using FieldCache, there is no overhead, otherwise the bitset is > populated on first access to the filter. > > Lucene 3.5 has no easy way to implement that filter, a "NULL" pseudo term is > the only solution (and also much faster on the first access in Lucene 3.6). > Later accesses hitting the cache in 3.6 will be faster, of course. > > Another hacky way to achieve the same results is (works with almost any > Lucene version): > BooleanQuery consisting of: MatchAllDocsQuery() as MUST clause and > PrefixQuery(field, "") as MUST_NOT clause. But the PrefixQuery will do > a > full term index scan without caching :-). You may use CachingWrapperFilter > with PrefixFilter instead. > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > >> -Original Message- >> From: Tim Eck [mailto:tim...@gmail.com] >> Sent: Thursday, February 16, 2012 10:14 PM >> To: java-user@lucene.apache.org >> Subject: RE: query for documents WITHOUT a field? >> >> Thanks for the fast response. I'll certainly have a look at the >> upcoming > 3.6.x >> release. What is the expected performance for using a negated filter? >> In particular does it defeat the index in any way and require a full index > scan? >> Is it different between regul
Re: lucene 4.0 indexReader is changed
How about DirectoryReader.html#openIfChanged? See: http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/index/DirectoryReader.html#openIfChanged(org.apache.lucene.index.DirectoryReader) -- Jack Krupansky -Original Message- From: Scott Smith Sent: Friday, October 26, 2012 7:54 PM To: java-user@lucene.apache.org Subject: lucene 4.0 indexReader is changed How do I determine if the index has been modified in 4.0? The ifchanged() and isChanged() appear to have been removed. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Running Solr Core/ Tika on Azure
SolrCell includes Tika and SolrCell is included with Solr, at least the standard distribution of Solr. You can stream Office and PDF docs directly to the extracting request handler where Tika will process them. You can also ask SolrCell to "extract only" and return the extracted content. See: http://wiki.apache.org/solr/ExtractingRequestHandler Whether the Azure distribution is "full" Solr including Solr Cell or not, I cannot answer. Note: For future reference, "Solr" questions should be asked on the "solr-user" mailing list. -- Jack Krupansky -Original Message- From: Aloke Ghoshal Sent: Monday, October 29, 2012 3:22 AM To: java-user@lucene.apache.org ; gene...@lucene.apache.org Subject: Running Solr Core/ Tika on Azure Hi, Looking for feedback on running Solr Core/ Tika parsing engine on Azure. There's one offering for Solr within Azure from Lucid works. This offering however doesn't mention Tika. We are looking at options to make content from files (doc, excel, pdfs, etc.) stored within Azure storage search-able. And whether the parser could run against our Azure store directly to index the content. The other option could be to write a separate connector that streams in the files. Let me know if you have experience along these lines. Regards, Aloke - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: using CharFilter to inject a space
I still think that we're looking at an "XY Problem" here, haggling over a "solution" when the problem has not been clearly and fully stated. In particular, rather than parsing straight natural language text, the data appears to have a structured form. Until the structure is fully defined, detailing a parser, especially by playing games such as "injecting spaces" is an exercise in futility. I mean, you MIGHT come up with a solution that SEEMS to work (at least for SOME cases), and MAY make you happy, but I would hate to see other Lucene users adopt such an approach to problem solving. Tell us the full problem and then we can focus on legitimate "solutions". -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Sunday, November 04, 2012 8:06 AM To: java-user Subject: Re: using CharFilter to inject a space Ahh, I don't know of a better way. I can imagine complex solutions involving something akin to WordDelimiterFilter... and I can imagine that that would be ridiculously expensive to maintain when there are really simple solutions like you're looking at. Mostly I was curious about your use-case Erick On Sat, Nov 3, 2012 at 11:35 PM, Igal @ getRailo.org wrote: well, my main goal is to use a ShingleFilter that will only take shingles that are not separated by commas etc. for example, the phrase: "red apples, green tomatoes, and brown potatoes" should yield the shingles "red apples", "green tomatoes", "and brown", "brown potatoes"; but not "apples green" and not "tomatoes and" as those are separated by commas. the problem with the common tokenizers is that they get rid of the commas so if I use a ShingleFilter after them there's no way to tell if there was a comma there or not. (another option I consider is to add an Attribute to specify if there was a comma before or after a token) if there's a better way -- I'm open to suggestions, Igal On 11/3/2012 8:10 PM, Erick Erickson wrote: So I've gotta ask... _why_ do you want to inject the spaces? If it's just to break this up into tokens, wouldn't something like LetterTokenizer do? Assuming you aren't interested in leaving in numbers Or even StandardTokenizer unless you have e-mail & etc. Or what about PatternReplaceCharFilter? FWIW, Erick On Sat, Nov 3, 2012 at 9:22 PM, Igal Sapir wrote: You're right. I'm not sure what I was thinking. Thanks for all your help, Igal On Nov 3, 2012 5:44 PM, "Robert Muir" wrote: On Sat, Nov 3, 2012 at 8:32 PM, Igal @ getRailo.org wrote: hi Robert, thank you for your replies. I couldn't find much documentation/examples of this, but this is what I came up with (below). is that the way I'm supposed to use the MappingCharFilter? You don't need to extend anything. You also don't want to create a normalizecharmap for each reader (thats way too heavy) Just build the NormalizeCharMap once, and pass it to MappingCharFilter's Constructor. --**--** - To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org For additional commands, e-mail: java-user-help@lucene.apache.**org --**--**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org For additional commands, e-mail: java-user-help@lucene.apache.**org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: content disappears in the index
Maybe... the author names have middle or first initials? Like, maybe the "Arslanagic" dude has an "A" initial in his name, like "A. Arslanagic" or "Arslanagic, A.". In any case, "string" is the proper type for a sorted field, although it would be nice if Lucene/Solr was more developer-friendly when this "mistake" is made. The relevant doc is: "Sorting can be done on the "score" of the document, or on any multiValued="false" indexed="true" field provided that field is either non-tokenized (ie: has no Analyzer) or uses an Analyzer that only produces a single Term (ie: uses the KeywordTokenizer)" ... "The common situation for sorting on a field that you do want to be tokenized for searching is to use a to clone your field. Sort on one, search on the other." See: http://wiki.apache.org/solr/CommonQueryParameters For example, have an "author" field that is "text" and an "author_s" (or "author_sorted" or "author_string") field that you copy the name to: Query on "author", but sort on "author_s". -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Monday, November 12, 2012 5:28 AM To: java-user Subject: Re: content disappears in the index First, sorting on tokenized fields is undefined/unsupported. You _might_ get away with it if the author field always reduces to one token, i.e. if you're always indexing only the last name. I should say unsupported/undefined when more than one token is the result of analysis. You can do things like use the KeywordTokenizer followed by tranformations on the _entire_ input field (lowercasing is popular for instance). So somehow the analysis chain you have defined for this field grabs "Arslanagic" and translates it into "a". Synonyms? Stemming? Some "interesting" sequence? The fastest way to look at that would be in Solr's admin/analysis page. Just put Arslanagic into the index box and you should see which of the steps does the translation. Although changing it to "a" is really weird, it's almost certainly something you've defined in the indexing analysis chain. FWIW, Erick On Mon, Nov 12, 2012 at 8:19 AM, Bernd Fehling < bernd.fehl...@uni-bielefeld.de> wrote: Hi list, a user reported wrong sorting of our search service running on solr. While chasing this issue I traced it back through lucene into the index. I have a text field for sorting (stored,indexed,tokenized,omitNorms,sortMissingLast) and three docs with author names. If I trace at org.apache.lucene.document.Document.add(IndexableField) while indexing I can see all three author names added as field to each documents. After searching with *:* for the three docs and doing a sort the sorting is wrong because one of the author names is reduced to the first char, all other chars are lost. So having the authors names (Alexander, Arslanagic, Brennmoen) indexed, the result of sorting ascending is (Arslanagic, Alexander, Brennmoen) which is wrong. But this happens because the author "Arslanagic" is reduced to "a" during indexing (???) and if sorted "a" is before "alexander". Currently I use 4.0 but have the same issue with 3.6.1. Without tracing through tons of code: - which is the last breakpoint for debugging to see the docs right before they go into the index - which is the first breakpoint for debugging to see the docs coming right out of the index Regards Bernd - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Which stemmer?
What is your use case? If you don't have a specific use case in mind, try each of them with some common words that you expect will or won't be stemmed. If you have Solr, you can experiment interactively using the Solr Admin Analysis web page. It would be nice if the javadoc for each stemmer gave a handful of examples that illustrated how some common words are stemmed. -- Jack Krupansky -Original Message- From: Scott Smith Sent: Wednesday, November 14, 2012 10:55 AM To: java-user@lucene.apache.org Subject: Which stemmer? Does anyone have any experience with the stemmers? I know that Porter is what "everyone" uses. Am I better off with KStemFilter (better performance) or ?? Does anyone understand the differences between the various stemmers and how to choose one over another? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Which stemmer?
Another word set to try: invest, investing, investment, investments, invests, investor, invester, investors, investers. Also, take a look at EnglishMinimalStemmer (EnglishMinimalStemFilterFactory) for minimal stemming. See: http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/en/EnglishMinimalStemFilterFactory.html http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/en/EnglishMinimalStemmer.html -- Jack Krupansky -Original Message- From: Scott Smith Sent: Wednesday, November 14, 2012 5:17 PM To: java-user@lucene.apache.org Subject: RE: Which stemmer? Unfortunately, my "use case" is a customer who wants stemming, but has very little knowledge of what that means except they think they want it. I agree with your last comment. So, here's my contribution: Original porter kstem minStem --- --- --- --- country countri country country run run run run runs runruns run running run running running readreadreadread readingread reading reading reader reader reader reader association associ association association associate associ associate associate listinglistlist listing water water water water watered water water watered suresuresuresure surelysure surely surely fred's fred' fred's fred' rosesroseroserose Still not sure which one to pick. Porter is more aggressive. Min stemmer is pretty minimal. Perhaps the kstemmer is "just right" :-) Cheers Scott -----Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Wednesday, November 14, 2012 4:14 PM To: java-user@lucene.apache.org Subject: Re: Which stemmer? What is your use case? If you don't have a specific use case in mind, try each of them with some common words that you expect will or won't be stemmed. If you have Solr, you can experiment interactively using the Solr Admin Analysis web page. It would be nice if the javadoc for each stemmer gave a handful of examples that illustrated how some common words are stemmed. -- Jack Krupansky -Original Message- From: Scott Smith Sent: Wednesday, November 14, 2012 10:55 AM To: java-user@lucene.apache.org Subject: Which stemmer? Does anyone have any experience with the stemmers? I know that Porter is what "everyone" uses. Am I better off with KStemFilter (better performance) or ?? Does anyone understand the differences between the various stemmers and how to choose one over another? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Which stemmer?
One other factor to keep in mind is that the customer should never "look" at the actual stem term - such as "countri" or "gener" because in can freak them out a little, for no good reason. I mean, the goal of stemming is to show what set of words/terms will be treated as equivalent on a query, and this is independent of what gets returned for a stored field. The stem is simply the means to THAT end. The fact that "dog" and "dogs" are not equivalent in KStem is in fact disheartening, at least to me, but it may not be problematic in some use cases. -- Jack Krupansky -Original Message- From: Scott Smith Sent: Thursday, November 15, 2012 11:57 AM To: java-user@lucene.apache.org Subject: RE: Which stemmer? Thanks for the suggestions I think Erick is correct as well. I'll let the customer decide. Here's an updated list. Fyi--the minStem was the English Minimal Stemmer--I changed the label. Interesting to see where the minimal stemmer and porter agree (and KStemmer doesn't). You may also find the "dog" examples interesting. I also found the "invest*" list entertaining. original porterkstem EngMinStem --- --- --- --- country countri country country countries countri country country country's country'country's country' run run run run runs run runs run running run running running read read read read reading read reading reading reader reader reader reader association associ association association associate associassociateassociate listing list list listing waterwaterwaterwater wateredwaterwater watered sure sure sure sure surely sure surely surely invest invest invest invest investing invest investinvesting investment invest investment investment investments invest investment investment invests invest invest invest investor investor invest investor invester invest invest invester investors investor invest investor investers invest invest invester organizationorgan organization organization organizeorgan organize organize organicorgan organic organic generousgener generous generous genericgener generic generic dog dog dog dog dog's dog'dog's dog' dogs dog dogs dog dogs' dog dogs dog Now, if someone would answer my question on the Solr list ("Custom Solr Indexer/Search"), my day would be complete ;-). Thanks for the continued help. Scott -Original Message- From: Tom Burton-West [mailto:tburt...@umich.edu] Sent: Thursday, November 15, 2012 11:06 AM To: java-user@lucene.apache.org Subject: Re: Which stemmer? I agree with Erick that you probably need to give your client a list of concrete examples, and perhaps to explain the trade-offs. All stemmers both overstem and understem. Understemming means that some forms of a word won't get searched. For example, without stemming, searching for "dogs" would not retrieve documents containing the word "dog". Generally there is a precision/recall tradeoff where reducing understemming increases overstemming. The problem with aggressive stemmers like the Porter stemmer, is that they overstem. The original Porter stemmer for example would stem "organization" and " organic" both to "organ" and "generalization" , "generous"and "generic" to " gener" * For background on the Porter stemmers and lots of examples see these pages: http*://snowball.tartarus.org/algorithms/porter/stemmer.html<http://snowball.tartarus.org/algorithms/porter/stemmer.html> * *http://snowball.tartarus.org/algorithms/english/stemmer.html*<http://snowball.tartarus.org/algorithms/english/stemmer.html> This paper on the Kstem stemmer lists cases where the Porter stemmer understems or overstems and explains the logic of Kstem: "Viewing Morphology as an Inference Process" (*Krovetz*, R., Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 191-203, 1993). *http://ciir.cs.umass.edu/pubfiles/ir-35.pdf*<http://ciir.cs.umass.edu/pubfiles/ir-35.pdf&
Re: how do re-get the doc after the doc was indexed ?
Sounds like you want path to be a unique key field. So, just do a Lucene search with a TermQuery for the path, which should return one document. No need to mess with Lucene internal doc ids. -- Jack Krupansky -Original Message- From: wgggfiy Sent: Saturday, November 17, 2012 8:08 AM To: java-user@lucene.apache.org Subject: how do re-get the doc after the doc was indexed ? for example: I indexed a doc with path=c:/books/appach.txt, author=Mike After a long time, I wanted to modify the author to John. But the quethion is how I can get the exact same doc fastly ?? My idea is to traverse the docs from id=0 to id=maxDoc(), and retrive it with store fields, and check its path whether equals path=c:/books/appach.txt. Any better idea ?? thx -- View this message in context: http://lucene.472066.n3.nabble.com/how-do-re-get-the-doc-after-the-doc-was-indexed-tp4020865.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Question about ordering rule of SpanNearQuery
Unfortunately, there doesn't appear to be any Javadoc that discusses what factors are used to score spans. For example, how to relate the number of times a span matches in a document vs. the exactness of each span match. -- Jack Krupansky -Original Message- From: 杨光 Sent: Monday, November 19, 2012 5:02 AM To: java-user@lucene.apache.org Subject: Question about ordering rule of SpanNearQuery Hi all, Recently, we are developing a platform with lucene. The ordering rule we specified is the document with the shortest distance between query terms ranks the first. But there may be a little different with SpanNearQuery. It returns all the documents with qualified distance. So I am confused with the ordering rule about SpanNearQuery. For example, I indicate the slot in SpanNearQuery is 10. And the results are all the qualified documents. Is it true that any document with shorter distance between query rand before the one with longer distance without considering the tf-idf algorithm? Or among all the qualified documents, it till uses tf-idf algorithm to rank the docs. Or there is some complex algorithm blending the distance and tf-idf algorithm. Thanks in advance. Kobe - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Line feed on windows
This doesn't sound like a Lucene issue. It's up to you to read a file and pass it as a string to Lucene. Maybe you're trying to read the file one line at a time, in which case it is up to you to supply line delimiters when combining the lines into a single string. Try reading the full file into a single string, line delimiters and all. Be careful about encoding though. -- Jack Krupansky -Original Message- From: Mansour Al Akeel Sent: Tuesday, November 20, 2012 1:19 PM To: java-user Subject: Line feed on windows Hello all, We are indexing and storing files contents in lucene index. These files contains line feed "\n" as end of line character. Lucene is storing the content as is, however when we read them, the "\n" is removed and we end up with text that is concatenated when there's no space. I can re-read the files from the filesystem to avoid this, but I like to see if there is other alternatives. Thank you. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Question about ordering rule of SpanNearQuery
Add &debugQuery=true to your query and look at the "explain" section to see how the scoring is calculated for each document. Sometimes it is counter-intuitive and some factors may differ but those differences can be overwhelmed by other, unrelated factors. -- Jack Krupansky -Original Message- From: 杨光 Sent: Wednesday, November 21, 2012 10:26 AM To: java-user@lucene.apache.org Subject: Question about ordering rule of SpanNearQuery Hi all, Recently, we are developing a platform with lucene. The ordering rule we specified is the document with the shortest distance between query terms ranks the first. But there may be a little different with SpanNearQuery. It returns all the documents with qualified distance. So I am confused with the ordering rule about SpanNearQuery. For example, I indicate the slot in SpanNearQuery is 10. And the results are all the qualified documents. Is it true that any document with shorter distance between query rand before the one with longer distance without considering the tf-idf algorithm? Or among all the qualified documents, it till uses tf-idf algorithm to rank the docs. Or there is some complex algorithm blending the distance and tf-idf algorithm. Thanks in advance. -- Guang Yang, Dept. of Computer Science Peking University, 100080 Beijing, China Tel: +86 18631516893 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Question about ordering rule of SpanNearQuery
Oops... sorry, I just noticed that you are a Lucene, not Solr, user. Call the IndexSearcher#explain method to get the explanation and call the toString method on the explanation to see the readable text. http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/IndexSearcher.html#explain(org.apache.lucene.search.Query, int) -- Jack Krupansky -Original Message- From: Jack Krupansky Sent: Wednesday, November 21, 2012 11:44 AM To: java-user@lucene.apache.org Subject: Re: Question about ordering rule of SpanNearQuery Add &debugQuery=true to your query and look at the "explain" section to see how the scoring is calculated for each document. Sometimes it is counter-intuitive and some factors may differ but those differences can be overwhelmed by other, unrelated factors. -- Jack Krupansky -Original Message- From: 杨光 Sent: Wednesday, November 21, 2012 10:26 AM To: java-user@lucene.apache.org Subject: Question about ordering rule of SpanNearQuery Hi all, Recently, we are developing a platform with lucene. The ordering rule we specified is the document with the shortest distance between query terms ranks the first. But there may be a little different with SpanNearQuery. It returns all the documents with qualified distance. So I am confused with the ordering rule about SpanNearQuery. For example, I indicate the slot in SpanNearQuery is 10. And the results are all the qualified documents. Is it true that any document with shorter distance between query rand before the one with longer distance without considering the tf-idf algorithm? Or among all the qualified documents, it till uses tf-idf algorithm to rank the docs. Or there is some complex algorithm blending the distance and tf-idf algorithm. Thanks in advance. -- Guang Yang, Dept. of Computer Science Peking University, 100080 Beijing, China Tel: +86 18631516893 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Which stemmer?
Great! For my favorite example of "invest", "invests", etc. it shows: SnowballEnglish: •investment •invest •invests •investing •invested kStem: •investors •invest •investor •invests •investing •invested minimalStem:invest •invest •invests That highlights the distinctions between these stemmers quite well, without highlighting the actual indexed term, which can be quite ugly. -- Jack Krupansky -Original Message- From: Elmer van Chastelet Sent: Wednesday, November 21, 2012 8:49 AM To: java-user@lucene.apache.org Subject: Re: Which stemmer? I've just created a small web application which you might find useful. You can see which words are matched by a query word when using different analyzers (phonetic and stemming analyzers). These include snowball, kstem and minimal stem (the ones on the right). http://dutieq.st.ewi.tudelft.nl/wordsearch/ I can extend the app with more analyzers. Please let me know :) --Elmer Example On 11/14/2012 07:55 PM, Scott Smith wrote: Does anyone have any experience with the stemmers? I know that Porter is what "everyone" uses. Am I better off with KStemFilter (better performance) or ?? Does anyone understand the differences between the various stemmers and how to choose one over another? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How does lucene handle the wildcard and fuzzy queries ?
The proper answer to all of these questions is the same and very simple: If you want "internal" details, read the source code first. If you have specific questions then, fine, ask specific questions - but only after you've checked the code first. Also, questions or issues related to "internals" aren't appropriate on "user" lists. -- Jack Krupansky -Original Message- From: sri krishna Sent: Tuesday, November 27, 2012 12:36 PM To: java-user@lucene.apache.org Subject: How does lucene handle the wildcard and fuzzy queries ? How does lucene handle the prefix queries(wild card) and fuzzy queries internally? Lucene stores date in in for of inverted index in segments, i.e term->doc id's. How does it search a word in the term list efficiently? And how does it handle the adv queries on same the inverted index? Thanks - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: handling different scores related to queries
Call the IndexSearch#explain method to get the technical details on how any query is scored. Call Explanation#toString to get the English description for the scoring. Or, using Solr, add the &debugQuery=true parameter to your query request and look at the "explain" section for scoring calculations. Some of these complex queries are "constant score" for performance reasons. -- Jack Krupansky -Original Message- From: sri krishna Sent: Tuesday, November 27, 2012 12:38 PM To: java-user Subject: handling different scores related to queries for a search string hello*~ how the scoring is calculated? as the formula given in the url: http://lucene.apache.org/core/old_versioned_docs/versions/3_0_1/api/core/org/apache/lucene/search/Similarity.html, doesn't take into consideration of edit distance(levenshtein distance) and prefix term corresponding factors into account. Does lucene add up the scores obtained from each type of query included i.e for the above query actual score=default scoring+1/(edit distance)+prefix match score ?, If so, there is no normalization between scores, else what is the approach lucene follows starting from seperating each query based identifiers like (~(edit distance), *(prefix query) etc) to actual scoring. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?
"I will probably have to implement my own datastructure and parser/tokenizer/stemmer" Why? I mean, I think the point of the Lucene architecture is that the codec level is completely independent of the analysis level. The end result of analysis is a value to be stored from the application perspective, a "logical value" so to speak, but NOT the bit sequence, the "physical value" so to speak, that the codec will actually store. So, go ahead and have your own codec that does whatever it wants with values, but the input for storage and query should be the output of a standard Lucene analyzer. -- Jack Krupansky -Original Message- From: Johannes.Lichtenberger Sent: Friday, November 30, 2012 10:15 AM To: java-user@lucene.apache.org Cc: Michael McCandless Subject: Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs? On 11/28/2012 01:11 AM, Michael McCandless wrote: Flexible indexing is the ability to make your own codec, which controls the reading and writing of all index parts (postings, stored fields, term vectors, deleted docs, etc.). So for example if you want to store some postings as a bit set instead of the block format that's the default coming up in 4.1, that's easy to do. But what is less easy (as I described below) is changing what is actually stored in the postings, eg adding a new per-position attribute. The original goal was to allow arbitrary attributes beyond the known docs/freqs/positions/offsets that Lucene supports today, so that you could easily make new application-dependent per-term, per-doc, per-position things, pull them from the analyzer, save them to the index, and access them from an IndexReader / query, but while some APIs do expose this, it's not very well explored yet (eg, you'd have to make a custom indexing chain to get the attributes "through" IndexWriter down to your codec). It would be great to make progress making this easier, so ideas are very welcome :) Regarding my questin/thread, is it also possible to change the backend system? I'd like to use Lucene for a versioned DBMS, thus I would need the ability to serialize/deserialize the bytes in my backend whereas keys/values are stored in pages (for instance in an upcoming B+-tree, or in simple "unordered" pages via a record-ID/record mapping). But as no one suggested anything as of now and I've also asked a year ago or so, after implementing the B+-tree I will probably have to implement my own datastructure and parser/tokenizer/stemmer... :-( kind regards, Johannes - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Using alternative scoring mechanism.
I thought it was that simple too, but I couldn't find the "get/setSimilarityProvider" methods listed in that patch, and no mention in the Similarity class. Obviously this feature has morphed a bit since then. To cut to the chase, you still use the old methods for setting the similarity class (IndexSearcher#setSimilarity), but now you need to instantiate a PerFieldSimilarityWrapper that provides a "get" method for each field. It is mentioned in the Similarity javadoc, if you read it carefully enough. See: http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/PerFieldSimilarityWrapper.html This appear to be the change from that original design: https://issues.apache.org/jira/browse/LUCENE-3749 -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Sunday, December 02, 2012 9:54 AM To: java-user Subject: Re: Using alternative scoring mechanism. I think you're looking for per-field similiarity, does this help? https://issues.apache.org/jira/browse/LUCENE-2236 Note, in 4.0 only Best Erick On Sat, Dec 1, 2012 at 1:43 PM, Eyal Ben Meir wrote: Can one replace the basic scoring algorithm (TF/IDF) for a specific field, to use a different one? I need to compute similarity for NAME field. The regular TF/IDF is not good enough, and I want to use a Name Recognition Engine as a scorer. How can it be done? Thanks, Eyal. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Is Analyzer used when calling IndexWriter.addIndexesNoOptimize()?
These are operations on indexes, so analysis is no longer relevant. Analysis is performed BEFORE data is placed in an index. You still need to perform analysis for queries though. You can use different analyzers as long as they are compatible - if they produce the same results, but if they don't produce the same token stream at both the token and character/byte level, your queries may fail. Rule #1 with Lucene and Solr - always be prepared to completely reindex your data, precisely because ideas about analysis evolve over time. -- Jack Krupansky -Original Message- From: Earl Hood Sent: Wednesday, December 05, 2012 2:33 AM To: java-user@lucene.apache.org Subject: Is Analyzer used when calling IndexWriter.addIndexesNoOptimize()? Lucene version: 3.0.3 Does IndexWriter use the analyzer when adding indexes via addIndexesNoOptimize()? What about for optimize()? I am examining some existing code and trying to determine what effects there may be when combining multiple indexes into a single index, but each index may have had different analyzers used. Thanks, --ewh - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Boolean and SpanQuery: different results
Can you provide some examples of terms that don't work and the index token stream they fail on? Make sure that the Analyzer you are using doesn't do any magic on the indexed terms - your query term is unanalyzed. Maybe multiple, but distinct, index terms are analyzing to the same, but unexpected term. -- Jack Krupansky -Original Message- From: Carsten Schnober Sent: Thursday, December 13, 2012 10:49 AM To: java-user@lucene.apache.org Subject: Boolean and SpanQuery: different results Hi, I'm following Grant's advice on how to combine BooleanQuery and SpanQuery (http://mail-archives.apache.org/mod_mbox/lucene-java-user/201003.mbox/%3c08c90e81-1c33-487a-9e7d-2f05b2779...@apache.org%3E). The strategy is to perform a BooleanQuery, get the document ID set and perform a SpanQuery restricted by those documents. The purpose is that I need to retrieve Spans for different terms in order to extract their respective payloads separately, but a precondition is that possibly multiple terms occur within the documents. My code looks like this: /* reader and terms are class variables and have been declared finally before */ Reader reader = ...; List terms = ... /* perform the BooleanQuery and store the document IDs in a BitSet */ BitSet bits = new BitSet(reader.maxDoc()); AllDocCollector collector = new AllDocCollector BooleanQuery bq = new BooleanQuery(); for (String term : terms) bq.add(new org.apache.lucene.search.RegexpQuery(new Term(config.getFieldname(), term)), Occur.MUST); IndexSearcher searcher = new IndexSearcher(reader); for (ScoreDoc doc : collector.getHits()) bits.set(doc.doc); /* get the spans for each term separately */ for (String term : terms) { String payloads = retrieveSpans(term, bits); // process and print payloads for term ... } def String retrieveSpans(String term, BitSet bits) { StringBuilder payloads = new StringBuilder(); Map termContexts = new HashMap<>(); Spans spans; SpanQuery sq = (SpanQuery) new SpanMultiTermQueryWrapper<>(new RegexpQuery(new Term("text", term))).rewrite(reader); for (AtomicReaderContext atomic : reader.leaves()) { spans = sq.getSpans(atomic, new DocIdBitSet(bits), termContexts); while (luceneSpans.next()) { // extract and store payloads in 'payloads' StringBuilder } } return payloads.toString(); } This construction seemed to be working fine at first, but I noticed a disturbing behaviour: for many terms, the BooleanQuery when fed with one RegexpQuery only matches a larger number of documents than the SpanQuery constructed from the same RegexpQuery. With the BooleanQuery containing only one RegexpQuery, the number should be identical, while with multiple Queries added to the BooleanQuery, the SpanQuery should return an equal number or more results. This behaviour is reproducible reliably even after re-indexing, but not for all tokens. Does anyone have an explanation for that? Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
precisionStep for days in TrieDate
If I specify a precisionStep of 26 for a TrieDate field, what rough impact should this have on both performance and index size? The input data has time in it, but the milliseconds per day is not needed for the app. Will Lucene store only the top 64 minus 26 bits of data and discard the low 26 bits? I’ve read that a higher precisionStep will lower performance. Will a precisionStep of 26 have dramatically lower performance when referencing days (without time of day)? I suppose that the piece of information I am missing is whether trie precisionStep simply affects some extra index table that trie keeps beyond the raw data values or the data values themselves. -- Jack Krupansky
Re: precisionStep for days in TrieDate
Thanks, you answered the main question - 26 doesn't simply lop off the time of day. Although, I still don't completely follow how trie works (without reading the paper itself.) -- Jack Krupansky -Original Message- From: Uwe Schindler Sent: Friday, December 14, 2012 5:58 PM To: java-user@lucene.apache.org Subject: RE: precisionStep for days in TrieDate Hi, If I specify a precisionStep of 26 for a TrieDate field, what rough impact should this have on both performance and index size? This value is mostly useless, everything > 8 does slowdown the queries tot he speed of TermRangeQuery. The input data has time in it, but the milliseconds per day is not needed for the app. Will Lucene store only the top 64 minus 26 bits of data and discard the low 26 bits? No, you may need to read the Javadocs of NumericRangeQuery, now updated with formulas: http://goo.gl/nyXQR The precisionStep is a count, after how many bits of the indexed value a new term starts. The original value is always indexed in full precision. Precision step of 4 for a 32 bit value(integer) means terms with these bit counts: All 32, left 28, left 24, left 20, left 16, left 12, left 8, left 4 bits of the value (total 8 terms/value). A precision step of 26 would index 2 terms: all 32 bits and one single term with the remaining 6 bits from the left. I’ve read that a higher precisionStep will lower performance. Will a precisionStep of 26 have dramatically lower performance when referencing days (without time of day)? See above. The assumption that 26 will limit precision to days is wrong. I suppose that the piece of information I am missing is whether trie precisionStep simply affects some extra index table that trie keeps beyond the raw data values or the data values themselves. It only affects how the value is indexed (how many terms), but not the value. Uwe - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Help needed: search is returning no results
Maybe you wanted "text" fields that are analyzed and tokenized, as opposed to string fields which are not analyzed and stored and queried exactly as-is. See: http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/document/TextField.html But, show us some of your indexed data and queries that fail. -- Jack Krupansky -Original Message- From: Ramon Casha Sent: Tuesday, December 18, 2012 9:14 AM To: java-user@lucene.apache.org Subject: Help needed: search is returning no results I have just downloaded and set up Lucene 4.0.0 to implement a search facility for a web app I'm developing. Creating the index seems to be successful - the files created contain the text that I'm indexing. However, search is returning no results. The code I'm using is fairly similar to the examples given. Here is the search code: private static final File index = new File("/tmp/naturopedia-index"); private static final Version VERSION = Version.LUCENE_40; public String search() throws IOException, ParseException { Analyzer analyzer = new StandardAnalyzer(VERSION); Directory directory = FSDirectory.open(index); DirectoryReader ireader = DirectoryReader.open(directory); IndexSearcher isearcher = new IndexSearcher(ireader); QueryParser parser = new QueryParser(VERSION, "labels", analyzer); Query q = parser.parse(getQuery()); ScoreDoc[] hits = isearcher.search(q, 1000).scoreDocs; for (int i = 0; i < hits.length; i++) { Document hitDoc = isearcher.doc(hits[i].doc); System.out.println(hitDoc.getField("id")); } ireader.close(); directory.close(); return "search"; } -- Every time I try this, the search returns zero results. I tried with different fields (text and labels), both of which are indexed, and I tried different words. Any help would be appreciated. Here is the code for producing the index: public String crawl() throws IOException { Analyzer analyzer = new StandardAnalyzer(VERSION); Directory directory = FSDirectory.open(index); IndexWriterConfig config = new IndexWriterConfig(VERSION, analyzer); config.setOpenMode(IndexWriterConfig.OpenMode.CREATE); IndexWriter iwriter = new IndexWriter(directory, config); DB db = new DB(); // JPA interface class. for ( Taxonomy t : db.taxonomy.all() ) { LOG.log(Level.INFO, "Scanning {0}", t); Document doc = new Document(); doc.add(new LongField("id", t.getId(), Store.YES)); LOG.log(Level.INFO, " id={0}", t.getId()); StringBuilder text = new StringBuilder(); for(Text l : t.getTexts()) { text.append(l.getWikiText()) .append((' ')); } if(text.length() > 0) { doc.add(new StringField("text", text.toString(), Field.Store.YES)); LOG.log(Level.INFO, " text={0}", text); } StringBuilder labels = new StringBuilder(); toIndex(labels, t.getLabels()); for(Image l : t.getImages()) { toIndex(labels, l.getLabels()); } doc.add(new StringField("labels", labels.toString(), Field.Store.NO)); LOG.log(Level.INFO, " labels={0}", labels); StringBuilder sb = new StringBuilder(); for ( Tag tag : t.getTags() ) { toIndex(sb, tag.getLabels()); } if(!sb.toString().isEmpty()) { doc.add(new StringField("tags", sb.toString(), Field.Store.NO)); LOG.log(Level.INFO, " tags={0}", sb); } iwriter.addDocument(doc); } db.close(); iwriter.close(); return "search"; } private void toIndex(StringBuilder sb, LabelGroup lg) { for(Label l : lg.getLabels()) { sb.append(l.getText()); sb.append(" "); } } -- Ramon Casha - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: NGramPhraseQuery with missing terms
"a BooleanQuery, but it requires me to consider every possible pair of terms (since any one of the terms could be missing)" What about setting minMatch and all the terms as "SHOULD" - and then minMatch could be tuned for how many missing terms to tolerate? See: http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/BooleanQuery.html#setMinimumNumberShouldMatch(int) -- Jack Krupansky -Original Message- From: 김한규 Sent: Wednesday, December 19, 2012 2:36 AM To: java-user@lucene.apache.org Subject: NGramPhraseQuery with missing terms Hi. I am trying to make a NGramPhrase query that could tolerate terms missing, so even if one of the NGrams doesn't match it still gets picked up by search. I know I could use the combination of normal SpanNearQuery and a BooleanQuery, but it requires me to consider every possible pair of terms (since any one of the terms could be missing) and it gets too messy and expensive. What I want to try is to use SpanTermQuery to get the positions of the mathcing NGrams and list the spans' position informations in an order, so that I could pick up any two or more spans near each other to score them accordingly, but I can't figure out how can I combine the spans. Any help in solving this issue is appreciated. Also, if there is an example of a simple scoring implementation example that combines multiple queries' results, it would be very nice. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Which token filter can combine 2 terms into 1?
And to be more specific, most query parsers will have already separated the terms and will call the analyzer with only one term at a time, so no term recombination is possible for those parsed terms, at query time. -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Friday, December 21, 2012 8:27 AM To: java-user Subject: Re: Which token filter can combine 2 terms into 1? If it's a fixed list and not excessively long, would synonyms work? But if theres some kind of logic you need to apply, I don't think you're going to find anything OOB. The problem is that by the time a token filter gets called, they are already split up, you'll probably have to write a custom filter that manages that logic. Best Erick On Fri, Dec 21, 2012 at 4:16 AM, Xi Shen wrote: Unfortunately, no...I am not combine every two term into one. I am combining a specific pair. E.g. the Token Stream: t1 t2 t2a t3 should be rewritten into t1 t2t2a t3 But the TS: t1 t2 t3 t2a should not be rewritten, and it is already correct On Fri, Dec 21, 2012 at 5:00 PM, Alan Woodward < alan.woodw...@romseysoftware.co.uk> wrote: > Have a look at ShingleFilter: > http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/shingle/ShingleFilter.html > > On 21 Dec 2012, at 08:42, Xi Shen wrote: > > > I have to use the white space and word delimiter to process the input > > first. I tried many combination, and it seems to me that it is inevitable > > the term will be split into two :( > > > > I think developing my own filter is the only resolution...but I just > cannot > > find a guide to help me understand what I need to do to implement a > > TokenFilter. > > > > > > On Fri, Dec 21, 2012 at 4:03 PM, Danil ŢORIN wrote: > > > >> Easiest way would be to pre-process your input and join those 2 > >> tokens > >> before splitting them by white space. > >> > >> But from given context I might miss some details...still worth a > >> shot. > >> > >> On Fri, Dec 21, 2012 at 9:50 AM, Xi Shen wrote: > >> > >>> Hi, > >>> > >>> I am looking for a token filter that can combine 2 terms into 1? > >>> E.g. > >>> > >>> the input has been tokenized by white space: > >>> > >>> t1 t2 t2a t3 > >>> > >>> I want a filter that output: > >>> > >>> t1 t2t2a t3 > >>> > >>> I know it is a very special case, and I am thinking about develop a > >> filter > >>> of my own. But I cannot figure out which API I should use to look > >>> for > >> terms > >>> in a Token Stream. > >>> > >>> -- > >>> Regards, > >>> David Shen > >>> > >>> http://about.me/davidshen > >>> https://twitter.com/#!/davidshen84 > >>> > >> > > > > > > > > -- > > Regards, > > David Shen > > > > http://about.me/davidshen > > https://twitter.com/#!/davidshen84 > > -- Regards, David Shen http://about.me/davidshen https://twitter.com/#!/davidshen84 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Which token filter can combine 2 terms into 1?
You still have the query parser's parsing before analysis to deal with, no matter what magic you code in your analyzer. -- Jack Krupansky -Original Message- From: Tom Sent: Friday, December 21, 2012 2:24 PM To: java-user@lucene.apache.org Subject: Re: Which token filter can combine 2 terms into 1? On Fri, Dec 21, 2012 at 9:16 AM, Jack Krupansky wrote: And to be more specific, most query parsers will have already separated the terms and will call the analyzer with only one term at a time, so no term recombination is possible for those parsed terms, at query time. Most analyzers will do that, yes. But if Xi writes his own analyzer with his own combiner filter, then he should also use this for query generation and thus get the desired combinations / snippets there as well. Xi, here is the recipe: - SnippetFilter extends TokenFilter -SnippetFilter needs access to your lexicon: a data structure to store your snippets. In the general case this is a tree, and going along a branch will tell you whenever a valid snipped has been built or if the snipped could be longer. (Example: "internal revenue" can be one snippet but, depending on the next token, a larger snipped of "internal revenue service" could be built.) - Logic of the SnippetFilter.incrementToken() goes something like this: You need a loop which retrieves tokens from the input variable until the input is empty. You store each retrieved token in a variable(s) x in SnippetFilter . As long as you have a potential match against your lexicon, you can continue in this loop. Once you realize that there is something within x which can not possibly become a (longer) snippet, break out of the loop and allow the consumer to retrieve it. - make sure your analyzer inserts SnippetFilter at the correct spot in the filter chain. Cheers FiveMileTom -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Friday, December 21, 2012 8:27 AM To: java-user Subject: Re: Which token filter can combine 2 terms into 1? If it's a fixed list and not excessively long, would synonyms work? But if theres some kind of logic you need to apply, I don't think you're going to find anything OOB. The problem is that by the time a token filter gets called, they are already split up, you'll probably have to write a custom filter that manages that logic. Best Erick On Fri, Dec 21, 2012 at 4:16 AM, Xi Shen wrote: Unfortunately, no...I am not combine every two term into one. I am combining a specific pair. E.g. the Token Stream: t1 t2 t2a t3 should be rewritten into t1 t2t2a t3 But the TS: t1 t2 t3 t2a should not be rewritten, and it is already correct On Fri, Dec 21, 2012 at 5:00 PM, Alan Woodward < alan.woodward@romseysoftware.**co.uk > wrote: > Have a look at ShingleFilter: > http://lucene.apache.org/core/**3_6_0/api/all/org/apache/** lucene/analysis/shingle/**ShingleFilter.html<http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/shingle/ShingleFilter.html> > > On 21 Dec 2012, at 08:42, Xi Shen wrote: > > > I have to use the white space and word delimiter to process the input > > first. I tried many combination, and it seems to me that it is inevitable > > the term will be split into two :( > > > > I think developing my own filter is the only resolution...but I just > cannot > > find a guide to help me understand what I need to do to implement a > > TokenFilter. > > > > > > On Fri, Dec 21, 2012 at 4:03 PM, Danil ŢORIN wrote: > > > >> Easiest way would be to pre-process your input and join those 2 > >> tokens > >> before splitting them by white space. > >> > >> But from given context I might miss some details...still worth a > >> shot. > >> > >> On Fri, Dec 21, 2012 at 9:50 AM, Xi Shen wrote: > >> > >>> Hi, > >>> > >>> I am looking for a token filter that can combine 2 terms into 1? > >>> E.g. > >>> > >>> the input has been tokenized by white space: > >>> > >>> t1 t2 t2a t3 > >>> > >>> I want a filter that output: > >>> > >>> t1 t2t2a t3 > >>> > >>> I know it is a very special case, and I am thinking about develop a > >> filter > >>> of my own. But I cannot figure out which API I should use to look > >>> for > >> terms > >>> in a Token Stream. > >>> > >>> -- > >>> Regards, > >>> David Shen > >>> > >>> http://about.me/davidshen > >>> https://twitter.com/#!/**davidshen84<https://twitter.com/#!/davidshen84> > >>> > >> > > > > > > > > -- > >
Re: Retrieving granular scores back from Lucene/SOLR
The Explanation tree returned by IndexSearcher#explain is as good as you are going to get, but is rather expensive. You are asking for a lot, so you should be prepared to pay for it. See: http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/IndexSearcher.html#explain(org.apache.lucene.search.Query, int) -- Jack Krupansky -Original Message- From: Vishwas Goel Sent: Tuesday, December 25, 2012 11:30 PM To: java-user@lucene.apache.org Subject: Retrieving granular scores back from Lucene/SOLR Hi, I am looking to get a bit more information back from SOLR/Lucene about the query/document pair scores. This would include field level scores, overall text relevance score, Boost value, BF value etc. Information could either be encoded in the score itself that Lucene/Solr returns - ideally however i would want to return an array of scores back from SOLR/Lucene. I understand that i could work with the debug output - but i am looking to do this for the production use case. Has anyone explored something like this? Does someone know, where is the final score(not just the relevance score) computed? Thanks, Vishwas - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: TokenFilter state question
You need a "reset" method that calls the super reset to reset the parent state and then reset your own state. http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/TokenStream.html#reset() You probably don't have one, so only the parent state gets reset. -- Jack Krupansky -Original Message- From: Jeremy Long Sent: Wednesday, December 26, 2012 9:08 AM To: java-user@lucene.apache.org Subject: TokenFilter state question Hello, I'm still trying to figure out some of the nuances of Lucene and I have run into a small issue. I have created my own custom analyzer which uses the WhitespaceTokenizer and chains together the LowercaseFilter, StopwordFilter, and my own custom filter (below). I am using this analyzer when searching (i.e. it is the analyzer used in a QueryParser). The custom analyzers purpose is to add tokens by concatenating the previous word with the current word. So that if you were given "Spring Framework Core" the resulting tokens would be "Spring SpringFramework Framework FrameworkCore Core". My problem is that when my query text is "Spring Framework Core" I end up with left-over state in my TokenPairConcatenatingFilter (the previousWord is a member field). So if I end up re-using my query parser on a subsequent search for "Apache Struts" I end up with the token stream of "CoreApache Apache ApacheStruts Struts". The Initial "core" was left over state. The left over state from the initial query appears to arise because in my initial loop that collects all of the tokens from the underlying stream only collects a single token. So the processing is - we collect the token "spring", we write "spring" out to the stream and move it to the previousWord. Next, we are at the end of the stream and we have no more words in the list so the filter returns false. At this time, the filter is called again and "Framework" is collected... repeat until end of tokens from the query is reached; however, "Core" is left in the previousWord field. The filter would work correctly with no state being left over if all of the tokens were collected at the beginning (i.e. the first call to incrementToken). Can anyone explain why all of the tokens would not be collected and/or a work around so that when QueryParser.parse("field:(Spring Framework Core)") is called residual state is not left over in my token filter? I have two hack solutions - 1) don't reuse the analyzer/QueryParser for subsequent queries or 2) build in a reset mechanism to clear the previousWord field. I don't like either solution and was hoping someone from the list might have a suggestion as to what I've done wrong or some feature of Lucene I've missed. The code is below. Thanks in advance, Jeremy // // TokenPairConcatenatingFilter import java.io.IOException; import java.util.LinkedList; import org.apache.lucene.analysis.TokenFilter; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; /** * Takes a TokenStream and adds additional tokens by concatenating pairs of words. * Example: "Spring Framework Core" -> "Spring SpringFramework Framework FrameworkCore Core". * * @author Jeremy Long (jeremy.l...@gmail.com) */ public final class TokenPairConcatenatingFilter extends TokenFilter { private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); private final PositionIncrementAttribute posIncAtt = addAttribute(PositionIncrementAttribute.class); private String previousWord = null; private LinkedList words = null; public TokenPairConcatenatingFilter(TokenStream stream) { super(stream); words = new LinkedList(); } /** * Increments the underlying TokenStream and sets CharTermAtttributes to * construct an expanded set of tokens by concatenting tokens with the * previous token. * * @return whether or not we have hit the end of the TokenStream * @throws IOException is thrown when an IOException occurs */ @Override public boolean incrementToken() throws IOException { //collect all the terms into the words collaction while (input.incrementToken()) { String word = new String(termAtt.buffer(), 0, termAtt.length()); words.add(word); } //if we have a previousTerm - write it out as its own token concatonated // with the current word (if one is available). if (previousWord != null && words.size() > 0) { String word = words.getFirst(); clearAttributes(); termAtt.append(previousWord).append(word); posIncAtt.setPositionIncrement(0); previousWord = null;
Re: Which token filter can combine 2 terms into 1?
Ah! You're quoting full phrases. You weren't clear about that originally. Thanks for the clarification. -- Jack Krupansky -Original Message- From: Tom Sent: Wednesday, December 26, 2012 5:54 PM To: java-user@lucene.apache.org Subject: Re: Which token filter can combine 2 terms into 1? On Fri, Dec 21, 2012 at 2:44 PM, Jack Krupansky wrote: You still have the query parser's parsing before analysis to deal with, no matter what magic you code in your analyzer. Not quite. "query parser's parsing" comes first, you are correct on that. But it is irrelevant for splitting field values into search terms, because this part of the whole process is done by an analyzer. Therefore, if you make sure the correct analyzer is used, then the parsing and splitting into individual search terms will be done by this analyzer, not by the query parser. Try it: Implement an analyzer with the SnippetFilter below. Start Luke and make sure this analyzer is selected in "Analyzer to use for query parsing". In the search expression, type in any length of text for example: body:"word1 word2 word3" and you will get the possibly combined Terms. For example, let's say one snipped in your SnippetFilter is: "word2 word3" you will get Term 0: field=body text=word1 Term 1: field=body text=word2 word3 In this case, word2 and word3 will NOT be split. -- Jack Krupansky -Original Message- From: Tom Sent: Friday, December 21, 2012 2:24 PM To: java-user@lucene.apache.org Subject: Re: Which token filter can combine 2 terms into 1? On Fri, Dec 21, 2012 at 9:16 AM, Jack Krupansky * *wrote: And to be more specific, most query parsers will have already separated the terms and will call the analyzer with only one term at a time, so no term recombination is possible for those parsed terms, at query time. Most analyzers will do that, yes. But if Xi writes his own analyzer with his own combiner filter, then he should also use this for query generation and thus get the desired combinations / snippets there as well. Xi, here is the recipe: - SnippetFilter extends TokenFilter -SnippetFilter needs access to your lexicon: a data structure to store your snippets. In the general case this is a tree, and going along a branch will tell you whenever a valid snipped has been built or if the snipped could be longer. (Example: "internal revenue" can be one snippet but, depending on the next token, a larger snipped of "internal revenue service" could be built.) - Logic of the SnippetFilter.incrementToken() goes something like this: You need a loop which retrieves tokens from the input variable until the input is empty. You store each retrieved token in a variable(s) x in SnippetFilter . As long as you have a potential match against your lexicon, you can continue in this loop. Once you realize that there is something within x which can not possibly become a (longer) snippet, break out of the loop and allow the consumer to retrieve it. - make sure your analyzer inserts SnippetFilter at the correct spot in the filter chain. Cheers FiveMileTom -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Friday, December 21, 2012 8:27 AM To: java-user Subject: Re: Which token filter can combine 2 terms into 1? If it's a fixed list and not excessively long, would synonyms work? But if theres some kind of logic you need to apply, I don't think you're going to find anything OOB. The problem is that by the time a token filter gets called, they are already split up, you'll probably have to write a custom filter that manages that logic. Best Erick On Fri, Dec 21, 2012 at 4:16 AM, Xi Shen wrote: Unfortunately, no...I am not combine every two term into one. I am combining a specific pair. E.g. the Token Stream: t1 t2 t2a t3 should be rewritten into t1 t2t2a t3 But the TS: t1 t2 t3 t2a should not be rewritten, and it is already correct On Fri, Dec 21, 2012 at 5:00 PM, Alan Woodward < alan.woodward@romseysoftware.co.uk >> wrote: > Have a look at ShingleFilter: > http://lucene.apache.org/core/3_6_0/api/all/org/apache/**<http://lucene.apache.org/core/**3_6_0/api/all/org/apache/**> lucene/analysis/shingle/ShingleFilter.html<http://** lucene.apache.org/core/3_6_0/**api/all/org/apache/lucene/** analysis/shingle/**ShingleFilter.html<http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/shingle/ShingleFilter.html> > > > On 21 Dec 2012, at 08:42, Xi Shen wrote: > > > I have to use the white space and word delimiter to process the > > input > > first. I tried many combination, and it seems to me that it is inevitable > > the term will be split into two :( > > > > I think developing my own filter is the only resolution...but I just > cannot > > find a guide to help me unders
Re: TokenFilter state question
If you want the query parser to present a sequence of tokens to your analyzer as one unit, you need to enclose the sequence in quotes. Otherwise, the query parser will pass each of the terms to the analyzer as a single unit, with a reset for each, as you in fact report. So, change: String querystr = "product:(Spring Framework Core) vendor:(SpringSource)"; to String querystr = "product:\"Spring Framework Core\" vendor:(SpringSource)"; -- Jack Krupansky -Original Message- From: Jeremy Long Sent: Wednesday, December 26, 2012 5:52 PM To: java-user@lucene.apache.org Subject: Re: TokenFilter state question Actually, I had thought the same thing and had played around with the reset method. However, in my example where I called the custom analyzer the "reset" method was called after every token, not after the end of the stream (which implies each token was treated as its own TokenStream?). String querystr = "product:(Spring Framework Core) vendor:(SpringSource)"; Map fieldAnalyzers = new HashMap(); fieldAnalyzers.put("product", new SearchFieldAnalyzer(Version.LUCENE_40)); fieldAnalyzers.put("vendor", new SearchFieldAnalyzer(Version.LUCENE_40)); PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper( new StandardAnalyzer(Version.LUCENE_40), fieldAnalyzers); QueryParser parser = new QueryParser(Version.LUCENE_40, field1, wrapper); Query q = parser.parse(querystr); In the above example (from the code snippets in the original email), if I were to add a reset method it would be called 4 times (yes, I have tried this). Reset gets called for each token "Spring" "Framework" "Core" and "SpringSource". Thus, if I reset my internal state I would not achieve the goal of have "Spring Framework Core" result in the tokens "Spring SpringFramework Framework FrameworkCore Core". My question is - why would these be treated as separate token streams? --Jeremy On Wed, Dec 26, 2012 at 10:54 AM, Jack Krupansky wrote: You need a "reset" method that calls the super reset to reset the parent state and then reset your own state. http://lucene.apache.org/core/**4_0_0/core/org/apache/lucene/** analysis/TokenStream.html#**reset()<http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/TokenStream.html#reset()> You probably don't have one, so only the parent state gets reset. -- Jack Krupansky -Original Message- From: Jeremy Long Sent: Wednesday, December 26, 2012 9:08 AM To: java-user@lucene.apache.org Subject: TokenFilter state question Hello, I'm still trying to figure out some of the nuances of Lucene and I have run into a small issue. I have created my own custom analyzer which uses the WhitespaceTokenizer and chains together the LowercaseFilter, StopwordFilter, and my own custom filter (below). I am using this analyzer when searching (i.e. it is the analyzer used in a QueryParser). The custom analyzers purpose is to add tokens by concatenating the previous word with the current word. So that if you were given "Spring Framework Core" the resulting tokens would be "Spring SpringFramework Framework FrameworkCore Core". My problem is that when my query text is "Spring Framework Core" I end up with left-over state in my TokenPairConcatenatingFilter (the previousWord is a member field). So if I end up re-using my query parser on a subsequent search for "Apache Struts" I end up with the token stream of "CoreApache Apache ApacheStruts Struts". The Initial "core" was left over state. The left over state from the initial query appears to arise because in my initial loop that collects all of the tokens from the underlying stream only collects a single token. So the processing is - we collect the token "spring", we write "spring" out to the stream and move it to the previousWord. Next, we are at the end of the stream and we have no more words in the list so the filter returns false. At this time, the filter is called again and "Framework" is collected... repeat until end of tokens from the query is reached; however, "Core" is left in the previousWord field. The filter would work correctly with no state being left over if all of the tokens were collected at the beginning (i.e. the first call to incrementToken). Can anyone explain why all of the tokens would not be collected and/or a work around so that when QueryParser.parse("field:(**Spring Framework Core)") is called residual state is not left over in my token filter? I have two hack solutions - 1) don't reuse the analyzer/QueryParser for subsequent queries or 2) build in a reset mechanism to clear the previousWord field. I don't like either solution and was hoping
Re: Differences in MLT Query Terms Question
The term "arv" is on the first list, but not the second. Maybe it's document frequency fell below the setting for minimum document frequency on the second run. Or, maybe the minimum word length was set to 4 or more on the second run. Are you using MoreLikeThisQuery or directly using MoreLikeThis? Or, possibly "arv" appears later in a document on the second run, after the number of tokens specified by maxNumTokensParsed. -- Jack Krupansky -Original Message- From: Peter Lavin Sent: Tuesday, January 08, 2013 1:46 PM To: java-user@lucene.apache.org Subject: Differences in MLT Query Terms Question Dear Users, I am running some simple experiments with Lucene and am seeing something I don't understand. I have 16 text files on 4 different topics, ranging in size from 50-900 KB. When I index all 16 of these and run an MLT query based on one of the indexed documents, I get an expected result (i.e. similar topics are found). When I reduce the number of text files to 4 and index them (having taken care to overwriting the previous index files), and then run the same MLT query (based on the same document from the index), I get slightly different scores. I'm assuming this is because the IDF is now different because there is less documents. For each run, I have set the max number of terms as... mlt.setMaxQueryTerms(100) However, when I compare the terms which get used for the MLT query on the 16 document index and the 4 document index, they are slightly different. I've printed, parsed and sorted them into two columns of a CSV file. I've pasted a small part of it at the end of this email. My Question(s)... 1) Can anybody explain why the set of terms used for the MLT query is different when a file from an index of 16 documents versus 4 documents is used? 2) Am I right in assuming that the reason for slightly different scores in the IDF, or could it be this slight difference in the sets of terms used (or possibly both)? regards, Peter -- with best regards, Peter Lavin, PhD Candidate, CAG - Computer Architecture & Grid Research Group, Lloyd Institute, 005, Trinity College Dublin, Ireland. +353 1 8961536 "about","about" "affordable","affordable" "agents","agents" "aids","aids" "architecture","architecture" "arv","based" "based","blog" "blog","board" "board","business" "business","care" "care","commemorates" "commemorates","contacts" "contacts","contributions" "contributions","coordinating" "coordinating","core" "core","countries" "countries","country" "country","data" "data","decisions" "decisions","details" "details","disbursements" "disbursements","documents" "documents","donors" - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: FuzzyQuery in lucene 4.0
FWIW, new FuzzyQuery(term, 2 ,0) is the same as new FuzzyQuery(term), given the current values of defaultMaxEdits (2) and defaultPrefixLength (0). -- Jack Krupansky -Original Message- From: Ian Lea Sent: Wednesday, January 09, 2013 9:44 AM To: java-user@lucene.apache.org Subject: Re: FuzzyQuery in lucene 4.0 See the javadocs for FuzzyQuery to see what the parameters are. I can't tell you what the comment means. Possible values to try maybe? -- Ian. On Wed, Jan 9, 2013 at 2:34 PM, algebra wrote: is true Ian, o code is good. The only thing that I dont understand is a line: Query query = new FuzzyQuery(term, 2 ,0); //0-2 Whats means 0 to 2? -- View this message in context: http://lucene.472066.n3.nabble.com/FuzzyQuery-in-lucene-4-0-tp4031871p4031879.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene-MoreLikethis
There are lots of parameters you can adjust, but the defaults essentially assume that you have a fairly large corpus and aren't interested in low-frequency terms. So, try MoreLikeThis#setMinDocFreq. The default is 5. You don't have any terms in your example with a doc freq over 2. Also, try setMinTermFreq. The default is 2. You don't have any terms with a term frequency above 1. -- Jack Krupansky -Original Message- From: Thomas Keller Sent: Tuesday, January 15, 2013 3:22 PM To: java-user@lucene.apache.org Subject: Lucene-MoreLikethis Hey, I have a question about "MoreLikeThis" in Lucene, Java. I built up an index and want to find similar documents. But I always get no results for my query, mlt.like(1) is always empty. Can anyone find my mistake? Here is an example. (I use Lucene 4.0) public class HelloLucene { public static void main(String[] args) throws IOException, ParseException { StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40); Directory index = new RAMDirectory(); IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer); IndexWriter w = new IndexWriter(index, config); addDoc(w, "Lucene in Action", "193398817"); addDoc(w, "Lucene for Dummies", "55320055Z"); addDoc(w, "Managing Gigabytes", "55063554A"); addDoc(w, "The Art of Computer Science", "9900333X"); w.close(); // search IndexReader reader = DirectoryReader.open(index); IndexSearcher searcher = new IndexSearcher(reader); MoreLikeThis mlt = new MoreLikeThis(reader); Query query = mlt.like(1); System.out.println(searcher.search(query, 5).totalHits); } private static void addDoc(IndexWriter w, String title, String isbn) throws IOException { Document doc = new Document(); doc.add(new TextField("title", title, Field.Store.YES)); // use a string field for isbn because we don't want it tokenized doc.add(new StringField("isbn", isbn, Field.Store.YES)); w.addDocument(doc); } } - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Combine two BooleanQueries by a SpanNearQuery.
You need to express the "boolean" query solely in terms of SpanOrQuery and SpanNearQuery. If you can't, ... then it probably can't be done, but you should be able to. How about starting with a plan English description of the problem you are trying to solve? -- Jack Krupansky -Original Message- From: Michel Conrad Sent: Thursday, January 17, 2013 11:01 AM To: java-user@lucene.apache.org Subject: Combine two BooleanQueries by a SpanNearQuery. Hi, I am looking to get a combination of multiple subqueries. What I want to do is to have two queries which have to be near one to another. As an example: Query1: (A AND (B OR C)) Query2: D Then I want to use something like a SpanNearQuery to combine both (slop 5): Both would then have to match and D should be within slop 5 to A, B or C. So my question is if there is a query that combines two BooleanQuery trees into a SpanNearQuery. It would have to take the terms that match Query 1 and the terms that match Query 2, and look if there is a combination within the required slop. Can I rewrite the BooleanQuery after parsing the query as a MultiTermQuery, than wrap these in SpanMultiTermQueryWrapper, which can be combined by the SpanNearQuery? Best regards, Michel - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Combine two BooleanQueries by a SpanNearQuery.
Currently there isn't. SpanNearQuery can take only other SpanQuery objects, which includes other spans, span terms, and span wrapped multi-term queries (e.g., wildcard, fuzzy query), but not Boolean queries. But it does sound like a good feature request. There is SpanNotQuery, so you can exclude terms from a span. -- Jack Krupansky -Original Message- From: Michel Conrad Sent: Thursday, January 17, 2013 12:14 PM To: java-user@lucene.apache.org Subject: Re: Combine two BooleanQueries by a SpanNearQuery. The problem I would like to solve is to have two queries that I will get from the query parser (this could include wildcardqueries and phrasequeries). Both of these queries would have to match the document and as an additional restriction I would like to add that a matching term from the first query is near a matching term from the second query. So that you can search for instance for matching documents with in your first query 'apple AND NOT computer' and in your second 'monkey' with a slop of 10 between the two queries, then it would be equivalent to '"apple monkey"~10 AND NOT computer'. I was wondering if there is a method to combine more complicated queries in a similar way. (Some kind of generic solution) Thanks for your help, Michel On Thu, Jan 17, 2013 at 5:14 PM, Jack Krupansky wrote: You need to express the "boolean" query solely in terms of SpanOrQuery and SpanNearQuery. If you can't, ... then it probably can't be done, but you should be able to. How about starting with a plan English description of the problem you are trying to solve? -- Jack Krupansky -Original Message- From: Michel Conrad Sent: Thursday, January 17, 2013 11:01 AM To: java-user@lucene.apache.org Subject: Combine two BooleanQueries by a SpanNearQuery. Hi, I am looking to get a combination of multiple subqueries. What I want to do is to have two queries which have to be near one to another. As an example: Query1: (A AND (B OR C)) Query2: D Then I want to use something like a SpanNearQuery to combine both (slop 5): Both would then have to match and D should be within slop 5 to A, B or C. So my question is if there is a query that combines two BooleanQuery trees into a SpanNearQuery. It would have to take the terms that match Query 1 and the terms that match Query 2, and look if there is a combination within the required slop. Can I rewrite the BooleanQuery after parsing the query as a MultiTermQuery, than wrap these in SpanMultiTermQueryWrapper, which can be combined by the SpanNearQuery? Best regards, Michel - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: SpanNearQuery with two boundaries
+1 I think that accurately states the semantics of the operation you want. -- Jack Krupansky -Original Message- From: Alan Woodward Sent: Friday, January 18, 2013 1:08 PM To: java-user@lucene.apache.org Subject: Re: SpanNearQuery with two boundaries Hi Igor, You could try wrapping the two cases in a SpanNotQuery: SpanNot(SpanNear(runs, cat, 10), SpanNear(runs, cat, 3)) That should return documents that have runs within 10 positions of cat, as long as they don't overlap with runs within 3 positions of cat. Alan Woodward www.flax.co.uk On 18 Jan 2013, at 16:13, Igor Shalyminov wrote: Hello! I want to perform search queries like this one: word:"dog" \1 word:"runs" (\3 \10) word:"cat" It is thus something like SpanNearQuery, but with two boundaries - minimum and maximum distance between the terms (which in the \1-case would be equal). Syntax (as above, fictional:) itself doesn't matter, I just want to know if one is able to build this type of query based on existing (Lucene 4.0.0) query classes. -- Best Regards, Igor Shalyminov - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Indexing multiple fields with one document position
Send the same input text to two different analyzers for two separate fields. The first analyzer emits only the first attribute. The second analyzer emits only the second attribute. The document position in one will correspond to the document position in the other. -- Jack Krupansky -Original Message- From: Igor Shalyminov Sent: Monday, January 21, 2013 3:04 AM To: java-user@lucene.apache.org Subject: Indexing multiple fields with one document position Hello! When indexing text with position data, one just adds field do a document in the form of its name and value, and the indexer assigns it unique position in the index. I wonder, if I have an entry with two attributes, say: cat, How do I store in the index two fields, "pos" and "number" with its values, pointing to the same position in the document? -- Best Regards, Igor Shalyminov - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content
You may be able to use Tika directly without needing to choose the specific classes, although the latter may give you the specific data you need without the extra overhead. You could take a look at the Solr Extracting Request Handler source for an example: http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_4_1_0/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/ Basically, Tika extracts a bunch of "metadata" and then you will have to add selected metadata to your Lucene documents. "content" is the main document body text. You could try Solr itself to see how it works: http://wiki.apache.org/solr/ExtractingRequestHandler -- Jack Krupansky -Original Message- From: Adrien Grand Sent: Sunday, January 27, 2013 12:53 PM To: java-user@lucene.apache.org Subject: Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content Have you tried using the PDFParser [1] and the OfficeParser [2] classes from Tika? This question seems to be more appropriate for the Tika user mailing list [3]? [1] http://tika.apache.org/1.3/api/org/apache/tika/parser/pdf/PDFParser.html#parse(java.io.InputStream, org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata, org.apache.tika.parser.ParseContext) [2] http://tika.apache.org/1.3/api/org/apache/tika/parser/microsoft/OfficeParser.html#parse(java.io.InputStream, org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata, org.apache.tika.parser.ParseContext) [3] http://tika.apache.org/mail-lists.html -- Adrien - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content
Re-read my last message - and then take a look at that Solr source code, which will give you an idea how to use Tika, even though you are using Lucene only. If you have specific questions, please be specific. To answer your latest question, yes, Tika is good enough. Solr /update/extract uses it, and Solr is based on Lucene. -- Jack Krupansky -Original Message- From: saisantoshi Sent: Sunday, January 27, 2013 2:09 PM To: java-user@lucene.apache.org Subject: Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content We are not using Solr and using just Lucene core 4.0 engine. I am trying to see if we can use tika library to extract textual information from pdf/word/excel documents. I am mainly interested in reading the contents inside the documents and index using lucene. My question here is , is tika framework good enough or is there any other better library. Any issues/experiences in using the tika framework. Thanks, Sai. -- View this message in context: http://lucene.472066.n3.nabble.com/Readers-for-extracting-textual-info-from-pd-doc-excel-for-indexing-the-actual-content-tp4036379p4036557.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Questions about FuzzyQuery in Lucene 4.x
Let's see your code that calls FuzzyQuery . If you happen to pass a prefixLength (3rd parameter) of 3 or more, then "ster" would not match "star" (but prefixLength of 2 would match). -- Jack Krupansky -Original Message- From: George Kelvin Sent: Monday, January 28, 2013 5:31 PM To: java-user@lucene.apache.org Subject: Questions about FuzzyQuery in Lucene 4.x Hi All, I’m working on several projects requiring powerful search features. I’ve been waiting for Lucene 4 since I read Michael McCandless's blog post about Lucene’s new FuzzyQuery and finally I got the chance to test it. The improvement on the fuzzy search is really impressive! However, I’ve encountered a problem today: In my data, there are some records with keywords “star” and “wars”. But when I issued a fuzzy query with two keywords “ster” and “wats”, the engine failed to find the records. I’m wondering if you can provide any inputs on that. Maybe I’m not doing fuzzy search in the right way. But all my other fuzzy queries with single keyword and with longer double keywords worked perfectly. Another issue is that I’m also exploring the possibility to do wildcard+fuzzy search using Lucene. I couldn’t find any related document for this on Lucene's website, but I found a stackoverfow thread talking about this. http://stackoverflow.com/questions/2631206/lucene-query-bla-match-words-that-start-with-something-fuzzy-how I tried the way suggested by the second answer their, and it worked. However the scoring is strange: all results were assigned with the exactly same score. Is there anything I can do to get the scoring right? Can you tell me what’s the best way to do wildcard fuzzy search? Any help will be appreciated! Thanks, George - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Questions about FuzzyQuery in Lucene 4.x
That depends on the value of "ed", and the indexed data. Another factor to take into consideration is that a case change ("Star" vs. "star") also counts as an edit. -- Jack Krupansky -Original Message- From: George Kelvin Sent: Tuesday, January 29, 2013 11:49 AM To: java-user@lucene.apache.org Subject: Re: Questions about FuzzyQuery in Lucene 4.x Hi Jack, Thanks for your reply! I don't think I passed the prefixLength parameter in. Here is the code I used to build the FuzzyQuery: String[] words = str.split("\\+"); BooleanQuery query = new BooleanQuery(); for (int i=0; iGeorge - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Questions about FuzzyQuery in Lucene 4.x
I also noticed that you have "MUST" for your full string of fuzzy terms - that means everyone of them must appear in an indexed document to be matched. Is it possible that maybe even one term was not in the same indexed document? Try to provide a complete example that shows the input data and the query - all the literals. In other words, construct a minimal test case that shows the failure. -- Jack Krupansky -Original Message- From: George Kelvin Sent: Tuesday, January 29, 2013 12:28 PM To: java-user@lucene.apache.org Subject: Re: Questions about FuzzyQuery in Lucene 4.x Hi Jack, ed is set to 1 here and I have lowercased all the data and queries. Regarding the indexed data factor you mentioned, can you elaborate more? Thanks! George On Tue, Jan 29, 2013 at 9:10 AM, Jack Krupansky wrote: That depends on the value of "ed", and the indexed data. Another factor to take into consideration is that a case change ("Star" vs. "star") also counts as an edit. -- Jack Krupansky -Original Message- From: George Kelvin Sent: Tuesday, January 29, 2013 11:49 AM To: java-user@lucene.apache.org Subject: Re: Questions about FuzzyQuery in Lucene 4.x Hi Jack, Thanks for your reply! I don't think I passed the prefixLength parameter in. Here is the code I used to build the FuzzyQuery: String[] words = str.split("\\+"); BooleanQuery query = new BooleanQuery(); for (int i=0; iTo unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org For additional commands, e-mail: java-user-help@lucene.apache.**org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Questions about FuzzyQuery in Lucene 4.x
I'm sorry, but for anybody to help you here, you really need to be able to provide a concise test case, like 10-20 lines of code, completely self-contained. If you think you need a million documents to repro what you claimed was a simple scenario, then you leave me very, very confused - and unable to help you any further. -- Jack Krupansky -Original Message- From: George Kelvin Sent: Tuesday, January 29, 2013 2:43 PM To: java-user@lucene.apache.org Subject: Re: Questions about FuzzyQuery in Lucene 4.x Hi Jack, The problematic query is "scar"+"wads". There are several (more than 10) documents in the data with the content "star wars", so I think that query should be able to find all these documents. I was trying to provide a minimal test case, but I couldn't reduce the size of data showing the failure. The size of the minimal data showing the failure I got so far is around 2 million. However, I found a suspicious document with content "scor". If I remove it from the 2 million documents data, that query can find all the "star wars" documents. If I add it back, then the query can't find any. I tried to reduce the size of the data to 1 million further and add that "scor" document, but now the query can still find all the "star wars" documents. Is it possible that Lucene somehow fail to find all the valid terms within the edit distance? Thanks! George On Tue, Jan 29, 2013 at 10:02 AM, Jack Krupansky wrote: I also noticed that you have "MUST" for your full string of fuzzy terms - that means everyone of them must appear in an indexed document to be matched. Is it possible that maybe even one term was not in the same indexed document? Try to provide a complete example that shows the input data and the query - all the literals. In other words, construct a minimal test case that shows the failure. -- Jack Krupansky -Original Message- From: George Kelvin Sent: Tuesday, January 29, 2013 12:28 PM To: java-user@lucene.apache.org Subject: Re: Questions about FuzzyQuery in Lucene 4.x Hi Jack, ed is set to 1 here and I have lowercased all the data and queries. Regarding the indexed data factor you mentioned, can you elaborate more? Thanks! George On Tue, Jan 29, 2013 at 9:10 AM, Jack Krupansky * *wrote: That depends on the value of "ed", and the indexed data. Another factor to take into consideration is that a case change ("Star" vs. "star") also counts as an edit. -- Jack Krupansky -Original Message- From: George Kelvin Sent: Tuesday, January 29, 2013 11:49 AM To: java-user@lucene.apache.org Subject: Re: Questions about FuzzyQuery in Lucene 4.x Hi Jack, Thanks for your reply! I don't think I passed the prefixLength parameter in. Here is the code I used to build the FuzzyQuery: String[] words = str.split("\\+"); BooleanQuery query = new BooleanQuery(); for (int i=0; i > For additional commands, e-mail: java-user-help@lucene.apache.org< java-user-help@lucene.**apache.org > --**--**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org For additional commands, e-mail: java-user-help@lucene.apache.**org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to find related words ?
Take a look at MoreLikeThisQuery: http://lucene.apache.org/core/4_1_0/queries/org/apache/lucene/queries/mlt/MoreLikeThisQuery.html And MoreLikeThis itself: http://lucene.apache.org/core/4_1_0/queries/org/apache/lucene/queries/mlt/MoreLikeThis.html So, the idea is search for documents using your keyword(s) and ask Lucene to extract relevant terms from the top document(s). -- Jack Krupansky -Original Message- From: wgggfiy Sent: Wednesday, January 30, 2013 12:27 PM To: java-user@lucene.apache.org Subject: How to find related words ? In short, you put in a term like "Lucene", and The ideal output would be "solr", "index", "full-text search", and so on. How to make it ? to find the related words. thx My idea is to use FuzzyQuery, or MoreLikeThis, or calc the score with all the terms and then sort. Any idea ? - -- Email: wuqiu.m...@qq.com -- -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-find-related-words-tp4037462.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to find related words ?
Oh, so you wanted "similar" words! You should have said so... your inquiry said you were looking for "related" words. So, which is it? More specifically, what exactly are you looking for, in terms of the semantics? In any case, "find similar" (MoreLikeThis) is about the best you can do out of the box. -- Jack Krupansky -Original Message- From: Andrew Gilmartin Sent: Thursday, January 31, 2013 9:04 AM To: java-user@lucene.apache.org Subject: Re: How to find related words ? wgggfiy wrote: en, it seems nice, but I'm puzzled by you and Andrew Gilmartina above, what's the difference between you guys ? The different is that similar documents do not give you similar terms. Similar documents can show a correlation of terms -- ie, whereever Lucene is mentioned so is Solr and Hadoop -- but in no way does this mean that the terms are similar. Accumulating similar and/or synonymous terms is a manual process. I am sure there are text mining tools/algorithms that make discoveries, but I do not know about these. (I am a journeyman programmer not a researcher.) If anyone does know about them, please share with this list. -- Andrew - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene vs Glimpse
Generally, all of your example queries should work fine with Lucene, provided that you carefully choose your analyzer, or even use the StandardAnalyzer. The special characters like underscore and dot generally get treated as spaces and the resulting sequence of terms would match as a phrase. It won't be a 100% solution, but it should do reasonably well. Is there a query that was failing to match reasonably for you? -- Jack Krupansky -Original Message- From: Mathias Dahl Sent: Monday, February 04, 2013 1:01 PM To: java-user@lucene.apache.org Subject: Lucene vs Glimpse Hi, I have hacked together a small web front end to the Glimpse text indexing engine (see http://webglimpse.net/ for information). I am very happy with how Glimpse indexes and searches data. If I understand it correctly it uses a combination of an index and searching directly in the files themselves as grep or other tools. The problem is that I discovered it is not open source and now that I want to extend the use from private to company wide I will run into license problems/costs. So, I decided to try out Lucene. I tried the examples and changed them a bit to use another analyzer. But when I started to think about it I realized that I will not be able to build something like Glimpse. At least not easily. Why? I will try to explain: As stated above, Glimpse uses a combination of index and in-file search. This makes it very powerful in the sense that I can get hits for things that are not necessarily being indexes as terms. Let's say I have a file with this content: ... import foo.bar.baz; ... With Glimpse, and without telling it how to index the content I can find the above file using a search string like "foo" or "bar" but also, and this is important, using foo.bar.baz. Another example: We have a lot of PL/SQL source code, and often you can find code like this: ... My_Nice_API.Some_Method ... Here too, Glimpse is almost magic since it combines index and normal search. I can find the file above using "My_Nice_API" or "My_Nice_API.Some_Method". In a sense I can have the cake and eat it too. If I want to do similar "free" search stuff with Lucene I think I have to create analyzers for the different kind of source code files, with fields for this and that. Quite an undertaking. Does anyone understand my point here and am I correct in that it would be hard to implement something as "free" as with Glimpse? I am not trying to critizise, just understand how Lucene (and Glimpse) works. Oh, yes, Glimpse has one big drawback: it only supports search strings up to 32 characters. Thanks! /Mathias - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Wildcard in a text field
That description is too vague. Could you provide a couple of examples of index text and queries and what you expect those queries to match. If you simply want to query for "*" and "?" in "string" fields, escape them with a backslash. But if you want to escape them in "text" fields, be sure to use an analyzer that preserves them since they generally will be treated as spaces. -- Jack Krupansky -Original Message- From: Nicolas Roduit Sent: Friday, February 08, 2013 2:49 AM To: java-user@lucene.apache.org Subject: Wildcard in a text field I'm looking for a way of making a query on words which contain wildcards (* or ?). In general, we use wildcards in query, not in the text. I haven't find anything in Lucene to build that. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Wildcard in a text field
Ah, okay... some people call that "prospective" search. In any case, there is no direct Lucene support that I know of. There are some references here: http://lucene.apache.org/core/4_0_0/memory/org/apache/lucene/index/memory/MemoryIndex.html -- Jack Krupansky -Original Message- From: Nicolas Roduit Sent: Friday, February 08, 2013 10:14 AM To: java-user@lucene.apache.org Subject: Re: Re: Wildcard in a text field For instance, I have a list of tags related to a text. Each text with its list of tags are put in a document and indexed by Lucene. If we consider that a tag is "buddh*" and I would like to make a query (e.g. "buddha" or "buddhism" or "buddhist") and find the document that contain "buddh*". Thanks, Le 08. 02. 13 13:35, Jack Krupansky a écrit : That description is too vague. Could you provide a couple of examples of index text and queries and what you expect those queries to match. If you simply want to query for "*" and "?" in "string" fields, escape them with a backslash. But if you want to escape them in "text" fields, be sure to use an analyzer that preserves them since they generally will be treated as spaces. -- Jack Krupansky -Original Message- From: Nicolas Roduit Sent: Friday, February 08, 2013 2:49 AM To: java-user@lucene.apache.org Subject: Wildcard in a text field I'm looking for a way of making a query on words which contain wildcards (* or ?). In general, we use wildcards in query, not in the text. I haven't find anything in Lucene to build that. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: fuzzy queries
You probably are not getting this document returned: list.add("strfffing_ m atcbbhing"); because... both terms have an edit distance greater than two. All the other documents have one or the other or both terms with an editing distance of 2 or less. Your query is essentially: Match a document if EITHER term matches. So, if NEITHER matches (within an editing distance of 2), the document is not a match. -- Jack Krupansky -Original Message- From: Pierre Antoine DuBoDeNa Sent: Saturday, February 09, 2013 12:52 PM To: java-user@lucene.apache.org Subject: Re: fuzzy queries with query like string~ matching~ (without specifying threshold) i get 14 results back.. Can it be problem with the analyzers? Here is the code: private File indexDir = new File("/a-directory-here"); private StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35); private IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35, analyzer); public static void main(String[] args) throws Exception { IndexProfiles Indexer = new IndexProfiles(); IndexWriter w = Indexer.CreateIndex(); ArrayList list = new ArrayList(); list.add("string matching"); list.add("string123 matching"); list.add("string matching123"); list.add("string123 matching123"); list.add("str4ing match2ing"); list.add("1string 2matching"); list.add("str_ing ma_tching"); list.add("string_matching"); list.add("strang mutching"); list.add("strrring maatchinng"); list.add("strfffing_ m atcbbhing"); list.add("str2ing__mat3ching"); list.add("string_m atching"); list.add("string matching another token"); list.add("strasding matc4hing ano23ther tok3en"); list.add("str4ing maaatching_another 2t oken"); for (String companyname:list) { Indexer.addSingleField(w, companyname); } int numDocs = w.numDocs(); System.out.println("# of Docs in Index: " + numDocs); w.close(); DoIndexQuery("string~ matching~"); } public static void DoIndexQuery(String query) throws IOException, ParseException { IndexProfiles Indexer = new IndexProfiles(); IndexReader reader = Indexer.LoadIndex(); Indexer.SearchIndex(reader, query, 50); reader.close(); } public IndexWriter CreateIndex() throws IOException { Directory index = FSDirectory.open(indexDir); IndexWriter w = new IndexWriter(index, config); return w; } public HashMap SearchIndex(IndexReader w, String query, int topk) throwsIOException, ParseException { Query q = new QueryParser(Version.LUCENE_35, "Name", analyzer ).parse(query); IndexSearcher searcher = new IndexSearcher(w); TopScoreDocCollector collector = TopScoreDocCollector.create(topk, true); searcher.search(q, collector); ScoreDoc[] hits = collector.topDocs().scoreDocs; System.out.println("Found " + hits.length + " hits."); HashMap map = new HashMap(); for(int i=0;i Can you reduce your test case to indexing one document/field and running a single FuzzyQuery (you seem to be running two at once, OR'ing the results)? And show the complete standalone source code (eg what is topk?) so we can see how you are indexing / building the Query / searching. The default minSim is 0.5. Note that 0.01 is not useful in practice: it (should) match nearly all terms. But I agree it's odd one term is not matching. Mike McCandless http://blog.mikemccandless.com On Sat, Feb 9, 2013 at 5:20 AM, Pierre Antoine DuBoDeNa wrote: >> >> Hello, >> >> I use lucene 3.6 and i try to use fuzzy queries so that I can match >> much >> more results. >> >> I am adding for example these strings: >> >> list.add("string matching"); >> >> list.add("string123 matching"); >> >> list.add("string matching123"); >> >> list.add("string123 matching123"); >> >> list.add("str4ing match2ing"); >> >> list.add("1string 2matching"); >> >> list.add("str_ing ma_tching"); >> >> list.add("string_matching"); >> >> list.add("strang mutching"); >> >> list.add("strrring maatchinng"); >> >> list.add("strfffing_ m atcbbhing"); >> >> list.add("str2ing__mat3ching"); >> >> list.add("string_m atching"); >> >> list.add("string matching another token"); >> >> list.add("strasding matc4hing ano23ther tok3en"); >> >> list.add("str4ing maaatching_another 2t oken"); >> >> >> >> then i do a query: >> >> >> "string~0.01 matching~0.01" >&
Re: Grouping and tokens
Please clarify exactly what you want to group by - give a specific example that makes it clear what terms should affect grouping and which shouldn't. -- Jack Krupansky -Original Message- From: Ramprakash Ramamoorthy Sent: Monday, February 18, 2013 6:12 AM To: java-user@lucene.apache.org Subject: Grouping and tokens Hello all, From the grouping javadoc, I read that fields that are supposed to be grouped should not be tokenized. I have an use case where the user has the freedom to group by any field during search time. Now that only tokenized fields are eligible for grouping, this is creating an issue with my search. Say for instance the book "*Fifty shades of grey*" when tokenized and searched for "*shades*" turns up in the result. However this is not the case when I have it as a non-tokenized field (using StandardAnalyzer-Version4.1). How do I go about this? Is indexing a tokenized and non-tokenized version of the same field the only go? I am afraid its way too costly! Thanks in advance for your valuable inputs. -- With Thanks and Regards, Ramprakash Ramamoorthy, India, +91 9626975420 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Grouping and tokens
Okay, so, fields that would normally need to be tokenized must be stored as both raw strings for grouping and tokenized text for keyword search. Simply use copyField to copy from one to the other. -- Jack Krupansky -Original Message- From: Ramprakash Ramamoorthy Sent: Monday, February 18, 2013 11:13 PM To: java-user@lucene.apache.org Subject: Re: Grouping and tokens On Mon, Feb 18, 2013 at 9:47 PM, Jack Krupansky wrote: Please clarify exactly what you want to group by - give a specific example that makes it clear what terms should affect grouping and which shouldn't. Assume I am indexing a library data. Say there are the following fields for a particular book. 1. Published 2. Language 3. Genre 4. Author 5. Title 6. ISBN While search time, the user can ask to group by any of the above fields, which means all of them are not supposed to be tokenized. So as I had told earlier, there is a book titled "Fifty shades of gray" and the user searches for "shades". The result turns up in case the field is tokenized. But here it doesn't, since it isn't tokenized. Hope I am clear? In a nutshell, how do I use a groupby on a field that is also tokenized? -- Jack Krupansky -Original Message- From: Ramprakash Ramamoorthy Sent: Monday, February 18, 2013 6:12 AM To: java-user@lucene.apache.org Subject: Grouping and tokens Hello all, From the grouping javadoc, I read that fields that are supposed to be grouped should not be tokenized. I have an use case where the user has the freedom to group by any field during search time. Now that only tokenized fields are eligible for grouping, this is creating an issue with my search. Say for instance the book "*Fifty shades of grey*" when tokenized and searched for "*shades*" turns up in the result. However this is not the case when I have it as a non-tokenized field (using StandardAnalyzer-Version4.1). How do I go about this? Is indexing a tokenized and non-tokenized version of the same field the only go? I am afraid its way too costly! Thanks in advance for your valuable inputs. -- With Thanks and Regards, Ramprakash Ramamoorthy, India, +91 9626975420 --**--**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org For additional commands, e-mail: java-user-help@lucene.apache.**org -- With Thanks and Regards, Ramprakash Ramamoorthy, India. +91 9626975420 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Grouping and tokens
Oops, sorry for the "Solr" answer. In Lucene you need to simply index the same value, once as a raw string and a second time as a tokenized text field. Grouping would use the raw string version of the data. -- Jack Krupansky -Original Message----- From: Jack Krupansky Sent: Monday, February 18, 2013 11:21 PM To: java-user@lucene.apache.org Subject: Re: Grouping and tokens Okay, so, fields that would normally need to be tokenized must be stored as both raw strings for grouping and tokenized text for keyword search. Simply use copyField to copy from one to the other. -- Jack Krupansky -Original Message- From: Ramprakash Ramamoorthy Sent: Monday, February 18, 2013 11:13 PM To: java-user@lucene.apache.org Subject: Re: Grouping and tokens On Mon, Feb 18, 2013 at 9:47 PM, Jack Krupansky wrote: Please clarify exactly what you want to group by - give a specific example that makes it clear what terms should affect grouping and which shouldn't. Assume I am indexing a library data. Say there are the following fields for a particular book. 1. Published 2. Language 3. Genre 4. Author 5. Title 6. ISBN While search time, the user can ask to group by any of the above fields, which means all of them are not supposed to be tokenized. So as I had told earlier, there is a book titled "Fifty shades of gray" and the user searches for "shades". The result turns up in case the field is tokenized. But here it doesn't, since it isn't tokenized. Hope I am clear? In a nutshell, how do I use a groupby on a field that is also tokenized? -- Jack Krupansky -Original Message- From: Ramprakash Ramamoorthy Sent: Monday, February 18, 2013 6:12 AM To: java-user@lucene.apache.org Subject: Grouping and tokens Hello all, From the grouping javadoc, I read that fields that are supposed to be grouped should not be tokenized. I have an use case where the user has the freedom to group by any field during search time. Now that only tokenized fields are eligible for grouping, this is creating an issue with my search. Say for instance the book "*Fifty shades of grey*" when tokenized and searched for "*shades*" turns up in the result. However this is not the case when I have it as a non-tokenized field (using StandardAnalyzer-Version4.1). How do I go about this? Is indexing a tokenized and non-tokenized version of the same field the only go? I am afraid its way too costly! Thanks in advance for your valuable inputs. -- With Thanks and Regards, Ramprakash Ramamoorthy, India, +91 9626975420 --**--**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org For additional commands, e-mail: java-user-help@lucene.apache.**org -- With Thanks and Regards, Ramprakash Ramamoorthy, India. +91 9626975420 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Grouping and tokens
Well, you don't need to "store" both copies since they will be the same. They both need to be "indexed" (string form for grouping, text form for keyword search), but only one needs to be "stored". -- Jack Krupansky -Original Message- From: Ramprakash Ramamoorthy Sent: Tuesday, February 19, 2013 1:07 AM To: java-user@lucene.apache.org Subject: Re: Grouping and tokens On Tue, Feb 19, 2013 at 12:57 PM, Jack Krupansky wrote: Oops, sorry for the "Solr" answer. In Lucene you need to simply index the same value, once as a raw string and a second time as a tokenized text field. Grouping would use the raw string version of the data. Yeah, thanks Jack. Was just wondering if there would be a better alternate rather than 2x storing. But I don't see any. Thanks again. -- Jack Krupansky -Original Message- From: Jack Krupansky Sent: Monday, February 18, 2013 11:21 PM To: java-user@lucene.apache.org Subject: Re: Grouping and tokens Okay, so, fields that would normally need to be tokenized must be stored as both raw strings for grouping and tokenized text for keyword search. Simply use copyField to copy from one to the other. -- Jack Krupansky -Original Message- From: Ramprakash Ramamoorthy Sent: Monday, February 18, 2013 11:13 PM To: java-user@lucene.apache.org Subject: Re: Grouping and tokens On Mon, Feb 18, 2013 at 9:47 PM, Jack Krupansky **wrote: Please clarify exactly what you want to group by - give a specific example that makes it clear what terms should affect grouping and which shouldn't. Assume I am indexing a library data. Say there are the following fields for a particular book. 1. Published 2. Language 3. Genre 4. Author 5. Title 6. ISBN While search time, the user can ask to group by any of the above fields, which means all of them are not supposed to be tokenized. So as I had told earlier, there is a book titled "Fifty shades of gray" and the user searches for "shades". The result turns up in case the field is tokenized. But here it doesn't, since it isn't tokenized. Hope I am clear? In a nutshell, how do I use a groupby on a field that is also tokenized? -- Jack Krupansky -Original Message- From: Ramprakash Ramamoorthy Sent: Monday, February 18, 2013 6:12 AM To: java-user@lucene.apache.org Subject: Grouping and tokens Hello all, From the grouping javadoc, I read that fields that are supposed to be grouped should not be tokenized. I have an use case where the user has the freedom to group by any field during search time. Now that only tokenized fields are eligible for grouping, this is creating an issue with my search. Say for instance the book "*Fifty shades of grey*" when tokenized and searched for "*shades*" turns up in the result. However this is not the case when I have it as a non-tokenized field (using StandardAnalyzer-Version4.1). How do I go about this? Is indexing a tokenized and non-tokenized version of the same field the only go? I am afraid its way too costly! Thanks in advance for your valuable inputs. -- With Thanks and Regards, Ramprakash Ramamoorthy, India, +91 9626975420 --** --**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org< java-user-**unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org< java-user-help@lucene.**apache.org > -- With Thanks and Regards, Ramprakash Ramamoorthy, India. +91 9626975420 --**--**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org For additional commands, e-mail: java-user-help@lucene.apache.**org --**--**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org For additional commands, e-mail: java-user-help@lucene.apache.**org -- With Thanks and Regards, Ramprakash Ramamoorthy, India, +91 9626975420 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: possible bug on Spellchecker
Any reason that you are not using the DirectSpellChecker? See: http://lucene.apache.org/core/4_0_0/suggest/org/apache/lucene/search/spell/DirectSpellChecker.html -- Jack Krupansky -Original Message- From: Samuel García Martínez Sent: Wednesday, February 20, 2013 3:34 PM To: java-user@lucene.apache.org Subject: possible bug on Spellchecker Hi all, Debugging Solr spellchecker (IndexBasedSpellchecker, delegating on lucene Spellchecker) behaviour i think i found a bug when the input is a 6 letter word: - george - anthem - argued - fluent Due to the getMin() and getMax() the grams indexed for these terms are 3 and 4. So, the fields would be something like this: - for "*george*" - start3: "geo" - start4: "geor" - end3: "rge" - end4: "orge" - 3: "geo", "eor", "org", "rge" - 4: "geor", "eorg", "orge" - for "*anthem*" - start3: "ant" - start4: "anth" - end3: "tem" - end4: "them" The problem shows up when the user swap 3rd a 4th characters, misspelling the word like this: - geroge - anhtem The queries generated for this terms are: (SHOULD boolean queries) - for "*geroge*" - start3: "ger" - start4: "gero" - end3: "oge" - end4: "roge" - 3: "ger", "ero", "rog", "oge" - 4: "gero", "erog", "roge" - for "*anhtem*" - start3: "anh" - start4: "anht" - end3: "tem" - end4: "htem" - 3: "anh", "nht", "hte", "tem" - 4: "anht", "nhte", "htem" So, as you can see, this kind of misspelling never matches the suitable suggestions although the edit distance is 0.9556. I think getMin(int l) and getMax(int l) should return 2 and 3, respectively, for l==6. Debugging other values i did not found any problem with any kind of misspelling. Any thoughts about this? -- Un saludo, Samuel García - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Boolean Query not working in Lucene 4.0
Try detailing both your expected behavior and the actual behavior. Try providing an actual code snippet and actual index and query data. Is it failing for all types and titles or just for some? -- Jack Krupansky -Original Message- From: saisantoshi Sent: Tuesday, February 26, 2013 6:52 PM To: java-user@lucene.apache.org Subject: Boolean Query not working in Lucene 4.0 The following query does not seems to work after we upgrade from 2.4 - 4.0 *+type:sometype +title:sometitle** Any ideas as to what are some of the places to look for? Is the above Query correct in syntax. Appreciate if you could advise on the above? Thanks, Sai. -- View this message in context: http://lucene.472066.n3.nabble.com/Boolean-Query-not-working-in-Lucene-4-0-tp4043246.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Getting documents from suggestions
Could you give us some examples of what you expect? I mean, how is your suggested set of documents any different from simply executing a query with the list of suggested terms (using q.op=OR)? Or, maybe you want something like MoreLikeThis? -- Jack Krupansky -Original Message- From: Bratislav Stojanovic Sent: Thursday, March 14, 2013 5:36 PM To: java-user@lucene.apache.org Subject: Getting documents from suggestions Hi all, How can I filter suggestions based on some value from the indexed field? I have a stored 'id' field in my index and I want to use that to examine documents where the suggestion was found, but how to get Document from suggestion? SpellChecker class only returns array of strings. What classes should I use? Please help. Thanx in advance. -- Bratislav Stojanovic, M.Sc. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Getting documents from suggestions
Let's refine this... If a top suggestion is X, do you simply want to know a few of the documents which have the highest term frequency for X? Or is there some other term-oriented metric you might propose? -- Jack Krupansky -Original Message- From: Bratislav Stojanovic Sent: Thursday, March 14, 2013 6:14 PM To: java-user@lucene.apache.org Subject: Re: Getting documents from suggestions Wow that was fast :) I have implemented a simple search box with auto-suggestions, so whenever user types in something, ajax call is fired to the SuggestServlet and in return 10 suggestions are shown. It's working fine with the SpellChecker class, but I only get array of Strings. What I want is to get lucene Document instances so I can use doc.get("id") to filter those suggestions. This field is my field, doesn't have to do anything with the default Doc. Id field Lucene generates. Here's an example : when I type "apache" I get suggestions like "apache.org", "apache2" etc. Now I want to have something like this : Document doc = SomeClass.getDocFromSuggestion("apache.org"); if (doc.get("id") == ...) { //add suggestion into the result } else { //do nothing. } Is MoreLikeThis designed for this? On Thu, Mar 14, 2013 at 10:45 PM, Jack Krupansky wrote: Could you give us some examples of what you expect? I mean, how is your suggested set of documents any different from simply executing a query with the list of suggested terms (using q.op=OR)? Or, maybe you want something like MoreLikeThis? -- Jack Krupansky -Original Message- From: Bratislav Stojanovic Sent: Thursday, March 14, 2013 5:36 PM To: java-user@lucene.apache.org Subject: Getting documents from suggestions Hi all, How can I filter suggestions based on some value from the indexed field? I have a stored 'id' field in my index and I want to use that to examine documents where the suggestion was found, but how to get Document from suggestion? SpellChecker class only returns array of strings. What classes should I use? Please help. Thanx in advance. -- Bratislav Stojanovic, M.Sc. --**--**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org For additional commands, e-mail: java-user-help@lucene.apache.**org -- Bratislav Stojanovic, M.Sc. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Getting documents from suggestions
I don't have time right now to debug your code right now, but make sure that the analysis is consistent between index and query. For example, "Apache" vs. "apache". -- Jack Krupansky -Original Message- From: Bratislav Stojanovic Sent: Saturday, March 16, 2013 7:29 AM To: java-user@lucene.apache.org Subject: Re: Getting documents from suggestions Hey Jack, I've tried MoreLikeTHis, but it always returns me 0 hits. Here's the code, it's very simple : // test2 Index lucene = null; try { lucene = new Index(); MoreLikeThis mlt = new MoreLikeThis(lucene.reader); mlt.setAnalyzer(lucene.analyzer); Reader target = new StringReader("apache"); Query query = mlt.like(target, "contents"); TopDocs results = lucene.searcher.search(query, 10); ScoreDoc[] hits = results.scoreDocs; System.out.println("Total "+hits.length+" hits"); for (int i = 0; i < hits.length; i++) { Document doc = lucene.searcher.doc(hits[i].doc); System.out.println("Hit "+i+" : "+doc.getField("id").stringValue()); } } catch (Exception e) { e.printStackTrace(); } finally { if (lucene != null) lucene.close(); } Here are my fields in the index : ... Field pathField = new StringField("path", file.getPath(), Field.Store.YES); doc.add(pathField); doc.add(new LongField("modified", file.lastModified(), Field.Store.NO)); // id so we can fetch metadata doc.add(new LongField("id", id, Field.Store.YES)); // add default user (guest) doc.add(new LongField("userid", -1L, Field.Store.YES)); doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(fis, "UTF-8"; ... Do you have any clue what might be wrong? Btw, searching for "apache" returns me 36 hits, so index is fine. P.S. I even tried like(int) and passing Doc. Id. but same thing. On Thu, Mar 14, 2013 at 10:45 PM, Jack Krupansky wrote: Could you give us some examples of what you expect? I mean, how is your suggested set of documents any different from simply executing a query with the list of suggested terms (using q.op=OR)? Or, maybe you want something like MoreLikeThis? -- Jack Krupansky -Original Message- From: Bratislav Stojanovic Sent: Thursday, March 14, 2013 5:36 PM To: java-user@lucene.apache.org Subject: Getting documents from suggestions Hi all, How can I filter suggestions based on some value from the indexed field? I have a stored 'id' field in my index and I want to use that to examine documents where the suggestion was found, but how to get Document from suggestion? SpellChecker class only returns array of strings. What classes should I use? Please help. Thanx in advance. -- Bratislav Stojanovic, M.Sc. --**--**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org For additional commands, e-mail: java-user-help@lucene.apache.**org -- Bratislav Stojanovic, M.Sc. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Multi-value fields in Lucene 4.1
I don't think there is a way of identifying which of the values of a multivalued field matched. But... I haven't checked the code to be absolutely certain whether their isn't some expert way. Also, realize that multiple values could match, such as if you queried for "B*". -- Jack Krupansky -Original Message- From: Chris Bamford Sent: Friday, March 22, 2013 5:57 AM To: java-user@lucene.apache.org Subject: Multi-value fields in Lucene 4.1 Hi, If I index several similar values in a multivalued field (e.g. many authors to one book), is there any way to know which of these matched during a query? e.g. Book "The art of Stuff", with authors "Bob Thingummy" and "Belinda Bootstrap" If we queried for +(author:Be*) and matched this document, is there a way of drilling down and identifying the specific sub-field that actually triggered the match ("Belinda Bootstrap") ? I was wondering what the lowest granularity of matching actually is - document / field / sub-field ... I am happy to index with term vectors and positions if it helps. Thanks, - Chris - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Accent insensitive analyzer
Try the ASCII Folding FIlter: https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html -- Jack Krupansky -Original Message- From: Jerome Blouin Sent: Friday, March 22, 2013 12:22 PM To: java-user@lucene.apache.org Subject: Accent insensitive analyzer Hello, I'm looking for an analyzer that allows performing accent insensitive search in latin languages. I'm currently using the StandardAnalyzer but it doesn't fulfill this need. Could you please point me to the one I need to use? I've checked the javadoc for the various analyzer packages but can't find one. Do I need to implement my own analyzer? Regards, Jerome - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Accent insensitive analyzer
Start with the Standard Tokenizer: https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html -- Jack Krupansky -Original Message- From: Jerome Blouin Sent: Friday, March 22, 2013 12:53 PM To: java-user@lucene.apache.org Subject: RE: Accent insensitive analyzer I understand that I can't configure it on an analyzer so on which class can I apply it? Thank, Jerome -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Friday, March 22, 2013 12:38 PM To: java-user@lucene.apache.org Subject: Re: Accent insensitive analyzer Try the ASCII Folding FIlter: https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html -- Jack Krupansky -Original Message- From: Jerome Blouin Sent: Friday, March 22, 2013 12:22 PM To: java-user@lucene.apache.org Subject: Accent insensitive analyzer Hello, I'm looking for an analyzer that allows performing accent insensitive search in latin languages. I'm currently using the StandardAnalyzer but it doesn't fulfill this need. Could you please point me to the one I need to use? I've checked the javadoc for the various analyzer packages but can't find one. Do I need to implement my own analyzer? Regards, Jerome - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Indexing a long list
The first question is how do you want to access the data? What do you want your queries to look like? What is the larger context? Are these properties of larger documents? Are there more than one per document? Etc. Why not just store the property as a tokenized field? Then you can query whether v(i) or v(j) are or are not present as keywords. -- Jack Krupansky -Original Message- From: Paul Bell Sent: Sunday, March 31, 2013 8:21 AM To: java-user@lucene.apache.org Subject: Indexing a long list Hi All, Suppose I need to index a property whose value is a long list of terms. For example, someProperty = ["v1", "v2", , "v100"] Please note that I could drop the leading "v" and index these as numbers instead of strings. But the question is what's the best practice in Lucene when dealing with a case like this? I need to be able to retrieve the list. This makes methink that I need to store it. And I suppose that the list could be stored in the index itself or in the "content" to which the index points. So there are really two parts to this question: 1. Lucene "best practices" for long list 2. Where to store such a list Thanks for your help. -Paul - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Indexing a long list
Multivalued fields are the other approach to keyword value pairs. And if you can denormalize your data, storing structure as separate documents can make sense and support more powerful queries. Although the join capabilities are rather limited. -- Jack Krupansky -Original Message- From: Paul Bell Sent: Sunday, March 31, 2013 8:52 AM To: java-user@lucene.apache.org Subject: Re: Indexing a long list Hi Jack, Thanks for the reply. I am very new to Lucene. Your timing is a bit uncanny. I was just coming to the conclusion that there's nothing special about this case for Lucene, i.e., a tokenized field should work, when I looked up and saw your e-mail. In re the larger context: yeah, the properties in question here belong to some kind of node, e.g., maybe a vertex in a graph DB. Possible properties include 'name', 'type', 'inEdges', 'outEdges', etc. Most properties are simple k=v pairs. But a few, notable the 'edge' properties, could be long lists. My intent was to create a Lucene Document for each node. The Fields in this Document would represent all of the node's properties. A generic (not in Lucene syntax) query should be able to ask after any property, e.g., ('name' equals "vol1" AND 'outEdges.name' startsWith "hasMirror") Note that 'outEdges.name' represents multiple elements, where 'name' represents only one. That is, the generic query syntax is trying to match any out-edge whose name property starts with "hasMirror". I haven't quite crystallized the generic query syntax and don't know how best to map it to both a Lucene query and to an appropriate Lucene index structure. Please let me know if you've any suggestions! Thanks again. -Paul On Sun, Mar 31, 2013 at 8:33 AM, Jack Krupansky wrote: The first question is how do you want to access the data? What do you want your queries to look like? What is the larger context? Are these properties of larger documents? Are there more than one per document? Etc. Why not just store the property as a tokenized field? Then you can query whether v(i) or v(j) are or are not present as keywords. -- Jack Krupansky -Original Message- From: Paul Bell Sent: Sunday, March 31, 2013 8:21 AM To: java-user@lucene.apache.org Subject: Indexing a long list Hi All, Suppose I need to index a property whose value is a long list of terms. For example, someProperty = ["v1", "v2", , "v100"] Please note that I could drop the leading "v" and index these as numbers instead of strings. But the question is what's the best practice in Lucene when dealing with a case like this? I need to be able to retrieve the list. This makes methink that I need to store it. And I suppose that the list could be stored in the index itself or in the "content" to which the index points. So there are really two parts to this question: 1. Lucene "best practices" for long list 2. Where to store such a list Thanks for your help. -Paul --**--**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org For additional commands, e-mail: java-user-help@lucene.apache.**org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: MLT Using a Query created in a different index
The heart of MLT is examining the top result of a query (or maybe more than one) and identifying the "top" terms from the top document(s) and then simply using those top terms for a subsequent query. The term ranking would of course depend on term frequency, and other relevancy considerations - for the corpus of the original query. A rich query corpus will give great results, a weak corpus will give weak results - no matter how rich or weak the final target corpus is. OTOH, if the target corpus really is representative on the source corpus, then results should be either good or terrible - the selected/query document may not have any representation in the target corpus. -- Jack Krupansky -Original Message- From: Peter Lavin Sent: Thursday, April 04, 2013 1:06 PM To: java-user@lucene.apache.org Subject: MLT Using a Query created in a different index Dear Users, I am doing some research where Lucene is integrated into agent technology. Part of this work involves using an MLT query in an index which was not created from a document in that index (i.e. the query is created, serialised and sent to the remote agent). Can anyone point me towards any information on what the potential impact of doing this would be? I'm assuming if both indexes have similar sets of documents, the impact would be negligible, but what, for example would be the impact of creating an MLT query from an index with only one or two documents for use in an index with several (say 100+) documents, with thanks, Peter -- with best regards, Peter Lavin, PhD Candidate, CAG - Computer Architecture & Grid Research Group, Lloyd Institute, 005, Trinity College Dublin, Ireland. +353 1 8961536 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: MLT Using a Query created in a different index
In a statistical sense, for the majority of documents, yes, but you could probably find quite a few outlier examples where the results from A to B or from B to A as significantly or even completely different or even non-existent. -- Jack Krupansky -Original Message- From: Peter Lavin Sent: Friday, April 05, 2013 3:49 AM To: java-user@lucene.apache.org Subject: Re: MLT Using a Query created in a different index Thanks for that Jack, so it's fair to say that if both the sources and target corpus are large and diverse, then the impact of using a different index to create the query would be negligible. P. On 04/04/2013 06:49 PM, Jack Krupansky wrote: The heart of MLT is examining the top result of a query (or maybe more than one) and identifying the "top" terms from the top document(s) and then simply using those top terms for a subsequent query. The term ranking would of course depend on term frequency, and other relevancy considerations - for the corpus of the original query. A rich query corpus will give great results, a weak corpus will give weak results - no matter how rich or weak the final target corpus is. OTOH, if the target corpus really is representative on the source corpus, then results should be either good or terrible - the selected/query document may not have any representation in the target corpus. -- Jack Krupansky -Original Message- From: Peter Lavin Sent: Thursday, April 04, 2013 1:06 PM To: java-user@lucene.apache.org Subject: MLT Using a Query created in a different index Dear Users, I am doing some research where Lucene is integrated into agent technology. Part of this work involves using an MLT query in an index which was not created from a document in that index (i.e. the query is created, serialised and sent to the remote agent). Can anyone point me towards any information on what the potential impact of doing this would be? I'm assuming if both indexes have similar sets of documents, the impact would be negligible, but what, for example would be the impact of creating an MLT query from an index with only one or two documents for use in an index with several (say 100+) documents, with thanks, Peter -- with best regards, Peter Lavin, PhD Candidate, CAG - Computer Architecture & Grid Research Group, Lloyd Institute, 005, Trinity College Dublin, Ireland. +353 1 8961536 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: how to consider special character like + ! @ # in Lucene search
You'll have to switch from using the standard analyzer/tokenizer to using the whitespace analyzer/tokenizer, and make sure not to use any additional token filters that might eliminate some or all special characters (or provide character maps for the ones that do accept character maps.) You will need to completely reindex your data as well. And, in some cases you may need to escape some special characters with backslash in your queries. -- Jack Krupansky -Original Message- From: neeraj shah Sent: Wednesday, April 10, 2013 2:07 AM To: java-user@lucene.apache.org Subject: how to consider special character like + ! @ # in Lucene search hello, im using Lucene2.9. i have to search special character like "/" in given text. but when im searching it gives me 0 hit. I have tried QueryParse.escape("/"). but did not get the result. how to proceed further. please help.. -- With Regards, Neeraj Kumar Shah - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to index Sharepoint files with Lucene
The Apache ManifoldCF "connector framework" has a SharePoint connector that can crawl SharePoint repositories. It has an output connector that feeds into Solr/SolrCell, but you can easily develop a connector that outputs whatever you want - like put the crawled files into a file system directory, or maybe even send each file directly into Tika and then directly index the content into Lucene, if that's what you want. In any case, MCF handles the SharePoint access and crawling. See: http://manifoldcf.apache.org/en_US/index.html -- Jack Krupansky -Original Message- From: Álvaro Vargas Quezada Sent: Wednesday, April 10, 2013 5:31 PM To: java-user@lucene.apache.org Subject: How to index Sharepoint files with Lucene Hi everyone! I'm trying to combine Lucene with Sharepoint (we use Windows and SP 2010), but I couldn't find good tutorials or proven tests cases that demostrate this integration. Do you know any tutorial or can give me some help about this? I have read all the "Lucene in Action" but here just talk about indexing files, and not integration with other softwares. Thanks in advanceGreetz from Chile - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: WhitespaceTokenizer, incrementToke() ArrayOutOfBoundException
I didn't read your code, but do you have the "reset" that is now mandatory and throws AIOOBE if not present? -- Jack Krupansky -Original Message- From: andi rexha Sent: Monday, April 15, 2013 10:21 AM To: java-user@lucene.apache.org Subject: WhitespaceTokenizer, incrementToke() ArrayOutOfBoundException Hi, I have tryed to get all the tokens from a TokenStream in the same way as I was doing in the 3.x version of Lucene, but now (at least with WhitespaceTokenizer) I get an exception: Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1 at java.lang.Character.codePointAtImpl(Character.java:2405) at java.lang.Character.codePointAt(Character.java:2369) at org.apache.lucene.analysis.util.CharacterUtils$Java5CharacterUtils.codePointAt(CharacterUtils.java:164) at org.apache.lucene.analysis.util.CharTokenizer.incrementToken(CharTokenizer.java:166) The code is quite simple, and I thought that it could have worked, but obviously it doesn't (unless I have made some mistakes). Here is the code, in case you spot some bugs on it (although it is trivial): String str = "this is a test"; Reader reader = new StringReader(str); TokenStream tokenStream = new WhitespaceTokenizer(Version.LUCENE_42, reader); //tokenStreamAnalyzer.tokenStream("test", reader); CharTermAttribute attribute = tokenStream.getAttribute(CharTermAttribute.class); while (tokenStream.incrementToken()) { System.out.println(new String(attribute.buffer(), 0, attribute.length())); } Hope you have any idea of why it is happening. Regards, Andi - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: WhitespaceTokenizer, incrementToke() ArrayOutOfBoundException
Yes, reset was always "mandatory" from an API contract sense, but not always enforced in a practical sense in 3.x (no uniformly extreme negative consequences), as the original emailer indicated. Now, it is "mandatory" in a practical sense as well (extremely annoying consequences in all cases of a contract violation). So, I should have said that the contract was mandatory but not enforced... which from a practical perspective negates its mandatory contractual value. -- Jack Krupansky -Original Message- From: Uwe Schindler Sent: Monday, April 15, 2013 11:53 AM To: java-user@lucene.apache.org Subject: RE: WhitespaceTokenizer, incrementToke() ArrayOutOfBoundException Hi, It was always mandatory! In Lucene 2.x/3.x some Tokenizers just returned bogus, undefined stuff if not correctly reset before usage, especially when Tokenizers are "reused" by the Analyzer, which is now mandatory in 4.x. So we made it throw some Exception (NPE or AIOOBE) in Lucene 4 by initializing the state fields in Lucene 4.0 with some default values that cause the Exception. The Exception is not more specified because of performance reasons (it's just caused by the new default values set in ctor previously). - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -----Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Monday, April 15, 2013 4:25 PM To: java-user@lucene.apache.org Subject: Re: WhitespaceTokenizer, incrementToke() ArrayOutOfBoundException I didn't read your code, but do you have the "reset" that is now mandatory and throws AIOOBE if not present? -- Jack Krupansky -Original Message- From: andi rexha Sent: Monday, April 15, 2013 10:21 AM To: java-user@lucene.apache.org Subject: WhitespaceTokenizer, incrementToke() ArrayOutOfBoundException Hi, I have tryed to get all the tokens from a TokenStream in the same way as I was doing in the 3.x version of Lucene, but now (at least with WhitespaceTokenizer) I get an exception: Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1 at java.lang.Character.codePointAtImpl(Character.java:2405) at java.lang.Character.codePointAt(Character.java:2369) at org.apache.lucene.analysis.util.CharacterUtils$Java5CharacterUtils.codePoint At(CharacterUtils.java:164) at org.apache.lucene.analysis.util.CharTokenizer.incrementToken(CharTokeniz er.java:166) The code is quite simple, and I thought that it could have worked, but obviously it doesn't (unless I have made some mistakes). Here is the code, in case you spot some bugs on it (although it is trivial): String str = "this is a test"; Reader reader = new StringReader(str); TokenStream tokenStream = new WhitespaceTokenizer(Version.LUCENE_42, reader); //tokenStreamAnalyzer.tokenStream("test", reader); CharTermAttribute attribute = tokenStream.getAttribute(CharTermAttribute.class); while (tokenStream.incrementToken()) { System.out.println(new String(attribute.buffer(), 0, attribute.length())); } Hope you have any idea of why it is happening. Regards, Andi - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Multiple PositionIncrement attributes
You can use SpanNearQuery to seek matches within a specified distance. Lucene knows nothing about "sentences". But if you have an analyzer or custom code that artificially bumps the position to the next multiple of some number like 100 or 1000 when a sentence boundary pattern is encountered, you could use that number times n to match within n sentences, roughly, plus or minus a sentence or two - there is nothing to cause the nearness to be rounded or truncated exactly to one of those boundaries. Maybe you want two numbers: 1) sentence separation, say 1000, and 2) maximum sentence length, say 500. The SpanNearQuery would use n-1 times the sentence separation plus the maximum sentence length. Well, you have to adjust that for how you count sentences - is 1 the current sentence or is that 0? -- Jack Krupansky -Original Message- From: Igor Shalyminov Sent: Thursday, April 25, 2013 6:54 AM To: java-user@lucene.apache.org Subject: Multiple PositionIncrement attributes Hi all! I use PositionIncrement attribute for finding words at some distance from each other. And I have two problems with that: 1) I want to search words within one sentence. A possible solution would be to set PositionIncrement of +INF (like 30 :) ) to the sentence break tag. 2) I want to use in my search both word-distance and sentence-distance between words (e.g. find the word "Putin" within 3 sentences after the word "Obama" or find the words "cheese" and "bacon" in one sentence within 3 words of each other). For the 2nd problem, is there a way of storing multiple position information sources in the index and using them for searching? Say, at least choosing one of those for a query. -- Best Regards, Igor Shalyminov - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: [PhraseQuery] Can "jakarta apache"~10 be searched by offset ?
Do you mean the raw character offsets of the starting and ending characters of the terms? No. Although, if you index the text as a raw string, you might be able to come up with a regex query like "jakarta.{1,10}apache" -- Jack Krupansky -Original Message- From: wgggfiy Sent: Monday, May 06, 2013 11:39 PM To: java-user@lucene.apache.org Subject: [PhraseQuery] Can "jakarta apache"~10 be searched by offset ? As I know, the syntax *"jakarta apache"~10*, which is a PhraseQuery with a slop=10 in position, but What I want is *based on offset* not on position? Anyone can help me ? thx. - -- Email: wuqiu.m...@qq.com -- -- View this message in context: http://lucene.472066.n3.nabble.com/PhraseQuery-Can-jakarta-apache-10-be-searched-by-offset-tp4061243.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: [PhraseQuery] Can "jakarta apache"~10 be searched by offset ?
You'll have to be more explicit about the actual data and what didn't work. Try developing a simple, self-contained unit test with some simple strings as input that demonstrates the case that you say doesn't work. I mean, regular expressions and field analysis can both be quite tricky - even a tiny typo can break everything. To be clear, my suggested approach to using regular expressions works only on un-tokenized input, so there won't be any positions or even offsets. Other than that, you're on your own until you develop that small, self-contained unit test. Or, you can file a Jira for a new Lucene Query for phrase and or span queries that measures distance by offsets rather than positions. -- Jack Krupansky -Original Message- From: wgggfiy Sent: Monday, May 13, 2013 3:47 AM To: java-user@lucene.apache.org Subject: Re: [PhraseQuery] Can "jakarta apache"~10 be searched by offset ? Jack, according to you, How can I implemt this requirement ?Could you give me a clue ? thank you very much.The regex query seemed not worked ? I got the field such asFieldType fieldType = new FieldType(); FieldInfo.IndexOptions indexOptions = FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS; fieldType.setIndexOptions(indexOptions);fieldType.setIndexed(true); fieldType.setTokenized(true);fieldType.setStored(true); fieldType.freeze();return new Field(name, value, fieldType); - -- Email: wuqiu.m...@qq.com -- -- View this message in context: http://lucene.472066.n3.nabble.com/PhraseQuery-Can-jakarta-apache-10-be-searched-by-offset-tp4061243p4062852.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: lucene and mongodb
That was tried with Lucandra/Solandra, which stored the Lucene index in Cassandra, but was less than optimal, so that model was discarded in favor of indexing Cassandra data directly into Solr/Lucene, side-by-side in each Cassandra node, but in native Lucene. The latter approach is now available from DataStax as DataStax Enterprise (DSE), which also integrates Hadoop. DSE provides the best of Cassandra integrated tightly with the best of Solr. In DSE, Cassandra takes care of all the cluster management, with Solr indexing the local Cassandra data, handing off incoming updates to Cassandra for normal Cassandra storage, and using a variation of the normal Solr code for distributing queries and merging results from other nodes, but depending on Cassandra for information about cluster configuration. (Note: I had proposed to talk in more detail about the above at Lucene Revolution, but my proposal was not accepted.) See: http://www.datastax.com/what-we-offer/products-services/datastax-enterprise As it says, "DataStax Enterprise is completely free for development work." -- Jack Krupansky -Original Message- From: Rider Carrion Cleger Sent: Tuesday, May 14, 2013 4:35 AM To: java-user-i...@lucene.apache.org ; java-user-...@lucene.apache.org ; java-user@lucene.apache.org Subject: lucene and mongodb Hi team, I'm working with apache lucene 4.2.1 and I would like to store lucene index in a NoSql database. So my questions are, - Can I store the lucene index in a mongodb database ? thanks you team! - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: classic.QueryParser - bug or new behavior?
Yeah, just go ahead and escape the slash, either with a backslash or by enclosing the whole term in quotes. Otherwise the slash (even embedded in the middle of a term!) indicates the start of a regex query term. -- Jack Krupansky -Original Message- From: Scott Smith Sent: Sunday, May 19, 2013 2:50 PM To: java-user@lucene.apache.org Subject: classic.QueryParser - bug or new behavior? I just upgraded from lucene 4.1 to 4.2.1. We believe we are seeing some different behavior. I'm using org.apache.lucene.queryparser.classic.QueryParser. If I pass the string "20110920/EXPIRED" (w/o quotes) to the parser, I get: org.apache.lucene.queryparser.classic.ParseException: Cannot parse '20110920/EXPIRED': Lexical error at line 1, column 17. Encountered: after : "/EXPIRED" at org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:131) We believe this used to work. I tried googling for this and found something that said I should use QueryParser.escape() on the string before passing it to the parser. However, that seems to break phrase queries (e.g., "John Smith" - with the quotes; I assume it's escaping the double-quotes and doesn't realize it's a phrase). Since it is a forward slash, I'm confused why it would need escaping of any of the characters in the string with the "/EXPIRED". Has anyone seen this? Scott - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Case insensitive StringField?
To be clear, analysis is not supported on StringField (or any non-tokenized field). But the good news is that by using the keyword tokenizer (KeywordTokenizer) on a TextField, you can get the same effect. That will preserve the entire input as a single token. You may want to include filters to trim exterior white space and normalize interior white space. -- Jack Krupansky -Original Message- From: Shahak Nagiel Sent: Tuesday, May 21, 2013 10:06 AM To: java-user@lucene.apache.org Subject: Case insensitive StringField? It appears that StringField instances are treated as literals, even though my analyzer lower-cases (on both write and read sides). So, for example, I can match with a term query (e.g. "NEW YORK"), but only if the case matches. If I use a QueryParser (or MultiFieldQueryParser), it never works because these query values are lowercased and don't match. I've found that using a TextField instead works, presumably because it's tokenized and processed correctly by the write analyzer. However, I would prefer that queries match against the entire/exact phrase ("NEW YORK"), rather than among the tokens ("NEW" or "YORK"). What's the solution here? Thanks in advance. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Query with phrases, wildcards and fuzziness
Just escape embedded spaces with a backslash. -- Jack Krupansky -Original Message- From: Ross Simpson Sent: Tuesday, May 21, 2013 8:08 PM To: java-user@lucene.apache.org Subject: Query with phrases, wildcards and fuzziness Hi all, I'm trying to create a fairly complex query, and having trouble constructing it. My index contains a TextField with place names as strings, e.g.: Port Melbourne, VIC 3207 I'm using an analyzer with just KeywordTokenizer and LowerCaseFilter, so that my strings are not tokenized at all. I want to support end-user searches like the following, and have them match that string above: Port Melbourne, VIC 3207 (exact) Port (prefix) Port Mel (prefix, including a space) Melbo (wildcard) Melburne (fuzzy) I'm trying to get away with not parsing the query myself, and just constructing something like this: parser.parse( "(STRING^9) OR (STRING*^7) OR (*STRING*^5) OR (STRING~1^3) ); That doesn't seem to work, neither with QueryParser nor with ComplexPhraseQueryParser. Specifically, I'm having trouble getting appropriate results when there's a space in the input string, notable with the wildcard match part (it ends up returning everything in the index). Is my approach above possible? I also have had a look at using specific Query implementations and combining them in a BooleanQuery, but I'm not quite sure how to replicate the "OR" behavior I want (from reading, Occur.SHOULD is not equivalent or "OR"). Any suggestions would be appreciated. Thanks! Ross - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Case insensitive StringField?
Yes it is. It always will. But... you can escape the spaces with a backslash: Query q = qp.parse("new\\ york"); -- Jack Krupansky -Original Message- From: Shahak Nagiel Sent: Tuesday, May 21, 2013 10:09 PM To: java-user@lucene.apache.org Subject: Re: Case insensitive StringField? Jack / Michael: Thanks, but the query parser still seems to be tokenizing the query? public class StringPhraseAnalyzer extends Analyzer { protected TokenStreamComponents createComponents (String fieldName, Reader reader) { Tokenizer tok = new KeywordTokenizer(reader); TokenFilter filter = new LowerCaseFilter(Version.LUCENE_41, tok); filter = new TrimFilter(filter, true); return new TokenStreamComponents(tok, filter); } } ... Analyzer analyzer = new StringPhraseAnalyzer(); // using this analyzer, add document to index with city TextField (value "NEW YORK") QueryParser qp = new QueryParser(Version.LUCENE_41, "city", analyzer); Query q = qp.parse("new york"); System.out.println ("Query: " + q); results in... Query: city:new city:york// I expected "city:new york" ...and no matches. Is a QueryParser the wrong way to generate the query for this type of analyzer? Thanks again! From: Jack Krupansky To: java-user@lucene.apache.org Sent: Tuesday, May 21, 2013 10:22 AM Subject: Re: Case insensitive StringField? To be clear, analysis is not supported on StringField (or any non-tokenized field). But the good news is that by using the keyword tokenizer (KeywordTokenizer) on a TextField, you can get the same effect. That will preserve the entire input as a single token. You may want to include filters to trim exterior white space and normalize interior white space. -- Jack Krupansky -Original Message- From: Shahak Nagiel Sent: Tuesday, May 21, 2013 10:06 AM To: java-user@lucene.apache.org Subject: Case insensitive StringField? It appears that StringField instances are treated as literals, even though my analyzer lower-cases (on both write and read sides). So, for example, I can match with a term query (e.g. "NEW YORK"), but only if the case matches. If I use a QueryParser (or MultiFieldQueryParser), it never works because these query values are lowercased and don't match. I've found that using a TextField instead works, presumably because it's tokenized and processed correctly by the write analyzer. However, I would prefer that queries match against the entire/exact phrase ("NEW YORK"), rather than among the tokens ("NEW" or "YORK"). What's the solution here? Thanks in advance. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Query with phrases, wildcards and fuzziness
Using BooleanQuery and Should is the way to go. There are some nuances, but you may not run into them. Sometimes it is more that the query parser syntax is the issue rather than the Lucene BQ itself. For example, with a string of AND and OR, they all get parsed into a single BQ, which is clearly not traditional "Boolean", but if you code multiple BQs that nest (or fully parenthesize your source query), you will get a true "Boolean" query. It's up to your particular application whether you need "true" Boolean or not. -- Jack Krupansky -Original Message- From: Ross Simpson Sent: Wednesday, May 22, 2013 7:44 AM To: java-user@lucene.apache.org Subject: Re: Query with phrases, wildcards and fuzziness One further question: If I wanted to construct my query using Query implementations instead of a QueryParser (e.g. TermQuery, WildcardQuery, etc.), what's the right way to duplicate the "OR" functionality I wrote about below? As I mentioned, I've read that wrapping query objects in a BooleanQuery and using Occur.SHOULD is not necessarily the same. Any suggestions? Ross On 22/05/2013 11:46 AM, Ross Simpson wrote: Jack, thanks very much! I wasn't considering a space a special character for some reason. That has worked perfectly. Cheers, Ross On May 22, 2013, at 10:24 AM, Jack Krupansky wrote: Just escape embedded spaces with a backslash. -- Jack Krupansky -Original Message- From: Ross Simpson Sent: Tuesday, May 21, 2013 8:08 PM To: java-user@lucene.apache.org Subject: Query with phrases, wildcards and fuzziness Hi all, I'm trying to create a fairly complex query, and having trouble constructing it. My index contains a TextField with place names as strings, e.g.: Port Melbourne, VIC 3207 I'm using an analyzer with just KeywordTokenizer and LowerCaseFilter, so that my strings are not tokenized at all. I want to support end-user searches like the following, and have them match that string above: Port Melbourne, VIC 3207 (exact) Port (prefix) Port Mel (prefix, including a space) Melbo (wildcard) Melburne (fuzzy) I'm trying to get away with not parsing the query myself, and just constructing something like this: parser.parse( "(STRING^9) OR (STRING*^7) OR (*STRING*^5) OR (STRING~1^3) ); That doesn't seem to work, neither with QueryParser nor with ComplexPhraseQueryParser. Specifically, I'm having trouble getting appropriate results when there's a space in the input string, notable with the wildcard match part (it ends up returning everything in the index). Is my approach above possible? I also have had a look at using specific Query implementations and combining them in a BooleanQuery, but I'm not quite sure how to replicate the "OR" behavior I want (from reading, Occur.SHOULD is not equivalent or "OR"). Any suggestions would be appreciated. Thanks! Ross - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Getting position increments directly from the the index
It might be nice to inquire as to the largest position for a field in a document. Is that information kept anywhere? Not that I know of, although I suppose it can be calculated at runtime by running though all the terms of the field. Then he could just divide by 1000. -- Jack Krupansky -Original Message- From: Michael McCandless Sent: Thursday, May 23, 2013 6:28 AM To: Lucene Users Subject: Re: Getting position increments directly from the the index Do you actually index the sentence boundary as a token? If so, you could just get the totalTermFreq of that token? Mike McCandless http://blog.mikemccandless.com On Wed, May 22, 2013 at 10:11 AM, Igor Shalyminov wrote: Hello! I'm storing sentence bounds in the index as position increments of 1000. I want to get the total number of sentences in the index, i. e. the number of "1000" increment values. Can I do that some other way rather than just loading each document and extracting position increments with a custom Analyzer? -- Best Regards, Igor Shalyminov - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org