Re: Handling special characters in Lucene 4.0
The standard analyzer should remove those ampersands and pluses, so the core alpha terms should be matched. You would need to use the white space analyzer or a custom analyzer to preserve such special characters. Please give a specific indexed text string and a specific query that fails against it. Also, QueryParser.escape will also escape asterisks, so they won't perform wildcard query. And then the standard analyzer will remove the asterisks as it does with most punctuation. If you switch to an analyzer that preserves special characters, you can then manually escape special characters with a backslash, and then leave the asterisk unescaped to perform a wildcard query. -- Jack Krupansky -Original Message- From: saisantoshi Sent: Sunday, October 20, 2013 6:02 PM To: java-user@lucene.apache.org Subject: Re: Handling special characters in Lucene 4.0 StandardAnalyzer both at index and search time. We use the default one and don't have any custom analyzers. Thanks, Sai -- View this message in context: http://lucene.472066.n3.nabble.com/Handling-special-characters-in-Lucene-4-0-tp4096674p4096710.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Handling special characters in Lucene 4.0
Maybe you are not using the same analyzer at index and query time. Even though you are correctly escaping the special query syntax characters, either the query analyzer is removing them or your index analyzer removed them. What analyzer are you using at index time? And, what analyzer are you using at query time? -- Jack Krupansky -Original Message- From: saisantoshi Sent: Sunday, October 20, 2013 12:47 PM To: java-user@lucene.apache.org Subject: Handling special characters in Lucene 4.0 I have created strings like the below &&searchtext +sampletext and when I try to search the following using *&&** or *+** it does not give any result. I am using QueryParser.escape(String s) method to handle the special characters but does not look like it did anything. Also, when I search something like this: title:search* it works and returns the search result but when I search like the following, it wont work title:*&&** ( No Result) Is the above valid search criteria? If not, can someone suggest here what would be appropriate search criteria? Seems like StandardAnalyzer is stripping out all the special characters and searching and that's why when we search without special characters, it does seem to work. Thanks, Sai. -- View this message in context: http://lucene.472066.n3.nabble.com/Handling-special-characters-in-Lucene-4-0-tp4096674.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Wildcard question
You get to decide: class QueryParser extends QueryParserBase: /** * Set to true to allow leading wildcard characters. * * When set, * or ? are allowed as * the first character of a PrefixQuery and WildcardQuery. * Note that this can produce very slow * queries on big indexes. * * Default: false. */ @Override public void setAllowLeadingWildcard(boolean allowLeadingWildcard) { this.allowLeadingWildcard = allowLeadingWildcard; } And the default is "false" (leading wildcard not allowed.) -- Jack Krupansky -Original Message- From: Carlos de Luna Saenz Sent: Wednesday, October 09, 2013 6:32 PM To: java-user@lucene.apache.org Subject: Wildcard question I've used Lucene 2,3 and now 4... i used to believe that * wildcard on the begining was acepted since 3 (but never used) and reviewing documentation says "Note: You cannot use a * or ? symbol as the first character of a search." is that ok or is a missupdated note on the http://lucene.apache.org/core/4_4_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description documentation? Thanks in advance. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Question about the CompoundWordTokenFilterBase
Out of curiosity, what is your use case? I mean, the normal use of this filter is to permit a "shorthand" reference to a long term, but why would you necessarily want to preclude direct reference to the full term? -- Jack Krupansky -Original Message- From: Alex Parvulescu Sent: Wednesday, September 18, 2013 10:27 AM To: java-user@lucene.apache.org Subject: Question about the CompoundWordTokenFilterBase Hi, While trying to play with the CompoundWordTokenFilterBase I noticed that the behavior is to include the original token together with the new sub-tokens. I assume this is expected (haven't found any relevant docs on this), but I was wondering if it's a hard requirement or can I propose a small change to skip the original token (controlled by a flag)? If there's interest I can put this in a JIRA issue and we can continue the discussion there. The patch is not too complicated, but I haven't ran any of the tests yet :) thanks, alex - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Can you escape characters you don't want the analyzer to modify
It sounds like you either need to have a custom analyzer or a field-aware analyzer. -- Jack Krupansky -Original Message- From: Scott Smith Sent: Tuesday, September 17, 2013 4:26 PM To: java-user@lucene.apache.org Subject: Can you escape characters you don't want the analyzer to modify Suppose I have a string like "ab@cd%d". My analyzer will turn this into "ab cd d". Can I pass it "ab\@cd\%d" and force it to treat it as a single word? I want to use the Query parser, but I don't want it messing with fields that have not been analyzed. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Query type always Boolean Query even if * and ? are present.
Queries have a syntax, with terms and operators. If no operators, there is an implicit operator (AND/OR = mandatory/optional). Generally a sequence of terms will always generate a BooleanQuery. So, any time you use some lexical construct that has meaning to the query parser that you don't want, then you need to escape it. -- Jack Krupansky -Original Message- From: Ankit Murarka Sent: Thursday, September 12, 2013 11:36 AM To: java-user@lucene.apache.org Subject: Re: Query type always Boolean Query even if * and ? are present. BingoThis has solved my case... Thanks a ton..!! Does this mean any input containing spaces and query being parsed using QueryParser will result in Query type as Boolean Query UNLESS white space is escaped.. So if the input contains white space and parsed using QueryParser it will eventually turn into Boolean since for each separate word it will create a term and each term will be treated for Boolean Query.. ?? Can you please point any other possible pitfall when using QueryParser and input containing special characters... On 9/12/2013 8:59 PM, Jack Krupansky wrote: You're not escaping white space, so your input will be a sequence of terms, which should generate a BooleanQuery. What is the last clause of the BQ? It should be your PrefixQuery. -- Jack Krupansky -Original Message- From: Ankit Murarka Sent: Thursday, September 12, 2013 11:25 AM To: java-user@lucene.apache.org Subject: Re: Query type always Boolean Query even if * and ? are present. If I remove the escape call from the function, then it works as expected.. Prefix/Boolean/Wildcard.. But this is NOT what I want... The escape should be present else I will get lexical error in case of Prefix/Boolean/Wildcard since my input will definitely contain special characters... Help needed.. TIA.. On 9/12/2013 8:52 PM, Ankit Murarka wrote: I also tried it with this query: * I am still getting it as Boolean Query.. It should be Prefix... On 9/12/2013 8:50 PM, Jack Krupansky wrote: The trailing asterisk in your query input is escaped with a backslash, so the query parser will not treat it as a wildcard. -- Jack Krupansky -Original Message- From: Ankit Murarka Sent: Thursday, September 12, 2013 10:19 AM To: java-user@lucene.apache.org Subject: Query type always Boolean Query even if * and ? are present. Hello. I am faced with a trivial issue: Everytime my Query is being fired as a Boolean Query... Providing Input : value="USER_NAME_MENTIONED"/>\* This input is provided. Since this contains special characters I use escape method of QueryParser (removed escaping for * and ? since they are needed for WildCard and Prefix Searches.) public static String escape(String s) { System.out.println("CALLED"); StringBuilder sb = new StringBuilder(); try { for (int i = 0; i < s.length(); i++) { char c = s.charAt(i); // These characters are part of the query syntax and must be escaped if (c == '\\' || c == '+' || c == '-' || c == '!' || c == '(' || c == ')' || c == ':' || c == '^' || c == '[' || c == ']' || c == '\"' || c == '{' || c == '}' || c == '~' || c == '|' || c == '&' || c == '/') { sb.append('\\'); } sb.append(c); } }catch(Exception e) { e.printStackTrace(); } return sb.toString(); } /*The function which is used to provide the HIT:*/ Directory directory=null;; IndexReader reader=null; IndexSearcher searcher=null; Analyzer analyzer=null; try { Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_44); directory = FSDirectory.open(new File(strMainIndexPath.replace("\\", "/"))); reader = DirectoryReader.open(directory); //location where indexes are. searcher = new IndexSearcher(reader); System.out.println("Searching for '" + searchString + "' using QueryParser"); QueryParser queryParser = new QueryParser(Version.LUCENE_44,"contents",analyzer); searchString=CommonMethods.escape(searchString); //MENTIONED ABOVE System.out.println("ESCAPE STRING AFTER THE ESCAPE FUNCTION OF QUERYPARSER >>> " + searchString); Query query = queryParser.parse(searchString); System.out.println("Type of query: " +query.getClass().getSimpleName()); //GETTING IT AS BOOLEAN ALWAYS.. The last output is always BOOLEAN Query... Even if I am appending * at the end, the
Re: Query type always Boolean Query even if * and ? are present.
You're not escaping white space, so your input will be a sequence of terms, which should generate a BooleanQuery. What is the last clause of the BQ? It should be your PrefixQuery. -- Jack Krupansky -Original Message- From: Ankit Murarka Sent: Thursday, September 12, 2013 11:25 AM To: java-user@lucene.apache.org Subject: Re: Query type always Boolean Query even if * and ? are present. If I remove the escape call from the function, then it works as expected.. Prefix/Boolean/Wildcard.. But this is NOT what I want... The escape should be present else I will get lexical error in case of Prefix/Boolean/Wildcard since my input will definitely contain special characters... Help needed.. TIA.. On 9/12/2013 8:52 PM, Ankit Murarka wrote: I also tried it with this query: * I am still getting it as Boolean Query.. It should be Prefix... On 9/12/2013 8:50 PM, Jack Krupansky wrote: The trailing asterisk in your query input is escaped with a backslash, so the query parser will not treat it as a wildcard. -- Jack Krupansky -Original Message- From: Ankit Murarka Sent: Thursday, September 12, 2013 10:19 AM To: java-user@lucene.apache.org Subject: Query type always Boolean Query even if * and ? are present. Hello. I am faced with a trivial issue: Everytime my Query is being fired as a Boolean Query... Providing Input : \* This input is provided. Since this contains special characters I use escape method of QueryParser (removed escaping for * and ? since they are needed for WildCard and Prefix Searches.) public static String escape(String s) { System.out.println("CALLED"); StringBuilder sb = new StringBuilder(); try { for (int i = 0; i < s.length(); i++) { char c = s.charAt(i); // These characters are part of the query syntax and must be escaped if (c == '\\' || c == '+' || c == '-' || c == '!' || c == '(' || c == ')' || c == ':' || c == '^' || c == '[' || c == ']' || c == '\"' || c == '{' || c == '}' || c == '~' || c == '|' || c == '&' || c == '/') { sb.append('\\'); } sb.append(c); } }catch(Exception e) { e.printStackTrace(); } return sb.toString(); } /*The function which is used to provide the HIT:*/ Directory directory=null;; IndexReader reader=null; IndexSearcher searcher=null; Analyzer analyzer=null; try { Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_44); directory = FSDirectory.open(new File(strMainIndexPath.replace("\\", "/"))); reader = DirectoryReader.open(directory); //location where indexes are. searcher = new IndexSearcher(reader); System.out.println("Searching for '" + searchString + "' using QueryParser"); QueryParser queryParser = new QueryParser(Version.LUCENE_44,"contents",analyzer); searchString=CommonMethods.escape(searchString); //MENTIONED ABOVE System.out.println("ESCAPE STRING AFTER THE ESCAPE FUNCTION OF QUERYPARSER >>> " + searchString); Query query = queryParser.parse(searchString); System.out.println("Type of query: " +query.getClass().getSimpleName()); //GETTING IT AS BOOLEAN ALWAYS.. The last output is always BOOLEAN Query... Even if I am appending * at the end, the query is always Boolean... I have to use QueryParser ONLY Absolutely no manipulation is done on string from being given as in input to the string which is provided to this search function. Kindly guide.. TIA. -- Regards Ankit Murarka "What lies behind us and what lies before us are tiny matters compared with what lies within us" - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Query type always Boolean Query even if * and ? are present.
The trailing asterisk in your query input is escaped with a backslash, so the query parser will not treat it as a wildcard. -- Jack Krupansky -Original Message- From: Ankit Murarka Sent: Thursday, September 12, 2013 10:19 AM To: java-user@lucene.apache.org Subject: Query type always Boolean Query even if * and ? are present. Hello. I am faced with a trivial issue: Everytime my Query is being fired as a Boolean Query... Providing Input : \* This input is provided. Since this contains special characters I use escape method of QueryParser (removed escaping for * and ? since they are needed for WildCard and Prefix Searches.) public static String escape(String s) { System.out.println("CALLED"); StringBuilder sb = new StringBuilder(); try { for (int i = 0; i < s.length(); i++) { char c = s.charAt(i); // These characters are part of the query syntax and must be escaped if (c == '\\' || c == '+' || c == '-' || c == '!' || c == '(' || c == ')' || c == ':' || c == '^' || c == '[' || c == ']' || c == '\"' || c == '{' || c == '}' || c == '~' || c == '|' || c == '&' || c == '/') { sb.append('\\'); } sb.append(c); } }catch(Exception e) { e.printStackTrace(); } return sb.toString(); } /*The function which is used to provide the HIT:*/ Directory directory=null;; IndexReader reader=null; IndexSearcher searcher=null; Analyzer analyzer=null; try { Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_44); directory = FSDirectory.open(new File(strMainIndexPath.replace("\\", "/"))); reader = DirectoryReader.open(directory); //location where indexes are. searcher = new IndexSearcher(reader); System.out.println("Searching for '" + searchString + "' using QueryParser"); QueryParser queryParser = new QueryParser(Version.LUCENE_44,"contents",analyzer); searchString=CommonMethods.escape(searchString); //MENTIONED ABOVE System.out.println("ESCAPE STRING AFTER THE ESCAPE FUNCTION OF QUERYPARSER >>> " + searchString); Query query = queryParser.parse(searchString); System.out.println("Type of query: " +query.getClass().getSimpleName()); //GETTING IT AS BOOLEAN ALWAYS.. The last output is always BOOLEAN Query... Even if I am appending * at the end, the query is always Boolean... I have to use QueryParser ONLY Absolutely no manipulation is done on string from being given as in input to the string which is provided to this search function. Kindly guide.. TIA. -- Regards Ankit Murarka "What lies behind us and what lies before us are tiny matters compared with what lies within us" - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Profiling Solr Lucene for query
Please send Solr-related inquiries to the Solr user list - this is the Lucene (Java) user list. -- Jack Krupansky -Original Message- From: Manuel Le Normand Sent: Sunday, September 08, 2013 7:03 AM To: java-user@lucene.apache.org Subject: Profiling Solr Lucene for query Hello all Looking on the 10% slowest queries, I get very bad performances (~60 sec per query). These queries have lots of conditions on my main field (more than a hundred), including phrase queries and rows=1000. I do return only id's though. I can quite firmly say that this bad performance is due to slow storage issue (that are beyond my control for now). Despite this I want to improve my performances. As tought in school, I started profiling these queries and the data of ~1 minute profile is located here: http://picpaste.com/pics/IMG_20130908_132441-ZyrfXeTY.1378637843.jpg Main observation: most of the time I do wait for readVInt, who's stacktrace (2 out of 2 thread dumps) is: catalina-exec-3870 - Thread t@6615 java.lang.Thread.State: RUNNABLE at org.apadhe.lucene.store.DataInput.readVInt(DataInput.java:108) at org.apaChe.lucene.codeosAockTreeIermsReade$FieldReader$SegmentTermsEnumFrame.loadBlock(BlockTreeTermsReader.java:2357) at ora.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BlockTreeTermsReader.java:1745) at org.apadhe.lucene.index.TermContext.build(TermContext.java:95) at org.apache.lucene.search.PhraseQuery$PhraseWeight.(PhraseQuery.java:221) at org.apache.lucene.search.PhraseQuery.createWeight(PhraseQuery.java:326) at org.apache.lucene.search.BooleanQuery$BooleanWeight.(BooleanQuery.java:183) at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.searth.BooleanQuery$BooleanWeight.(BooleanQuery.java:183) at oro.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.searth.BooleanQuery$BooleanWeight.(BooleanQuery.java:183) at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:675) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297) So I do actually wait for IO as expected, but I might be too many time page faulting while looking for the TermBlocks (tim file), ie locating the term. As I reindex now, would it be useful lowering down the termInterval (default to 128)? As the FST (tip files) are that small (few 10-100 MB) so there are no memory contentions, could I lower down this param to 8 for example? General configs: solr 4.3 36 shards, each has few million docs Thanks in advance, Manu - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Expunge deleting using excessive transient disk space
You're in the wrong room! (Lucene/Java-user) Move the discussion over to the Solr user list. -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Sunday, September 08, 2013 7:56 AM To: java-user Subject: Re: Expunge deleting using excessive transient disk space How much free disk space do you have when you try the merge? Is this a typo? 2 Note name=| Best Erick On Sun, Sep 8, 2013 at 7:26 AM, Manuel Le Normand < manuel.lenorm...@gmail.com> wrote: Hi again, In order to delete part of my index I run a delete by query that intends to erase 15% of the docs. I added this params to the solrconfig.xml 2 2 5000.0 10.0 15.0 The extra params were added in order to promote merge of old segments but with restriction on the transient disk that can be used (as I have only 15GB per shard). This procedure failed on a no space left on device exception, although proper calculations show that these params should cause no usage excess of the transient free disk space I have. Looking on the infostream I can see that the first merges do succeed but older segments are kept in reference thus cannot be deleted until all the merging are done. Is there anyway of overcoming this? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Fuzzy Searching on Lucene / Solr
The limit of 2 is hard-coded precisely because good performance for editing distances above 2 cannot be guaranteed. -- Jack Krupansky -Original Message- From: Michael Tobias Sent: Wednesday, August 14, 2013 1:00 AM To: java-user@lucene.apache.org Subject: Fuzzy Searching on Lucene / Solr My first post so please be gentle with me. I am about to start 'playing' with Solr to see if it will be the correct tool for a new searchable database development. One of my requirements is the ability to do 'fuzzy' searches and I understand that the latest versions of Lucene / Solr use an improved version of indexing and the Levenshtein distance formula (or rather a modified version of Levenshtein if wished for, treating letter transpositions as a single difference rather than 2). Levenshtein is precisely what I need, but I also understand that the maximum distance currently implemented is a distance of just TWO. That is not really adequate for my purposes. I need to be able to handle at least a distance of 3 and probably 4. Is the current maximum distance of 2 hard-coded in the system? Can it be over-ridden? How? I understand that performance (both indexing and querying) may be impaired significantly by doing this but that might be a price worth paying. If it IS possible to change the max distance to 3 or 4 does anybody have any idea what the performance implications might be? Many thanks for any/all assistance you can provide. Regards Michael - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: DV limited to 32766 ?
Check out the discussion on: https://issues.apache.org/jira/browse/LUCENE-4583 "StraightBytesDocValuesField fails if bytes > 32k" -- Jack Krupansky -Original Message- From: Nicolas Guyot Sent: Friday, August 09, 2013 5:57 PM To: java-user@lucene.apache.org Subject: DV limited to 32766 ? Hi, when writing a large binaryDocValue, BinaryDocValuesWriter throws an exception saying DocValuesField "fieldName" is too large, must be <= 32766 Is there a way to avoid that limit? Thanks, Nicolas - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Query serialization/deserialization
"it would be nice to be able to emit classic Lucene query parser queries where possible" Yeah, but then we hit the problem of the Query terms having been through analysis. Maybe it would be nice if we had query syntax to indicate that terms had already been analyzed. -- Jack Krupansky -Original Message- From: Michael Sokolov Sent: Sunday, August 04, 2013 4:55 PM To: java-user@lucene.apache.org Cc: Denis Bazhenov Subject: Re: Query serialization/deserialization On 07/28/2013 07:32 PM, Denis Bazhenov wrote: A full JSON query ser/deser would be an especially nice additionto Solr, allowing direct access to all Lucene Query features even if they haven't been integrated into the higher level query parsers. There is nothing we could do, so we wrote one, in fact :) I'll try to elaborate with the team on the question of contributing it to OSS. On Jul 29, 2013, at 1:54 AM, Jack Krupansky wrote: Yeah, it's a shame such a ser/deser feature isn't available in Lucene. My idea is to have a separate module that the Query classes can delegate to for serialization and deserialization, handling recursion for nested query objects, and then have modules for XML, JSON, and a pseudo-Java functional notation (maybe closer to JavaScript) for the actual formatting and parser. And then developers can subclass the module to add any custom Query classes of their own. I struggled with this as well, and ended up implementing a parallel query class hierarchy for query classes I cared about; they support a toXmlNode that generates an xml tree that can then be parsed by the Lucene XML query parser. You can see the code here (https://github.com/msokolov/lux/blob/master/src/main/java/lux/query/ParseableQuery.java) but it would be hard to use outside my project. A more general solution would definitely be welcome. One challenge of course is the wealth of different query parsers: it would be nice to be able to emit classic Lucene query parser queries where possible, wouldn't it, in addition to structured output like XML and JSON? -Mike - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to Index each file and then each Line for Complete Phrase Match. Sample Data shown.
Why not start with something simple? Like, index each log line as a tokenized text field and then do PhraseQuery against that text field? Is there something else you need beyond that? -- Jack Krupansky -Original Message- From: Ankit Murarka Sent: Saturday, August 03, 2013 3:22 AM To: java-user@lucene.apache.org Subject: How to Index each file and then each Line for Complete Phrase Match. Sample Data shown. Hello All, I have this mentioned in the log file. Till now I am indexing the complete directory containing files which contain data like this: Now I need to index each line of the file to implement complete phrase search. I intend to store phrases in index and then use SpellChecker API to suggest me similar phrases. 7/20/2013 7:45 *package execution happening-1 * FATAL *check request has been sent for instance* Ip:Port *EXCEPTION* 7/20/2013 7:45 *This is not working perfectly * DEBUG *check request for instance being received is status=200 * Ip:Port *EXCEPTION* 7/20/2013 7:45 *Encountering a constant error. * DEBUG *response is not proper.Expecting some more information on this detail. * Ip:Port *EXCEPTION* 7/20/2013 7:45 *This needs urgent attention * FATAL *I am still trying to ensure it is running perfectly. Encountering some issues. * Ip:Port *EXCEPTION* 7/20/2013 8:01 *Job is running fine.* INFO *\ *Exception Occured in ClassFactory* * Function() java.nullPointerException: Value is null * *Should not be null* To implement complete phrase search I reckon I need to index each line and store the phrase .*Phrases in the above mentioned table are highlighted in Bold.* So, if I am able to index these and store these phrases as indexes, so when User tries to search for "package executing", the Lucene would be able to provide me "package execution happening-1" as a valid suggestion.. These columns does not have a name to them and hence I cannot index based on column name. Also as shown in the table above, first column may contain time/date or a phrase in itself (shown in last row). Please suggest. How is it possible using Lucene and its API. Javadoc does not seem to guide me anywhere for this case. -- Regards Ankit Murarka "What lies behind us and what lies before us are tiny matters compared with what lies within us" - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: how to by pass analyzer one some fields in QueryParser ?
PerFieldAnalyzerWrapper http://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html "This analyzer is used to facilitate scenarios where different fields require different analysis techniques." -- Jack Krupansky -Original Message- From: Wenbo Zhao Sent: Sunday, July 28, 2013 10:41 AM To: lucene-user-mail-list Subject: Re: how to by pass analyzer one some fields in QueryParser ? sorry guys I think I made a mistake the parse string I use was "date:\"2013/07/2*\" text:..." the quoted "" makes queryparser ignored trailing '*' and analyze the string. I changed to "date:2013\/07\/2* text:...", it works fine. sorry for the disturb :-) 2013/7/28 Wenbo Zhao Hi, all I'm stuck in one simple question, as title says, I think it should have a simple solution. Say I use StandardAnalyzer and have two fields in all documents, StringField("date"...) is not tokenized, format is 2013/07/28 TextField("text" ...) is tokenized. QueryParser parse "date:2013/07/2* text:something" returns "+date:"2013 07 2" +text:something" which is apparently not I want. I have other StringFields, not only date, so I'm looking for a general solution, not specified to this date format. any reply will be appreciated . Thanks. -- Best Regards, ZHAO, Wenbo === -- Best Regards, ZHAO, Wenbo === - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Query serialization/deserialization
Yeah, it's a shame such a ser/deser feature isn't available in Lucene. My idea is to have a separate module that the Query classes can delegate to for serialization and deserialization, handling recursion for nested query objects, and then have modules for XML, JSON, and a pseudo-Java functional notation (maybe closer to JavaScript) for the actual formatting and parser. And then developers can subclass the module to add any custom Query classes of their own. A full JSON query ser/deser would be an especially nice additionto Solr, allowing direct access to all Lucene Query features even if they haven't been integrated into the higher level query parsers. And maybe the format should have a flag for whether terms have been analyzed or not. Then, deserialization could optionally do analysis as well. The Solr QueryParsing.toString method shows a purely external approach to serialization (I've done something similar myself.) This is what is output in the "parsedquery" section of debugQuery output for a Solr query response. -- Jack Krupansky -Original Message- From: Denis Bazhenov Sent: Sunday, July 28, 2013 1:59 AM To: java-user@lucene.apache.org Subject: Query serialization/deserialization I'm looking for a tool to serialize and deserialize Lucene queries. We have tried using Query.toString(), but some queries return string that couldn't be parsed by a QueryParser afterwards. The alternative possibility is to use standard Java serialization mechanism. The reason I'm trying to avoid it, is that serialization is used to communicate in distributed system, and we can't guarantee the equality of Lucene version at all nodes at any particular point in time. Is there some way to perform query serialization in "lucene version independent way"? --- Denis Bazhenov - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Performance measurements
In addition, although I am a bit beyond my expertise here, I believe you should be able to take any query object, including one returned from a query parser, wrap it with a ConstantScoreQuery, and then search on the CSQ to avoid all the scoring overhead. For example, "*:*" is super fast even though it matches everything - no scoring. -- Jack Krupansky -Original Message- From: Arjen van der Meijden Sent: Thursday, July 25, 2013 3:06 PM To: java-user@lucene.apache.org Subject: Re: Performance measurements Hi Sriram, I don't see any obvious mistakes, although you don't need to create a FilteredQuery: There are plenty of search-methods on the IndexSearcher that accept both a query (your TermQuery) and a filter (your TermsFilter). The way I understand Filters (but I have no advanced in-depth knowledge of them) is that they are very similar to Queries. Queries are used for two tasks; matching a item and giving some measure of how "well" it matched (i.e. the score). Filters are used only for matching, but I doubt there is very much difference from a technical point of view between the to ways of matching items. I'll leave more detailed explanations to others, as I might make too many mistakes or just assume I know something I actually don't :) Best regards, Arjen On 25-7-2013 19:56 Sriram Sankar wrote: Thanks everyone. I'm trying this out: So searching would become: - Create a Query with only your termA - Create a TermsFilter with all your termB's - execute your preferred search-method with both the query and the filter I don't the get the same results as before - and am still debugging. But I'm including before and after code in case someone is able to see a problem with what I'm doing. I'm also looking for docs on how filters work (or will read the code). But at a high level, is the filter fully created when the Filter object is created? Or is it incrementally built during traversal (when next() and advance() are called on the filters). Reason for this question is related to early termination. The two versions of code (query based and filter based) are shown below - let me know if you see a problem with either. Ignore any minor syntactic errors that may have got introduced as I simplified my code for inclusion here. Thanks, Sriram. QUERY APPROACH: BooleanQuery orTerms = new BooleanQuery(); for (int i = 0; i < orCount; ++i) { TermQuery orArg = new TermQuery(new Term("conn", Integer.toString(connection[i]))); BooleanClause cl = new BooleanClause(orArg, BooleanClause.Occur.SHOULD); orTerms.add(cl); } TermQuery tq = new TermQuery(new Term("name", name)); BooleanQuery query = new BooleanQuery(); query.add(new BooleanClause(tq, BooleanClause.Occur.MUST)); query.add(new BooleanClause(orTerms, BooleanClause.Occur.MUST)); FILTER APPROACH: List orTerms = new ArrayList(); for (int i = 0; i < orCount; ++i) { terms.add(new Term("conn", Integer.toString(connection[i]))); } TermsFilter conns = new TermsFilter(terms); TermQuery tq = new TermQuery(new Term("name", name)); FilteredQuery query = new FilteredQuery(tq, conns); On Thu, Jul 25, 2013 at 12:14 AM, Arjen van der Meijden < acmmail...@tweakers.net> wrote: On 24-7-2013 21:58 Sriram Sankar wrote: On Wed, Jul 24, 2013 at 10:24 AM, Jack Krupansky **wrote: Scoring has been a major focus of Lucene. Non-scored filters are also available, but the query parsers are focused (exclusively) on scored-search. When you say "filter" do you mean a step performed after retrieval? Or is it yet another retrieval operation? He is really referring to the Filters available as an addition to retrieval. The ones you supply with the search-method: http://lucene.apache.org/core/**4_4_0/core/org/apache/lucene/** search/IndexSearcher.html#**search%28org.apache.lucene.** search.Query,%20org.apache.**lucene.search.Filter,%20int%29<http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/search/IndexSearcher.html#search%28org.apache.lucene.search.Query,%20org.apache.lucene.search.Filter,%20int%29> Unfortunately the documentation of Lucene is a bit fragmented, but basically they limit the scope of your search domain (i.e. reduce the available set of documents) during the processing of a query. So it basically becomes (query) AND (filters). There are several useful implementations available for the filters. But in your case you can just create a single TermsFilter (its in the queries module/package) which is simply a OR-list like the one in your example (similar to a basic IN in sql): http://lucene.apache.org/core/**4_4_0/queries/org/apache/** lucene/queries/TermsFilter.**html<http://lucene.apache.org/core/4_4_0/queries/org/apache/lucene/queries/TermsFilter.html> So searching would become: - Create a Query with only your termA - Create a TermsFilte
Re: Performance measurements
I think I've exhausted my expertise in Lucene filters, but I think you can wrap a query with a filter and also wrap a filter with a query. So, for IndexSearcher.search, you could take a filter and wrap it with ConstantScoreQuery. So, if a BooleanQuery got wrapped as a filter, it could be wrapped as a CSQ for search so that no scoring would be done. -- Jack Krupansky -Original Message- From: Sriram Sankar Sent: Wednesday, July 24, 2013 3:58 PM To: java-user@lucene.apache.org Subject: Re: Performance measurements On Wed, Jul 24, 2013 at 10:24 AM, Jack Krupansky wrote: Unicorn sounds like it was optimized for graph search. Specialized search engines can in fact beat out generalized search engines for specific use cases. Yes and no (I worked on it). Yes, there are many aspect of Unicorn that have been optimized for graph search. But the tests I am running have very little to do with those optimizations. I am still learning about Lucene and have suspected that the scoring framework (that has to be very general) may be contributing to the performance issues. With Unicorn, we made a decision to do all scoring after retrieval and not during retrieval. Scoring has been a major focus of Lucene. Non-scored filters are also available, but the query parsers are focused (exclusively) on scored-search. When you say "filter" do you mean a step performed after retrieval? Or is it yet another retrieval operation? As Adrien indicates, try using raw Lucene filters and you should get much better results. Whether even that will compete with a use-case-specific (graph) search engine remains to be seen. Thanks (I will study this more). Sriram. -- Jack Krupansky -Original Message- From: Sriram Sankar Sent: Wednesday, July 24, 2013 1:03 PM To: java-user@lucene.apache.org Subject: Re: Performance measurements No I do not need scoring. This is a pure retrieval query - which matches what we used to do with Unicorn in Facebook - something like: (name:sriram AND (friend:1 OR friend:2 ...)) This automatically gives us second degree. With Unicorn, we would always get sub-millisecond performance even for n>500. Should I assume that Lucene is that much worse - or is it that this use case has not been optimized? Sriram. On Wed, Jul 24, 2013 at 9:59 AM, Adrien Grand wrote: Hi, On Wed, Jul 24, 2013 at 6:11 PM, Sriram Sankar wrote: > termA AND (termB1 OR termB2 OR ... OR termBn) Maybe this comment is not appropriate for your use-case, but if you don't actually need scoring from the disjunction on the right of the query, a TermsFilter will be faster when n gets large. -- Adrien --**--**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org For additional commands, e-mail: java-user-help@lucene.apache.**org --**--**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org For additional commands, e-mail: java-user-help@lucene.apache.**org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Performance measurements
Unicorn sounds like it was optimized for graph search. Specialized search engines can in fact beat out generalized search engines for specific use cases. Scoring has been a major focus of Lucene. Non-scored filters are also available, but the query parsers are focused (exclusively) on scored-search. As Adrien indicates, try using raw Lucene filters and you should get much better results. Whether even that will compete with a use-case-specific (graph) search engine remains to be seen. -- Jack Krupansky -Original Message- From: Sriram Sankar Sent: Wednesday, July 24, 2013 1:03 PM To: java-user@lucene.apache.org Subject: Re: Performance measurements No I do not need scoring. This is a pure retrieval query - which matches what we used to do with Unicorn in Facebook - something like: (name:sriram AND (friend:1 OR friend:2 ...)) This automatically gives us second degree. With Unicorn, we would always get sub-millisecond performance even for n>500. Should I assume that Lucene is that much worse - or is it that this use case has not been optimized? Sriram. On Wed, Jul 24, 2013 at 9:59 AM, Adrien Grand wrote: Hi, On Wed, Jul 24, 2013 at 6:11 PM, Sriram Sankar wrote: > termA AND (termB1 OR termB2 OR ... OR termBn) Maybe this comment is not appropriate for your use-case, but if you don't actually need scoring from the disjunction on the right of the query, a TermsFilter will be faster when n gets large. -- Adrien - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Performance measurements
Thanks for the detailed numbers. Nothing seems unexpected to me. Increasing query complexity or term count is simply going to increase query execution time. I think I'll add a new rule to my informal performance guidance - Query complexity of no more than ten to twenty terms is a "slam dunk", but more than that is "uncharted territory" that risks queries taking more than half a second or even multiple seconds and requires a proof of concept implementation to validate reasonable query times. -- Jack Krupansky -Original Message- From: Sriram Sankar Sent: Wednesday, July 24, 2013 12:11 PM To: java-user@lucene.apache.org Subject: Performance measurements I did some performance tests on a real index using a query having the following pattern: termA AND (termB1 OR termB2 OR ... OR termBn) The results were not good and I was wondering if I may be doing something wrong (and what I would need to do to improve performance), or is it just that the OR is very inefficient. The format for the data below is illustrated below by example: 5|10 time: 0.092728962; scored: 18 Here, n=5, and we measure performance for retrieval of 10 results which is 0.0927ms. Had we not early terminated, we would have obtained 18 results. As you will see in the data below, the performance for n=0 is very good, but goes down drastically as n is increased. Sriram. 0|10 time: 0.007941587; scored: 10887 0|1000 time: 0.018967384; scored: 10887 0|5000 time: 0.061943552; scored: 10887 0|1 time: 0.115327001; scored: 10887 1|10 time: 0.053950965; scored: 0 5|20 time: 0.274681853; scored: 18 10|10 time: 0.14251254; scored: 22 10|20 time: 0.282503313; scored: 22 20|10 time: 0.251964067; scored: 32 20|30 time: 0.52860957; scored: 32 50|10 time: 0.888969702; scored: 57 50|30 time: 1.078579956; scored: 57 50|50 time: 1.601169195; scored: 57 100|10 time: 1.396391061; scored: 79 100|40 time: 1.8083494; scored: 79 100|80 time: 2.921094513; scored: 79 200|10 time: 2.848105701; scored: 119 200|50 time: 3.472198462; scored: 119 200|100 time: 4.722673648; scored: 119 400|10 time: 4.463727049; scored: 235 400|100 time: 6.554119665; scored: 235 400|200 time: 9.591892527; scored: 235 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: QueryParser for DisjunctionMaxQuery, et al.
I came up with some human-readable BNF rules for the Solr, dismax, and edismax query parsers. They are in my book. In fact they are linked from the "Solr Hot Spots" section in the preface. The main problems with re-parsing from a lucene Query structure are two-fold: 1. Terms have been analyzed, and there is no guarantee that term analysis is idempotent (repeatable without changing the result.) 2. "Query enrichment" or "query enhancement" may have occurred - additional terms and operators added. Reparsing them may result in a second (or even nth) level of enrichment/enhancement. IOW, query enrichment/enhancement is not guaranteed to to be idempotent. That said, sure somebody could gin up a re-parser. The question is how useful it would be. I suppose if the spec for the reparser was that it was literal and with no term analysis or query enrichment/enhancement, it could work. The fact that nobody has done it is a testament to its marginal utility. Did you have a specific application in mind? A more useful utility would be a "partial parser" that only parses without analysis or enrichment and generates an abstract syntax tree that applications could than access and manipulate and then "regenerate" a true source query that doesn't have analysis or enrichment (except as the application may explicitly have performed on the tree.) -- Jack Krupansky -Original Message- From: Beale, Jim (US-KOP) Sent: Tuesday, July 23, 2013 10:07 AM To: java-user@lucene.apache.org Subject: QueryParser for DisjunctionMaxQuery, et al. Hello all, It seems somewhat odd to me that the Query classes generate strings that the QueryParser won't parse. Does anyone have a QueryParser that will parse the full range of Lucene query strings? Failing that, has the BNF been written down somewhere? I can't seem to find it for the full cases. Thanks for any info/guidance. Cheers, Jim Beale Lead Developer Hibu.com The information contained in this email message, including any attachments, is intended solely for use by the individual or entity named above and may be confidential. If the reader of this message is not the intended recipient, you are hereby notified that you must not read, use, disclose, distribute or copy any part of this communication. If you have received this communication in error, please immediately notify me by email and destroy the original message, including any attachments. Thank you. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Trying to search java.lang.NullPointerException in log file.
"This is because the StandardAnalyzer must be splitting the words on "SPACES" and since there is no space present here. The entire string is converted into 1 token." Those statements are inconsistent! I mean, what code is converting the entire string to 1 token and eliminating white space? Is that your own code before you hand the string to the standard analyzer??? That makes no sense. I mean, the standard analyzer is using the standard tokenizer that doesn't do that! Are you applying the same analyzer at query time as you do at index time? It is not uncommon for Lucene users to forget to do that. If you don't, then you will have to hand-analyze the query string and simulate exactly what the standard analyzer did at index time. So, please clarify your situation. -- Jack Krupansky -Original Message- From: Ankit Murarka Sent: Monday, July 22, 2013 6:24 AM To: java-user@lucene.apache.org Subject: Trying to search java.lang.NullPointerException in log file. Hello. I am trying to search java.lang.NullPointerException in a log file. The log file is huge. However I am unable to search it. This is because the StandardAnalyzer must be splitting the words on "SPACES" and since there is no space present here. The entire string is converted into 1 token. What can be a possible way of finding "Exception:java.lang.NullPointerException" in a log file. The string may be different also. Suppose "Exception: java.lang.NullPointerException error occured" I am trying to use Phrase Query but I am not sure if that will serve the purpose. Can please someone suggest. -- Regards Ankit - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Trying to search java.lang.NullPointerException in log file.
Standard anlyzer/tokenizer will use white space and other punctuation to delimit tokens. The rules are a little complicated (although I tried to summarize them for Solr in my book) - the same rules apply for Lucene. Verify that you are properly constructing a PhraseQuery from your analyzed text at query time. What is the exact query text and what are the exact analyzer tokens for that query text and how many are there? -- Jack Krupansky -Original Message- From: Ankit Murarka Sent: Monday, July 22, 2013 10:29 AM To: java-user@lucene.apache.org Subject: Re: Trying to search java.lang.NullPointerException in log file. First thing first : Same analyzer is being used to index and to search. Now, I am not using any custom analyzer to split the string and get the tokens. I was assuming StandardAnalyzer might be using whitespaces to split the content. If that is not the case then I must have got it completely wrong. So for searching "java.lang.NullPointer" how should I proceed? This string might be present after : like ":java.lang.NullPointer" . In both cases I want to search for "java.lang.NullPointer" only. On 7/22/2013 7:51 PM, Jack Krupansky wrote: "This is because the StandardAnalyzer must be splitting the words on "SPACES" and since there is no space present here. The entire string is converted into 1 token." Those statements are inconsistent! I mean, what code is converting the entire string to 1 token and eliminating white space? Is that your own code before you hand the string to the standard analyzer??? That makes no sense. I mean, the standard analyzer is using the standard tokenizer that doesn't do that! Are you applying the same analyzer at query time as you do at index time? It is not uncommon for Lucene users to forget to do that. If you don't, then you will have to hand-analyze the query string and simulate exactly what the standard analyzer did at index time. So, please clarify your situation. -- Jack Krupansky -Original Message- From: Ankit Murarka Sent: Monday, July 22, 2013 6:24 AM To: java-user@lucene.apache.org Subject: Trying to search java.lang.NullPointerException in log file. Hello. I am trying to search java.lang.NullPointerException in a log file. The log file is huge. However I am unable to search it. This is because the StandardAnalyzer must be splitting the words on "SPACES" and since there is no space present here. The entire string is converted into 1 token. What can be a possible way of finding "Exception:java.lang.NullPointerException" in a log file. The string may be different also. Suppose "Exception: java.lang.NullPointerException error occured" I am trying to use Phrase Query but I am not sure if that will serve the purpose. Can please someone suggest. -- Regards Ankit Murarka "Peace is found not in what surrounds us, but in what we hold within." - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Searching for words begining with "or"
Just so you know, the presence of a wildcard in a term means that the term will not be analyzed. So, state:OR* should fail since "OR" will not be in the index - because it would index as "or" (lowercase). Hmmm... why does "or" seem familiar...? Ah yeah... right!... The standard analyzer includes the standard stop filter, which defaults to using this set of stopwords: final List stopWords = Arrays.asList( "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with" ); And... "or" is on that list! So, the standard analyzer is removing "or" from the index! That's why the query can't find it. Unless you really want these stop words removed, construct your own analyzer that does not do stop word removal. -- Jack Krupansky -Original Message- From: ABlaise Sent: Friday, July 19, 2013 12:07 AM To: java-user@lucene.apache.org Subject: Re: Searching for words begining with "or" When I make my query, everything goes well until I add the last part : (city:or* OR state:or*). I tried the first solution that was given to me but putting \OR and \AND doesn't seem to be the solution. The query is actually well built, he has no problem with OR or \OR to parse the query since the query looks like that : +(+(areaType:city areaType:neighborhood areaType:county) +areaName:portland*) +(city:or* state:or*). It seems to me as a valid query. It's just that he can't seem to find the 'OR' *in* the index... it's like they don't exist. And I know this because if I retrieve the last dysfunctional part of the query, he finds (among others) the right document, with the state written in it... It's like he can't 'see' the 'or' in the index... As for the upper/lower case, I am using a standard Analyzer to index and to search and I feed him with the states in upper case and he doesn't seem to change it. Still, I tried to put them in lower case but it didn't change anything... Thanks in advance for your future answers and for the help you already provided me with. -- View this message in context: http://lucene.472066.n3.nabble.com/Searching-for-words-begining-with-or-tp4079018p4079035.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Searching for words begining with "or"
Break your query down into simpler pieces for testing. What pieces seem to have what problems? Be specific about the symptom, and how you "know" that something is wrong. You wrote: stored,indexed,tokenized,omitNorms>. But... the standard analyzer would have lowercased that term. Did it, or are you using some other analyzer? -- Jack Krupansky -Original Message- From: ABlaise Sent: Thursday, July 18, 2013 9:19 PM To: java-user@lucene.apache.org Subject: Searching for words begining with "or" Hi everyone, I am new to this forum, I have made some research for my question but I can't seem to find an answer for it. I am using Lucene for a project and I know for sure that in my lucene index I have somewhere this document with these elements : Document stored,indexed,tokenized,omitNorms stored,indexed,tokenized,omitNorms stored,indexed,tokenized,omitNorms>. I am looking for it but this query doesn't work : "(+areaType:(City OR Neighborhood OR County) +areaName:portland*) AND *(city:or* OR state:or*)*" and I have tried tons of alternatives (o*, o*r, ...). Lucene seems to mistake 'or' for the OR operator. How should I do to be more precise ? To add precision to my question, this String goes through a QueryParser with a StandardAnalyzer before being searched for in the index. Any help would be welcomed ! Thanks in advance, Adrien -- View this message in context: http://lucene.472066.n3.nabble.com/Searching-for-words-begining-with-or-tp4079018.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Indexing into SolrCloud
Sorry, but you need to resend this message to the Solr user list - this is the Lucene user list. -- Jack Krupansky -Original Message- From: Beale, Jim (US-KOP) Sent: Thursday, July 18, 2013 12:34 PM To: java-user@lucene.apache.org Subject: Indexing into SolrCloud Hey folks, I've been migrating an application which indexes about 15M documents from straight-up Lucene into SolrCloud. We've set up 5 Solr instances with a 3 zookeeper ensemble using HAProxy for load balancing. The documents are processed on a quad core machine with 6 threads and indexed into SolrCloud through HAProxy using ConcurrentUpdateSolrServer in order to batch the updates. The indexing box is heavily-loaded I've been accepting the default HttpClient with 50K buffered docs and 2 threads, i.e., int solrMaxBufferedDocs = 5; int solrThreadCount = 2; solrServer = new ConcurrentUpdateSolrServer(solrHttpIPAddress, solrMaxBufferedDocs, solrThreadCount); autoCommit is configured in the solrconfig as follows: 60 50 false I'm getting the following errors on the client and server sides respectively: Client side: 2013-07-16 19:02:47,002 [concurrentUpdateScheduler-1-thread-4] INFO SystemDefaultHttpClient - I/O exception (java.net.SocketException) caught when processing request: Software caused connection abort: socket write error 2013-07-16 19:02:47,002 [concurrentUpdateScheduler-1-thread-4] INFO SystemDefaultHttpClient - Retrying request 2013-07-16 19:02:47,002 [concurrentUpdateScheduler-1-thread-5] INFO SystemDefaultHttpClient - I/O exception (java.net.SocketException) caught when processing request: Software caused connection abort: socket write error 2013-07-16 19:02:47,002 [concurrentUpdateScheduler-1-thread-5] INFO SystemDefaultHttpClient - Retrying request Server side: 7988753 [qtp1956653918-23] ERROR org.apache.solr.core.SolrCore â java.lang.RuntimeException: [was class org.eclipse.jetty.io.EofException] early EOF at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:393) When I disabled autoCommit on the server side, I didn't see any errors there but I still get the issue client-side after about 2 million documents - which is about 45 minutes. Has anyone seen this issue before? I couldn't find anything useful on the usual places. I suppose I could setup wireshark to see what is happening but I'm hoping that someone has a better suggestion. Thanks in advance for any help! Best regards, Jim Beale hibu.com 2201 Renaissance Boulevard, King of Prussia, PA, 19406 Office: 610-879-3864 Mobile: 610-220-3067 The information contained in this email message, including any attachments, is intended solely for use by the individual or entity named above and may be confidential. If the reader of this message is not the intended recipient, you are hereby notified that you must not read, use, disclose, distribute or copy any part of this communication. If you have received this communication in error, please immediately notify me by email and destroy the original message, including any attachments. Thank you. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Query expansion in Lucene (4.x)
We don't commonly use the term "query expansion" for Lucene and Solr, but I would say that there are two categories of "QE": 1. Lightweight QE, by which I mean things like synonym expansion, stemming, stopword removal, spellcheck, and anything else that modifies the raw query in any way that affects the results. MoreLikeThis/Find Similar could also be considered lightweight QE. In this lightweight sense, Lucene and Solr have lots of QE features. 2. Heavyweight QE, such as the "Unsupervised Feedback" feature that LucidWorks Search offers, which can automatically iterate on the original query, either refining or expanding the results to increase either precision or recall (your choice). Lucene and Solr themselves do not offer this automated capability, although the LucidWorks Search query feature is completely implemented using underlying Lucene and Solr features. Wikipedia: http://en.wikipedia.org/wiki/Query_expansion Lucid: http://docs.lucidworks.com/display/help/Unsupervised+Feedback+Options http://docs.lucidworks.com/display/lweug/Understanding+and+Improving+Relevance#UnderstandingandImprovingRelevance-UnsupervisedFeedback -- Jack Krupansky -Original Message- From: Michael O'Leary Sent: Wednesday, July 17, 2013 5:58 PM To: java-user@lucene.apache.org Subject: Query expansion in Lucene (4.x) I was reading a paper about Query Expansion ( http://search.fub.it/claudio/pdf/CSUR-2012.pdf) and it said: "For instance, Google Enterprise, MySQL and Lucene provide the user with an AQE facility that can be turned on or off." I searched through the Lucene 4.1.0 source code, which is what I have downloaded, and through Lucene forums and the web generally looking for references to query expansion in Lucene, and what I found was some discussion of the SynonymFilter class and external projects called LucQE and LuceneQE. Is there other code in Lucene that supports query expansion that I missed (because I didn't expand the queries I was searching with enough or something :-) or are these the primary Lucene and Lucene-compatible components that support query expansion? Thanks, Mike - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is text searching algorithm in Lucene 4.3.1
The core tf-idf scoring is described in this Javadoc: http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html That describes the scoring model and cites some papers. Then you can navigate up to the base class and see that BM25 is another derived class. Unfortunately, that has less Javadoc, although it does cite a key paper on that approach. -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Wednesday, July 17, 2013 8:17 AM To: java-user Subject: Re: What is text searching algorithm in Lucene 4.3.1 Note: as of Lucene 4.x, you can plug in your own scoring algorithm, it ships with several variants (e.g. BM25) so you can look at the pluggable scoring where all the code for the various algorithms is concentrated. Erick On Wed, Jul 17, 2013 at 12:40 AM, Jack Krupansky wrote: The source code is what most people use to understand how Lucene actually works. In some cases the Javadoc comments will point to published papers or web sites for algorithms or approaches. -- Jack Krupansky -Original Message- From: Vinh Đặng Sent: Tuesday, July 16, 2013 10:54 PM To: java-user@lucene.apache.org Subject: What is text searching algorithm in Lucene 4.3.1 Hi all, I am trying to apply Lucene for a specific domain, so I need to customize the text searching / text comparing algorithm of Lucene. Is there any guideline / tutorial or article which explains about how Lucene search and answer the query? Thank you very much. -- Thank you very much VINH Dang (Mr) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is text searching algorithm in Lucene 4.3.1
The source code is what most people use to understand how Lucene actually works. In some cases the Javadoc comments will point to published papers or web sites for algorithms or approaches. -- Jack Krupansky -Original Message- From: Vinh Đặng Sent: Tuesday, July 16, 2013 10:54 PM To: java-user@lucene.apache.org Subject: What is text searching algorithm in Lucene 4.3.1 Hi all, I am trying to apply Lucene for a specific domain, so I need to customize the text searching / text comparing algorithm of Lucene. Is there any guideline / tutorial or article which explains about how Lucene search and answer the query? Thank you very much. -- Thank you very much VINH Dang (Mr) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: [ANNOUNCE] Web Crawler
Lucene does not provide any capabilities for crawling websites. You would have to contact the Nutch project, the ManifoldCF project, or other web crawling projects. As far as bypassing robots.txt, that is a very unethical thing to do. It is rather offensive that you seem to be suggesting that anybody on this mailing list would engage in such an unethical or unprofessional activity. -- Jack Krupansky -Original Message- From: Ramakrishna Sent: Monday, July 15, 2013 9:13 AM To: java-user@lucene.apache.org Subject: Re: [ANNOUNCE] Web Crawler Hi.. I'm trying nutch to crawl some web-sites. Unfortunately they restricted to crawl their web-site by writing robots.txt. By using crawl-anywhere can I crawl any web-sites irrespective of that web-sites robots.txt??? If yes, plz send me the materials/links to study about crawl-anywhere or else plz suggest me which are the crawlers to use to crawl web-sites without bothering about robots.txt of that particular site. Its urgent plz reply as soon as possible. Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607833p4078039.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene in Action
Oh, yeah, ... that was the original book, that got canceled. Long story. O'Reilly should have taken that page down by now - oh well. But now I have the Solr-only book, self-published as an e-book on Lulu.com. Yes, LIA2 is still a valuable resource. Details have changed, but most concepts are still valid. -- Jack Krupansky -Original Message- From: Ivan Brusic Sent: Wednesday, July 10, 2013 10:41 AM To: java-user@lucene.apache.org Subject: Re: Lucene in Action Jack, don't you also have a book coming out on O'Reilly? http://shop.oreilly.com/product/0636920028765.do Lucene in Action might be outdated, but many of the core concepts remain the same. The analysis chain (analyzers/tokenizers) might have a slightly different API, but the concepts are still valid. -- Ivan On Wed, Jul 10, 2013 at 6:55 AM, Vinh Dang wrote: Thank Jack very much. Btw, I still find a details tutorial which guide me step by step, from download lucene until configure IDE (I unzipped lucene and received a complex folder - I don't know what should I do next?) And, a dummy question,there are three files to download in lucene 4.3.1 download page. Which file I should download? Sent from my BlackBerry® smartphone from Viettel -Original Message- From: "Jack Krupansky" Sender: "Jack Krupansky" Date: Wed, 10 Jul 2013 08:50:39 To: Reply-To: java-user@lucene.apache.org Subject: Re: Lucene in Action I would also note that Solr is a great way to start learning Lucene since a lot of the underlying Lucene concepts are visible in Solr and eventually a lot of Lucene users will end up having to re-invent features that Solr adds to Lucene, anyway. Maybe when I finish covering Solr in my e-book I'll start diving into coverage of Lucene as well. The original book idea did include covering both Lucene and Solr, but that deal fell through and I decided to stick with Solr as my initial focus. Actually, I've already had some thoughts about adding some chapters on a deeper dive into underlying Lucene concepts for Solr users, such as the structure of Query objects and tokenization and token filtering, mostly since advanced Solr users run into issues there, but those areas are difficult for Lucene users as well. My e-book: http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-2/ebook/product-21099700.html -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Wednesday, July 10, 2013 8:14 AM To: java-user ; dqvin...@gmail.com Subject: Re: Lucene in Action Right, unfortunately, there's nothing that I know of that's super-recent. Jack Kupransky is e-publishing a book on Solr, which will be more up to date but I don't know how thoroughly it dives into the underlying Lucene code. Otherwise, I think the best thing is to tackle a real problem (perhaps try working on a JIRA?) and get into it that way. The 3.1 book has a lot of good background, it's still valuable. But the examples will be out of date. Unfortunately, writing a book is an incredible amount of work, writing code is much more fun ... Best Erick On Tue, Jul 9, 2013 at 11:22 AM, Vinh Dang wrote: > Please try, I am new on Lucene also, and willing to study and share :) > Sent from my BlackBerry(R) smartphone from Viettel > > -Original Message- > From: "Vinh Dang" > Date: Tue, 9 Jul 2013 15:19:41 > To: > Reply-To: dqvin...@gmail.com > Subject: Re: Lucene in Action > > You have my yesterday question :) > > After unzip lucene, you just need to import lucene core. JAR file into > your project to use (with eclipse,just drag and drop). > > Lucene core.jar (I do not remember exact name, but easy to find this jar > file) provides core functions of lucene > --Original Message-- > From: Šimun Šunjić > To: java-user@lucene.apache.org > ReplyTo: java-user@lucene.apache.org > Subject: Lucene in Action > Sent: Jul 9, 2013 16:08 > > I am learning about Apache Lucene from Manning book: Lucene in Action. > However examples from book is for Lucene v3.0.3 and today Lucene is in > version 4.3.1. I can't find any good newer Lucene tutorial for learning, > can you guys from community suggest me some :) > > Thanks > > -- > mag.inf. Šunjić Šimun > > > Sent from my BlackBerry(R) smartphone from Viettel - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene in Action
I would also note that Solr is a great way to start learning Lucene since a lot of the underlying Lucene concepts are visible in Solr and eventually a lot of Lucene users will end up having to re-invent features that Solr adds to Lucene, anyway. Maybe when I finish covering Solr in my e-book I'll start diving into coverage of Lucene as well. The original book idea did include covering both Lucene and Solr, but that deal fell through and I decided to stick with Solr as my initial focus. Actually, I've already had some thoughts about adding some chapters on a deeper dive into underlying Lucene concepts for Solr users, such as the structure of Query objects and tokenization and token filtering, mostly since advanced Solr users run into issues there, but those areas are difficult for Lucene users as well. My e-book: http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-2/ebook/product-21099700.html -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Wednesday, July 10, 2013 8:14 AM To: java-user ; dqvin...@gmail.com Subject: Re: Lucene in Action Right, unfortunately, there's nothing that I know of that's super-recent. Jack Kupransky is e-publishing a book on Solr, which will be more up to date but I don't know how thoroughly it dives into the underlying Lucene code. Otherwise, I think the best thing is to tackle a real problem (perhaps try working on a JIRA?) and get into it that way. The 3.1 book has a lot of good background, it's still valuable. But the examples will be out of date. Unfortunately, writing a book is an incredible amount of work, writing code is much more fun ... Best Erick On Tue, Jul 9, 2013 at 11:22 AM, Vinh Dang wrote: Please try, I am new on Lucene also, and willing to study and share :) Sent from my BlackBerry(R) smartphone from Viettel -Original Message- From: "Vinh Dang" Date: Tue, 9 Jul 2013 15:19:41 To: Reply-To: dqvin...@gmail.com Subject: Re: Lucene in Action You have my yesterday question :) After unzip lucene, you just need to import lucene core. JAR file into your project to use (with eclipse,just drag and drop). Lucene core.jar (I do not remember exact name, but easy to find this jar file) provides core functions of lucene --Original Message-- From: Šimun Šunjić To: java-user@lucene.apache.org ReplyTo: java-user@lucene.apache.org Subject: Lucene in Action Sent: Jul 9, 2013 16:08 I am learning about Apache Lucene from Manning book: Lucene in Action. However examples from book is for Lucene v3.0.3 and today Lucene is in version 4.3.1. I can't find any good newer Lucene tutorial for learning, can you guys from community suggest me some :) Thanks -- mag.inf. Šunjić Šimun Sent from my BlackBerry(R) smartphone from Viettel - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Please Help solve problem of bad read performance in lucene 4.2.1
To be clear, Lucene and Solr are "search" engines, NOT "storage" engines. Has someone claimed otherwise to you? What is your query performance in in 4.x vs. 3.x? That's the true, proper measure of Lucene and Solr performance. -- Jack Krupansky -Original Message- From: Chris Zhang Sent: Sunday, July 07, 2013 12:26 PM To: java-user@lucene.apache.org Subject: Re: Please Help solve problem of bad read performance in lucene 4.2.1 thianks Adrien, In my project, almost all hit docs are supposed to be fetched for every query, what's why I am upset by the poor reading performance. Maybe I should store field values which are expected to be stored in high performance storage engine. In the above test case, time consuming of reading all docs in lucene 3.0 is about 78 sec, that reading speed is approximately 10MB/s , but 700+ sec in lucene 4.2.1, which indicates reading speed is less than 1MB/s. So I think committer of lucene should pay attention to this. On Sun, Jul 7, 2013 at 10:23 PM, Adrien Grand wrote: Indeed, Lucene 4.1+ may be a bit slower for indices that comptelely fit in your file-system cache. On the other hand, you should see better performance with indices which are larger than the amount of physical memory of your machine. Your reading benchmark only measures IndexReader.get(int) which should only be used to display summary results (that is, only called 10 or 20 times per displayed page). Most of time, the bottleneck is rather searching which can be made more efficient on small indices by switching to an in-memory postings format. -- Adrien - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Forcing lucene to use specific field when processing parsed query
Each Query object has getter methods for each of its fields. If a field is a nested Query object, you will then have to recursively process it. Be prepared to do lots of "instanceof" conditionals. None of this is hard, just lots of dog work. Consult the Javadoc for the getter methods. And when you write your instanceof conditionals, be sure to include an "else" that displays the actual class name so that you can then add yet another instanceof. Now, what you are really looking for is the TermQuery, that has a field name - that's what you want to change. After you have called all of the getters, you create a new Query object of the current type (the type you referenced in the "instanceof") and return that new Query object. Recursion would return the new Query object. -- Jack Krupansky -Original Message- From: Puneet Pawaia Sent: Saturday, July 06, 2013 12:54 PM To: java-user@lucene.apache.org Subject: Re: Forcing lucene to use specific field when processing parsed query Hi Uwe Unfortunately, at the stage I am required to do this, I do not have the query text. I only have the parsed query. I thought of iterating through the query clauses etc but could not find how to. What function allows me to do this ? Regards Puneet Pawaia On 6 Jul 2013 20:06, "Uwe Schindler" wrote: Hi, You can only do this manually with instanceof checks and walking through BooleanClauses. The better way to fix your problem would be to change your query parser, e.g. by overriding getFieldQuery and other protected methods to enforce a specific field while parsing the query string. Uwe Puneet Pawaia schrieb: >Hi all, > >I am using Lucene.Net 3.0.3 and need to search in a specific field >(ignoring any fields specified in the query). I am given a parsed >Lucene >Query so I am unable to generate a parsed query with my required field. > >Is there any functionality in Lucene that allows me to loop through the >terms of the query and change the field ? The given query would be a >complex query where there would be spans, clauses etc. > >Or perhaps there is some way of forcing Lucene to ignore the fields >given >in the parsed query and used a specified field only. > >Any help would be most appreciated. > >Regards >Puneet Pawaia -- Uwe Schindler H.-H.-Meier-Allee 63, 28213 Bremen http://www.thetaphi.de - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: handling nonexistent fields in an index
There is a Lucene filter that you can use to check efficiently for whether a field has a value or not. new ConstantScoreQuery(new FieldValueFilter(String field, boolean negate)) -- Jack Krupansky -Original Message- From: David Carlton Sent: Wednesday, July 03, 2013 4:27 PM To: java-user@lucene.apache.org Subject: handling nonexistent fields in an index I have a bunch of Lucene indices lying around, and I want to start adding a new field to documents in new indices that I'm generating. So, for a given index, either every document in the index will have that field or no document will have that field. The new field has a default value; and I would like to write a query that, when applied to old indices, matches all documents, while when applied to new indices, it will only match documents with that specific default value. (Probably the query will include other restrictions, but the other restrictions have nothing to do with the new field, so they'll apply to both indices.) I can, of course, write two different queries, one for the old indices and one for the new indices; for layering reasons, I'd prefer not to do that, but it's a possibility. (I can't, however, go back to the old indices and add the new field in.) Any suggestions for how to write a single query that will work in both places? Basically, what I want is a query that says something like (field IS MISSING) OR (field = DEFAULT_VALUE) If it matters, the new field will only take one of a small number of values, ten or so. The one hint I've turned up when googling is this: http://stackoverflow.com/questions/4365369/solr-search-for-documents-where-a-field-doesnt-exist It talks in terms of Solr, but hopefully I can figure out how to translate that into stock Lucene? Thinking out loud about what it suggests, I guess maybe I can generate a WildcardQuery for my field with * (which I hope won't be too expensive, given how few values my field has), and then do something like (field = DEFAULT_VALUE) OR NOT (field matches *) And then I have to translate that into Lucene BooleanQuery syntax; I think I can probably handle that step of things (I've done that sort of thing before), but if anybody has tips, I'm all ears. Basically, any suggestions would be welcome, whether about the basic approach or about the details. And I would in particular very much appreciate advice as to whether or not WildcardQuery(field, *) will have good performance if field only takes a small number of values. -- David Carlton carl...@sumologic.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: highlighting component to searchComponent
Try asking your question on the “Solr user” email list – this is the Lucene user list! -- Jack Krupansky From: Adrien RUFFIE Sent: Monday, July 01, 2013 4:36 AM To: java-user@lucene.apache.org Subject: highlighting component to searchComponent Hello all I had the following configuration in my solrconfig.xml : But when I start my webapp the following message appears: WARNING: Deprecated syntax found. should move to So I have tried to convert my highlighting to searchComponent with following configuration: 70 0.5 [-\w ,/\n\"']{20,200} But now following error appears … do you know the problem and what is the correct class implementation ? GRAVE: org.apache.solr.common.SolrException: Error Instantiating SolrFragmentsBuilder, solr.highlight.GapFragmenter is not a org.apache.solr.highlight.SolrFragmentsBuilder if you know the right way to do this conversion, I'm interested Best regards, Bien cordialement, Adrien RUFFIE Ingénieur R&D 40, rue du Village d’Entreprises 31670 Labège www.e-deal.com LD : +33 1 73 03 29 50 Std : +33 1 73 03 29 80 Fax : +33 1 73 01 69 77 a.ruf...@e-deal.com E-DEAL soutient le Pacte Mondial de l'ONU
Re: Relevance ranking calculation based on filtered document count
The very definition of a "filter" in Lucene is that it doesn't influence relevance/scoring in any way, so your question is a contradiction in terms. If you are finding that the use of a filter is affecting the scores of documents, then that is clearly a bug. -- Jack Krupansky -Original Message- From: Nigel V Thomas Sent: Monday, July 01, 2013 7:38 AM To: java-user@lucene.apache.org Subject: Relevance ranking calculation based on filtered document count Hi, I would like to know if it is possible to calculate the relevance ranks of documents based on filtered document count? The current filter implementations as far as I know, seems to be applied after the query is processed and ranked against the full set of documents. Since system wide IDF values are used to rank documents, the resulting ordering is different from a set whose range is restricted only to the filtered set of documents. Many thanks, Nigel - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to Perform a Full Text Search on a Number with Leading Zeros or Decimals?
The user could use a regular expression query to match the numbers, but otherwise, you will have to write some specialized token filter to recognize numeric tokens and generate extra tokens at the same position for each token variant that you want to search for. -- Jack Krupansky -Original Message- From: Todd Hunt Sent: Friday, June 28, 2013 2:18 PM To: java-user@lucene.apache.org Subject: How to Perform a Full Text Search on a Number with Leading Zeros or Decimals? I have an application that is indexing the text from various reports and forms that are generated from our core system. The reports will contain dollar amounts and various indexes that contain all numbers, but have leading zeros. If a document contains that following text that is stored in one Lucene document field: "Account 0012345 owes $321.98" What analyzer can be used to index this text and allow the user to find this document by searching on: 12345 OR 321 ??? We are currently using a StandardAnalyzer which works well for most of our use cases, but not one like this. I realize that I could create my own token filter to convert any text that can be represented by an Integer or Long, with leading zeros or not, and convert the value to a normal looking integer without leading zeros. But I'd prefer to reuse and existing analyzer or technique to achieve the same results. Thank you. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Questions about doing a full text search with numeric values
Do continue to experiment with Solr as a "testbed" - all of the analysis filters used by Solr are... part of Lucene, so once you figure things out in Solr (using the Solr Admin UI analysis page), you can mechanically translate to raw Lucene API calls. Look at the standard tokenizer, it should do a better job with punctuation. -- Jack Krupansky -Original Message- From: Todd Hunt Sent: Thursday, June 27, 2013 1:14 PM To: java-user@lucene.apache.org Subject: Questions about doing a full text search with numeric values I am working on an application that is using Tika to index text based documents and store the text results in Lucene. These documents can range anywhere from 1 page to thousands of pages. We are currently using Lucene 3.0.3. I am currently using the StandarAnalyzer to index and search for the text that is contained in one Lucene document field. For strictly alpha based, English words, the searches return the results as expected. The problem has to do with searching for numeric values in the indexed documents. So examples of text in the documents that cannot be found unless wild cards are used are: Ø 1-800-costumes.com o 800 does not find the text above Ø $118.30 o 118 does not find the text above Ø 3tigers o 3 does not find the text above Ø 00123456 o 123456 does not find the text above Ø 123,abc,foo,bar,456 o This is in a CSV file o 123 nor 456 finds the text above § I realize that it has to do with the texted only being separated by commas and so it is treated as one token, but I think the issue is no different than the others The expectation from our users is that if they can open the document in its default application (Word, Adobe, Notepad, etc.) and perform a "find" within that application and find the text, then our application based on Lucene should be able to find the same text. It is not reasonable for us to request that our users surround their search with wildcards. Also, it seems like a kludge to programmatically put wild cards around any numeric values the user may enter for searching. Is there some type of numeric parser or filter that would help me out with these scenarios? I've looked at Solr and we already have strong foundation of code utilizing Spring, Hibernate, and Lucene. Trying to integrate Solr into our application would take too much refactoring and time that isn't available for this release. Also, since these numeric values are embedded within the documents, I don't think storing them as their own field would make sense since I want to maintain the context of the numeric values within the document. Thank you. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Language detection
Oops... sorry, I just realized this was on the Lucene-user list. My response was for Solr-ONLY! -- Jack Krupansky -Original Message- From: Jack Krupansky Sent: Thursday, June 27, 2013 1:11 PM To: java-user@lucene.apache.org Subject: Re: Language detection You can use the LangDetectLanguageIdentifierUpdateProcessorFactory update processor to redirect languages to alternate fields, and then set the non-English fields to be "ignored". But, the document would still be indexed, just without the redirected text fields. (Examples of using that update processor are in my book - but not the "ignored" step.) There is also a Tika-specific processor as well: TikaLanguageIdentifierUpdateProcessorFactory If you really want to completely suppress the indexing of documents containing non-English text, you'll have to make an explicit check before sendting the document to Solr. Tika also has language detection, so you could call Tika from an external process before sending the document to Solr. -- Jack Krupansky -Original Message- From: Hang Mang Sent: Thursday, June 27, 2013 11:45 AM To: java-user@lucene.apache.org Subject: Language detection Hello, is there some kind of a filter or component that I could use to filter non-english text? I have a preprocessing step that I only want to index English documents. Best, Gucko - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Language detection
You can use the LangDetectLanguageIdentifierUpdateProcessorFactory update processor to redirect languages to alternate fields, and then set the non-English fields to be "ignored". But, the document would still be indexed, just without the redirected text fields. (Examples of using that update processor are in my book - but not the "ignored" step.) There is also a Tika-specific processor as well: TikaLanguageIdentifierUpdateProcessorFactory If you really want to completely suppress the indexing of documents containing non-English text, you'll have to make an explicit check before sendting the document to Solr. Tika also has language detection, so you could call Tika from an external process before sending the document to Solr. -- Jack Krupansky -Original Message- From: Hang Mang Sent: Thursday, June 27, 2013 11:45 AM To: java-user@lucene.apache.org Subject: Language detection Hello, is there some kind of a filter or component that I could use to filter non-english text? I have a preprocessing step that I only want to index English documents. Best, Gucko - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Securing stored data using Lucene
Is your device patient data for a single patient or is it for potentially a large number of patients (on a single device)? Generally, the proper way to "secure" a Lucene/Solr server and its data is to keep the server and data behind a firewall and also behind an application layer so that no outside process can directly talk to Solr. If you are developing a custom server based only on Lucene, then it is up to you to assure that your server is secure and Lucene has no role in that security. Or... are you actually thinking of running Lucene on the device? If so, and security is a concern, then you are in UNCHARTED TERRITORY and really on your own. In that case, you probably need to be talking about custom codecs for encrypted data for all of your fields. Even then, the in-memory representation of the field values would be unencrypted and hence insecure if someone took a stolen device and directly examined the memory. How much "Rich Lucene Search" do you need to do on the device itself, as opposed to just looking for "(encrypted) blob storage"? If you want Lucene to do your searching on field values, the field values will be exposed in memory. If all you want is to retrieve an encrypted blob based on an encrypted key, why are you even considering Lucene? -- Jack Krupansky -Original Message- From: Rafaela Voiculescu Sent: Wednesday, June 26, 2013 5:06 AM To: java-user@lucene.apache.org Subject: Re: Securing stored data using Lucene Hello, Thank you all for your help and the suggestions. They are very useful. I wanted to clarify more aspects of the project, since I overlooked them in my previous mails. To explain the use case exactly, the application should work like this: - The application works with patient data and that's why we want to keep things confidential. We are downloading patient data that can go to the mobile device (it should even work on desktop in a similar way really) - We have to keep the data in the device due to internet connection limitation. The device will get, if lucky, internet connection once or twice per week, hence us needing to keep the patient data locally - The thing I forgot to mention is that the structure of the patient data is kept in json format - Currently, there is no plan for using database because the structure of the patient stored locally might need to change (so we want to store the json as document in Lucene). - And we need to achieve the part with not having someone who, for instance steals the device, able to access the data unless they have the encryption key and mechanism and not having someone who's not supposed to access the data do that. This is why we're trying to find a way to encrypt somehow the json documents and still use Lucene or try not to have the index stored as plaintext, if it would be possible. Thank you again for all your help and in case this mail has given more useful details and there are other suggestions or comments, I would be very happy to read them. Have a nice day, Rafaela On 25 June 2013 20:59, SUJIT PAL wrote: Hi Rafaela, I built something along these lines as a proof of concept. All data in the index was unstored and only fields which were searchable (tokenized and indexed) were kept in the index. The full record was encrypted and stored in a MongoDB database. A custom Solr component did the search against the index, gathered up unique ids of the results, then pulled out the encrypted data from MongoDB, decrypted it on the fly and rendered the results. You can find the (Scala) code here: https://github.com/sujitpal/solr4-extras (under the src/main/scala/com/mycompany/solr4extras/secure folder). More information (more or less the same as what I wrote but probably a bit more readable with inlined code): http://sujitpal.blogspot.com/2012/12/searching-encrypted-document-collection.html There are some obvious data sync concerns with this sort of setup, but as Adrian points out, you can't index encrypted data. HTH Sujit On Jun 25, 2013, at 4:17 AM, Adrien Grand wrote: > On Tue, Jun 25, 2013 at 1:03 PM, Rafaela Voiculescu > wrote: >> Hello, > > Hi, > >> I am sorry I was not a bit more explicit. I am trying to find an acceptable >> way to encrypt the data to prevent any access of it in any way unless the >> person who is trying to access it knows how to decrypt it. As I mentioned, >> I looked a bit through the patch, but I am not sure of its status. > > You can encrypt stored fields, but there is no way to do it correctly > with fields that have positions indexed: attackers could infer the > actual terms based on the order of terms (the encrypted version must > sort the same way as the original terms), frequencies and positions. > > -- > Adrien > > - > To
Re: New Lucene User
Try starting with Solr. You can have your search server up and running without writing any code. And Solr's Data Import Handler can load data direct from the database. -- Jack Krupansky -Original Message- From: raghavendra.k@barclays.com Sent: Monday, June 17, 2013 5:03 PM To: java-user@lucene.apache.org Subject: New Lucene User Hi, I have a requirement to perform a full-text search in a new application and I came across Lucene and I want to check if it helps our cause. Requirement: I have a SQL Server database table with around 70 million records in it. It is not a live table and the data gets appended to it on a daily basis. The table has about 30 columns. The user will provide one string, and this value has to be searched against 20 columns for each record. All matching records need to be displayed in the UI. My Analysis Based on what I have read until now about Lucene, I believe I need to convert my database table data into a flat file, generate indexes and then perform the search. Questions - To begin with, is Lucene a good option for this kind of requirement? Note: Let us ignore daily index generation and UI display for this discussion. - Should the entire data of 70 million records exist in one flat file? - How do I define what fields (20 columns) should be searched among the complete list (30 columns)? As I am just starting off, I may not even know about other dependencies. I kindly request you to provide clarifications / reference to an example that would suit my case. Please let me know if you have any questions. Thanks, Raghu ___ This message is for information purposes only, it is not a recommendation, advice, offer or solicitation to buy or sell a product or service nor an official confirmation of any transaction. It is directed at persons who are professionals and is not intended for retail customer use. Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimer. For important disclosures, please see: www.barclays.com/salesandtradingdisclaimer regarding market commentary from Barclays Sales and/or Trading, who are active market participants; and in respect of Barclays Research, including disclosures relating to specific issuers, please see http://publicresearch.barclays.com. ___ - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: compare paragraphs of text - which Query Class to use?
First, start with Solr and use the edismax query parser with the default query operator as "OR" and set pf, pf2, and pf3, and then simply query by the raw text of the paragraph. This will order the results by how closely the indexed paragraphs match the query paragraph. This is also a good technique for detecting plagiarism where a lot of the text is similar if not identical. Once you get experience using this technique in Solr, then simply look at the parsed query that edismax generates and do the same in your Lucene Java code. -- Jack Krupansky -Original Message- From: Malgorzata Urbanska Sent: Friday, June 14, 2013 12:23 PM To: java-user@lucene.apache.org Subject: compare paragraphs of text - which Query Class to use? Hello, I've just started using Lucene and I'm not sure which Query Classes I should use in my project. My goal is to compare paragraphs of text. Paragraph A is a query and paragraph B is a document for which I would like to calculate similarity score. the paragraphs A and B can be in some situations exactly the same or not. Generally I would like to check do they talk about the same topic. In my project I have set of paragraphs A and set of paragraphs B, so I'm looking for some universal solution which allow me to check similarity score for each paragraph A all paragraphs B. Do you have any suggestions? I really appreciate all of the ideas. -- gosia - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene Indexes explanantion
Sorry, but you really are going to need to work on your "Lucene Basics" before you tackle such an ambitious effort. The Lucene JavaDoc, Solr Wikis, Stack Underflow, blogs of McCandless, et al, and Google search in general will cover a lot of the ground, including those mysterious terms: - Indexed terms - Stored values - Payloads - DocValues -- Jack Krupansky -Original Message- From: nikhil desai Sent: Monday, June 10, 2013 8:36 PM To: java-user@lucene.apache.org Subject: Re: Lucene Indexes explanantion I don't think I could get much from what you said, could you please elaborate? Appreciate. On Mon, Jun 10, 2013 at 5:20 PM, Jack Krupansky wrote: Your stored value could be very different from your indexed (searchable) value. You can also associate payloads with an indexed term. And there are DocValues as well. -- Jack Krupansky -Original Message- From: nikhil desai Sent: Monday, June 10, 2013 8:06 PM To: java-user@lucene.apache.org Subject: Re: Lucene Indexes explanantion Sure. Thanks Jack. I don't have much experience working with Lucene, however, here is what I am trying to resolve. I learned that the Custom attributes cannot be used for indexing or searching purposes. However I wanted the attributes to be used for indexing and searching. So I created custom attributes and inserted them as tokens into the tokenstream by assigning positionIncrement attribute to 0. Now since my new token stream has attributes(as tokens) and they are used while indexing, I can now search the document based on the attributes(tokens I newly inserted). However I still have an issue. And by the way I have a lot of attributes that I need to assign to an individual token. Ex: Sentence: "LinkedIn is famous" After passing through custom analyzer and few filters that I have written and appending Attributes to the tokens, the new Tokenstream we get is "LinkedIn Noun SocialSite famous JJ Positive" - (what that means is that LinkedIn is Noun and is also an Socialsite, famous is an adjective and also a Positive word, 'is' is removed as it does not make sense to index 'is') This is now definitely searchable based on Attributes(Here: Noun, SocialSite, JJ, Positive). However, since I have put this entire text "LinkedIn is famous" as a Field while adding a Document, when I search for say "SocialSite", I get a Document as an output which has "LinkedIn is famous" as one of the fields. However, is it possible to get only "LinkedIn" as output rather than an entire text? i.e Only the actual token(the token present in the original input) as output? Another example: if I search for say "Positive" I should get "famous" as output and not the entire "LinkedIn is famous". I know that if I put it as a Field in the document, I should be able to get it, but how do I add such a Field? because, only when the Tokens are passed through the filters we get to know what all Attributes would be attached to it, so while we do indexwriter.addDocument() we have no idea about the Attributes. The typical problem that I see is the indexing is done based on the new tokenstream which is good, but when it retrieves the Document, it has the older actual Tokenstream(or actual input) and that is what is given as output. Does that make any sense? Or I have a typical use case that does not go well with Lucene? Any help comments are appreciated. On Mon, Jun 10, 2013 at 1:32 PM, Jack Krupansky * *wrote: Even though you've posted for Lucene, you might want to consider taking a look at Solr because Solr has an Admin UI with an Analysis page which gives you a nice display of how index and query text is analyzed into tokens, terms, and attributes - all of which Solr inherits from Lucene. And check out the unit tests for Lucene (and Solr) for indexing. Then you can actually step through code and see it happen. Otherwise, google for blogs on various sub-topics of interest with specific terms. OTOH... don't try diving too deeply until you've written and understood a fair amount of Java code using Lucene. Otherwise, you won't have enough context to understand or even ask intelligent questions. -- Jack Krupansky -Original Message- From: nikhil desai Sent: Monday, June 10, 2013 1:24 PM To: java-user@lucene.apache.org Subject: Lucene Indexes explanantion Hello, My first time post in this group. I have been using Lucene recently. I have a question. Where can I find a good explanation on Indexes. Or rather how indexing (Not really the mathematical aspect) happens in Lucene, what all attributes(charTerm, Offset etc) come into play? And the way it is implemented? I checked the "Lucene In Action" and could not find much on actual indexing, what all classes etc are being used. Appreciate your help. Thanks NIKHIL ---
Re: Lucene Indexes explanantion
Your stored value could be very different from your indexed (searchable) value. You can also associate payloads with an indexed term. And there are DocValues as well. -- Jack Krupansky -Original Message- From: nikhil desai Sent: Monday, June 10, 2013 8:06 PM To: java-user@lucene.apache.org Subject: Re: Lucene Indexes explanantion Sure. Thanks Jack. I don't have much experience working with Lucene, however, here is what I am trying to resolve. I learned that the Custom attributes cannot be used for indexing or searching purposes. However I wanted the attributes to be used for indexing and searching. So I created custom attributes and inserted them as tokens into the tokenstream by assigning positionIncrement attribute to 0. Now since my new token stream has attributes(as tokens) and they are used while indexing, I can now search the document based on the attributes(tokens I newly inserted). However I still have an issue. And by the way I have a lot of attributes that I need to assign to an individual token. Ex: Sentence: "LinkedIn is famous" After passing through custom analyzer and few filters that I have written and appending Attributes to the tokens, the new Tokenstream we get is "LinkedIn Noun SocialSite famous JJ Positive" - (what that means is that LinkedIn is Noun and is also an Socialsite, famous is an adjective and also a Positive word, 'is' is removed as it does not make sense to index 'is') This is now definitely searchable based on Attributes(Here: Noun, SocialSite, JJ, Positive). However, since I have put this entire text "LinkedIn is famous" as a Field while adding a Document, when I search for say "SocialSite", I get a Document as an output which has "LinkedIn is famous" as one of the fields. However, is it possible to get only "LinkedIn" as output rather than an entire text? i.e Only the actual token(the token present in the original input) as output? Another example: if I search for say "Positive" I should get "famous" as output and not the entire "LinkedIn is famous". I know that if I put it as a Field in the document, I should be able to get it, but how do I add such a Field? because, only when the Tokens are passed through the filters we get to know what all Attributes would be attached to it, so while we do indexwriter.addDocument() we have no idea about the Attributes. The typical problem that I see is the indexing is done based on the new tokenstream which is good, but when it retrieves the Document, it has the older actual Tokenstream(or actual input) and that is what is given as output. Does that make any sense? Or I have a typical use case that does not go well with Lucene? Any help comments are appreciated. On Mon, Jun 10, 2013 at 1:32 PM, Jack Krupansky wrote: Even though you've posted for Lucene, you might want to consider taking a look at Solr because Solr has an Admin UI with an Analysis page which gives you a nice display of how index and query text is analyzed into tokens, terms, and attributes - all of which Solr inherits from Lucene. And check out the unit tests for Lucene (and Solr) for indexing. Then you can actually step through code and see it happen. Otherwise, google for blogs on various sub-topics of interest with specific terms. OTOH... don't try diving too deeply until you've written and understood a fair amount of Java code using Lucene. Otherwise, you won't have enough context to understand or even ask intelligent questions. -- Jack Krupansky -Original Message- From: nikhil desai Sent: Monday, June 10, 2013 1:24 PM To: java-user@lucene.apache.org Subject: Lucene Indexes explanantion Hello, My first time post in this group. I have been using Lucene recently. I have a question. Where can I find a good explanation on Indexes. Or rather how indexing (Not really the mathematical aspect) happens in Lucene, what all attributes(charTerm, Offset etc) come into play? And the way it is implemented? I checked the "Lucene In Action" and could not find much on actual indexing, what all classes etc are being used. Appreciate your help. Thanks NIKHIL --**--**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org For additional commands, e-mail: java-user-help@lucene.apache.**org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene Indexes explanantion
Even though you've posted for Lucene, you might want to consider taking a look at Solr because Solr has an Admin UI with an Analysis page which gives you a nice display of how index and query text is analyzed into tokens, terms, and attributes - all of which Solr inherits from Lucene. And check out the unit tests for Lucene (and Solr) for indexing. Then you can actually step through code and see it happen. Otherwise, google for blogs on various sub-topics of interest with specific terms. OTOH... don't try diving too deeply until you've written and understood a fair amount of Java code using Lucene. Otherwise, you won't have enough context to understand or even ask intelligent questions. -- Jack Krupansky -Original Message- From: nikhil desai Sent: Monday, June 10, 2013 1:24 PM To: java-user@lucene.apache.org Subject: Lucene Indexes explanantion Hello, My first time post in this group. I have been using Lucene recently. I have a question. Where can I find a good explanation on Indexes. Or rather how indexing (Not really the mathematical aspect) happens in Lucene, what all attributes(charTerm, Offset etc) come into play? And the way it is implemented? I checked the "Lucene In Action" and could not find much on actual indexing, what all classes etc are being used. Appreciate your help. Thanks NIKHIL - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Getting position increments directly from the the index
If you add a special "end of document term" then some of these calculations might be easier. And, give that special term a payload of the sentence count. While you're at it, insert "end of sentence" terms that could have a a payload of the sentence number. -- Jack Krupansky -Original Message- From: Michael McCandless Sent: Thursday, May 23, 2013 10:39 AM To: Lucene Users Subject: Re: Getting position increments directly from the the index On Thu, May 23, 2013 at 9:54 AM, Igor Shalyminov wrote: But, just to clarify, is there a way to get, let's say, a vector of position increments directly from the index, without re-parsing document contents? Term vectors (as Jack suggested) are one option, but they are very heavy (slows down indexing, takes lots of disk space, slow (seek-per-document) to load at search time). You can enumerate all positions for each termXdoc in the postings, but you'd then need to collate by document to get the max position (last term) for that document. I guess an int[maxDoc] would do the trick, then walk that array dividing each maxPosition by 1000. Or index the sentence token :) Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Getting position increments directly from the the index
Take a look at the Term Vectors Component: http://wiki.apache.org/solr/TermVectorComponent -- Jack Krupansky -Original Message- From: Igor Shalyminov Sent: Thursday, May 23, 2013 9:54 AM To: java-user@lucene.apache.org Subject: Re: Getting position increments directly from the the index Thanks, Mike and Jack! Those are really good options. But, just to clarify, is there a way to get, let's say, a vector of position increments directly from the index, without re-parsing document contents? -- Best Regards, Igor 23.05.2013, 16:13, "Jack Krupansky" : It might be nice to inquire as to the largest position for a field in a document. Is that information kept anywhere? Not that I know of, although I suppose it can be calculated at runtime by running though all the terms of the field. Then he could just divide by 1000. -- Jack Krupansky -Original Message- From: Michael McCandless Sent: Thursday, May 23, 2013 6:28 AM To: Lucene Users Subject: Re: Getting position increments directly from the the index Do you actually index the sentence boundary as a token? If so, you could just get the totalTermFreq of that token? Mike McCandless http://blog.mikemccandless.com On Wed, May 22, 2013 at 10:11 AM, Igor Shalyminov wrote: Hello! I'm storing sentence bounds in the index as position increments of 1000. I want to get the total number of sentences in the index, i. e. the number of "1000" increment values. Can I do that some other way rather than just loading each document and extracting position increments with a custom Analyzer? -- Best Regards, Igor Shalyminov - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Getting position increments directly from the the index
It might be nice to inquire as to the largest position for a field in a document. Is that information kept anywhere? Not that I know of, although I suppose it can be calculated at runtime by running though all the terms of the field. Then he could just divide by 1000. -- Jack Krupansky -Original Message- From: Michael McCandless Sent: Thursday, May 23, 2013 6:28 AM To: Lucene Users Subject: Re: Getting position increments directly from the the index Do you actually index the sentence boundary as a token? If so, you could just get the totalTermFreq of that token? Mike McCandless http://blog.mikemccandless.com On Wed, May 22, 2013 at 10:11 AM, Igor Shalyminov wrote: Hello! I'm storing sentence bounds in the index as position increments of 1000. I want to get the total number of sentences in the index, i. e. the number of "1000" increment values. Can I do that some other way rather than just loading each document and extracting position increments with a custom Analyzer? -- Best Regards, Igor Shalyminov - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Query with phrases, wildcards and fuzziness
Using BooleanQuery and Should is the way to go. There are some nuances, but you may not run into them. Sometimes it is more that the query parser syntax is the issue rather than the Lucene BQ itself. For example, with a string of AND and OR, they all get parsed into a single BQ, which is clearly not traditional "Boolean", but if you code multiple BQs that nest (or fully parenthesize your source query), you will get a true "Boolean" query. It's up to your particular application whether you need "true" Boolean or not. -- Jack Krupansky -Original Message- From: Ross Simpson Sent: Wednesday, May 22, 2013 7:44 AM To: java-user@lucene.apache.org Subject: Re: Query with phrases, wildcards and fuzziness One further question: If I wanted to construct my query using Query implementations instead of a QueryParser (e.g. TermQuery, WildcardQuery, etc.), what's the right way to duplicate the "OR" functionality I wrote about below? As I mentioned, I've read that wrapping query objects in a BooleanQuery and using Occur.SHOULD is not necessarily the same. Any suggestions? Ross On 22/05/2013 11:46 AM, Ross Simpson wrote: Jack, thanks very much! I wasn't considering a space a special character for some reason. That has worked perfectly. Cheers, Ross On May 22, 2013, at 10:24 AM, Jack Krupansky wrote: Just escape embedded spaces with a backslash. -- Jack Krupansky -Original Message- From: Ross Simpson Sent: Tuesday, May 21, 2013 8:08 PM To: java-user@lucene.apache.org Subject: Query with phrases, wildcards and fuzziness Hi all, I'm trying to create a fairly complex query, and having trouble constructing it. My index contains a TextField with place names as strings, e.g.: Port Melbourne, VIC 3207 I'm using an analyzer with just KeywordTokenizer and LowerCaseFilter, so that my strings are not tokenized at all. I want to support end-user searches like the following, and have them match that string above: Port Melbourne, VIC 3207 (exact) Port (prefix) Port Mel (prefix, including a space) Melbo (wildcard) Melburne (fuzzy) I'm trying to get away with not parsing the query myself, and just constructing something like this: parser.parse( "(STRING^9) OR (STRING*^7) OR (*STRING*^5) OR (STRING~1^3) ); That doesn't seem to work, neither with QueryParser nor with ComplexPhraseQueryParser. Specifically, I'm having trouble getting appropriate results when there's a space in the input string, notable with the wildcard match part (it ends up returning everything in the index). Is my approach above possible? I also have had a look at using specific Query implementations and combining them in a BooleanQuery, but I'm not quite sure how to replicate the "OR" behavior I want (from reading, Occur.SHOULD is not equivalent or "OR"). Any suggestions would be appreciated. Thanks! Ross - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Case insensitive StringField?
Yes it is. It always will. But... you can escape the spaces with a backslash: Query q = qp.parse("new\\ york"); -- Jack Krupansky -Original Message- From: Shahak Nagiel Sent: Tuesday, May 21, 2013 10:09 PM To: java-user@lucene.apache.org Subject: Re: Case insensitive StringField? Jack / Michael: Thanks, but the query parser still seems to be tokenizing the query? public class StringPhraseAnalyzer extends Analyzer { protected TokenStreamComponents createComponents (String fieldName, Reader reader) { Tokenizer tok = new KeywordTokenizer(reader); TokenFilter filter = new LowerCaseFilter(Version.LUCENE_41, tok); filter = new TrimFilter(filter, true); return new TokenStreamComponents(tok, filter); } } ... Analyzer analyzer = new StringPhraseAnalyzer(); // using this analyzer, add document to index with city TextField (value "NEW YORK") QueryParser qp = new QueryParser(Version.LUCENE_41, "city", analyzer); Query q = qp.parse("new york"); System.out.println ("Query: " + q); results in... Query: city:new city:york// I expected "city:new york" ...and no matches. Is a QueryParser the wrong way to generate the query for this type of analyzer? Thanks again! From: Jack Krupansky To: java-user@lucene.apache.org Sent: Tuesday, May 21, 2013 10:22 AM Subject: Re: Case insensitive StringField? To be clear, analysis is not supported on StringField (or any non-tokenized field). But the good news is that by using the keyword tokenizer (KeywordTokenizer) on a TextField, you can get the same effect. That will preserve the entire input as a single token. You may want to include filters to trim exterior white space and normalize interior white space. -- Jack Krupansky -Original Message- From: Shahak Nagiel Sent: Tuesday, May 21, 2013 10:06 AM To: java-user@lucene.apache.org Subject: Case insensitive StringField? It appears that StringField instances are treated as literals, even though my analyzer lower-cases (on both write and read sides). So, for example, I can match with a term query (e.g. "NEW YORK"), but only if the case matches. If I use a QueryParser (or MultiFieldQueryParser), it never works because these query values are lowercased and don't match. I've found that using a TextField instead works, presumably because it's tokenized and processed correctly by the write analyzer. However, I would prefer that queries match against the entire/exact phrase ("NEW YORK"), rather than among the tokens ("NEW" or "YORK"). What's the solution here? Thanks in advance. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Query with phrases, wildcards and fuzziness
Just escape embedded spaces with a backslash. -- Jack Krupansky -Original Message- From: Ross Simpson Sent: Tuesday, May 21, 2013 8:08 PM To: java-user@lucene.apache.org Subject: Query with phrases, wildcards and fuzziness Hi all, I'm trying to create a fairly complex query, and having trouble constructing it. My index contains a TextField with place names as strings, e.g.: Port Melbourne, VIC 3207 I'm using an analyzer with just KeywordTokenizer and LowerCaseFilter, so that my strings are not tokenized at all. I want to support end-user searches like the following, and have them match that string above: Port Melbourne, VIC 3207 (exact) Port (prefix) Port Mel (prefix, including a space) Melbo (wildcard) Melburne (fuzzy) I'm trying to get away with not parsing the query myself, and just constructing something like this: parser.parse( "(STRING^9) OR (STRING*^7) OR (*STRING*^5) OR (STRING~1^3) ); That doesn't seem to work, neither with QueryParser nor with ComplexPhraseQueryParser. Specifically, I'm having trouble getting appropriate results when there's a space in the input string, notable with the wildcard match part (it ends up returning everything in the index). Is my approach above possible? I also have had a look at using specific Query implementations and combining them in a BooleanQuery, but I'm not quite sure how to replicate the "OR" behavior I want (from reading, Occur.SHOULD is not equivalent or "OR"). Any suggestions would be appreciated. Thanks! Ross - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Case insensitive StringField?
To be clear, analysis is not supported on StringField (or any non-tokenized field). But the good news is that by using the keyword tokenizer (KeywordTokenizer) on a TextField, you can get the same effect. That will preserve the entire input as a single token. You may want to include filters to trim exterior white space and normalize interior white space. -- Jack Krupansky -Original Message- From: Shahak Nagiel Sent: Tuesday, May 21, 2013 10:06 AM To: java-user@lucene.apache.org Subject: Case insensitive StringField? It appears that StringField instances are treated as literals, even though my analyzer lower-cases (on both write and read sides). So, for example, I can match with a term query (e.g. "NEW YORK"), but only if the case matches. If I use a QueryParser (or MultiFieldQueryParser), it never works because these query values are lowercased and don't match. I've found that using a TextField instead works, presumably because it's tokenized and processed correctly by the write analyzer. However, I would prefer that queries match against the entire/exact phrase ("NEW YORK"), rather than among the tokens ("NEW" or "YORK"). What's the solution here? Thanks in advance. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: classic.QueryParser - bug or new behavior?
Yeah, just go ahead and escape the slash, either with a backslash or by enclosing the whole term in quotes. Otherwise the slash (even embedded in the middle of a term!) indicates the start of a regex query term. -- Jack Krupansky -Original Message- From: Scott Smith Sent: Sunday, May 19, 2013 2:50 PM To: java-user@lucene.apache.org Subject: classic.QueryParser - bug or new behavior? I just upgraded from lucene 4.1 to 4.2.1. We believe we are seeing some different behavior. I'm using org.apache.lucene.queryparser.classic.QueryParser. If I pass the string "20110920/EXPIRED" (w/o quotes) to the parser, I get: org.apache.lucene.queryparser.classic.ParseException: Cannot parse '20110920/EXPIRED': Lexical error at line 1, column 17. Encountered: after : "/EXPIRED" at org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:131) We believe this used to work. I tried googling for this and found something that said I should use QueryParser.escape() on the string before passing it to the parser. However, that seems to break phrase queries (e.g., "John Smith" - with the quotes; I assume it's escaping the double-quotes and doesn't realize it's a phrase). Since it is a forward slash, I'm confused why it would need escaping of any of the characters in the string with the "/EXPIRED". Has anyone seen this? Scott - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: lucene and mongodb
That was tried with Lucandra/Solandra, which stored the Lucene index in Cassandra, but was less than optimal, so that model was discarded in favor of indexing Cassandra data directly into Solr/Lucene, side-by-side in each Cassandra node, but in native Lucene. The latter approach is now available from DataStax as DataStax Enterprise (DSE), which also integrates Hadoop. DSE provides the best of Cassandra integrated tightly with the best of Solr. In DSE, Cassandra takes care of all the cluster management, with Solr indexing the local Cassandra data, handing off incoming updates to Cassandra for normal Cassandra storage, and using a variation of the normal Solr code for distributing queries and merging results from other nodes, but depending on Cassandra for information about cluster configuration. (Note: I had proposed to talk in more detail about the above at Lucene Revolution, but my proposal was not accepted.) See: http://www.datastax.com/what-we-offer/products-services/datastax-enterprise As it says, "DataStax Enterprise is completely free for development work." -- Jack Krupansky -Original Message- From: Rider Carrion Cleger Sent: Tuesday, May 14, 2013 4:35 AM To: java-user-i...@lucene.apache.org ; java-user-...@lucene.apache.org ; java-user@lucene.apache.org Subject: lucene and mongodb Hi team, I'm working with apache lucene 4.2.1 and I would like to store lucene index in a NoSql database. So my questions are, - Can I store the lucene index in a mongodb database ? thanks you team! - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: [PhraseQuery] Can "jakarta apache"~10 be searched by offset ?
You'll have to be more explicit about the actual data and what didn't work. Try developing a simple, self-contained unit test with some simple strings as input that demonstrates the case that you say doesn't work. I mean, regular expressions and field analysis can both be quite tricky - even a tiny typo can break everything. To be clear, my suggested approach to using regular expressions works only on un-tokenized input, so there won't be any positions or even offsets. Other than that, you're on your own until you develop that small, self-contained unit test. Or, you can file a Jira for a new Lucene Query for phrase and or span queries that measures distance by offsets rather than positions. -- Jack Krupansky -Original Message- From: wgggfiy Sent: Monday, May 13, 2013 3:47 AM To: java-user@lucene.apache.org Subject: Re: [PhraseQuery] Can "jakarta apache"~10 be searched by offset ? Jack, according to you, How can I implemt this requirement ?Could you give me a clue ? thank you very much.The regex query seemed not worked ? I got the field such asFieldType fieldType = new FieldType(); FieldInfo.IndexOptions indexOptions = FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS; fieldType.setIndexOptions(indexOptions);fieldType.setIndexed(true); fieldType.setTokenized(true);fieldType.setStored(true); fieldType.freeze();return new Field(name, value, fieldType); - -- Email: wuqiu.m...@qq.com -- -- View this message in context: http://lucene.472066.n3.nabble.com/PhraseQuery-Can-jakarta-apache-10-be-searched-by-offset-tp4061243p4062852.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: [PhraseQuery] Can "jakarta apache"~10 be searched by offset ?
Do you mean the raw character offsets of the starting and ending characters of the terms? No. Although, if you index the text as a raw string, you might be able to come up with a regex query like "jakarta.{1,10}apache" -- Jack Krupansky -Original Message- From: wgggfiy Sent: Monday, May 06, 2013 11:39 PM To: java-user@lucene.apache.org Subject: [PhraseQuery] Can "jakarta apache"~10 be searched by offset ? As I know, the syntax *"jakarta apache"~10*, which is a PhraseQuery with a slop=10 in position, but What I want is *based on offset* not on position? Anyone can help me ? thx. - -- Email: wuqiu.m...@qq.com -- -- View this message in context: http://lucene.472066.n3.nabble.com/PhraseQuery-Can-jakarta-apache-10-be-searched-by-offset-tp4061243.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Multiple PositionIncrement attributes
You can use SpanNearQuery to seek matches within a specified distance. Lucene knows nothing about "sentences". But if you have an analyzer or custom code that artificially bumps the position to the next multiple of some number like 100 or 1000 when a sentence boundary pattern is encountered, you could use that number times n to match within n sentences, roughly, plus or minus a sentence or two - there is nothing to cause the nearness to be rounded or truncated exactly to one of those boundaries. Maybe you want two numbers: 1) sentence separation, say 1000, and 2) maximum sentence length, say 500. The SpanNearQuery would use n-1 times the sentence separation plus the maximum sentence length. Well, you have to adjust that for how you count sentences - is 1 the current sentence or is that 0? -- Jack Krupansky -Original Message- From: Igor Shalyminov Sent: Thursday, April 25, 2013 6:54 AM To: java-user@lucene.apache.org Subject: Multiple PositionIncrement attributes Hi all! I use PositionIncrement attribute for finding words at some distance from each other. And I have two problems with that: 1) I want to search words within one sentence. A possible solution would be to set PositionIncrement of +INF (like 30 :) ) to the sentence break tag. 2) I want to use in my search both word-distance and sentence-distance between words (e.g. find the word "Putin" within 3 sentences after the word "Obama" or find the words "cheese" and "bacon" in one sentence within 3 words of each other). For the 2nd problem, is there a way of storing multiple position information sources in the index and using them for searching? Say, at least choosing one of those for a query. -- Best Regards, Igor Shalyminov - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: WhitespaceTokenizer, incrementToke() ArrayOutOfBoundException
Yes, reset was always "mandatory" from an API contract sense, but not always enforced in a practical sense in 3.x (no uniformly extreme negative consequences), as the original emailer indicated. Now, it is "mandatory" in a practical sense as well (extremely annoying consequences in all cases of a contract violation). So, I should have said that the contract was mandatory but not enforced... which from a practical perspective negates its mandatory contractual value. -- Jack Krupansky -Original Message- From: Uwe Schindler Sent: Monday, April 15, 2013 11:53 AM To: java-user@lucene.apache.org Subject: RE: WhitespaceTokenizer, incrementToke() ArrayOutOfBoundException Hi, It was always mandatory! In Lucene 2.x/3.x some Tokenizers just returned bogus, undefined stuff if not correctly reset before usage, especially when Tokenizers are "reused" by the Analyzer, which is now mandatory in 4.x. So we made it throw some Exception (NPE or AIOOBE) in Lucene 4 by initializing the state fields in Lucene 4.0 with some default values that cause the Exception. The Exception is not more specified because of performance reasons (it's just caused by the new default values set in ctor previously). - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -----Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Monday, April 15, 2013 4:25 PM To: java-user@lucene.apache.org Subject: Re: WhitespaceTokenizer, incrementToke() ArrayOutOfBoundException I didn't read your code, but do you have the "reset" that is now mandatory and throws AIOOBE if not present? -- Jack Krupansky -Original Message- From: andi rexha Sent: Monday, April 15, 2013 10:21 AM To: java-user@lucene.apache.org Subject: WhitespaceTokenizer, incrementToke() ArrayOutOfBoundException Hi, I have tryed to get all the tokens from a TokenStream in the same way as I was doing in the 3.x version of Lucene, but now (at least with WhitespaceTokenizer) I get an exception: Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1 at java.lang.Character.codePointAtImpl(Character.java:2405) at java.lang.Character.codePointAt(Character.java:2369) at org.apache.lucene.analysis.util.CharacterUtils$Java5CharacterUtils.codePoint At(CharacterUtils.java:164) at org.apache.lucene.analysis.util.CharTokenizer.incrementToken(CharTokeniz er.java:166) The code is quite simple, and I thought that it could have worked, but obviously it doesn't (unless I have made some mistakes). Here is the code, in case you spot some bugs on it (although it is trivial): String str = "this is a test"; Reader reader = new StringReader(str); TokenStream tokenStream = new WhitespaceTokenizer(Version.LUCENE_42, reader); //tokenStreamAnalyzer.tokenStream("test", reader); CharTermAttribute attribute = tokenStream.getAttribute(CharTermAttribute.class); while (tokenStream.incrementToken()) { System.out.println(new String(attribute.buffer(), 0, attribute.length())); } Hope you have any idea of why it is happening. Regards, Andi - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: WhitespaceTokenizer, incrementToke() ArrayOutOfBoundException
I didn't read your code, but do you have the "reset" that is now mandatory and throws AIOOBE if not present? -- Jack Krupansky -Original Message- From: andi rexha Sent: Monday, April 15, 2013 10:21 AM To: java-user@lucene.apache.org Subject: WhitespaceTokenizer, incrementToke() ArrayOutOfBoundException Hi, I have tryed to get all the tokens from a TokenStream in the same way as I was doing in the 3.x version of Lucene, but now (at least with WhitespaceTokenizer) I get an exception: Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1 at java.lang.Character.codePointAtImpl(Character.java:2405) at java.lang.Character.codePointAt(Character.java:2369) at org.apache.lucene.analysis.util.CharacterUtils$Java5CharacterUtils.codePointAt(CharacterUtils.java:164) at org.apache.lucene.analysis.util.CharTokenizer.incrementToken(CharTokenizer.java:166) The code is quite simple, and I thought that it could have worked, but obviously it doesn't (unless I have made some mistakes). Here is the code, in case you spot some bugs on it (although it is trivial): String str = "this is a test"; Reader reader = new StringReader(str); TokenStream tokenStream = new WhitespaceTokenizer(Version.LUCENE_42, reader); //tokenStreamAnalyzer.tokenStream("test", reader); CharTermAttribute attribute = tokenStream.getAttribute(CharTermAttribute.class); while (tokenStream.incrementToken()) { System.out.println(new String(attribute.buffer(), 0, attribute.length())); } Hope you have any idea of why it is happening. Regards, Andi - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to index Sharepoint files with Lucene
The Apache ManifoldCF "connector framework" has a SharePoint connector that can crawl SharePoint repositories. It has an output connector that feeds into Solr/SolrCell, but you can easily develop a connector that outputs whatever you want - like put the crawled files into a file system directory, or maybe even send each file directly into Tika and then directly index the content into Lucene, if that's what you want. In any case, MCF handles the SharePoint access and crawling. See: http://manifoldcf.apache.org/en_US/index.html -- Jack Krupansky -Original Message- From: Álvaro Vargas Quezada Sent: Wednesday, April 10, 2013 5:31 PM To: java-user@lucene.apache.org Subject: How to index Sharepoint files with Lucene Hi everyone! I'm trying to combine Lucene with Sharepoint (we use Windows and SP 2010), but I couldn't find good tutorials or proven tests cases that demostrate this integration. Do you know any tutorial or can give me some help about this? I have read all the "Lucene in Action" but here just talk about indexing files, and not integration with other softwares. Thanks in advanceGreetz from Chile - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: how to consider special character like + ! @ # in Lucene search
You'll have to switch from using the standard analyzer/tokenizer to using the whitespace analyzer/tokenizer, and make sure not to use any additional token filters that might eliminate some or all special characters (or provide character maps for the ones that do accept character maps.) You will need to completely reindex your data as well. And, in some cases you may need to escape some special characters with backslash in your queries. -- Jack Krupansky -Original Message- From: neeraj shah Sent: Wednesday, April 10, 2013 2:07 AM To: java-user@lucene.apache.org Subject: how to consider special character like + ! @ # in Lucene search hello, im using Lucene2.9. i have to search special character like "/" in given text. but when im searching it gives me 0 hit. I have tried QueryParse.escape("/"). but did not get the result. how to proceed further. please help.. -- With Regards, Neeraj Kumar Shah - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: MLT Using a Query created in a different index
In a statistical sense, for the majority of documents, yes, but you could probably find quite a few outlier examples where the results from A to B or from B to A as significantly or even completely different or even non-existent. -- Jack Krupansky -Original Message- From: Peter Lavin Sent: Friday, April 05, 2013 3:49 AM To: java-user@lucene.apache.org Subject: Re: MLT Using a Query created in a different index Thanks for that Jack, so it's fair to say that if both the sources and target corpus are large and diverse, then the impact of using a different index to create the query would be negligible. P. On 04/04/2013 06:49 PM, Jack Krupansky wrote: The heart of MLT is examining the top result of a query (or maybe more than one) and identifying the "top" terms from the top document(s) and then simply using those top terms for a subsequent query. The term ranking would of course depend on term frequency, and other relevancy considerations - for the corpus of the original query. A rich query corpus will give great results, a weak corpus will give weak results - no matter how rich or weak the final target corpus is. OTOH, if the target corpus really is representative on the source corpus, then results should be either good or terrible - the selected/query document may not have any representation in the target corpus. -- Jack Krupansky -Original Message- From: Peter Lavin Sent: Thursday, April 04, 2013 1:06 PM To: java-user@lucene.apache.org Subject: MLT Using a Query created in a different index Dear Users, I am doing some research where Lucene is integrated into agent technology. Part of this work involves using an MLT query in an index which was not created from a document in that index (i.e. the query is created, serialised and sent to the remote agent). Can anyone point me towards any information on what the potential impact of doing this would be? I'm assuming if both indexes have similar sets of documents, the impact would be negligible, but what, for example would be the impact of creating an MLT query from an index with only one or two documents for use in an index with several (say 100+) documents, with thanks, Peter -- with best regards, Peter Lavin, PhD Candidate, CAG - Computer Architecture & Grid Research Group, Lloyd Institute, 005, Trinity College Dublin, Ireland. +353 1 8961536 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: MLT Using a Query created in a different index
The heart of MLT is examining the top result of a query (or maybe more than one) and identifying the "top" terms from the top document(s) and then simply using those top terms for a subsequent query. The term ranking would of course depend on term frequency, and other relevancy considerations - for the corpus of the original query. A rich query corpus will give great results, a weak corpus will give weak results - no matter how rich or weak the final target corpus is. OTOH, if the target corpus really is representative on the source corpus, then results should be either good or terrible - the selected/query document may not have any representation in the target corpus. -- Jack Krupansky -Original Message- From: Peter Lavin Sent: Thursday, April 04, 2013 1:06 PM To: java-user@lucene.apache.org Subject: MLT Using a Query created in a different index Dear Users, I am doing some research where Lucene is integrated into agent technology. Part of this work involves using an MLT query in an index which was not created from a document in that index (i.e. the query is created, serialised and sent to the remote agent). Can anyone point me towards any information on what the potential impact of doing this would be? I'm assuming if both indexes have similar sets of documents, the impact would be negligible, but what, for example would be the impact of creating an MLT query from an index with only one or two documents for use in an index with several (say 100+) documents, with thanks, Peter -- with best regards, Peter Lavin, PhD Candidate, CAG - Computer Architecture & Grid Research Group, Lloyd Institute, 005, Trinity College Dublin, Ireland. +353 1 8961536 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Indexing a long list
Multivalued fields are the other approach to keyword value pairs. And if you can denormalize your data, storing structure as separate documents can make sense and support more powerful queries. Although the join capabilities are rather limited. -- Jack Krupansky -Original Message- From: Paul Bell Sent: Sunday, March 31, 2013 8:52 AM To: java-user@lucene.apache.org Subject: Re: Indexing a long list Hi Jack, Thanks for the reply. I am very new to Lucene. Your timing is a bit uncanny. I was just coming to the conclusion that there's nothing special about this case for Lucene, i.e., a tokenized field should work, when I looked up and saw your e-mail. In re the larger context: yeah, the properties in question here belong to some kind of node, e.g., maybe a vertex in a graph DB. Possible properties include 'name', 'type', 'inEdges', 'outEdges', etc. Most properties are simple k=v pairs. But a few, notable the 'edge' properties, could be long lists. My intent was to create a Lucene Document for each node. The Fields in this Document would represent all of the node's properties. A generic (not in Lucene syntax) query should be able to ask after any property, e.g., ('name' equals "vol1" AND 'outEdges.name' startsWith "hasMirror") Note that 'outEdges.name' represents multiple elements, where 'name' represents only one. That is, the generic query syntax is trying to match any out-edge whose name property starts with "hasMirror". I haven't quite crystallized the generic query syntax and don't know how best to map it to both a Lucene query and to an appropriate Lucene index structure. Please let me know if you've any suggestions! Thanks again. -Paul On Sun, Mar 31, 2013 at 8:33 AM, Jack Krupansky wrote: The first question is how do you want to access the data? What do you want your queries to look like? What is the larger context? Are these properties of larger documents? Are there more than one per document? Etc. Why not just store the property as a tokenized field? Then you can query whether v(i) or v(j) are or are not present as keywords. -- Jack Krupansky -Original Message- From: Paul Bell Sent: Sunday, March 31, 2013 8:21 AM To: java-user@lucene.apache.org Subject: Indexing a long list Hi All, Suppose I need to index a property whose value is a long list of terms. For example, someProperty = ["v1", "v2", , "v100"] Please note that I could drop the leading "v" and index these as numbers instead of strings. But the question is what's the best practice in Lucene when dealing with a case like this? I need to be able to retrieve the list. This makes methink that I need to store it. And I suppose that the list could be stored in the index itself or in the "content" to which the index points. So there are really two parts to this question: 1. Lucene "best practices" for long list 2. Where to store such a list Thanks for your help. -Paul --**--**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org For additional commands, e-mail: java-user-help@lucene.apache.**org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Indexing a long list
The first question is how do you want to access the data? What do you want your queries to look like? What is the larger context? Are these properties of larger documents? Are there more than one per document? Etc. Why not just store the property as a tokenized field? Then you can query whether v(i) or v(j) are or are not present as keywords. -- Jack Krupansky -Original Message- From: Paul Bell Sent: Sunday, March 31, 2013 8:21 AM To: java-user@lucene.apache.org Subject: Indexing a long list Hi All, Suppose I need to index a property whose value is a long list of terms. For example, someProperty = ["v1", "v2", , "v100"] Please note that I could drop the leading "v" and index these as numbers instead of strings. But the question is what's the best practice in Lucene when dealing with a case like this? I need to be able to retrieve the list. This makes methink that I need to store it. And I suppose that the list could be stored in the index itself or in the "content" to which the index points. So there are really two parts to this question: 1. Lucene "best practices" for long list 2. Where to store such a list Thanks for your help. -Paul - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Accent insensitive analyzer
Start with the Standard Tokenizer: https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html -- Jack Krupansky -Original Message- From: Jerome Blouin Sent: Friday, March 22, 2013 12:53 PM To: java-user@lucene.apache.org Subject: RE: Accent insensitive analyzer I understand that I can't configure it on an analyzer so on which class can I apply it? Thank, Jerome -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Friday, March 22, 2013 12:38 PM To: java-user@lucene.apache.org Subject: Re: Accent insensitive analyzer Try the ASCII Folding FIlter: https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html -- Jack Krupansky -Original Message- From: Jerome Blouin Sent: Friday, March 22, 2013 12:22 PM To: java-user@lucene.apache.org Subject: Accent insensitive analyzer Hello, I'm looking for an analyzer that allows performing accent insensitive search in latin languages. I'm currently using the StandardAnalyzer but it doesn't fulfill this need. Could you please point me to the one I need to use? I've checked the javadoc for the various analyzer packages but can't find one. Do I need to implement my own analyzer? Regards, Jerome - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Accent insensitive analyzer
Try the ASCII Folding FIlter: https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html -- Jack Krupansky -Original Message- From: Jerome Blouin Sent: Friday, March 22, 2013 12:22 PM To: java-user@lucene.apache.org Subject: Accent insensitive analyzer Hello, I'm looking for an analyzer that allows performing accent insensitive search in latin languages. I'm currently using the StandardAnalyzer but it doesn't fulfill this need. Could you please point me to the one I need to use? I've checked the javadoc for the various analyzer packages but can't find one. Do I need to implement my own analyzer? Regards, Jerome - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Multi-value fields in Lucene 4.1
I don't think there is a way of identifying which of the values of a multivalued field matched. But... I haven't checked the code to be absolutely certain whether their isn't some expert way. Also, realize that multiple values could match, such as if you queried for "B*". -- Jack Krupansky -Original Message- From: Chris Bamford Sent: Friday, March 22, 2013 5:57 AM To: java-user@lucene.apache.org Subject: Multi-value fields in Lucene 4.1 Hi, If I index several similar values in a multivalued field (e.g. many authors to one book), is there any way to know which of these matched during a query? e.g. Book "The art of Stuff", with authors "Bob Thingummy" and "Belinda Bootstrap" If we queried for +(author:Be*) and matched this document, is there a way of drilling down and identifying the specific sub-field that actually triggered the match ("Belinda Bootstrap") ? I was wondering what the lowest granularity of matching actually is - document / field / sub-field ... I am happy to index with term vectors and positions if it helps. Thanks, - Chris - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Getting documents from suggestions
I don't have time right now to debug your code right now, but make sure that the analysis is consistent between index and query. For example, "Apache" vs. "apache". -- Jack Krupansky -Original Message- From: Bratislav Stojanovic Sent: Saturday, March 16, 2013 7:29 AM To: java-user@lucene.apache.org Subject: Re: Getting documents from suggestions Hey Jack, I've tried MoreLikeTHis, but it always returns me 0 hits. Here's the code, it's very simple : // test2 Index lucene = null; try { lucene = new Index(); MoreLikeThis mlt = new MoreLikeThis(lucene.reader); mlt.setAnalyzer(lucene.analyzer); Reader target = new StringReader("apache"); Query query = mlt.like(target, "contents"); TopDocs results = lucene.searcher.search(query, 10); ScoreDoc[] hits = results.scoreDocs; System.out.println("Total "+hits.length+" hits"); for (int i = 0; i < hits.length; i++) { Document doc = lucene.searcher.doc(hits[i].doc); System.out.println("Hit "+i+" : "+doc.getField("id").stringValue()); } } catch (Exception e) { e.printStackTrace(); } finally { if (lucene != null) lucene.close(); } Here are my fields in the index : ... Field pathField = new StringField("path", file.getPath(), Field.Store.YES); doc.add(pathField); doc.add(new LongField("modified", file.lastModified(), Field.Store.NO)); // id so we can fetch metadata doc.add(new LongField("id", id, Field.Store.YES)); // add default user (guest) doc.add(new LongField("userid", -1L, Field.Store.YES)); doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(fis, "UTF-8"; ... Do you have any clue what might be wrong? Btw, searching for "apache" returns me 36 hits, so index is fine. P.S. I even tried like(int) and passing Doc. Id. but same thing. On Thu, Mar 14, 2013 at 10:45 PM, Jack Krupansky wrote: Could you give us some examples of what you expect? I mean, how is your suggested set of documents any different from simply executing a query with the list of suggested terms (using q.op=OR)? Or, maybe you want something like MoreLikeThis? -- Jack Krupansky -Original Message- From: Bratislav Stojanovic Sent: Thursday, March 14, 2013 5:36 PM To: java-user@lucene.apache.org Subject: Getting documents from suggestions Hi all, How can I filter suggestions based on some value from the indexed field? I have a stored 'id' field in my index and I want to use that to examine documents where the suggestion was found, but how to get Document from suggestion? SpellChecker class only returns array of strings. What classes should I use? Please help. Thanx in advance. -- Bratislav Stojanovic, M.Sc. --**--**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org For additional commands, e-mail: java-user-help@lucene.apache.**org -- Bratislav Stojanovic, M.Sc. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Getting documents from suggestions
Let's refine this... If a top suggestion is X, do you simply want to know a few of the documents which have the highest term frequency for X? Or is there some other term-oriented metric you might propose? -- Jack Krupansky -Original Message- From: Bratislav Stojanovic Sent: Thursday, March 14, 2013 6:14 PM To: java-user@lucene.apache.org Subject: Re: Getting documents from suggestions Wow that was fast :) I have implemented a simple search box with auto-suggestions, so whenever user types in something, ajax call is fired to the SuggestServlet and in return 10 suggestions are shown. It's working fine with the SpellChecker class, but I only get array of Strings. What I want is to get lucene Document instances so I can use doc.get("id") to filter those suggestions. This field is my field, doesn't have to do anything with the default Doc. Id field Lucene generates. Here's an example : when I type "apache" I get suggestions like "apache.org", "apache2" etc. Now I want to have something like this : Document doc = SomeClass.getDocFromSuggestion("apache.org"); if (doc.get("id") == ...) { //add suggestion into the result } else { //do nothing. } Is MoreLikeThis designed for this? On Thu, Mar 14, 2013 at 10:45 PM, Jack Krupansky wrote: Could you give us some examples of what you expect? I mean, how is your suggested set of documents any different from simply executing a query with the list of suggested terms (using q.op=OR)? Or, maybe you want something like MoreLikeThis? -- Jack Krupansky -Original Message- From: Bratislav Stojanovic Sent: Thursday, March 14, 2013 5:36 PM To: java-user@lucene.apache.org Subject: Getting documents from suggestions Hi all, How can I filter suggestions based on some value from the indexed field? I have a stored 'id' field in my index and I want to use that to examine documents where the suggestion was found, but how to get Document from suggestion? SpellChecker class only returns array of strings. What classes should I use? Please help. Thanx in advance. -- Bratislav Stojanovic, M.Sc. --**--**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org For additional commands, e-mail: java-user-help@lucene.apache.**org -- Bratislav Stojanovic, M.Sc. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Getting documents from suggestions
Could you give us some examples of what you expect? I mean, how is your suggested set of documents any different from simply executing a query with the list of suggested terms (using q.op=OR)? Or, maybe you want something like MoreLikeThis? -- Jack Krupansky -Original Message- From: Bratislav Stojanovic Sent: Thursday, March 14, 2013 5:36 PM To: java-user@lucene.apache.org Subject: Getting documents from suggestions Hi all, How can I filter suggestions based on some value from the indexed field? I have a stored 'id' field in my index and I want to use that to examine documents where the suggestion was found, but how to get Document from suggestion? SpellChecker class only returns array of strings. What classes should I use? Please help. Thanx in advance. -- Bratislav Stojanovic, M.Sc. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Boolean Query not working in Lucene 4.0
Try detailing both your expected behavior and the actual behavior. Try providing an actual code snippet and actual index and query data. Is it failing for all types and titles or just for some? -- Jack Krupansky -Original Message- From: saisantoshi Sent: Tuesday, February 26, 2013 6:52 PM To: java-user@lucene.apache.org Subject: Boolean Query not working in Lucene 4.0 The following query does not seems to work after we upgrade from 2.4 - 4.0 *+type:sometype +title:sometitle** Any ideas as to what are some of the places to look for? Is the above Query correct in syntax. Appreciate if you could advise on the above? Thanks, Sai. -- View this message in context: http://lucene.472066.n3.nabble.com/Boolean-Query-not-working-in-Lucene-4-0-tp4043246.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: possible bug on Spellchecker
Any reason that you are not using the DirectSpellChecker? See: http://lucene.apache.org/core/4_0_0/suggest/org/apache/lucene/search/spell/DirectSpellChecker.html -- Jack Krupansky -Original Message- From: Samuel García Martínez Sent: Wednesday, February 20, 2013 3:34 PM To: java-user@lucene.apache.org Subject: possible bug on Spellchecker Hi all, Debugging Solr spellchecker (IndexBasedSpellchecker, delegating on lucene Spellchecker) behaviour i think i found a bug when the input is a 6 letter word: - george - anthem - argued - fluent Due to the getMin() and getMax() the grams indexed for these terms are 3 and 4. So, the fields would be something like this: - for "*george*" - start3: "geo" - start4: "geor" - end3: "rge" - end4: "orge" - 3: "geo", "eor", "org", "rge" - 4: "geor", "eorg", "orge" - for "*anthem*" - start3: "ant" - start4: "anth" - end3: "tem" - end4: "them" The problem shows up when the user swap 3rd a 4th characters, misspelling the word like this: - geroge - anhtem The queries generated for this terms are: (SHOULD boolean queries) - for "*geroge*" - start3: "ger" - start4: "gero" - end3: "oge" - end4: "roge" - 3: "ger", "ero", "rog", "oge" - 4: "gero", "erog", "roge" - for "*anhtem*" - start3: "anh" - start4: "anht" - end3: "tem" - end4: "htem" - 3: "anh", "nht", "hte", "tem" - 4: "anht", "nhte", "htem" So, as you can see, this kind of misspelling never matches the suitable suggestions although the edit distance is 0.9556. I think getMin(int l) and getMax(int l) should return 2 and 3, respectively, for l==6. Debugging other values i did not found any problem with any kind of misspelling. Any thoughts about this? -- Un saludo, Samuel García - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Grouping and tokens
Well, you don't need to "store" both copies since they will be the same. They both need to be "indexed" (string form for grouping, text form for keyword search), but only one needs to be "stored". -- Jack Krupansky -Original Message- From: Ramprakash Ramamoorthy Sent: Tuesday, February 19, 2013 1:07 AM To: java-user@lucene.apache.org Subject: Re: Grouping and tokens On Tue, Feb 19, 2013 at 12:57 PM, Jack Krupansky wrote: Oops, sorry for the "Solr" answer. In Lucene you need to simply index the same value, once as a raw string and a second time as a tokenized text field. Grouping would use the raw string version of the data. Yeah, thanks Jack. Was just wondering if there would be a better alternate rather than 2x storing. But I don't see any. Thanks again. -- Jack Krupansky -Original Message- From: Jack Krupansky Sent: Monday, February 18, 2013 11:21 PM To: java-user@lucene.apache.org Subject: Re: Grouping and tokens Okay, so, fields that would normally need to be tokenized must be stored as both raw strings for grouping and tokenized text for keyword search. Simply use copyField to copy from one to the other. -- Jack Krupansky -Original Message- From: Ramprakash Ramamoorthy Sent: Monday, February 18, 2013 11:13 PM To: java-user@lucene.apache.org Subject: Re: Grouping and tokens On Mon, Feb 18, 2013 at 9:47 PM, Jack Krupansky **wrote: Please clarify exactly what you want to group by - give a specific example that makes it clear what terms should affect grouping and which shouldn't. Assume I am indexing a library data. Say there are the following fields for a particular book. 1. Published 2. Language 3. Genre 4. Author 5. Title 6. ISBN While search time, the user can ask to group by any of the above fields, which means all of them are not supposed to be tokenized. So as I had told earlier, there is a book titled "Fifty shades of gray" and the user searches for "shades". The result turns up in case the field is tokenized. But here it doesn't, since it isn't tokenized. Hope I am clear? In a nutshell, how do I use a groupby on a field that is also tokenized? -- Jack Krupansky -Original Message- From: Ramprakash Ramamoorthy Sent: Monday, February 18, 2013 6:12 AM To: java-user@lucene.apache.org Subject: Grouping and tokens Hello all, From the grouping javadoc, I read that fields that are supposed to be grouped should not be tokenized. I have an use case where the user has the freedom to group by any field during search time. Now that only tokenized fields are eligible for grouping, this is creating an issue with my search. Say for instance the book "*Fifty shades of grey*" when tokenized and searched for "*shades*" turns up in the result. However this is not the case when I have it as a non-tokenized field (using StandardAnalyzer-Version4.1). How do I go about this? Is indexing a tokenized and non-tokenized version of the same field the only go? I am afraid its way too costly! Thanks in advance for your valuable inputs. -- With Thanks and Regards, Ramprakash Ramamoorthy, India, +91 9626975420 --** --**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org< java-user-**unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org< java-user-help@lucene.**apache.org > -- With Thanks and Regards, Ramprakash Ramamoorthy, India. +91 9626975420 --**--**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org For additional commands, e-mail: java-user-help@lucene.apache.**org --**--**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org For additional commands, e-mail: java-user-help@lucene.apache.**org -- With Thanks and Regards, Ramprakash Ramamoorthy, India, +91 9626975420 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Grouping and tokens
Oops, sorry for the "Solr" answer. In Lucene you need to simply index the same value, once as a raw string and a second time as a tokenized text field. Grouping would use the raw string version of the data. -- Jack Krupansky -Original Message----- From: Jack Krupansky Sent: Monday, February 18, 2013 11:21 PM To: java-user@lucene.apache.org Subject: Re: Grouping and tokens Okay, so, fields that would normally need to be tokenized must be stored as both raw strings for grouping and tokenized text for keyword search. Simply use copyField to copy from one to the other. -- Jack Krupansky -Original Message- From: Ramprakash Ramamoorthy Sent: Monday, February 18, 2013 11:13 PM To: java-user@lucene.apache.org Subject: Re: Grouping and tokens On Mon, Feb 18, 2013 at 9:47 PM, Jack Krupansky wrote: Please clarify exactly what you want to group by - give a specific example that makes it clear what terms should affect grouping and which shouldn't. Assume I am indexing a library data. Say there are the following fields for a particular book. 1. Published 2. Language 3. Genre 4. Author 5. Title 6. ISBN While search time, the user can ask to group by any of the above fields, which means all of them are not supposed to be tokenized. So as I had told earlier, there is a book titled "Fifty shades of gray" and the user searches for "shades". The result turns up in case the field is tokenized. But here it doesn't, since it isn't tokenized. Hope I am clear? In a nutshell, how do I use a groupby on a field that is also tokenized? -- Jack Krupansky -Original Message- From: Ramprakash Ramamoorthy Sent: Monday, February 18, 2013 6:12 AM To: java-user@lucene.apache.org Subject: Grouping and tokens Hello all, From the grouping javadoc, I read that fields that are supposed to be grouped should not be tokenized. I have an use case where the user has the freedom to group by any field during search time. Now that only tokenized fields are eligible for grouping, this is creating an issue with my search. Say for instance the book "*Fifty shades of grey*" when tokenized and searched for "*shades*" turns up in the result. However this is not the case when I have it as a non-tokenized field (using StandardAnalyzer-Version4.1). How do I go about this? Is indexing a tokenized and non-tokenized version of the same field the only go? I am afraid its way too costly! Thanks in advance for your valuable inputs. -- With Thanks and Regards, Ramprakash Ramamoorthy, India, +91 9626975420 --**--**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org For additional commands, e-mail: java-user-help@lucene.apache.**org -- With Thanks and Regards, Ramprakash Ramamoorthy, India. +91 9626975420 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Grouping and tokens
Okay, so, fields that would normally need to be tokenized must be stored as both raw strings for grouping and tokenized text for keyword search. Simply use copyField to copy from one to the other. -- Jack Krupansky -Original Message- From: Ramprakash Ramamoorthy Sent: Monday, February 18, 2013 11:13 PM To: java-user@lucene.apache.org Subject: Re: Grouping and tokens On Mon, Feb 18, 2013 at 9:47 PM, Jack Krupansky wrote: Please clarify exactly what you want to group by - give a specific example that makes it clear what terms should affect grouping and which shouldn't. Assume I am indexing a library data. Say there are the following fields for a particular book. 1. Published 2. Language 3. Genre 4. Author 5. Title 6. ISBN While search time, the user can ask to group by any of the above fields, which means all of them are not supposed to be tokenized. So as I had told earlier, there is a book titled "Fifty shades of gray" and the user searches for "shades". The result turns up in case the field is tokenized. But here it doesn't, since it isn't tokenized. Hope I am clear? In a nutshell, how do I use a groupby on a field that is also tokenized? -- Jack Krupansky -Original Message- From: Ramprakash Ramamoorthy Sent: Monday, February 18, 2013 6:12 AM To: java-user@lucene.apache.org Subject: Grouping and tokens Hello all, From the grouping javadoc, I read that fields that are supposed to be grouped should not be tokenized. I have an use case where the user has the freedom to group by any field during search time. Now that only tokenized fields are eligible for grouping, this is creating an issue with my search. Say for instance the book "*Fifty shades of grey*" when tokenized and searched for "*shades*" turns up in the result. However this is not the case when I have it as a non-tokenized field (using StandardAnalyzer-Version4.1). How do I go about this? Is indexing a tokenized and non-tokenized version of the same field the only go? I am afraid its way too costly! Thanks in advance for your valuable inputs. -- With Thanks and Regards, Ramprakash Ramamoorthy, India, +91 9626975420 --**--**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org For additional commands, e-mail: java-user-help@lucene.apache.**org -- With Thanks and Regards, Ramprakash Ramamoorthy, India. +91 9626975420 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Grouping and tokens
Please clarify exactly what you want to group by - give a specific example that makes it clear what terms should affect grouping and which shouldn't. -- Jack Krupansky -Original Message- From: Ramprakash Ramamoorthy Sent: Monday, February 18, 2013 6:12 AM To: java-user@lucene.apache.org Subject: Grouping and tokens Hello all, From the grouping javadoc, I read that fields that are supposed to be grouped should not be tokenized. I have an use case where the user has the freedom to group by any field during search time. Now that only tokenized fields are eligible for grouping, this is creating an issue with my search. Say for instance the book "*Fifty shades of grey*" when tokenized and searched for "*shades*" turns up in the result. However this is not the case when I have it as a non-tokenized field (using StandardAnalyzer-Version4.1). How do I go about this? Is indexing a tokenized and non-tokenized version of the same field the only go? I am afraid its way too costly! Thanks in advance for your valuable inputs. -- With Thanks and Regards, Ramprakash Ramamoorthy, India, +91 9626975420 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: fuzzy queries
You probably are not getting this document returned: list.add("strfffing_ m atcbbhing"); because... both terms have an edit distance greater than two. All the other documents have one or the other or both terms with an editing distance of 2 or less. Your query is essentially: Match a document if EITHER term matches. So, if NEITHER matches (within an editing distance of 2), the document is not a match. -- Jack Krupansky -Original Message- From: Pierre Antoine DuBoDeNa Sent: Saturday, February 09, 2013 12:52 PM To: java-user@lucene.apache.org Subject: Re: fuzzy queries with query like string~ matching~ (without specifying threshold) i get 14 results back.. Can it be problem with the analyzers? Here is the code: private File indexDir = new File("/a-directory-here"); private StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35); private IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35, analyzer); public static void main(String[] args) throws Exception { IndexProfiles Indexer = new IndexProfiles(); IndexWriter w = Indexer.CreateIndex(); ArrayList list = new ArrayList(); list.add("string matching"); list.add("string123 matching"); list.add("string matching123"); list.add("string123 matching123"); list.add("str4ing match2ing"); list.add("1string 2matching"); list.add("str_ing ma_tching"); list.add("string_matching"); list.add("strang mutching"); list.add("strrring maatchinng"); list.add("strfffing_ m atcbbhing"); list.add("str2ing__mat3ching"); list.add("string_m atching"); list.add("string matching another token"); list.add("strasding matc4hing ano23ther tok3en"); list.add("str4ing maaatching_another 2t oken"); for (String companyname:list) { Indexer.addSingleField(w, companyname); } int numDocs = w.numDocs(); System.out.println("# of Docs in Index: " + numDocs); w.close(); DoIndexQuery("string~ matching~"); } public static void DoIndexQuery(String query) throws IOException, ParseException { IndexProfiles Indexer = new IndexProfiles(); IndexReader reader = Indexer.LoadIndex(); Indexer.SearchIndex(reader, query, 50); reader.close(); } public IndexWriter CreateIndex() throws IOException { Directory index = FSDirectory.open(indexDir); IndexWriter w = new IndexWriter(index, config); return w; } public HashMap SearchIndex(IndexReader w, String query, int topk) throwsIOException, ParseException { Query q = new QueryParser(Version.LUCENE_35, "Name", analyzer ).parse(query); IndexSearcher searcher = new IndexSearcher(w); TopScoreDocCollector collector = TopScoreDocCollector.create(topk, true); searcher.search(q, collector); ScoreDoc[] hits = collector.topDocs().scoreDocs; System.out.println("Found " + hits.length + " hits."); HashMap map = new HashMap(); for(int i=0;i Can you reduce your test case to indexing one document/field and running a single FuzzyQuery (you seem to be running two at once, OR'ing the results)? And show the complete standalone source code (eg what is topk?) so we can see how you are indexing / building the Query / searching. The default minSim is 0.5. Note that 0.01 is not useful in practice: it (should) match nearly all terms. But I agree it's odd one term is not matching. Mike McCandless http://blog.mikemccandless.com On Sat, Feb 9, 2013 at 5:20 AM, Pierre Antoine DuBoDeNa wrote: >> >> Hello, >> >> I use lucene 3.6 and i try to use fuzzy queries so that I can match >> much >> more results. >> >> I am adding for example these strings: >> >> list.add("string matching"); >> >> list.add("string123 matching"); >> >> list.add("string matching123"); >> >> list.add("string123 matching123"); >> >> list.add("str4ing match2ing"); >> >> list.add("1string 2matching"); >> >> list.add("str_ing ma_tching"); >> >> list.add("string_matching"); >> >> list.add("strang mutching"); >> >> list.add("strrring maatchinng"); >> >> list.add("strfffing_ m atcbbhing"); >> >> list.add("str2ing__mat3ching"); >> >> list.add("string_m atching"); >> >> list.add("string matching another token"); >> >> list.add("strasding matc4hing ano23ther tok3en"); >> >> list.add("str4ing maaatching_another 2t oken"); >> >> >> >> then i do a query: >> >> >> "string~0.01 matching~0.01" >&
Re: Wildcard in a text field
Ah, okay... some people call that "prospective" search. In any case, there is no direct Lucene support that I know of. There are some references here: http://lucene.apache.org/core/4_0_0/memory/org/apache/lucene/index/memory/MemoryIndex.html -- Jack Krupansky -Original Message- From: Nicolas Roduit Sent: Friday, February 08, 2013 10:14 AM To: java-user@lucene.apache.org Subject: Re: Re: Wildcard in a text field For instance, I have a list of tags related to a text. Each text with its list of tags are put in a document and indexed by Lucene. If we consider that a tag is "buddh*" and I would like to make a query (e.g. "buddha" or "buddhism" or "buddhist") and find the document that contain "buddh*". Thanks, Le 08. 02. 13 13:35, Jack Krupansky a écrit : That description is too vague. Could you provide a couple of examples of index text and queries and what you expect those queries to match. If you simply want to query for "*" and "?" in "string" fields, escape them with a backslash. But if you want to escape them in "text" fields, be sure to use an analyzer that preserves them since they generally will be treated as spaces. -- Jack Krupansky -Original Message- From: Nicolas Roduit Sent: Friday, February 08, 2013 2:49 AM To: java-user@lucene.apache.org Subject: Wildcard in a text field I'm looking for a way of making a query on words which contain wildcards (* or ?). In general, we use wildcards in query, not in the text. I haven't find anything in Lucene to build that. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Wildcard in a text field
That description is too vague. Could you provide a couple of examples of index text and queries and what you expect those queries to match. If you simply want to query for "*" and "?" in "string" fields, escape them with a backslash. But if you want to escape them in "text" fields, be sure to use an analyzer that preserves them since they generally will be treated as spaces. -- Jack Krupansky -Original Message- From: Nicolas Roduit Sent: Friday, February 08, 2013 2:49 AM To: java-user@lucene.apache.org Subject: Wildcard in a text field I'm looking for a way of making a query on words which contain wildcards (* or ?). In general, we use wildcards in query, not in the text. I haven't find anything in Lucene to build that. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene vs Glimpse
Generally, all of your example queries should work fine with Lucene, provided that you carefully choose your analyzer, or even use the StandardAnalyzer. The special characters like underscore and dot generally get treated as spaces and the resulting sequence of terms would match as a phrase. It won't be a 100% solution, but it should do reasonably well. Is there a query that was failing to match reasonably for you? -- Jack Krupansky -Original Message- From: Mathias Dahl Sent: Monday, February 04, 2013 1:01 PM To: java-user@lucene.apache.org Subject: Lucene vs Glimpse Hi, I have hacked together a small web front end to the Glimpse text indexing engine (see http://webglimpse.net/ for information). I am very happy with how Glimpse indexes and searches data. If I understand it correctly it uses a combination of an index and searching directly in the files themselves as grep or other tools. The problem is that I discovered it is not open source and now that I want to extend the use from private to company wide I will run into license problems/costs. So, I decided to try out Lucene. I tried the examples and changed them a bit to use another analyzer. But when I started to think about it I realized that I will not be able to build something like Glimpse. At least not easily. Why? I will try to explain: As stated above, Glimpse uses a combination of index and in-file search. This makes it very powerful in the sense that I can get hits for things that are not necessarily being indexes as terms. Let's say I have a file with this content: ... import foo.bar.baz; ... With Glimpse, and without telling it how to index the content I can find the above file using a search string like "foo" or "bar" but also, and this is important, using foo.bar.baz. Another example: We have a lot of PL/SQL source code, and often you can find code like this: ... My_Nice_API.Some_Method ... Here too, Glimpse is almost magic since it combines index and normal search. I can find the file above using "My_Nice_API" or "My_Nice_API.Some_Method". In a sense I can have the cake and eat it too. If I want to do similar "free" search stuff with Lucene I think I have to create analyzers for the different kind of source code files, with fields for this and that. Quite an undertaking. Does anyone understand my point here and am I correct in that it would be hard to implement something as "free" as with Glimpse? I am not trying to critizise, just understand how Lucene (and Glimpse) works. Oh, yes, Glimpse has one big drawback: it only supports search strings up to 32 characters. Thanks! /Mathias - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to find related words ?
Oh, so you wanted "similar" words! You should have said so... your inquiry said you were looking for "related" words. So, which is it? More specifically, what exactly are you looking for, in terms of the semantics? In any case, "find similar" (MoreLikeThis) is about the best you can do out of the box. -- Jack Krupansky -Original Message- From: Andrew Gilmartin Sent: Thursday, January 31, 2013 9:04 AM To: java-user@lucene.apache.org Subject: Re: How to find related words ? wgggfiy wrote: en, it seems nice, but I'm puzzled by you and Andrew Gilmartina above, what's the difference between you guys ? The different is that similar documents do not give you similar terms. Similar documents can show a correlation of terms -- ie, whereever Lucene is mentioned so is Solr and Hadoop -- but in no way does this mean that the terms are similar. Accumulating similar and/or synonymous terms is a manual process. I am sure there are text mining tools/algorithms that make discoveries, but I do not know about these. (I am a journeyman programmer not a researcher.) If anyone does know about them, please share with this list. -- Andrew - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to find related words ?
Take a look at MoreLikeThisQuery: http://lucene.apache.org/core/4_1_0/queries/org/apache/lucene/queries/mlt/MoreLikeThisQuery.html And MoreLikeThis itself: http://lucene.apache.org/core/4_1_0/queries/org/apache/lucene/queries/mlt/MoreLikeThis.html So, the idea is search for documents using your keyword(s) and ask Lucene to extract relevant terms from the top document(s). -- Jack Krupansky -Original Message- From: wgggfiy Sent: Wednesday, January 30, 2013 12:27 PM To: java-user@lucene.apache.org Subject: How to find related words ? In short, you put in a term like "Lucene", and The ideal output would be "solr", "index", "full-text search", and so on. How to make it ? to find the related words. thx My idea is to use FuzzyQuery, or MoreLikeThis, or calc the score with all the terms and then sort. Any idea ? - -- Email: wuqiu.m...@qq.com -- -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-find-related-words-tp4037462.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Questions about FuzzyQuery in Lucene 4.x
I'm sorry, but for anybody to help you here, you really need to be able to provide a concise test case, like 10-20 lines of code, completely self-contained. If you think you need a million documents to repro what you claimed was a simple scenario, then you leave me very, very confused - and unable to help you any further. -- Jack Krupansky -Original Message- From: George Kelvin Sent: Tuesday, January 29, 2013 2:43 PM To: java-user@lucene.apache.org Subject: Re: Questions about FuzzyQuery in Lucene 4.x Hi Jack, The problematic query is "scar"+"wads". There are several (more than 10) documents in the data with the content "star wars", so I think that query should be able to find all these documents. I was trying to provide a minimal test case, but I couldn't reduce the size of data showing the failure. The size of the minimal data showing the failure I got so far is around 2 million. However, I found a suspicious document with content "scor". If I remove it from the 2 million documents data, that query can find all the "star wars" documents. If I add it back, then the query can't find any. I tried to reduce the size of the data to 1 million further and add that "scor" document, but now the query can still find all the "star wars" documents. Is it possible that Lucene somehow fail to find all the valid terms within the edit distance? Thanks! George On Tue, Jan 29, 2013 at 10:02 AM, Jack Krupansky wrote: I also noticed that you have "MUST" for your full string of fuzzy terms - that means everyone of them must appear in an indexed document to be matched. Is it possible that maybe even one term was not in the same indexed document? Try to provide a complete example that shows the input data and the query - all the literals. In other words, construct a minimal test case that shows the failure. -- Jack Krupansky -Original Message- From: George Kelvin Sent: Tuesday, January 29, 2013 12:28 PM To: java-user@lucene.apache.org Subject: Re: Questions about FuzzyQuery in Lucene 4.x Hi Jack, ed is set to 1 here and I have lowercased all the data and queries. Regarding the indexed data factor you mentioned, can you elaborate more? Thanks! George On Tue, Jan 29, 2013 at 9:10 AM, Jack Krupansky * *wrote: That depends on the value of "ed", and the indexed data. Another factor to take into consideration is that a case change ("Star" vs. "star") also counts as an edit. -- Jack Krupansky -Original Message- From: George Kelvin Sent: Tuesday, January 29, 2013 11:49 AM To: java-user@lucene.apache.org Subject: Re: Questions about FuzzyQuery in Lucene 4.x Hi Jack, Thanks for your reply! I don't think I passed the prefixLength parameter in. Here is the code I used to build the FuzzyQuery: String[] words = str.split("\\+"); BooleanQuery query = new BooleanQuery(); for (int i=0; i > For additional commands, e-mail: java-user-help@lucene.apache.org< java-user-help@lucene.**apache.org > --**--**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org For additional commands, e-mail: java-user-help@lucene.apache.**org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Questions about FuzzyQuery in Lucene 4.x
I also noticed that you have "MUST" for your full string of fuzzy terms - that means everyone of them must appear in an indexed document to be matched. Is it possible that maybe even one term was not in the same indexed document? Try to provide a complete example that shows the input data and the query - all the literals. In other words, construct a minimal test case that shows the failure. -- Jack Krupansky -Original Message- From: George Kelvin Sent: Tuesday, January 29, 2013 12:28 PM To: java-user@lucene.apache.org Subject: Re: Questions about FuzzyQuery in Lucene 4.x Hi Jack, ed is set to 1 here and I have lowercased all the data and queries. Regarding the indexed data factor you mentioned, can you elaborate more? Thanks! George On Tue, Jan 29, 2013 at 9:10 AM, Jack Krupansky wrote: That depends on the value of "ed", and the indexed data. Another factor to take into consideration is that a case change ("Star" vs. "star") also counts as an edit. -- Jack Krupansky -Original Message- From: George Kelvin Sent: Tuesday, January 29, 2013 11:49 AM To: java-user@lucene.apache.org Subject: Re: Questions about FuzzyQuery in Lucene 4.x Hi Jack, Thanks for your reply! I don't think I passed the prefixLength parameter in. Here is the code I used to build the FuzzyQuery: String[] words = str.split("\\+"); BooleanQuery query = new BooleanQuery(); for (int i=0; iTo unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org For additional commands, e-mail: java-user-help@lucene.apache.**org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Questions about FuzzyQuery in Lucene 4.x
That depends on the value of "ed", and the indexed data. Another factor to take into consideration is that a case change ("Star" vs. "star") also counts as an edit. -- Jack Krupansky -Original Message- From: George Kelvin Sent: Tuesday, January 29, 2013 11:49 AM To: java-user@lucene.apache.org Subject: Re: Questions about FuzzyQuery in Lucene 4.x Hi Jack, Thanks for your reply! I don't think I passed the prefixLength parameter in. Here is the code I used to build the FuzzyQuery: String[] words = str.split("\\+"); BooleanQuery query = new BooleanQuery(); for (int i=0; iGeorge - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Questions about FuzzyQuery in Lucene 4.x
Let's see your code that calls FuzzyQuery . If you happen to pass a prefixLength (3rd parameter) of 3 or more, then "ster" would not match "star" (but prefixLength of 2 would match). -- Jack Krupansky -Original Message- From: George Kelvin Sent: Monday, January 28, 2013 5:31 PM To: java-user@lucene.apache.org Subject: Questions about FuzzyQuery in Lucene 4.x Hi All, I’m working on several projects requiring powerful search features. I’ve been waiting for Lucene 4 since I read Michael McCandless's blog post about Lucene’s new FuzzyQuery and finally I got the chance to test it. The improvement on the fuzzy search is really impressive! However, I’ve encountered a problem today: In my data, there are some records with keywords “star” and “wars”. But when I issued a fuzzy query with two keywords “ster” and “wats”, the engine failed to find the records. I’m wondering if you can provide any inputs on that. Maybe I’m not doing fuzzy search in the right way. But all my other fuzzy queries with single keyword and with longer double keywords worked perfectly. Another issue is that I’m also exploring the possibility to do wildcard+fuzzy search using Lucene. I couldn’t find any related document for this on Lucene's website, but I found a stackoverfow thread talking about this. http://stackoverflow.com/questions/2631206/lucene-query-bla-match-words-that-start-with-something-fuzzy-how I tried the way suggested by the second answer their, and it worked. However the scoring is strange: all results were assigned with the exactly same score. Is there anything I can do to get the scoring right? Can you tell me what’s the best way to do wildcard fuzzy search? Any help will be appreciated! Thanks, George - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content
Re-read my last message - and then take a look at that Solr source code, which will give you an idea how to use Tika, even though you are using Lucene only. If you have specific questions, please be specific. To answer your latest question, yes, Tika is good enough. Solr /update/extract uses it, and Solr is based on Lucene. -- Jack Krupansky -Original Message- From: saisantoshi Sent: Sunday, January 27, 2013 2:09 PM To: java-user@lucene.apache.org Subject: Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content We are not using Solr and using just Lucene core 4.0 engine. I am trying to see if we can use tika library to extract textual information from pdf/word/excel documents. I am mainly interested in reading the contents inside the documents and index using lucene. My question here is , is tika framework good enough or is there any other better library. Any issues/experiences in using the tika framework. Thanks, Sai. -- View this message in context: http://lucene.472066.n3.nabble.com/Readers-for-extracting-textual-info-from-pd-doc-excel-for-indexing-the-actual-content-tp4036379p4036557.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content
You may be able to use Tika directly without needing to choose the specific classes, although the latter may give you the specific data you need without the extra overhead. You could take a look at the Solr Extracting Request Handler source for an example: http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_4_1_0/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/ Basically, Tika extracts a bunch of "metadata" and then you will have to add selected metadata to your Lucene documents. "content" is the main document body text. You could try Solr itself to see how it works: http://wiki.apache.org/solr/ExtractingRequestHandler -- Jack Krupansky -Original Message- From: Adrien Grand Sent: Sunday, January 27, 2013 12:53 PM To: java-user@lucene.apache.org Subject: Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content Have you tried using the PDFParser [1] and the OfficeParser [2] classes from Tika? This question seems to be more appropriate for the Tika user mailing list [3]? [1] http://tika.apache.org/1.3/api/org/apache/tika/parser/pdf/PDFParser.html#parse(java.io.InputStream, org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata, org.apache.tika.parser.ParseContext) [2] http://tika.apache.org/1.3/api/org/apache/tika/parser/microsoft/OfficeParser.html#parse(java.io.InputStream, org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata, org.apache.tika.parser.ParseContext) [3] http://tika.apache.org/mail-lists.html -- Adrien - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Indexing multiple fields with one document position
Send the same input text to two different analyzers for two separate fields. The first analyzer emits only the first attribute. The second analyzer emits only the second attribute. The document position in one will correspond to the document position in the other. -- Jack Krupansky -Original Message- From: Igor Shalyminov Sent: Monday, January 21, 2013 3:04 AM To: java-user@lucene.apache.org Subject: Indexing multiple fields with one document position Hello! When indexing text with position data, one just adds field do a document in the form of its name and value, and the indexer assigns it unique position in the index. I wonder, if I have an entry with two attributes, say: cat, How do I store in the index two fields, "pos" and "number" with its values, pointing to the same position in the document? -- Best Regards, Igor Shalyminov - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: SpanNearQuery with two boundaries
+1 I think that accurately states the semantics of the operation you want. -- Jack Krupansky -Original Message- From: Alan Woodward Sent: Friday, January 18, 2013 1:08 PM To: java-user@lucene.apache.org Subject: Re: SpanNearQuery with two boundaries Hi Igor, You could try wrapping the two cases in a SpanNotQuery: SpanNot(SpanNear(runs, cat, 10), SpanNear(runs, cat, 3)) That should return documents that have runs within 10 positions of cat, as long as they don't overlap with runs within 3 positions of cat. Alan Woodward www.flax.co.uk On 18 Jan 2013, at 16:13, Igor Shalyminov wrote: Hello! I want to perform search queries like this one: word:"dog" \1 word:"runs" (\3 \10) word:"cat" It is thus something like SpanNearQuery, but with two boundaries - minimum and maximum distance between the terms (which in the \1-case would be equal). Syntax (as above, fictional:) itself doesn't matter, I just want to know if one is able to build this type of query based on existing (Lucene 4.0.0) query classes. -- Best Regards, Igor Shalyminov - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Combine two BooleanQueries by a SpanNearQuery.
Currently there isn't. SpanNearQuery can take only other SpanQuery objects, which includes other spans, span terms, and span wrapped multi-term queries (e.g., wildcard, fuzzy query), but not Boolean queries. But it does sound like a good feature request. There is SpanNotQuery, so you can exclude terms from a span. -- Jack Krupansky -Original Message- From: Michel Conrad Sent: Thursday, January 17, 2013 12:14 PM To: java-user@lucene.apache.org Subject: Re: Combine two BooleanQueries by a SpanNearQuery. The problem I would like to solve is to have two queries that I will get from the query parser (this could include wildcardqueries and phrasequeries). Both of these queries would have to match the document and as an additional restriction I would like to add that a matching term from the first query is near a matching term from the second query. So that you can search for instance for matching documents with in your first query 'apple AND NOT computer' and in your second 'monkey' with a slop of 10 between the two queries, then it would be equivalent to '"apple monkey"~10 AND NOT computer'. I was wondering if there is a method to combine more complicated queries in a similar way. (Some kind of generic solution) Thanks for your help, Michel On Thu, Jan 17, 2013 at 5:14 PM, Jack Krupansky wrote: You need to express the "boolean" query solely in terms of SpanOrQuery and SpanNearQuery. If you can't, ... then it probably can't be done, but you should be able to. How about starting with a plan English description of the problem you are trying to solve? -- Jack Krupansky -Original Message- From: Michel Conrad Sent: Thursday, January 17, 2013 11:01 AM To: java-user@lucene.apache.org Subject: Combine two BooleanQueries by a SpanNearQuery. Hi, I am looking to get a combination of multiple subqueries. What I want to do is to have two queries which have to be near one to another. As an example: Query1: (A AND (B OR C)) Query2: D Then I want to use something like a SpanNearQuery to combine both (slop 5): Both would then have to match and D should be within slop 5 to A, B or C. So my question is if there is a query that combines two BooleanQuery trees into a SpanNearQuery. It would have to take the terms that match Query 1 and the terms that match Query 2, and look if there is a combination within the required slop. Can I rewrite the BooleanQuery after parsing the query as a MultiTermQuery, than wrap these in SpanMultiTermQueryWrapper, which can be combined by the SpanNearQuery? Best regards, Michel - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Combine two BooleanQueries by a SpanNearQuery.
You need to express the "boolean" query solely in terms of SpanOrQuery and SpanNearQuery. If you can't, ... then it probably can't be done, but you should be able to. How about starting with a plan English description of the problem you are trying to solve? -- Jack Krupansky -Original Message- From: Michel Conrad Sent: Thursday, January 17, 2013 11:01 AM To: java-user@lucene.apache.org Subject: Combine two BooleanQueries by a SpanNearQuery. Hi, I am looking to get a combination of multiple subqueries. What I want to do is to have two queries which have to be near one to another. As an example: Query1: (A AND (B OR C)) Query2: D Then I want to use something like a SpanNearQuery to combine both (slop 5): Both would then have to match and D should be within slop 5 to A, B or C. So my question is if there is a query that combines two BooleanQuery trees into a SpanNearQuery. It would have to take the terms that match Query 1 and the terms that match Query 2, and look if there is a combination within the required slop. Can I rewrite the BooleanQuery after parsing the query as a MultiTermQuery, than wrap these in SpanMultiTermQueryWrapper, which can be combined by the SpanNearQuery? Best regards, Michel - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene-MoreLikethis
There are lots of parameters you can adjust, but the defaults essentially assume that you have a fairly large corpus and aren't interested in low-frequency terms. So, try MoreLikeThis#setMinDocFreq. The default is 5. You don't have any terms in your example with a doc freq over 2. Also, try setMinTermFreq. The default is 2. You don't have any terms with a term frequency above 1. -- Jack Krupansky -Original Message- From: Thomas Keller Sent: Tuesday, January 15, 2013 3:22 PM To: java-user@lucene.apache.org Subject: Lucene-MoreLikethis Hey, I have a question about "MoreLikeThis" in Lucene, Java. I built up an index and want to find similar documents. But I always get no results for my query, mlt.like(1) is always empty. Can anyone find my mistake? Here is an example. (I use Lucene 4.0) public class HelloLucene { public static void main(String[] args) throws IOException, ParseException { StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40); Directory index = new RAMDirectory(); IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer); IndexWriter w = new IndexWriter(index, config); addDoc(w, "Lucene in Action", "193398817"); addDoc(w, "Lucene for Dummies", "55320055Z"); addDoc(w, "Managing Gigabytes", "55063554A"); addDoc(w, "The Art of Computer Science", "9900333X"); w.close(); // search IndexReader reader = DirectoryReader.open(index); IndexSearcher searcher = new IndexSearcher(reader); MoreLikeThis mlt = new MoreLikeThis(reader); Query query = mlt.like(1); System.out.println(searcher.search(query, 5).totalHits); } private static void addDoc(IndexWriter w, String title, String isbn) throws IOException { Document doc = new Document(); doc.add(new TextField("title", title, Field.Store.YES)); // use a string field for isbn because we don't want it tokenized doc.add(new StringField("isbn", isbn, Field.Store.YES)); w.addDocument(doc); } } - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: FuzzyQuery in lucene 4.0
FWIW, new FuzzyQuery(term, 2 ,0) is the same as new FuzzyQuery(term), given the current values of defaultMaxEdits (2) and defaultPrefixLength (0). -- Jack Krupansky -Original Message- From: Ian Lea Sent: Wednesday, January 09, 2013 9:44 AM To: java-user@lucene.apache.org Subject: Re: FuzzyQuery in lucene 4.0 See the javadocs for FuzzyQuery to see what the parameters are. I can't tell you what the comment means. Possible values to try maybe? -- Ian. On Wed, Jan 9, 2013 at 2:34 PM, algebra wrote: is true Ian, o code is good. The only thing that I dont understand is a line: Query query = new FuzzyQuery(term, 2 ,0); //0-2 Whats means 0 to 2? -- View this message in context: http://lucene.472066.n3.nabble.com/FuzzyQuery-in-lucene-4-0-tp4031871p4031879.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Differences in MLT Query Terms Question
The term "arv" is on the first list, but not the second. Maybe it's document frequency fell below the setting for minimum document frequency on the second run. Or, maybe the minimum word length was set to 4 or more on the second run. Are you using MoreLikeThisQuery or directly using MoreLikeThis? Or, possibly "arv" appears later in a document on the second run, after the number of tokens specified by maxNumTokensParsed. -- Jack Krupansky -Original Message- From: Peter Lavin Sent: Tuesday, January 08, 2013 1:46 PM To: java-user@lucene.apache.org Subject: Differences in MLT Query Terms Question Dear Users, I am running some simple experiments with Lucene and am seeing something I don't understand. I have 16 text files on 4 different topics, ranging in size from 50-900 KB. When I index all 16 of these and run an MLT query based on one of the indexed documents, I get an expected result (i.e. similar topics are found). When I reduce the number of text files to 4 and index them (having taken care to overwriting the previous index files), and then run the same MLT query (based on the same document from the index), I get slightly different scores. I'm assuming this is because the IDF is now different because there is less documents. For each run, I have set the max number of terms as... mlt.setMaxQueryTerms(100) However, when I compare the terms which get used for the MLT query on the 16 document index and the 4 document index, they are slightly different. I've printed, parsed and sorted them into two columns of a CSV file. I've pasted a small part of it at the end of this email. My Question(s)... 1) Can anybody explain why the set of terms used for the MLT query is different when a file from an index of 16 documents versus 4 documents is used? 2) Am I right in assuming that the reason for slightly different scores in the IDF, or could it be this slight difference in the sets of terms used (or possibly both)? regards, Peter -- with best regards, Peter Lavin, PhD Candidate, CAG - Computer Architecture & Grid Research Group, Lloyd Institute, 005, Trinity College Dublin, Ireland. +353 1 8961536 "about","about" "affordable","affordable" "agents","agents" "aids","aids" "architecture","architecture" "arv","based" "based","blog" "blog","board" "board","business" "business","care" "care","commemorates" "commemorates","contacts" "contacts","contributions" "contributions","coordinating" "coordinating","core" "core","countries" "countries","country" "country","data" "data","decisions" "decisions","details" "details","disbursements" "disbursements","documents" "documents","donors" - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: TokenFilter state question
If you want the query parser to present a sequence of tokens to your analyzer as one unit, you need to enclose the sequence in quotes. Otherwise, the query parser will pass each of the terms to the analyzer as a single unit, with a reset for each, as you in fact report. So, change: String querystr = "product:(Spring Framework Core) vendor:(SpringSource)"; to String querystr = "product:\"Spring Framework Core\" vendor:(SpringSource)"; -- Jack Krupansky -Original Message- From: Jeremy Long Sent: Wednesday, December 26, 2012 5:52 PM To: java-user@lucene.apache.org Subject: Re: TokenFilter state question Actually, I had thought the same thing and had played around with the reset method. However, in my example where I called the custom analyzer the "reset" method was called after every token, not after the end of the stream (which implies each token was treated as its own TokenStream?). String querystr = "product:(Spring Framework Core) vendor:(SpringSource)"; Map fieldAnalyzers = new HashMap(); fieldAnalyzers.put("product", new SearchFieldAnalyzer(Version.LUCENE_40)); fieldAnalyzers.put("vendor", new SearchFieldAnalyzer(Version.LUCENE_40)); PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper( new StandardAnalyzer(Version.LUCENE_40), fieldAnalyzers); QueryParser parser = new QueryParser(Version.LUCENE_40, field1, wrapper); Query q = parser.parse(querystr); In the above example (from the code snippets in the original email), if I were to add a reset method it would be called 4 times (yes, I have tried this). Reset gets called for each token "Spring" "Framework" "Core" and "SpringSource". Thus, if I reset my internal state I would not achieve the goal of have "Spring Framework Core" result in the tokens "Spring SpringFramework Framework FrameworkCore Core". My question is - why would these be treated as separate token streams? --Jeremy On Wed, Dec 26, 2012 at 10:54 AM, Jack Krupansky wrote: You need a "reset" method that calls the super reset to reset the parent state and then reset your own state. http://lucene.apache.org/core/**4_0_0/core/org/apache/lucene/** analysis/TokenStream.html#**reset()<http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/TokenStream.html#reset()> You probably don't have one, so only the parent state gets reset. -- Jack Krupansky -Original Message- From: Jeremy Long Sent: Wednesday, December 26, 2012 9:08 AM To: java-user@lucene.apache.org Subject: TokenFilter state question Hello, I'm still trying to figure out some of the nuances of Lucene and I have run into a small issue. I have created my own custom analyzer which uses the WhitespaceTokenizer and chains together the LowercaseFilter, StopwordFilter, and my own custom filter (below). I am using this analyzer when searching (i.e. it is the analyzer used in a QueryParser). The custom analyzers purpose is to add tokens by concatenating the previous word with the current word. So that if you were given "Spring Framework Core" the resulting tokens would be "Spring SpringFramework Framework FrameworkCore Core". My problem is that when my query text is "Spring Framework Core" I end up with left-over state in my TokenPairConcatenatingFilter (the previousWord is a member field). So if I end up re-using my query parser on a subsequent search for "Apache Struts" I end up with the token stream of "CoreApache Apache ApacheStruts Struts". The Initial "core" was left over state. The left over state from the initial query appears to arise because in my initial loop that collects all of the tokens from the underlying stream only collects a single token. So the processing is - we collect the token "spring", we write "spring" out to the stream and move it to the previousWord. Next, we are at the end of the stream and we have no more words in the list so the filter returns false. At this time, the filter is called again and "Framework" is collected... repeat until end of tokens from the query is reached; however, "Core" is left in the previousWord field. The filter would work correctly with no state being left over if all of the tokens were collected at the beginning (i.e. the first call to incrementToken). Can anyone explain why all of the tokens would not be collected and/or a work around so that when QueryParser.parse("field:(**Spring Framework Core)") is called residual state is not left over in my token filter? I have two hack solutions - 1) don't reuse the analyzer/QueryParser for subsequent queries or 2) build in a reset mechanism to clear the previousWord field. I don't like either solution and was hoping
Re: Which token filter can combine 2 terms into 1?
Ah! You're quoting full phrases. You weren't clear about that originally. Thanks for the clarification. -- Jack Krupansky -Original Message- From: Tom Sent: Wednesday, December 26, 2012 5:54 PM To: java-user@lucene.apache.org Subject: Re: Which token filter can combine 2 terms into 1? On Fri, Dec 21, 2012 at 2:44 PM, Jack Krupansky wrote: You still have the query parser's parsing before analysis to deal with, no matter what magic you code in your analyzer. Not quite. "query parser's parsing" comes first, you are correct on that. But it is irrelevant for splitting field values into search terms, because this part of the whole process is done by an analyzer. Therefore, if you make sure the correct analyzer is used, then the parsing and splitting into individual search terms will be done by this analyzer, not by the query parser. Try it: Implement an analyzer with the SnippetFilter below. Start Luke and make sure this analyzer is selected in "Analyzer to use for query parsing". In the search expression, type in any length of text for example: body:"word1 word2 word3" and you will get the possibly combined Terms. For example, let's say one snipped in your SnippetFilter is: "word2 word3" you will get Term 0: field=body text=word1 Term 1: field=body text=word2 word3 In this case, word2 and word3 will NOT be split. -- Jack Krupansky -Original Message- From: Tom Sent: Friday, December 21, 2012 2:24 PM To: java-user@lucene.apache.org Subject: Re: Which token filter can combine 2 terms into 1? On Fri, Dec 21, 2012 at 9:16 AM, Jack Krupansky * *wrote: And to be more specific, most query parsers will have already separated the terms and will call the analyzer with only one term at a time, so no term recombination is possible for those parsed terms, at query time. Most analyzers will do that, yes. But if Xi writes his own analyzer with his own combiner filter, then he should also use this for query generation and thus get the desired combinations / snippets there as well. Xi, here is the recipe: - SnippetFilter extends TokenFilter -SnippetFilter needs access to your lexicon: a data structure to store your snippets. In the general case this is a tree, and going along a branch will tell you whenever a valid snipped has been built or if the snipped could be longer. (Example: "internal revenue" can be one snippet but, depending on the next token, a larger snipped of "internal revenue service" could be built.) - Logic of the SnippetFilter.incrementToken() goes something like this: You need a loop which retrieves tokens from the input variable until the input is empty. You store each retrieved token in a variable(s) x in SnippetFilter . As long as you have a potential match against your lexicon, you can continue in this loop. Once you realize that there is something within x which can not possibly become a (longer) snippet, break out of the loop and allow the consumer to retrieve it. - make sure your analyzer inserts SnippetFilter at the correct spot in the filter chain. Cheers FiveMileTom -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Friday, December 21, 2012 8:27 AM To: java-user Subject: Re: Which token filter can combine 2 terms into 1? If it's a fixed list and not excessively long, would synonyms work? But if theres some kind of logic you need to apply, I don't think you're going to find anything OOB. The problem is that by the time a token filter gets called, they are already split up, you'll probably have to write a custom filter that manages that logic. Best Erick On Fri, Dec 21, 2012 at 4:16 AM, Xi Shen wrote: Unfortunately, no...I am not combine every two term into one. I am combining a specific pair. E.g. the Token Stream: t1 t2 t2a t3 should be rewritten into t1 t2t2a t3 But the TS: t1 t2 t3 t2a should not be rewritten, and it is already correct On Fri, Dec 21, 2012 at 5:00 PM, Alan Woodward < alan.woodward@romseysoftware.co.uk >> wrote: > Have a look at ShingleFilter: > http://lucene.apache.org/core/3_6_0/api/all/org/apache/**<http://lucene.apache.org/core/**3_6_0/api/all/org/apache/**> lucene/analysis/shingle/ShingleFilter.html<http://** lucene.apache.org/core/3_6_0/**api/all/org/apache/lucene/** analysis/shingle/**ShingleFilter.html<http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/shingle/ShingleFilter.html> > > > On 21 Dec 2012, at 08:42, Xi Shen wrote: > > > I have to use the white space and word delimiter to process the > > input > > first. I tried many combination, and it seems to me that it is inevitable > > the term will be split into two :( > > > > I think developing my own filter is the only resolution...but I just > cannot > > find a guide to help me unders