Re: Problem of calling indexWriterConfig.clone()
I honestly don't understand what DWPT pool has to do with IndexWriterConfig instances not being reusable for new IndexWriter instances. If you have the need to open a new IndexWriter with the same configuration as the one you used before, why not save the original config as the template, then simply do this for every IndexWriter instance you're creating: private final IndexWriterConfig masterCfg = new IndexWriterConfig(Version.LUCENE_47, null); // set whatever you need on this instance . IndexWriter writer = new IndexWriter(directory, masterCfg.clone()); Wouldn't this just work? If not, could you paste the stack trace of the exception you're getting? On Mon, Aug 11, 2014 at 9:01 PM, Sheng sheng...@gmail.com wrote: From src code of DocumentsWriterPerThreadPool, the variable numThreadStatesActive seems to be always increasing, which explains why asserting on numThreadStatesActive == 0 before cloning this object fails. So what should be the most appropriate way of re-opening an indexwriter if what you have are the index directory plus the indexWriterConfig that the closed indexWriter has been using? BTW - I am reasonably sure calling indexWriterConfig.clone() in the middle of indexing documents used to work for my code(same Lucene 4.7). It is since recently I had to do faceted indexing as well that this problem started to emerge. Is it related? On Mon, Aug 11, 2014 at 11:31 PM, Vitaly Funstein vfunst...@gmail.com wrote: I only have the source to 4.6.1, but if you look at the constructor of IndexWriter there, it looks like this: public IndexWriter(Directory d, IndexWriterConfig conf) throws IOException { conf.setIndexWriter(this); // prevent reuse by other instances The setter throws an exception if the configuration object has already been used with another instance of IndexWriter. Therefore, it should be cloned before being used in the constructor of IndexWriter. On Mon, Aug 11, 2014 at 7:12 PM, Sheng sheng...@gmail.com wrote: So the indexWriterConfig.clone() failed at this step: clone.indexerThreadPool = indexerThreadPool http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.7.0/org/apache/lucene/index/LiveIndexWriterConfig.java#LiveIndexWriterConfig.0indexerThreadPool .clone http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.7.0/org/apache/lucene/index/DocumentsWriterPerThreadPool.java#DocumentsWriterPerThreadPool.clone%28%29 (); which then failed at this step in the indexerThreadPool if (numThreadStatesActive http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.7.0/org/apache/lucene/index/DocumentsWriterPerThreadPool.java#DocumentsWriterPerThreadPool.0numThreadStatesActive != 0) { throw new IllegalStateException http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/lang/IllegalStateException.java#IllegalStateException (clone this object before it is used!); } There is a comment right above this: // We should only be cloned before being used: Does this mean whenever the indexWriter gets called for commit/prepareCommit, etc., the corresponding indexWriterConfig object cannot be called with .clone() at all? On Mon, Aug 11, 2014 at 9:52 PM, Vitaly Funstein vfunst...@gmail.com wrote: Looks like you have to clone it prior to using with any IndexWriter instances. On Mon, Aug 11, 2014 at 2:49 PM, Sheng sheng...@gmail.com wrote: I tried to create a clone of indexwriteconfig with indexWriterConfig.clone() for re-creating a new indexwriter, but I then I got this very annoying illegalstateexception: clone this object before it is used. Why does this exception happen, and how can I get around it? Thanks!
BitSet in Filters
Hi, The current usage of BitSets in filters in Lucene is limited to applying only on docIDs i.e. I can only construct a filter out of a BitSet if I have the DocumentIDs handy. However, with every update/delete i.e. CRUD modification, these will change, and I have to again redo the whole process to fetch the latest docIDs. Assume a scenario where I need to tag millions of documents with a tag like Finance, IT, Legal, etc. Unless, I can cache these filters in memory, the cost of constructing this filter at run time per query is not practical. If I could map the documents to a numeric long identifier and put them in a BitMap, I could then cache them because the size reduces drastically. However, I cannot use this numeric long identifier in Lucene filters because it is not a docID but another regular field. Please help with this scenario. Thanks, --- Thanks n Regards, Sandeep Ramesh Khanzode
Re: Can't get case insensitive keyword analyzer to work
Hello Milind, if you don't set the field to be tokenized, no analyzer will be used and the field's contents will be stored as-is, i.e. case sensitive. It's the analyzer's job to tokenize the input, so if you use an analyzer that does not separate the input into several tokens (like the KeywordAnalyzer), your input will remain untokenized. Regards Christoph Am 12.08.2014 um 03:38 schrieb Milind: I found the problem. But it makes no sense to me. If I set the field type to be tokenized, it works. But if I set it to not be tokenized the search fails. i.e. I have to pass in true to the method. theFieldType.setTokenized(storeTokenized); I want the field to be stored as un-tokenized. But it seems that I don't need to do that. The LowerCaseKeywordAnalyzer works if the field is tokenized, but not if it's un-tokenized! How can that be? On Mon, Aug 11, 2014 at 1:49 PM, Milind mili...@gmail.com wrote: It does look like the lowercase is working. The following code Document theDoc = theIndexReader.document(0); System.out.println(theDoc.get(sn)); IndexableField theField = theDoc.getField(sn); TokenStream theTokenStream = theField.tokenStream(theAnalyzer); System.out.println(theTokenStream); produces the following output SN345-B21 LowerCaseFilter@5f70bea5 term=sn345-b21,bytes=[73 6e 33 34 35 2d 62 32 31],startOffset=0,endOffset=9 But the search does not work. Anything obvious popping out for anyone? On Sat, Aug 9, 2014 at 4:39 PM, Milind mili...@gmail.com wrote: I looked at a couple of examples on how to get keyword analyzer to be case insensitive but I think I missed something since it's not working for me. In the code below, I'm indexing text in upper case and searching in lower case. But I get back no hits. Do I need to something more while indexing? private static class LowerCaseKeywordAnalyzer extends Analyzer { @Override protected TokenStreamComponents createComponents(String theFieldName, Reader theReader) { KeywordTokenizer theTokenizer = new KeywordTokenizer(theReader); TokenStreamComponents theTokenStreamComponents = new TokenStreamComponents( theTokenizer, new LowerCaseFilter(Version.LUCENE_46, theTokenizer)); return theTokenStreamComponents; } } private static void addDocment(IndexWriter theWriter, String theFieldName, String theValue, boolean storeTokenized) throws Exception { Document theDocument = new Document(); FieldType theFieldType = new FieldType(); theFieldType.setStored(true); theFieldType.setIndexed(true); theFieldType.setTokenized(storeTokenized); theDocument.add(new Field(theFieldName, theValue, theFieldType)); theWriter.addDocument(theDocument); } static void testLowerCaseKeywordAnalyzer() throws Exception { Version theVersion = Version.LUCENE_46; Directory theIndex = new RAMDirectory(); Analyzer theAnalyzer = new LowerCaseKeywordAnalyzer(); IndexWriterConfig theConfig = new IndexWriterConfig(theVersion, theAnalyzer); IndexWriter theWriter = new IndexWriter(theIndex, theConfig); addDocment(theWriter, sn, SN345-B21, false); addDocment(theWriter, sn, SN445-B21, false); theWriter.close(); QueryParser theParser = new QueryParser(theVersion, sn, theAnalyzer); Query theQuery = theParser.parse(sn:sn345-b21); IndexReader theIndexReader = DirectoryReader.open(theIndex); IndexSearcher theSearcher = new IndexSearcher(theIndexReader); TopScoreDocCollector theCollector = TopScoreDocCollector.create(10, true); theSearcher.search(theQuery, theCollector); ScoreDoc[] theHits = theCollector.topDocs().scoreDocs; System.out.println(Number of results found: + theHits.length); } -- Regards Milind -- Regards Milind -- Weil Individualität der beste Standard ist Dipl.-Inf. Christoph Kaser IconParc GmbH Sophienstraße 1 80333 München iconparc.de Tel: +49 - 89- 15 90 06 - 21 Fax: +49 - 89- 15 90 06 - 19 Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer. HRB 121830, Amtsgericht München
Re: Problem of calling indexWriterConfig.clone()
We've removed IndexWriterConfig.clone as of 4.9: https://issues.apache.org/jira/browse/LUCENE-5708 Cloning of those complex / expert classes was buggy and too hairy to get right. You just have to make a new IWC every time you make an IW. Mike McCandless http://blog.mikemccandless.com On Tue, Aug 12, 2014 at 2:29 AM, Vitaly Funstein vfunst...@gmail.com wrote: I honestly don't understand what DWPT pool has to do with IndexWriterConfig instances not being reusable for new IndexWriter instances. If you have the need to open a new IndexWriter with the same configuration as the one you used before, why not save the original config as the template, then simply do this for every IndexWriter instance you're creating: private final IndexWriterConfig masterCfg = new IndexWriterConfig(Version.LUCENE_47, null); // set whatever you need on this instance . IndexWriter writer = new IndexWriter(directory, masterCfg.clone()); Wouldn't this just work? If not, could you paste the stack trace of the exception you're getting? On Mon, Aug 11, 2014 at 9:01 PM, Sheng sheng...@gmail.com wrote: From src code of DocumentsWriterPerThreadPool, the variable numThreadStatesActive seems to be always increasing, which explains why asserting on numThreadStatesActive == 0 before cloning this object fails. So what should be the most appropriate way of re-opening an indexwriter if what you have are the index directory plus the indexWriterConfig that the closed indexWriter has been using? BTW - I am reasonably sure calling indexWriterConfig.clone() in the middle of indexing documents used to work for my code(same Lucene 4.7). It is since recently I had to do faceted indexing as well that this problem started to emerge. Is it related? On Mon, Aug 11, 2014 at 11:31 PM, Vitaly Funstein vfunst...@gmail.com wrote: I only have the source to 4.6.1, but if you look at the constructor of IndexWriter there, it looks like this: public IndexWriter(Directory d, IndexWriterConfig conf) throws IOException { conf.setIndexWriter(this); // prevent reuse by other instances The setter throws an exception if the configuration object has already been used with another instance of IndexWriter. Therefore, it should be cloned before being used in the constructor of IndexWriter. On Mon, Aug 11, 2014 at 7:12 PM, Sheng sheng...@gmail.com wrote: So the indexWriterConfig.clone() failed at this step: clone.indexerThreadPool = indexerThreadPool http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.7.0/org/apache/lucene/index/LiveIndexWriterConfig.java#LiveIndexWriterConfig.0indexerThreadPool .clone http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.7.0/org/apache/lucene/index/DocumentsWriterPerThreadPool.java#DocumentsWriterPerThreadPool.clone%28%29 (); which then failed at this step in the indexerThreadPool if (numThreadStatesActive http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.7.0/org/apache/lucene/index/DocumentsWriterPerThreadPool.java#DocumentsWriterPerThreadPool.0numThreadStatesActive != 0) { throw new IllegalStateException http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/lang/IllegalStateException.java#IllegalStateException (clone this object before it is used!); } There is a comment right above this: // We should only be cloned before being used: Does this mean whenever the indexWriter gets called for commit/prepareCommit, etc., the corresponding indexWriterConfig object cannot be called with .clone() at all? On Mon, Aug 11, 2014 at 9:52 PM, Vitaly Funstein vfunst...@gmail.com wrote: Looks like you have to clone it prior to using with any IndexWriter instances. On Mon, Aug 11, 2014 at 2:49 PM, Sheng sheng...@gmail.com wrote: I tried to create a clone of indexwriteconfig with indexWriterConfig.clone() for re-creating a new indexwriter, but I then I got this very annoying illegalstateexception: clone this object before it is used. Why does this exception happen, and how can I get around it? Thanks! - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Can't get case insensitive keyword analyzer to work
And unfiltered. So even if you use the keyword tokenizer that only generates a single token, you still want token filtering, such as lower case. -- Jack Krupansky -Original Message- From: Christoph Kaser Sent: Tuesday, August 12, 2014 3:07 AM To: java-user@lucene.apache.org Subject: Re: Can't get case insensitive keyword analyzer to work Hello Milind, if you don't set the field to be tokenized, no analyzer will be used and the field's contents will be stored as-is, i.e. case sensitive. It's the analyzer's job to tokenize the input, so if you use an analyzer that does not separate the input into several tokens (like the KeywordAnalyzer), your input will remain untokenized. Regards Christoph Am 12.08.2014 um 03:38 schrieb Milind: I found the problem. But it makes no sense to me. If I set the field type to be tokenized, it works. But if I set it to not be tokenized the search fails. i.e. I have to pass in true to the method. theFieldType.setTokenized(storeTokenized); I want the field to be stored as un-tokenized. But it seems that I don't need to do that. The LowerCaseKeywordAnalyzer works if the field is tokenized, but not if it's un-tokenized! How can that be? On Mon, Aug 11, 2014 at 1:49 PM, Milind mili...@gmail.com wrote: It does look like the lowercase is working. The following code Document theDoc = theIndexReader.document(0); System.out.println(theDoc.get(sn)); IndexableField theField = theDoc.getField(sn); TokenStream theTokenStream = theField.tokenStream(theAnalyzer); System.out.println(theTokenStream); produces the following output SN345-B21 LowerCaseFilter@5f70bea5 term=sn345-b21,bytes=[73 6e 33 34 35 2d 62 32 31],startOffset=0,endOffset=9 But the search does not work. Anything obvious popping out for anyone? On Sat, Aug 9, 2014 at 4:39 PM, Milind mili...@gmail.com wrote: I looked at a couple of examples on how to get keyword analyzer to be case insensitive but I think I missed something since it's not working for me. In the code below, I'm indexing text in upper case and searching in lower case. But I get back no hits. Do I need to something more while indexing? private static class LowerCaseKeywordAnalyzer extends Analyzer { @Override protected TokenStreamComponents createComponents(String theFieldName, Reader theReader) { KeywordTokenizer theTokenizer = new KeywordTokenizer(theReader); TokenStreamComponents theTokenStreamComponents = new TokenStreamComponents( theTokenizer, new LowerCaseFilter(Version.LUCENE_46, theTokenizer)); return theTokenStreamComponents; } } private static void addDocment(IndexWriter theWriter, String theFieldName, String theValue, boolean storeTokenized) throws Exception { Document theDocument = new Document(); FieldType theFieldType = new FieldType(); theFieldType.setStored(true); theFieldType.setIndexed(true); theFieldType.setTokenized(storeTokenized); theDocument.add(new Field(theFieldName, theValue, theFieldType)); theWriter.addDocument(theDocument); } static void testLowerCaseKeywordAnalyzer() throws Exception { Version theVersion = Version.LUCENE_46; Directory theIndex = new RAMDirectory(); Analyzer theAnalyzer = new LowerCaseKeywordAnalyzer(); IndexWriterConfig theConfig = new IndexWriterConfig(theVersion, theAnalyzer); IndexWriter theWriter = new IndexWriter(theIndex, theConfig); addDocment(theWriter, sn, SN345-B21, false); addDocment(theWriter, sn, SN445-B21, false); theWriter.close(); QueryParser theParser = new QueryParser(theVersion, sn, theAnalyzer); Query theQuery = theParser.parse(sn:sn345-b21); IndexReader theIndexReader = DirectoryReader.open(theIndex); IndexSearcher theSearcher = new IndexSearcher(theIndexReader); TopScoreDocCollector theCollector = TopScoreDocCollector.create(10, true); theSearcher.search(theQuery, theCollector); ScoreDoc[] theHits = theCollector.topDocs().scoreDocs; System.out.println(Number of results found: + theHits.length); } -- Regards Milind -- Regards Milind -- Weil Individualität der beste Standard ist Dipl.-Inf. Christoph Kaser IconParc GmbH Sophienstraße 1 80333 München iconparc.de Tel: +49 - 89- 15 90 06 - 21 Fax: +49 - 89- 15 90 06 - 19 Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer. HRB 121830, Amtsgericht München
RE: escaping characters
Thanks! That worked. We recently upgraded from 2.9 to 4.9, was true the default in 2.9? -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Monday, August 11, 2014 5:54 PM To: java-user@lucene.apache.org Subject: Re: escaping characters You need to manually enable automatic generation of phrase queries - it defaults to disabled, which simply treats the sub-terms as individual terms subject to the default operator. See: http://lucene.apache.org/core/4_9_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean) -- Jack Krupansky -Original Message- From: Chris Salem Sent: Monday, August 11, 2014 1:03 PM To: java-user@lucene.apache.org Subject: RE: escaping characters I'm not using Solr. Here's my code: FSDirectory fsd = FSDirectory.open(new File(C:\\indexes\\Lucene4)); IndexReader reader = DirectoryReader.open(fsd); IndexSearcher searcher = new IndexSearcher(reader); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_4_9, getStopWords()); BooleanQuery.setMaxClauseCount(10); QueryParser qptemp = new QueryParser(Version.LUCENE_4_9, resume_text,analyzer); qptemp.setAllowLeadingWildcard(true); qptemp.setDefaultOperator(QueryParser.AND_OPERATOR); Query querytemp = qptemp.parse(resume_text: (LS\\/MS)); System.out.println(querytemp.toString()); TopFieldCollector tfcollector = TopFieldCollector.create(new Sort(), 20, false, true, false, true); ScoreDoc[] hits; searcher.search(querytemp, tfcollector); hits = tfcollector.topDocs().scoreDocs; long resultCount = tfcollector.getTotalHits(); reader.close(); -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Monday, August 11, 2014 12:27 PM To: java-user Subject: Re: escaping characters Take a look at the adnim/analysis page for the field in question. The next bit of critical information is adding debug=query to the URL. The former will tell you what happens to the input stream at query and index time, the latter will tell you how the query got through the query parsing process. My guess is that you have WordDelimiterFilterFactory in your analysis chain and that's breaking things up. Best, Erick On Mon, Aug 11, 2014 at 8:54 AM, Chris Salem csa...@mainsequence.net wrote: Hi everyone, I'm trying to escape special characters and it doesn't seem to be working. If I do a search like resume_text: (LS\/MS) it searches for LS AND MS instead of LS/MS. How would I escape the slash so it searches for LS/MS? Thanks - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Problem of calling indexWriterConfig.clone()
I think what you suggest probably will work, and I appreciate that. What I am a little concerned about is if Indexwriterconfig is completely stateless or not, meaning if i clone from the very original Indexwriterconfig, will I lose some info from the breakpoint? Maybe I don't need worry about it, since it is going to be removed in 4.9? On Tue, Aug 12, 2014 at 2:29 AM, Vitaly Funstein vfunst...@gmail.com javascript:_e(%7B%7D,'cvml','vfunst...@gmail.com'); wrote: I honestly don't understand what DWPT pool has to do with IndexWriterConfig instances not being reusable for new IndexWriter instances. If you have the need to open a new IndexWriter with the same configuration as the one you used before, why not save the original config as the template, then simply do this for every IndexWriter instance you're creating: private final IndexWriterConfig masterCfg = new IndexWriterConfig(Version.LUCENE_47, null); // set whatever you need on this instance . IndexWriter writer = new IndexWriter(directory, masterCfg.clone()); Wouldn't this just work? If not, could you paste the stack trace of the exception you're getting? On Mon, Aug 11, 2014 at 9:01 PM, Sheng sheng...@gmail.com javascript:_e(%7B%7D,'cvml','sheng...@gmail.com'); wrote: From src code of DocumentsWriterPerThreadPool, the variable numThreadStatesActive seems to be always increasing, which explains why asserting on numThreadStatesActive == 0 before cloning this object fails. So what should be the most appropriate way of re-opening an indexwriter if what you have are the index directory plus the indexWriterConfig that the closed indexWriter has been using? BTW - I am reasonably sure calling indexWriterConfig.clone() in the middle of indexing documents used to work for my code(same Lucene 4.7). It is since recently I had to do faceted indexing as well that this problem started to emerge. Is it related? On Mon, Aug 11, 2014 at 11:31 PM, Vitaly Funstein vfunst...@gmail.com javascript:_e(%7B%7D,'cvml','vfunst...@gmail.com'); wrote: I only have the source to 4.6.1, but if you look at the constructor of IndexWriter there, it looks like this: public IndexWriter(Directory d, IndexWriterConfig conf) throws IOException { conf.setIndexWriter(this); // prevent reuse by other instances The setter throws an exception if the configuration object has already been used with another instance of IndexWriter. Therefore, it should be cloned before being used in the constructor of IndexWriter. On Mon, Aug 11, 2014 at 7:12 PM, Sheng sheng...@gmail.com javascript:_e(%7B%7D,'cvml','sheng...@gmail.com'); wrote: So the indexWriterConfig.clone() failed at this step: clone.indexerThreadPool = indexerThreadPool http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.7.0/org/apache/lucene/index/LiveIndexWriterConfig.java#LiveIndexWriterConfig.0indexerThreadPool .clone http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.7.0/org/apache/lucene/index/DocumentsWriterPerThreadPool.java#DocumentsWriterPerThreadPool.clone%28%29 (); which then failed at this step in the indexerThreadPool if (numThreadStatesActive http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.7.0/org/apache/lucene/index/DocumentsWriterPerThreadPool.java#DocumentsWriterPerThreadPool.0numThreadStatesActive != 0) { throw new IllegalStateException http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/lang/IllegalStateException.java#IllegalStateException (clone this object before it is used!); } There is a comment right above this: // We should only be cloned before being used: Does this mean whenever the indexWriter gets called for commit/prepareCommit, etc., the corresponding indexWriterConfig object cannot be called with .clone() at all? On Mon, Aug 11, 2014 at 9:52 PM, Vitaly Funstein vfunst...@gmail.com javascript:_e(%7B%7D,'cvml','vfunst...@gmail.com'); wrote: Looks like you have to clone it prior to using with any IndexWriter instances. On Mon, Aug 11, 2014 at 2:49 PM, Sheng sheng...@gmail.com javascript:_e(%7B%7D,'cvml','sheng...@gmail.com'); wrote: I tried to create a clone of indexwriteconfig with indexWriterConfig.clone() for re-creating a new indexwriter, but I then I got this very annoying illegalstateexception: clone this object before it is used. Why does this exception happen, and how can I get around it? Thanks!
Re: escaping characters
The default changed to false in Lucene 3.1. Before that it was true. -- Jack Krupansky -Original Message- From: Chris Salem Sent: Tuesday, August 12, 2014 8:34 AM To: java-user@lucene.apache.org Subject: RE: escaping characters Thanks! That worked. We recently upgraded from 2.9 to 4.9, was true the default in 2.9? -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Monday, August 11, 2014 5:54 PM To: java-user@lucene.apache.org Subject: Re: escaping characters You need to manually enable automatic generation of phrase queries - it defaults to disabled, which simply treats the sub-terms as individual terms subject to the default operator. See: http://lucene.apache.org/core/4_9_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean) -- Jack Krupansky -Original Message- From: Chris Salem Sent: Monday, August 11, 2014 1:03 PM To: java-user@lucene.apache.org Subject: RE: escaping characters I'm not using Solr. Here's my code: FSDirectory fsd = FSDirectory.open(new File(C:\\indexes\\Lucene4)); IndexReader reader = DirectoryReader.open(fsd); IndexSearcher searcher = new IndexSearcher(reader); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_4_9, getStopWords()); BooleanQuery.setMaxClauseCount(10); QueryParser qptemp = new QueryParser(Version.LUCENE_4_9, resume_text,analyzer); qptemp.setAllowLeadingWildcard(true); qptemp.setDefaultOperator(QueryParser.AND_OPERATOR); Query querytemp = qptemp.parse(resume_text: (LS\\/MS)); System.out.println(querytemp.toString()); TopFieldCollector tfcollector = TopFieldCollector.create(new Sort(), 20, false, true, false, true); ScoreDoc[] hits; searcher.search(querytemp, tfcollector); hits = tfcollector.topDocs().scoreDocs; long resultCount = tfcollector.getTotalHits(); reader.close(); -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Monday, August 11, 2014 12:27 PM To: java-user Subject: Re: escaping characters Take a look at the adnim/analysis page for the field in question. The next bit of critical information is adding debug=query to the URL. The former will tell you what happens to the input stream at query and index time, the latter will tell you how the query got through the query parsing process. My guess is that you have WordDelimiterFilterFactory in your analysis chain and that's breaking things up. Best, Erick On Mon, Aug 11, 2014 at 8:54 AM, Chris Salem csa...@mainsequence.net wrote: Hi everyone, I'm trying to escape special characters and it doesn't seem to be working. If I do a search like resume_text: (LS\/MS) it searches for LS AND MS instead of LS/MS. How would I escape the slash so it searches for LS/MS? Thanks - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: escaping characters
See Javadocs of QueryParser: NOTE: You must specify the required Version compatibility when creating QueryParser: - As of 3.1, QueryParserBase.setAutoGeneratePhraseQueries(boolean) is false by default. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Chris Salem [mailto:csa...@mainsequence.net] Sent: Tuesday, August 12, 2014 2:34 PM To: java-user@lucene.apache.org Subject: RE: escaping characters Thanks! That worked. We recently upgraded from 2.9 to 4.9, was true the default in 2.9? -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Monday, August 11, 2014 5:54 PM To: java-user@lucene.apache.org Subject: Re: escaping characters You need to manually enable automatic generation of phrase queries - it defaults to disabled, which simply treats the sub-terms as individual terms subject to the default operator. See: http://lucene.apache.org/core/4_9_0/queryparser/org/apache/lucene/quer yparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boo lean) -- Jack Krupansky -Original Message- From: Chris Salem Sent: Monday, August 11, 2014 1:03 PM To: java-user@lucene.apache.org Subject: RE: escaping characters I'm not using Solr. Here's my code: FSDirectory fsd = FSDirectory.open(new File(C:\\indexes\\Lucene4)); IndexReader reader = DirectoryReader.open(fsd); IndexSearcher searcher = new IndexSearcher(reader); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_4_9, getStopWords()); BooleanQuery.setMaxClauseCount(10); QueryParser qptemp = new QueryParser(Version.LUCENE_4_9, resume_text,analyzer); qptemp.setAllowLeadingWildcard(true); qptemp.setDefaultOperator(QueryParser.AND_OPERATOR); Query querytemp = qptemp.parse(resume_text: (LS\\/MS)); System.out.println(querytemp.toString()); TopFieldCollector tfcollector = TopFieldCollector.create(new Sort(), 20, false, true, false, true); ScoreDoc[] hits; searcher.search(querytemp, tfcollector); hits = tfcollector.topDocs().scoreDocs; long resultCount = tfcollector.getTotalHits(); reader.close(); -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Monday, August 11, 2014 12:27 PM To: java-user Subject: Re: escaping characters Take a look at the adnim/analysis page for the field in question. The next bit of critical information is adding debug=query to the URL. The former will tell you what happens to the input stream at query and index time, the latter will tell you how the query got through the query parsing process. My guess is that you have WordDelimiterFilterFactory in your analysis chain and that's breaking things up. Best, Erick On Mon, Aug 11, 2014 at 8:54 AM, Chris Salem csa...@mainsequence.net wrote: Hi everyone, I'm trying to escape special characters and it doesn't seem to be working. If I do a search like resume_text: (LS\/MS) it searches for LS AND MS instead of LS/MS. How would I escape the slash so it searches for LS/MS? Thanks - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Problem of calling indexWriterConfig.clone()
IWC.clone is/was buggy ... just stop calling it and make a new IWC from scratch each time in your application. Mike McCandless http://blog.mikemccandless.com On Tue, Aug 12, 2014 at 8:37 AM, Sheng sheng...@gmail.com wrote: I think what you suggest probably will work, and I appreciate that. What I am a little concerned about is if Indexwriterconfig is completely stateless or not, meaning if i clone from the very original Indexwriterconfig, will I lose some info from the breakpoint? Maybe I don't need worry about it, since it is going to be removed in 4.9? On Tue, Aug 12, 2014 at 2:29 AM, Vitaly Funstein vfunst...@gmail.com javascript:_e(%7B%7D,'cvml','vfunst...@gmail.com'); wrote: I honestly don't understand what DWPT pool has to do with IndexWriterConfig instances not being reusable for new IndexWriter instances. If you have the need to open a new IndexWriter with the same configuration as the one you used before, why not save the original config as the template, then simply do this for every IndexWriter instance you're creating: private final IndexWriterConfig masterCfg = new IndexWriterConfig(Version.LUCENE_47, null); // set whatever you need on this instance . IndexWriter writer = new IndexWriter(directory, masterCfg.clone()); Wouldn't this just work? If not, could you paste the stack trace of the exception you're getting? On Mon, Aug 11, 2014 at 9:01 PM, Sheng sheng...@gmail.com javascript:_e(%7B%7D,'cvml','sheng...@gmail.com'); wrote: From src code of DocumentsWriterPerThreadPool, the variable numThreadStatesActive seems to be always increasing, which explains why asserting on numThreadStatesActive == 0 before cloning this object fails. So what should be the most appropriate way of re-opening an indexwriter if what you have are the index directory plus the indexWriterConfig that the closed indexWriter has been using? BTW - I am reasonably sure calling indexWriterConfig.clone() in the middle of indexing documents used to work for my code(same Lucene 4.7). It is since recently I had to do faceted indexing as well that this problem started to emerge. Is it related? On Mon, Aug 11, 2014 at 11:31 PM, Vitaly Funstein vfunst...@gmail.com javascript:_e(%7B%7D,'cvml','vfunst...@gmail.com'); wrote: I only have the source to 4.6.1, but if you look at the constructor of IndexWriter there, it looks like this: public IndexWriter(Directory d, IndexWriterConfig conf) throws IOException { conf.setIndexWriter(this); // prevent reuse by other instances The setter throws an exception if the configuration object has already been used with another instance of IndexWriter. Therefore, it should be cloned before being used in the constructor of IndexWriter. On Mon, Aug 11, 2014 at 7:12 PM, Sheng sheng...@gmail.com javascript:_e(%7B%7D,'cvml','sheng...@gmail.com'); wrote: So the indexWriterConfig.clone() failed at this step: clone.indexerThreadPool = indexerThreadPool http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.7.0/org/apache/lucene/index/LiveIndexWriterConfig.java#LiveIndexWriterConfig.0indexerThreadPool .clone http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.7.0/org/apache/lucene/index/DocumentsWriterPerThreadPool.java#DocumentsWriterPerThreadPool.clone%28%29 (); which then failed at this step in the indexerThreadPool if (numThreadStatesActive http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.7.0/org/apache/lucene/index/DocumentsWriterPerThreadPool.java#DocumentsWriterPerThreadPool.0numThreadStatesActive != 0) { throw new IllegalStateException http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/lang/IllegalStateException.java#IllegalStateException (clone this object before it is used!); } There is a comment right above this: // We should only be cloned before being used: Does this mean whenever the indexWriter gets called for commit/prepareCommit, etc., the corresponding indexWriterConfig object cannot be called with .clone() at all? On Mon, Aug 11, 2014 at 9:52 PM, Vitaly Funstein vfunst...@gmail.com javascript:_e(%7B%7D,'cvml','vfunst...@gmail.com'); wrote: Looks like you have to clone it prior to using with any IndexWriter instances. On Mon, Aug 11, 2014 at 2:49 PM, Sheng sheng...@gmail.com javascript:_e(%7B%7D,'cvml','sheng...@gmail.com'); wrote: I tried to create a clone of indexwriteconfig with indexWriterConfig.clone() for re-creating a new indexwriter, but I then I got this very annoying illegalstateexception: clone this object before it is used. Why does this exception happen, and how can I get around
Re: BitSet in Filters
bq: Unless, I can cache these filters in memory, the cost of constructing this filter at run time per query is not practical Why do you say that? Do you have evidence? Because lots and lots of Solr installations do exactly this and they run fine. So I suspect there's something you're not telling us about your setup. Are you, say, soft committing often? Do you have autowarming specified? You're not going to be able to keep your filters based on some other field in the document. Internally, Lucene uses the internal doc ID as an index into the bitset. That's baked in to very low levels and isn't going to change AFAIK. Best, Erick On Mon, Aug 11, 2014 at 11:53 PM, Sandeep Khanzode sandeep_khanz...@yahoo.com.invalid wrote: Hi, The current usage of BitSets in filters in Lucene is limited to applying only on docIDs i.e. I can only construct a filter out of a BitSet if I have the DocumentIDs handy. However, with every update/delete i.e. CRUD modification, these will change, and I have to again redo the whole process to fetch the latest docIDs. Assume a scenario where I need to tag millions of documents with a tag like Finance, IT, Legal, etc. Unless, I can cache these filters in memory, the cost of constructing this filter at run time per query is not practical. If I could map the documents to a numeric long identifier and put them in a BitMap, I could then cache them because the size reduces drastically. However, I cannot use this numeric long identifier in Lucene filters because it is not a docID but another regular field. Please help with this scenario. Thanks, --- Thanks n Regards, Sandeep Ramesh Khanzode
Re: Can't get case insensitive keyword analyzer to work
Thanks Christoph, So it seems that tokenized has been conflated to analyzed. I just looked at the Javadocs and that's what it mentions. I had read it earlier, but it hadn't registered. I wonder why it's not called setAnalyzed. Thanks again. On Tue, Aug 12, 2014 at 3:07 AM, Christoph Kaser christoph.ka...@iconparc.de wrote: Hello Milind, if you don't set the field to be tokenized, no analyzer will be used and the field's contents will be stored as-is, i.e. case sensitive. It's the analyzer's job to tokenize the input, so if you use an analyzer that does not separate the input into several tokens (like the KeywordAnalyzer), your input will remain untokenized. Regards Christoph Am 12.08.2014 um 03:38 schrieb Milind: I found the problem. But it makes no sense to me. If I set the field type to be tokenized, it works. But if I set it to not be tokenized the search fails. i.e. I have to pass in true to the method. theFieldType.setTokenized(storeTokenized); I want the field to be stored as un-tokenized. But it seems that I don't need to do that. The LowerCaseKeywordAnalyzer works if the field is tokenized, but not if it's un-tokenized! How can that be? On Mon, Aug 11, 2014 at 1:49 PM, Milind mili...@gmail.com wrote: It does look like the lowercase is working. The following code Document theDoc = theIndexReader.document(0); System.out.println(theDoc.get(sn)); IndexableField theField = theDoc.getField(sn); TokenStream theTokenStream = theField.tokenStream(theAnalyzer); System.out.println(theTokenStream); produces the following output SN345-B21 LowerCaseFilter@5f70bea5 term=sn345-b21,bytes=[73 6e 33 34 35 2d 62 32 31],startOffset=0,endOffset=9 But the search does not work. Anything obvious popping out for anyone? On Sat, Aug 9, 2014 at 4:39 PM, Milind mili...@gmail.com wrote: I looked at a couple of examples on how to get keyword analyzer to be case insensitive but I think I missed something since it's not working for me. In the code below, I'm indexing text in upper case and searching in lower case. But I get back no hits. Do I need to something more while indexing? private static class LowerCaseKeywordAnalyzer extends Analyzer { @Override protected TokenStreamComponents createComponents(String theFieldName, Reader theReader) { KeywordTokenizer theTokenizer = new KeywordTokenizer(theReader); TokenStreamComponents theTokenStreamComponents = new TokenStreamComponents( theTokenizer, new LowerCaseFilter(Version.LUCENE_46, theTokenizer)); return theTokenStreamComponents; } } private static void addDocment(IndexWriter theWriter, String theFieldName, String theValue, boolean storeTokenized) throws Exception { Document theDocument = new Document(); FieldType theFieldType = new FieldType(); theFieldType.setStored(true); theFieldType.setIndexed(true); theFieldType.setTokenized(storeTokenized); theDocument.add(new Field(theFieldName, theValue, theFieldType)); theWriter.addDocument(theDocument); } static void testLowerCaseKeywordAnalyzer() throws Exception { Version theVersion = Version.LUCENE_46; Directory theIndex = new RAMDirectory(); Analyzer theAnalyzer = new LowerCaseKeywordAnalyzer(); IndexWriterConfig theConfig = new IndexWriterConfig(theVersion, theAnalyzer); IndexWriter theWriter = new IndexWriter(theIndex, theConfig); addDocment(theWriter, sn, SN345-B21, false); addDocment(theWriter, sn, SN445-B21, false); theWriter.close(); QueryParser theParser = new QueryParser(theVersion, sn, theAnalyzer); Query theQuery = theParser.parse(sn:sn345-b21); IndexReader theIndexReader = DirectoryReader.open(theIndex); IndexSearcher theSearcher = new IndexSearcher(theIndexReader); TopScoreDocCollector theCollector = TopScoreDocCollector.create(10, true); theSearcher.search(theQuery, theCollector); ScoreDoc[] theHits = theCollector.topDocs().scoreDocs; System.out.println(Number of results found: + theHits.length); } -- Regards Milind -- Regards Milind -- Weil Individualität der beste Standard ist Dipl.-Inf. Christoph Kaser IconParc GmbH Sophienstraße 1 80333 München iconparc.de Tel: +49 - 89- 15 90 06 - 21 Fax: +49 - 89- 15 90 06 - 19 Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer.
Re: BitSet in Filters
Hi Erick, I have mentioned everything that is relevant, I believe :). However, just to give more background: Assume documents of the order of more than 300 million, and multiple concurrent users running search. I may front Lucene with ElasticSearch, and ES basically calls Lucene TermFilters. My filters are broad in nature, so you can take it that any time I filter on a tag, it would run into, easily, millions of documents to be accepted in the filter. The only filter that uses a BitSet works with Document Ids in Lucene. I would have wanted this bitset approach to work on some other regular numeric long field so that we can scale which does not seem likely if I have to use an ArrayList of Longs for TermFilters. Hope that makes the scenario more clear. Please let me know your thoughts. --- Thanks n Regards, Sandeep Ramesh Khanzode On Tuesday, August 12, 2014 8:41 PM, Erick Erickson erickerick...@gmail.com wrote: bq: Unless, I can cache these filters in memory, the cost of constructing this filter at run time per query is not practical Why do you say that? Do you have evidence? Because lots and lots of Solr installations do exactly this and they run fine. So I suspect there's something you're not telling us about your setup. Are you, say, soft committing often? Do you have autowarming specified? You're not going to be able to keep your filters based on some other field in the document. Internally, Lucene uses the internal doc ID as an index into the bitset. That's baked in to very low levels and isn't going to change AFAIK. Best, Erick On Mon, Aug 11, 2014 at 11:53 PM, Sandeep Khanzode sandeep_khanz...@yahoo.com.invalid wrote: Hi, The current usage of BitSets in filters in Lucene is limited to applying only on docIDs i.e. I can only construct a filter out of a BitSet if I have the DocumentIDs handy. However, with every update/delete i.e. CRUD modification, these will change, and I have to again redo the whole process to fetch the latest docIDs. Assume a scenario where I need to tag millions of documents with a tag like Finance, IT, Legal, etc. Unless, I can cache these filters in memory, the cost of constructing this filter at run time per query is not practical. If I could map the documents to a numeric long identifier and put them in a BitMap, I could then cache them because the size reduces drastically. However, I cannot use this numeric long identifier in Lucene filters because it is not a docID but another regular field. Please help with this scenario. Thanks, --- Thanks n Regards, Sandeep Ramesh Khanzode
RE: BitSet in Filters
Hi, in general you cannot cache Filter, you can cache their DocIdSets (CachingWrapperFilter is for example doing this). Lucene Queries are executed per segment, that means when you index new documents or update new documents, lucene creates new index segments. Older ones *never* change, so a DocIdSet (e.g. implemented by FixedBitSet) can be linked to a specifc segment of the index that never changes - only deletions may be added, but that's transparent to the filter - the deletions (given in acceptDocs to getDocIdSet) and the cached BitSet just need to be anded together (btw, deletions in Lucene are just a Filter, too). Of course, after a while Lucene merges segments using its MergePolicy, because otherwise there would be too many of them. In that case several smaller segments (preferably those with many deletions) get merged into larger ones by the indexer. This is the only case when the some *new* DocIdSets need to be created. Large segments are unlikely to be merged, unless they have many deletions (caused by updates into new segments or deletions). This approach is used by Solr and Elasticsearch - CachingWrapperFilter is an example how to do this in own code. To implement this: - don't cache a bitset for the whole index this would indeed need you to recalculate the bitsets over and over - In YourFilter.getDocIdSet() look up in your cache if the coreCacheKey of the given AtomicReaderContext.reader() is in your cache and if yes, reuse the cached DocIdSet (deletions are not relevant, you just have to apply them by BitsFilteredDocIdSet.wrap(cachedDocIdSet). If it's not in the cache, recalculate the bitset for the given AtomicReaderContext (not the whole index) and return it as DocIdSet instance. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Sandeep Khanzode [mailto:sandeep_khanz...@yahoo.com.INVALID] Sent: Tuesday, August 12, 2014 8:53 AM To: Lucene Users Subject: BitSet in Filters Hi, The current usage of BitSets in filters in Lucene is limited to applying only on docIDs i.e. I can only construct a filter out of a BitSet if I have the DocumentIDs handy. However, with every update/delete i.e. CRUD modification, these will change, and I have to again redo the whole process to fetch the latest docIDs. Assume a scenario where I need to tag millions of documents with a tag like Finance, IT, Legal, etc. Unless, I can cache these filters in memory, the cost of constructing this filter at run time per query is not practical. If I could map the documents to a numeric long identifier and put them in a BitMap, I could then cache them because the size reduces drastically. However, I cannot use this numeric long identifier in Lucene filters because it is not a docID but another regular field. Please help with this scenario. Thanks, --- Thanks n Regards, Sandeep Ramesh Khanzode - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Questions for facets search
I actually have 2 questions: 1. Is it possible to get the facet label for a particular document? The reason we want this is we'd like to allow users to see tags for each hit in addition to the taxonomy for his/her search. 2. Is it possible to re-index the facet cache without reindexing the whole lucene cache, since they are separated? We have a dynamic list of faceted fields, being able to quickly rebuild the whole facet lucene cache would be quite desirable. Again, I am using lucene 4.7, thanks in advance to your answers! Sheng
AW: Questions for facets search
For 1st: from Solr Level i guess, you could select (only) the document by uniqueid. Then you have the facets for that particular document. But this results in one additional query/doc. Gesendet von meinem BlackBerry 10-Smartphone. Originalnachricht Von: Sheng Gesendet: Dienstag, 12. August 2014 23:35 An: java-user@lucene.apache.org Antwort an: java-user@lucene.apache.org Betreff: Questions for facets search I actually have 2 questions: 1. Is it possible to get the facet label for a particular document? The reason we want this is we'd like to allow users to see tags for each hit in addition to the taxonomy for his/her search. 2. Is it possible to re-index the facet cache without reindexing the whole lucene cache, since they are separated? We have a dynamic list of faceted fields, being able to quickly rebuild the whole facet lucene cache would be quite desirable. Again, I am using lucene 4.7, thanks in advance to your answers! Sheng - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org