Re: Problem of calling indexWriterConfig.clone()

2014-08-12 Thread Vitaly Funstein
I honestly don't understand what DWPT pool has to do with IndexWriterConfig
instances not being reusable for new IndexWriter instances. If you have the
need to open a new IndexWriter with the same configuration as the one you
used before, why not save the original config as the template, then
simply do this for every IndexWriter instance you're creating:

private final IndexWriterConfig masterCfg = new
IndexWriterConfig(Version.LUCENE_47, null);
// set whatever you need on this instance
.

IndexWriter writer = new IndexWriter(directory, masterCfg.clone());

Wouldn't this just work? If not, could you paste the stack trace of the
exception you're getting?


On Mon, Aug 11, 2014 at 9:01 PM, Sheng sheng...@gmail.com wrote:

 From src code of DocumentsWriterPerThreadPool, the variable
 numThreadStatesActive seems to be always increasing, which explains why
 asserting on numThreadStatesActive == 0 before cloning this object
 fails. So what should be the most appropriate way of re-opening an
 indexwriter if what you have are the index directory plus the
 indexWriterConfig that the closed indexWriter has been using?

 BTW - I am reasonably sure calling indexWriterConfig.clone() in the middle
 of indexing documents used to work for my code(same Lucene 4.7). It is
 since recently I had to do faceted indexing as well that this problem
 started to emerge. Is it related?


 On Mon, Aug 11, 2014 at 11:31 PM, Vitaly Funstein vfunst...@gmail.com
 wrote:

  I only have the source to 4.6.1, but if you look at the constructor of
  IndexWriter there, it looks like this:
 
public IndexWriter(Directory d, IndexWriterConfig conf) throws
  IOException {
  conf.setIndexWriter(this); // prevent reuse by other instances
 
  The setter throws an exception if the configuration object has already
 been
  used with another instance of IndexWriter. Therefore, it should be cloned
  before being used in the constructor of IndexWriter.
 
 
  On Mon, Aug 11, 2014 at 7:12 PM, Sheng sheng...@gmail.com wrote:
 
   So the indexWriterConfig.clone() failed at this step:
   clone.indexerThreadPool = indexerThreadPool
   
  
 
 http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.7.0/org/apache/lucene/index/LiveIndexWriterConfig.java#LiveIndexWriterConfig.0indexerThreadPool
   
   .clone
   
  
 
 http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.7.0/org/apache/lucene/index/DocumentsWriterPerThreadPool.java#DocumentsWriterPerThreadPool.clone%28%29
   
   ();
  
   which then failed at this step in the indexerThreadPool
  
  
   if (numThreadStatesActive
   
  
 
 http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.7.0/org/apache/lucene/index/DocumentsWriterPerThreadPool.java#DocumentsWriterPerThreadPool.0numThreadStatesActive
   
   != 0) {
  
   throw new IllegalStateException
   
  
 
 http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/lang/IllegalStateException.java#IllegalStateException
   (clone
   this object before it is used!);
  
   }
  
  
   There is a comment right above this:
   // We should only be cloned before being used:
  
   Does this mean whenever the indexWriter gets called for
   commit/prepareCommit, etc., the corresponding indexWriterConfig object
   cannot be called with .clone() at all?
  
  
   On Mon, Aug 11, 2014 at 9:52 PM, Vitaly Funstein vfunst...@gmail.com
   wrote:
  
Looks like you have to clone it prior to using with any IndexWriter
instances.
   
   
On Mon, Aug 11, 2014 at 2:49 PM, Sheng sheng...@gmail.com wrote:
   
 I tried to create a clone of indexwriteconfig with
 indexWriterConfig.clone() for re-creating a new indexwriter, but
 I
then I
 got this very annoying illegalstateexception: clone this object
  before
it
 is used. Why does this exception happen, and how can I get around
  it?
 Thanks!

   
  
 



BitSet in Filters

2014-08-12 Thread Sandeep Khanzode
Hi,
 
The current usage of BitSets in filters in Lucene is limited to applying only 
on docIDs i.e. I can only construct a filter out of a BitSet if I have the 
DocumentIDs handy.

However, with every update/delete i.e. CRUD modification, these will change, 
and I have to again redo the whole process to fetch the latest docIDs. 

Assume a scenario where I need to tag millions of documents with a tag like 
Finance, IT, Legal, etc.

Unless, I can cache these filters in memory, the cost of constructing this 
filter at run time per query is not practical. If I could map the documents to 
a numeric long identifier and put them in a BitMap, I could then cache them 
because the size reduces drastically. However, I cannot use this numeric long 
identifier in Lucene filters because it is not a docID but another regular 
field.

Please help with this scenario. Thanks,

---
Thanks n Regards,
Sandeep Ramesh Khanzode

Re: Can't get case insensitive keyword analyzer to work

2014-08-12 Thread Christoph Kaser

Hello Milind,

if you don't set the field to be tokenized, no analyzer will be used and 
the field's contents will be stored as-is, i.e. case sensitive.
It's the analyzer's job to tokenize the input, so if you use an analyzer 
that does not separate the input into several tokens (like the 
KeywordAnalyzer), your input will remain untokenized.


Regards
Christoph

Am 12.08.2014 um 03:38 schrieb Milind:

I found the problem.  But it makes no sense to me.

If I set the field type to be tokenized, it works.  But if I set it to not
be tokenized the search fails.  i.e. I have to pass in true to the method.
 theFieldType.setTokenized(storeTokenized);

I want the field to be stored as un-tokenized.  But it seems that I don't
need to do that.  The LowerCaseKeywordAnalyzer works if the field is
tokenized, but not if it's un-tokenized!

How can that be?


On Mon, Aug 11, 2014 at 1:49 PM, Milind mili...@gmail.com wrote:


It does look like the lowercase is working.

The following code

 Document theDoc = theIndexReader.document(0);
 System.out.println(theDoc.get(sn));
 IndexableField theField = theDoc.getField(sn);
 TokenStream theTokenStream = theField.tokenStream(theAnalyzer);
 System.out.println(theTokenStream);

produces the following output
 SN345-B21
 LowerCaseFilter@5f70bea5 term=sn345-b21,bytes=[73 6e 33 34 35 2d 62
32 31],startOffset=0,endOffset=9

But the search does not work.  Anything obvious popping out for anyone?


On Sat, Aug 9, 2014 at 4:39 PM, Milind mili...@gmail.com wrote:


I looked at a couple of examples on how to get keyword analyzer to be
case insensitive but I think I missed something since it's not working for
me.

In the code below, I'm indexing text in upper case and searching in lower
case.  But I get back no hits.  Do I need to something more while
indexing?

 private static class LowerCaseKeywordAnalyzer extends Analyzer
 {
 @Override
 protected TokenStreamComponents createComponents(String
theFieldName, Reader theReader)
 {
 KeywordTokenizer theTokenizer = new
KeywordTokenizer(theReader);
 TokenStreamComponents theTokenStreamComponents =
 new TokenStreamComponents(
 theTokenizer,
 new LowerCaseFilter(Version.LUCENE_46,
theTokenizer));
 return theTokenStreamComponents;
 }
 }

 private static void addDocment(IndexWriter theWriter,
   String theFieldName,
   String theValue,
   boolean storeTokenized)
 throws Exception
 {
   Document theDocument = new Document();
   FieldType theFieldType = new FieldType();
   theFieldType.setStored(true);
   theFieldType.setIndexed(true);
   theFieldType.setTokenized(storeTokenized);
   theDocument.add(new Field(theFieldName, theValue,
theFieldType));
   theWriter.addDocument(theDocument);
 }


 static void testLowerCaseKeywordAnalyzer()
 throws Exception
 {
 Version theVersion = Version.LUCENE_46;
 Directory theIndex = new RAMDirectory();

 Analyzer theAnalyzer = new LowerCaseKeywordAnalyzer();

 IndexWriterConfig theConfig = new IndexWriterConfig(theVersion,
 theAnalyzer);
 IndexWriter theWriter = new IndexWriter(theIndex, theConfig);
 addDocment(theWriter, sn, SN345-B21, false);
 addDocment(theWriter, sn, SN445-B21, false);
 theWriter.close();

 QueryParser theParser = new QueryParser(theVersion, sn,
theAnalyzer);
 Query theQuery = theParser.parse(sn:sn345-b21);
 IndexReader theIndexReader = DirectoryReader.open(theIndex);
 IndexSearcher theSearcher = new IndexSearcher(theIndexReader);
 TopScoreDocCollector theCollector =
TopScoreDocCollector.create(10, true);
 theSearcher.search(theQuery, theCollector);
 ScoreDoc[] theHits = theCollector.topDocs().scoreDocs;
 System.out.println(Number of results found:  + theHits.length);
 }

--
Regards
Milind


--
Regards
Milind






--


Weil Individualität der beste Standard ist

Dipl.-Inf. Christoph Kaser

IconParc GmbH
Sophienstraße 1
80333 München

iconparc.de

Tel: +49 - 89- 15 90 06 - 21
Fax: +49 - 89- 15 90 06 - 19

Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer. 
HRB 121830, Amtsgericht München




Re: Problem of calling indexWriterConfig.clone()

2014-08-12 Thread Michael McCandless
We've removed IndexWriterConfig.clone as of 4.9:

https://issues.apache.org/jira/browse/LUCENE-5708

Cloning of those complex / expert classes was buggy and too hairy to get right.

You just have to make a new IWC every time you make an IW.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Aug 12, 2014 at 2:29 AM, Vitaly Funstein vfunst...@gmail.com wrote:
 I honestly don't understand what DWPT pool has to do with IndexWriterConfig
 instances not being reusable for new IndexWriter instances. If you have the
 need to open a new IndexWriter with the same configuration as the one you
 used before, why not save the original config as the template, then
 simply do this for every IndexWriter instance you're creating:

 private final IndexWriterConfig masterCfg = new
 IndexWriterConfig(Version.LUCENE_47, null);
 // set whatever you need on this instance
 .

 IndexWriter writer = new IndexWriter(directory, masterCfg.clone());

 Wouldn't this just work? If not, could you paste the stack trace of the
 exception you're getting?


 On Mon, Aug 11, 2014 at 9:01 PM, Sheng sheng...@gmail.com wrote:

 From src code of DocumentsWriterPerThreadPool, the variable
 numThreadStatesActive seems to be always increasing, which explains why
 asserting on numThreadStatesActive == 0 before cloning this object
 fails. So what should be the most appropriate way of re-opening an
 indexwriter if what you have are the index directory plus the
 indexWriterConfig that the closed indexWriter has been using?

 BTW - I am reasonably sure calling indexWriterConfig.clone() in the middle
 of indexing documents used to work for my code(same Lucene 4.7). It is
 since recently I had to do faceted indexing as well that this problem
 started to emerge. Is it related?


 On Mon, Aug 11, 2014 at 11:31 PM, Vitaly Funstein vfunst...@gmail.com
 wrote:

  I only have the source to 4.6.1, but if you look at the constructor of
  IndexWriter there, it looks like this:
 
public IndexWriter(Directory d, IndexWriterConfig conf) throws
  IOException {
  conf.setIndexWriter(this); // prevent reuse by other instances
 
  The setter throws an exception if the configuration object has already
 been
  used with another instance of IndexWriter. Therefore, it should be cloned
  before being used in the constructor of IndexWriter.
 
 
  On Mon, Aug 11, 2014 at 7:12 PM, Sheng sheng...@gmail.com wrote:
 
   So the indexWriterConfig.clone() failed at this step:
   clone.indexerThreadPool = indexerThreadPool
   
  
 
 http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.7.0/org/apache/lucene/index/LiveIndexWriterConfig.java#LiveIndexWriterConfig.0indexerThreadPool
   
   .clone
   
  
 
 http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.7.0/org/apache/lucene/index/DocumentsWriterPerThreadPool.java#DocumentsWriterPerThreadPool.clone%28%29
   
   ();
  
   which then failed at this step in the indexerThreadPool
  
  
   if (numThreadStatesActive
   
  
 
 http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.7.0/org/apache/lucene/index/DocumentsWriterPerThreadPool.java#DocumentsWriterPerThreadPool.0numThreadStatesActive
   
   != 0) {
  
   throw new IllegalStateException
   
  
 
 http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/lang/IllegalStateException.java#IllegalStateException
   (clone
   this object before it is used!);
  
   }
  
  
   There is a comment right above this:
   // We should only be cloned before being used:
  
   Does this mean whenever the indexWriter gets called for
   commit/prepareCommit, etc., the corresponding indexWriterConfig object
   cannot be called with .clone() at all?
  
  
   On Mon, Aug 11, 2014 at 9:52 PM, Vitaly Funstein vfunst...@gmail.com
   wrote:
  
Looks like you have to clone it prior to using with any IndexWriter
instances.
   
   
On Mon, Aug 11, 2014 at 2:49 PM, Sheng sheng...@gmail.com wrote:
   
 I tried to create a clone of indexwriteconfig with
 indexWriterConfig.clone() for re-creating a new indexwriter, but
 I
then I
 got this very annoying illegalstateexception: clone this object
  before
it
 is used. Why does this exception happen, and how can I get around
  it?
 Thanks!

   
  
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Can't get case insensitive keyword analyzer to work

2014-08-12 Thread Jack Krupansky
And unfiltered. So even if you use the keyword tokenizer that only generates 
a single token, you still want token filtering, such as lower case.


-- Jack Krupansky

-Original Message- 
From: Christoph Kaser

Sent: Tuesday, August 12, 2014 3:07 AM
To: java-user@lucene.apache.org
Subject: Re: Can't get case insensitive keyword analyzer to work

Hello Milind,

if you don't set the field to be tokenized, no analyzer will be used and
the field's contents will be stored as-is, i.e. case sensitive.
It's the analyzer's job to tokenize the input, so if you use an analyzer
that does not separate the input into several tokens (like the
KeywordAnalyzer), your input will remain untokenized.

Regards
Christoph

Am 12.08.2014 um 03:38 schrieb Milind:

I found the problem.  But it makes no sense to me.

If I set the field type to be tokenized, it works.  But if I set it to not
be tokenized the search fails.  i.e. I have to pass in true to the method.
 theFieldType.setTokenized(storeTokenized);

I want the field to be stored as un-tokenized.  But it seems that I don't
need to do that.  The LowerCaseKeywordAnalyzer works if the field is
tokenized, but not if it's un-tokenized!

How can that be?


On Mon, Aug 11, 2014 at 1:49 PM, Milind mili...@gmail.com wrote:


It does look like the lowercase is working.

The following code

 Document theDoc = theIndexReader.document(0);
 System.out.println(theDoc.get(sn));
 IndexableField theField = theDoc.getField(sn);
 TokenStream theTokenStream = theField.tokenStream(theAnalyzer);
 System.out.println(theTokenStream);

produces the following output
 SN345-B21
 LowerCaseFilter@5f70bea5 term=sn345-b21,bytes=[73 6e 33 34 35 2d 62
32 31],startOffset=0,endOffset=9

But the search does not work.  Anything obvious popping out for anyone?


On Sat, Aug 9, 2014 at 4:39 PM, Milind mili...@gmail.com wrote:


I looked at a couple of examples on how to get keyword analyzer to be
case insensitive but I think I missed something since it's not working 
for

me.

In the code below, I'm indexing text in upper case and searching in 
lower

case.  But I get back no hits.  Do I need to something more while
indexing?

 private static class LowerCaseKeywordAnalyzer extends Analyzer
 {
 @Override
 protected TokenStreamComponents createComponents(String
theFieldName, Reader theReader)
 {
 KeywordTokenizer theTokenizer = new
KeywordTokenizer(theReader);
 TokenStreamComponents theTokenStreamComponents =
 new TokenStreamComponents(
 theTokenizer,
 new LowerCaseFilter(Version.LUCENE_46,
theTokenizer));
 return theTokenStreamComponents;
 }
 }

 private static void addDocment(IndexWriter theWriter,
   String theFieldName,
   String theValue,
   boolean storeTokenized)
 throws Exception
 {
   Document theDocument = new Document();
   FieldType theFieldType = new FieldType();
   theFieldType.setStored(true);
   theFieldType.setIndexed(true);
   theFieldType.setTokenized(storeTokenized);
   theDocument.add(new Field(theFieldName, theValue,
theFieldType));
   theWriter.addDocument(theDocument);
 }


 static void testLowerCaseKeywordAnalyzer()
 throws Exception
 {
 Version theVersion = Version.LUCENE_46;
 Directory theIndex = new RAMDirectory();

 Analyzer theAnalyzer = new LowerCaseKeywordAnalyzer();

 IndexWriterConfig theConfig = new IndexWriterConfig(theVersion,

theAnalyzer);
 IndexWriter theWriter = new IndexWriter(theIndex, theConfig);
 addDocment(theWriter, sn, SN345-B21, false);
 addDocment(theWriter, sn, SN445-B21, false);
 theWriter.close();

 QueryParser theParser = new QueryParser(theVersion, sn,
theAnalyzer);
 Query theQuery = theParser.parse(sn:sn345-b21);
 IndexReader theIndexReader = DirectoryReader.open(theIndex);
 IndexSearcher theSearcher = new IndexSearcher(theIndexReader);
 TopScoreDocCollector theCollector =
TopScoreDocCollector.create(10, true);
 theSearcher.search(theQuery, theCollector);
 ScoreDoc[] theHits = theCollector.topDocs().scoreDocs;
 System.out.println(Number of results found:  + 
theHits.length);

 }

--
Regards
Milind


--
Regards
Milind






--


Weil Individualität der beste Standard ist

Dipl.-Inf. Christoph Kaser

IconParc GmbH
Sophienstraße 1
80333 München

iconparc.de

Tel: +49 - 89- 15 90 06 - 21
Fax: +49 - 89- 15 90 06 - 19

Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer.
HRB 121830, Amtsgericht München



RE: escaping characters

2014-08-12 Thread Chris Salem
Thanks!  That worked.

We recently upgraded from 2.9 to 4.9, was true the default in 2.9?

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Monday, August 11, 2014 5:54 PM
To: java-user@lucene.apache.org
Subject: Re: escaping characters

You need to manually enable automatic generation of phrase queries - it 
defaults to disabled, which simply treats the sub-terms as individual terms 
subject to the default operator.

See:
http://lucene.apache.org/core/4_9_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean)

-- Jack Krupansky

-Original Message-
From: Chris Salem
Sent: Monday, August 11, 2014 1:03 PM
To: java-user@lucene.apache.org
Subject: RE: escaping characters

I'm not using Solr.  Here's my code:

FSDirectory fsd = FSDirectory.open(new File(C:\\indexes\\Lucene4));

IndexReader reader = DirectoryReader.open(fsd);

IndexSearcher searcher = new IndexSearcher(reader); Analyzer analyzer = new 
StandardAnalyzer(Version.LUCENE_4_9,
getStopWords());

BooleanQuery.setMaxClauseCount(10);

QueryParser qptemp = new QueryParser(Version.LUCENE_4_9, 
resume_text,analyzer); qptemp.setAllowLeadingWildcard(true);
qptemp.setDefaultOperator(QueryParser.AND_OPERATOR);

Query querytemp = qptemp.parse(resume_text: (LS\\/MS));

System.out.println(querytemp.toString());
TopFieldCollector tfcollector = TopFieldCollector.create(new Sort(), 20, false, 
true, false, true);

ScoreDoc[] hits;
searcher.search(querytemp, tfcollector); hits = 
tfcollector.topDocs().scoreDocs; long resultCount = tfcollector.getTotalHits();

reader.close();



-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Monday, August 11, 2014 12:27 PM
To: java-user
Subject: Re: escaping characters

Take a look at the adnim/analysis page for the field in question.
The next bit of critical information is adding  debug=query to the URL. The 
former will tell you what happens to the input stream at query and index time, 
the latter will tell you how the query got through the query parsing process.

My guess is that you have WordDelimiterFilterFactory in your analysis chain and 
that's breaking things up.

Best,
Erick


On Mon, Aug 11, 2014 at 8:54 AM, Chris Salem csa...@mainsequence.net
wrote:

 Hi everyone,



 I'm trying to escape special characters and it doesn't seem to be working.
 If I do a search like resume_text: (LS\/MS) it searches for LS AND MS 
 instead of LS/MS.  How would I escape the slash so it searches for LS/MS?

 Thanks






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Problem of calling indexWriterConfig.clone()

2014-08-12 Thread Sheng
I think what you suggest probably will work, and I appreciate that. What I
am a little concerned about is if Indexwriterconfig is completely stateless
or not, meaning if i clone from the very original Indexwriterconfig, will I
lose some info from the breakpoint? Maybe I don't need worry about it,
since it is going to be removed in 4.9?

On Tue, Aug 12, 2014 at 2:29 AM, Vitaly Funstein vfunst...@gmail.com
javascript:_e(%7B%7D,'cvml','vfunst...@gmail.com'); wrote:

 I honestly don't understand what DWPT pool has to do with IndexWriterConfig
 instances not being reusable for new IndexWriter instances. If you have the
 need to open a new IndexWriter with the same configuration as the one you
 used before, why not save the original config as the template, then
 simply do this for every IndexWriter instance you're creating:

 private final IndexWriterConfig masterCfg = new
 IndexWriterConfig(Version.LUCENE_47, null);
 // set whatever you need on this instance
 .

 IndexWriter writer = new IndexWriter(directory, masterCfg.clone());

 Wouldn't this just work? If not, could you paste the stack trace of the
 exception you're getting?


 On Mon, Aug 11, 2014 at 9:01 PM, Sheng sheng...@gmail.com
 javascript:_e(%7B%7D,'cvml','sheng...@gmail.com'); wrote:

  From src code of DocumentsWriterPerThreadPool, the variable
  numThreadStatesActive seems to be always increasing, which explains why
  asserting on numThreadStatesActive == 0 before cloning this object
  fails. So what should be the most appropriate way of re-opening an
  indexwriter if what you have are the index directory plus the
  indexWriterConfig that the closed indexWriter has been using?
 
  BTW - I am reasonably sure calling indexWriterConfig.clone() in the
 middle
  of indexing documents used to work for my code(same Lucene 4.7). It is
  since recently I had to do faceted indexing as well that this problem
  started to emerge. Is it related?
 
 
  On Mon, Aug 11, 2014 at 11:31 PM, Vitaly Funstein vfunst...@gmail.com
 javascript:_e(%7B%7D,'cvml','vfunst...@gmail.com');
  wrote:
 
   I only have the source to 4.6.1, but if you look at the constructor of
   IndexWriter there, it looks like this:
  
 public IndexWriter(Directory d, IndexWriterConfig conf) throws
   IOException {
   conf.setIndexWriter(this); // prevent reuse by other instances
  
   The setter throws an exception if the configuration object has already
  been
   used with another instance of IndexWriter. Therefore, it should be
 cloned
   before being used in the constructor of IndexWriter.
  
  
   On Mon, Aug 11, 2014 at 7:12 PM, Sheng sheng...@gmail.com
 javascript:_e(%7B%7D,'cvml','sheng...@gmail.com'); wrote:
  
So the indexWriterConfig.clone() failed at this step:
clone.indexerThreadPool = indexerThreadPool

   
  
 
 http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.7.0/org/apache/lucene/index/LiveIndexWriterConfig.java#LiveIndexWriterConfig.0indexerThreadPool

.clone

   
  
 
 http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.7.0/org/apache/lucene/index/DocumentsWriterPerThreadPool.java#DocumentsWriterPerThreadPool.clone%28%29

();
   
which then failed at this step in the indexerThreadPool
   
   
if (numThreadStatesActive

   
  
 
 http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.7.0/org/apache/lucene/index/DocumentsWriterPerThreadPool.java#DocumentsWriterPerThreadPool.0numThreadStatesActive

!= 0) {
   
throw new IllegalStateException

   
  
 
 http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/lang/IllegalStateException.java#IllegalStateException
(clone
this object before it is used!);
   
}
   
   
There is a comment right above this:
// We should only be cloned before being used:
   
Does this mean whenever the indexWriter gets called for
commit/prepareCommit, etc., the corresponding indexWriterConfig
 object
cannot be called with .clone() at all?
   
   
On Mon, Aug 11, 2014 at 9:52 PM, Vitaly Funstein 
 vfunst...@gmail.com javascript:_e(%7B%7D,'cvml','vfunst...@gmail.com');
wrote:
   
 Looks like you have to clone it prior to using with any IndexWriter
 instances.


 On Mon, Aug 11, 2014 at 2:49 PM, Sheng sheng...@gmail.com
 javascript:_e(%7B%7D,'cvml','sheng...@gmail.com'); wrote:

  I tried to create a clone of indexwriteconfig with
  indexWriterConfig.clone() for re-creating a new indexwriter,
 but
  I
 then I
  got this very annoying illegalstateexception: clone this object
   before
 it
  is used. Why does this exception happen, and how can I get
 around
   it?
  Thanks!
 

   
  
 



Re: escaping characters

2014-08-12 Thread Jack Krupansky

The default changed to false in Lucene 3.1. Before that it was true.

-- Jack Krupansky

-Original Message- 
From: Chris Salem

Sent: Tuesday, August 12, 2014 8:34 AM
To: java-user@lucene.apache.org
Subject: RE: escaping characters

Thanks!  That worked.

We recently upgraded from 2.9 to 4.9, was true the default in 2.9?

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Monday, August 11, 2014 5:54 PM
To: java-user@lucene.apache.org
Subject: Re: escaping characters

You need to manually enable automatic generation of phrase queries - it 
defaults to disabled, which simply treats the sub-terms as individual terms 
subject to the default operator.


See:
http://lucene.apache.org/core/4_9_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean)

-- Jack Krupansky

-Original Message-
From: Chris Salem
Sent: Monday, August 11, 2014 1:03 PM
To: java-user@lucene.apache.org
Subject: RE: escaping characters

I'm not using Solr.  Here's my code:

FSDirectory fsd = FSDirectory.open(new File(C:\\indexes\\Lucene4));

IndexReader reader = DirectoryReader.open(fsd);

IndexSearcher searcher = new IndexSearcher(reader); Analyzer analyzer = new 
StandardAnalyzer(Version.LUCENE_4_9,

getStopWords());

BooleanQuery.setMaxClauseCount(10);

QueryParser qptemp = new QueryParser(Version.LUCENE_4_9, 
resume_text,analyzer); qptemp.setAllowLeadingWildcard(true);

qptemp.setDefaultOperator(QueryParser.AND_OPERATOR);

Query querytemp = qptemp.parse(resume_text: (LS\\/MS));

System.out.println(querytemp.toString());
TopFieldCollector tfcollector = TopFieldCollector.create(new Sort(), 20, 
false, true, false, true);


ScoreDoc[] hits;
searcher.search(querytemp, tfcollector); hits = 
tfcollector.topDocs().scoreDocs; long resultCount = 
tfcollector.getTotalHits();


reader.close();



-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Monday, August 11, 2014 12:27 PM
To: java-user
Subject: Re: escaping characters

Take a look at the adnim/analysis page for the field in question.
The next bit of critical information is adding  debug=query to the URL. The 
former will tell you what happens to the input stream at query and index 
time, the latter will tell you how the query got through the query parsing 
process.


My guess is that you have WordDelimiterFilterFactory in your analysis chain 
and that's breaking things up.


Best,
Erick


On Mon, Aug 11, 2014 at 8:54 AM, Chris Salem csa...@mainsequence.net
wrote:


Hi everyone,



I'm trying to escape special characters and it doesn't seem to be working.
If I do a search like resume_text: (LS\/MS) it searches for LS AND MS
instead of LS/MS.  How would I escape the slash so it searches for LS/MS?

Thanks







-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: escaping characters

2014-08-12 Thread Uwe Schindler
See Javadocs of QueryParser:

NOTE: You must specify the required Version compatibility when creating 
QueryParser:
- As of 3.1, QueryParserBase.setAutoGeneratePhraseQueries(boolean) is false by 
default.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Chris Salem [mailto:csa...@mainsequence.net]
 Sent: Tuesday, August 12, 2014 2:34 PM
 To: java-user@lucene.apache.org
 Subject: RE: escaping characters
 
 Thanks!  That worked.
 
 We recently upgraded from 2.9 to 4.9, was true the default in 2.9?
 
 -Original Message-
 From: Jack Krupansky [mailto:j...@basetechnology.com]
 Sent: Monday, August 11, 2014 5:54 PM
 To: java-user@lucene.apache.org
 Subject: Re: escaping characters
 
 You need to manually enable automatic generation of phrase queries - it
 defaults to disabled, which simply treats the sub-terms as individual terms
 subject to the default operator.
 
 See:
 http://lucene.apache.org/core/4_9_0/queryparser/org/apache/lucene/quer
 yparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boo
 lean)
 
 -- Jack Krupansky
 
 -Original Message-
 From: Chris Salem
 Sent: Monday, August 11, 2014 1:03 PM
 To: java-user@lucene.apache.org
 Subject: RE: escaping characters
 
 I'm not using Solr.  Here's my code:
 
 FSDirectory fsd = FSDirectory.open(new File(C:\\indexes\\Lucene4));
 
 IndexReader reader = DirectoryReader.open(fsd);
 
 IndexSearcher searcher = new IndexSearcher(reader); Analyzer analyzer =
 new StandardAnalyzer(Version.LUCENE_4_9,
 getStopWords());
 
 BooleanQuery.setMaxClauseCount(10);
 
 QueryParser qptemp = new QueryParser(Version.LUCENE_4_9,
 resume_text,analyzer); qptemp.setAllowLeadingWildcard(true);
 qptemp.setDefaultOperator(QueryParser.AND_OPERATOR);
 
 Query querytemp = qptemp.parse(resume_text: (LS\\/MS));
 
 System.out.println(querytemp.toString());
 TopFieldCollector tfcollector = TopFieldCollector.create(new Sort(), 20, 
 false,
 true, false, true);
 
 ScoreDoc[] hits;
 searcher.search(querytemp, tfcollector); hits =
 tfcollector.topDocs().scoreDocs; long resultCount = 
 tfcollector.getTotalHits();
 
 reader.close();
 
 
 
 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Monday, August 11, 2014 12:27 PM
 To: java-user
 Subject: Re: escaping characters
 
 Take a look at the adnim/analysis page for the field in question.
 The next bit of critical information is adding  debug=query to the URL. The
 former will tell you what happens to the input stream at query and index
 time, the latter will tell you how the query got through the query parsing
 process.
 
 My guess is that you have WordDelimiterFilterFactory in your analysis chain
 and that's breaking things up.
 
 Best,
 Erick
 
 
 On Mon, Aug 11, 2014 at 8:54 AM, Chris Salem csa...@mainsequence.net
 wrote:
 
  Hi everyone,
 
 
 
  I'm trying to escape special characters and it doesn't seem to be working.
  If I do a search like resume_text: (LS\/MS) it searches for LS AND MS
  instead of LS/MS.  How would I escape the slash so it searches for LS/MS?
 
  Thanks
 
 
 
 
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Problem of calling indexWriterConfig.clone()

2014-08-12 Thread Michael McCandless
IWC.clone is/was buggy ... just stop calling it and make a new IWC
from scratch each time in your application.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Aug 12, 2014 at 8:37 AM, Sheng sheng...@gmail.com wrote:
 I think what you suggest probably will work, and I appreciate that. What I
 am a little concerned about is if Indexwriterconfig is completely stateless
 or not, meaning if i clone from the very original Indexwriterconfig, will I
 lose some info from the breakpoint? Maybe I don't need worry about it,
 since it is going to be removed in 4.9?

 On Tue, Aug 12, 2014 at 2:29 AM, Vitaly Funstein vfunst...@gmail.com
 javascript:_e(%7B%7D,'cvml','vfunst...@gmail.com'); wrote:

 I honestly don't understand what DWPT pool has to do with IndexWriterConfig
 instances not being reusable for new IndexWriter instances. If you have the
 need to open a new IndexWriter with the same configuration as the one you
 used before, why not save the original config as the template, then
 simply do this for every IndexWriter instance you're creating:

 private final IndexWriterConfig masterCfg = new
 IndexWriterConfig(Version.LUCENE_47, null);
 // set whatever you need on this instance
 .

 IndexWriter writer = new IndexWriter(directory, masterCfg.clone());

 Wouldn't this just work? If not, could you paste the stack trace of the
 exception you're getting?


 On Mon, Aug 11, 2014 at 9:01 PM, Sheng sheng...@gmail.com
 javascript:_e(%7B%7D,'cvml','sheng...@gmail.com'); wrote:

  From src code of DocumentsWriterPerThreadPool, the variable
  numThreadStatesActive seems to be always increasing, which explains why
  asserting on numThreadStatesActive == 0 before cloning this object
  fails. So what should be the most appropriate way of re-opening an
  indexwriter if what you have are the index directory plus the
  indexWriterConfig that the closed indexWriter has been using?
 
  BTW - I am reasonably sure calling indexWriterConfig.clone() in the
 middle
  of indexing documents used to work for my code(same Lucene 4.7). It is
  since recently I had to do faceted indexing as well that this problem
  started to emerge. Is it related?
 
 
  On Mon, Aug 11, 2014 at 11:31 PM, Vitaly Funstein vfunst...@gmail.com
 javascript:_e(%7B%7D,'cvml','vfunst...@gmail.com');
  wrote:
 
   I only have the source to 4.6.1, but if you look at the constructor of
   IndexWriter there, it looks like this:
  
 public IndexWriter(Directory d, IndexWriterConfig conf) throws
   IOException {
   conf.setIndexWriter(this); // prevent reuse by other instances
  
   The setter throws an exception if the configuration object has already
  been
   used with another instance of IndexWriter. Therefore, it should be
 cloned
   before being used in the constructor of IndexWriter.
  
  
   On Mon, Aug 11, 2014 at 7:12 PM, Sheng sheng...@gmail.com
 javascript:_e(%7B%7D,'cvml','sheng...@gmail.com'); wrote:
  
So the indexWriterConfig.clone() failed at this step:
clone.indexerThreadPool = indexerThreadPool

   
  
 
 http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.7.0/org/apache/lucene/index/LiveIndexWriterConfig.java#LiveIndexWriterConfig.0indexerThreadPool

.clone

   
  
 
 http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.7.0/org/apache/lucene/index/DocumentsWriterPerThreadPool.java#DocumentsWriterPerThreadPool.clone%28%29

();
   
which then failed at this step in the indexerThreadPool
   
   
if (numThreadStatesActive

   
  
 
 http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.7.0/org/apache/lucene/index/DocumentsWriterPerThreadPool.java#DocumentsWriterPerThreadPool.0numThreadStatesActive

!= 0) {
   
throw new IllegalStateException

   
  
 
 http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/lang/IllegalStateException.java#IllegalStateException
(clone
this object before it is used!);
   
}
   
   
There is a comment right above this:
// We should only be cloned before being used:
   
Does this mean whenever the indexWriter gets called for
commit/prepareCommit, etc., the corresponding indexWriterConfig
 object
cannot be called with .clone() at all?
   
   
On Mon, Aug 11, 2014 at 9:52 PM, Vitaly Funstein 
 vfunst...@gmail.com javascript:_e(%7B%7D,'cvml','vfunst...@gmail.com');
wrote:
   
 Looks like you have to clone it prior to using with any IndexWriter
 instances.


 On Mon, Aug 11, 2014 at 2:49 PM, Sheng sheng...@gmail.com
 javascript:_e(%7B%7D,'cvml','sheng...@gmail.com'); wrote:

  I tried to create a clone of indexwriteconfig with
  indexWriterConfig.clone() for re-creating a new indexwriter,
 but
  I
 then I
  got this very annoying illegalstateexception: clone this object
   before
 it
  is used. Why does this exception happen, and how can I get
 around
   

Re: BitSet in Filters

2014-08-12 Thread Erick Erickson
bq: Unless, I can cache these filters in memory, the cost of constructing
this filter at run time per query is not practical

Why do you say that? Do you have evidence? Because lots and lots of Solr
installations do exactly this and they run fine.

So I suspect there's something you're not telling us about your setup. Are
you, say, soft committing often? Do you have autowarming specified?

You're not going to be able to keep your filters based on some other field
in the document. Internally, Lucene uses the internal doc ID as an index
into the bitset. That's baked in to very low levels and isn't going to
change AFAIK.

Best,
Erick


On Mon, Aug 11, 2014 at 11:53 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 Hi,

 The current usage of BitSets in filters in Lucene is limited to applying
 only on docIDs i.e. I can only construct a filter out of a BitSet if I have
 the DocumentIDs handy.

 However, with every update/delete i.e. CRUD modification, these will
 change, and I have to again redo the whole process to fetch the latest
 docIDs.

 Assume a scenario where I need to tag millions of documents with a tag
 like Finance, IT, Legal, etc.

 Unless, I can cache these filters in memory, the cost of constructing this
 filter at run time per query is not practical. If I could map the documents
 to a numeric long identifier and put them in a BitMap, I could then cache
 them because the size reduces drastically. However, I cannot use this
 numeric long identifier in Lucene filters because it is not a docID but
 another regular field.

 Please help with this scenario. Thanks,

 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode


Re: Can't get case insensitive keyword analyzer to work

2014-08-12 Thread Milind
Thanks Christoph,

So it seems that tokenized has been conflated to analyzed.  I just looked
at the Javadocs and that's what it mentions. I had read it earlier, but it
hadn't registered.  I wonder why it's not called setAnalyzed.  Thanks again.


On Tue, Aug 12, 2014 at 3:07 AM, Christoph Kaser 
christoph.ka...@iconparc.de wrote:

 Hello Milind,

 if you don't set the field to be tokenized, no analyzer will be used and
 the field's contents will be stored as-is, i.e. case sensitive.
 It's the analyzer's job to tokenize the input, so if you use an analyzer
 that does not separate the input into several tokens (like the
 KeywordAnalyzer), your input will remain untokenized.

 Regards
 Christoph

 Am 12.08.2014 um 03:38 schrieb Milind:

  I found the problem.  But it makes no sense to me.

 If I set the field type to be tokenized, it works.  But if I set it to not
 be tokenized the search fails.  i.e. I have to pass in true to the method.
  theFieldType.setTokenized(storeTokenized);

 I want the field to be stored as un-tokenized.  But it seems that I don't
 need to do that.  The LowerCaseKeywordAnalyzer works if the field is
 tokenized, but not if it's un-tokenized!

 How can that be?


 On Mon, Aug 11, 2014 at 1:49 PM, Milind mili...@gmail.com wrote:

  It does look like the lowercase is working.

 The following code

  Document theDoc = theIndexReader.document(0);
  System.out.println(theDoc.get(sn));
  IndexableField theField = theDoc.getField(sn);
  TokenStream theTokenStream = theField.tokenStream(theAnalyzer);
  System.out.println(theTokenStream);

 produces the following output
  SN345-B21
  LowerCaseFilter@5f70bea5 term=sn345-b21,bytes=[73 6e 33 34 35 2d 62
 32 31],startOffset=0,endOffset=9

 But the search does not work.  Anything obvious popping out for anyone?


 On Sat, Aug 9, 2014 at 4:39 PM, Milind mili...@gmail.com wrote:

  I looked at a couple of examples on how to get keyword analyzer to be
 case insensitive but I think I missed something since it's not working
 for
 me.

 In the code below, I'm indexing text in upper case and searching in
 lower
 case.  But I get back no hits.  Do I need to something more while
 indexing?

  private static class LowerCaseKeywordAnalyzer extends Analyzer
  {
  @Override
  protected TokenStreamComponents createComponents(String
 theFieldName, Reader theReader)
  {
  KeywordTokenizer theTokenizer = new
 KeywordTokenizer(theReader);
  TokenStreamComponents theTokenStreamComponents =
  new TokenStreamComponents(
  theTokenizer,
  new LowerCaseFilter(Version.LUCENE_46,
 theTokenizer));
  return theTokenStreamComponents;
  }
  }

  private static void addDocment(IndexWriter theWriter,
String theFieldName,
String theValue,
boolean storeTokenized)
  throws Exception
  {
Document theDocument = new Document();
FieldType theFieldType = new FieldType();
theFieldType.setStored(true);
theFieldType.setIndexed(true);
theFieldType.setTokenized(storeTokenized);
theDocument.add(new Field(theFieldName, theValue,
 theFieldType));
theWriter.addDocument(theDocument);
  }


  static void testLowerCaseKeywordAnalyzer()
  throws Exception
  {
  Version theVersion = Version.LUCENE_46;
  Directory theIndex = new RAMDirectory();

  Analyzer theAnalyzer = new LowerCaseKeywordAnalyzer();

  IndexWriterConfig theConfig = new IndexWriterConfig(theVersion,

  theAnalyzer);
  IndexWriter theWriter = new IndexWriter(theIndex, theConfig);
  addDocment(theWriter, sn, SN345-B21, false);
  addDocment(theWriter, sn, SN445-B21, false);
  theWriter.close();

  QueryParser theParser = new QueryParser(theVersion, sn,
 theAnalyzer);
  Query theQuery = theParser.parse(sn:sn345-b21);
  IndexReader theIndexReader = DirectoryReader.open(theIndex);
  IndexSearcher theSearcher = new IndexSearcher(theIndexReader);
  TopScoreDocCollector theCollector =
 TopScoreDocCollector.create(10, true);
  theSearcher.search(theQuery, theCollector);
  ScoreDoc[] theHits = theCollector.topDocs().scoreDocs;
  System.out.println(Number of results found:  +
 theHits.length);
  }

 --
 Regards
 Milind

  --
 Regards
 Milind




 --
 

 Weil Individualität der beste Standard ist

 Dipl.-Inf. Christoph Kaser

 IconParc GmbH
 Sophienstraße 1
 80333 München

 iconparc.de

 Tel: +49 - 89- 15 90 06 - 21
 Fax: +49 - 89- 15 90 06 - 19

 Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer. 

Re: BitSet in Filters

2014-08-12 Thread Sandeep Khanzode
Hi Erick,

I have mentioned everything that is relevant, I believe :).

However, just to give more background: Assume documents of the order of more 
than 300 million, and multiple concurrent users running search. I may front 
Lucene with ElasticSearch, and ES basically calls Lucene TermFilters. My 
filters are broad in nature, so you can take it that any time I filter on a 
tag, it would run into, easily, millions of documents to be accepted in the 
filter.

The only filter that uses a BitSet works with Document Ids in Lucene. I would 
have wanted this bitset approach to work on some other regular numeric long 
field so that we can scale which does not seem likely if I have to use an 
ArrayList of Longs for TermFilters.

Hope that makes the scenario more clear. Please let me know your thoughts.
 
---
Thanks n Regards,
Sandeep Ramesh Khanzode


On Tuesday, August 12, 2014 8:41 PM, Erick Erickson erickerick...@gmail.com 
wrote:
 


bq: Unless, I can cache these filters in memory, the cost of constructing this 
filter at run time per query is not practical

Why do you say that? Do you have evidence? Because lots and lots of Solr 
installations do exactly this and they run fine.

So I suspect there's something you're not telling us about your setup. Are you, 
say, soft committing often? Do you have autowarming specified? 

You're not going to be able to keep your filters based on some other field in 
the document. Internally, Lucene uses the internal doc ID as an index into the 
bitset. That's baked in to very low levels and isn't going to change AFAIK.

Best,
Erick



On Mon, Aug 11, 2014 at 11:53 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

Hi,
 
The current usage of BitSets in filters in Lucene is limited to applying only 
on docIDs i.e. I can only construct a filter out of a BitSet if I have the 
DocumentIDs handy.

However, with every update/delete i.e. CRUD modification, these will change, 
and I have to again redo the whole process to fetch the latest docIDs. 

Assume a scenario where I need to tag millions of documents with a tag like 
Finance, IT, Legal, etc.

Unless, I can cache these filters in memory, the cost of constructing this 
filter at run time per query is not practical. If I could map the documents to 
a numeric long identifier and put them in a BitMap, I could then cache them 
because the size reduces drastically. However, I cannot use this numeric long 
identifier in Lucene filters because it is not a docID but another regular 
field.

Please help with this scenario. Thanks,

---
Thanks n Regards,
Sandeep Ramesh Khanzode

RE: BitSet in Filters

2014-08-12 Thread Uwe Schindler
Hi,

in general you cannot cache Filter, you can cache their DocIdSets 
(CachingWrapperFilter is for example doing this). Lucene Queries are executed 
per segment, that means when you index new documents or update new documents, 
lucene creates new index segments. Older ones *never* change, so a DocIdSet 
(e.g. implemented by FixedBitSet) can be linked to a specifc segment of the 
index that never changes - only deletions may be added, but that's transparent 
to the filter - the deletions (given in acceptDocs to getDocIdSet) and the 
cached BitSet just need to be anded together (btw, deletions in Lucene are just 
a Filter, too).

Of course, after a while Lucene merges segments using its MergePolicy, because 
otherwise there would be too many of them. In that case several smaller 
segments (preferably those with many deletions) get merged into larger ones by 
the indexer. This is the only case when the some *new* DocIdSets need to be 
created. Large segments are unlikely to be merged, unless they have many 
deletions (caused by updates into new segments or deletions). This approach is 
used by Solr and Elasticsearch - CachingWrapperFilter is an example how to do 
this in own code.

To implement this:
- don't cache a bitset for the whole index this would indeed need you to 
recalculate the bitsets over and over
- In YourFilter.getDocIdSet() look up in your cache if the coreCacheKey of the 
given AtomicReaderContext.reader() is in your cache and if yes, reuse the 
cached DocIdSet (deletions are not relevant, you just have to apply them by 
BitsFilteredDocIdSet.wrap(cachedDocIdSet). If it's not in the cache, 
recalculate the bitset for the given AtomicReaderContext (not the whole index) 
and return it as DocIdSet instance.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Sandeep Khanzode [mailto:sandeep_khanz...@yahoo.com.INVALID]
 Sent: Tuesday, August 12, 2014 8:53 AM
 To: Lucene Users
 Subject: BitSet in Filters
 
 Hi,
 
 The current usage of BitSets in filters in Lucene is limited to applying only 
 on
 docIDs i.e. I can only construct a filter out of a BitSet if I have the
 DocumentIDs handy.
 
 However, with every update/delete i.e. CRUD modification, these will
 change, and I have to again redo the whole process to fetch the latest
 docIDs.
 
 Assume a scenario where I need to tag millions of documents with a tag like
 Finance, IT, Legal, etc.
 
 Unless, I can cache these filters in memory, the cost of constructing this 
 filter
 at run time per query is not practical. If I could map the documents to a
 numeric long identifier and put them in a BitMap, I could then cache them
 because the size reduces drastically. However, I cannot use this numeric long
 identifier in Lucene filters because it is not a docID but another regular 
 field.
 
 Please help with this scenario. Thanks,
 
 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Questions for facets search

2014-08-12 Thread Sheng
I actually have 2 questions:

1. Is it possible to get the facet label for a particular document? The
reason we want this is we'd like to allow users to see tags for each hit in
addition to the taxonomy for his/her search.

2. Is it possible to re-index the facet cache without reindexing the whole
lucene cache, since they are separated? We have a dynamic list of faceted
fields, being able to quickly rebuild the whole facet lucene cache would be
quite desirable.

Again, I am using lucene 4.7, thanks in advance to your answers!

Sheng


AW: Questions for facets search

2014-08-12 Thread Ralf Heyde
For 1st: from Solr Level i guess, you could select (only) the document by 
uniqueid. Then you have the facets for that particular document. But this 
results in one additional query/doc.

Gesendet von meinem BlackBerry 10-Smartphone.
  Originalnachricht  
Von: Sheng
Gesendet: Dienstag, 12. August 2014 23:35
An: java-user@lucene.apache.org
Antwort an: java-user@lucene.apache.org
Betreff: Questions for facets search

I actually have 2 questions:

1. Is it possible to get the facet label for a particular document? The
reason we want this is we'd like to allow users to see tags for each hit in
addition to the taxonomy for his/her search.

2. Is it possible to re-index the facet cache without reindexing the whole
lucene cache, since they are separated? We have a dynamic list of faceted
fields, being able to quickly rebuild the whole facet lucene cache would be
quite desirable.

Again, I am using lucene 4.7, thanks in advance to your answers!

Sheng

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org