Don't get results wheras Luke does...
Dear Lucene-users, I am a bit puzzled over this. I have a query which should return some documents, if I use Luke, I obtain hits using the org.apache.lucene.analysis.KeywordAnalyzer. This is the query: domain:NB-AR* (I have data indexed using: doc.add(new Field(domain, NB-ARC, Field.Store.YES, Field.Index.NOT_ANALYZED)); ) Explain structure reveals that Luke is employing a PrefixQuery. Ok, now I want to obtain these results using my Java application: //Using the QueryParser, let him decide what to do with it: Query q = new QueryParser(Version.LUCENE_35, contents, analyzer).parse(domain:NB-AR*); System.out.println(Type of query: + q.getClass().getSimpleName()); // Type of query: PrefixQuery so that's ok int hitsPerPage = 1000; TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true); searcher.search(q, collector); ScoreDoc[] hits = collector.topDocs().scoreDocs; System.out.println(Found + hits.length + hits.); // Unfortunately 0 hits. // move on and make specify a Term and PrefixQuery: Term term = new Term(domain, NB-AR); q = new PrefixQuery(term); collector = TopScoreDocCollector.create(hitsPerPage, true); searcher.search(q, collector); hits = collector.topDocs().scoreDocs; // Found with prefix 441 hits. I tried to lowercase the search query, re-index and made the field: Field.Index.ANALYZED but nothing worked... I have a feeling it is something very trivial, but I just can't figure it out... Anyone? EJ Blom -- View this message in context: http://lucene.472066.n3.nabble.com/Don-t-get-results-wheras-Luke-does-tp3563736p3563736.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Use multiple lucene indices
Hi Guys, Thank you very much for your answers. I will do some profiling on memory usage, but is there any documentation on how Lucene uses/allocates the memory? Best wishes, Rui Wang On 6 Dec 2011, at 06:11, KARTHIK SHIVAKUMAR wrote: hi would the memory usage go through the roof? Yup My past experience got me pickels in there... with regards karthik On Mon, Dec 5, 2011 at 11:28 PM, Rui Wang rw...@ebi.ac.uk wrote: Hi All, We are planning to use lucene in our project, but not entirely sure about some of the design decisions were made. Below are the details, any comments/suggestions are more than welcome. The requirements of the project are below: 1. We have tens of thousands of files, their size ranging from 500M to a few terabytes, and majority of the contents in these files will not be accessed frequently. 2. We are planning to keep less accessed contents outside of our database, store them on the file system. 3. We also have code to get the binary position of these contents in the files. Using these binary positions, we can quickly retrieve the contents and convert them into our domain objects. We think Lucene provides a scalable solution for storing and indexing these binary positions, so the idea is that each piece of the content in the files will a document, each document will have at least an ID field to identify to content and a binary position field contains the starting and stop position of the content. Having done some performance testing, it seems to us that Lucene is well capable of doing this. At the moment, we are planning to create one Lucene index per file, so if we have new files to be added to the system, we can simply generate a new index. The problem is do with searching, this approach means that we need to create an new IndexSearcher every time a file is accessed through our web service. We knew that it is rather expensive to open a new IndexSearcher, and are thinking of using some kind of pooling mechanism. Our questions are: 1. Is this one index per file approach a viable solution? What do you think about pooling IndexSearcher? 2. If we have many IndexSearchers opened at the same time, would the memory usage go through the roof? I couldn't find any document on how Lucene use allocate memory. Thank you very much for your help. Many thanks, Rui Wang - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- *N.S.KARTHIK R.M.S.COLONY BEHIND BANK OF INDIA R.M.V 2ND STAGE BANGALORE 560094* - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Use multiple lucene indices
How many documents there are in the system ? approximate it by: 2 files * avg(docs/file) From my understanding your queries will be just lookup for a document ID (Q: are those IDs unique between files? or you need to filter by filename?) If that will be the only usecase than maybe you should consider some other lookup systems, a ehcache offloaded and persistent on disk might work just as well. If you are anywhere 200 mln documents I'd say you should go with a single index that contains all the data on a decent box (2-4 CPU, 4-8Gb RAM) In a slightly beefier host and Lucene4 (try various codecs for speed/memory usage) I think you could go to 1 bln documents. If you plan on more complex queries..like given a position in a file, identify a document that contains it...than the number of documents should be reconsidered. In worst case case scenario I would go with partitioned index (5-10 partitions, but not thousands) On Tue, Dec 6, 2011 at 11:03, Rui Wang rw...@ebi.ac.uk wrote: Hi Guys, Thank you very much for your answers. I will do some profiling on memory usage, but is there any documentation on how Lucene uses/allocates the memory? Best wishes, Rui Wang On 6 Dec 2011, at 06:11, KARTHIK SHIVAKUMAR wrote: hi would the memory usage go through the roof? Yup My past experience got me pickels in there... with regards karthik On Mon, Dec 5, 2011 at 11:28 PM, Rui Wang rw...@ebi.ac.uk wrote: Hi All, We are planning to use lucene in our project, but not entirely sure about some of the design decisions were made. Below are the details, any comments/suggestions are more than welcome. The requirements of the project are below: 1. We have tens of thousands of files, their size ranging from 500M to a few terabytes, and majority of the contents in these files will not be accessed frequently. 2. We are planning to keep less accessed contents outside of our database, store them on the file system. 3. We also have code to get the binary position of these contents in the files. Using these binary positions, we can quickly retrieve the contents and convert them into our domain objects. We think Lucene provides a scalable solution for storing and indexing these binary positions, so the idea is that each piece of the content in the files will a document, each document will have at least an ID field to identify to content and a binary position field contains the starting and stop position of the content. Having done some performance testing, it seems to us that Lucene is well capable of doing this. At the moment, we are planning to create one Lucene index per file, so if we have new files to be added to the system, we can simply generate a new index. The problem is do with searching, this approach means that we need to create an new IndexSearcher every time a file is accessed through our web service. We knew that it is rather expensive to open a new IndexSearcher, and are thinking of using some kind of pooling mechanism. Our questions are: 1. Is this one index per file approach a viable solution? What do you think about pooling IndexSearcher? 2. If we have many IndexSearchers opened at the same time, would the memory usage go through the roof? I couldn't find any document on how Lucene use allocate memory. Thank you very much for your help. Many thanks, Rui Wang - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- *N.S.KARTHIK R.M.S.COLONY BEHIND BANK OF INDIA R.M.V 2ND STAGE BANGALORE 560094* - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Use multiple lucene indices
Hi Danil, Thank you for your suggestions. We will have approximately half million documents per file, so using your calculation, 2 files * 50 = 10, 000, 000, 000. And we are likely to get more files in the future, so a scalable solution is most desirable. The document IDs are not unique between files, so we will have to filter by file name as well. echcahe is certainly an interesting idea, does it have the comparable load speed as a Lucene index, what about memory footprint? Another thing I should have mentioned before, we will add a few files (say 10) per day, this means we need to update indices on a regular basis, hence the reason why we were thinking of generating one index per file. Am I right to say that you would definitely not go for one index per file solution? is it also due to memory consumption? Many thanks, Rui Wang On 6 Dec 2011, at 10:05, Danil ŢORIN wrote: How many documents there are in the system ? approximate it by: 2 files * avg(docs/file) From my understanding your queries will be just lookup for a document ID (Q: are those IDs unique between files? or you need to filter by filename?) If that will be the only usecase than maybe you should consider some other lookup systems, a ehcache offloaded and persistent on disk might work just as well. If you are anywhere 200 mln documents I'd say you should go with a single index that contains all the data on a decent box (2-4 CPU, 4-8Gb RAM) In a slightly beefier host and Lucene4 (try various codecs for speed/memory usage) I think you could go to 1 bln documents. If you plan on more complex queries..like given a position in a file, identify a document that contains it...than the number of documents should be reconsidered. In worst case case scenario I would go with partitioned index (5-10 partitions, but not thousands) On Tue, Dec 6, 2011 at 11:03, Rui Wang rw...@ebi.ac.uk wrote: Hi Guys, Thank you very much for your answers. I will do some profiling on memory usage, but is there any documentation on how Lucene uses/allocates the memory? Best wishes, Rui Wang On 6 Dec 2011, at 06:11, KARTHIK SHIVAKUMAR wrote: hi would the memory usage go through the roof? Yup My past experience got me pickels in there... with regards karthik On Mon, Dec 5, 2011 at 11:28 PM, Rui Wang rw...@ebi.ac.uk wrote: Hi All, We are planning to use lucene in our project, but not entirely sure about some of the design decisions were made. Below are the details, any comments/suggestions are more than welcome. The requirements of the project are below: 1. We have tens of thousands of files, their size ranging from 500M to a few terabytes, and majority of the contents in these files will not be accessed frequently. 2. We are planning to keep less accessed contents outside of our database, store them on the file system. 3. We also have code to get the binary position of these contents in the files. Using these binary positions, we can quickly retrieve the contents and convert them into our domain objects. We think Lucene provides a scalable solution for storing and indexing these binary positions, so the idea is that each piece of the content in the files will a document, each document will have at least an ID field to identify to content and a binary position field contains the starting and stop position of the content. Having done some performance testing, it seems to us that Lucene is well capable of doing this. At the moment, we are planning to create one Lucene index per file, so if we have new files to be added to the system, we can simply generate a new index. The problem is do with searching, this approach means that we need to create an new IndexSearcher every time a file is accessed through our web service. We knew that it is rather expensive to open a new IndexSearcher, and are thinking of using some kind of pooling mechanism. Our questions are: 1. Is this one index per file approach a viable solution? What do you think about pooling IndexSearcher? 2. If we have many IndexSearchers opened at the same time, would the memory usage go through the roof? I couldn't find any document on how Lucene use allocate memory. Thank you very much for your help. Many thanks, Rui Wang - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- *N.S.KARTHIK R.M.S.COLONY BEHIND BANK OF INDIA R.M.V 2ND STAGE BANGALORE 560094* - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail:
Re: Don't get results wheras Luke does...
Try QueryParser.setLowercaseExpandedTerms(false). QueryParser will lowercase terms in prefix etc queries by default. If that doesn't work, and it was my problem, I'd just lowercase everything, everywhere. Life's too short to mess around with case issues. -- Ian. On Tue, Dec 6, 2011 at 8:12 AM, ejblom ejb...@gmail.com wrote: Dear Lucene-users, I am a bit puzzled over this. I have a query which should return some documents, if I use Luke, I obtain hits using the org.apache.lucene.analysis.KeywordAnalyzer. This is the query: domain:NB-AR* (I have data indexed using: doc.add(new Field(domain, NB-ARC, Field.Store.YES, Field.Index.NOT_ANALYZED)); ) Explain structure reveals that Luke is employing a PrefixQuery. Ok, now I want to obtain these results using my Java application: //Using the QueryParser, let him decide what to do with it: Query q = new QueryParser(Version.LUCENE_35, contents, analyzer).parse(domain:NB-AR*); System.out.println(Type of query: + q.getClass().getSimpleName()); // Type of query: PrefixQuery so that's ok int hitsPerPage = 1000; TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true); searcher.search(q, collector); ScoreDoc[] hits = collector.topDocs().scoreDocs; System.out.println(Found + hits.length + hits.); // Unfortunately 0 hits. // move on and make specify a Term and PrefixQuery: Term term = new Term(domain, NB-AR); q = new PrefixQuery(term); collector = TopScoreDocCollector.create(hitsPerPage, true); searcher.search(q, collector); hits = collector.topDocs().scoreDocs; // Found with prefix 441 hits. I tried to lowercase the search query, re-index and made the field: Field.Index.ANALYZED but nothing worked... I have a feeling it is something very trivial, but I just can't figure it out... Anyone? EJ Blom -- View this message in context: http://lucene.472066.n3.nabble.com/Don-t-get-results-wheras-Luke-does-tp3563736p3563736.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Don't get results wheras Luke does...
I had a similar problem. The problem was the -' char, which is a special char for Lucene. You can try indexing the data in lowercase and use WhitespaceAnalyzer for both indexing and searching over the field. One other option is replace - with _ when indexing and searching. This way, your data won't be indexed with any special chars. One lesson I've learned is to leave upper case characters to be used only for operators. Data that will be searched upon should always be lowercase. On Tue, Dec 6, 2011 at 10:01 AM, Ian Lea ian@gmail.com wrote: Try QueryParser.setLowercaseExpandedTerms(false). QueryParser will lowercase terms in prefix etc queries by default. If that doesn't work, and it was my problem, I'd just lowercase everything, everywhere. Life's too short to mess around with case issues. -- Ian. On Tue, Dec 6, 2011 at 8:12 AM, ejblom ejb...@gmail.com wrote: Dear Lucene-users, I am a bit puzzled over this. I have a query which should return some documents, if I use Luke, I obtain hits using the org.apache.lucene.analysis.KeywordAnalyzer. This is the query: domain:NB-AR* (I have data indexed using: doc.add(new Field(domain, NB-ARC, Field.Store.YES, Field.Index.NOT_ANALYZED)); ) Explain structure reveals that Luke is employing a PrefixQuery. Ok, now I want to obtain these results using my Java application: //Using the QueryParser, let him decide what to do with it: Query q = new QueryParser(Version.LUCENE_35, contents, analyzer).parse(domain:NB-AR*); System.out.println(Type of query: + q.getClass().getSimpleName()); // Type of query: PrefixQuery so that's ok int hitsPerPage = 1000; TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true); searcher.search(q, collector); ScoreDoc[] hits = collector.topDocs().scoreDocs; System.out.println(Found + hits.length + hits.); // Unfortunately 0 hits. // move on and make specify a Term and PrefixQuery: Term term = new Term(domain, NB-AR); q = new PrefixQuery(term); collector = TopScoreDocCollector.create(hitsPerPage, true); searcher.search(q, collector); hits = collector.topDocs().scoreDocs; // Found with prefix 441 hits. I tried to lowercase the search query, re-index and made the field: Field.Index.ANALYZED but nothing worked... I have a feeling it is something very trivial, but I just can't figure it out... Anyone? EJ Blom -- View this message in context: http://lucene.472066.n3.nabble.com/Don-t-get-results-wheras-Luke-does-tp3563736p3563736.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: lucene-core-3.3.0 not optimizing
Try taking a look at the patch, but on a quick glance it doesn't look like the underlying code has changed much. But note the whole point of this is that optimize is overused given its former name, why do you want to keep using it? Best Erick On Tue, Dec 6, 2011 at 1:04 AM, KARTHIK SHIVAKUMAR nskarthi...@gmail.com wrote: Hi LUCENE-3454 http://issues.apache.org/jira/browse/LUCENE-3454: So u mean the code has changed with this API ... Does any body have any sample code snippet or is there a sample to play around with regards karthik On Fri, Dec 2, 2011 at 3:44 PM, Ian Lea ian@gmail.com wrote: Well, calling optimize(maxNumSegments) will (from the javadocs on recent releases) Optimize the index down to = maxNumSegments. So optimize(100) won't get you down to 1 big file, unless you are using compound files perhaps. Maybe it did something different 7 years ago but that seems very unlikely. In 3.5.0 all optimize() calls are deprecated anyway. I suggest you read the release notes and the javadocs, upgrade to 3.5.0 and remove all optimize() calls altogether. -- Ian. On Fri, Dec 2, 2011 at 9:58 AM, KARTHIK SHIVAKUMAR nskarthi...@gmail.com wrote: Hi I have used Index and Optimize 5+ Million XML docs in Lucene 1.x 7 years ago, And this piece of IndexWriter.optimize used to Merger all the bits and pieces of the created into 1 big file. I have not tracked the API changes since 7 yearsand with lucene-core-3.3.0 ...on google not able to find the solutions Why this is happening. with regards karthik On Fri, Dec 2, 2011 at 12:37 PM, Simon Willnauer simon.willna...@googlemail.com wrote: what do you understand when you say optimize? Unless you tell us what this code does in your case and what you'd expect it doing its impossible to give you any reasonable answer. simon On Fri, Dec 2, 2011 at 4:54 AM, KARTHIK SHIVAKUMAR nskarthi...@gmail.com wrote: Hi Spec O/s win os 7 Jdk : 1.6.0_29 Lucene lucene-core-3.3.0 Finally after Indexing successfully ,Why this Code does not optimize ( sample code ) INDEX_WRITER.optimize(100); INDEX_WRITER.commit(); INDEX_WRITER.close(); *N.S.KARTHIK R.M.S.COLONY BEHIND BANK OF INDIA R.M.V 2ND STAGE BANGALORE 560094* - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- *N.S.KARTHIK R.M.S.COLONY BEHIND BANK OF INDIA R.M.V 2ND STAGE BANGALORE 560094* - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- *N.S.KARTHIK R.M.S.COLONY BEHIND BANK OF INDIA R.M.V 2ND STAGE BANGALORE 560094* - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Spell check on a subset of an index ( 'namespace' aware spell checker)
I'm still struggling with this. I've tried to implement the solution mentioned in previous reply, but unfortunately there is a blocking issue with this: I cannot find a way to create another index from the source index in a way that the new index has the field values in it. The only way to copy document's field values from one to another index is to have stored fields. But stored fields hold the original String in its entirety, and not the analyzed String, which I need. Is there another way to copy documents with (at least the spellcheck field) from the one to another index? Recap: I have a source index holding documents for different namespaces. These documents hold one field (analyzed) that should be used for spell checking. I want to construct an spellchecker index for each namespace separately. To accomplish this, I first get the list of namespaces (each document has a namespace field in the original index). Then, for each namespace, I get the list of documents that match this namespace. Then I'd like to use this subset to construct a spellchecker index. Regards, Elmer On 11/23/2011 03:28 PM, E. van Chastelet wrote: I currently have an idea to get it done, but it's not a nice solution. If we have an index Q with all documents for all namespaces, we first extract the list of all terms that appear for the field namespace in Q (this field indicates the namespace of the document). Then, for each namespace n in the terms list: - Get all docs from Q that match +namespace:n - Construct a temporary index from these docs - Use this temporary index to construct the dictionary, which the SpellChecker can use as input. - Call indexDictionary on SpellChecker to create spellcheck index for current namespace. - Delete temporary index We now have separate spell check indexes for each namespace. Any suggestions for a cleaner solution? Regards, Elmer van Chastelet On 11/10/2011 01:16 PM, E. van Chastelet wrote: Hi all, In our project we like to have the ability to get search results scoped to one 'namespace' (as we call it). This can easily be achieved by using a filter or just an additional must-clause. For the spellchecker (and our autocompletion, which is a modified spellchecker), the story seems different. The spell checker index is created using a LuceneDictionary, which has a IndexReader as source. We would like to get (spellcheck/autocomplete) suggestions that are scoped to one namespace (i.e. field 'namespace' should have a particular value). With a single source index containing docs for all namespaces, it seems not possible to create a spellcheck index for each namespace the ordinary way. Q1: Is there a way to construct a LuceneDictionary from a subset of a single source index (all terms where namespace = %value%) ? Another, maybe better solution is to customize the spellchecker by adding an additional namespace field to the spellchecker index. At query-time, an additional must-clause is added, scoping the suggestions to one (or more) namespace(s). The advantage of this is to have a singleton spellchecker (or at least the index reader) for all namespaces. This also means less open files by our application (imagine if there are over 1000 namespaces). Q2: Will there be a significant penalty (say more than 50% slower) for the additional must-clause at query time? Q3: Or can you think of a better solution for this problem? :) How we currently do it: we currently use Lucene 3.1 with Hibernate Search and we actually already have auto completion and spell checking scoped to one namespace. This is currently achieved by using index sharding, so each namespace has its own index and reader, and another for spell check and auto completion. Unfortunately there are some downsides to this: - Our faceting engine has no good support for multiple indexes, so faceting only works on a single namespace - Needs administration for mapping namespace identifier (String) to index number (integer) - The number of shards (and thus name spaces) is currently hardcoded. At this moment it is set to 100, and this means Hibernate Search opens up 100 index readers/writers, while only n100 are in use. and therfore: - Much open file descriptors - Hard limit on number of namespaces Therefore it seems better to switch back to having a single index for all namespaces. Thanks! Regards, Elmer van Chastelet - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Spell check on a subset of an index ( 'namespace' aware spell checker)
There are utilities floating around for getting output from analyzers - would that help? I think there are some in LIA, probably others elsewhere. The idea being that you grab the stored fields from the index, pass them through your analyzer, grab the output and use that. Or can you do something with TermEnum and/or TermDocs. Not sure exactly what or how though ... -- Ian. On Tue, Dec 6, 2011 at 2:20 PM, E. van Chastelet evanchaste...@gmail.com wrote: I'm still struggling with this. I've tried to implement the solution mentioned in previous reply, but unfortunately there is a blocking issue with this: I cannot find a way to create another index from the source index in a way that the new index has the field values in it. The only way to copy document's field values from one to another index is to have stored fields. But stored fields hold the original String in its entirety, and not the analyzed String, which I need. Is there another way to copy documents with (at least the spellcheck field) from the one to another index? Recap: I have a source index holding documents for different namespaces. These documents hold one field (analyzed) that should be used for spell checking. I want to construct an spellchecker index for each namespace separately. To accomplish this, I first get the list of namespaces (each document has a namespace field in the original index). Then, for each namespace, I get the list of documents that match this namespace. Then I'd like to use this subset to construct a spellchecker index. Regards, Elmer On 11/23/2011 03:28 PM, E. van Chastelet wrote: I currently have an idea to get it done, but it's not a nice solution. If we have an index Q with all documents for all namespaces, we first extract the list of all terms that appear for the field namespace in Q (this field indicates the namespace of the document). Then, for each namespace n in the terms list: - Get all docs from Q that match +namespace:n - Construct a temporary index from these docs - Use this temporary index to construct the dictionary, which the SpellChecker can use as input. - Call indexDictionary on SpellChecker to create spellcheck index for current namespace. - Delete temporary index We now have separate spell check indexes for each namespace. Any suggestions for a cleaner solution? Regards, Elmer van Chastelet On 11/10/2011 01:16 PM, E. van Chastelet wrote: Hi all, In our project we like to have the ability to get search results scoped to one 'namespace' (as we call it). This can easily be achieved by using a filter or just an additional must-clause. For the spellchecker (and our autocompletion, which is a modified spellchecker), the story seems different. The spell checker index is created using a LuceneDictionary, which has a IndexReader as source. We would like to get (spellcheck/autocomplete) suggestions that are scoped to one namespace (i.e. field 'namespace' should have a particular value). With a single source index containing docs for all namespaces, it seems not possible to create a spellcheck index for each namespace the ordinary way. Q1: Is there a way to construct a LuceneDictionary from a subset of a single source index (all terms where namespace = %value%) ? Another, maybe better solution is to customize the spellchecker by adding an additional namespace field to the spellchecker index. At query-time, an additional must-clause is added, scoping the suggestions to one (or more) namespace(s). The advantage of this is to have a singleton spellchecker (or at least the index reader) for all namespaces. This also means less open files by our application (imagine if there are over 1000 namespaces). Q2: Will there be a significant penalty (say more than 50% slower) for the additional must-clause at query time? Q3: Or can you think of a better solution for this problem? :) How we currently do it: we currently use Lucene 3.1 with Hibernate Search and we actually already have auto completion and spell checking scoped to one namespace. This is currently achieved by using index sharding, so each namespace has its own index and reader, and another for spell check and auto completion. Unfortunately there are some downsides to this: - Our faceting engine has no good support for multiple indexes, so faceting only works on a single namespace - Needs administration for mapping namespace identifier (String) to index number (integer) - The number of shards (and thus name spaces) is currently hardcoded. At this moment it is set to 100, and this means Hibernate Search opens up 100 index readers/writers, while only n100 are in use. and therfore: - Much open file descriptors - Hard limit on number of namespaces Therefore it seems better to switch back to having a single index for all namespaces. Thanks! Regards, Elmer van Chastelet - To
tokenizing text using language analyzer but preserving stopwords if possible
I need to implement a quick and dirty or poor man's translation of a foreign language document by looking up each word in a dictionary and replacing it with the English translation. So what I need is to tokenize the original foreign text into words and then access each word, look it up and get its translation. However, if possible, I also need to preserve non-words, i.e. stopwords so that I could replicate them in the output stream without translating. If the latter is not possible then I just need to preserve the order of the original words so that their translations have the same order in the output. Can I accomplish this using Lucene components? I presume I'd have to start by creating an analyzer for the foreign language, but then what? How do I (i) tokenize, (ii) access words in the correct order, (iii) also access non-words if possible? Thanks much Ilya Zavorin
Lucene 4.0 Index Format Finalization Timetable
Is there a timetable for when it is expected to be finalized? I'm not looking for an exact date, just an approximate like (next month, 2 months 6 months,etc) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene 4.0 Index Format Finalization Timetable
On Tue, Dec 6, 2011 at 6:41 PM, Jamie Johnson jej2...@gmail.com wrote: Is there a timetable for when it is expected to be finalized? it will be finalized when Lucene 4.0 is released. -- lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene 4.0 Index Format Finalization Timetable
Thanks Robert. Is there a timetable for that? I'm trying to gauge whether it is appropriate to push for my organization to move to the current lucene 4.0 implementation (we're using solr cloud which is built against trunk) or if it's expected there will be changes to what is currently on trunk. I'm not looking for anything hard, just trying to plan as much as possible understanding that this is one of the implications of using trunk. On Tue, Dec 6, 2011 at 6:48 PM, Robert Muir rcm...@gmail.com wrote: On Tue, Dec 6, 2011 at 6:41 PM, Jamie Johnson jej2...@gmail.com wrote: Is there a timetable for when it is expected to be finalized? it will be finalized when Lucene 4.0 is released. -- lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene 4.0 Index Format Finalization Timetable
I asked here[1] and it said Ask again later. [1] http://8ball.tridelphia.net/ On 12/06/2011 08:46 PM, Jamie Johnson wrote: Thanks Robert. Is there a timetable for that? I'm trying to gauge whether it is appropriate to push for my organization to move to the current lucene 4.0 implementation (we're using solr cloud which is built against trunk) or if it's expected there will be changes to what is currently on trunk. I'm not looking for anything hard, just trying to plan as much as possible understanding that this is one of the implications of using trunk. On Tue, Dec 6, 2011 at 6:48 PM, Robert Muirrcm...@gmail.com wrote: On Tue, Dec 6, 2011 at 6:41 PM, Jamie Johnsonjej2...@gmail.com wrote: Is there a timetable for when it is expected to be finalized? it will be finalized when Lucene 4.0 is released. -- lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene 4.0 Index Format Finalization Timetable
I suppose that's fair enough. Some quick googling seems that this has been asked many times with pretty much the same response. Sorry to add to the noise. On Tue, Dec 6, 2011 at 9:34 PM, Darren Govoni dar...@ontrenet.com wrote: I asked here[1] and it said Ask again later. [1] http://8ball.tridelphia.net/ On 12/06/2011 08:46 PM, Jamie Johnson wrote: Thanks Robert. Is there a timetable for that? I'm trying to gauge whether it is appropriate to push for my organization to move to the current lucene 4.0 implementation (we're using solr cloud which is built against trunk) or if it's expected there will be changes to what is currently on trunk. I'm not looking for anything hard, just trying to plan as much as possible understanding that this is one of the implications of using trunk. On Tue, Dec 6, 2011 at 6:48 PM, Robert Muirrcm...@gmail.com wrote: On Tue, Dec 6, 2011 at 6:41 PM, Jamie Johnsonjej2...@gmail.com wrote: Is there a timetable for when it is expected to be finalized? it will be finalized when Lucene 4.0 is released. -- lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org