Don't get results wheras Luke does...

2011-12-06 Thread ejblom
Dear Lucene-users,

I am a bit puzzled over this. I have a query which should return some
documents, if I use Luke, I obtain hits using the
org.apache.lucene.analysis.KeywordAnalyzer.

This is the query:

domain:NB-AR*

(I have data indexed using:

doc.add(new Field(domain, NB-ARC, Field.Store.YES,
Field.Index.NOT_ANALYZED));  )

Explain structure reveals that Luke is employing a PrefixQuery. Ok, now I
want to obtain these results using my Java application:

//Using the QueryParser, let him decide what to do with it:

Query q = new QueryParser(Version.LUCENE_35, contents,
analyzer).parse(domain:NB-AR*);
System.out.println(Type of query:  + q.getClass().getSimpleName());

// Type of query: PrefixQuery so that's ok

int hitsPerPage = 1000;
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage,
true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
System.out.println(Found  + hits.length +  hits.);

// Unfortunately 0 hits.

// move on and make specify a Term and PrefixQuery:

Term term = new Term(domain, NB-AR);
q = new PrefixQuery(term);
collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
hits = collector.topDocs().scoreDocs;

// Found with prefix 441 hits.



I tried to lowercase the search query, re-index and made the field:
Field.Index.ANALYZED but nothing worked...

I have a feeling it is something very trivial, but I just can't figure it
out...

Anyone?

EJ Blom


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Don-t-get-results-wheras-Luke-does-tp3563736p3563736.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Use multiple lucene indices

2011-12-06 Thread Rui Wang
Hi Guys,

Thank you very much for your answers. 

I will do some profiling on memory usage, but is there any documentation on how 
Lucene uses/allocates the memory? 

Best wishes,
Rui Wang


On 6 Dec 2011, at 06:11, KARTHIK SHIVAKUMAR wrote:

 hi
 
 would the memory usage go through the roof?
 
 Yup 
 
 My past experience got me pickels  in there...
 
 
 
 with regards
 karthik
 
 On Mon, Dec 5, 2011 at 11:28 PM, Rui Wang rw...@ebi.ac.uk wrote:
 
 Hi All,
 
 We are planning to use lucene in our project, but not entirely sure about
 some of the design decisions were made. Below are the details, any
 comments/suggestions are more than welcome.
 
 The requirements of the project are below:
 
 1. We have  tens of thousands of files, their size ranging from 500M to a
 few terabytes, and majority of the contents in these files will not be
 accessed frequently.
 
 2. We are planning to keep less accessed contents outside of our database,
 store them on the file system.
 
 3. We also have code to get the binary position of these contents in the
 files. Using these binary positions, we can quickly retrieve the contents
 and convert them into our domain objects.
 
 We think Lucene provides a scalable solution for storing and indexing
 these binary positions, so the idea is that each piece of the content in
 the files will a document, each document will have at least an ID field to
 identify to content and a binary position field contains the starting and
 stop position of the content. Having done some performance testing, it
 seems to us that Lucene is well capable of doing this.
 
 At the moment, we are planning to create one Lucene index per file, so if
 we have new files to be added to the system, we can simply generate a new
 index. The problem is do with searching, this approach means that we need
 to create an new IndexSearcher every time a file is accessed through our
 web service. We knew that it is rather expensive to open a new
 IndexSearcher, and are thinking of using some kind of pooling mechanism.
 Our questions are:
 
 1. Is this one index per file approach a viable solution? What do you
 think about pooling IndexSearcher?
 
 2. If we have many IndexSearchers opened at the same time, would the
 memory usage go through the roof? I couldn't find any document on how
 Lucene use allocate memory.
 
 Thank you very much for your help.
 
 Many thanks,
 Rui Wang
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
 
 -- 
 *N.S.KARTHIK
 R.M.S.COLONY
 BEHIND BANK OF INDIA
 R.M.V 2ND STAGE
 BANGALORE
 560094*


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Use multiple lucene indices

2011-12-06 Thread Danil ŢORIN
How many documents there are in the system ?
approximate it by: 2 files * avg(docs/file)

From my understanding your queries will be just lookup for a document ID
(Q: are those IDs unique between files? or you need to filter by filename?)
If that will be the only usecase than maybe you should consider some other
lookup systems, a ehcache offloaded and persistent on disk might work just
as well.

If you are anywhere  200 mln documents I'd say you should go with a single
index that contains all the data on a decent box (2-4 CPU, 4-8Gb RAM)
In a slightly beefier host and Lucene4 (try various codecs for speed/memory
usage) I think you could go to 1 bln documents.

If you plan on more complex queries..like given a position in a file,
identify a document that contains it...than the number of documents should
be reconsidered.

In worst case case scenario I would go with partitioned index (5-10
partitions, but not thousands)


On Tue, Dec 6, 2011 at 11:03, Rui Wang rw...@ebi.ac.uk wrote:

 Hi Guys,

 Thank you very much for your answers.

 I will do some profiling on memory usage, but is there any documentation
 on how Lucene uses/allocates the memory?

 Best wishes,
 Rui Wang


 On 6 Dec 2011, at 06:11, KARTHIK SHIVAKUMAR wrote:

  hi
 
  would the memory usage go through the roof?
 
  Yup 
 
  My past experience got me pickels  in there...
 
 
 
  with regards
  karthik
 
  On Mon, Dec 5, 2011 at 11:28 PM, Rui Wang rw...@ebi.ac.uk wrote:
 
  Hi All,
 
  We are planning to use lucene in our project, but not entirely sure
 about
  some of the design decisions were made. Below are the details, any
  comments/suggestions are more than welcome.
 
  The requirements of the project are below:
 
  1. We have  tens of thousands of files, their size ranging from 500M to
 a
  few terabytes, and majority of the contents in these files will not be
  accessed frequently.
 
  2. We are planning to keep less accessed contents outside of our
 database,
  store them on the file system.
 
  3. We also have code to get the binary position of these contents in the
  files. Using these binary positions, we can quickly retrieve the
 contents
  and convert them into our domain objects.
 
  We think Lucene provides a scalable solution for storing and indexing
  these binary positions, so the idea is that each piece of the content in
  the files will a document, each document will have at least an ID field
 to
  identify to content and a binary position field contains the starting
 and
  stop position of the content. Having done some performance testing, it
  seems to us that Lucene is well capable of doing this.
 
  At the moment, we are planning to create one Lucene index per file, so
 if
  we have new files to be added to the system, we can simply generate a
 new
  index. The problem is do with searching, this approach means that we
 need
  to create an new IndexSearcher every time a file is accessed through our
  web service. We knew that it is rather expensive to open a new
  IndexSearcher, and are thinking of using some kind of pooling mechanism.
  Our questions are:
 
  1. Is this one index per file approach a viable solution? What do you
  think about pooling IndexSearcher?
 
  2. If we have many IndexSearchers opened at the same time, would the
  memory usage go through the roof? I couldn't find any document on how
  Lucene use allocate memory.
 
  Thank you very much for your help.
 
  Many thanks,
  Rui Wang
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
 
  --
  *N.S.KARTHIK
  R.M.S.COLONY
  BEHIND BANK OF INDIA
  R.M.V 2ND STAGE
  BANGALORE
  560094*


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Use multiple lucene indices

2011-12-06 Thread Rui Wang
Hi Danil,

Thank you for your suggestions.

We will have approximately half million documents per file, so using your 
calculation, 2 files * 50 = 10, 000, 000, 000. And we are likely to get 
more files in the future, so a scalable solution is most desirable. 

The document IDs are not unique between files, so we will have to filter by 
file name as well. echcahe is certainly an interesting idea, does it have the 
comparable load speed as a Lucene index, what about memory footprint?

Another thing I should have mentioned before, we will add a few files (say 10) 
per day, this means we need to update indices on a regular basis, hence the 
reason why we were thinking of generating one index per file. 

Am I right to say that you would definitely not go for one index per file 
solution? is it also due to memory consumption? 

Many thanks,
Rui Wang


On 6 Dec 2011, at 10:05, Danil ŢORIN wrote:

 How many documents there are in the system ?
 approximate it by: 2 files * avg(docs/file)
 
 From my understanding your queries will be just lookup for a document ID
 (Q: are those IDs unique between files? or you need to filter by filename?)
 If that will be the only usecase than maybe you should consider some other
 lookup systems, a ehcache offloaded and persistent on disk might work just
 as well.
 
 If you are anywhere  200 mln documents I'd say you should go with a single
 index that contains all the data on a decent box (2-4 CPU, 4-8Gb RAM)
 In a slightly beefier host and Lucene4 (try various codecs for speed/memory
 usage) I think you could go to 1 bln documents.
 
 If you plan on more complex queries..like given a position in a file,
 identify a document that contains it...than the number of documents should
 be reconsidered.
 
 In worst case case scenario I would go with partitioned index (5-10
 partitions, but not thousands)
 
 
 On Tue, Dec 6, 2011 at 11:03, Rui Wang rw...@ebi.ac.uk wrote:
 
 Hi Guys,
 
 Thank you very much for your answers.
 
 I will do some profiling on memory usage, but is there any documentation
 on how Lucene uses/allocates the memory?
 
 Best wishes,
 Rui Wang
 
 
 On 6 Dec 2011, at 06:11, KARTHIK SHIVAKUMAR wrote:
 
 hi
 
 would the memory usage go through the roof?
 
 Yup 
 
 My past experience got me pickels  in there...
 
 
 
 with regards
 karthik
 
 On Mon, Dec 5, 2011 at 11:28 PM, Rui Wang rw...@ebi.ac.uk wrote:
 
 Hi All,
 
 We are planning to use lucene in our project, but not entirely sure
 about
 some of the design decisions were made. Below are the details, any
 comments/suggestions are more than welcome.
 
 The requirements of the project are below:
 
 1. We have  tens of thousands of files, their size ranging from 500M to
 a
 few terabytes, and majority of the contents in these files will not be
 accessed frequently.
 
 2. We are planning to keep less accessed contents outside of our
 database,
 store them on the file system.
 
 3. We also have code to get the binary position of these contents in the
 files. Using these binary positions, we can quickly retrieve the
 contents
 and convert them into our domain objects.
 
 We think Lucene provides a scalable solution for storing and indexing
 these binary positions, so the idea is that each piece of the content in
 the files will a document, each document will have at least an ID field
 to
 identify to content and a binary position field contains the starting
 and
 stop position of the content. Having done some performance testing, it
 seems to us that Lucene is well capable of doing this.
 
 At the moment, we are planning to create one Lucene index per file, so
 if
 we have new files to be added to the system, we can simply generate a
 new
 index. The problem is do with searching, this approach means that we
 need
 to create an new IndexSearcher every time a file is accessed through our
 web service. We knew that it is rather expensive to open a new
 IndexSearcher, and are thinking of using some kind of pooling mechanism.
 Our questions are:
 
 1. Is this one index per file approach a viable solution? What do you
 think about pooling IndexSearcher?
 
 2. If we have many IndexSearchers opened at the same time, would the
 memory usage go through the roof? I couldn't find any document on how
 Lucene use allocate memory.
 
 Thank you very much for your help.
 
 Many thanks,
 Rui Wang
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
 
 --
 *N.S.KARTHIK
 R.M.S.COLONY
 BEHIND BANK OF INDIA
 R.M.V 2ND STAGE
 BANGALORE
 560094*
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 


-
To unsubscribe, e-mail: 

Re: Don't get results wheras Luke does...

2011-12-06 Thread Ian Lea
Try QueryParser.setLowercaseExpandedTerms(false).  QueryParser will
lowercase terms in prefix etc queries by default.

If that doesn't work, and it was my problem, I'd just lowercase
everything, everywhere.  Life's too short to mess around with case
issues.


--
Ian.


On Tue, Dec 6, 2011 at 8:12 AM, ejblom ejb...@gmail.com wrote:
 Dear Lucene-users,

 I am a bit puzzled over this. I have a query which should return some
 documents, if I use Luke, I obtain hits using the
 org.apache.lucene.analysis.KeywordAnalyzer.

 This is the query:

 domain:NB-AR*

 (I have data indexed using:

 doc.add(new Field(domain, NB-ARC, Field.Store.YES,
 Field.Index.NOT_ANALYZED));  )

 Explain structure reveals that Luke is employing a PrefixQuery. Ok, now I
 want to obtain these results using my Java application:

 //Using the QueryParser, let him decide what to do with it:

 Query q = new QueryParser(Version.LUCENE_35, contents,
 analyzer).parse(domain:NB-AR*);
 System.out.println(Type of query:  + q.getClass().getSimpleName());

 // Type of query: PrefixQuery so that's ok

 int hitsPerPage = 1000;
 TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage,
 true);
 searcher.search(q, collector);
 ScoreDoc[] hits = collector.topDocs().scoreDocs;
 System.out.println(Found  + hits.length +  hits.);

 // Unfortunately 0 hits.

 // move on and make specify a Term and PrefixQuery:

 Term term = new Term(domain, NB-AR);
 q = new PrefixQuery(term);
 collector = TopScoreDocCollector.create(hitsPerPage, true);
 searcher.search(q, collector);
 hits = collector.topDocs().scoreDocs;

 // Found with prefix 441 hits.



 I tried to lowercase the search query, re-index and made the field:
 Field.Index.ANALYZED but nothing worked...

 I have a feeling it is something very trivial, but I just can't figure it
 out...

 Anyone?

 EJ Blom


 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Don-t-get-results-wheras-Luke-does-tp3563736p3563736.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Don't get results wheras Luke does...

2011-12-06 Thread Felipe Carvalho
I had a similar problem. The problem was the -' char, which is a special
char for Lucene. You can try indexing the data in lowercase and use
WhitespaceAnalyzer for both indexing and searching over the field. One
other option is replace - with _ when indexing and searching. This way,
your data won't be indexed with any special chars. One lesson I've learned
is to leave upper case characters to be used only for operators. Data that
will be searched upon should always be lowercase.

On Tue, Dec 6, 2011 at 10:01 AM, Ian Lea ian@gmail.com wrote:

 Try QueryParser.setLowercaseExpandedTerms(false).  QueryParser will
 lowercase terms in prefix etc queries by default.

 If that doesn't work, and it was my problem, I'd just lowercase
 everything, everywhere.  Life's too short to mess around with case
 issues.


 --
 Ian.


 On Tue, Dec 6, 2011 at 8:12 AM, ejblom ejb...@gmail.com wrote:
  Dear Lucene-users,
 
  I am a bit puzzled over this. I have a query which should return some
  documents, if I use Luke, I obtain hits using the
  org.apache.lucene.analysis.KeywordAnalyzer.
 
  This is the query:
 
  domain:NB-AR*
 
  (I have data indexed using:
 
  doc.add(new Field(domain, NB-ARC, Field.Store.YES,
  Field.Index.NOT_ANALYZED));  )
 
  Explain structure reveals that Luke is employing a PrefixQuery. Ok, now I
  want to obtain these results using my Java application:
 
  //Using the QueryParser, let him decide what to do with it:
 
  Query q = new QueryParser(Version.LUCENE_35, contents,
  analyzer).parse(domain:NB-AR*);
  System.out.println(Type of query:  + q.getClass().getSimpleName());
 
  // Type of query: PrefixQuery so that's ok
 
  int hitsPerPage = 1000;
  TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage,
  true);
  searcher.search(q, collector);
  ScoreDoc[] hits = collector.topDocs().scoreDocs;
  System.out.println(Found  + hits.length +  hits.);
 
  // Unfortunately 0 hits.
 
  // move on and make specify a Term and PrefixQuery:
 
  Term term = new Term(domain, NB-AR);
  q = new PrefixQuery(term);
  collector = TopScoreDocCollector.create(hitsPerPage, true);
  searcher.search(q, collector);
  hits = collector.topDocs().scoreDocs;
 
  // Found with prefix 441 hits.
 
 
 
  I tried to lowercase the search query, re-index and made the field:
  Field.Index.ANALYZED but nothing worked...
 
  I have a feeling it is something very trivial, but I just can't figure it
  out...
 
  Anyone?
 
  EJ Blom
 
 
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/Don-t-get-results-wheras-Luke-does-tp3563736p3563736.html
  Sent from the Lucene - Java Users mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: lucene-core-3.3.0 not optimizing

2011-12-06 Thread Erick Erickson
Try taking a look at the patch, but on a quick glance it doesn't
look like the underlying code has changed much.

But note the whole point of this is that optimize is overused
given its former name, why do you want to keep using it?

Best
Erick

On Tue, Dec 6, 2011 at 1:04 AM, KARTHIK SHIVAKUMAR
nskarthi...@gmail.com wrote:
 Hi

LUCENE-3454 http://issues.apache.org/jira/browse/LUCENE-3454:

 So u mean the code has changed with this API ...

 Does any body have any sample code snippet   or is there a sample to
 play around


 with regards
 karthik


 On Fri, Dec 2, 2011 at 3:44 PM, Ian Lea ian@gmail.com wrote:

 Well, calling optimize(maxNumSegments) will (from the javadocs on
 recent releases) Optimize the index down to = maxNumSegments.  So
 optimize(100) won't get you down to 1 big file, unless you are using
 compound files perhaps.  Maybe it did something different 7 years ago
 but that seems very unlikely.

 In 3.5.0 all optimize() calls are deprecated anyway.  I suggest you
 read the release notes and the javadocs, upgrade to 3.5.0 and remove
 all optimize() calls altogether.


 --
 Ian.


 On Fri, Dec 2, 2011 at 9:58 AM, KARTHIK SHIVAKUMAR
 nskarthi...@gmail.com wrote:
  Hi
 
  I have used Index and Optimize   5+ Million XML docs  in Lucene 1.x    7
  years ago,
 
  And this piece of IndexWriter.optimize used to Merger all the bits and
  pieces of the created into 1 big file.
 
  I have not tracked the API changes since 7 yearsand with
  lucene-core-3.3.0 ...on google  not able to  find the solutions Why this
 is
  happening.
 
 
  with regards
  karthik
 
  On Fri, Dec 2, 2011 at 12:37 PM, Simon Willnauer 
  simon.willna...@googlemail.com wrote:
 
  what do you understand when you say optimize? Unless you tell us what
  this code does in your case and what you'd expect it doing its
  impossible to give you any reasonable answer.
 
  simon
 
  On Fri, Dec 2, 2011 at 4:54 AM, KARTHIK SHIVAKUMAR
  nskarthi...@gmail.com wrote:
   Hi
  
   Spec
   O/s win os 7
   Jdk : 1.6.0_29
   Lucene  lucene-core-3.3.0
  
  
  
   Finally after Indexing successfully ,Why this Code does not optimize (
   sample code )
  
              INDEX_WRITER.optimize(100);
              INDEX_WRITER.commit();
              INDEX_WRITER.close();
  
  
   *N.S.KARTHIK
   R.M.S.COLONY
   BEHIND BANK OF INDIA
   R.M.V 2ND STAGE
   BANGALORE
   560094*
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
 
  --
  *N.S.KARTHIK
  R.M.S.COLONY
  BEHIND BANK OF INDIA
  R.M.V 2ND STAGE
  BANGALORE
  560094*

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




 --
 *N.S.KARTHIK
 R.M.S.COLONY
 BEHIND BANK OF INDIA
 R.M.V 2ND STAGE
 BANGALORE
 560094*

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Spell check on a subset of an index ( 'namespace' aware spell checker)

2011-12-06 Thread E. van Chastelet

I'm still struggling with this.

I've tried to implement the solution mentioned in previous reply, but 
unfortunately there is a blocking issue with this:
I cannot find a way to create another index from the source index in a 
way that the new index has the field values in it. The only way to copy 
document's field values from one to another index is to have stored 
fields. But stored fields hold the original String in its entirety, 
and not the analyzed String, which I need. Is there another way to copy 
documents with (at least the spellcheck field) from the one to another 
index?


Recap:
I have a source index holding documents for different namespaces. These 
documents hold one field (analyzed) that should be used for spell 
checking. I want to construct an spellchecker index for each namespace 
separately. To accomplish this, I first get the list of namespaces (each 
document has a namespace field in the original index). Then, for each 
namespace, I get the list of documents that match this namespace. Then 
I'd like to use this subset to construct a spellchecker index.


Regards,
Elmer


On 11/23/2011 03:28 PM, E. van Chastelet wrote:

I currently have an idea to get it done, but it's not a nice solution.

If we have an index Q with all documents for all namespaces, we first 
extract the list of all terms that appear for the field namespace in Q 
(this field indicates the namespace of the document).


Then, for each namespace n in the terms list:
 - Get all docs from Q that match +namespace:n
 - Construct a temporary index from these docs
 - Use this temporary index to construct the dictionary, which the 
SpellChecker can use as input.
 - Call indexDictionary on SpellChecker to create spellcheck index for 
current namespace.

 - Delete temporary index

We now have separate spell check indexes for each namespace.

Any suggestions for a cleaner solution?

Regards,
Elmer van Chastelet



On 11/10/2011 01:16 PM, E. van Chastelet wrote:

Hi all,

In our project we like to have the ability to get search results 
scoped to one 'namespace' (as we call it). This can easily be 
achieved by using a filter or just an additional must-clause.
For the spellchecker (and our autocompletion, which is a modified 
spellchecker), the story seems different. The spell checker index is 
created using a LuceneDictionary, which has a IndexReader as source. 
We would like to get (spellcheck/autocomplete) suggestions that are 
scoped to one namespace (i.e. field 'namespace' should have a 
particular value).
With a single source index containing docs for all namespaces, it 
seems not possible to create a spellcheck index for each namespace 
the ordinary way.
Q1: Is there a way to construct a LuceneDictionary from a subset of a 
single source index (all terms where namespace = %value%) ?


Another, maybe better solution is to customize the spellchecker by 
adding an additional namespace field to the spellchecker index. At 
query-time, an additional must-clause is added, scoping the 
suggestions to one (or more) namespace(s). The advantage of this is 
to have a singleton spellchecker (or at least the index reader) for 
all namespaces. This also means less open files by our application 
(imagine if there are over 1000 namespaces).
Q2: Will there be a significant penalty (say more than 50% slower) 
for the additional must-clause at query time?


Q3: Or can you think of a better solution for this problem? :)

How we currently do it: we currently use Lucene 3.1 with Hibernate 
Search and we actually already have auto completion and spell 
checking scoped to one namespace. This is currently achieved by using 
index sharding, so each namespace has its own index and reader, and 
another for spell check and auto completion. Unfortunately there are 
some downsides to this:
- Our faceting engine has no good support for multiple indexes, so 
faceting only works on a single namespace
- Needs administration for mapping namespace identifier (String) to 
index number (integer)
- The number of shards (and thus name spaces) is currently hardcoded. 
At this moment it is set to 100, and this means Hibernate Search 
opens up 100 index readers/writers, while only n100 are in use. and 
therfore:

- Much open file descriptors
- Hard limit on number of namespaces

Therefore it seems better to switch back to having a single index for 
all namespaces.


Thanks!

Regards,
Elmer van Chastelet






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Spell check on a subset of an index ( 'namespace' aware spell checker)

2011-12-06 Thread Ian Lea
There are utilities floating around for getting output from analyzers
- would that help?  I think there are some in LIA, probably others
elsewhere.  The idea being that you grab the stored fields from the
index, pass them through your analyzer, grab the output and use that.

Or can you do something with TermEnum and/or TermDocs.  Not sure
exactly what or how though ...


--
Ian.

On Tue, Dec 6, 2011 at 2:20 PM, E. van Chastelet
evanchaste...@gmail.com wrote:
 I'm still struggling with this.

 I've tried to implement the solution mentioned in previous reply, but
 unfortunately there is a blocking issue with this:
 I cannot find a way to create another index from the source index in a way
 that the new index has the field values in it. The only way to copy
 document's field values from one to another index is to have stored fields.
 But stored fields hold the original String in its entirety, and not the
 analyzed String, which I need. Is there another way to copy documents with
 (at least the spellcheck field) from the one to another index?

 Recap:
 I have a source index holding documents for different namespaces. These
 documents hold one field (analyzed) that should be used for spell checking.
 I want to construct an spellchecker index for each namespace separately. To
 accomplish this, I first get the list of namespaces (each document has a
 namespace field in the original index). Then, for each namespace, I get the
 list of documents that match this namespace. Then I'd like to use this
 subset to construct a spellchecker index.

 Regards,
 Elmer


 On 11/23/2011 03:28 PM, E. van Chastelet wrote:

 I currently have an idea to get it done, but it's not a nice solution.

 If we have an index Q with all documents for all namespaces, we first
 extract the list of all terms that appear for the field namespace in Q (this
 field indicates the namespace of the document).

 Then, for each namespace n in the terms list:
  - Get all docs from Q that match +namespace:n
  - Construct a temporary index from these docs
  - Use this temporary index to construct the dictionary, which the
 SpellChecker can use as input.
  - Call indexDictionary on SpellChecker to create spellcheck index for
 current namespace.
  - Delete temporary index

 We now have separate spell check indexes for each namespace.

 Any suggestions for a cleaner solution?

 Regards,
 Elmer van Chastelet



 On 11/10/2011 01:16 PM, E. van Chastelet wrote:

 Hi all,

 In our project we like to have the ability to get search results scoped
 to one 'namespace' (as we call it). This can easily be achieved by using a
 filter or just an additional must-clause.
 For the spellchecker (and our autocompletion, which is a modified
 spellchecker), the story seems different. The spell checker index is created
 using a LuceneDictionary, which has a IndexReader as source. We would like
 to get (spellcheck/autocomplete) suggestions that are scoped to one
 namespace (i.e. field 'namespace' should have a particular value).
 With a single source index containing docs for all namespaces, it seems
 not possible to create a spellcheck index for each namespace the ordinary
 way.
 Q1: Is there a way to construct a LuceneDictionary from a subset of a
 single source index (all terms where namespace = %value%) ?

 Another, maybe better solution is to customize the spellchecker by adding
 an additional namespace field to the spellchecker index. At query-time, an
 additional must-clause is added, scoping the suggestions to one (or more)
 namespace(s). The advantage of this is to have a singleton spellchecker (or
 at least the index reader) for all namespaces. This also means less open
 files by our application (imagine if there are over 1000 namespaces).
 Q2: Will there be a significant penalty (say more than 50% slower) for
 the additional must-clause at query time?

 Q3: Or can you think of a better solution for this problem? :)

 How we currently do it: we currently use Lucene 3.1 with Hibernate Search
 and we actually already have auto completion and spell checking scoped to
 one namespace. This is currently achieved by using index sharding, so each
 namespace has its own index and reader, and another for spell check and auto
 completion. Unfortunately there are some downsides to this:
 - Our faceting engine has no good support for multiple indexes, so
 faceting only works on a single namespace
 - Needs administration for mapping namespace identifier (String) to index
 number (integer)
 - The number of shards (and thus name spaces) is currently hardcoded. At
 this moment it is set to 100, and this means Hibernate Search opens up 100
 index readers/writers, while only n100 are in use. and therfore:
 - Much open file descriptors
 - Hard limit on number of namespaces

 Therefore it seems better to switch back to having a single index for all
 namespaces.

 Thanks!

 Regards,
 Elmer van Chastelet




 -
 To 

tokenizing text using language analyzer but preserving stopwords if possible

2011-12-06 Thread Ilya Zavorin
I need to implement a quick and dirty or poor man's translation of a 
foreign language document by looking up each word in a dictionary and replacing 
it with the English translation. So what I need is to tokenize the original 
foreign text into words and then access each word, look it up and get its 
translation. However, if possible, I also need to preserve non-words, i.e. 
stopwords so that I could replicate them in the output stream without 
translating. If the latter is not possible then I just need to preserve the 
order of the original words so that their translations have the same order in 
the output.

Can I accomplish this using Lucene components? I presume I'd have to start by 
creating an analyzer for the foreign language, but then what? How do I (i) 
tokenize, (ii) access words in the correct order, (iii) also access non-words 
if possible?

Thanks much


Ilya Zavorin




Lucene 4.0 Index Format Finalization Timetable

2011-12-06 Thread Jamie Johnson
Is there a timetable for when it is expected to be finalized?  I'm not
looking for an exact date, just an approximate like (next month, 2
months 6 months,etc)

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 4.0 Index Format Finalization Timetable

2011-12-06 Thread Robert Muir
On Tue, Dec 6, 2011 at 6:41 PM, Jamie Johnson jej2...@gmail.com wrote:
 Is there a timetable for when it is expected to be finalized?

it will be finalized when Lucene 4.0 is released.

-- 
lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 4.0 Index Format Finalization Timetable

2011-12-06 Thread Jamie Johnson
Thanks Robert.  Is there a timetable for that?  I'm trying to gauge
whether it is appropriate to push for my organization to move to the
current lucene 4.0 implementation (we're using solr cloud which is
built against trunk) or if it's expected there will be changes to what
is currently on trunk.  I'm not looking for anything hard, just trying
to plan as much as possible understanding that this is one of the
implications of using trunk.

On Tue, Dec 6, 2011 at 6:48 PM, Robert Muir rcm...@gmail.com wrote:
 On Tue, Dec 6, 2011 at 6:41 PM, Jamie Johnson jej2...@gmail.com wrote:
 Is there a timetable for when it is expected to be finalized?

 it will be finalized when Lucene 4.0 is released.

 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 4.0 Index Format Finalization Timetable

2011-12-06 Thread Darren Govoni

I asked here[1] and it said Ask again later.

[1] http://8ball.tridelphia.net/

On 12/06/2011 08:46 PM, Jamie Johnson wrote:

Thanks Robert.  Is there a timetable for that?  I'm trying to gauge
whether it is appropriate to push for my organization to move to the
current lucene 4.0 implementation (we're using solr cloud which is
built against trunk) or if it's expected there will be changes to what
is currently on trunk.  I'm not looking for anything hard, just trying
to plan as much as possible understanding that this is one of the
implications of using trunk.

On Tue, Dec 6, 2011 at 6:48 PM, Robert Muirrcm...@gmail.com  wrote:

On Tue, Dec 6, 2011 at 6:41 PM, Jamie Johnsonjej2...@gmail.com  wrote:

Is there a timetable for when it is expected to be finalized?

it will be finalized when Lucene 4.0 is released.

--
lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 4.0 Index Format Finalization Timetable

2011-12-06 Thread Jamie Johnson
I suppose that's fair enough.  Some quick googling seems that this has
been asked many times with pretty much the same response.  Sorry to
add to the noise.

On Tue, Dec 6, 2011 at 9:34 PM, Darren Govoni dar...@ontrenet.com wrote:
 I asked here[1] and it said Ask again later.

 [1] http://8ball.tridelphia.net/


 On 12/06/2011 08:46 PM, Jamie Johnson wrote:

 Thanks Robert.  Is there a timetable for that?  I'm trying to gauge
 whether it is appropriate to push for my organization to move to the
 current lucene 4.0 implementation (we're using solr cloud which is
 built against trunk) or if it's expected there will be changes to what
 is currently on trunk.  I'm not looking for anything hard, just trying
 to plan as much as possible understanding that this is one of the
 implications of using trunk.

 On Tue, Dec 6, 2011 at 6:48 PM, Robert Muirrcm...@gmail.com  wrote:

 On Tue, Dec 6, 2011 at 6:41 PM, Jamie Johnsonjej2...@gmail.com  wrote:

 Is there a timetable for when it is expected to be finalized?

 it will be finalized when Lucene 4.0 is released.

 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org