RE: Obtaining IDF values for the terms in a document set

2011-12-15 Thread Burton-West, Tom
Hi Mike, If you just need the IDF you can run HighFreqTerm.java in contrib against either your sample index or your index to get the N terms with the highest DF values (i.e. lowest IDF.) If you have a large index, giving it lots of memory seems to help. Depending on your use case, you may inst

RE: Does change to ICU in Lucene/Solr 3.3 require re-indexing?

2011-07-14 Thread Burton-West, Tom
unicode version itself. I would suggest just using your old icu jar and lucene-icu.jar until you yourself want to upgrade... its not guaranteed to work but I suspect it will :) On Thu, Jul 14, 2011 at 2:08 PM, Burton-West, Tom wrote: > We are about to upgrade to Solr/Lucene 3.3 from a 3.1

Does change to ICU in Lucene/Solr 3.3 require re-indexing?

2011-07-14 Thread Burton-West, Tom
We are about to upgrade to Solr/Lucene 3.3 from a 3.1dev version (Lucene Implementation Version: 3.1-SNAPSHOT 1036094 - 2010-11-19 16:01:10) We have a 6 TB + index that includes somewhere over 200 languages that was indexed with the ICUTokenizer and ICUFoldingFilter from 3.1dev and would like

RE: Non-English Languages Search

2011-05-13 Thread Burton-West, Tom
Hi Ivan and Robert, >> sounds like you should talk to Tom Burton-West! Ok, I'll bite. A few questions: Are you planning to have separate fields for each language or the same fields with contents in different languages? If #2 are you planning to have a field to indicate the language so you can d

RE: Sharding Techniques

2011-05-12 Thread Burton-West, Tom
Hi Samar, Have you looked at top or iostat or other monitoring utilities to see if you are cpu bound vs I/O bound? With 225 term queries, it's possible that you are I/O bound. I suspect you need to think about seek time and caching. For each unique field:term combination lucene has to look up

RE: Sharding Techniques

2011-05-10 Thread Burton-West, Tom
Hi Samar, >>Normal queries go fine under 500 ms but when people start searching >>"anything" some queries take up to > 100 seconds. Don't you think >>distributing smaller indexes on different machines would reduce the average >>.search time. (Although I have a feeling that search time for smaller

RE: Link to nightly build test reports on main Lucene site needs updating

2011-05-02 Thread Burton-West, Tom
Thanks for fixing++ Tom -Original Message- From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: Sunday, May 01, 2011 6:05 AM To: d...@lucene.apache.org; simon.willna...@gmail.com; java-user@lucene.apache.org Subject: RE: Link to nightly build test reports on main Lucene site needs updat

Link to nightly build test reports on main Lucene site needs updating

2011-04-29 Thread Burton-West, Tom
Hello, I went to look at the "Hudson nightly builds" and tried to follow the link from the main Lucene page http://lucene.apache.org/java/docs/developer-resources.html#Nightly The links to the Clover Test Coverage Reports point to http://hudson.zones.apache.org/hudson/view/Lucene/job/Lucene-

RE: TermDoc to TermDocsEnum

2011-03-23 Thread Burton-West, Tom
Hi, If I understand correctly what you are trying to do as far as getting corpusTF, you might want to look at the implementation of the "-t" flag in org.apache.lucene.misc/HighFreqTerms.java in contib. Take a look at the getTotalTermFreq method in trunk. http://svn.apache.org/viewvc/lucene

termIndexInterval, CheckIndex, size of tis file and Lucene index compression

2011-03-21 Thread Burton-West, Tom
I'm trying to get a feel for the impact of changing the termIndexInterval from the default of 128 to 1024 (8 * 128). This reduces the size of the tii file by 1/8th but in the worst case requires doing a linear scan of 1024 terms instead of 128 in memory. I'm not so concerned about the perform

Understanding the IndexWriter-Infostream log

2011-03-17 Thread Burton-West, Tom
Hello all, We have very large documents with large numbers of unique terms. Our documents average about 800,000 KB and about 200,000 tokens. In trying to understand how often the ramBuffer gets flushed to disk we turned on the IndexWriter log. true With the Solr default setting of ramBuffer

RE: Bigrams for CJK with ICUTokenizer ?

2011-02-04 Thread Burton-West, Tom
, February 04, 2011 3:19 PM To: java-user@lucene.apache.org Subject: Re: Bigrams for CJK with ICUTokenizer ? On Fri, Feb 4, 2011 at 3:07 PM, Burton-West, Tom wrote: > Thanks Robert, > > Lucene 2740 looks really interesting.  In the meantime a JIRA issue for this > sounds like a good id

RE: Bigrams for CJK with ICUTokenizer ?

2011-02-04 Thread Burton-West, Tom
11 12:58 PM To: java-user@lucene.apache.org Subject: Re: Bigrams for CJK with ICUTokenizer ? On Fri, Feb 4, 2011 at 12:46 PM, Burton-West, Tom wrote: > Hello all, > > We are using the ICUTokenizer because we have documents in about 400 > different languages.   We are also setting autoGenerate

Bigrams for CJK with ICUTokenizer ?

2011-02-04 Thread Burton-West, Tom
Hello all, We are using the ICUTokenizer because we have documents in about 400 different languages. We are also setting autoGeneratePhraseQueries to false so that CJK and other languages that don't use space to separate words won't get tokenized properly by the ICUTokenizer and then the toke

ICUTokenizer and CJK

2010-11-22 Thread Burton-West, Tom
Hi all, I see in the javadoc for the ICUTokenizer that it has special handling for Lao,Myanmar, Khmer word breaking but no details in the javadoc about what it does with CJK, which for C and J appears to be breaking into unigrams. Is this correct? Tom

API access to in-memory tii file (3.x not flex).

2010-11-10 Thread Burton-West, Tom
Hello all, We have an extremely large number of terms in our indexes. I want to be able to extract a sample of the terms, say something like every 128th term. If I use code based on org.apache.lucene.misc.HighFreqTerms or org.apache.lucene.index.CheckIndex I would get a TermsEnum, call term

RE: High frequency term for the searched query

2010-11-04 Thread Burton-West, Tom
Can you give more details about what you want? Perhaps with an example? Do you want the number of documents containing the query term, the number of occurrences of the query term within a document, or the number of occurrences of the term in the entire index? You can use an explain query to get

RE: scalability limit in terms of numbers of large documents

2010-08-16 Thread Burton-West, Tom
Hi Andy, We are currently indexing about 650,000 full-text books in per Solr/Lucene index. We have 10 shards for a total of about 6.5 million documents and our average response time is under a 2 seconds, but the slowest 1% of queries take between 5-30 seconds. If you were searching only on

RE: Question to the writer of MultiPassIndexSplitter

2010-08-05 Thread Burton-West, Tom
The work on MultiPassIndexSplitter is being done by Andrzej Bialecki, the creator of Luke. See http://lucene-eurocon.org/sessions-track1-day1.html#3 http://lucene-eurocon.org/slides/Munching-&-crunching-Lucene-index-post-processing-and-applications_Andrzej-Bialecki.pdf The slides say "SinglePas

RE: on-the-fly "filters" from docID lists

2010-07-23 Thread Burton-West, Tom
Hi all, >>Re scalability of filter construction - the database is likely to hold stable >>primary keys not lucene doc ids >>which are unstable in the face of updates. This is the scalability issue I was concerned about. Assume the database call efficiently retrieves a sorted array of 50,000 s

RE: on-the-fly "filters" from docID lists

2010-07-22 Thread Burton-West, Tom
Hi Mike and Martin, We have a similar use-case. Is there a scalability/performance issue with the getDocIdSet having to iterate through hundreds of thousands of docIDs? Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search -Original Message- From: Michael McCandless [mai

RE: Understanding lucene indexes and disk I/O

2010-04-13 Thread Burton-West, Tom
Thanks Mike, At some point maybe the File Formats Document could be updated to make it clear that the tii has an entry similar to the IntexInterval'th tis entry but instead of holding frq/prx deltas it holds absolute pointers. Is it worth entering a JIRA issue? I would be happy to update the

Understanding lucene indexes and disk I/O

2010-04-12 Thread Burton-West, Tom
Hi all, Please let me know if this should be posted instead to the Lucene java-dev list. We have very large tis files (about 36 GB). I have not been too concerned as I assumed that due to the indexing of the tis file by the tii file, only a small portion of the file needed to be read. However