Hi Mike,
If you just need the IDF you can run HighFreqTerm.java in contrib against
either your sample index or your index to get the N terms with the highest DF
values (i.e. lowest IDF.) If you have a large index, giving it lots of memory
seems to help.
Depending on your use case, you may inst
unicode version itself.
I would suggest just using your old icu jar and lucene-icu.jar until
you yourself want to upgrade... its not guaranteed to work but I
suspect it will :)
On Thu, Jul 14, 2011 at 2:08 PM, Burton-West, Tom wrote:
> We are about to upgrade to Solr/Lucene 3.3 from a 3.1
We are about to upgrade to Solr/Lucene 3.3 from a 3.1dev version (Lucene
Implementation Version: 3.1-SNAPSHOT 1036094 - 2010-11-19 16:01:10)
We have a 6 TB + index that includes somewhere over 200 languages that was
indexed with the ICUTokenizer and ICUFoldingFilter from 3.1dev and would like
Hi Ivan and Robert,
>> sounds like you should talk to Tom Burton-West!
Ok, I'll bite.
A few questions:
Are you planning to have separate fields for each language or the same fields
with contents in different languages?
If #2 are you planning to have a field to indicate the language so you can d
Hi Samar,
Have you looked at top or iostat or other monitoring utilities to see if you
are cpu bound vs I/O bound?
With 225 term queries, it's possible that you are I/O bound.
I suspect you need to think about seek time and caching. For each unique
field:term combination lucene has to look up
Hi Samar,
>>Normal queries go fine under 500 ms but when people start searching
>>"anything" some queries take up to > 100 seconds. Don't you think
>>distributing smaller indexes on different machines would reduce the average
>>.search time. (Although I have a feeling that search time for smaller
Thanks for fixing++
Tom
-Original Message-
From: Uwe Schindler [mailto:u...@thetaphi.de]
Sent: Sunday, May 01, 2011 6:05 AM
To: d...@lucene.apache.org; simon.willna...@gmail.com;
java-user@lucene.apache.org
Subject: RE: Link to nightly build test reports on main Lucene site needs
updat
Hello,
I went to look at the "Hudson nightly builds" and tried to follow the link from
the main Lucene page
http://lucene.apache.org/java/docs/developer-resources.html#Nightly
The links to the Clover Test Coverage Reports point to
http://hudson.zones.apache.org/hudson/view/Lucene/job/Lucene-
Hi,
If I understand correctly what you are trying to do as far as getting corpusTF,
you might want to look at the implementation of the "-t" flag in
org.apache.lucene.misc/HighFreqTerms.java in contib.
Take a look at the getTotalTermFreq method in trunk.
http://svn.apache.org/viewvc/lucene
I'm trying to get a feel for the impact of changing the termIndexInterval from
the default of 128 to 1024 (8 * 128). This reduces the size of the tii file by
1/8th but in the worst case requires doing a linear scan of 1024 terms instead
of 128 in memory. I'm not so concerned about the perform
Hello all,
We have very large documents with large numbers of unique terms. Our
documents average about 800,000 KB and about 200,000 tokens. In trying to
understand how often the ramBuffer gets flushed to disk we turned on the
IndexWriter log.
true
With the Solr default setting of ramBuffer
, February 04, 2011 3:19 PM
To: java-user@lucene.apache.org
Subject: Re: Bigrams for CJK with ICUTokenizer ?
On Fri, Feb 4, 2011 at 3:07 PM, Burton-West, Tom wrote:
> Thanks Robert,
>
> Lucene 2740 looks really interesting. In the meantime a JIRA issue for this
> sounds like a good id
11 12:58 PM
To: java-user@lucene.apache.org
Subject: Re: Bigrams for CJK with ICUTokenizer ?
On Fri, Feb 4, 2011 at 12:46 PM, Burton-West, Tom wrote:
> Hello all,
>
> We are using the ICUTokenizer because we have documents in about 400
> different languages. We are also setting autoGenerate
Hello all,
We are using the ICUTokenizer because we have documents in about 400 different
languages. We are also setting autoGeneratePhraseQueries to false so that CJK
and other languages that don't use space to separate words won't get tokenized
properly by the ICUTokenizer and then the toke
Hi all,
I see in the javadoc for the ICUTokenizer that it has special handling for
Lao,Myanmar, Khmer word breaking but no details in the javadoc about what it
does with CJK, which for C and J appears to be breaking into unigrams. Is this
correct?
Tom
Hello all,
We have an extremely large number of terms in our indexes. I want to be able
to extract a sample of the terms, say something like every 128th term. If I
use code based on org.apache.lucene.misc.HighFreqTerms or
org.apache.lucene.index.CheckIndex I would get a TermsEnum, call
term
Can you give more details about what you want? Perhaps with an example?
Do you want the number of documents containing the query term, the number of
occurrences of the query term within a document, or the number of occurrences
of the term in the entire index?
You can use an explain query to get
Hi Andy,
We are currently indexing about 650,000 full-text books in per Solr/Lucene
index. We have 10 shards for a total of about 6.5 million documents and our
average response time is under a 2 seconds, but the slowest 1% of queries take
between 5-30 seconds. If you were searching only on
The work on MultiPassIndexSplitter is being done by Andrzej Bialecki, the
creator of Luke.
See http://lucene-eurocon.org/sessions-track1-day1.html#3
http://lucene-eurocon.org/slides/Munching-&-crunching-Lucene-index-post-processing-and-applications_Andrzej-Bialecki.pdf
The slides say "SinglePas
Hi all,
>>Re scalability of filter construction - the database is likely to hold stable
>>primary keys not lucene doc ids
>>which are unstable in the face of updates.
This is the scalability issue I was concerned about. Assume the database call
efficiently retrieves a sorted array of 50,000 s
Hi Mike and Martin,
We have a similar use-case. Is there a scalability/performance issue with the
getDocIdSet having to iterate through hundreds of thousands of docIDs?
Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search
-Original Message-
From: Michael McCandless [mai
Thanks Mike,
At some point maybe the File Formats Document could be updated to make it clear
that the tii has an entry similar to the IntexInterval'th tis entry but instead
of holding frq/prx deltas it holds absolute pointers. Is it worth entering a
JIRA issue? I would be happy to update the
Hi all,
Please let me know if this should be posted instead to the Lucene java-dev list.
We have very large tis files (about 36 GB). I have not been too concerned as I
assumed that due to the indexing of the tis file by the tii file, only a small
portion of the file needed to be read. However
23 matches
Mail list logo