Thanks Otis for your response. I've few more questions,
1) Is it recommended to do index partitioning for large indexes?
- We index around 35 fields (storing only two of them - simple ids)
- Each document is around 200 bytes
- Our index grows to around 50G a week
2) The reaso
Hi all,
I've been tracking down a problem happening in our production environment.
When we switch an index after doing deletes & adds, running some searches,
and finally changing the pointer
from old index to new all the threads start stacking up all waiting on
isDeleted(). The threads seem to fin
Release 2.3.0 of Lucene Java is now available!
Many new features, optimizations, and bug fixes have been added since
2.2, including:
* significantly improved indexing performance
* segment merging in background threads
* refreshable IndexReaders
* faster StandardAnalyzer and improved Toke
> I've been poking around the list archives and didn't really come up against
> anything interesting. Anyone using Lucene to index OCR text? Any
> strategies/algorithms/packages you recommend?
>
> I have a large collection (10^7 docs) that's mostly the result of OCR. We
> index/search/etc. with Luc
Lots of luck to you, because I haven't a clue. My company deals with
OCR data and we haven't had a single workable idea. Of course, our
data sets are minuscule compared to what you're talking about, so we
haven't tried to heuristically clean up the data.
But given that Google is scanning the entir
Hi,
I am very new to Lucene & Hadoop, and I have a project where I need to
use Lucene to index some input given either as a a huge collection of
Java objects or one huge java object.
I read about Hadoop's MapReduce utilities and I want to leverage that feature
in my case described above.
I've been poking around the list archives and didn't really come up against
anything interesting. Anyone using Lucene to index OCR text? Any
strategies/algorithms/packages you recommend?
I have a large collection (10^7 docs) that's mostly the result of OCR. We
index/search/etc. with Lucene withou
> Or, you could just do things twice. That is, send your text through
> a TokenStream, then call next() and count. Then send it all
> through doc.add().
Hm.
This means read the content twice, doesn't matter using an own analyzer oder
overriding/wrapping the main analyzer.
Is there anywhere a hoo
Oh, also, I don't think not using CFS would lead to this, unless it's
somehow triggering too many file descriptors...
Mike
Cam Bazz wrote:
no. only after that there was a gc error.
I am also not using the compound index file format in order to
increase
indexing speed. could it be becaus
Hmm, you should have seen an exception before that one from optimize.
Can you post the GC error? Was it an OutOfMemoryError situation?
Mike
On Jan 24, 2008, at 5:32 PM, Cam Bazz wrote:
no. only after that there was a gc error.
I am also not using the compound index file format in order to
no. only after that there was a gc error.
I am also not using the compound index file format in order to increase
indexing speed. could it be because of that?
I will run the test case again tomorrow. What can I do to increase logging?
Best,
-C.B.
On Jan 24, 2008 11:52 PM, Michael McCandless <[EMA
That means that one of the merges, which run in the background by
default with 2.3, hit an unhandled exception.
Did you see another exception logged / printed to stderr before this
one?
Mike
Cam Bazz wrote:
Does anyone have any idea about the error I got while indexing?
Best Regards,
Hi Itamar,
On 01/24/2008 at 2:55 PM, Itamar Syn-Hershko wrote:
> > Lucene does not store proximity relations between data in different
> > fields, only within individual fields
>
> So are 2 calls for doc->add with the same field but different
> texts are considered as 1 field (latter call being i
Does anyone have any idea about the error I got while indexing?
Best Regards,
-C.B.
Exception in thread "main" java.io.IOException: background merge hit
exception: _kq:C962870 _kr:C2591 into _ks [optimize]
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1749)
at org.apach
OK, I will give this a try.
Now I have the problem that I do not know how to get the offsets (or
positions? What is the difference?) back from the searched document...
There is a IndexReader#termPositions (Term t) - but this returns the
positions for the whole index, not a single document.
> -
Hi all,
Just FYI, perhaps this is old news for you ... This large corpus is
freely available and it is pairwise sentence-aligned for all language
combinations. This looks like a good resource for linguistic
information, such as frequent words and phrases, n-gram profiles, etc.
http://wt.jrc.
Steve and all,
I didn't know whether to send a detailed description of my case to aid with
seeing the whole picture, or to send a list of short questions which will
require loads of follow-up. I guess I know what is better now, thanks
>> Lucene does not store proximity relations between data
I think you'll have to implement your own Analyzer and count.
That is, every call to next() that returns a token will have to
also increment some counter by 1.
To use this, you must have some way of knowing when a page
ends, and at that point you call your instance of your custom
analyzer to see w
> -Original Message-
> From: Erick Erickson [mailto:[EMAIL PROTECTED]
> Sent: Freitag, 11. Januar 2008 16:16
> To: java-user@lucene.apache.org
> Subject: Re: Design questions
> But you could also vary this scheme by simply storing in your document
> the offsets for the beginning of each p
Yes, sorry, that's the case.
Thank you!
> -Original Message-
> From: Erick Erickson [mailto:[EMAIL PROTECTED]
> Sent: Donnerstag, 24. Januar 2008 19:49
> To: java-user@lucene.apache.org
> Subject: Re: Creating search query
>
> That should work fine, assuming that foo and bar are the un
That should work fine, assuming that foo and bar are the untokenized
fields and content is the tokenized content.
Erick
On Jan 24, 2008 1:18 PM, <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I have an index with some fields which are indexed and un_tokenized
> (keywords) and one field which is indexed an
Thank you.
> -Original Message-
> From: Lukas Vlcek [mailto:[EMAIL PROTECTED]
> Sent: Mittwoch, 23. Januar 2008 08:23
> To: java-user@lucene.apache.org
> Subject: Re: Compass
>
> Hi,
>
> I am using Compass with Spring and JPA. It works pretty nice.
> I don't store
> index into databas
Hi,
I have an index with some fields which are indexed and un_tokenized
(keywords) and one field which is indexed and tokenized (content).
Now I want to create a Query-Object:
TermQuery k1 = new TermQuery(new Term("foo", "some foo"));
TermQuery k2 = new TermQuery(new Term("bar",
In general, you just need to denorm the data and create a list of
Genes, and add each Genes' related information by SQLs. Ranking can be
easily adjusted via each field's weight, not a big deal.
Seems an ideal case for using DBSight. It can also do incremental
indexing, which you may also need.
--
Thank you Steven and Yonik,
I think I got it. And I can find LogMergePolicy uses
Math.log() to find merges. :-)
Thank you again,
Koji
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECT
On Jan 24, 2008 8:40 AM, Steven Parkes <[EMAIL PROTECTED]> wrote:
> I'm curious, why is LogMergePolicy named *Log*MergePolicy?
> (Why not ExpMergePolicy? :-)
>
> Well, I guess it's a matter of perspective. When you look at the way the
> algorithm works, the merge decisions are based
I'm curious, why is LogMergePolicy named *Log*MergePolicy?
(Why not ExpMergePolicy? :-)
Well, I guess it's a matter of perspective. When you look at the way the
algorithm works, the merge decisions are based on a concept of level and
levels are assigned based on the log of the numb
Hello,
I'm curious, why is LogMergePolicy named *Log*MergePolicy?
(Why not ExpMergePolicy? :-)
Thank you,
Koji
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Hi,
(Warning, not for the weak-hearted)
I'm currently working on a project where we have a large and complex data
model, related to Genomics. We are trying to build a search engine that
provides "full text" and "field-based text" searches for our customer base
(mostly academic research), and are
vivek sar wrote:
I've a field as NO_NORM, does it has to be untokenized to be able to
sort on it?
NO_NORMS is the same as UNTOKENIZED + omitNorms, so you can sort on that.
Antony
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
F
Hi all.
I need to check two conditions in search
first i need to find out bank name next
in those i need to find documents consisting particular city
finally i need the documents which satisfy both conditions
i.e., documents with bank+city
please can anyone help me
Thanks,
prathiba.P
On Thu, 2008-01-24 at 08:18 +1100, Antony Bowesman wrote:
> These are odd. The last case in both of the above shows a slowdown compared
> to
> 2.1 index and version and in the first 50K queries, the 2.3 index and version
> is
> even slower than 2.3 with 2.1 index. It catches up in the longer
Is there anything I can do to pass my Unit-Test ?
Or it is impossible ?
Thanks a lot,
Fabrice
Fabrice Robini wrote:
>
> Hi Srikant,
>
> I really thank you for your reply, it's very interesting.
> I have to say I am confused with that now...
> I do not know what I can to for passing this U
I've a field as NO_NORM, does it has to be untokenized to be able to
sort on it?
On Jan 21, 2008 12:47 PM, Antony Bowesman <[EMAIL PROTECTED]> wrote:
> vivek sar wrote:
> > I need to be able to sort on optime as well, thus need to store it .
>
> Lucene's default sorting does not need the field to
34 matches
Mail list logo