RE: Problems indexing large documents

2006-06-09 Thread Rob Staveley (Tom)
I'm trying to come to terms with http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.h tml#setMaxFieldLength(int) too. I've been attempting to index large text files as single Lucene documents, passing them as java.io.Reader to cope with RAM. I was assuming (like - I suspect

Problems indexing large documents

2006-06-09 Thread manu mohedano
Problem Solved! Thank's a lot guys!!!

Re: Indexing question

2006-06-09 Thread Erick Erickson
Couple of things. 1> you can use a different analyzer to NOT remove stopwords. SimpleAnalyzer comes to mind (though watch out for case). Look at LuceneInAction for an explanation of several analyzers that are available. 2> If memory servers, Lucene defaults to indexing only the first 10,000 word

Re: Adding Fields to Documents with UnStored Fields - crazy scheme?

2006-06-09 Thread Yonik Seeley
On 6/8/06, Bob Arens <[EMAIL PROTECTED]> wrote: I've been handed a legacy index containing Documents with two Fields; one is a file ID, the other is contents of the file. The contents field was added using UnStored. Now, we want to add fields. Oh, the humanity! My crazy idea - can we add new Doc

Re: Numbertools and efficient sorting

2006-06-09 Thread Chris Hostetter
: I have an integer field that I've indexed after converting to a string : using NumberTools.longToString(). : Now I want to sort my results using this field. Everything works when : treating the field as a string, but is very slow and memory intensive. : : I want to use INT sorting instead, but

Re: Adding Fields to Documents with UnStored Fields - crazy scheme?

2006-06-09 Thread Chris Hostetter
: > fileID twice .. if you mean you want the list of fileIDs that match : > both : > clauses, you're not going to get any results back -- because no doc : > with a : > contents field is going to have a title field, and no doc with a title : > field is going to have a contents field. : I'd want bot

RE: Problems indexing large documents

2006-06-09 Thread Pasha Bizhan
Hi, > From: manu mohedano [mailto:[EMAIL PROTECTED] > Hi All! I have a trouble... When I index text documents in > english, there is no problem, buy when I index Spanish text > documents (And they're big), a lot of information from the > document don't become indexed (I suppose it is due to

Re: Problems indexing large documents

2006-06-09 Thread Daniel Naber
On Freitag 09 Juni 2006 21:31, manu mohedano wrote: > Hi All! I have a trouble... When I index text documents in english, > there is no problem, buy when I index Spanish text documents (And > they're big), a lot of information from the document don't become > indexed Read the FAQ at http://wiki.a

Problems indexing large documents

2006-06-09 Thread manu mohedano
Hi All! I have a trouble... When I index text documents in english, there is no problem, buy when I index Spanish text documents (And they're big), a lot of information from the document don't become indexed (I suppose it is due to the Analyzer, but if the documents is less tahn 400kb it works per

Re: Adding Fields to Documents with UnStored Fields - crazy scheme?

2006-06-09 Thread Bob Arens
: That kinda would be the point - "contents:germany" would get the same : fileIDs, but "contents:germany title:medicine" would (hopefully) give : us a more specific query. when you say "contents:germany title:medicine" i'm not sure if you are assuming that both clauses are mandatory or option

RE: Different scoring mechanism

2006-06-09 Thread Chris Hostetter
: For example: a query containing two terms: "fast", "car", having : document frequencies 300.000 and 20.000 in the index respectively. In a : worst case scenario this would require 320.000 document scores to be : calculated. I am not really sure how lucene optimizes its search, but I : guess it

Re: Adding Fields to Documents with UnStored Fields - crazy scheme?

2006-06-09 Thread Chris Hostetter
: > : would consist of two Documents, : > : Document X: fileID:, contents: : > : Document Y:fileID:, title:, url:, etc. : > add another document with the same fileID and a title field and a url : > field, and you search for "contents:germany" you're still going to get : > back the same document -

Re: Adding Fields to Documents with UnStored Fields - crazy scheme?

2006-06-09 Thread Bob Arens
If the old index is optimized then you might be able to iterate through all the docs in your old index (sorted by doc id) and for each iteration add the corresponding doc to the new index so it has a matching doc id. The idea being that after searching on one index you could use the doc id

Re: Aggregating category hits

2006-06-09 Thread Peter Keegan
I compared Solr's DocSetHitCollector and counting bitset intersections to get facet counts with a different approach that uses a custom hit collector that tests each docid hit (bit) with each facets' bitset and increments a count in a histogram. My assumption was that for queries with few hits, th

Indexing question

2006-06-09 Thread manumohedano
Hi All! I have a trouble... When I index text documents in english, there is no problem, buy when I index Spanish text documents (And they're big), a lot of information form the document don't become Indexed (I suppose it is due to the Analyzer). Howewer I want to Index ALL the strings in the docum

Numbertools and efficient sorting

2006-06-09 Thread Benjamin Stein
I have an integer field that I've indexed after converting to a string using NumberTools.longToString(). Now I want to sort my results using this field. Everything works when treating the field as a string, but is very slow and memory intensive. I want to use INT sorting instead, but these strin

COMMIT_LOCK_TIMEOUT - IndexSearcher/IndexReader

2006-06-09 Thread Michael Duval
Hi All, Has anyone else out there come across the shortcomings of the new COMMIT_LOCK_TIMEOUT in regards to searching on an actively updated Index? It used to be a settable system property and therefor "semi" dynamic across a system with multiple readers/searchers and one writer. I am awar

searching multiple indexes in multiple servers.

2006-06-09 Thread Omar Didi
Hi all, my index size has grown too much and I keep getting outOfMemoryError after running few searches. I am using all the RAM that the JVM is allowing me 2.6GB. I am left with two solutions now, the easy and expensive solution is to upgrade the hardware to a 64-bit System and use more RAM. the

combining two query calls in one?

2006-06-09 Thread zzzzz shalev
hey, i am using the pmsearcher to retrieve data from a number of ram indexes. i am calling my own search function which calls the indexsearcher.search meathod and returns the top 100 ids/scores , however, before returning the topdocs i start a separate thread which requeries the index and

Re: adding term information to Index

2006-06-09 Thread Grant Ingersoll
Hi Patricio, As of now, I don't think this is possible. However, we are slowly but surely working on similar problems. Please feel free to add your two cents to http://wiki.apache.org/jakarta-lucene/FlexibleIndexing as we are considering several new ideas related to making indexing more fle

RE: Different scoring mechanism

2006-06-09 Thread Trieschnigg, R.B. \(Dolf\)
> :! If a document does not contain a queryterm this score > can be larger > : or smaller than 0 ! > > if a document doesn't contain a term, then the scorer for > that query will never even try to score that document -- > regardless of what your Similarity class looks like. > > if you real

Re: Multisearch Problem

2006-06-09 Thread Dan Wiggin
My lucene version is 1.4.3 and always worked with this. Someday I have to do the change to Lucene 2.0. But the problem isn't this because the problem is something like One index have something indexed and other index is olnly created but without any document. It's very strange because this problem

RE: Adding Fields to Documents with UnStored Fields - crazy scheme?

2006-06-09 Thread Robert Haycock
Hi Bob, No idea if this would work BUT... If the old index is optimized then you might be able to iterate through all the docs in your old index (sorted by doc id) and for each iteration add the corresponding doc to the new index so it has a matching doc id. The idea being that after searching on

RE: Compound / non-compound index files and SIGKILL

2006-06-09 Thread Rob Staveley (Tom)
I am no longer a Jira virgin. http://issues.apache.org/jira/browse/LUCENE-594 Thanks again. -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: 09 June 2006 07:13 To: java-user@lucene.apache.org Subject: RE: Compound / non-compound index files and SIGKILL : Whom sh

RE: Property comparison possible??

2006-06-09 Thread Robert Haycock
He he, nice comparison! Cheers for the advice. Rob. -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: 09 June 2006 08:00 To: java-user@lucene.apache.org Subject: RE: Property comparison possible?? : Is it possible to perform a search using fields instead of terms

Re: Adding Fields to Documents with UnStored Fields - crazy scheme?

2006-06-09 Thread Bob Arens
On Jun 9, 2006, at 2:10 AM, Chris Hostetter wrote: : 2. Recreating the index from scratch will require the moving of the : heavens and the earth. : : My crazy idea - can we add new Documents to the index with the Fields : we wish to add, and duplicate file IDs? i.e. an entry for file ID Fo

Re: Different scoring mechanism

2006-06-09 Thread Chris Hostetter
:! If a document does not contain a queryterm this score can be larger : or smaller than 0 ! if a document doesn't contain a term, then the scorer for that query will never even try to score that document -- regardless of what your Similarity class looks like. if you really want this kind of

Re: return single document from duplicated documents in index

2006-06-09 Thread Chris Hostetter
take a look at the HitCollector and Filter APIs .. you can impliment any logic you want in either of those classes to restrict what results you get -- and the FieldCache gives you an easy way to check what the value of a particular indexed field is. storing the mappings of field value to "best" m

Re: Adding Fields to Documents with UnStored Fields - crazy scheme?

2006-06-09 Thread Chris Hostetter
: 2. Recreating the index from scratch will require the moving of the : heavens and the earth. : : My crazy idea - can we add new Documents to the index with the Fields : we wish to add, and duplicate file IDs? i.e. an entry for file ID Foo : would consist of two Documents, : Document X: fileID:,

RE: Property comparison possible??

2006-06-09 Thread Chris Hostetter
: Is it possible to perform a search using fields instead of terms, eg. : like this sql: : SELECT col1, col2 : FROM table1 : WHERE col1 = col2 presumably "col1" and "col2" are untokenized fields? (otherwise equality is kind of vague) if you really wanted to add a constraint like this to an exist