Re: Stop Word list File processing

2009-06-30 Thread Shayak Sen
Hi, When you make the custom filter for removing stopwords, use the constructor to load the stopwords list then use it as you were earlier. On Wed, Jul 1, 2009 at 1:23 PM, Harsha1<99harsha.h@gmail.com> wrote: > > Hi, > I have a string through i need to filter off some of words (say stop words

Re: Term Frequency vector consumes memory

2009-06-30 Thread Ganesh
Thanks for your reply. My requirement is to fetch the list of top frequency terms indexed in a day. I used the logic said in the article (refer below link) http://stackoverflow.com/questions/195434/how-can-i-get-top-terms-for-a-subset-of-documents-in-a-lucene-index I enabled term vector for a fi

Stop Word list File processing

2009-06-30 Thread Harsha1
Hi, I have a string through i need to filter off some of words (say stop words). But I want to use WhiteSpaceAnalyser. So I have created a custom analyser with capability of whitespaceAnalyser and filtering unwanted words. Since the String array of Stop words is increasing, i would like put this i

Re: Order of fields within a Document in Lucene 2.4+

2009-06-30 Thread Mark Miller
Yeah, I've heard rumblings about this issue before. I can't remember what patch changed it though - one of Mike M's I think? On Tue, Jun 30, 2009 at 8:40 PM, Chris Hostetter wrote: > > Hmmm... i'm not an expert on the internals of indexing, and i don't use > FieldSelectors much, but this seems li

Re: Order of fields within a Document in Lucene 2.4+

2009-06-30 Thread Chris Hostetter
Hmmm... i'm not an expert on the internals of indexing, and i don't use FieldSelectors much, but this seems like a pretty big bug to me ... or at the very least: a change in behavior that completely eliminates the value of LOAD_AND_BREAK. https://issues.apache.org/jira/browse/LUCENE-1727 :

Re: Scaling out/up or a mix

2009-06-30 Thread Marcus Herou
Hi, like the sound of this. What I am not familiar with in terms of Lucene is how the index get's swapped in and out of memory. When it comes to database tables (non partitionable tables at least) I know that one should have enough memory to fit the entire index into memory to avoid file-sorts for

Re: Scaling out/up or a mix

2009-06-30 Thread Marcus Herou
Hi. The number of concurrent users today is insignficant but once we push for the service we will get into trouble... I know that since even one simple faceting query (which we will use to display trend graphs) can take forever (talking about SOLR bytw). "Normal" Lucene queries (title:blah OR desc

Re: MultiSegmentReader problems - current is null

2009-06-30 Thread liat oren
Ok, thanks a lot - I iwll try that tomorrow Best, Liat 2009/6/30 Simon Willnauer > Hi, > On Sun, Jun 28, 2009 at 2:39 PM, liat oren wrote: > > Hi, > > > > I have an index that is a multi-segment index (how come it is created > this > > way?) > > > > When I try to get the freq of a term at the f

Re: Lucene 2.9

2009-06-30 Thread Mark Miller
I hope July. Could easily be August though. I'm kicking and screaming to get it out soon though. Its been hurting my high brow reputation. On Tue, Jun 30, 2009 at 2:41 PM, Siraj Haider wrote: > is there an ETA for Lucene 2.9 release? > > -siraj > > ---

Re: Query which gives high score proportional to 'distinct term matches'

2009-06-30 Thread Matthew Hall
Well, we have a very similar requirement here, but for us its for every single field that we wanted this kind of behavior. We got this in by eliminating the TF (Term Frequency) contribution to score via a custom Similarity. (Which is very easy to do.) I... think in the newer versions of lucen

Lucene 2.9

2009-06-30 Thread Siraj Haider
is there an ETA for Lucene 2.9 release? -siraj - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Query which gives high score proportional to 'distinct term matches'

2009-06-30 Thread chandrakant k
I have a index which has got fields like title : content : If I search for, lets say obama fly , then the documents having obama and fly should be given high scores irrespective of the number of times they may occur. This requirement is for fields - title and content. The implementation whi

Re: Term Frequency vector consumes memory

2009-06-30 Thread Grant Ingersoll
In Lucene, a Term Vector is a specific thing that is stored on disk when creating a Document and Field. It is optional and off by default. It is separate from being able to get the term frequencies for all the docs in a specific field. The former is decided at indexing time and there is

Re: A simple Vector Space Model and TFIDF usage

2009-06-30 Thread Grant Ingersoll
On Jun 29, 2009, at 3:10 PM, Amir Hossein Jadidinejad wrote: Hi, It's my first experiment with Lucene. Please help me. I'm going to index a set of documents and create a feature vector for each of them. This vector contains all terms belong to the document that weight using TFIDF. After tha

Re: Scaling out/up or a mix

2009-06-30 Thread Andy Goodell
I have improved date-sorted searching performance pretty dramatically by replacing the two step "search then sort" operation with a one step "use the date as the score" algorithm. The main gotcha was making sure to not affect which results get counted as hits in boolean searches, but overall I onl

Re: Highligheter fails using JapaneseAnalyzer

2009-06-30 Thread Matthew Hall
Does the same thing happen when you use SimpleAnalyzer, or StandardAnalyzer? I have a sneaking suspicion that the : in your contents string is what's causing your issue here, as : is a reserved character that denotes a field specification. But I could be wrong. Try swapping analyzers, if you no l

Highligheter fails using JapaneseAnalyzer

2009-06-30 Thread k.sayama
hello. i've tried to highlight string using Highligheter(2.4.1) and JapaneseAnalyzer but the following code extract show the problem String F = "f"; String CONTENTS = "AAA :BBB CCC"; JapaneseAnalyzer analyzer = new JapaneseAnalyzer(); QueryParser qp = new QueryParser( F, analyzer ); Query quer

Re: optimized searching

2009-06-30 Thread Simon Willnauer
On Tue, Jun 30, 2009 at 3:21 PM, Ian Lea wrote: > Have you read the javadocs? What does collector.getTotalHits() return? >  Does it return the same when you use new TopDocCollector(1000) and > some other number?  Are you asking basically the same questions in 2 > different threads at the same time?

Re: optimized searching

2009-06-30 Thread Erick Erickson
<<>> Are you willing to pay me to do your job for you? Sorry to besnarky, but please be aware that we're volunteers here, it's pretty presumptuous to ask for this. You still haven't answered what it is you're trying to do. Why are you collecting 1,000 titles? What's the purpose? Are you just expe

Re: optimized searching

2009-06-30 Thread Ian Lea
Have you read the javadocs? What does collector.getTotalHits() return? Does it return the same when you use new TopDocCollector(1000) and some other number? Are you asking basically the same questions in 2 different threads at the same time? You are still iterating over many hits and that will s

Re: MultiSegmentReader problems - current is null

2009-06-30 Thread Simon Willnauer
Hi, On Sun, Jun 28, 2009 at 2:39 PM, liat oren wrote: > Hi, > > I have an index that is a multi-segment index (how come it is created this > way?) > > When I try to get the freq of a term at the following way: >   TermDocs tDocs = this.indexReader.termDocs(term); >   tf = tDocs.freq(); > the greq m

Re: MultiSegmentReader problems - current is null

2009-06-30 Thread liat oren
Ohh, right. It resolves the problem I mentioned in the second email I sent. However, in the first mail I sent, the current of the multi-segment reader is null, which brings that problem. Thanks Liat 2009/6/30 Simon Willnauer > On Mon, Jun 29, 2009 at 9:55 AM, liat oren wrote: > > The full er

RE: Read large size index

2009-06-30 Thread Uwe Schindler
There was a code snipplet in my mail, just fill in your code. I cannot do everything for you. With some programming experience you should understand what's going on: > searcher.search(query, new HitCollector() { > @Override public void collect(int docid, float score) { > // do

Re: MultiSegmentReader problems - current is null

2009-06-30 Thread Simon Willnauer
On Mon, Jun 29, 2009 at 9:55 AM, liat oren wrote: > The full error is: > Exception in thread "main" java.lang.NullPointerException >        at > Priorart.Lucene.Expert.index.MultiSegmentReader$MultiTermDocs.freq(Mu > ltiSegmentReader.java:709) > I looked at issue > LUCENE-781

RE: Read large size index

2009-06-30 Thread m.harig
Thanks Uwe, can you please give me a code snippet , so that i can resolve my issue , please The correct way to iterate over all results is to use a custom HitCollector (Collector in 2.9) instance. The HitCollector's method collect(docid, score) is called for every hit. No need to a

Re: MultiSegmentReader problems - current is null

2009-06-30 Thread liat oren
lucene-2.4.1 Thanks, Liat 2009/6/29 Simon Willnauer > Quick question, which version of lucene do you use?! > > simon > > On Mon, Jun 29, 2009 at 9:55 AM, liat oren wrote: > > The full error is: > > Exception in thread "main" java.lang.NullPointerException > >at > > Priorart.Lucene.Exper

RE: Read large size index

2009-06-30 Thread Uwe Schindler
The correct way to iterate over all results is to use a custom HitCollector (Collector in 2.9) instance. The HitCollector's method collect(docid, score) is called for every hit. No need to allocate arrays then: e.g.: searcher.search(query, new HitCollector() { @Override public void collect

Re: optimized searching

2009-06-30 Thread m.harig
Thanks eric in Ian's link, particularly see the section "Don't iterate over morehits than necessary". A couple of other things: 1> Loading the entire document just to get a field or two isn't very efficient, think about lazy loading (See FieldSelector) i done it , but have couple of ques

Re: Read large size index

2009-06-30 Thread Simon Willnauer
On Tue, Jun 30, 2009 at 2:30 PM, m.harig wrote: > > > > Hi there, > > On Tue, Jun 30, 2009 at 12:41 PM, m.harig wrote: >> >> Thanks Simon , >> >>          Its working now , thanks a lot , i've a doubt >> >>       i've got 30,000 pdf files indexed ,  but if i use the code which you >> sent , returns

Re: Read large size index

2009-06-30 Thread m.harig
Hi there, On Tue, Jun 30, 2009 at 12:41 PM, m.harig wrote: > > Thanks Simon , > >          Its working now , thanks a lot , i've a doubt > >       i've got 30,000 pdf files indexed ,  but if i use the code which you > sent , returns only 200 results , because am setting   TopDocs topDocs = > se

Re: Read large size index

2009-06-30 Thread Simon Willnauer
Hi there, On Tue, Jun 30, 2009 at 12:41 PM, m.harig wrote: > > Thanks Simon , > >          Its working now , thanks a lot , i've a doubt > >       i've got 30,000 pdf files indexed ,  but if i use the code which you > sent , returns only 200 results , because am setting   TopDocs topDocs = > searc

Re: optimized searching

2009-06-30 Thread Erick Erickson
in Ian's link, particularly see the section "Don't iterate over morehits than necessary". A couple of other things: 1> Loading the entire document just to get a field or two isn't very efficient, think about lazy loading (See FieldSelector) 2> What do you mean when you say "not very good"? Us

Re: Modifying score based on tf and slop

2009-06-30 Thread Rads2029
Restarting this thread. I did try out the soln mentioned by Simon below, however that did not work. As changing the tf implementation to return 1, adversely affected by span scoring . ie, the slop distance does not affect score if i make tf as 1. I had found a work around in some other way, but

Re: Read large size index

2009-06-30 Thread m.harig
Thanks Simon , Its working now , thanks a lot , i've a doubt i've got 30,000 pdf files indexed , but if i use the code which you sent , returns only 200 results , because am setting TopDocs topDocs = searcher.search(query,200); as i said if use Integer.MAX_VALUE , it return

RE: Scaling out/up or a mix

2009-06-30 Thread Uwe Schindler
> On Mon, 2009-06-29 at 09:47 +0200, Marcus Herou wrote: > > Index size(and growing): 16Gx8 = 128G > > Doc size (data): 20k > > Num docs: 90M > > Num users: Few hundred but most critical is that the admin staff which > is > > using the index all day long. > > Query types: Example: title:"Iphone" OR

RE: Scaling out/up or a mix

2009-06-30 Thread Toke Eskildsen
On Tue, 2009-06-30 at 11:29 +0200, Uwe Schindler wrote: > So the simple answer is always: > If 64 bit platform with lots of RAM, use MMapDirectory. Fair enough. That makes the RAM-focused solution much more scalable. My point still stands though, as Marcus is currently examining his hardware optio

RE: Scaling out/up or a mix

2009-06-30 Thread Uwe Schindler
> On Mon, 2009-06-29 at 09:47 +0200, Marcus Herou wrote: > > Index size(and growing): 16Gx8 = 128G > > Doc size (data): 20k > > Num docs: 90M > > Num users: Few hundred but most critical is that the admin staff which > is > > using the index all day long. > > Query types: Example: title:"Iphone" OR

Re: Read large size index

2009-06-30 Thread Simon Willnauer
Hey there, On Tue, Jun 30, 2009 at 10:41 AM, wrote: > Thanks Simon > >  this is my code , but am getting null , > > IndexReader open = IndexReader.open(indexDir); > >                IndexSearcher searcher = new IndexSearcher(open); > >                final String fName = "contents"; > >          

Re: optimized searching

2009-06-30 Thread Ian Lea
What exactly is the problem? Are you concerned about the time that your code snippet takes to run, or how much memory it uses? If you have a query that matches many documents then iterating through all of them, as your code does, is inevitably going to take time. See http://wiki.apache.org/lucen

Re: Scaling out/up or a mix

2009-06-30 Thread Toke Eskildsen
On Mon, 2009-06-29 at 09:47 +0200, Marcus Herou wrote: > Index size(and growing): 16Gx8 = 128G > Doc size (data): 20k > Num docs: 90M > Num users: Few hundred but most critical is that the admin staff which is > using the index all day long. > Query types: Example: title:"Iphone" OR description:"I

Term Frequency vector consumes memory

2009-06-30 Thread Ganesh
At the end of the day, I used to build the stats of top indexed terms. I enabled term frequency for the single field. It is working fine. I could able to get the top terms and its frequencies. It consumes huge amount of RAM. My index size is 5 GB and has 8 million records. If i didn't enable ter