Re: Multiple indexes

2005-03-01 Thread Otis Gospodnetic
Ben, You do need to use a separate instance of those 3 classes for each index yes. But this is really something like: IndexWriter writer = new IndexWriter(); So it's normal code-writing process you don't really have to create anything new, just use existing Lucene API. As for locking,

Re: Ranking Terms

2005-02-26 Thread Otis Gospodnetic
Make sure you are not indexing your documents using the compound index format (default in the newer versions of Lucene). Then you will see the .frq file. Here is an example from one of Simpy's Lucene indices: -rw-r--r--1 simpysimpy 629073 Feb 26 13:14 _1ao.frq Otis -- http://www.si

Re: Not entire document being indexed?

2005-02-24 Thread Otis Gospodnetic
Use Luke to peek in your index and find out what really got indexed. You could also try the extreme case and set that max value to the max Integer. Otis --- "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote: > Hi everyone > > I'm having a bizzare problem with a few of the documents here that do >

Re: Lucene vs. in-DB-full-text-searching

2005-02-18 Thread Otis Gospodnetic
You are right. Since there are C++ and now C ports of Lucene, it would be interesting to integrate them directly with DBs, so that the RDBMS full-text search under the hood is actually powered by one of the Lucene ports. Otis --- David Spencer <[EMAIL PROTECTED]> wrote: > Otis Gospodne

Re: Search Performance

2005-02-18 Thread Otis Gospodnetic
]> wrote: > Wouldn't this leave open file handles? I had a problem where there > were lots of open file handles for deleted index files, because the > old searchers were not being closed. > > On Fri, 18 Feb 2005 13:41:37 -0800 (PST), Otis Gospodnetic > <[EMAIL PROTECTED]>

Re: Document comparison

2005-02-18 Thread Otis Gospodnetic
Matt, Erik and I have some code for this in Lucene in Action, but David Spencer did this since the book was published: http://www.lucenebook.com/blog/announcements/more_like_this.html Otis --- Matt Chaput <[EMAIL PROTECTED]> wrote: > Is there a simple, efficient way to compute similarity of

Re: Search Performance

2005-02-18 Thread Otis Gospodnetic
Or you could just open a new IndexSearcher, forget the old one, and have GC collect it when everyone is done with it. Otis --- Chris Lamprecht <[EMAIL PROTECTED]> wrote: > I should have mentioned, the reason for not doing this the obvious, > simple way (just close the Searcher and reopen it if a

Re: Lucene vs. in-DB-full-text-searching

2005-02-18 Thread Otis Gospodnetic
The most obvious answer is that the full-text indexing features of RDBMS's are not as good (as fast) as Lucene. MySQL, PostgreSQL, Oracle, MS SQL Server etc. all have full-text indexing/searching features, but I always hear people complaining about the speed. A person from a well-known online boo

Re: Concurrent searching & re-indexing

2005-02-16 Thread Otis Gospodnetic
Hi Paul, If I understand your setup correctly, it looks like you are running multiple threads that create IndexWriter for the ame directory. That's a "no no". This section (first hit) describes all various concurrency issues with regards to adds, updates, optimization, and searches: http://www

Re: What does [] do to a query and what's up with lucene.apache.org?

2005-02-14 Thread Otis Gospodnetic
Hi, lucene.apache.org seems to work now. Here is the query syntax: http://lucene.apache.org/queryparsersyntax.html [] is used as [BEGIN-RANGE-STRING TO END-RANGE-STRING] Otis --- Jim Lynch <[EMAIL PROTECTED]> wrote: > First I'm getting a > > > The requested URL could not be retrieved

RE: Multiple Fields with same name

2005-02-11 Thread Otis Gospodnetic
Hi, It's been a while since I've used that feature, but I believe they will always be in the same order, but I seem to recall that they will be in the reverse order. Whichever way they come, you can always reverse if if the other order is better for you. java.util.Collections class has a number

Re: behavioral differences between Field.Keyword and Field.UnStored

2005-02-11 Thread Otis Gospodnetic
The QueryParser is analyzing your Field.Keyword (genre field) fields, because it doesn't know that genre is a Keyword field and should not be analyzed. Check section 4.4. here: http://www.lucenebook.com/search?query=queryparser+keyword Otis --- Mike Rose <[EMAIL PROTECTED]> wrote: > Perhaps

Re: Optimize not deleting all files

2005-02-04 Thread Otis Gospodnetic
Get and try Lucene 1.4.3. One of the older versions had a bug that was not deleting old index files. Otis --- [EMAIL PROTECTED] wrote: > Hi, > > When I run an optimize in our production environment, old index are > left in the directory and are not deleted. > > My understanding is that an >

Re: Numbers in the Query String

2005-02-03 Thread Otis Gospodnetic
Using different analyzers for indexing and searching is not recommended. Your numbers are not even in the index because you are using StandardAnalyzer. Use Luke to look at your index. Otis --- Hetan Shah <[EMAIL PROTECTED]> wrote: > Hello, > > How can one search for a document based on the qu

Re: which HTML parser is better?

2005-02-02 Thread Otis Gospodnetic
If you are not married to Java: http://search.cpan.org/~kilinrax/HTML-Strip-1.04/Strip.pm Otis --- sergiu gordea <[EMAIL PROTECTED]> wrote: > Karl Koch wrote: > > >I am in control of the html, which means it is well formated HTML. I > use > >only HTML files which I have transformed from XML. No

RE: carrot2 question too - Re: Fun with the Wikipedia

2005-01-31 Thread Otis Gospodnetic
Adam, Dawid posted some code that lets you use Carrot2 locally with Lucene, without the componentized pipe line system described on Carrot2 site. Otis --- Adam Saltiel <[EMAIL PROTECTED]> wrote: > David, Hi, > Would you be able to comment on coincidentally recent thread " RE: -> > Grouping Sear

Re: Lucene in Action hits desk in UK

2005-01-28 Thread Otis Gospodnetic
; > Just wondering: > > Is Lucene-in-Action being sold anywhere in Singapore? > > > > thanks! > > > > Otis Gospodnetic <[EMAIL PROTECTED]> wrote: Gospodnetiæ > sounds like Gospodnetich and Eric is Erik :) > > Otis > > --- John Haxby wrote

Re: Disk space used by optimize

2005-01-28 Thread Otis Gospodnetic
Morus, that description of 3 sets of index files is what I was imagining, too. I'll have to test and add to the book errata, it seems. Thanks for the info, Otis --- Morus Walter <[EMAIL PROTECTED]> wrote: > Otis Gospodnetic writes: > > Hello, > > > > Yes, tha

Re: Loading a large index

2005-01-28 Thread Otis Gospodnetic
Edwin, --- Edwin Tang <[EMAIL PROTECTED]> wrote: > I have three indices really that I search via ParallelMultiSearcher. > All three > are being updated constantly. We would like to be able to perform a > search on > the indices and have the results reflect the latest documents > indexed. However,

Re: total number of (unique) terms in the index

2005-01-28 Thread Otis Gospodnetic
I don't think there is a direct way to get the number of (unique) terms in the index, so yes, I think you'll have to loop through TermEnum and count. Otis --- Jonathan Lasko <[EMAIL PROTECTED]> wrote: > I'm looking for the total number of unique terms in the index. I see > > that I can get a T

Re: google mini? who needs it when Lucene is there

2005-01-27 Thread Otis Gospodnetic
500 times the original data? Not true! :) Otis --- "Xiaohong Yang (Sharon)" <[EMAIL PROTECTED]> wrote: > Hi, > > I agree that Google mini is quite expensive. It might be similar to > the desktop version in quality. Anyone knows google's ratio of index > to text? Is it true that Lucene's i

RE: Disk space used by optimize

2005-01-27 Thread Otis Gospodnetic
l" ;) > > Yes the final three files are: the .cfs (46.8MB), deletable (4 > bytes), > and segments (29 bytes). > > --Leto > > > > > -Original Message- > > From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] > > > > Hello, > >

Re: rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there

2005-01-27 Thread Otis Gospodnetic
I discuss this with myself a lot inside my head... :) Seriously, I agree with Erik. I think this is a business opportunity. How many people are hating me now and going "shh"? Raise your hands! Otis --- David Spencer <[EMAIL PROTECTED]> wrote: > This reminds me, has anyone every discuss

Re: Disk space used by optimize

2005-01-27 Thread Otis Gospodnetic
Hello, Yes, that is how optimize works - copies all existing index segments into one unified index segment, thus optimizing it. see hit #1: http://www.lucenebook.com/search?query=optimize+disk+space However, three times the space sounds a bit too much, or I make a mistake in the book. :) You sa

Re: XML index

2005-01-27 Thread Otis Gospodnetic
Hello Karl, Grab the source code for Lucene in Action, it's got code that parses and indexes XML with DOM and SAX. You can see the coverage of that stuff here: http://lucenebook.com/search?query=indexing+XML+section%3A7* I haven't used kXML, but I imagine the LIA code should get you going quickl

Re: Boosting Questions

2005-01-27 Thread Otis Gospodnetic
Luke, Boosting is only one of the factors involved in Document/Query scoring. Assuming that by applying your boosts to Document A or a single field of Document A increases the total score enough, yes, that Document A may have the highest score. But just because you boost a single Document and no

Re: Different Documents (with fields) in one index?

2005-01-27 Thread Otis Gospodnetic
Karl, This is completely fine. You can have documents with different fields in the same index. Otis --- Karl Koch <[EMAIL PROTECTED]> wrote: > Hello all, > > perhaps not such a sophisticated question: > > I would like to have a very diverse set of documents in one index. > Depending > on th

Re: Lucene in Action hits desk in UK

2005-01-26 Thread Otis Gospodnetic
Gospodnetić sounds like Gospodnetich and Eric is Erik :) Otis --- John Haxby <[EMAIL PROTECTED]> wrote: > Otis Gospodnetic wrote: > > >I contacted both the US and UK Amazon sites and asked them to fix my > >last name (the last character in my name has a little slash (no

Re: Getting Into Search

2005-01-26 Thread Otis Gospodnetic
Hi Luke, That's not hard with RangeQuery (supported by QueryParser), take a look at this: http://www.lucenebook.com/search?query=date+range The grayed-out text has the section name and page number, so you can quickly locate this stuff in your ebook. Otis P.S. Do you know if Indigo/Chapters has

Re: Lucene in Action hits desk in UK

2005-01-26 Thread Otis Gospodnetic
Publisher -> Amazon information feed seems to be a fairly manual process, and Amazon takes a while to update book information on their site, including prices. I contacted both the US and UK Amazon sites and asked them to fix my last name (the last character in my name has a little slash (not an ac

Re: Search Chinese in Unicode !!!

2005-01-25 Thread Otis Gospodnetic
I don't have a document with chinese characters to verify this, but it looks right, so I'll add your change to SearchFiles.java. Thanks, Otis --- Eric Chow <[EMAIL PROTECTED]> wrote: > Search not really correct with UTF-8 !!! > > > The following is the search result that I used the SearchFiles

Re: Search on heterogenous index

2005-01-25 Thread Otis Gospodnetic
Hello Simeon, Heterogenous Documents/indices are OK - check out the second hit: http://www.lucenebook.com/search?query=heterogenous+different Otis --- Simeon Koptelov <[EMAIL PROTECTED]> wrote: > Hello all. I'm new to lucene and think about using it in my project. > > I have prices with dyn

Re: keep indexes as files or save them in database

2005-01-23 Thread Otis Gospodnetic
A number of people have tried putting Lucene indices in RDBMS. As far as I know, all were slower than FSDirectory. Otis --- nafise hassani <[EMAIL PROTECTED]> wrote: > Hi > I want to know from the performance point of view it > is better to save lucene indexes in database or use > them as files

Re: English and French documents together / analysis, indexing, searching

2005-01-23 Thread Otis Gospodnetic
That would be a partial solution. Accents will not be a problem any more, but if you use an Analyzer than stems tokens, they will not rally be tokenized properly. Searches will probably work, but if you look at the index you will see that some terms were not analyzed properly. But it may be suff

Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread Otis Gospodnetic
Yes, I remember your email about the large number of Terms. If it can be avoided and you figure out how to do it, I'd love to patch something. :) Otis --- "Kevin A. Burton" <[EMAIL PROTECTED]> wrote: > Otis Gospodnetic wrote: > > >It would be interesting

Re: Lucene in Action

2005-01-22 Thread Otis Gospodnetic
Hi Ansi, If you want the print version, I would guess you could order it from the publisher (http://www.manning.com/hatcher2) or from Amazon and they will ship it to you in China. The electronic version (a PDF file) is also available from the above URL. I'll ask Manning Publications and see whet

Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread Otis Gospodnetic
There Kevin, that's what I was referring to, the .tii file. Otis --- Paul Elschot <[EMAIL PROTECTED]> wrote: > On Saturday 22 January 2005 01:39, Kevin A. Burton wrote: > > Kevin A. Burton wrote: > > > > > We have one large index right now... its about 60G ... When I > open it > > > the Java V

Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread Otis Gospodnetic
It would be interesting to know _what_exactly_ uses your memory. Running under an optimizer should tell you that. The only thing that comes to mind is... can't remember the details now, but when the index is opened, I believe every 128th term is read into memory. This, I believe, helps with inde

RE: Search Chinese in Unicode !!!

2005-01-21 Thread Otis Gospodnetic
If you are hosting the code somewhere (e.g. your site, SF, java.net, etc.), we should link to them from one of the Lucene pages where we link to related external tools, apps, and such. Otis --- "Safarnejad, Ali (AFIS)" <[EMAIL PROTECTED]> wrote: > I've written a Chinese Analyzer for Lucene that

Re: Suggestion needed for extranet search

2005-01-21 Thread Otis Gospodnetic
Free as in orange juice. Otis --- "Ranjan K. Baisak" <[EMAIL PROTECTED]> wrote: > Otis, > Thanks for your help. Is nutch a freeware tool? > > regards, > Ranjan > --- Otis Gospodnetic <[EMAIL PROTECTED]> > wrote: > > > Hi Ranjan, > >

Re: Concurrent read and write

2005-01-21 Thread Otis Gospodnetic
Hello Ashley, You can read/search while modifying the index, but you have to ensure only one thread or only one process is modifying an index at any given time. Both IndexReader and IndexWriter can be used to modify an index. The former to delete Documents and the latter to add them. You have t

Re: Suggestion needed for extranet search

2005-01-21 Thread Otis Gospodnetic
Hi Ranjan, It sounds like you are should look at and use Nutch: http://www.nutch.org Otis --- "Ranjan K. Baisak" <[EMAIL PROTECTED]> wrote: > I am planning to move to Lucene but not have much > knowledge on the same. The search engine which I had > developed is searching some extranet URLs e.g.

RE: Filtering w/ Multiple Terms

2005-01-21 Thread Otis Gospodnetic
This: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/BooleanQuery.TooManyClauses.html ? You can control that limit via http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/BooleanQuery.html#maxClauseCount Otis --- Jerry Jalenak <[EMAIL PROTECTED]> wrote: > OK.

Re: Stemming

2005-01-21 Thread Otis Gospodnetic
Hi Kevin, Stemming is an optional operation and is done in the analysis step. Lucene comes with a Porter stemmer and a Filter that you can use in an Analyzer: ./src/java/org/apache/lucene/analysis/PorterStemFilter.java ./src/java/org/apache/lucene/analysis/PorterStemmer.java You can find more a

Re: Closed IndexWriter reuse

2005-01-20 Thread Otis Gospodnetic
No, you can't add documents to an index once you close the IndexWriter. You can re-open the IndexWriter and add more documents, of course. Otis --- Oscar Picasso <[EMAIL PROTECTED]> wrote: > Hi, > > Is it safe to add documents to an IndexWriter that has been closed? > > From what I have seen,

Re: lucene2.0 and transaction support

2005-01-20 Thread Otis Gospodnetic
The Wiki has some info about Lucene 2.0, but that is all there is about 2.0. Regarding transactions - have you tried DbDirectory? I believe that will provide XA support and it won't require Lucene changes. Otis --- John Wang <[EMAIL PROTECTED]> wrote: > Hi: > >When is lucene 2.0 schedule

RE: help in indexing

2005-01-20 Thread Otis Gospodnetic
Hello Chetan, The code that comes with the Lucene book contains a little framework for indexing rich-text documents. It sounds like you may be able to use it as-is, and extending it with a parser for Excel files, which we didn't include in the code (whould we include it in the next edition?). Wh

Re: Why IndexReader.lastModified(index) is depricated?

2005-01-19 Thread Otis Gospodnetic
Going for the segments file like that is not a recommended practise, or at least not something I'd recommend. 'segments' file is really something that a caller should not know anything about. Once day Lucene may choose to rename the segments file or some such, and the code that uses this trick wi

Re: Demo webapp + pdf

2005-01-19 Thread Otis Gospodnetic
We've used PDFBox for Lucene in Action: http://www.lucenebook.com/search?query=PDFBox If you download the source code for the book you will get ready to use code for parsing and indexing PDF files, as well as Word, XML, and RTF. Otis --- Vlachogiannis Evangelos <[EMAIL PROTECTED]> wrote: > H

Re: QUERYPARSER + LEXECIAL ERROR

2005-01-17 Thread Otis Gospodnetic
Hello, Try: String searchWrd = "kid \"toy\"" OR "kid \"ball\"" You'll have to use a WhitespaceAnalyzer with that, though, or a custom Analyzer that doesn't remove the escape character (\). Otis --- Karthik N S <[EMAIL PROTECTED]> wrote: > > > Hi Guys. > > Apologies. > > > >

Re: How do I unlock?

2005-01-11 Thread Otis Gospodnetic
Eh, that exactly :) When I read my emails in reverse order --- Chris Lamprecht <[EMAIL PROTECTED]> wrote: > What about a shutdown hook? > > Runtime.getRuntime().addShutdownHook(new Thread() { > public void run() { /* whatever */ } > }); > > see also > http://www.onjava.com/pub/a/onja

Re: How do I unlock?

2005-01-11 Thread Otis Gospodnetic
I didn't pay full attention to this thread, but it sounds like somebody may be interested in RuntimeShutdownHook (or some similar name) as a place to try to release the locks. Otis --- Joseph Ottinger <[EMAIL PROTECTED]> wrote: > On Tue, 11 Jan 2005, Doug Cutting wrote: > > > Joseph Ottinger wr

Re: Token Characters

2005-01-11 Thread Otis Gospodnetic
The best place to look is: ./src/java/org/apache/lucene/analysis/standard/StandardTokenizer.jj You can see it at: http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/analysis/standard/ Otis --- Shawn Konopinsky <[EMAIL PROTECTED]> wrote: > Hey There, > > Wondering wher

Re: How do I unlock?

2005-01-11 Thread Otis Gospodnetic
Hello, 1) The FAQ has been moved to the Wiki, so feel free to stick it in there. 2) http://www.lucenebook.com/search?query=unlock Otis --- Chris Hostetter <[EMAIL PROTECTED]> wrote: > > : I'm getting > : Lock obtain timed out. > : > : I was developing and forgot to close the writer. How do I

Re: Performance question

2005-01-10 Thread Otis Gospodnetic
Use one index, working with a single index is simpler. Also, once you pull a Document from Hits object, all Fields are read off of the disk. There was some discussion about selective Field reading about a week ago, check the list archives. Also keep in mind Field compression is now possible (onl

Re: Question about sorting and sorted results

2005-01-10 Thread Otis Gospodnetic
Hello Mariella, Check out the first hit here: http://www.lucenebook.com/search?query=sort+tokenize Otis -- http://www.simpy.com - save, tag, index, search, and share your links --- Mariella Di Giacomo <[EMAIL PROTECTED]> wrote: > Hi ALL, > > > I am using a java class to query an index and ret

Re: Duplicate Id

2005-01-07 Thread Otis Gospodnetic
Hello, If you search for India OR Test, you will find both, if you use AND, you will find none. Lucene can search any text, not just files. It sounds like you are using Lucene's demo as a real application (not a good practise). I suggest you take a look at the Resources page on the Lucene Wiki

Re: RemoteSearcher

2005-01-06 Thread Otis Gospodnetic
Nutch (nutch.org) has a pretty sophisticated infrastructure for distributed searching, but it doesn't use RemoteSearcher. Otis --- Yura Smolsky <[EMAIL PROTECTED]> wrote: > Hello. > > Does anyone know application which based on RemoteSearcher to > distribute index on many servers? > > Yura Smo

Re: Lucene Book in UK

2005-01-06 Thread Otis Gospodnetic
The book is $44.95 USD - it's printed on the back cover. Amazon had the correct price (minus their discount) until recently. They are just very slow with their site/book info updates, but I'm sure they'll fix it eventually. Otis --- Erik Hatcher <[EMAIL PROTECTED]> wrote: > > On Jan 6, 2005,

Re: reading fields selectively

2005-01-06 Thread Otis Gospodnetic
Hi John, There is no API for this, but I recall somebody talking about adding support for this a few months back. I even think that somebody might have contributed a patch for this. I am not certain about this, but check the patch queue (link on Lucene site). If there is a patch there, even if

Re: 1.4.3 breaks 1.4.1 QueryParser functionality

2005-01-05 Thread Otis Gospodnetic
Hello Bill, "I feel your pain" ;) But seriously, there was a QueryParser mess-up in the recent minor releases. I think this is the first time we've messed up the backward compatibility in the last ~4 years, I believe. Lucene public API is very 'narrow', and typically very stable. What we did wi

Re: simultaneous index/search/delete

2005-01-05 Thread Otis Gospodnetic
Any index-modifying operations need to be serializes. Searching is read-only and can be done in parallel with anything else. See http://www.lucenebook.com/search?query=concurrent for some hints. Otis --- Alex Kiselevski <[EMAIL PROTECTED]> wrote: > > Concerning the question about simultaneous

Re: Parsing issue

2005-01-04 Thread Otis Gospodnetic
That's the correct place to look and it includes code samples. Yes, it's a Jar file that you add to the CLASSPATH and use ... hm, normally programmatically, yes :). Otis --- Hetan Shah <[EMAIL PROTECTED]> wrote: > Has any one used NekoHTML ? If so how do I use it. Is it a stand > alone > ja

Re: Help for sorting

2005-01-03 Thread Otis Gospodnetic
Hello, --- mahaveer jain <[EMAIL PROTECTED]> wrote: > I am looking out to implement sorting in my lucene application. This > is what my code look like. > > I am using StandardAnalyzer() analyzer. > > Query query = QueryParser.parse(keyword, "contents", analyzer); > > Sort sortCol = new Sort

Re: Is search in lucene commutative?

2004-12-31 Thread Otis Gospodnetic
Replying to lucene-user list. Yes, term1 AND term2, as well as term1 OR term2 should yield the same hits. Otis --- ABDOU Samir <[EMAIL PROTECTED]> wrote: > Hello, > > Does a query such as give the same hits as for the > query ? > Google seems to differentiate the two requests. > > Thanks.

Re: how often to optimize?

2004-12-28 Thread Otis Gospodnetic
Correct. The self-maintenance you are referring to is Lucene's periodic segment merging. The frequency of that can be controlled through IndexWriter's mergeFactor. Otis --- aurora <[EMAIL PROTECTED]> wrote: > > Are not optimized indices causing you any problems (e.g. slow > searches, > > high n

Re: Need an analyzer that includes numbers.

2004-12-25 Thread Otis Gospodnetic
WhitespaceAnalyzer will let you have it. It just breaks the input on spaces. Otis --- Jim <[EMAIL PROTECTED]> wrote: > I've seen some discussion on this and the answer seems to be "write > your > own". Hasn't someone already done that by now that would share? I > really have to be able to i

Re: nable to read TLD "META-INF/c.tld" from JAR file ... standard.jar

2004-12-23 Thread Otis Gospodnetic
Most definitely Jetty. I can't believe you're using Tomcat for Rojo! ;) Otis --- Erik Hatcher <[EMAIL PROTECTED]> wrote: > Wrong list. > > Though perhaps you should be using Jetty ;) > > Erik > > > On Dec 23, 2004, at 4:17 PM, Kevin A. Burton wrote: > > > What in the world is up with

Re: addIndexes() Question

2004-12-22 Thread Otis Gospodnetic
I _think_ you'd be better off doing it all at once, but I wouldn't trust myself on this and would instead construct a small 3-index set and test, looking at a) maximal disk usage, b) time, and c) RAM usage. :) Otis --- Ryan Aslett <[EMAIL PROTECTED]> wrote: > > Hi there, Im about to embark on

Re: retrieve tokens

2004-12-22 Thread Otis Gospodnetic
For simpy.com I store the full text of web pages in Lucene, in order to provide full-text web searches. Nutch (nutch.org) does the same. You can set the maximal number of tokens you want indexed via IndexWriter. You can also compress fields in the newest version of Lucene (or maybe just the one

Re: retrieve tokens

2004-12-22 Thread Otis Gospodnetic
I suspect Martijn really wants that snippet dynamically generated, with KWIC, as on the lucenebook.com screen shot. Thus, he can't generate and store the snippet at index time, and has to construct it at search time. Otis --- Mike Snare <[EMAIL PROTECTED]> wrote: > > But for the other issue on

Re: (Offtopic) The unicode name for a character

2004-12-22 Thread Otis Gospodnetic
If you are not tied to Java, see 'unac' at http://www.senga.org/. It's old, but if nothing else you could see how it works and rewrite it in Java. And if you can, you can donate it to Lucene Sandbox. Otis --- Peter Pimley <[EMAIL PROTECTED]> wrote: > > Hi everyone, > > The Question: > In Java

Re: retrieve tokens

2004-12-22 Thread Otis Gospodnetic
Martijn, have you seen the Highlighter in the Lucene Sandbox? If you've stored your text in the Lucene index, there is no need to go back to DB to pull out the blog, parse it, and highlight it - the Highlighter in the Sandbox will do this for you. Otis --- "M. Smit" <[EMAIL PROTECTED]> wrote: >

Re: how often to optimize?

2004-12-21 Thread Otis Gospodnetic
Hello, I think some of these questions my be answered in the jGuru FAQ > So my question is would it be an overkill to optimize everyday? Only if lots of documents are being added/deleted, and you end up with a lot of index segments. > Is > there > any guideline on how often to optimize? E

Re: index size doubled?

2004-12-21 Thread Otis Gospodnetic
You don't need to optimize to simulate an incremental update. You just have to re-open your index with the IndexSearcher to see newly added documents. Otis --- aurora <[EMAIL PROTECTED]> wrote: > Thanks for the heads up. I'm using Lucene 1.4.2. > > I tried to do optimize() again but it has no

Re: index size doubled?

2004-12-21 Thread Otis Gospodnetic
Another possibility is that you are using an older version of Lucene, which was known to have a bug with similar symptoms. Get the latest version of Lucene. You shouldn't really have multiple .cfs files after optimizing your index. Also, optimize only at the end, if you care about indexing speed

RE: Queries difference

2004-12-20 Thread Otis Gospodnetic
Alex, I think you want this: +city:London +city:Amsterdam +address:1_street +address:2_street Otis --- Alex Kiselevski <[EMAIL PROTECTED]> wrote: > > Thanks Morus > So if I understand right > If the seqond query is : > +city(London) +city(Amsterdam) +address(1_street) +address(2_street) > >

Re: analyzer effecting phrases?

2004-12-20 Thread Otis Gospodnetic
When searching for phrases, what's important is the position of each token/word extracted by the Analyzer. WhitespaceAnalyzer/LowerCaseFilter don't do anything with the positional information. There is nothing else in your Analyzer? In any case, the following should help you see what your Analyz

Re: Indexing with Lucene 1.4.3

2004-12-17 Thread Otis Gospodnetic
The only place where you have to specify that you are using the compound index format is on IndexWriter instance. Nothing needs to be done at search time on IndexSearcher. Otis --- Hetan Shah <[EMAIL PROTECTED]> wrote: > Thanks Chuck, > > I now understand why I see only one file. Another quest

Re: Why does the StandardTokenizer split hyphenated words?

2004-12-16 Thread Otis Gospodnetic
Hello, As Erik already said - that Analyzer is really there to get people going quickly and as a 'does pretty good' Analyzer. There is no Analyzer that will work for everyone, and Analyzers are meant to be custom-made. It looks like you already got that figured out and have your own Analyzer. O

Re: Disk space needed for indexing???

2004-12-16 Thread Otis Gospodnetic
The exact disk space usage depends on the number of fields in the index and on how many of them store the original text. You should also keep in mind that the call to IndexWriter's optimize() will result in your index directory size doubling while the optimization is in progress, so if you want to

Re: A question about scoring function in Lucene

2004-12-15 Thread Otis Gospodnetic
There is one case that I can think of where this 'constant' scoring would be useful, and I think Chuck already mentioned this 1-2 months ago. For instace, having such scores would allow one to create alert applications where queries run by some scheduler would trigger an alert whenever the score i

RE: Indexing a large number of DB records

2004-12-15 Thread Otis Gospodnetic
Note that this really includes some extra steps. You don't need a temp index. Add everything to a single index using a single IndexWriter instance. No need to call addIndexes nor optimize until the end. Adding Documents to an index takes a constant amount of time, regardless of the index size, b

Re: Indexing a large number of DB records

2004-12-15 Thread Otis Gospodnetic
ry > 20,000 documents to flush memory structures to disk. > There doesn't seem to be an equivalent in Lucene. > > -- Homam > > > > > > > --- Otis Gospodnetic <[EMAIL PROTECTED]> > wrote: > > > Hello, > > > > There ar

Re: Indexing a large number of DB records

2004-12-14 Thread Otis Gospodnetic
Hello, There are a few things you can do: 1) Don't just pull all rows from the DB at once. Do that in batches. 2) If you can get a Reader from your SqlDataReader, consider this: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Read

RE: Opinions: Using Lucene as a thin database

2004-12-14 Thread Otis Gospodnetic
Well, one could always partition an index, distribute pieces of it horizontally across multiple 'search servers' and use the built-in RMI-based and Parallel search feature. Nutch uses something similar for search scaling. Otis --- Monsur Hossain <[EMAIL PROTECTED]> wrote: > > My concern is tha

RE: TFIDF Implementation

2004-12-14 Thread Otis Gospodnetic
You can also see 'Books like this' example from here https://secure.manning.com/catalog/view.php?book=hatcher2&item=source Otis --- Bruce Ritchie <[EMAIL PROTECTED]> wrote: > Christoph, > > I'm not entirely certain if this is what you want, but a while back > David Spencer did code up a 'More L

RE: Opinions: Using Lucene as a thin database

2004-12-14 Thread Otis Gospodnetic
You can see Flickr-like tag (lookup) system at my Simpy site ( http://www.simpy.com ). It uses Lucene as the backend for lookups, but still uses a RDBMS as the primary storage. I find it that keeping the RDBMS and Lucene indices is a bit of a pain and error prone, so _thin_ storage layer with sim

Re: finalize delete without optimize

2004-12-14 Thread Otis Gospodnetic
is, how would I go about submitting a patch? > > thanks > > -John > > > On Mon, 13 Dec 2004 22:24:12 -0800 (PST), Otis Gospodnetic > <[EMAIL PROTECTED]> wrote: > > Hello John, > > > > I believe you didn't get any replies to this. What you are >

Re: finalize delete without optimize

2004-12-14 Thread Otis Gospodnetic
Hello John, I believe you didn't get any replies to this. What you are describing cannot be done using the public, but maaay (no source code on this machine, so I can't double-check that) be doable if you use some of the 'internal' methods. I don't have the need for this, but others might, so

Re: Finding unused segment files?

2004-12-12 Thread Otis Gospodnetic
into Lucene index directories and removes * unwanted files. In its more radical mode, this tool can be used to * remove all non-Lucene index files from a directory. The other * option is to remove unused Lucene segment files, should the index * directory get polluted. * * TODO: this tool

Re: Indexing HTML files give following message

2004-12-12 Thread Otis Gospodnetic
Hello, This is probably due to some bad HTML. The application you are using is just a demo, and uses a JavaCC-based HTML parser, which may not be resilient to invalid HTML. For Lucene in Action we developed a little extensible indexing framework, and for HTML indexing we used 2 tools to handle H

Re: Sorting based on calculations at search time

2004-12-10 Thread Otis Gospodnetic
Guru (I thought my first name was OK until now), Have you tried using boosts for that? You can boost individual Document Fields when indexing, and/or you can boost individual Documents, thus giving some more and some less 'weight', which will have an effect on the final score. Otis --- Gu

RE: OutOfMemoryError with Lucene 1.4 final

2004-12-10 Thread Otis Gospodnetic
Ying, You should follow this finally block advice below. In addition, I think you can just close the reader, and it will close the underlying stream (I'm not sure about that, double-check it). You are not running out of file handles, though. Your JVM is running out of memory. You can play with

Re: maxDoc()

2004-12-09 Thread Otis Gospodnetic
Hello Garrett, Share some code, it will be easier for others to help you that way. Obviously, this would be a huge bug if the problem were within Lucene. Otis --- Garrett Heaver <[EMAIL PROTECTED]> wrote: > Can anyone please explain to my why maxDoc returns 0 when Luke shows > 239,473 > docume

Re: 'IN' type search

2004-12-08 Thread Otis Gospodnetic
Hello, You can use BooleanQuery for that. Otis --- Ravi <[EMAIL PROTECTED]> wrote: > > Hi > How do you get all documents in lucene where a particular field > value > is in a given list of values (like SQL IN). What kind of Query class > should I use? > > Thanks in advance. > Ravi. > > > -

Re: Updating indexes incrementaly including replacnig old documents

2004-12-08 Thread Otis Gospodnetic
Both options are good, and which one you choose depends on which one you feel more comfortable with, I'd say. The searcher won't see duplicates or missing documents until it is reopened. So use a separate IndexSearcher for searching, and reinstantiate it only after you are completely done with ei

Re: Empty/non-empty field indexing question

2004-12-08 Thread Otis Gospodnetic
if > the > field is not there, correct? > But then is there a point putting an empty value in it, if an > application will never search for empty values? > > > thanks > > -pedja > > > Otis Gospodnetic said the following on 12/8/2004 1:31 AM: > >

Re: searchig with special characters

2004-12-08 Thread Otis Gospodnetic
Leading wildcard character (*) is not allowed if you use QueryParser that comes with Lucene. Reason: performance. See many discussions about this on lucene-user mailing list. Also see the search sytax document on the Lucene site. What other characters are you having trouble with? Otis --- Sa

Re: problem restoring index

2004-12-08 Thread Otis Gospodnetic
There is no need to reindex. However, I also don't quite get what the problem is :) Otis --- Santosh <[EMAIL PROTECTED]> wrote: > hi, > > when I restart the tomcat . the Index is getting corrupted. If I take > the backup of Index and then restarting tomcat. the Index is not > working properly.

  1   2   3   4   5   6   7   8   9   10   >