Re: document diversity

2009-10-01 Thread Phil Whelan
Hi Mike, I'd simply store a field doctype with values pdf, txt, html and perform a separate search for each type. Although, I'd be interested if anyone has a cooler way of doing this. Cheers, Phil On Thu, Oct 1, 2009 at 9:56 AM, Michael Masters mmast...@gmail.com wrote: I was wondering if

Re: [ANNOUNCEMENT] LucidGaze for Lucene released

2009-09-14 Thread Phil Whelan
Hi Mark, Is there any Lucene 2.9 versions of this in development that I could get my hands on? I'd be happy to be an alpha tester. Cheers, Phil LucidGaze for Lucene works as a drop-in replacement for the Lucene JAR; it requires no changes to the source code of the application, or even

Re: Problems with IndexReader.reopen()

2009-09-14 Thread Phil Whelan
Sorry, just realised my mistake. I should read the docs more carefully. IndexReader.reopen() does not reopen the existing IndexReader, but returns a new one. Phil On Mon, Sep 14, 2009 at 3:20 PM, Phil Whelan phil...@gmail.com wrote: Hi, I'm not sure why my IndexReader.reopen() call

Problems with IndexReader.reopen()

2009-09-14 Thread Phil Whelan
Hi, I'm not sure why my IndexReader.reopen() call is not working. The latest results are not coming back, meaning the reader / searcher has not being re-opened for the new Documents that have been added. IndexReader openReader = searcher.getIndexReader(); searcher.close();

Re: Enumerating NumericField using TermEnum?

2009-09-13 Thread Phil Whelan
Hi Uwe, Thanks for the explanation! It really helps. That makes sense that for a small number of values, such as hour NumericField is not going to help me. I'm experimenting with using epoch NumericField for sorting, which funnily is where I started with 2.4.1, before going down the usual

Enumerating NumericField using TermEnum?

2009-09-11 Thread Phil Whelan
Hi, I've used NumericField to store my hour field. Example... doc.add(new NumericField(hour).setIntValue(Integer.parseInt(12))); Before I was using plain string Field and enumerating them with TermEnum, which worked fine. Now I'm using NumericField's I'm not sure how to port this

Re: Why does this search succeed with web app, but not Luke?

2009-08-06 Thread Phil Whelan
Hi Jim, Are you using the same Analyzer for indexing and searching? .yyy will be seem as a HOSTNAME by StandardAnalyzer and will keep it as one term, whereas another indexer might split this into 2 terms. This should not matter either way as long as you are using the same Analyzer for both

Re: Why does this search succeed with web app, but not Luke?

2009-08-06 Thread Phil Whelan
select path is just the filename part, without the extension, i.e., the part. That's why I said in my original post that I was kind of surprised that doing a web query for path:.yyy succeeded, i.e, in the path field in the index, there is no .yyy, just . Jim Phil

Re: Searching doubt

2009-08-04 Thread Phil Whelan
On Tue, Aug 4, 2009 at 8:31 AM, Shai Ereraser...@gmail.com wrote: Hi Darren, The question was, how given a string aboutus in a document, you can return that document as a result to the query about us (note the space). So we're mostly discussing how to detect and then break the word aboutus to

Re: Searching doubt

2009-08-04 Thread Phil Whelan
(sorry, tangent. I'll be quick) On Tue, Aug 4, 2009 at 8:42 AM, Shai Ereraser...@gmail.com wrote: Interesting ... I don't have access to a Japanese dictionary, so I just extract bi-grams. Shai - if you're interested in parsing Japanese, check out Kakasi. It can split into words and convert

Re: Weird discrepancy with term counts vs. terms (off by 1)

2009-08-02 Thread Phil Whelan
Hi Jim, On Sun, Aug 2, 2009 at 1:32 AM, oh...@cox.net wrote: I first noticed the problem that I'm seeing while working on this latter app. Basically, what I noticed was that while I was adding 13 documents to the index, when I listed the path terms, there were only 12 of them. Field text

Re: Weird discrepancy with term counts vs. terms (off by 1)

2009-08-02 Thread Phil Whelan
Hi Jim, On Sun, Aug 2, 2009 at 9:08 AM, Phil Whelanphil...@gmail.com wrote: So then, I reviewed the index using Luke, and what I saw with that was that there were indeed only 12 path terms (under Term Count on the left), but, when I clicked the Show Top Terms in Luke, there were 13 terms

Re: Weird behaviour

2009-08-02 Thread Phil Whelan
Hi Prashant, I agree with Shai, that using Luke and printing out what the Document looks like before it goes into the index, are going to be your best bet for debugging this problem. The problem you're having is that StandardAnalyzer does not break-up the hostname into separate terms, as it has

Re: Weird discrepancy with term counts vs. terms (off by 1)

2009-08-02 Thread Phil Whelan
On Sun, Aug 2, 2009 at 10:58 AM, Andrzej Bialeckia...@getopt.org wrote: Thank you Phil for spotting this bug - this fix will be included in the next release of Luke. Glad to help. Thanks for building this great tool! Phil -

Re: Weird discrepancy with term counts vs. terms (off by 1)

2009-08-02 Thread Phil Whelan
Hi Jim, On Sun, Aug 2, 2009 at 12:12 PM, oh...@cox.net wrote: i.e., I was ignoring the 1st term in the TermEnum (since the .next() bumps the TermEnum to the 2nd term, initially). Great! Glad you found the problem. I couldn't see it. Phil

Re: How to improve search time?

2009-08-02 Thread Phil Whelan
Hi Prashant, Take a look at this... http://wiki.apache.org/lucene-java/ImproveSearchingSpeed Cheers, Phil On Sun, Aug 2, 2009 at 9:33 PM, prashant ullegaddiprashullega...@gmail.com wrote: Hi, I've a single index of size 87GB containing around 50M documents. When I search for any query,

Re: ThreadedIndexWriter vs. IndexWriter

2009-08-01 Thread Phil Whelan
Hi Mike, It's Jibo, not me, having the problem. But thanks for the link. I was interested to look at the code. Will be buying the book soon. Phil On Sat, Aug 1, 2009 at 2:08 AM, Michael McCandless luc...@mikemccandless.com wrote: (Please note that ThreadedIndexWriter is source code available

Re: java.io.IOException when trying to list terms in index (IndexReader)

2009-08-01 Thread Phil Whelan
Hi Jim, I cannot see anything obvious, but both open() and terms() throw IOException's. You could try putting these in separate try..catch blocks to see which one it's coming from. Or using e.printStackTrace() in the catch block will give more info to help you debug what's happening. On Sat, Aug

Re: indexing multiple email addresses in one field

2009-07-31 Thread Phil Whelan
include stop word removal in the processing of your token stream. Matt Phil Whelan wrote: Hi Matthew / Paul, On Thu, Jul 30, 2009 at 4:32 PM, Paul Cowanco...@aconex.com wrote: Matthew Hall wrote: Place a delimiter between the email addresses that doesn't get removed in your

Re: Seeking guidance for updating indexes

2009-07-31 Thread Phil Whelan
Hi Jim, There should not be much difference from the lucene end between a new index and index you want to update (add more documents to). As stated in the Lucene docs IndexWriter will create the index if it does not already exist.

Re: ThreadedIndexWriter vs. IndexWriter

2009-07-31 Thread Phil Whelan
Hi Jibo, Have you tried optimizing indexes? I do not know anything about the implementation of ThreadedIndexWriter, but if they both optimize down to the same size, it could just mean that ThreadedIndexWriter is not as optimized. Thanks, Phil On Fri, Jul 31, 2009 at 11:38 AM, Jibo

Re: ThreadedIndexWriter vs. IndexWriter

2009-07-31 Thread Phil Whelan
Hi Jibo, Your mergeFactor is different, and the resulting numFiles (segment files) is different. Maybe each thread is responsible for a segment file. Just curious - do you have 3 threads? Phil - To unsubscribe, e-mail:

Is it possible to retrieve Terms from a Document?

2009-07-31 Thread Phil Whelan
Hi, I know you can use Field.Store.YES, but I want to inspect the terms / tokens and their order related to the field name at search time. Is this possible? Obviously this information is stored in the index, but I can not find any API to access it. I'm guessing the answer might be that Terms

indexing multiple email addresses in one field

2009-07-30 Thread Phil Whelan
Hi, We have a very large lucene index that we're developing that has a field of email addresses. (Actually mulitple fields with multiple emails addresses, but I'll simplify here) Each document will have one email field containing multiple email addresses. I am indexing email addresses only

Re: indexing multiple email addresses in one field

2009-07-30 Thread Phil Whelan
On Thu, Jul 30, 2009 at 11:22 AM, Matthew Hall mh...@informatics.jax.org wrote: 1. Sure, just have an analyzer that splits on all non letter characters. 2. Phrase queries keep the order intact.  (And yes, the positional information for the terms is kept, which is what allows span queries to

Re: indexing multiple email addresses in one field

2009-07-30 Thread Phil Whelan
Hi Matthew / Paul, On Thu, Jul 30, 2009 at 4:32 PM, Paul Cowanco...@aconex.com wrote: Matthew Hall wrote: Place a delimiter between the email addresses that doesn't get removed in your analyzer.  (preferably something you know will never be searched on) Or add them separately (rather than:

Re: Is there a list of special characters for standard analyzer?

2009-07-30 Thread Phil Whelan
On Thu, Jul 30, 2009 at 7:12 PM, oh...@cox.net wrote: I was wonder if there is a list of special characters for the standard analyzer? What I mean by special is characters that the analyzer considers break characters. For example, if I have something like foo=something, apparently the

Re: Querying across object relationships

2009-07-29 Thread Phil Whelan
Hi Don, On Wed, Jul 29, 2009 at 1:42 PM, Donal Murtaghdomur...@yahoo.co.uk wrote:    Course.name   Attendance.mandatory   Student.name    -    cooking                        N                      Bob    art                                Y      

Re: Exclusion search

2009-07-22 Thread Phil Whelan
If there are only have a few thousand documents, and the number of results quite small is this a case where post-search filtering can be done? I have not done anything like this myself with Lucene, so is this a bad idea? If not, what would be the best way to do this?

Re: indexing 100GB of data

2009-07-22 Thread Phil Whelan
On Wed, Jul 22, 2009 at 5:46 AM, m.harigm.ha...@gmail.com wrote: Is there any article or forum for using Hadoop with lucene? Please any1 help me Hi M, Katta is a project that is combining Lucene and Hadoop. Check it out here... http://katta.sourceforge.net/ Thanks, Phil

Re: Alternative way to simulate sorting without doing actual sort

2009-07-22 Thread Phil Whelan
Hi Ganesh, I'm not sure whether this will work for you, but one way I got around this was with multiple searches. I only needed the first 50 results, but wanted to sort by date,hour,min,sec. This could result in 5 results or millions of results. I added the date to the query, so I'd search for

Re: Batch searching

2009-07-22 Thread Phil Whelan
On Wed, Jul 22, 2009 at 12:28 PM, Matthew Hallmh...@informatics.jax.org wrote: Not sure if this helps you, but some of the issue you are facing seem similar to those in the real time search threads. Hi Matthew, Do you have a pointer of where to go to see the real time threads? Thanks, Phil