Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-11 Thread Bill Janssen
shrinath.m shrinat...@webyog.com wrote: Consider we've offline HTML pages, no parsing while crawling, now what ? Any tokenizer someone has built for this ? In UpLib, which uses PyLucene, I use BeautifulSoup to simplify Web pages by selecting only text between certain tags, before indexing

Re: about pdf search

2011-03-07 Thread Bill Janssen
James Wilson james_wil...@nmcourt.fed.us wrote: I have completed a project to do the exact same thing. I put the pdf text in XML files. Then after I do a Lucene search I read the text from the XML files. I do not store the text in the Lucene index. That would bloat the index and slow down

Re: AW: Best practices for multiple languages?

2011-01-20 Thread Bill Janssen
languages. Bill With this solution : 1. I only need one field (or two if I want both stemmed and unstemmed processing) 2. The user can search in all document regarless to there language I hope this help. Dominique www.zoonix.fr www.crawl-anywhere.com Le 20/01/11 00:29, Bill Janssen

Re: AW: Best practices for multiple languages?

2011-01-19 Thread Bill Janssen
Clemens Wyss clemens...@mysign.ch wrote: 1) Docs in different languages -- every document is one language 2) Each document has fields in different languages We mainly have 1)-models I've recently done this for UpLib. I run a language-guesser over the document to identify the primary

Re: AW: Best practices for multiple languages?

2011-01-19 Thread Bill Janssen
reasonable corpus to be convinced it would be worth it. Bill paul Le 19 janv. 2011 à 19:21, Bill Janssen a écrit : Clemens Wyss clemens...@mysign.ch wrote: 1) Docs in different languages -- every document is one language 2) Each document has fields in different languages We mainly

Re: AW: Best practices for multiple languages?

2011-01-19 Thread Bill Janssen
Paul Libbrecht p...@hoplahup.net wrote: I did several changes of this sort and the precision and recall measures went better in particular in presence of language-indication failure which happened to be very common in our authoring environment. There are two kinds of failures: no language,

Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?

2011-01-18 Thread Bill Janssen
Grant Ingersoll gsing...@apache.org wrote: Where do you get your Lucene/Solr downloads from? [x] ASF Mirrors (linked in our release announcements or via the Lucene website) [] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) [x] I/we build them from source via an

Re: Indexing is hung or doesn't complete

2010-10-13 Thread Bill Janssen
Ching zchin...@gmail.com wrote: I use PDFBox version 1.1.0; I did find a workaround now. Just wondering which tools do you use to extract text from pdf? Thanks. Ching, in UpLib I use a patched version of xpdf which reports the bounding box and font information for each word (as well as the

Re: finding the analyzer for a language...

2010-09-25 Thread Bill Janssen
Robert Muir rcm...@gmail.com wrote: On Fri, Sep 24, 2010 at 9:58 PM, Bill Janssen jans...@parc.com wrote: I thought that since I'm updating UpLib's Lucene code, I should tackle the issue of document languages, as well. Right now I'm using an off-the-shelf language identifier, textcat

finding the analyzer for a language...

2010-09-24 Thread Bill Janssen
I thought that since I'm updating UpLib's Lucene code, I should tackle the issue of document languages, as well. Right now I'm using an off-the-shelf language identifier, textcat, to figure out which language a Web page or PDF is (mainly) written in. I then want to analyze that document with an

Re: recommended way to identify a version to pass to StandardAnalyzer constructor?

2010-09-19 Thread Bill Janssen
Simon Willnauer simon.willna...@googlemail.com wrote: On Fri, Sep 17, 2010 at 11:45 PM, Bill Janssen jans...@parc.com wrote: Simon Willnauer simon.willna...@googlemail.com wrote: On Fri, Sep 17, 2010 at 8:14 PM, Bill Janssen jans...@parc.com wrote: Simon Willnauer simon.willna

Re: recommended way to identify a version to pass to StandardAnalyzer constructor?

2010-09-17 Thread Bill Janssen
Simon Willnauer simon.willna...@googlemail.com wrote: Hey Bill, let me clarify what Version is used for since I think that caused little confusion. Thanks. The Version constant was mainly introduced to help users with backwards compatibility and upgrading their codebase to a new version

Re: recommended way to identify a version to pass to StandardAnalyzer constructor?

2010-09-17 Thread Bill Janssen
Simon Willnauer simon.willna...@googlemail.com wrote: On Fri, Sep 17, 2010 at 8:14 PM, Bill Janssen jans...@parc.com wrote: Simon Willnauer simon.willna...@googlemail.com wrote: Hey Bill, let me clarify what Version is used for since I think that caused little confusion. Thanks

Re: recommended way to identify a version to pass to StandardAnalyzer constructor?

2010-09-17 Thread Bill Janssen
Bill Janssen jans...@parc.com wrote: ...is there any attribute or static method somewhere in Lucene which will return a value of org.apache.lucene.util.Version that corresponds to the version of the code? That's what I'm looking for. Version.LUCENE_CURRENT looks good, but it's deprecated

recommended way to identify a version to pass to StandardAnalyzer constructor?

2010-09-16 Thread Bill Janssen
So, in version 3, I have to pass a version parameter to the constructor for StandardAnalyzer. Since Version.LUCENE_CURRENT is deprecated, I'd like this to be the same as the version of the index I'm using. Is there a way of getting a value of Version for an index? I don't see obvious methods on

Re: Using lucene as a database... good idea or bad idea?

2008-07-29 Thread Bill Janssen
I do this with uplib (http://uplib.parc.com/) with fair success. Originally I thought I'd need Lucene plus a relational database to store metadata about the documents for metadata searches. So far, though, I've been able to store the metadata in Lucene and use the same Lucene DB for both metadata

Re: text extraction from pdf

2008-05-15 Thread Bill Janssen
Problem I am having is that some of them has multiple columns. and multiple word boxes. Does the xpdf patch extract different columns and wordboxes? It tells you where each word is. Columns you have to do for yourself. Bill In UpLib, I use xpdf-3.02pl2 with a patch which gives me position

Re: text extraction from pdf

2008-05-14 Thread Bill Janssen
the unix program pdf2text can convert keeping the text places, but I wanted to ask you guys if you know something better, AFAIK, PDFBox has a lower-level API that allows you to get hold of text positions. In UpLib, I use xpdf-3.02pl2 with a patch which gives me position and font

Re: Does Lucene save an offline version of web pages?

2008-04-27 Thread Bill Janssen
- Fetch and index some pages (containing word and pdf documents) on daily basis. - Extract all pages that contain some provided keywords after fetching the pages. - Create some bulletin from fetched pages, bulletin will be in pdf format and are categorized based on keywords. - provide

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-12-02 Thread Bill Janssen
I'll see if I can get back to this over the weekend. I got a chance to copy my corpus to another G4 and try indexing with Lucene 2.2. This one seems OK! Same texts. So now I'm inclined to believe that it *is* the machine, rather than the code. Whew! Though that doesn't explain why 2.0 works

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-12-02 Thread Bill Janssen
Hmmm, it still sounds like you are hitting a threading issue that is probably exacerbated by the multicore platform of the newer machine. Exactly what I was thinking. What are the details of the CPUs of these two systems? Ah, good point. The bad machine is a dual-processor 1GHz G4 wind

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-30 Thread Bill Janssen
Your errors seem to happen around the same area (~20K docs). If you skip the first say ~18K docs does the error still happen? We need to somehow narrow this down. I'm trying to boil down the documents to a set which I can deploy on a DVD-ROM, so I can move the same set around from machine to

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-29 Thread Bill Janssen
Do you have another PPC machine to reproduce this on? (To rule out bad RAM/hard-drive on the first one). I'll dig up an old laptop and try it there. Another thing to try is turning on the infoStream (IndexWriter.setInfoStream(...)) and capture post the resulting log. It will be very large

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-29 Thread Bill Janssen
Another thing to try is turning on the infoStream (IndexWriter.setInfoStream(...)) and capture post the resulting log. It will be very large since it takes quite a while for the error to occur... I can do that. Here's what I see: Optimizing... merging segments _ram_a (1 docs) _ram_b

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-29 Thread Bill Janssen
Another thing to try is turning on the infoStream (IndexWriter.setInfoStream(...)) and capture post the resulting log. It will be very large since it takes quite a while for the error to occur... I can do that. Here's a more complete dump. I've modified the code so that I now remove

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-29 Thread Bill Janssen
Can you try running with the trunk version of Lucene (2.3-dev) and see if the error still occurs? EG you can download this AM's build here: http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/288/artifact/artifacts Still there. Here's the dump with last night's build:

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-29 Thread Bill Janssen
Are you still getting the original exception too or just the Array out =20= of bounds one now? Also, are you doing anything else to the index =20 while this is happening? The code at the point in the exception below =20= is trying to properly handle deleted documents. Just the

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-29 Thread Bill Janssen
Could you post this part of the code (deleting) too? Here it is: private static void remove (File index_file, String[] doc_ids, int start) { String number; String list; Term term; TermDocs matches; if (debug_mode)

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-29 Thread Bill Janssen
Have you tried another PPC machine? No. It's in another location, but perhaps I can get it tomorrow. On the other hand, the success when using 2.0 makes it likely to me that the machine isn't the problem. OK, I've reverted to my original codebase (where I first create a reader and do the

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-29 Thread Bill Janssen
Also, could you try out the CheckIndex tool in 2.3-dev before and after the deletes? Great idea! I don't suppose there's a jar file of it? Bill - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail:

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-29 Thread Bill Janssen
So, it's a little clearer. I get the Array-out-of-bounds exception if I'm re-indexing some already indexed documents -- if there are deletions involved. I get the CorruptIndexException if I'm indexing freshly -- no deletions. Here's an example of that (with the latest nightly). I removed the

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-28 Thread Bill Janssen
Here's the code I'm using: try { // Now add the documents to the index IndexWriter writer = new IndexWriter(index_loc, new StandardAnalyzer(), !index_loc.exists()); writer.setMaxFieldLength(Integer.MAX_VALUE); try { for

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-28 Thread Bill Janssen
I just tried re-indexing with lucene-core-2.0.0.jar and the same indexing code; works great. So what am I doing wrong with 2.2? Bill - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-28 Thread Bill Janssen
Are you really sure in your 2.2 test you are starting with no prior index? I'd ask that too, but yes, I'm really really sure. Building a completely new index each time. Works with 2.0.0. Fails with 2.2.0. Works with 2.2.0 *if* I remove the optimization step. Bill

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-28 Thread Bill Janssen
You are not hitting any other exception before this one right? Can you change your test case so that the catch clause is run before the finally clause? I wonder if you are hitting some interesting exception and then trying to optimize, which then masks the original exception. Yes, I

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-28 Thread Bill Janssen
I'm going to run the same software on an Intel machine and see what happens. So, I ran the same codebase with lucene-core-2.2.0.jar on an Intel Mac Pro, OS X 10.5.0, Java 1.5, and no exception is raised. Different corpus, about 5 pages instead of 2. This is reinforcing my thinking

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-28 Thread Bill Janssen
Hmmm ... how many chunks of about 50 pages do you do before hitting this? Roughly how many docs are in the index when it happens? Oh, gosh, not sure. I'm guessing it's about half done. Can you describe the docs/fields you're adding? I've got 1735 documents, 18969 pages -- average page size

Re: Lucille, a (new) Python port of Lucene

2007-08-28 Thread Bill Janssen
Lucille apparently doesn't require gcj. Bill Why Lucille in light of PyLucene? Erik On Aug 28, 2007, at 10:55 AM, Dan Callaghan wrote: Dear list, I have recently begun a Python port of Lucene, named Lucille. It is still very much a work in progress, but I hope to have a

Re: Keyphrase Extraction (via Lingo)

2007-05-09 Thread Bill Janssen
Dawid Weiss wrote: You could also try splitting the document into paragraphs and use Carrot2's Lingo algorithm (www.carrot2.org) on a paragraph-level to extract clusters. Labelling routine in Lingo should extract 'key' phrases; this analysis is heavily frequency-based, but... you know,

Re: Keyphrase Extraction

2007-05-08 Thread Bill Janssen
Dawid Weiss wrote: You could also try splitting the document into paragraphs and use Carrot2's Lingo algorithm (www.carrot2.org) on a paragraph-level to extract clusters. Labelling routine in Lingo should extract 'key' phrases; this analysis is heavily frequency-based, but... you know, you

Re: strange idf in Lucene 2.1

2007-04-12 Thread Bill Janssen
docfreqs (idfs) do not take into account deleted docs. This is more of an engineering tradeoff rather than a feature. If we could cheaply and easily update idfs when documents are deleted from an index, we would. Wow. So is it fair to say that the stored IDF is really the cumulative IDF for

Re: strange idf in Lucene 2.1

2007-04-12 Thread Bill Janssen
The difference between IndexReader.maxDoc() and numDocs() tells you how many documents have been marked for deletion but still take up space in the index. But not which terms have an odd IDF value because of those deleted documents. How much does the IDF value contribute to the score in

Re: keywords in a document

2007-04-09 Thread Bill Janssen
Try looking at the retrieveInterestingTerms method on the class MoreLikeThis. http://lucene.apache.org/java/2_0_0/api/org/apache/lucene/search/similar/MoreLikeThis.html Bill - To unsubscribe, e-mail: [EMAIL PROTECTED] For

Re: Reduction based more like this?

2007-02-09 Thread Bill Janssen
For example, given terms female, John and London - all 3 may have equal IDF but should a document representing a female in London be given equal weighting to a document representing the rarer example of a female who happens to be called John? Not to mention multi-word phrase tokenization,

Re: using a document as a query?

2007-01-31 Thread Bill Janssen
MoreLikeThis is just what I wanted. Thanks. Bill Yes, I believe Dave did something like that on searchmorph.org and somebody= else did this on some some with RFCs. What's that called? Query by examp= le? I think so, try define:Query By Example on Google. Take a look at= MoreLikeThis

using a document as a query?

2007-01-30 Thread Bill Janssen
I was thinking of trying something, and wondered if someone else already had it working... I'd like to take a document, and use it as a query to find other documents in my index that 'match' it. I'm talking about short documents, like newspaper articles or email messages. Seems to me that there

overriding addClause()?

2006-10-23 Thread Bill Janssen
I'd like to suggest a minor change in the QueryParser.jj. I thought I'd describe it here and get some feedback before posting a patch. The issue is that I can't get my hands on some clauses (typically PhraseQuery instances) in my subclass of MultiFieldQueryParser, which I'd like to do to

IMAP server that uses Lucene?

2006-05-28 Thread Bill Janssen
Hi! I've got oodles of email stored in MH (one file per message, hierarchical directories) format. I'm looking for an IMAP server that will use Lucene to index that mail and perform the various search parts of the IMAP protocol. Ideally, the mail would not have to be converted to another email

Re: Fetch Documents Without Retrieveing All Fields

2006-04-10 Thread Bill Janssen
In case anyone else was wondering: I got curious about how one would replace FieldCache, and discovered that you can create an instance of a class which implements FieldCache, and then simply assign it to org.apache.lucene.search.FieldCache.DEFAULT. 2) your use case sounds like it could best be

Re: WRITE_LOCK_TIMEOUT

2006-04-05 Thread Bill Janssen
Hi. Is it correct that in Release 1.9.1 a WRITE_LOCK_TIMEOUT is hardcoded and there is no way to set it from outside? I've seen a check-in in the CVS from a few days ago which added getters/setters for this, but ... there is no release containing this, right? So, my question is:

Any plans for a 1.9.2 release? Need timeout setting!

2006-03-30 Thread Bill Janssen
I presume the patch that gives us a way of overriding the default timeout for write locks has made it into the source DB, but I really need a jar file to point people at which contains it. Any chance of a 1.9.2 release? Bill -

Re: Can i use lucene to search the internet.

2006-03-23 Thread Bill Janssen
Let's stop this thread. Can i use lucene to search the internet. No. You may be able to use Lucene to *index* the internet, and then search the resulting index. Read the book Lucene in Action for a better idea of what this would entail. Bill

Re: Setting the COMMIT lock timeout.

2006-03-13 Thread Bill Janssen
Daniel Naber ponders: Seems these have been forgotten. They can easily be added, but I still wonder what the use case is to set these values? The default value isn't magic. The appropriate value is context-specific. I've got some people using Lucene on machines with slow disks, and we need

Adjusting WRITE_LOCK_TIMEOUT in 1.9.1

2006-03-09 Thread Bill Janssen
I don't see how to adjust the value of IndexWriter's WRITE_LOCK_TIMEOUT in 1.9. Since the property org.apache.lucene.writeLockTimeout is no longer consulted, the value of IndexWriter.WRITE_LOCK_TIMEOUT is final, and there's no setter, what's the deal? Bill

notification of active IndexSearchers when index is modified?

2006-01-19 Thread Bill Janssen
I've got a daemon process which keeps an IndexSearcher open on an index and responds to query requests by sending back document identifiers. I've also got other processes updating the index by re-indexing existing documents, deleting obsolete documents, and adding new documents. Is there any way

multi-field query parser with AND operator?

2006-01-04 Thread Bill Janssen
I've got a some code developed for Lucene 1.4.1, that works around the problem of having both (1) multiple default fields, and (2) the AND operator for query elements. In 1.4.1, MultiFieldQueryParser effectively only allowed the OR operator. I'm wondering if this has changed in 1.9. Will I be

Re: Lucene does NOT use UTF-8.

2005-08-27 Thread Bill Janssen
Thanks for pointing this out, Marvin. I wish Sun (or someone) would document and register this particular character set encoding with IANA, so that it could be used outside of Java. As it stands now, it's essentially a bastard encoding, good for nothing, and one of the warts of Java. Lucene

Normalizing search scores over multiple indices

2005-04-04 Thread Bill Janssen
I've got a situation where I'm searching over a number of different repositories, each containing a different set of documents. I'd like to run searches over, say, 4 different indices, then combine the results outside of Java to present to the user. Is there any way of normalizing search scores