RE: HTML text extraction

2006-06-22 Thread Liao Xuefeng
hi, all, I wrote my own html parser because it just meets my require and do not depend on 3rd part's lib. and i'd like to share it (in attachment). This class provides some static methods to do html <-> text convertion: HtmlUtil.html2text(String html); HtmlUtil.text2html(String text); a

Re: A Special SpanQuery

2006-06-22 Thread Chris Hostetter
I don't really use Span queries, but this strieks me as being very similar to past discussions about using Span queries along with "sentinel" terms to find words in the same sentence or paragraph. If you had a a special Term indexed at the end of every document, you could do something like this..

RE: lucene in combination with pattern recognition...

2006-06-22 Thread bruce
hi simon like a hole in my head what i really need is a way to recursively iterate through a site, and to to be able to selectively iterate through the 'form' elements on a given page. ie, if i visually analyze a site and determine that the 1st level (page) has a form, and i need to set t

Re: lucene in combination with pattern recognition...

2006-06-22 Thread Simon Courtenage
You might also check out an old paper by Kruger, Giles, Lawrence et al. on a search engine called Deadliner (see here at http://clgiles.ist.psu.edu/papers/CIKM-2000-deadliner.pdf). Deadliner crawled for Calls for Papers for conferences, using Support Vector Machines trained to recognise relevant

A Special SpanQuery

2006-06-22 Thread Ben Knear
I am trying to make a SpanNearQuery that will contain a SpanNotQuery and running into a bit of difficulty. Has anyone worked with creating a variation of a SpanQuery or using special logic to make this work? For example - (A B !C) in order with a slop of 1 should return results with A and B with

Re: lucene in combination with pattern recognition...

2006-06-22 Thread Bob Carpenter
Check out Andrew McCallum's paper: http://www.cs.umass.edu/~mccallum/papers/acm-queue-ie.pdf It mentions this very problem. There are also some more technical presentations around. He was part of the Whiz-Bang team that took on the problem. The fact that the company's out of business is a tes

Re: Phrase Frequency For Analysis

2006-06-22 Thread Bob Carpenter
Adding to this growing thread, there's really no reason to index all the term bigrams, trigrams, etc. It's not only slow, it's very memory/disk intensive. All you need to do is two passes over the collection. Pass One Collect counts of bigrams (or trigrams, or whatever -- if size is an

Re: Searching repeating fields

2006-06-22 Thread Chris Hostetter
: Here, the 'revenue-info' is a repeating node, so we can have records like : : Record 1 : ---financial-data : --revenue-info : year = 2000 : amount = 100 : --revenue-info : year = 2001 : amount = 200 : : Record 2 : ---financial-data : --revenue-

Searching repeating fields

2006-06-22 Thread Subodh Damle
Hi all. We've been using Lucene to index our dynamic data structure and so far Lucene has been flexible enough to accommodate our requirements. Now we have this requirement about searching repeating fields, whose implementation is not clear. Our data records have a dynamic tree-like structu

Re: Lucene and SIPs

2006-06-22 Thread Bob Carpenter
Time to pull out the chalkboard. :-) SIPs, at least in the Amazon sense, are usually found by means of statistical independence testing. You can find more info in Chris Manning's and Hinrich Schuetze's statistical NLP book (heads-up: they're now working on an IR book with more of a focus on sear

Re: Restricting search space to a large number of ids

2006-06-22 Thread Erick Erickson
How many documents are you getting in your result set? And how are you dealing with those results? If you're looking at more than a hundred or so using a Hits object, you are acutally re-executing the query every 100 results or so you examine. This has been discussed several times, you might want

Re: lucene and maven2

2006-06-22 Thread sfryxell
thats kinda what i was thinking. i'll just upload the correct jar to my companies repository. thanks. On Jun 22, 2006, at 11:43 AM, Chris Hostetter wrote: : http://www.ibiblio.org/maven2/org/apache/lucene/lucene-core/2.0.0/ : : my classes won't compile against this jar as it doesn't contain

Re: lucene and maven2

2006-06-22 Thread Chris Hostetter
: http://www.ibiblio.org/maven2/org/apache/lucene/lucene-core/2.0.0/ : : my classes won't compile against this jar as it doesn't contain any : class files. there is a pom in the manifest directory. I don't know much about maven, but that certainly doesn't look like a valid lucene-core jar. Perha

Re: Phrase Frequency For Analysis

2006-06-22 Thread Andrzej Bialecki
Nader Akhnoukh wrote: Yes, Chris is correct, the goal is to determine the most frequently occuring phrases in a document compared to the frequency of that phrase in the index. So there are only output phrases, no inputs. Also performance is not really an issue, this would take place on an irre

Re: Phrase Frequency For Analysis

2006-06-22 Thread Kamal Abou Mikhael
I may be coming into this thread without knowing enough. I have implemented a phrase filter, which indexes all token sequences that are 2 to N tokens long. The n is defined in the constructor. It takes a stopword Trie for input because the policy I used, based on a publish work I read, was that a

Re: Phrase Frequency For Analysis

2006-06-22 Thread Nader Akhnoukh
Yes, Chris is correct, the goal is to determine the most frequently occuring phrases in a document compared to the frequency of that phrase in the index. So there are only output phrases, no inputs. Also performance is not really an issue, this would take place on an irregular basis and could ru

Restricting search space to a large number of ids

2006-06-22 Thread Jonathan Taylor
Hi, I have an index of 3 million documents. Document id is stored but not indexed and document contents is indexed but not stored. Searches are quite slow, but for each document I have a list of 50,000 or so relevent documents. I would like lucene to only search in these? I can see I can restr

Re: addIndexes() is taking infinite time ...

2006-06-22 Thread Otis Gospodnetic
It can't be ignored. But look in JIRA, I believe there is a patch there that changes the code so that two optimize() calls are not needed. If that works for you, please let us know. Otis - Original Message From: heritrix.lucene <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent

lucene and maven2

2006-06-22 Thread sfryxell
what up g. trying to use the lucene-core.2.0.0.jar that is in the maven repository at http://www.ibiblio.org/maven2/org/apache/lucene/lucene-core/2.0.0/ my classes won't compile against this jar as it doesn't contain any class files. there is a pom in the manifest directory. was this int

Re: Can't open index

2006-06-22 Thread James Pine
Hey Thomas, It looks like your index file(s) are being stored on a remote file system. Is it possible that the network connection fails sometimes during your indexing/searching operation? If that's not the issue, you mention that you're creating your index file at the same time that you're search

Re: HTML text extraction

2006-06-22 Thread Michael Wechner
John Wang wrote: Hi Xuefeng: Can you please send me your htmlparser too? Xuefeng, would it be possible to open source your parser? Thanks Michi thanks -John On 6/21/06, Daniel Noll <[EMAIL PROTECTED]> wrote: Simon Courtenage wrote: > I also use htmlparser, which is rather good. I'

Re: HTML text extraction

2006-06-22 Thread John Wang
Hi Xuefeng: Can you please send me your htmlparser too? thanks -John On 6/21/06, Daniel Noll <[EMAIL PROTECTED]> wrote: Simon Courtenage wrote: > I also use htmlparser, which is rather good. I've had to customize it, > though, to parse strings containing > html source rather than accept

RE: Lucene and SIPs

2006-06-22 Thread Larry Ogrodnek
I didn't make too much progress, and kind of ended up dropping it. One thing that I played with was creating multiple phrase indexes, one each for 2, 3, 4, and 5 words. I wrote a tokenizer that would batch up the words, so, for the input string: The quick brown fox jumps over the slow lazy

Re: Modifying the stored norm type

2006-06-22 Thread Yonik Seeley
On 6/22/06, karl wettin <[EMAIL PROTECTED]> wrote: I tried to make a quick and dirty proof of concept, but noticed that no matter what order TermDocs return the documents, the collector get ascending document number order. TermDocs should also always return documents in ascending order for a si

Can't open index

2006-06-22 Thread WATHELET Thomas
I'm creating my index file and in the same time I try to do some searches inside. Sometimes I retrieve this error message: "\\tradluxstmp01\JavaIndex\tra\index_FR\_335.fnm (The system cannot find the file specified)" What I have to do or what's happen?

Re: Search within multiple different subfolders

2006-06-22 Thread Erick Erickson
Perhaps for privacy reasons? that only specific users should be able to search the whole index. Is there a best practice approach to realize this? Good point. But I still think you could get the same effect with less complexity by including a "source" tag (to extend the example) and munging

SV: Modifying the stored norm type

2006-06-22 Thread Marcus Falck
But that doesn't solve my problem since I can't guarantee that articles are added in a special order to the index. How ever it seems to work nice using a float as norm value. / Marcus Från: Paul Elschot [mailto:[EMAIL PROTECTED] Skickat: on 2006-06-21 19:32 T

Re: Modifying the stored norm type

2006-06-22 Thread karl wettin
On Wed, 2006-06-21 at 19:32 +0200, Paul Elschot wrote: > > > TermDocs in reversed chronological order > > There is no need to write extra code for that, the documents would be > collected oldest first, newest last. I tried to make a quick and dirty proof of concept, but noticed that no matter wha

Re: Phrase Frequency For Analysis

2006-06-22 Thread Andrzej Bialecki
Chris Hostetter wrote: I think either you missunderstood Nader's question or I did: I belive the goal is to determine what the most frequently occuring phrases are -- not determine how frequently a particular input phrase appears. Isn't the latter a pre-requisite for the former ? ;) Regardi

Re: What is a "Lazy Field"...

2006-06-22 Thread Chris Hostetter
Searching the mailing list archives can be helpful for understanding new concepts like this; in particular this is something that has been discussed on java-dev... http://www.nabble.com/forum/Search.jtp?forum=44&local=y&query=Lazy+Field http://www.nabble.com/Lazy-Field-Loading-t1362158.html#a3649

Re: Phrase Frequency For Analysis

2006-06-22 Thread Chris Hostetter
: > I am trying to get the most frequently occurring phrases in a document and : > in the index as a whole. The goal is compare the two to get something like : > Amazon's SIPs. : Other than indexing the phrases directly, you could use a SpanNearQuery : over the words, use getSpans() on its SpanS

Re: addIndexes() is taking infinite time ...

2006-06-22 Thread heritrix . lucene
so how it can be ignored ?? On 6/22/06, Mike Streeton <[EMAIL PROTECTED]> wrote: From memory addIndexes() also does and optimization before hand, this might be what is taking the time. Mike www.ardentia.com the home of NetSearch -Original Message- From: heritrix.lucene [mailto:[EMAIL

Re: Search within multiple different subfolders

2006-06-22 Thread Martin Braun
hi, > > I'm hardly the lucene expert, but I don't think you can search just a > portion of the index. But that's effectively what you're doing if you > restrict the search to "son and.". I think there is also the possibility to write a custom search filter (org.apache.lucene.search.Filter), an

Re: Phrase Frequency For Analysis

2006-06-22 Thread Paul Elschot
On Thursday 22 June 2006 01:33, Nader Akhnoukh wrote: > Hi, I've looked through the archives and it looks like this question has > been asked in one form or another a few times, but without a satisfactory > solution. > > I am trying to get the most frequently occurring phrases in a document and >