hi, all,
I wrote my own html parser because it just meets my require and do not
depend on 3rd part's lib. and i'd like to share it (in attachment).
This class provides some static methods to do html <-> text convertion:
HtmlUtil.html2text(String html);
HtmlUtil.text2html(String text);
a
I don't really use Span queries, but this strieks me as being very similar
to past discussions about using Span queries along with "sentinel" terms
to find words in the same sentence or paragraph.
If you had a a special Term indexed at the end of every document, you
could do something like this..
hi simon
like a hole in my head
what i really need is a way to recursively iterate through a site, and to to
be able to selectively iterate through the 'form' elements on a given page.
ie, if i visually analyze a site and determine that the 1st level (page) has
a form, and i need to set t
You might also check out an old paper by Kruger, Giles, Lawrence et al. on
a search engine called Deadliner (see here at
http://clgiles.ist.psu.edu/papers/CIKM-2000-deadliner.pdf).
Deadliner crawled for Calls for Papers for conferences, using Support
Vector Machines trained to
recognise relevant
I am trying to make a SpanNearQuery that will contain a SpanNotQuery and
running into a bit of difficulty. Has anyone worked with creating a
variation of a SpanQuery or using special logic to make this work?
For example - (A B !C) in order with a slop of 1 should return results
with A and B with
Check out Andrew McCallum's paper:
http://www.cs.umass.edu/~mccallum/papers/acm-queue-ie.pdf
It mentions this very problem. There are
also some more technical presentations around.
He was part of the Whiz-Bang team that took
on the problem. The fact that the company's
out of business is a tes
Adding to this growing thread, there's really no reason to
index all the term bigrams, trigrams, etc. It's not
only slow, it's very memory/disk intensive. All you need
to do is two passes over the collection.
Pass One
Collect counts of bigrams (or trigrams, or whatever -- if
size is an
: Here, the 'revenue-info' is a repeating node, so we can have records like :
: Record 1
: ---financial-data
: --revenue-info
: year = 2000
: amount = 100
: --revenue-info
: year = 2001
: amount = 200
:
: Record 2
: ---financial-data
: --revenue-
Hi all.
We've been using Lucene to index our dynamic data structure and so far
Lucene has been flexible enough to accommodate our requirements.
Now we have this requirement about searching repeating fields, whose
implementation is not clear.
Our data records have a dynamic tree-like structu
Time to pull out the chalkboard. :-)
SIPs, at least in the Amazon sense, are usually found
by means of statistical independence testing. You
can find more info in Chris Manning's and Hinrich
Schuetze's statistical NLP book (heads-up: they're
now working on an IR book with more of a focus on
sear
How many documents are you getting in your result set? And how are you
dealing with those results? If you're looking at more than a hundred or so
using a Hits object, you are acutally re-executing the query every 100
results or so you examine. This has been discussed several times, you might
want
thats kinda what i was thinking. i'll just upload the correct jar to
my companies repository.
thanks.
On Jun 22, 2006, at 11:43 AM, Chris Hostetter wrote:
: http://www.ibiblio.org/maven2/org/apache/lucene/lucene-core/2.0.0/
:
: my classes won't compile against this jar as it doesn't contain
: http://www.ibiblio.org/maven2/org/apache/lucene/lucene-core/2.0.0/
:
: my classes won't compile against this jar as it doesn't contain any
: class files. there is a pom in the manifest directory.
I don't know much about maven, but that certainly doesn't look like a
valid lucene-core jar. Perha
Nader Akhnoukh wrote:
Yes, Chris is correct, the goal is to determine the most frequently
occuring
phrases in a document compared to the frequency of that phrase in the
index. So there are only output phrases, no inputs.
Also performance is not really an issue, this would take place on an
irre
I may be coming into this thread without knowing enough. I have implemented a
phrase filter, which indexes all token sequences that are 2 to N tokens long.
The n is defined in the constructor.
It takes a stopword Trie for input because the policy I used, based on a publish
work I read, was that a
Yes, Chris is correct, the goal is to determine the most frequently occuring
phrases in a document compared to the frequency of that phrase in the
index. So there are only output phrases, no inputs.
Also performance is not really an issue, this would take place on an
irregular basis and could ru
Hi,
I have an index of 3 million documents. Document id is stored but not
indexed and document contents is indexed but not stored.
Searches are quite slow, but for each document I have a list of 50,000 or
so relevent documents. I would like lucene to only search in these? I
can see I can restr
It can't be ignored. But look in JIRA, I believe there is a patch there that
changes the code so that two optimize() calls are not needed. If that works
for you, please let us know.
Otis
- Original Message
From: heritrix.lucene <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent
what up g.
trying to use the lucene-core.2.0.0.jar that is in the maven
repository at
http://www.ibiblio.org/maven2/org/apache/lucene/lucene-core/2.0.0/
my classes won't compile against this jar as it doesn't contain any
class files. there is a pom in the manifest directory.
was this int
Hey Thomas,
It looks like your index file(s) are being stored on a
remote file system. Is it possible that the network
connection fails sometimes during your
indexing/searching operation?
If that's not the issue, you mention that you're
creating your index file at the same time that you're
search
John Wang wrote:
Hi Xuefeng:
Can you please send me your htmlparser too?
Xuefeng, would it be possible to open source your parser?
Thanks
Michi
thanks
-John
On 6/21/06, Daniel Noll <[EMAIL PROTECTED]> wrote:
Simon Courtenage wrote:
> I also use htmlparser, which is rather good. I'
Hi Xuefeng:
Can you please send me your htmlparser too?
thanks
-John
On 6/21/06, Daniel Noll <[EMAIL PROTECTED]> wrote:
Simon Courtenage wrote:
> I also use htmlparser, which is rather good. I've had to customize it,
> though, to parse strings containing
> html source rather than accept
I didn't make too much progress, and kind of ended up dropping it.
One thing that I played with was creating multiple phrase indexes, one
each for 2, 3, 4, and 5 words. I wrote a tokenizer that would batch up
the words, so, for the input string:
The quick brown fox jumps over the slow lazy
On 6/22/06, karl wettin <[EMAIL PROTECTED]> wrote:
I tried to make a quick and dirty proof of concept, but noticed that no
matter what order TermDocs return the documents, the collector get
ascending document number order.
TermDocs should also always return documents in ascending order for a
si
I'm creating my index file and in the same time I try to do some
searches inside.
Sometimes I retrieve this error message:
"\\tradluxstmp01\JavaIndex\tra\index_FR\_335.fnm (The system cannot find
the file specified)"
What I have to do or what's happen?
Perhaps for privacy reasons? that only specific users should be able to
search the whole index.
Is there a best practice approach to realize this?
Good point. But I still think you could get the same effect with less
complexity by including a "source" tag (to extend the example) and munging
But that doesn't solve my problem since I can't guarantee that articles are
added in a special order to the index.
How ever it seems to work nice using a float as norm value.
/
Marcus
Från: Paul Elschot [mailto:[EMAIL PROTECTED]
Skickat: on 2006-06-21 19:32
T
On Wed, 2006-06-21 at 19:32 +0200, Paul Elschot wrote:
>
> > TermDocs in reversed chronological order
>
> There is no need to write extra code for that, the documents would be
> collected oldest first, newest last.
I tried to make a quick and dirty proof of concept, but noticed that no
matter wha
Chris Hostetter wrote:
I think either you missunderstood Nader's question or I did: I belive the
goal is to determine what the most frequently occuring phrases are -- not
determine how frequently a particular input phrase appears.
Isn't the latter a pre-requisite for the former ? ;)
Regardi
Searching the mailing list archives can be helpful for understanding new
concepts like this; in particular this is something that has been
discussed on java-dev...
http://www.nabble.com/forum/Search.jtp?forum=44&local=y&query=Lazy+Field
http://www.nabble.com/Lazy-Field-Loading-t1362158.html#a3649
: > I am trying to get the most frequently occurring phrases in a document and
: > in the index as a whole. The goal is compare the two to get something like
: > Amazon's SIPs.
: Other than indexing the phrases directly, you could use a SpanNearQuery
: over the words, use getSpans() on its SpanS
so how it can be ignored ??
On 6/22/06, Mike Streeton <[EMAIL PROTECTED]> wrote:
From memory addIndexes() also does and optimization before hand, this
might be what is taking the time.
Mike
www.ardentia.com the home of NetSearch
-Original Message-
From: heritrix.lucene [mailto:[EMAIL
hi,
>
> I'm hardly the lucene expert, but I don't think you can search just a
> portion of the index. But that's effectively what you're doing if you
> restrict the search to "son and.".
I think there is also the possibility to write a custom search filter
(org.apache.lucene.search.Filter), an
On Thursday 22 June 2006 01:33, Nader Akhnoukh wrote:
> Hi, I've looked through the archives and it looks like this question has
> been asked in one form or another a few times, but without a satisfactory
> solution.
>
> I am trying to get the most frequently occurring phrases in a document and
>
34 matches
Mail list logo