Customizing termFreq

2004-12-12 Thread Vikas Gupta
Hi developers,

I am indexing HTML documents in lucene as:

H1:text in H1 font
H2:text in H2 font
...
H6:text in H6 font
content:all the text

The problem is that query of a type
+(H1:xyz)
is getting scored with the termFreq of xyz in the H1 field whereas I want
it be scored using the termFreq of xyz in the entire document (i.e.
content field)

Can you point me how to achieve this.

I took a look at Similarity class. It does have a tf() function but it is
actually passed a termFreq value.

Thanks a lot.

PS: I am using lucene for a class project where I am trying to utilize
font information of HTML documents. I am boosting the scores for matches
in H6 field over matches in H5 and so on.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing HTML files give following message

2004-12-12 Thread Otis Gospodnetic
Hello,

This is probably due to some bad HTML.  The application you are using
is just a demo, and uses a JavaCC-based HTML parser, which may not be
resilient to invalid HTML.  For Lucene in Action we developed a little
extensible indexing framework, and for HTML indexing we used 2 tools to
handle HTML parsing: JTidy and NekoHTML.  Since the code for the book
is freely available... http://www.manning.com.  NekoHTML knows how
to deal with some bad HTML, that's why I'm suggesting this.
The indexing framework could come handy for those working on various
'desktop search' applications (Roosster, LDesktop (if that's really
happening), Lucidity, etc.)

Otis


--- Hetan Shah [EMAIL PROTECTED] wrote:

 java org.apache.lucene.demo.IndexHTML -create -index
 /source/workarea/hs152827/newIndex ..
 adding ../0/10037.html
 adding ../0/10050.html
 adding ../0/1006132.html
 adding ../0/1013223.html
 Parse Aborted: Encountered \ at line 5, column 1.
 Was expecting one of:
 ArgName ...
 = ...
 TagEnd ...
 
 And then the indexing hangs on this line. Earlier it used to go on
 and
 index remaining pages in the directory. Any idea why would the
 indexer
 stop at this error.
 
 Pointers are much needed and appreciated.
 -H
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: HITCOLLECTOR+SCORE+DELIMA

2004-12-12 Thread Karthik N S

Hi Guys

Apologies..


So u say I have to Build a Filter to Collect all the Scores between the 2
Ranges [ 0.2f to 1.0f]


so the API for the same would be

 Hits hit = search(Query query, Filter filtertoGetScore)


 But while writing the Filter  Score again depends on Hits   Score =
hits.score(x);



 How To solve this Or Am I in Wrong Process


Any Simple Src for the same will be greatly appreciated.  :)

Thx in advance



-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Friday, December 10, 2004 6:54 PM
To: Lucene Users List
Subject: Re: HITCOLLECTOR+SCORE+DELIMA


On Dec 10, 2004, at 7:39 AM, Karthik N S wrote:
 I am still in delima on How to use the HitCollector for returning
 Hits hits
 between scores  0.2f to 1.0f ,

 There is not a simple example for the same, yet lot's of talk on usage
 for
 the same on the form.

Unfortunately there isn't a clean way to stop a HitCollector - it will
simply collect all hits.

Also, scores are _not_ normalized when passed to a HitCollector, so you
may get scores  1.0.  Hits, however, does normalize and you're
guaranteed that scores will be = 1.0.  Hits are in descending score
order, so you may just want to use Hits and filter based on the score
provided by hits.score(i).

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Customizing termFreq

2004-12-12 Thread Vikas Gupta
Hoss,

 so why not query for +(content:xyz) .. or is the problem that you only
 want to get back docs with xyz in an H1, but you want the score based on
 the whole doc?

That is correct. I want to score documents which match +(H1:xyz) higher
than those which match +(H2:xyz) but the tf and idf computation for xyz
should be based on the entire text and not limited to H1 or H2 fields


 if that's the case, then construct a Filter with the requirement of
 (H1:xyz) and make youre query (content:xyz)

I will take a look at this. Thanks a lot for your prompt response.

-Vikas

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Customizing termFreq

2004-12-12 Thread Chris Hostetter
: H1:text in H1 font
: H2:text in H2 font

: content:all the text
:
: The problem is that query of a type
: +(H1:xyz)
: is getting scored with the termFreq of xyz in the H1 field whereas I want
: it be scored using the termFreq of xyz in the entire document (i.e.
: content field)

so why not query for +(content:xyz) .. or is the problem that you only
want to get back docs with xyz in an H1, but you want the score based on
the whole doc?

if that's the case, then construct a Filter with the requirement of
(H1:xyz) and make youre query (content:xyz)


-Hoss

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Incremental Search experiment with Lucene, sort of like the new Google Suggestion page

2004-12-12 Thread David Spencer
Chris Lamprecht wrote:
Very cool, thanks for posting this!  

Google's feature doesn't seem to do a search on every keystroke
necessarily.  Instead, it waits until you haven't typed a character
for a short period (I'm guessing about 100 or 150 milliseconds).  So
Thx again for tip, I updated my experiment 
(http://www.searchmorph.com/kat/isearch.html, as per below) to use a 
150ms delay to avoid some needless searches...TBD is more intelligent 
guidance or suggestions for the user.

if you type fast, it doesn't hit the server until you pause.  There
are some more detailed postings on slashdot about how it works.
On Fri, 10 Dec 2004 16:36:27 -0800, David Spencer
[EMAIL PROTECTED] wrote:
Google just came out with a page that gives you feedback as to how many
pages will match your query and variations on it:
http://www.google.com/webhp?complete=1hl=en
I had an unexposed experiment I had done with Lucene a few months ago
that this has inspired me to expose - it's not the same, but it's
similar in that as you type in a query you're given *immediate* feedback
as to how many pages match.
Try it here: http://www.searchmorph.com/kat/isearch.html
This is my SearchMorph site which has an index of ~90k pages of open
source javadoc packages.
As you type in a query, on every keystroke it does at least one Lucene
search to show results in the bottom part of the page.
It also gives spelling corrections (using my NGramSpeller
contribution) and also suggests popular tokens that start the same way
as your search query.
For one way to see corrections in action, type in rollback character
by character (don't do a cut and paste).
Note that:
-- this is not how the Google page works - just similar to it
-- I do single word suggestions while google does the more useful whole
phrase suggestions (TBD I'll try to copy them)
-- They do lots of javascript magic, whereas I use old school frames mostly
-- this is relatively expensive, as it does 1 query per character, and
when it's doing spelling correction there is even more work going on
-- this is just an experiment and the page may be unstable as I fool w/ it
What's nice is when you get used to immediate results, going back to the
batch way of searching seems backward, slow, and old fashioned.
There are too many idle CPUs in the world - this is one way to keep them
busier :)
-- Dave
PS Weblog entry updated too:
http://www.searchmorph.com/weblog/index.php?id=26
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Finding unused segment files?

2004-12-12 Thread Otis Gospodnetic
Hello George,

Here is a quick hack (with a few TODOs).  I only tested it a bit, so
the actual delete calls are still commented out.  If this works for
you, and especially if you take care of TODOs, I may put this in the
Lucene Sandbox.

Otis
P.S.
Usage example showing how the fool found some unused segments (this was
caused by a bug in one of the earlier 1.4 versions of Lucene).

[EMAIL PROTECTED] java]$ java org.apache.lucene.index.SegmentPurger
/simpy/users/1/index
Candidate non-Lucene file found: _1b2.del
Candidate unused Lucene file found: _1b2.cfs
Candidate unused Lucene file found: _1bm.cfs
Candidate unused Lucene file found: _1c6.cfs
Candidate unused Lucene file found: _1cq.cfs
Candidate unused Lucene file found: _1da.cfs
Candidate unused Lucene file found: _1du.cfs
Candidate unused Lucene file found: _1ee.cfs
Candidate unused Lucene file found: _1ey.cfs
[EMAIL PROTECTED] java]$
[EMAIL PROTECTED] java]$ strings /simpy/users/1/index/segments
_3o0
[EMAIL PROTECTED] java]$ ls -al /simpy/users/1/index/
total 647
drwxrwsr-x2 otis simpy1024 Dec  7 14:39 .
drwxrwsr-x3 otis simpy1024 Sep 16 20:39 ..
-rw-rw-r--1 otis simpy  212815 Nov 17 18:36 _1b2.cfs
-rw-rw-r--1 otis simpy 104 Nov 17 18:40 _1b2.del
-rw-rw-r--1 otis simpy3380 Nov 17 18:40 _1bm.cfs
-rw-rw-r--1 otis simpy3533 Nov 17 18:40 _1c6.cfs
-rw-rw-r--1 otis simpy4774 Nov 17 18:40 _1cq.cfs
-rw-rw-r--1 otis simpy3389 Nov 17 18:40 _1da.cfs
-rw-rw-r--1 otis simpy3809 Nov 17 18:40 _1du.cfs
-rw-rw-r--1 otis simpy3423 Nov 17 18:40 _1ee.cfs
-rw-rw-r--1 otis simpy4016 Nov 17 18:40 _1ey.cfs
-rw-rw-r--1 otis simpy  410299 Dec  7 14:39 _3o0.cfs
-rw-rw-r--1 otis simpy   4 Dec  7 14:39 deletable
-rw-rw-r--1 otis simpy  29 Dec  7 14:39 segments


--- [EMAIL PROTECTED] wrote:

 Hello all.
 
  
 
 I recently ran into a problem where errors during indexing or
 optimization
 (perhaps related to running out of disk space) left me with a working
 index
 in a directory but with additional segment files (partial) that were
 unneeded.  The solution for finding the ~40 files to keep out of the
 ~900
 files in the directory amounted to dumping the segments file and
 noting that
 only 5 segments were in fact live.  The index is a non-compound
 index
 using FSDirectory.
 
  
 
 Is there (or would it be possible to add (and I'd be willing to
 submit code
 if it made sense to people)) some sort of interrogation on the index
 of what
 files belonged to it?  I looked first as FSDirectory itself thinking
 that
 it's list() method should return the subset of index-related files
 but
 looking deeper it looks like Directory is at a lower level
 abstracting
 simple I/O and thus wouldn't know.
 
  
 
 So any thoughts?  Would it make sense to have a form of clean on
 IndexWriter()?  I hesitate since it seems there isn't a charter that
 only
 Lucene files could exist in the directory thus what is ideal for my
 application (since I know I won't mingle other files) might not be
 ideal for
 all.  Would it be fair to look for Lucene known extensions and file
 naming
 signatures to identify unused files that might be failed or dead
 segments?
 
  
 
 Thanks,
 
 -George
 
 package org.apache.lucene.index;

import org.apache.lucene.store.IndexInput;
import org.apache.lucene.store.FSDirectory;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Iterator;
import java.io.File;


/**
 * A tool that peeks into Lucene index directories and removes
 * unwanted files.  In its more radical mode, this tool can be used to
 * remove all non-Lucene index files from a directory.  The other
 * option is to remove unused Lucene segment files, should the index
 * directory get polluted.
 *
 * TODO: this tool should really lock the directory for writing before
 * removing any Lucene segment files, otherwise this tool itself may
 * corrupt the index.
 *
 * @author Otis Gospodnetic
 * @version $Id$
 */
public class SegmentPurger
{
// TODO: copied from SegmentMerger - should probably made public
// static final, to make it reusable
// TODO: add .del extension

// File extensions of old-style index files
public static final String MULTIFILE_EXTENSIONS[] = new String[] {
fnm, frq, prx, fdx, fdt, tii, tis
};
public static final String VECTOR_EXTENSIONS[] = new String[] {
tvx, tvd, tvf
};
public static final String COMPOUNDFILE_EXTENSIONS[] = new String[] {
cfs
};
public static final String INDEX_FILES[] = new String[] {
segments, deletable
};

public static final String[][] SEGMENT_EXTENSIONS = new String[][] {
MULTIFILE_EXTENSIONS, COMPOUNDFILE_EXTENSIONS, VECTOR_EXTENSIONS
};

/** The file format version, a negative number. */
/* Works since counter, the old