Re: auto-generate uid?

2004-11-22 Thread Terry Steichen
Not exactly sure what you're trying to do.  You can easily generate a number 
when you index each Document and insert it in a uid field (which is, BTW, what 
I do), and if you base it on a timestamp plus some characteristic of the 
document (which is also what I do), it should always be unique.  As you add 
more documents, they will each get their own unique id.  When you delete 
documents and optimize, these ids won't be affected.

However, in your subsequent clarification, you indicated you already had a 
unique id, and want to find the maximum value.  So why did you say you want one 
auto-generated?

Terry
  - Original Message - 
  From: aurora 
  To: [EMAIL PROTECTED] 
  Sent: Monday, November 22, 2004 4:39 PM
  Subject: Re: auto-generate uid?


  Just to clarify. I have a Field 'uid' those value is an unique integer. I  
  use it as a key to the document stored externally. I don't mean Lucene's  
  internal document number.

  I was wonder if there is a method to query the highest value of a field,  
  perhaps something like:

 IndexReader.maxTerm('uid')


  > What would the purpose of an auto-generated UID be?
  >
  > But no, Lucene does not generate UID's for you.  Documents are numbered  
  > internally by their insertion order.  This number changes, however, when  
  > documents are deleted in the middle and the index is optimized.
  >
  > Erik
  >
  > On Nov 22, 2004, at 1:50 PM, aurora wrote:
  >
  >> Is there a way to auto-generate uid in Lucene? Even it is just a way to  
  >> query the highest uid and let the application add one to it will do.
  >>
  >> Thanks.
  >>
  >>
  >> -
  >> To unsubscribe, e-mail: [EMAIL PROTECTED]
  >> For additional commands, e-mail: [EMAIL PROTECTED]



  -- 
  Using Opera's revolutionary e-mail client: http://www.opera.com/m2/


  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]



Re: disadvantages

2004-11-21 Thread Terry Steichen
Compared to what?
  - Original Message - 
  From: Miguel Angel 
  To: [EMAIL PROTECTED] 
  Sent: Sunday, November 21, 2004 12:00 PM
  Subject: disadvantages


  What are disadvantages the Lucene?? 
  -- 
  Miguel Angel Angeles R.
  Asesoria en Conectividad y Servidores
  Telf. 97451277

  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene external field storage contribution.

2004-11-09 Thread Terry Steichen
Kevin,

Sorry for the delay in replying.  I think your idea for an external field 
storage mechanism is excellent.  I'd love to see it, and if I can, will be 
willing to help make that happen.

Regards,

Terry
  - Original Message - 
  From: Kevin A. Burton 
  To: Lucene Users List 
  Sent: Sunday, November 07, 2004 4:47 PM
  Subject: Lucene external field storage contribution.


  About 3 months ago I developed a external storage engine which ties into 
  lucene. 

  I'd like to discuss making a contribution so that this is integrated 
  into a future version of Lucene.

  I'm going to paste my original PROPOSAL in this email. 

  There wasn't a ton of feedback first time around but I figure squeaky 
  wheel gets the grease...


  >>
  >>
  >> I created this proposal because we need this fixed at work. I want to 
  >> go ahead and work on a vertical fix for our version of lucene and then 
  >> submit this back to Jakarta.
  >> There seems to be a lot of interest here and I wanted to get feedback 
  >> from the list before moving forward ...
  >>
  >> Should I put this in the wiki?!
  >>
  >> Kevin
  >>
  >> ** OVERVIEW **
  >>
  >> Currently Lucene supports 'stored fields; where the content of these 
  >> fields are
  >> kept within the lucene index for use in the future.
  >>
  >> While acceptable for small indexes, larger amounts of stored fields 
  >> prevent:
  >>
  >> - Fast index merges since the full content has to be continually merged.
  >>
  >> - Storing the indexes in memory (since a LOT of memory would be 
  >> required and
  >> this is cost prohibitive)
  >>
  >> - Fast queries since block caching can't be used on the index data.
  >>
  >> For example in our current setup our index size is 20G.  Nearly 90% of 
  >> this is
  >> content.  If we could store the content outside of Lucene our merges and
  >> searches would be MUCH faster.  If we could store the index in MEMORY 
  >> this could
  >> be orders of magnitude faster.
  >>
  >> ** PROPOSAL **
  >>
  >> Provide an external field storage mechanism which supports legacy indexes
  >> without modification.  Content is stored in a "content segment". The only
  >> changes would be a field with 3(or 4 if checksum enabled) values.
  >>
  >> - CS_SEGMENT
  >>
  >>   Logical ID of the content segment.  This is an integer value.  
  >> There is
  >>   a global Lucene property named CS_ROOT which stores all the 
  >> content.
  >>   The segments are just flat files with pointers.  Segments are 
  >> broken
  >>   into logical pieces by time and size.  Usually 100M of content 
  >> would be
  >>   in one segment.
  >>
  >> - CS_OFFSET
  >>
  >>   The byte offset of the field.
  >>
  >> - CS_LENGTH
  >>
  >>   The length of the field.
  >>
  >> - CS_CHECKSUM
  >>
  >>   Optional checksum to verify that the content is correct when 
  >> fetched
  >>   from the index.
  >>
  >> - The field value here would be exactly 'N:O:L' where N is the segment 
  >> number,
  >>   O is the offset, and L is the length.  O and L are 64bit values.  N 
  >> is a 32
  >>   bit value (though 64bit wouldn't really hurt).
  >>
  >> This mechanism allows for the external storage of any named field.
  >>  
  >> CS_OFFSET, and CS_LENGTH allow use with RandomAccessFile and new NIO 
  >> code for
  >> efficient content lookup.  (Though filehandle caching should probably 
  >> be used).
  >>
  >> Since content is broken into logical 100M segments the underlying 
  >> filesystem can
  >> orgnize the file into contiguous blocks for efficient non-fragmented 
  >> lookup.
  >>
  >> File manipulation is easy and indexes can be merged by simply 
  >> concatenating the
  >> second file to the end of the first.  (Though the segment, offset, and 
  >> length
  >> need to be updated).  (FIXME: I think I need to think about this more 
  >> since I
  >> will have < 100M per syncs)
  >>
  >> Supporting full unicode is important.  Full java.lang.String storage 
  >> is used
  >> with String.getBytes() so we should be able to avoid unicode issues.  
  >> If Java
  >> has a correct java.lang.String representation it's possible easily add 
  >> unicode
  >> support just by serializing the byte representation. (Note that the 
  >> JDK says
  >> that the DEFAULT system char encoding is used so if this is ever 
  >> changed it
  >> might break the index)
  >>
  >> While Linux and modern versions of Windows (not sure about OSX) 
  >> support 64bit
  >> filesystems the 4G storage boundary of 32bit filesystems (ext2 is an 
  >> example)
  >> are an issue.  Using smaller indexes can prevent this but eventually 
  >> segment
  >> lookup in the filesystem will be slow.  This will only happen within 
  >> terabyte
  >> storage systems so hopefully the developer has migrated to another 
  >> (modern)
  >> filesystem such as XFS.
  >>
  >> ** FEATURES **
  >>
  >>   - Must be able to replicate indexes easily to other hosts.
  >>
  >>   - Adding content to the inde

Re: BooleanQuery - TooManyClauses

2004-10-26 Thread Terry Steichen
I think what Erik's asking is whether you can live with expressing your indexed date 
in the form of MMDD, without the hour and minute extension.  That will sharply 
educe the number of range query expansion terms.  If you're using the timestamp as a 
unique identifier, you might consider creating two fields, one for the unique 
identifier (MMDDHHmmssZ) and one for the date (MMDD), and only use the range 
on the date field (not on the timestamp field)

Regards,

Terry
  - Original Message - 
  From: Angelov, Rossen 
  To: 'Lucene Users List' 
  Sent: Tuesday, October 26, 2004 11:43 AM
  Subject: RE: BooleanQuery - TooManyClauses 


  >
  >On Oct 25, 2004, at 6:35 PM, Angelov, Rossen wrote:
  >> Why there is a limit on the number of clauses? and is there any harm in
  >> setting MaxClauseCount to Integer.MAX_VALUE?
  >
  >The harm is in performance and resource utilization.  Rather than do 
  >this, though, read on...
  >
  >> I'm using a Range Query on a field that represents dates and getting
  >> BooleanQuery$TooManyClauses exception.
  >> This is the query -  +/article/createddateiso8601:[2003010100 TO
  >> 2003123199]
  >
  >Do you really need to do ranges down to that time level?  Or are you 
  >really just concerned with date?  If you indexed using MMDD 
  >instead, there would only be a maximum of 365 terms in that range, 
  >whereas you've got zillions (ok, I was too lazy to do the math!  But 
  >far more than 1,024).

  I need to do range searches. They are part of the requirements and even
  worse, the range can be as big as up to 10 years for now. It will get
  bigger. I'm indexing using MMDDHHmmssZ format and as you said there will
  be more than just 365 terms per year. This number changes every day as new
  documents are indexed daily. The only limit I can see is the number of
  documents that were indexed. I guess maxClauseCount can't be more than the
  indexed documents.

  >I recommend changing how you index dates, or at least use a different 
  >field for queries that do not need to concern themselves with the 
  >timestamp aspect.

  What do you mean change how the dates are indexed? By the way this field is
  indexed as a string.

  >
  > Erik
  >
  >

  Ross

  "This communication is intended solely for the addressee and is
  confidential and not for third party unauthorized distribution."



Re: Multisearcher question

2004-10-12 Thread Terry Steichen
I think what Sreedhar is asking for is the capability to form a "join" across multiple 
indices - and if so, I could sure use that capability myself.  However, I think 
Lucene's logic focuses only on a single query, so I doubt if that's easily done.

  - Original Message - 
  From: Otis Gospodnetic 
  To: Lucene Users List 
  Sent: Tuesday, October 12, 2004 9:04 AM
  Subject: Re: Multisearcher question


  Hello Sreedhar,

  This is the expected behaviour.  The query is run against each index,
  and it won't have any matches in either index, because neither index
  has both fields.

  Otis

  --- "Sreedhar, Dantam" <[EMAIL PROTECTED]> wrote:

  > Hi,
  > 
  > Index side information:
  > 
  > No. of indexes: Two (to explain better I call these as index_a and
  > index_b).
  > 
  > Fields in index_a: x and y.
  > Fields in index_b: y and z.
  > 
  > I have written a multisearch code like this.
  > 
  > Searcher search_a = new IndexSearcher(LOCATION_OF_INDEX_A);
  > Searcher search_b = new IndexSearcher(LOCATION_OF_INDEX_B);
  > Searcher[] searcher = new Searcher[2];
  > searcher[0] = search_a;
  > searcher[1] = search_b;
  > MultiSearcher searcher = new MultiSearcher(searcher);
  > 
  > I am getting the following results,
  > 
  > x:  - WORKS
  > x: AND y: - WORKS
  > x: AND z: - DOESN'T WORK
  > 
  > Is this expected behavior?
  > 
  > My question is, Can MultiSearcher be used to search on indexes with
  > different fields? If yes, could you please correct the above code.
  > 
  > Thanks,
  > -Sreedhar
  > 
  > 
  > -
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]
  > 
  > 


  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]



Nutch vs Lucene

2004-09-16 Thread Terry Steichen
So Nutch uses (but doesn't enhance) Lucene?  Or, does it enhance Lucene in its ability 
to operate in a distributed fashion?

Regards,

Terry

PS: I'm aware of Doug's involvement in both - which is partly why I'm puzzled.

  - Original Message - 
  From: Otis Gospodnetic 
  To: Lucene Users List 
  Sent: Thursday, September 16, 2004 8:21 PM
  Subject: Re: Concurent operations with Lucene


  Nutch is a robust, multi-threaded Java web crawler and a (distributed)
  search engine.
  Nutch uses Lucene to index web pages and search the resulting indices.
  Doug Cutting is the padre of both Nutch and Lucene.

  Otis


  --- Terry Steichen <[EMAIL PROTECTED]> wrote:

  > Otis,
  > 
  > What's the relationship between Nutch and Lucene?
  > 
  > Terry
  >   - Original Message - 
  >   From: Otis Gospodnetic 
  >   To: Lucene Users List 
  >   Sent: Wednesday, September 15, 2004 7:29 AM
  >   Subject: Re: Concurent operations with Lucene
  > 
  > 
  >   Hello
  > 
  >   Only 1 process can modify (add/delete) an index at a time.
  >   Have you seen Nutch (http://nutch.org/)?
  > 
  >   Otis
  > 
  >   --- Daniel CHAN <[EMAIL PROTECTED]> wrote:
  > 
  >   > Hi,
  >   > 
  >   > I'm currently developping a search engine for a few websites and
  >   > would
  >   > like to use Lucene to do so. After reading some docs, a post on
  > jGuru
  >   > states that some concurrent operations are forbidden with Lucene
  >   > (http://www.jguru.com/faq/view.jsp?EID=913302). However, the post
  >   > dated from 2 years ago.
  >   > 
  >   > What I would like to know: is Lucene able to handle query
  >   > concurrently
  >   > with delete operation ? (You can check the table on the jGuru
  > page
  >   > and
  >   > the posts at the bottom).
  >   > 
  >   > Cheers
  >   > 
  >   > -- 
  >   > Daniel CHAN <[EMAIL PROTECTED]>
  >   > Free Software supporter
  >   > GnuPG : FFEC 70DD 9B2D D10A E161 79B5 3EDB CB9B A3C3 F6F3
  >   > 
  >   > 
  > 
  >   > ATTACHMENT part 2 application/pgp-signature name=signature.asc
  > 
  > 
  > 
  >  
  > -
  >   To unsubscribe, e-mail: [EMAIL PROTECTED]
  >   For additional commands, e-mail:
  > [EMAIL PROTECTED]
  > 
  > 


  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]



Re: Term highlighting and Term vector patch

2004-09-16 Thread Terry Steichen
Christoph,

Just curious - how are you currently using Term Vectors?  They seem to be a neat 
feature with lots of future promise, but I'm not sure how to best use them now.

Regards,

Terry
  - Original Message - 
  From: Christoph Goller 
  To: Lucene Developers List 
  Sent: Thursday, September 16, 2004 5:01 AM
  Subject: Re: Term highlighting and Term vector patch


  Hi Grant,

  I try to look into your latest code by the end of September but I probably
  won't find time earlier. I am using the current TermVectors very successfully.
  Thanks for the excellent code.



Re: Concurent operations with Lucene

2004-09-15 Thread Terry Steichen
Otis,

What's the relationship between Nutch and Lucene?

Terry
  - Original Message - 
  From: Otis Gospodnetic 
  To: Lucene Users List 
  Sent: Wednesday, September 15, 2004 7:29 AM
  Subject: Re: Concurent operations with Lucene


  Hello

  Only 1 process can modify (add/delete) an index at a time.
  Have you seen Nutch (http://nutch.org/)?

  Otis

  --- Daniel CHAN <[EMAIL PROTECTED]> wrote:

  > Hi,
  > 
  > I'm currently developping a search engine for a few websites and
  > would
  > like to use Lucene to do so. After reading some docs, a post on jGuru
  > states that some concurrent operations are forbidden with Lucene
  > (http://www.jguru.com/faq/view.jsp?EID=913302). However, the post
  > dated from 2 years ago.
  > 
  > What I would like to know: is Lucene able to handle query
  > concurrently
  > with delete operation ? (You can check the table on the jGuru page
  > and
  > the posts at the bottom).
  > 
  > Cheers
  > 
  > -- 
  > Daniel CHAN <[EMAIL PROTECTED]>
  > Free Software supporter
  > GnuPG : FFEC 70DD 9B2D D10A E161 79B5 3EDB CB9B A3C3 F6F3
  > 
  > 

  > ATTACHMENT part 2 application/pgp-signature name=signature.asc



  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Book

2004-09-07 Thread Terry Steichen
Jeez, Erik!  Where's your sense of public spirit ;-)

Terry

PS: Glad to hear you're (finally!) nearing publication.  

  - Original Message - 
  From: Erik Hatcher 
  To: Lucene Users List 
  Sent: Tuesday, September 07, 2004 6:43 AM
  Subject: Re: Lucene Book


  On Sep 7, 2004, at 3:00 AM, [EMAIL PROTECTED] wrote:
  > I am new to Lucene. Can anyone guide me from where i can download free
  > Lucene book.

  Free?!

  http://www.manning.com/hatcher2 is the book Otis and I have spent the 
  last year laboring on.  It has been a long hard effort that is about to 
  come to fruition.  Lucene in Action is in copy/tech editing right now 
  and will be pushed into production very shortly.  There will be some 
  chapters, as always with Manning, available for free download once the 
  book has been typeset (probably even before physical copies are 
  available).  We have not decided which chapters we'll make available 
  for free yet.

  I hope a few folks buy it - it would be a shame for my kids to go 
  without food ;)

  Erik


  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Search Applet

2004-08-18 Thread Terry Steichen
I suspect it has to do with the security restrictions of the applet, 'cause it doesn't 
appear to be finding your Lucene jar file.  Also, regarding the lock files, I believe 
you can disable the locking stuff just for purposes like yours (read-only index).

Regards,

Terry
  - Original Message - 
  From: Simon mcIlwaine 
  To: Lucene Users List 
  Sent: Wednesday, August 18, 2004 11:03 AM
  Subject: Lucene Search Applet


  Im developing a Lucene CD-ROM based search which will search html pages on CD-ROM, 
using an applet as the UI. I know that theres a problem with lock files and also 
security restrictions on applets so I am using the RAMDirectory. I have it working in 
a Swing application however when I put it into an applet its giving me problems. It 
compiles but when I go to run the applet I get the error below. Can anyone help? 
Thanks in advance.
  Simon

  Error:

  Java.lang.noClassDefFoundError: org/apache/lucene/store/Directory

  At: Java.lang.Class.getDeclaredConstructors0(Native Method)

  At: Java.lang.Class.privateGetDeclaredConstructors(Class.java:1610)

  At: Java.lang.Class.getConstructor0(Class.java:1922)

  At: Java.lang.Class.newInstance0(Class.java:278)

  At: Java.lang.Class.newInstance(Class.java:261)

  At: sun.applet.AppletPanel.createApplet(AppletPanel.java:617)

  At: sun.applet.AppletPanel.runloader(AppletPanel.java:546)

  At: sun.applet.AppletPanel.run(AppletPanel.java:298)

  At: java.lang.Thread.run(Thread.java:534)

  Code:

  import org.apache.lucene.search.IndexSearcher;

  import org.apache.lucene.search.Query;

  import org.apache.lucene.search.TermQuery;

  import org.apache.lucene.store.RAMDirectory;

  import org.apache.lucene.store.Directory;

  import org.apache.lucene.index.Term;

  import org.apache.lucene.search.Hits;

  import java.awt.*;

  import java.awt.event.*;

  import javax.swing.*;

  import java.io.*;

  public class MemorialApp2 extends JApplet implements ActionListener{

  JLabel prompt;

  JTextField input;

  JButton search;

  JPanel panel;

  String indexDir = "C:/Java/lucene/index-list";

  private static RAMDirectory idx;

  public void init(){

  Container cp = getContentPane();

  panel = new JPanel();

  panel.setLayout(new FlowLayout(FlowLayout.CENTER, 4, 4));

  prompt = new JLabel("Keyword search:");

  input = new JTextField("",20);

  search = new JButton("Search");

  search.addActionListener(this);

  panel.add(prompt);

  panel.add(input);

  panel.add(search);

  cp.add(panel);

  }

  public void actionPerformed(ActionEvent e){

  if (e.getSource() == search){

  String surname = (input.getText());

  try {

  findSurname(indexDir, surname);

  } catch(Exception ex) {

  System.err.println(ex);

  }

  }

  }

  public static void findSurname(String indexDir, String surname) throws Exception{

  idx = new RAMDirectory(indexDir);

  IndexSearcher searcher = new IndexSearcher(idx);

  Query query = new TermQuery(new Term("surname", surname));

  Hits hits = searcher.search(query);

  for (int i = 0; i < hits.length(); i++) {

  //Document doc = hits.doc(i);

  System.out.println("Surname: " + hits.doc(i).get("surname"));

  }

  }

  }


Re: Negative Boost

2004-08-04 Thread Terry Steichen
Well, I'm not too confident of my JavaCC skills, and when I've messed around with this 
stuff in the past, I sometimes ended up inadvertently creating problems in other areas 
of the query syntax. 

But if, in the future, I or someone else took on this task of enhancing QueryParser, 
I'd like to be assured that the underlying Lucene engine will accept and support 
negative boosting.  Is that the case?

Regards,

Terry

  - Original Message - 
  From: Erik Hatcher 
  To: Lucene Users List 
  Sent: Wednesday, August 04, 2004 9:12 AM
  Subject: Re: Negative Boost


  On Aug 4, 2004, at 7:19 AM, Terry Steichen wrote:
  > I can't get negative boosts to work with QueryParser.  Is it possible 
  > to do so?

  Closer inspection on the parsing:

   TOKEN : {
  )+ ( "." (<_NUM_CHAR>)+ )? > : DEFAULT
  }

  where

 <#_NUM_CHAR:   ["0"-"9"] >

  So, no, negative boosts don't appear possible with QueryParser 
  currently.  I have no objections if you'd like to enhance the grammar 
  to allow for it (provided sufficient unit tests, of course).

  Erik


  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]



Re: Negative Boost

2004-08-04 Thread Terry Steichen
Near as I can tell, setting the boost to, say, 0.10, doesn't seem to do anything.

Regards,

Terry
  - Original Message - 
  From: Otis Gospodnetic 
  To: Lucene Users List 
  Sent: Wednesday, August 04, 2004 9:38 AM
  Subject: Re: Negative Boost


  You can just use boost that is < 1.0, no?

  Otis

  --- Terry Steichen <[EMAIL PROTECTED]> wrote:

  > I can't get negative boosts to work with QueryParser.  Is it possible
  > to do so?
  > 
  > TIA,
  > 
  > Terry
  > 
  > 
  > 


  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]


Negative Boost

2004-08-04 Thread Terry Steichen
I can't get negative boosts to work with QueryParser.  Is it possible to do so?

TIA,

Terry




Re: Underscore character and case issue

2004-07-05 Thread Terry Steichen
Luke runs just fine with 1.3.1.  If you're using Windows, try highlighting
it with Windows Explorer, right-clicking on it, choosing the "Open with.."
menu option and selecting "javaw".

Regards,

Terry

- Original Message - 
From: "Andrzej Bialecki" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, July 05, 2004 1:45 PM
Subject: Re: Underscore character and case issue


> Robert Brown wrote:
>
> > F:\Apache\Lucene\AddOns\Luke\v0.5>java -fullversion
> > java full version "1.3.1_10-b03"
> >
> > F:\Lucene\AddOns\Luke\v0.5>
>
> I never tested it with anything below 1.4 ...
>
> -- 
> Best regards,
> Andrzej Bialecki
>
> -
> Software Architect, System Integration Specialist
> CEN/ISSS EC Workshop, ECIMF project chair
> EU FP6 E-Commerce Expert/Evaluator
> -
> FreeBSD developer (http://www.freebsd.org)
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: NullAnalyzer

2004-06-11 Thread Terry Steichen
+1

- Original Message - 
From: "Eric Jain" <[EMAIL PROTECTED]>
To: "lucene-user" <[EMAIL PROTECTED]>
Sent: Friday, June 11, 2004 4:24 AM
Subject: NullAnalyzer


> There doesn't seem to be an Analyzer that doesn't do anything included 
> with Lucene, is there? This would seem useful to prevent tokenization of 
> certain fields in queries, together with the PerFieldAnalyzerWrapper. 
> But perhaps there is a better way to accomplish this?
> 
>private static class NullAnalyzer
>  extends Analyzer
>{
>  public TokenStream tokenStream(String fieldName, Reader reader)
>  {
>return new CharTokenizer(reader)
>{
>  protected boolean isTokenChar(char c)
>  {
>return true;
>  }
>};
>  }
>}
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Open-ended range queries

2004-06-10 Thread Terry Steichen
Speaking for myself, only a small number of my code modules currently treat
"null" as the open-ended range query term parameter.  If the syntax change
from 'null' --> '*' was deemed otherwise desirable and the syntax transition
made very clearly, I could personally adjust to it without too much
difficulty.

I agree that the proposed '*' syntax does seem more logical.  If a change to
that syntax were made such that the old "null" syntax for the upper bound
was retained for backward compatibility, such a transition would be
completely painless.

Regards,

Terry

- Original Message - 
From: "Scott ganyo" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, June 10, 2004 8:57 PM
Subject: Re: Open-ended range queries


> Well, I do like the *, but apparently there are some people that are
> using this with the null...
>
> Scott
>
> On Jun 10, 2004, at 7:15 PM, Erik Hatcher wrote:
>
> > On Jun 10, 2004, at 4:54 PM, Scott ganyo wrote:
> >> It looks to me like Revision 1.18 broke it.
> >
> > It seems this could be it:
> >
> > revision 1.18
> > date: 2002/06/25 00:05:31;  author: briangoetz;  state: Exp;  lines:
> > +62 -33
> > Support for new range query syntax.  The delimiter is " TO ", but is
> > optional
> > for backward compatibility with previous syntax.  If the range
> > arguments
> > match the format supported by
> > DateFormat.getDateInstance(DateFormat.SHORT),
> > then they will be converted into the appropriate date strings a la
> > DateField.
> >
> > Added Field.Keyword "constructor" for Date-valued arguments.
> >
> > Optimized DateField.timeToString function.
> >
> >
> > But geez June 2002 and no one has complained since?
> >
> > Given that this is so outdated, I'm not sure what the right course of
> > action is.  There are lots more Lucene users now than there were then.
> >  Would adding NULL back be what folks want?  What about simply an
> > asterisk to denote open ended-ness?  [* TO term] or [term TO *]
> >
> > For completeness, here is the diff:
> >
> > % cvs diff -u -r 1.17 -r 1.18 QueryParser.jj
> > Index: QueryParser.jj
> > ===
> > RCS file:
> > /home/cvs/jakarta-lucene/src/java/org/apache/lucene/queryParser/
> > QueryParser.jj,v
> > retrieving revision 1.17
> > retrieving revision 1.18
> > diff -u -r1.17 -r1.18
> > --- QueryParser.jj  20 May 2002 15:45:43 -  1.17
> > +++ QueryParser.jj  25 Jun 2002 00:05:31 -  1.18
> > @@ -65,8 +65,11 @@
> >
> >  import java.util.Vector;
> >  import java.io.*;
> > +import java.text.*;
> > +import java.util.*;
> >  import org.apache.lucene.index.Term;
> >  import org.apache.lucene.analysis.*;
> > +import org.apache.lucene.document.*;
> >  import org.apache.lucene.search.*;
> >
> >  /**
> > @@ -218,35 +221,30 @@
> >
> >private Query getRangeQuery(String field,
> >Analyzer analyzer,
> > -  String queryText,
> > +  String part1,
> > +  String part2,
> >boolean inclusive)
> >{
> > -// Use the analyzer to get all the tokens.  There should be 1 or
> > 2.
> > -TokenStream source = analyzer.tokenStream(field,
> > -  new
> > StringReader(queryText));
> > -Term[] terms = new Term[2];
> > -org.apache.lucene.analysis.Token t;
> > +boolean isDate = false, isNumber = false;
> >
> > -for (int i = 0; i < 2; i++)
> > -{
> > -  try
> > -  {
> > -t = source.next();
> > -  }
> > -  catch (IOException e)
> > -  {
> > -t = null;
> > -  }
> > -  if (t != null)
> > -  {
> > -String text = t.termText();
> > -if (!text.equalsIgnoreCase("NULL"))
> > -{
> > -  terms[i] = new Term(field, text);
> > -}
> > -  }
> > +try {
> > +  DateFormat df = DateFormat.getDateInstance(DateFormat.SHORT);
> > +  df.setLenient(true);
> > +  Date d1 = df.parse(part1);
> > +  Date d2 = df.parse(part2);
> > +  part1 = DateField.dateToString(d1);
> > +  part2 = DateField.dateToString(d2);
> > +  isDate = true;
> >  }
> > -return new RangeQuery(terms[0], terms[1], inclusive);
> > +catch (Exception e) { }
> > +
> > +if (!isDate) {
> > +  // @@@ Add number support
> > +}
> > +
> > +return new RangeQuery(new Term(field, part1),
> > +  new Term(field, part2),
> > +  inclusive);
> >}
> >
> >public static void main(String[] args) throws Exception {
> > @@ -282,7 +280,7 @@
> >  | <#_WHITESPACE: ( " " | "\t" ) >
> >  }
> >
> > - SKIP : {
> > + SKIP : {
> ><<_WHITESPACE>>
> >  }
> >
> > @@ -303,14 +301,28 @@
> >  |  (<_TERM_CHAR>)* "*" >
> >  | 
> >(<_TERM_CHAR> | ( [ "*", "?" ] ))* >
> > -| 
> > -| 
> > +|  : RangeIn
> > +|  : RangeEx
> >  }
> 

Re: Open-ended range queries

2004-06-10 Thread Terry Steichen
Well, I'm using 1.4 RC3 and the "null" range upper limit works just fine for
searches in two of my fields; one is in the form of a cannonical date (eg,
20040610) and the other is in the form of a padded word count (e.g., 01500
for 1500).  The syntax would be pub_date:[20040501 TO null] (dates later
than April 30, 2004) and s_words:[01000 TO null] (articles with 1000 or more
words).

Regards,

Terry

PS: This use of "null" has worked this way since at least 1.2.  As I recall,
way back when, "null" also worked as the first term limit (but no longer
does).

- Original Message - 
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, June 10, 2004 2:24 PM
Subject: Re: Open-ended range queries


> On Jun 10, 2004, at 2:13 PM, Terry Steichen wrote:
> > Actually, QueryParser does support open-ended ranges like :  [term TO
> > null].
> > Doesn't work for the lower end of the range (though that's usually
> > less of a
> > problem).
>
> It supports "null"?  Are you sure?  If so, I'm very confused about it
> because I don't see where in the grammar it has any special handling
> like that.  Could you show an example that demonstrates this?
>
> Erik
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Open-ended range queries

2004-06-10 Thread Terry Steichen
Actually, QueryParser does support open-ended ranges like :  [term TO null].
Doesn't work for the lower end of the range (though that's usually less of a
problem).

Regards,

Terry

- Original Message - 
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, June 10, 2004 11:32 AM
Subject: Re: Open-ended range queries


> On Jun 10, 2004, at 9:38 AM, Eric Jain wrote:
> > I see that with RangeQueries either the upperTerm or the lowerTerm are
> > optional - very useful. However, it seems the QueryParser doesn't
> > support this, or is there a syntax trick I have overlooked?
>
> Correct, QueryParser does not support open-ended range queries.  A
> hack, of course, is to use some text that comes ahead of the first or
> beyond the last term on either end of the range.  [term TO Z] for
> example.
>
> What would you suggest as a way to denote an open end?
>
> FYI - You could override getRangeQuery on a custom QueryParser subclass
> and implement this yourself.
>
> Erik
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: extensible query parser - Re: Proximity Searches behavior

2004-06-10 Thread Terry Steichen
Erik,

When is "Lucene in Action" scheduled to be out?

Regards,

Terry

- Original Message - 
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, June 10, 2004 5:04 AM
Subject: Re: extensible query parser - Re: Proximity Searches behavior


> On Jun 9, 2004, at 4:39 PM, David Spencer wrote:
> >> I like the idea of a flexible run-time grammar, but it sounds too 
> >> good to be true in a general purpose kinda way.
> >
> > My idea isn't perfect for humans, but at least lets you use queries 
> > not hard coded.
> 
> But in my idealistic view, getting something (near) perfect for humans 
> is what a QueryParser is all about.  And, of course, this is domain and 
> application specific in a lot of ways.
> 
> > [5] the point
> >
> > Be backward  compatible and "natural" for existing query syntax, but 
> > leave a hook so that if you innovate and define new query expansion 
> > code there's some hope of someone using it as they can in theory drop 
> > it in and use it w/o coding. Right now if you create some code in this 
> > area I suspect there's little chance people will try it out as there's 
> > too much friction to try it out.
> 
> I'm still grasping for a happy medium between the current QueryParser 
> and this idea of an awkward syntax general purpose pluggable parser.
> 
> Interestingly the current QueryParser is pluggable in some interesting 
> ways thanks to the getters for each query time being overridable.  For 
> example, disallowing wildcard and fuzzy queries, enhancing range query 
> to handle different formats, and changing PhraseQuery into a 
> SpanNearQuery are all tricks I'm including in Lucene in Action.
> 
> Erik
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Proximity Searches behavior

2004-06-09 Thread Terry Steichen
This poses a couple of additional questions:

1) If you set the default slop factor in QueryProcessor to something greater
than 1, can you also use wildcards?  (I ask that question because, to my
understanding, you can't combine the explicit proximity query syntax with
wildcards.  That is, something like "quick fox*"~3 is not legal.)

2) Regarding the SpanQuery family, do we have any documentation on (a) what
led to their emergence (what problem they solve), (b) what their syntax is
(other than what can be discerned from the JUnit tests), and (c) examples of
their use?

3) Is there a plan for adding QueryParser support for the SpanQuery family?

Regards,

Terry

- Original Message - 
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, June 09, 2004 7:34 AM
Subject: Re: Proximity Searches behavior


> On Jun 9, 2004, at 4:26 AM, gaudinat wrote:
> > What does exactly happen with three words or more when we do a
> > proximity search?
> > such as:  "lucene jakarta best"~10
> > Is each word can be at a distance of  10 of each others, or is there
> > an other behaviour?
>
> The total number of "hops" to put the words in order is calculated and
> if it is less than or equal to 10 there is a match.
>
> For example, a field was indexed with "the quick brown fox jumps...".
>
> "quick brown" matches
> "quick fox"~1 matches (# of hops = 1)
> "quick jumps"~1 does not match (# hops = 2)
> "the brown jumps"~1 does not match (# hops = 2)
> "fox brown"~1 does NOT match (# hops is actually 2)
>
> > By the way, I would like to know  if someone use this lucene feature
> > regularly?
>
> I suspect many people do.  You can also set a default slop factor on
> QueryParser so users don't have to use the ~10 syntax.
>
> > For my part I would like to use this feature to improve the precision
> > of the finding documents by using the word position.
>
> Also have a look at the new (in Lucene 1.4) SpanQuery family.  It has a
> less confusing "slop factor" and you can control whether the terms must
> be in order or not (with SpanNearQuery).  QueryParser does not support
> it currently, but subclassing and overriding getFieldQuery makes it
> possible.
>
> Erik
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: why the score is not always 1.0 when comparing two identical strings?

2004-06-04 Thread Terry Steichen
Nothing is wrong.  When the maximum relevance score is greater than one, all
hit scores are normalized (making the highest score 1.0).  When the maximum
score is less than 1, normalization does not occur.  The more complex the
query, the more likely that the raw (non-normalized) score will be less than
one.  Use searcher.explain() to get the details.

HTH,

Terry

- Original Message - 
From: "uddam chukmol" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, June 04, 2004 8:39 AM
Subject: why the score is not always 1.0 when comparing two identical
strings?


> hi,
>
> i'm not so convinced by the way Lucene compute the score.
>
> I tried to compare two string by using a program. In the program, i index
the first string as if i indexed a document and use the queryParser with the
same analyzer that I used to index the first string to analyze my second
string and to form a query from it.
>
> I run the program for the first time with the first string as:
> "This is the text to index with Lucene CREATE TABLE Elements ( TYPELEMENT
varchar (255) NULL , CLEELEMENT varchar (255) NULL , LIBELEM varchar (255)
NULL , CODENTITE varchar (255) NULL , CLEENTITE varchar (255) NULL ,
DONNNEEA1 varchar (255) NULL , DONNEEB1 varchar (255) NULL , DONNEEA2
varchar (255) NULL , DONNEEB2 varchar (255) NULL , DONNEEA3 varchar (255)
NULL , DONNEEB3 varchar (255) NULL , DONNEEA4 varchar (255) NULL , DONNEEB4
varchar (255) NULL , DONNEEA5 varchar (255) NULL , DONNEEB5 varchar (255)
NULL , TOP1 varchar (255) NULL , TOP2 varchar (255) NULL , TOP3 varchar
(255) NULL , TOP4 varchar (255) NULL , TOP5 varchar (255) NULL , QTE1
varchar (255) NULL , QTE2 varchar (255) NULL , QTE3 varchar (255) NULL ,
MONTANT1 varchar (255) NULL , MONTANT2 varchar (255) NULL , MONTANT3 varchar
(255) NULL , DATE1 varchar (255) NULL , DATE2 varchar (255) NULL , DATE3
varchar (255) NULL , STATUT varchar (255) NULL , DATPRISENCPTSTAT varchar
(255) NULL )".
>
> I used the same string as to form my query and i got the final score of
these two string which is 1.0.
>
> Then something suprised me when i changed to two strings into "All work
and no play makes Jack a dull boy" and compared them by using one as a
document and other to form the query. The result was just not 1.0. it was
0.3033.. instead.
>
> I used Eclipse as my Java Editor. Any conflict with Lucene?
>
> Any idea/suggestion of what went wrong over here?
>
> Uddam
>
>
> -
> Do you Yahoo!?
> Friends.  Fun. Try the all-new Yahoo! Messenger


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: FileNotFoundException when trying to indexing.

2004-06-03 Thread Terry Steichen
Prasad,

I think you'll have to provide more code so we can see what's actually going
on.  BTW, I don't see you calling the UseCompoundFile method (unless you do
it inside indexFile/Directory) - I wonder if that could be an issue?

Regards,

Terry

PS: I run on XP/Pro just fine, so there's nothing intrinsically wrong with
the platform.

- Original Message - 
From: "Prasad Ganguri" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Thursday, June 03, 2004 12:59 PM
Subject: FileNotFoundException when trying to indexing.


I am using Lucene for buiding our document management system. I tested it in
Windows2000 Professional and got successful execution.

Recently, when we ported the code onto an WindowsXP Professional, we are
getting the following exception. I tried to create segments folder using my
code, but throwing Access denied error.

Could some one help me, what is wrong with my code?

java.io.FileNotFoundException: C:\cms\index\segments (The system cannot find
the file specified)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.(RandomAccessFile.java:204)
at org.apache.lucene.store.FSInputStream$Descriptor.(Unknown
Source)
at org.apache.lucene.store.FSInputStream.(Unknown Source)
at org.apache.lucene.store.FSDirectory.openFile(Unknown Source)
at org.apache.lucene.index.SegmentInfos.read(Unknown Source)
at org.apache.lucene.index.IndexWriter$1.doBody(Unknown Source)
at org.apache.lucene.store.Lock$With.run(Unknown Source)
at org.apache.lucene.index.IndexWriter.(Unknown Source)
at org.apache.lucene.index.IndexWriter.(Unknown Source)
at
com.ganguri.cms.contentmanagement.index.FileIndexer.index(FileIndexer.java:6
2)
at
com.ganguri.cms.contentmanagement.filemanager.Document.moveFileToRepository(
Document.java:215)
at
jsp_servlet._content._indexcardprocess._jspService(_indexcardprocess.java:19
3)
at com.ganguri.cms.jsp.CMSJSPPage.service(CMSJSPPage.java:20)
at
weblogic.servlet.internal.ServletStubImpl.invokeServlet(ServletStubImpl.java
:105)
at
weblogic.servlet.internal.ServletStubImpl.invokeServlet(ServletStubImpl.java
:123)
at
weblogic.servlet.internal.ServletContextImpl.invokeServlet(ServletContextImp
l.java:742)
at
weblogic.servlet.internal.ServletContextImpl.invokeServlet(ServletContextImp
l.java:686)
at
weblogic.servlet.internal.ServletContextManager.invokeServlet(ServletContext
Manager.java:247)
at
weblogic.socket.MuxableSocketHTTP.invokeServlet(MuxableSocketHTTP.java:361)
at
weblogic.socket.MuxableSocketHTTP.execute(MuxableSocketHTTP.java:261)
at weblogic.kernel.ExecuteThread.run(ExecuteThread.java:120)

The corresponding code is as follows:

public static void index(File indexDir, File dataDir, boolean isNew) throws
Exception
{
if (!dataDir.exists())
throw new IOException(dataDir.getName() + " does not exist.");
System.out.println(" indexDir existing.? " +
indexDir.exists());
IndexWriter writer = null;
if (!indexDir.exists())
{
indexDir.mkdirs();
}
try
{
writer = new IndexWriter(indexDir, getAnalyzer(), isNew);  // Here the
exception is thrown
if (dataDir.isFile())
indexFile(writer, dataDir);
else if (dataDir.isDirectory())
indexDirectory(writer, dataDir);
else
return;
writer.optimize();
writer.close();
}
catch (Exception e)
{
e.printStackTrace();
}
finally
{
if (writer != null)
writer.close();
}
}

Thanks in advance..


Prasad


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: similarity of two texts

2004-06-02 Thread Terry Steichen
Erik,

Could you expand on this just a wee bit, perhaps with an example of how to
compute this vector angle?

TIA,

Terry

- Original Message - 
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, June 01, 2004 9:39 AM
Subject: Re: similarity of two texts


> On Jun 1, 2004, at 9:24 AM, Grant Ingersoll wrote:
> > Hey Eric,
>
> Eri*K*  :)
>
> > What did you do to calc similarity?
>
> I computed the angle between two vectors.  The vectors are obtained
> from IndexReader.getTermFreqVector(docId, "field").
>
> >   I haven't had time, but was thinking of ways to add the ability to
> > get the similarity score (as calculated when doing a search) given a
> > term vector (or just a document id).
>
> It would be quite compute-intensive to do something like this.  This
> could be done through a custom sort as well, if applying it at the
> scoring level doesn't work.  I haven't given any thought to how this
> could work for scoring or sorting before, but does sound quite
> interesting.
>
> >   Any ideas on how to approach this would be appreciated.  The scoring
> > in Lucene has always been a bit confusing to me, despite looking at
> > the code several times, especially once you get into boolean queries,
> > etc.
>
> No doubt that it is confusing - to me also.  But Explanation is your
> friend.
>
> Erik
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lockfile Problem Solved

2004-05-31 Thread Terry Steichen
Just thought I'd pass on some info I just discovered.  I've been successfully using 
the CVS head version of Lucene as of about 2 months ago.  I then got the formal 
release (1.4-rc3) and tried it with my application, but it failed.  I tried it with 
some commandline test routines and they worked fine, but not my application (runs 
under Tomcat 4.1.24, on a Windows XP/Pro).

After a lot of debugging, I discovered that IndexReader was trying to create a 
lockfile in a non-existent directory (called "temp" at the same level as the index 
directory).  After I manually created this directory, everything seems to be working 
fine.  

(Why this happens only with the Tomcat application, and not commandline apps, I don't 
know.  Probably has something to do with JVM properties.)

Regards,

Terry

Re: Internal full content store within Lucene

2004-05-18 Thread Terry Steichen
+1

- Original Message - 
From: "Kevin Burton" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, May 18, 2004 2:43 PM
Subject: Internal full content store within Lucene


> Per the discussion the other day about storing content external to 
> Lucene I think we have an opportunity to improve the lucene core and 
> bring a lot of functionality to future developers.
> 
> Right now Lucene allows you to have a 'stored' field which keeps the 
> content with a segment along with your inverted index.
> 
> While this is flexible for small indexes in production environments it 
> falls down because index merges take FOREVER.
> 
> A thread the other day opened up and suggesting storing just a pointer 
> to a file on the filesystem.  This got me thinking about a long term 
> mechanism I wanted for our cluster where we store content outside of the 
> index in a high performance flat-file database.
> 
> The Lucene index would only maintain FILENO-:OFFSET:LENGTH info within 
> the index and this would allow us to point to our flat file database. 
> 
> This would allow Lucene index merges to be FAST, support native field 
> storage, and allow the filesystem optimize contiguous blocks for the 
> flat content store.  Everyone wins.
> 
> This is what the Internet archive uses:
> 
> http://www.archive.org/web/researcher/ArcFileFormat.php
> 
> I propose that Lucene support a new form of stored field that allows 
> external storage engine to keep the content in a flat text store.
> 
> How much interest is there for this?  I have to do this for work and 
> will certainly take the extra effort into making this a standard Lucene 
> feature. 
> 
> I can come up with a requirements doc and a more formal proposal in 
> another email if I get enough +1s...
> 
> Kevin
> 
> -- 
> 
> Please reply using PGP.
> 
> http://peerfear.org/pubkey.asc
> 
> NewsMonster - http://www.newsmonster.org/
> 
> Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
>AIM/YIM - sfburtonator,  Web - http://peerfear.org/
> GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
>   IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Java 1.4 (was: new Lucene release: 1.4 RC3)

2004-05-13 Thread Terry Steichen
For my $.02, I think it is a mistake to leave this single dependency, if
there's some reasonably easy way to remove it.

Regards,

Terry

- Original Message - 
From: "Tim Jones" <[EMAIL PROTECTED]>
To: "'Lucene Users List'" <[EMAIL PROTECTED]>
Sent: Thursday, May 13, 2004 4:10 PM
Subject: Re: Java 1.4 (was: new Lucene release: 1.4 RC3)


> yes - it has to do with the anonymous inner classes - see
>
> http://issues.apache.org/bugzilla/show_bug.cgi?id=27638
>
> did we decide to leave this as a "compile in 1.4, run in
> 1.3" work around, or to convert the anonymous inner classes
> to named inner classes?
>
> this is the only 1.4 dependency that I know of.
>
>
> > -Original Message-
> > From: Terry Steichen [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, May 12, 2004 9:42 AM
> > To: Lucene Users List
> > Subject: Re: new Lucene release: 1.4 RC3
> >
> >
> > Last time I checked, JDK 1.4 was needed to compile the
> > classes implementing
> > the new sorting features.  Part of the issue was the
> > inclusion of the regex
> > classes, but the other dependency had to do (as I recall)
> > with some kind of
> > inner class constructs (that JDK 1.3 won't compile).  I
> > believe that the
> > contributor, Tim Jones, fixed some of then to work with JDK
> > 1.3, but to the
> > best of my knowledge, not the inner class stuff.
> >
> > Regards,
> >
> > Terry
> >
> > - Original Message - 
> > From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > Sent: Wednesday, May 12, 2004 8:04 AM
> > Subject: Re: new Lucene release: 1.4 RC3
> >
> >
> > > I don't recall any JDK 1.4 methods/classes being used, and
> > I just saw
> > > Doug replacing one AssertException (1.4) with RuntimeException.
> > >
> > > Are there some 1.4 dependencies I'm not aware of?
> > >
> > > Otis
> > >
> > > --- Terry Steichen <[EMAIL PROTECTED]> wrote:
> > > > I presume this still requires Java 1.4 to build, but will run with
> > > > Java 1.3?
> > > >
> > > > Regards,
> > > >
> > > > Terry
> > > >
> > > > - Original Message - 
> > > > From: "Doug Cutting" <[EMAIL PROTECTED]>
> > > > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > > > Sent: Tuesday, May 11, 2004 4:51 PM
> > > > Subject: new Lucene release: 1.4 RC3
> > > >
> > > >
> > > > > Version 1.4 RC3 of Lucene is available for download from:
> > > > >
> > > > > http://cvs.apache.org/dist/jakarta/lucene/v1.4-rc3/
> > > > >
> > > > > Changes are described at:
> > > > >
> > > > >
> > > >
> > >
> > http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-lucene/CH
> ANGES.txt?rev=1.85
> > > >
> > > > Doug
> > > >
> > > >
> > > -
> > > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > > For additional commands, e-mail:
> > > [EMAIL PROTECTED]
> > > >
> > > >
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: new Lucene release: 1.4 RC3

2004-05-12 Thread Terry Steichen
Last time I checked, JDK 1.4 was needed to compile the classes implementing
the new sorting features.  Part of the issue was the inclusion of the regex
classes, but the other dependency had to do (as I recall) with some kind of
inner class constructs (that JDK 1.3 won't compile).  I believe that the
contributor, Tim Jones, fixed some of then to work with JDK 1.3, but to the
best of my knowledge, not the inner class stuff.

Regards,

Terry

- Original Message - 
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, May 12, 2004 8:04 AM
Subject: Re: new Lucene release: 1.4 RC3


> I don't recall any JDK 1.4 methods/classes being used, and I just saw
> Doug replacing one AssertException (1.4) with RuntimeException.
>
> Are there some 1.4 dependencies I'm not aware of?
>
> Otis
>
> --- Terry Steichen <[EMAIL PROTECTED]> wrote:
> > I presume this still requires Java 1.4 to build, but will run with
> > Java 1.3?
> >
> > Regards,
> >
> > Terry
> >
> > - Original Message - 
> > From: "Doug Cutting" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > Sent: Tuesday, May 11, 2004 4:51 PM
> > Subject: new Lucene release: 1.4 RC3
> >
> >
> > > Version 1.4 RC3 of Lucene is available for download from:
> > >
> > > http://cvs.apache.org/dist/jakarta/lucene/v1.4-rc3/
> > >
> > > Changes are described at:
> > >
> > >
> >
>
http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-lucene/CHANGES.txt?rev=1.85
> > >
> > > Doug
> > >
> > >
> > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail:
> > [EMAIL PROTECTED]
> > >
> > >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: new Lucene release: 1.4 RC3

2004-05-12 Thread Terry Steichen
I presume this still requires Java 1.4 to build, but will run with Java 1.3?

Regards,

Terry

- Original Message - 
From: "Doug Cutting" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, May 11, 2004 4:51 PM
Subject: new Lucene release: 1.4 RC3


> Version 1.4 RC3 of Lucene is available for download from:
>
> http://cvs.apache.org/dist/jakarta/lucene/v1.4-rc3/
>
> Changes are described at:
>
>
http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-lucene/CHANGES.txt?rev=1.85
>
> Doug
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Problems From the Word Go

2004-04-30 Thread Terry Steichen
Erik,

Maybe you could donate some of those demo modules (and the accompanying
article/text) to Lucene, so they'd be incorporated officially in the
website?

Regards,

Terry

- Original Message - 
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, April 30, 2004 8:48 AM
Subject: Re: Problems From the Word Go


> Unfortunately the demo that comes with Lucene is harder to run than it
> really should be.  My suggestion is to just get the Lucene JAR, and try
> out examples from the many articles available.  My intro Lucene article
> at java.net should be easy to get up and running in only a few minutes
> of having the JAR (and basic Java know-how with classpath and such).
>
> Erik
>
> On Apr 29, 2004, at 11:53 AM, Alex Wybraniec wrote:
>
> > I'm sorry if this is not the correct place to post this, but I'm very
> > confused, and getting towards the end of my tether.
> >
> > I need to install/compile and run Lucene on a Windows XP Pro based
> > machine,
> > running J2SE 1.4.2, with ANT.
> >
> > I downloaded both the source code and the pre-compile versions, and as
> > yet
> > have not been able to get either running. I've been through the
> > documentation, and still I can find little to help me set it up
> > properly.
> >
> > All I want to do (to start with) is compile and run the demo version.
> >
> > I'm sorry to ask such a newbie question, but I'm really stuck.
> >
> > So if anyone can point me to an idiots guide, or offer me some help, I
> > would
> > be most grateful.
> >
> > Once I get past this stage, I'll have all sorts of juicer questions
> > for you,
> > but at the minute, I can't even get past stage 1
> >
> > Thank you in advance
> > Alex
> > ---
> > Outgoing mail is certified Virus Free.
> > Checked by AVG anti-virus system (http://www.grisoft.com).
> > Version: 6.0.672 / Virus Database: 434 - Release Date: 28/04/2004
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: searching only part of an index

2004-04-27 Thread Terry Steichen
I think that if you include the indexing timestamp in the Document you
create when indexing, you could sort on this and only pick the first 100.

Regards,

Terry
- Original Message - 
From: "Alan Smith" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, April 27, 2004 8:02 AM
Subject: searching only part of an index


> Hi
>
> I wondered if anyone knows whether it is possible to search ONLY the 100
(or
> whatever) most recently added documents to a lucene index? I know that
once
> I have all my results ordered by ID number in Hits I could then just
display
> the required amount, but I wondered if there is a way to avoid searching
all
> documents in the index in the first place?
>
> Many thanks
>
> Alan
>
> _
> Express yourself with cool new emoticons
http://www.msn.co.uk/specials/myemo
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Stemmer Benefits/Costs

2004-04-22 Thread Terry Steichen
Andrzej,

Sorry for misspelling your name.  My Polish sucks.

Terry

- Original Message - 
From: "Terry Steichen" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, April 22, 2004 7:56 PM
Subject: Re: Stemmer Benefits/Costs


> So, Andrez - Thank you for your comments - what you say makes a good deal
of
> sense.  When you have lots of different inflections that all share the
same
> root, stemming can clearly provide significant (recall) benefits (in terms
> of catching hidden words and/or simplifying the query).
>
> However, would you say that "from the perspective of English" ("with its
> minimal inflection") the points I raise are correct?  (You seem to say so
> with the statement that stemming "usually improves recall, but lowers
> precision.")
>
> And, would you expect significant benefits from the Egothor project code
> (versus Snowball/Porter) when the text is in English (as opposed to a
highly
> inflectional language like Polish)?
>
> Regards,
>
> Terry
>
> - Original Message - 
> From: "Andrzej Bialecki" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Thursday, April 22, 2004 5:37 PM
> Subject: Re: Stemmer Benefits/Costs
>
>
> > Terry Steichen wrote:
> >
> > > I've been experimenting with the Porter and Snowball stemmers.  It
> > > seems to me that one of the most valuable benefits these provide is
> > > the capability to generalize phrase terms.  As a very simple example,
> > > without the stemmer, I might need to include three phrase terms in my
> > > query: "north korea", "north korean", "north koreans".  But with the
> > > stemmer only one will suffice.  To me, that's a huge advantage.  (For
> > > non-phrases, the advantage doesn't seem to be so great, because much
> > > the same effect can be achieved with wildcards.)
> >
> > That's because you look at it from the perspective of English language
> > with its minimal inflection... My mother tongue is Polish - a highly
> > inflectional language from the Slavic family of languages. It is normal
> > for a single Polish word to have as many as 20+ different inflected
> > forms (plural/singular/dual, tense, gender, mood, case, infinitive...
> > enough? ;-) ). For this type of language studies show that stemming (or
> > rather lemmatization - bringing words to their base grammatical forms)
> > significantly improves recall in IR systems.
> >
> > >
> > > But there seems to be a price that you also pay, in that
> > > discrimination may be adversely affected.  If you want to
> > > discriminate between two terms that the stemmer views as derived from
> > > the same root, you're out of luck (I think).  The problem with this
> >
> > Stemming usually improves recall, but lowers precision. For some systems
> > it is more desirable to provide any results, even if they are not quite
> > correct, than to provide none.
> >
> > > is that you may start with a set of terms that don't have this
> > > problem, but over time as new content is added to the index, such
> > > problems may gradually get introduced - often unpredictably.  And to
> > > the best of my (admittedly limited) knowledge, once you've indexed
> > > using a stemmer, there's no way to override it in specific instances.
> >
> > You can always store in your index stemmed/non-stemmed terms alongside.
> >
> > >
> > > Appreciate any comments, thoughts on the above.
> >
> > For highly-inflectional languages I had _very_ good results with
> > stemmers built using the code from Egothor project
> > (http://www.egothor.org) - much more sophisticated than simple
> > rule-based stemmers like Snowball or Porter. In fact, after proper
> > training on a large corpus I was getting ~70% of correct lemmas for
> > previously unseen words, and over 90% of correct (unique) stems.
> >
> > -- 
> > Best regards,
> > Andrzej Bialecki
> >
> > -
> > Software Architect, System Integration Specialist
> > CEN/ISSS EC Workshop, ECIMF project chair
> > EU FP6 E-Commerce Expert/Evaluator
> > -
> > FreeBSD developer (http://www.freebsd.org)
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Stemmer Benefits/Costs

2004-04-22 Thread Terry Steichen
So, Andrez - Thank you for your comments - what you say makes a good deal of
sense.  When you have lots of different inflections that all share the same
root, stemming can clearly provide significant (recall) benefits (in terms
of catching hidden words and/or simplifying the query).

However, would you say that "from the perspective of English" ("with its
minimal inflection") the points I raise are correct?  (You seem to say so
with the statement that stemming "usually improves recall, but lowers
precision.")

And, would you expect significant benefits from the Egothor project code
(versus Snowball/Porter) when the text is in English (as opposed to a highly
inflectional language like Polish)?

Regards,

Terry

- Original Message - 
From: "Andrzej Bialecki" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, April 22, 2004 5:37 PM
Subject: Re: Stemmer Benefits/Costs


> Terry Steichen wrote:
>
> > I've been experimenting with the Porter and Snowball stemmers.  It
> > seems to me that one of the most valuable benefits these provide is
> > the capability to generalize phrase terms.  As a very simple example,
> > without the stemmer, I might need to include three phrase terms in my
> > query: "north korea", "north korean", "north koreans".  But with the
> > stemmer only one will suffice.  To me, that's a huge advantage.  (For
> > non-phrases, the advantage doesn't seem to be so great, because much
> > the same effect can be achieved with wildcards.)
>
> That's because you look at it from the perspective of English language
> with its minimal inflection... My mother tongue is Polish - a highly
> inflectional language from the Slavic family of languages. It is normal
> for a single Polish word to have as many as 20+ different inflected
> forms (plural/singular/dual, tense, gender, mood, case, infinitive...
> enough? ;-) ). For this type of language studies show that stemming (or
> rather lemmatization - bringing words to their base grammatical forms)
> significantly improves recall in IR systems.
>
> >
> > But there seems to be a price that you also pay, in that
> > discrimination may be adversely affected.  If you want to
> > discriminate between two terms that the stemmer views as derived from
> > the same root, you're out of luck (I think).  The problem with this
>
> Stemming usually improves recall, but lowers precision. For some systems
> it is more desirable to provide any results, even if they are not quite
> correct, than to provide none.
>
> > is that you may start with a set of terms that don't have this
> > problem, but over time as new content is added to the index, such
> > problems may gradually get introduced - often unpredictably.  And to
> > the best of my (admittedly limited) knowledge, once you've indexed
> > using a stemmer, there's no way to override it in specific instances.
>
> You can always store in your index stemmed/non-stemmed terms alongside.
>
> >
> > Appreciate any comments, thoughts on the above.
>
> For highly-inflectional languages I had _very_ good results with
> stemmers built using the code from Egothor project
> (http://www.egothor.org) - much more sophisticated than simple
> rule-based stemmers like Snowball or Porter. In fact, after proper
> training on a large corpus I was getting ~70% of correct lemmas for
> previously unseen words, and over 90% of correct (unique) stems.
>
> -- 
> Best regards,
> Andrzej Bialecki
>
> -
> Software Architect, System Integration Specialist
> CEN/ISSS EC Workshop, ECIMF project chair
> EU FP6 E-Commerce Expert/Evaluator
> -
> FreeBSD developer (http://www.freebsd.org)
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Stemmer Benefits/Costs

2004-04-22 Thread Terry Steichen
I've been experimenting with the Porter and Snowball stemmers.  It seems to me that 
one of the most valuable benefits these provide is the capability to generalize phrase 
terms.  As a very simple example, without the stemmer, I might need to include three 
phrase terms in my query: "north korea", "north korean", "north koreans".  But with 
the stemmer only one will suffice.  To me, that's a huge advantage.  (For non-phrases, 
the advantage doesn't seem to be so great, because much the same effect can be 
achieved with wildcards.)

But there seems to be a price that you also pay, in that discrimination may be 
adversely affected.  If you want to discriminate between two terms that the stemmer 
views as derived from the same root, you're out of luck (I think).  The problem with 
this is that you may start with a set of terms that don't have this problem, but over 
time as new content is added to the index, such problems may gradually get introduced 
- often unpredictably.  And to the best of my (admittedly limited) knowledge, once 
you've indexed using a stemmer, there's no way to override it in specific instances.

Appreciate any comments, thoughts on the above.

Regards,

Terry
 

Re: Wierd Search Behavior

2004-04-01 Thread Terry Steichen
I did some more checking and uncovered what appears to be a serious Lucene
problem. (Either that or my merge code - below - is wrong.)  Appreciate any
help in figuring out what's wrong.  Here are the facts as I see them:

1) I put together a large number of canned queries (some rather complex) for
routine testing purposes.
2) I created a new compound file index and tested the queries.  All worked
fine.
3) I then indexed some new documents and merged the new index with the
original index.
4) I then tried the queries again.  Each time I did this, about 1-3% of the
queries no longer worked - the actual number appears to vary with each
merge.
5) The specific queries that fail change with each merge. Ones that failed
after the previous merge almost always appear to work again with the next
merge (which produces a new batch of failures).
6) In all cases I've so far examined, the offending part of the affected
queries is a single quoted phrase (even though there may be several such
phrases in the query) - remove it, and the (now modified) query works fine.
7) I tried the same thing using the original multi-file index format, with
the same results.
8) About a week and a half ago, I migrated from 1.3final to the latest CVS
head.
9) I've only just started checking this, so I don't know how long this
behavior has been going on.  The small percentage of errors and (apparent)
randomness of which query is affected make it hard to detect.
10) I have about 32 fields per document, most of which are tokenized,
indexed and stored.
11) My merge code (for the multi-file index format) is this:

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.FSDirectory;

class MergeIndices {
  public static void main(String[] args) {

 //args[0]: relative path to main index
 //args[1]: relative path to new index (to be merged with main)

 try {
  IndexWriter writer = new IndexWriter(args[0], new StandardAnalyzer(),
false);
 // writer.setUseCompoundFile(true); //used for compound format
  FSDirectory dir = FSDirectory.getDirectory(args[1], false);
  FSDirectory[] dirs = new FSDirectory[1];
  dirs[0] = dir;
  writer.addIndexes(dirs);
  writer.optimize();
  writer.close();
 } catch (Exception e) {
  System.out.println(" caught a " + e.getClass() +
"\n with message: " + e.getMessage());
 }
  }

}



----- Original Message -
From: "Terry Steichen" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, March 31, 2004 11:47 AM
Subject: Re: Wierd Search Behavior


> No, they're typos in the e-mail.  In the application, all the colons are
> properly placed.  (Guess I was/am so frustrated I can't write right any
> more).
>
> Terry
>
> - Original Message -
> From: "Erik Hatcher" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Wednesday, March 31, 2004 9:55 AM
> Subject: Re: Wierd Search Behavior
>
>
> > On Mar 31, 2004, at 9:49 AM, Terry Steichen wrote:\
> > > I'm experiencing some very puzzling search behavior.  I am using the
> > > CVS head I pulled about a week ago.  I use the StandardAnalyzer and
> > > QueryParser.  I have a collection of XML documents indexed.  One field
> > > is "subhead", and here's what I find with different queries:
> > > subhead:(missile defense)- works fine
> > > subhead("missile" "defense") - works fine
> > > subhead("missile defense") - fails
> > > subhead(missile defense "missile defense") - fails
> > > subhead(missile defense "missile dork") - works fine
> > > subhead(missile defense "missile defens") - works fine (note
> > > misspelling)
> >
> > I presume the missing colons on all but the first example is just a
> > typo in your e-mail?  If not, might that be the problem?
> >
> > Erik
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Wierd Search Behavior

2004-03-31 Thread Terry Steichen
No, they're typos in the e-mail.  In the application, all the colons are
properly placed.  (Guess I was/am so frustrated I can't write right any
more).

Terry

- Original Message -
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, March 31, 2004 9:55 AM
Subject: Re: Wierd Search Behavior


> On Mar 31, 2004, at 9:49 AM, Terry Steichen wrote:\
> > I'm experiencing some very puzzling search behavior.  I am using the
> > CVS head I pulled about a week ago.  I use the StandardAnalyzer and
> > QueryParser.  I have a collection of XML documents indexed.  One field
> > is "subhead", and here's what I find with different queries:
> > subhead:(missile defense)- works fine
> > subhead("missile" "defense") - works fine
> > subhead("missile defense") - fails
> > subhead(missile defense "missile defense") - fails
> > subhead(missile defense "missile dork") - works fine
> > subhead(missile defense "missile defens") - works fine (note
> > misspelling)
>
> I presume the missing colons on all but the first example is just a
> typo in your e-mail?  If not, might that be the problem?
>
> Erik
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Wierd Search Behavior

2004-03-31 Thread Terry Steichen
I'm experiencing some very puzzling search behavior.  I am using the CVS head I pulled 
about a week ago.  I use the StandardAnalyzer and QueryParser.  I have a collection of 
XML documents indexed.  One field is "subhead", and here's what I find with different 
queries:
subhead:(missile defense)- works fine 
subhead("missile" "defense") - works fine
subhead("missile defense") - fails
subhead(missile defense "missile defense") - fails
subhead(missile defense "missile dork") - works fine
subhead(missile defense "missile defens") - works fine (note misspelling)

At the moment, I can't find any other field or phrase that does this.  However, 
according to my notes (as I'm no longer trusting my mind on this), about a week ago 
(about the time I started using the new CVS version) I noticed similar behavior with 
the query 'subhead:"al qaeda" - but that now works perfectly fine! Same thing with the 
query 'summary:"heart disease"; it failed to work and then a day or so later, it 
worked.  (I merge new documents into the master index each day.)

Any ideas on what might possibly be going on would be very much appreciated.

Regards,

Terry



Re: Similarity - position in Field[] effects scoring - how to change?

2004-03-23 Thread Terry Steichen
Joachim,

I believe you'll have to replace the default Similarity class with one of
your own.  Not sure exactly what the settings should be - maybe some other
list members can give you specifics.  Otherwise, you'll probably have to
experiment with it.

Regards,

Terry

- Original Message -
From: "Joachim Schreiber" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, March 23, 2004 10:05 AM
Subject: Similarity - position in Field[] effects scoring - how to change?


> Hallo,
>
> I run in following problem. Perhaps somebody can help me.
>
> I have a index with different ids in the same field
> something like
>
> 
> 45678565
> 87854546
>
> Situation: I have different documents with the entry  in the
same
> index.
>
>
> document 1)
>
> 324235678565
> 324dssd5678565
> 45678324565
> 
> 8785454324326
>
>
> document 2)
>
> 324235678565
> 
> 45678324565
> 8785454324326
>
>
>
> when I search for "  s: "  I receive both docs, but document 1 has
a
> better scoring than document 2.
> The position of  in doc 1 is Field[4] and in doc 2 it's
Field[2],
> so this seems to effect scoring.
>
> How can I disable this behaviour, so doc 1 has the same scoring as doc
2???
> Which method do I have to overwrite in DefaultSimilarity.
> Has anybody any idea, any help.
>
> Thanks
>
> yo
>
>
>
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Final Hits

2004-03-22 Thread Terry Steichen
Erik,

There are a number of different possibilities which I'm still evaluating.
But if there is some significant reason for *not* subclassing Hits
(performance?), that will have a major bearing on whether the approach I'm
evaluating makes sense.

So, let me rephrase my question: Is the "final" nature of Hits due to some
performance reason, or simply because no one has previously expressed any
interest in subclassing it?  Or, putting it in reverse, is there any
technical problem likely to arise from removing the "final" attribute(s)?

Regards,

Terry

- Original Message -
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, March 22, 2004 7:06 AM
Subject: Re: Final Hits


> How exactly would you take advantage of a subclassable Hits class?
>
>
> On Mar 21, 2004, at 6:01 AM, Terry Steichen wrote:
>
> > Does anyone know why the Hits class is final (thus preventing it from
> > being subclassed)?
> >
> > Regards,
> >
> > Terry
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SpanXXQuery Usage

2004-03-22 Thread Terry Steichen
Otis,

Can you give me/us a rough idea of what these are supposed to do?  It's hard
to extrapolate the terse unit test code into much of a general notion.  I
searched the archives with little success.

Regards,

Terry

- Original Message -
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, March 22, 2004 2:46 AM
Subject: Re: SpanXXQuery Usage


> Only in unit tests, so far.
>
> Otis
>
> --- Terry Steichen <[EMAIL PROTECTED]> wrote:
> > Is there any documentation (other than that in the source) on how to
> > use the new SpanxxQuery features?  Specifically: SpanNearQuery,
> > SpanNotQuery, SpanFirstQuery and SpanOrQuery?
> >
> > Regards,
> >
> > Terry
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Final Hits

2004-03-21 Thread Terry Steichen
Does anyone know why the Hits class is final (thus preventing it from being 
subclassed)? 

Regards,

Terry


SpanXXQuery Usage

2004-03-19 Thread Terry Steichen
Is there any documentation (other than that in the source) on how to use the new 
SpanxxQuery features?  Specifically: SpanNearQuery, SpanNotQuery, SpanFirstQuery and 
SpanOrQuery?  

Regards,

Terry


Re: Sys properties Was: java.io.tmpdir as lock dir .... once again

2004-03-08 Thread Terry Steichen
I tend to agree (but with the same uncertainty as to why I feel that way).

Regards,

Terry
- Original Message - 
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, March 08, 2004 2:34 PM
Subject: Re: Sys properties Was: java.io.tmpdir as lock dir  once again


> I can't explain why, but I feel like the old index format should stay
> by default.  I feel like I'd rather a (slightly) faster index, and
> switch to the compound one when/IF I encounter problems, than have a
> safer, but slower index, and never realize that there is a faster
> option available.
> 
> Weak argument, I know, but some instinct in me thinks that the current
> mode should remain.
> 
> Otis
> 
> 
> --- Doug Cutting <[EMAIL PROTECTED]> wrote:
> > hui wrote:
> > > Index time: 
> > > compound format is 89 seconds slower.
> > > 
> > > compound format:
> > > 1389507 total milliseconds
> > > non-compound format:
> > > 1300534 total milliseconds
> > > 
> > > The index size is 85m with 4 fields only. The files are stored in
> > the index.
> > > The compound format has only 3 files and the other has 13 files. 
> > 
> > Thanks for performing this benchmark!
> > 
> > It looks like the compound format is around 7% slower when indexing. 
> > To 
> > my thinking that's acceptable, given the dramatic reduction in file 
> > handles.  If folks really need maximal indexing performance, then
> > they 
> > can explicitly disable the compound format.
> > 
> > Would anyone object to making compound format the default for Lucene 
> > 1.4?  This is an incompatible change, but I don't think it should
> > break 
> > applications.
> > 
> > Doug
> > 
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> > 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SubstringQuery -- Re: Leading Wild Card Search

2004-02-17 Thread Terry Steichen
Doug,

What you say makes a good deal of sense to me.  Could you give us a relative
sense of the "slowness" of different operators?

Regards

Terry

- Original Message -
From: "Doug Cutting" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, February 17, 2004 1:16 PM
Subject: Re: SubstringQuery -- Re: Leading Wild Card Search


> David Spencer wrote:
> > 2 files attached, SubstringQuery (which you'll use) and
> > SubstringTermEnum ( used by the former to be
> > consistent w/ other Query code).
> >
> > I find this kind of query useful to have and think that the query parser
> > should allow it in spite of the perception
> > of this being slow, however I think the debate is the "user centric
> > view" (say mine, allow substring queries)
> > vs the "protect the engines performance" view which says not to allow
> > expensive queries.
>
> I think the argument is more complex.
>
> One issue is cost of execution: very slow queries can be used to
> implement a denial-of-service attack.  Maybe that's an overstatement,
> but in a web server setting, once a few of slow searches are running, no
> others may complete.  When folks hit "Stop" in their browser the server
> does not stop processing the query.  If they hit "Reload" then another
> new search is started.  So these can be very problematic.  This is real.
>   Lots of folks have deployed Lucene with large indexes and then found
> that their server randomly crashes.  Closer scrutiny shows that they
> were permitting operators that are too slow for their combination of
> index size and query traffic.  The BooleanQuery.TooManyClauses exception
> was added to address this, but it can still be too late, if the problem
> is caused before the query is built, e.g., while enumerating all terms.
>
> A releated issue is that users (and even most developers) don't
> understand the relative costs of different query operators.  Some things
> are fast, others are surprisingly slow.  That's not a great user
> experience, and triggers problems like those described above.  People
> think that the rare slow cases are network problems or something, and
> hit "Reload".
>
> I have no problem with including slow operators with Lucene, but they
> should be well documented as such, at least for developers.  Perhaps we
> should make a pass through the existing Query classes, in particular
> those which expand into other queries, and add some performance notes,
> so that folks don't blindly start using things which may bite them.  By
> default I think it would be safest if the QueryParser only permitted
> operators which are efficient.  Folks can then, at their own risk,
> enable other operators.
>
> In summary, removing operators can be user-centric, if it removes
> unpredictablity.  And the reason for protecting engine performance is
> not miserly, it's to guarantee availablility.  And finally, an issue
> dear to me, a predicatble search engine results in fewer spurious bug
> reports, saving developer time for real bugs.
>
> Doug
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query Term Questions

2004-01-21 Thread Terry Steichen
Right you are - the leading zero was the key.  Thanks.

Regards,

Terry

PS: Is this in the docs?  If not, maybe it should be mentioned.

- Original Message - 
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, January 21, 2004 2:04 PM
Subject: Re: Query Term Questions


> On Jan 21, 2004, at 1:07 PM, Terry Steichen wrote:
> > Unfortunately, using positive boost factors less than 1 causes the 
> > parser to
> > barf the same as do negative boost factors.
> 
> Are you sure about that?  Works for me.  QueryParser just isn't set up 
> to deal with a minus sign, but "term^0.5" should work fine.  You'll 
> need the leading zero.
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query Term Questions

2004-01-21 Thread Terry Steichen
Morus,

Unfortunately, using positive boost factors less than 1 causes the parser to
barf the same as do negative boost factors.

Regards,

Terry

- Original Message -
From: "Morus Walter" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, January 21, 2004 10:54 AM
Subject: Re: Query Term Questions


> Erik Hatcher writes:
> > >
> > > TS==>I've not been able to get negative boosting to work at all.
Maybe
> > > there's a problem with my syntax.
> > > If, for example, I do a search with "green beret"^10, it works just
> > > fine.
> > > But "green beret"^-2 gives me a
> > > ParseException showing a lexical error.
> >
> > Have you tried it without using QueryParser and boosting a Query using
> > setBoost on it?  QueryParser is a double-edged sword and it looks like
> > it only allows numeric characters (plus "." followed by numeric
> > characters).  So QueryParser has the problem with negative boosts, but
> > not Query itself.
>
> He said he wants to have one term less important than others (at least
> that's what I understood).
> That's done by positive boost factors smaller than 1.0 (e.g. 0.5 or 0.1)
> and might be called 'negative boosting' (such as breking is a form of
> negative acceleration).
>
> If you use negative boost factors you would even decrease the score of
> a match (not only increase it less) and risk of ending with a negative
> score. I don't think that would be a good idea.
>
> Morus
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query Term Questions

2004-01-21 Thread Terry Steichen
Erik,

Thanks for your response.  My specific comments (TS==>) are inserted below.
I should make clear that I'm using
fairly complex, embedded queries - not ones that the user is expected to
enter.

Regards,

Terry

- Original Message -
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, January 21, 2004 9:31 AM
Subject: Re: Query Term Questions


> On Jan 20, 2004, at 10:22 AM, Terry Steichen wrote:
> > 1) Is there a way to set the query boost factor depending not on the
> > presence of a term, but on the presence of two specific terms?  For
> > example, I may want to boost the relevance of a document that contains
> > both "iraq" and "clerics", but not boost the relevance of documents
> > that contain only one or the other terms. (The idea is better
> > discrimination than if I simply boosted both terms.)
>
> But doesn't the query itself take this into account?  If there are
> multiple matching terms then the overlap (coord) factor kicks in.

TS==>Except that I'd like to be able to choose to do this on a
query-by-query basis.  In other words,
it's desirable that some specific queries significantly increase their
discrimination based on this multiple matching,
relative to the normal extra boost given by the coord factor.  However, I
take it from your answer that
there's not a way to do this in the query itself (at least using the
unmodified, standard Lucene version).

>
> > 2) Is it possible to apply (or simulate) a negative query boost
> > factor?  For example, I may have a complex query with lots of terms
> > but want to reduce the relevance of a matching document that also
> > included the term "iowa". ( The idea is for an easier and more
> > discriminating way than simply increasing the relevance of all other
> > terms besides "iowa").
>
> Another reply mentioned negative boosting.  Is that not working as
> you'd like?

TS==>I've not been able to get negative boosting to work at all.  Maybe
there's a problem with my syntax.
If, for example, I do a search with "green beret"^10, it works just fine.
But "green beret"^-2 gives me a
ParseException showing a lexical error.

>
> > 3) Is there a way to handle variants of a phrase without OR'ing
> > together the variants?  For example, I may want to find documents
> > dealing with North Korea; the terms might be "north korea" or "north
> > korean" or "north koreans" - is there a way to handle this with a
> > single term using wildcards?
>
> Sounds like what you're really after is fancier analysis.  This is one
> of the purposes of analysis, to do stemming.

TS==>Well, I hope I'm not trying to be fancy.  It's just that listing all
the different variants, particularly (as in my
case) I have to do this for multiple fields, gets tedious and error-prone.
The example above is simply one such case
for a particular query - other queries may have entirely different desired
combinations.  Constructing a single stemmer
to handle all such cases would be (for me, at least) very difficult.
Besides, I tend to stay away from stemming because
I believe it can introduce some rather unpredictable side-effects.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query Term Questions

2004-01-21 Thread Terry Steichen
By the silence, I gather that the answers to my questions are "no", "no" and
"no".

Regards,

Terry

- Original Message -
From: "Terry Steichen" <[EMAIL PROTECTED]>
To: "Lucene Users Group" <[EMAIL PROTECTED]>
Sent: Tuesday, January 20, 2004 10:22 AM
Subject: Query Term Questions


1) Is there a way to set the query boost factor depending not on the
presence of a term, but on the presence of two specific terms?  For example,
I may want to boost the relevance of a document that contains both "iraq"
and "clerics", but not boost the relevance of documents that contain only
one or the other terms. (The idea is better discrimination than if I simply
boosted both terms.)

2) Is it possible to apply (or simulate) a negative query boost factor?  For
example, I may have a complex query with lots of terms but want to reduce
the relevance of a matching document that also included the term "iowa". (
The idea is for an easier and more discriminating way than simply increasing
the relevance of all other terms besides "iowa").

3) Is there a way to handle variants of a phrase without OR'ing together the
variants?  For example, I may want to find documents dealing with North
Korea; the terms might be "north korea" or "north korean" or "north
koreans" - is there a way to handle this with a single term using wildcards?

Regards,

Terry


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Query Term Questions

2004-01-20 Thread Terry Steichen
1) Is there a way to set the query boost factor depending not on the presence of a 
term, but on the presence of two specific terms?  For example, I may want to boost the 
relevance of a document that contains both "iraq" and "clerics", but not boost the 
relevance of documents that contain only one or the other terms. (The idea is better 
discrimination than if I simply boosted both terms.)

2) Is it possible to apply (or simulate) a negative query boost factor?  For example, 
I may have a complex query with lots of terms but want to reduce the relevance of a 
matching document that also included the term "iowa". ( The idea is for an easier and 
more discriminating way than simply increasing the relevance of all other terms 
besides "iowa").  

3) Is there a way to handle variants of a phrase without OR'ing together the variants? 
 For example, I may want to find documents dealing with North Korea; the terms might 
be "north korea" or "north korean" or "north koreans" - is there a way to handle this 
with a single term using wildcards?

Regards,

Terry

Re: setMaxClauseCount ??

2004-01-18 Thread Terry Steichen
Maybe you're using wildcards (which cause the query to get expanded).  Just
go in and set the varb to something very large (provided that doing so
doesn't give you an OutOfMemory error - which is why that limit was set).

HTH,

Terry

- Original Message -
From: "Karl Koch" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Sunday, January 18, 2004 3:54 PM
Subject: setMaxClauseCount ??


> Hi group,
>
> I run over a IndexOutOfBoundsException:
>
> -> java.lang.IndexOutOfBoundsException: More than 32 required/prohibited
> clauses in query.
>
> The reason: I have more then 32 BooleanCauses. From the Mailinglist I got
> the info how to set the maxiumum number of clauses higher before a loop:
>
> ...
> myBooleanQuery.setMaxClauseCount(Integer.MAX_VALUE);
> while (true){
>   Token token = tokenStream.next();
>   if (token == null) {
> break;
>   }
>   myBooleanQuery.add(new TermQuery(new Term("bla", token.termText())),
true,
> false);
> } ...
>
> However the error still remains, why?
>
> Karl
>
> --
> +++ GMX - die erste Adresse für Mail, Message, More +++
> Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Peculiar (?) Indexing Performance

2004-01-13 Thread Terry Steichen
I just aborted a re-indexing operation (because it was taking too much time - will run 
it overnight instead).  But I was surprised by what I found in the index directory, 
which contained a total of 1,402 index files!  It started out with 36 files with the 
name of "_I9a.*", followed by groups of 72 files with names like "_17si.*" and so 
forth.

Is this normal?

Also, I noticed that during the indexing it would chug along, indexing at a pretty 
decent rate, and then, every so often (I would estimate every several hundred added 
files) it would stop for perhaps 10 - 30 seconds (occasionally longer), doing a bunch 
of disk activity.  Then it would resume again - almost like it was optimizing.  (I'm 
doing this on a notebook, so the disk IO is probably fairly slow.)

Is this normal?

Regards,

Terry

PS: The code I'm using to do the indexing is below:

import npg1.search.WebExecAnalyzer;
import org.apache.lucene.index.IndexWriter;
import npg1.search.WESimilarity2;
import npg1.search.WPDocument2a;

import java.io.File;
import java.util.Date;

class IndexWPFiles2a {
  public static void main(String[] args) {
 
 //args[0] = location of target directory to be indexed
 //args[1] = location of index directory (in which to create index files)
 
 System.out.println("starting"); 
 try {
  Date start = new Date();
  
  String target = "c:/master_db/master_xml";
  if(args[0] != null) {
   target = args[0];
  }  
  String index = "c:/master_db/master_index";
  if(args[1] != null) {
   index = args[1];
  }

  IndexWriter writer = null;
  if(args.length < 3) {
   writer = new IndexWriter(index, new WebExecAnalyzer(), true);
   writer.mergeFactor = 50;
   writer.setSimilarity(new WESimilarity2());
   indexDocs(writer, new File(target));
  } else {
   writer = new IndexWriter(index, new WebExecAnalyzer(), false);
   writer.setSimilarity(new WESimilarity2());
  }
  writer.optimize();
  writer.close();
  
  Date end = new Date();
  
  System.out.print(end.getTime() - start.getTime());
  System.out.println(" total milliseconds");
 
 } catch (Exception e) {
  System.out.println(" caught a " + e.getClass() +
"\n with message: " + e.getMessage());
 }
  }

  public static void indexDocs(IndexWriter writer, File file)
   throws Exception {
 //System.out.println("starting indexing with internal path"); 
 if (file.isDirectory()) {
  String[] files = file.list();
  for (int i = 0; i < files.length; i++){
   //System.out.println("recursive call");
   indexDocs(writer, new File(file, files[i]));
  }
 } else {
  try {
   System.out.println("adding " + file);
   writer.addDocument(WPDocument2a.Document(file));
  } catch (Exception e) {
   System.out.println("error adding "+file+" -  Exception: "+e.getMessage());
  }
 }
  }
}


Re: Performance question

2004-01-07 Thread Terry Steichen
Scott,

FYI - I happen to be using dom4j.  I might add, however, that I parse each
document into about 18 fields.  If, as Dror says, your document is simpler,
(or if you increase the merge factor) you can probably better that.

Regards,

Terry

- Original Message -
From: "Dror Matalon" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, January 07, 2004 10:48 PM
Subject: Re: Performance question


> On Wed, Jan 07, 2004 at 07:24:22PM -0700, Scott Smith wrote:
> > After two rather frustrating days, I find I need to apologize to Lucene.
My
> > last run of 225 messages averaged around 25 milliseconds per
message--that's
> > parsing the xml, creating the Document, and putting it in the index
(2.5Ghz
> > cpu, 1G ram).  Turns out the performance problem was xerces sax "helping
me"
> > by loading the DTD before it parsed each message and the DTD wasn't
local to
> > our site.  After seeing Terry's response, I knew there had to be more
going
> > on than what I was assuming.
> >
> > Thanks for the suggestions.  I wonder how much faster I can go if I
> > implement some of those?
>
> 25 msecs to insert a document is on the high side, but it depends of
> course on the size of your document. You're probably spending 90% of
> your time in the XML parsing. I believe that there are other parsers
> that are faster than xerces, you might want to look at these. You might
> want to look at http://dom4j.org/.
>
> Dror
>
>
> >
> > Regards
> >
> > Scott
> >
> > -Original Message-
> > From: Terry Steichen [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, January 06, 2004 5:48 AM
> > To: Lucene Users List
> > Subject: Re: Performance question
> >
> >
> > Scott,
> >
> > Here are some figures to use for comparision.  Using the latest Lucene
> > release, I index about 200 similar-sized XML files at a time, on a
Windows
> > XP machine (2Ghz).  First I create a new index, which adds the documents
at
> > a rate of about 8 per second (I don't recall what the cpu % is during
this).
> > Then I merge this new index with the master one (using, I think, the
default
> > merge factor), which takes about 4.5 minutes (during which time the cpu
> > utilization stays near 100%).  The master index currently holds about
> > 115,000 such documents.
> >
> > HTH,
> >
> > Regards,
> >
> > Terry
> >
> > - Original Message -
> > From: "Scott Smith" <[EMAIL PROTECTED]>
> > To: <[EMAIL PROTECTED]>
> > Sent: Monday, January 05, 2004 10:26 PM
> > Subject: Performance question
> >
> >
> > > I have an application that is reading in XML files and indexing them.
> > Each
> > > XML file is 3K-6K bytes.  This application preloads a database that I
> > > will add to "on the fly" later.  However, all I want it to do
> > > initially is take some existing files and create the initial index as
> > > quick as I can.
> > >
> > > Since I want to index "on the fly" later, I set the merge factor to
> > > 10.
> > I'm
> > > assuming that I can't create the index initially with one merge factor
> > > (e.g., 100) and then change the merge factor later (true?).
> > >
> > > What I see is that it takes 1-3 seconds per xml file to do the index.
> > This
> > > means I'm indexing around 150k bytes per minute.  I also notice that
> > > the
> > CPU
> > > utilization rarely exceeds 5% (looking at task manager on a Windows
> > > box).
> > I
> > > use Xerces to read in the files (SAX interface) and I don't close or
> > > optimize the index between stories nor do I sleep anyplace.  I've
> > > looked
> > at
> > > the page fault numbers and they aren't changing much.  I guess I would
> > have
> > > expected that I would have pretty much pegged the CPU and seen much
> > > faster indexing.
> > >
> > > Any ideas/suggestions?
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
> --
> Dror Matalon
> Zapatec Inc
> 1700 MLK Way
> Berkeley, CA 94709
> http://www.fastbuzz.com
> http://www.zapatec.com
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Performance question

2004-01-06 Thread Terry Steichen
Scott,

Here are some figures to use for comparision.  Using the latest Lucene
release, I index about 200 similar-sized XML files at a time, on a Windows
XP machine (2Ghz).  First I create a new index, which adds the documents at
a rate of about 8 per second (I don't recall what the cpu % is during this).
Then I merge this new index with the master one (using, I think, the default
merge factor), which takes about 4.5 minutes (during which time the cpu
utilization stays near 100%).  The master index currently holds about
115,000 such documents.

HTH,

Regards,

Terry

- Original Message -
From: "Scott Smith" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Monday, January 05, 2004 10:26 PM
Subject: Performance question


> I have an application that is reading in XML files and indexing them.
Each
> XML file is 3K-6K bytes.  This application preloads a database that I will
> add to "on the fly" later.  However, all I want it to do initially is take
> some existing files and create the initial index as quick as I can.
>
> Since I want to index "on the fly" later, I set the merge factor to 10.
I'm
> assuming that I can't create the index initially with one merge factor
> (e.g., 100) and then change the merge factor later (true?).
>
> What I see is that it takes 1-3 seconds per xml file to do the index.
This
> means I'm indexing around 150k bytes per minute.  I also notice that the
CPU
> utilization rarely exceeds 5% (looking at task manager on a Windows box).
I
> use Xerces to read in the files (SAX interface) and I don't close or
> optimize the index between stories nor do I sleep anyplace.  I've looked
at
> the page fault numbers and they aren't changing much.  I guess I would
have
> expected that I would have pretty much pegged the CPU and seen much faster
> indexing.
>
> Any ideas/suggestions?
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Error deleting a document when using the compound file index

2003-12-30 Thread Terry Steichen
Paul,

I just started using 1.3 final (labeled 1.4 RC1) and ran into a similar
problem (though I'm not using the compound file option).  My code ran just
fine all the way through 1.3RC3, but with the latest release, the
reader.delete() threw a "Lock obtain timed out" IOException.  What I finally
did was locate the place earlier in my code where the FSDirectory was first
created, and immediately followed it with a call to
IndexReader.unlock(fsdir).  So far, that appears to have done the trick (but
I haven't had much time to fully test it yet).

Regards,

Terry

- Original Message -
From: "Paul Williams" <[EMAIL PROTECTED]>
To: "'Lucene Users List'" <[EMAIL PROTECTED]>
Sent: Tuesday, December 30, 2003 12:07 PM
Subject: Error deleting a document when using the compound file index


> Hi,
>
> I am just testing Lucene 1.3 RC with the Compound index option on. When I
> come to delete an existing document in the index to re-update a document,
I
> get an unable to obtain lock error.
>
> java.io.IOException: Lock obtain timed out
> at org.apache.lucene.store.Lock.obtain(Lock.java:97)
> at org.apache.lucene.index.IndexReader.delete(IndexReader.java:279)
> at org.apache.lucene.index.IndexReader.delete(IndexReader.java:312)
> at objects.CH_Lucene.updateDocument(CH_Lucene.java:1065)
> at
> servlets.CH_SearchConnection.updateDocument(CH_SearchConnection.java:1048)
> at
> functions.CH_UpdateThread.updateDocument(CH_UpdateThread.java:255)
> at functions.CH_UpdateThread.run(CH_UpdateThread.java:132)
>
> I am using the following snippet of code to remove the previous document
> entry.
>
> DecimalFormat df = new DecimalFormat("");
> df.setMinimumIntegerDigits(8);
>
> Term docNumberTerm = new Term("Field10", df.format(ldoc));
>
> IndexReader reader = IndexReader.open(location);
>
> // Delete old term if present
> if (reader.docFreq(docNumberTerm) > 0)
> {
> reader.delete(docNumberTerm);
> }
> reader.close();
>
> I did not get this when I was using the normal index file structure. Is
> there a way round this?
>
> Regards,
>
> Paul
>
>
> DISCLAIMER: The information in this message is confidential and may be
> legally privileged. It is intended solely for the addressee.  Access to
this
> message by anyone else is unauthorised.  If you are not the intended
> recipient, any disclosure, copying, or distribution of the message, or any
> action or omission taken by you in reliance on it, is prohibited and may
be
> unlawful.  Please immediately contact the sender if you have received this
> message in error. Thank you. Copyright Valid Information Systems Limited.
> http://www.valinf.com Address Morline House, 160 London Road, Barking,
> Essex, IG11 8BB. Tel: 020 8215 1414 Fax: 020 8215 2040.
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Term out of order.

2003-10-29 Thread Terry Steichen
What kind of response is this? (e.g. "apparently so.")   Is this a problem
or not?

Regards,

Terry Steichen

- Original Message -
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, October 29, 2003 7:09 PM
Subject: Re: Term out of order.


> Apparently so :(
> http://www.google.com/search?q=lucene+%22term+out+of+order%22
>
> Otis
>
> --- Victor Hadianto <[EMAIL PROTECTED]> wrote:
> > Hi all,
> >
> > I'm using Lucene.Net but seems appropriate to post here as well. I
> > have been
> > getting this exception "Term out of order" every now and then while
> > doing a
> > bulk indexing.
> >
> > I have been searching on the mailing list for this specific issue,
> > but none
> > to avail. Does this occur on the Java version of Lucene?
> >
> > thanks
> >
> > /victor
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
>
> __
> Do you Yahoo!?
> Exclusive Video Premiere - Britney Spears
> http://launch.yahoo.com/promos/britneyspears/
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Confusion over wildcard search logic

2003-09-23 Thread Terry Steichen
Erik's analysis is comprehensive and useful.  I think this example reflects
a common (and understandable) oversight - that wildcards do *not* work with
a phrase.  Got caught on that many times myself.  Also there may be
confusion about the format -> field:(term1 term2), in that the examples
provided don't seem to make use a parentheses.  Finally, as I recall, there
was some bug(s) with some wildcard patterns with 1.2.

Regards,

Terry

- Original Message -
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, September 22, 2003 10:33 PM
Subject: Re: Confusion over wildcard search logic


> Ah, this is a fun one lots of fiddly issues with how queries work
> and how QueryParser works.  I'll take a stab at some of these inline
> below
>
> On Monday, September 22, 2003, at 08:26  PM, Dan Quaroni wrote:
> > I have a simple command line interface for testing.
>
> Interesting interface.  Looks like something that if made generic
> enough would be handy to have at least in the sandbox.
>
> >   I'm getting some odd
> > results, though, with certain logic of wildcard searches.
>
> not all your queries are truly "WildcardQuery"'s though.  look at the
> class it constructed to get a better idea of what is happening.
>
> >   It seems like
> > depending on what order I put the fields of the query in alters the
> > results
> > drastically when I AND them together.
>
> Not quite the right explanation of what is happening.  More below
>
> > ***
> > This one makes sense
> >
> > Query> name:amb*
> > State> california
> > name:amb*
> > [EMAIL PROTECTED]
> > amb*
> > 2819 total matching documents
>
> Right QueryParser does a little optimization here and anything with
> a simple trailing * turns into a PrefixQuery, meaning all name fields
> that begin with "amb".
>
> > ***
> > This is the REALLY confusing one.  We know there's a company named AMB
> > Property Corporation.  Why do I get NO hits?
> >
> > Query> name:"amb prop*"
> > State> california
> > name:"amb prop*"
> > [EMAIL PROTECTED]
> > "amb prop"
> > 0 total matching documents
>
> Notice you're now in PhraseQuery land.  Wildcards don't work like you
> seem to expect here.  What is really happening here is a query for
> documents that have "amb" and "prop" terms side by side in that order.
> The asterisk got axed by the analyzer.  If you said "name:amb
> name:prop*" you'd get some hits I believe, as it would turn into a
> boolean query with a term and wildcard queries either OR'd or AND'd
> together.  PhraseQuery does not support wildcards.  A custom subclass
> of QueryParser could do some interesting things here and expand
> wildcard-like terms like this in a phrase into PhrasePrefixQuery, but
> that is probably overkill here (although maybe not).  Look at the test
> case for PhrasePrefixQuery for some hints.
>
> > Ok, so I get some results with this (I know the * isn't neccessary at
> > the
> > end of property, but bear with me for the next example where it goes
> > all
> > screwy)
> >
> > Query> name:amb property*
> > State> california
> > name:amb property*
> > [EMAIL PROTECTED]
> > amb name:amb property*:property*
> > 56 total matching documents
>
> your default field for QueryParser is "property*"?  Odd field name, or
> is the output fishy?  I'm a bit confused by the "property*:" there.
> I'm assuming you're outputting the Query.toString here.
>
> See above for a different way to phrase the query.
>
> > ***
> > south san francisco is an exact match to the city.  Why does this find
> > 0
> > results??!
> >
> > Query> name:amb property* AND city:south san francisco
> > State> california
> > name:amb property* AND city:south san francisco
> > [EMAIL PROTECTED]
> > amb +name:amb property* AND city:south san francisco:property*
> > +city:south
> > name:
> > amb property* AND city:south san francisco:san name:amb property* AND
> > city:south
> >  san francisco:francisco
> > 0 total matching documents
>
> with all the AND's going on, this makes sense because "san" and
> "francisco" end up as separate term queries.  you'd have to say
> city:"south san francisco" to turn it into a PhraseQuery.
>
> > 
> > Do this and suddenly I get matches
> >
> > Query> name:amb propert* and city:"south san fran*"
> > State> california
> > name:amb propert* and city:"south san fran*"
> > [EMAIL PROTECTED]
> > amb name:amb propert* and city:"south san fran*":propert* city:"south
> > san
> > fran"56 total matching documents
>
> you're getting hits on the wildcard match at least, and probably on
> name field "amb" as well.  again, phrase queries don't support
> wildcards like you've done here with "south san fran*" so you're not
> matching anything with that.
>
> > *
> > And look, this gets matches too:
> >
> > Query> name:"amb propert*" and city:"south san*"
> > State> california
> > name:"amb prope

Re: Is it possible in lucene for numeric search

2003-09-22 Thread Terry Steichen
You can also use a RangeQuery.  If you index the field of numeric data, say
'score', as a string, then you can do things like: score:[75 TO 80].  Only
extra work is that you need to pad the actual score with enough 0's (such
that 9 becomes 09, etc.) to cover the expected range.

Regards,

Terry

- Original Message -
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, September 22, 2003 9:08 AM
Subject: Re: Is it possible in lucene for numeric search


> Yes, you can do numeric searches as long as you realize its really just
> text that is indexed.  You will need to ensure the Analyzer you use
> indexes numbers appropriately as well.
>
> Erik
>
>
> On Monday, September 22, 2003, at 02:06  AM, Senthil Kumar K wrote:
>
> > Hi,
> >
> >   I found that lucene is a full-featured text search engine. Is it
> > possible to make it for numeric search. In my appl. I need to look
> > in to the index for score=75.
> >
> > regards,
> > k.senthilkumar
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Scoring Behavior

2003-09-21 Thread Terry Steichen
Doug,

Well, my good intentions (to reindex on Thursday night) were interrupted by
Hurricaine Isabel (followed by a 44 hour power outtage).

Well, excuses aside, I did get the reindex done today and the scores for all
hits from a single date query come out to be the same score (as they
should).  Don't have any idea what screwed up the previous index - though as
promised, I'll keep an eye on it as I continue to merge new stuff over the
next few days/weeks.

Is there a way, using standard Lucene configuration parameters and/or API's,
to force the hit scores to come out so the highest one is set to 1, and the
others are proportionately lower?

Regards,

Terry

- Original Message -----
From: "Terry Steichen" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, September 18, 2003 10:10 AM
Subject: Re: Lucene Scoring Behavior


> Doug,
>
> I just extracted a portion of the database, reindexed and found the scores
> to come out much more like we'd expect.  Appears this may be an indexing
> issue - I index new stuff each day and merge the new index with the master
> index.  Only redo the master when I can't avoid it (because it takes so
> long).  I probably merge 100 times or more before reindexing.  This
evening
> I'll reindex and let you know if the apparent problem clears up.  If so,
> I'll keep track of it as I continue to merge and see if there's any issue
> there.
>
> Thanks for the input (and from Erik, pointing me to the Explanation - it's
> pretty neat).
>
> Question: The new scores for the test database portion mentioned above all
> seem to come out in the range of .06 to .07.  I assume this is because
they
> never get normalized.  If this is the case, (a) would it hurt anything to
> "normalize up" (so the scores range up to 1), and if so (b) is there an
> easy, non-disruptive (to the source code) way to do this?
>
> Regards,
>
> Terry
>
>
> - Original Message -
> From: "Doug Cutting" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Wednesday, September 17, 2003 11:15 PM
> Subject: Re: Lucene Scoring Behavior
>
>
> > Hmm.  This makes no sense to me.  Can you supply a reproducible
> > standalone test case?
> >
> > Doug
> >
> > Terry Steichen wrote:
> > > Doug,
> > >
> > > (1) No, I did *not* boost the pub_date field, either in the indexing
> process
> > > or in the query itself.
> > >
> > > (2) And, each pub_date field of each document (which is in XML format)
> > > contains only one instance of the date string.
> > >
> > > (3) And only the pub_date field itself is indexed.  There are other
> > > attributes of this field that may contain the date string, but they
> aren't
> > > indexed - that is, they are not included in the instantiated Document
> class.
> > >
> > > Regards,
> > >
> > > Terry
> > >
> > > - Original Message -
> > > From: "Doug Cutting" <[EMAIL PROTECTED]>
> > > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > > Sent: Wednesday, September 17, 2003 5:51 PM
> > > Subject: Re: Lucene Scoring Behavior
> > >
> > >
> > >
> > >>Terry Steichen wrote:
> > >>
> > >>>  0.03125 = fieldNorm(field=pub_date, doc=90992)
> > >>>  1.0 = fieldNorm(field=pub_date, doc=90970)
> > >>
> > >>It looks like the fieldNorm's are what differ, not the IDFs.  These
are
> > >>the product of the document and/or field boost, and 1/sqrt(numTerms)
> > >>where numTerms is the number of terms in the "pub_date" field of the
> > >>document.  Thus if each document is only assigned one date, and you
> > >>didn't boost the field or the document when you indexed it, this
should
> > >>be 1.0.  But if the document has two dates, then this would be
> > >>1/sqrt(2).  Or if you boosted this document pub_date field, then this
> > >>will have whatever boost you provided.
> > >>
> > >>So, did you boost anything when indexing?  Or could a single document
> > >>have two or more different values for pub_date?  Either would explain
> > >
> > > this.
> > >
> > >>Doug
> > >>
> > >>
> > >>-
> > >>To unsubscribe, e-mail: [EMAIL PROTECTED]
> > >>For additional commands, e-mail: [EMAIL PROTECTED]
> > >>
> > >
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Broken by Lock

2003-09-21 Thread Terry Steichen
About a month ago, timeouts were added to Lock (and they seem to make a lot of good 
sense).  However, because of this enhancement, in using the latest CVS my application 
broke - I keep getting the message "Lock obtain timed out".  I looked through the 
source in an attempt to figure out a quick way to set something to provide backward 
compatibility, but it didn't leap out at me.  

Does anyone know of a simple way to do this?

Regards,

Terry



Re: Lucene Scoring Behavior

2003-09-18 Thread Terry Steichen
Doug,

I just extracted a portion of the database, reindexed and found the scores
to come out much more like we'd expect.  Appears this may be an indexing
issue - I index new stuff each day and merge the new index with the master
index.  Only redo the master when I can't avoid it (because it takes so
long).  I probably merge 100 times or more before reindexing.  This evening
I'll reindex and let you know if the apparent problem clears up.  If so,
I'll keep track of it as I continue to merge and see if there's any issue
there.

Thanks for the input (and from Erik, pointing me to the Explanation - it's
pretty neat).

Question: The new scores for the test database portion mentioned above all
seem to come out in the range of .06 to .07.  I assume this is because they
never get normalized.  If this is the case, (a) would it hurt anything to
"normalize up" (so the scores range up to 1), and if so (b) is there an
easy, non-disruptive (to the source code) way to do this?

Regards,

Terry


- Original Message -
From: "Doug Cutting" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, September 17, 2003 11:15 PM
Subject: Re: Lucene Scoring Behavior


> Hmm.  This makes no sense to me.  Can you supply a reproducible
> standalone test case?
>
> Doug
>
> Terry Steichen wrote:
> > Doug,
> >
> > (1) No, I did *not* boost the pub_date field, either in the indexing
process
> > or in the query itself.
> >
> > (2) And, each pub_date field of each document (which is in XML format)
> > contains only one instance of the date string.
> >
> > (3) And only the pub_date field itself is indexed.  There are other
> > attributes of this field that may contain the date string, but they
aren't
> > indexed - that is, they are not included in the instantiated Document
class.
> >
> > Regards,
> >
> > Terry
> >
> > - Original Message -
> > From: "Doug Cutting" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > Sent: Wednesday, September 17, 2003 5:51 PM
> > Subject: Re: Lucene Scoring Behavior
> >
> >
> >
> >>Terry Steichen wrote:
> >>
> >>>  0.03125 = fieldNorm(field=pub_date, doc=90992)
> >>>  1.0 = fieldNorm(field=pub_date, doc=90970)
> >>
> >>It looks like the fieldNorm's are what differ, not the IDFs.  These are
> >>the product of the document and/or field boost, and 1/sqrt(numTerms)
> >>where numTerms is the number of terms in the "pub_date" field of the
> >>document.  Thus if each document is only assigned one date, and you
> >>didn't boost the field or the document when you indexed it, this should
> >>be 1.0.  But if the document has two dates, then this would be
> >>1/sqrt(2).  Or if you boosted this document pub_date field, then this
> >>will have whatever boost you provided.
> >>
> >>So, did you boost anything when indexing?  Or could a single document
> >>have two or more different values for pub_date?  Either would explain
> >
> > this.
> >
> >>Doug
> >>
> >>
> >>-
> >>To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Scoring Behavior

2003-09-17 Thread Terry Steichen
Doug,

(1) No, I did *not* boost the pub_date field, either in the indexing process
or in the query itself.

(2) And, each pub_date field of each document (which is in XML format)
contains only one instance of the date string.

(3) And only the pub_date field itself is indexed.  There are other
attributes of this field that may contain the date string, but they aren't
indexed - that is, they are not included in the instantiated Document class.

Regards,

Terry

- Original Message -
From: "Doug Cutting" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, September 17, 2003 5:51 PM
Subject: Re: Lucene Scoring Behavior


> Terry Steichen wrote:
> >   0.03125 = fieldNorm(field=pub_date, doc=90992)
> >   1.0 = fieldNorm(field=pub_date, doc=90970)
>
> It looks like the fieldNorm's are what differ, not the IDFs.  These are
> the product of the document and/or field boost, and 1/sqrt(numTerms)
> where numTerms is the number of terms in the "pub_date" field of the
> document.  Thus if each document is only assigned one date, and you
> didn't boost the field or the document when you indexed it, this should
> be 1.0.  But if the document has two dates, then this would be
> 1/sqrt(2).  Or if you boosted this document pub_date field, then this
> will have whatever boost you provided.
>
> So, did you boost anything when indexing?  Or could a single document
> have two or more different values for pub_date?  Either would explain
this.
>
> Doug
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Scoring Behavior

2003-09-17 Thread Terry Steichen
Doug/Erik,

I do use RangeQuery to get a range of dates, but in this case I'm just
getting a single date (string), so I believe it's just a regular query I'm
using.

Per Erik's suggestion, I checked out the Explanation for some of these
anomolies.  I've included a condensation of the data it generated below
(which I don't frankly don't understand).  Perhaps that will give you or
Erik some insight into what's happening?

Regards,

Terry

PS: I note that the 'docFreq' parameters displayed below correspond exactly
to the number of hits for the query.  Also, here's the Similarity class I'm
using (per an earlier suggestion of Doug):

public class WESimilarity2 extends
org.apache.lucene.search.DefaultSimilarity {

 public float lengthNorm(String fieldName, int numTerms) {
  if (fieldName.equals("headline") || fieldName.equals("summary") ||
fieldName.equals("ssummary")){
   return 4.0f * super.lengthNorm(fieldName, Math.max(numTerms,750));
  } else {
   return super.lengthNorm(fieldName, Math.max(numTerms, 750));
  }
 }
}




Query #1: pub_date:20030917
All items: Score: .23000652
0.23000652 = weight(pub_date:20030917 in 91197), product of:
  0.9994 = queryWeight(pub_date:20030917), product of:
7.360209 = idf(docFreq=157)
0.1358657 = queryNorm
  0.23000653 = fieldWeight(pub_date:20030917 in 91197), product of:
1.0 = tf(termFreq(pub_date:20030917)=1)
7.360209 = idf(docFreq=157)
0.03125 = fieldNorm(field=pub_date, doc=91197)

Query #2: pub_date:20030916
All items: Score: .22295427
0.22295427 = fieldWeight(pub_date:20030916 in 90992), product of:
  1.0 = tf(termFreq(pub_date:20030916)=1)
  7.1345367 = idf(docFreq=197)
  0.03125 = fieldNorm(field=pub_date, doc=90992)


Query #3: pub_date:20030915
Items 1&2: Score: 1.0
7.2580175 = weight(pub_date:20030915 in 90970), product of:
  0.9994 = queryWeight(pub_date:20030915), product of:
7.258018 = idf(docFreq=174)
0.13777865 = queryNorm
  7.258018 = fieldWeight(pub_date:20030915 in 90970), product of:
1.0 = tf(termFreq(pub_date:20030915)=1)
7.258018 = idf(docFreq=174)
1.0 = fieldNorm(field=pub_date, doc=90970)

Query #3 (same as above): pub_date:20030915
Other items: Score: 03125
0.22681305 = weight(pub_date:20030915 in 90826), product of:
  0.9994 = queryWeight(pub_date:20030915), product of:
7.258018 = idf(docFreq=174)
0.13777865 = queryNorm
  0.22681306 = fieldWeight(pub_date:20030915 in 90826), product of:
1.0 = tf(termFreq(pub_date:20030915)=1)
7.258018 = idf(docFreq=174)
0.03125 = fieldNorm(field=pub_date, doc=90826)

Query #4: pub_date:20030914
0.21384604 = weight(pub_date:20030914 in 90417), product of:
  0.9994 = queryWeight(pub_date:20030914), product of:
6.843074 = idf(docFreq=264)
0.14613315 = queryNorm
  0.21384606 = fieldWeight(pub_date:20030914 in 90417), product of:
1.0 = tf(termFreq(pub_date:20030914)=1)
6.843074 = idf(docFreq=264)
0.03125 = fieldNorm(field=pub_date, doc=90417)

Query #5: pub_date 20030913
Items 1&2: Score: 1.0
7.366558 = fieldWeight(pub_date:20030913 in 90591), product of:
  1.0 = tf(termFreq(pub_date:20030913)=1)
  7.366558 = idf(docFreq=156)
  1.0 = fieldNorm(field=pub_date, doc=90591)

Query #5 (same as above): pub_date:20030913
Other items: Score: .03125
0.23020494 = fieldWeight(pub_date:20030913 in 90383), product of:
  1.0 = tf(termFreq(pub_date:20030913)=1)
  7.366558 = idf(docFreq=156)
  0.03125 = fieldNorm(field=pub_date, doc=90383)


- Original Message -
From: "Doug Cutting" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, September 17, 2003 4:55 PM
Subject: Re: Lucene Scoring Behavior


> If you're using RangeQuery to do date searching, then you'll likely see
> unusual scoring.  The IDF of a date, like any other term, is inversely
> related to the number of documents with that date.  So documents whose
> dates are rare will score higher, which is probably not what you intend.
>
> Using a Filter for date searching is one way to remove dates from the
> scoring calculation.  Another is to provide a Similarity implementation
> that gives an IDF of 1.0 for terms from your date field, e.g., something
> like:
>
> public class MySimilarity extends DefaultSimilarity {
>public float idf(Term term, Searcher searcher) throws IOException {
>  if (term.field() == "date") {
>return 1.0f;
>  } else {
>return super.idf(term, searcher);
>  }
>}
> }
>
> Or you could just give date clauses of your query a very small boost
> (e.g., .0001) so that other clauses dominate the scoring.
>
> Doug
>
> Terry Steichen wrote:
> > I've run across some puzzling behavior regarding scoring.  I have a set
of documents which contain, among others, a date field

Lucene Scoring Behavior

2003-09-17 Thread Terry Steichen
I've run across some puzzling behavior regarding scoring.  I have a set of documents 
which contain, among others, a date field (whose contents is a string in the MMDD 
format).  When I query on the date 20030917 (that is, today), I get 157 hits, all of 
which have a score of .23000652.  If I use 20030916 (yesterday), I get 197 hits, each 
of which has a score of .22295427.

So far, all seems logical.  However, when I search for all records for the date 
20030915, the first two (of 174 hits) have a score of 1.0, while all the rest of the 
hits have a score of .03125.  Here is a tabulation of these and a few more queries:

Query Date  Result
===
20030917all have a score of .23000652 (157)
20030916all have a score of .22295427 (197)
20030915first 2 have a 1.0 score, all rest are .03125 (174)
20030914all have a score of .21384604 (264)
20030913first 2 have a 1.0 score, all rest are .03125 (156)
20030912all have a score .2166833 (241)
20030911first 3 have a 1.0 score, all rest are .03125 (244)
20030910all have a score of  .2208193 (211)

I would expect that all the hits would have the same score, and I would expect it to 
be normalized to 1 (unless, I guess, the top score was less than 1, in which case 
normalization presumably doesn't occur).  

Does anyone have any ideas as to what might be going on here?  (I'm using the latest 
CVS sources, obtained this afternoon.)

Regards,

Terry


Negative boosting?

2003-09-11 Thread Terry Steichen
I've often found the use of query-based boosting to be very beneficial.  This is 
particularly so when it's easy to identify the term that I want to stand out as a 
primary selector.  

However, I've come across quite a few other cases where it would be easier (and more 
logical) to apply a negative boost - to de-emphasize the match when the term is 
present.  

Is it possible to apply a negative boost (It doesn't seem to work), and if not, would 
it break anything signficant if that were added?

Regards,

Terry


Lucene documentation

2003-08-30 Thread Terry Steichen
Tatu,

My comments were not intended to be directed specifically toward you.  And I
don't think you worded your reply badly.


Rather, it is my general observation, after participating in this list for a
year and a half, that there seems to be a consistently big gap between
Lucene newbies and "old hands".Bridging that gap isn't simply a matter
of RTFM, because the available "manual" just scratches the surface of what
can be done.  It is my experience that you have to become immersed in the
details quite a bit before important aspects become evident.  Some will say
that's OK, because everyone is expected to "do their own homework."  I would
agree, if there was sufficient material available to study.  But simply
because Lucene is so powerful and capable and (IMHO) cutting edge, people
could use a bit more help not only in getting started, but in tapping some
of its more refined capabilities.


Regards,

Terry

PS: If there is general interest in doing some documentation enhancement,
I'd be happy to participate/contribute.

- Original Message -
From: "Tatu Saloranta" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Saturday, August 30, 2003 12:09 AM
Subject: Re: Keyword search with space and wildcard


> On Friday 29 August 2003 10:02, Terry Steichen wrote:
> > I agree.  One problem, however, that new (and not-so-new) Lucene users
face
> > is a learning curve when they want to get past the simplest and most
> > obvious uses of Lucene.  For example, I don't think any of the docs
mention
> > the fact that you can't combine a phrase and a wildcard query.  Other
> > things that are obviously quite well understood by many members of the
> > list, are still less-than-clear to others.  For example, I found (and
still
> > find) it a bit difficult to find concrete examples/advice of how to get
> > good benefit from filters.
> >
> > My whole point is that this is a *very* powerful and flexible
technology.
> > But I think it's often very difficult for those most experienced in
using
> > Lucene to fully appreciate how it looks from the "newbie" point of view.
>
> I agree completely. Perhaps I worded my reply badly; I didn't mean to
sound
> hostile towards new users at all -- after all I consider myself to be one
(I
> just happened to work on simple improvements to QueryParser and learnt how
it
> works). I wish documentation was more complete; perhaps some section could
> list common workarounds or insights. And perhaps incompatibility of phrase
> and wild card queries could be added to document that lists current
> limitations.
>
> I guess the reason I think it's valuable to document the flexibility of
query
> construction is that I have been working on something similar (although
> working with database queries) in a system I'm working on, and I have also
> seen systems that have query syntax that's too intertwined with backend
> implementation (for example, while Hibernate is a good ORM, its queries
don't
> seem to have backend independent intermediate representation... which
makes
> it hard to develop different kinds of backends). So, it's useful to know
that
> there are 2 levels of interfaces to Lucene's query functionality.
>
> -+ Tatu +-
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Keyword search with space and wildcard

2003-08-29 Thread Terry Steichen
Tatu,

I agree.  One problem, however, that new (and not-so-new) Lucene users face
is a learning curve when they want to get past the simplest and most obvious
uses of Lucene.  For example, I don't think any of the docs mention the fact
that you can't combine a phrase and a wildcard query.  Other things that are
obviously quite well understood by many members of the list, are still
less-than-clear to others.  For example, I found (and still find) it a bit
difficult to find concrete examples/advice of how to get good benefit from
filters.

My whole point is that this is a *very* powerful and flexible technology.
But I think it's often very difficult for those most experienced in using
Lucene to fully appreciate how it looks from the "newbie" point of view.

Just my $0.02.

Regards,

Terry


- Original Message -
From: "Tatu Saloranta" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, August 29, 2003 11:14 AM
Subject: Re: Keyword search with space and wildcard


> On Thursday 28 August 2003 21:54, Brian Campbell wrote:
> > Basically, yes, I am trying to put a wildcard in a phrase.  My field (a
> > Keyword) is the name of a project.  It can be 40 characters long (I'm
> > basically indexing some database columns).  Since it is a Keyword and
not a
> > Text field, it doesn't get tokenized (I do this on purpose) and must
match
> > up exactly.  I would like for users to be able to search on partial
phrases
> > such as "Hello w*" and match up to "Hello world" and "Hello washington",
> > etc.  Is this not possible?  Is it documented anywhere?
>
> This can be done, AFAIK.
>
> This is one thing that many people seem unaware of: you don't HAVE to use
> QueryParser to build queries. In your case it seems like you should be
able
> to construct query you want if you either by-pass QueryParser, or create
> a dummy analyzer (one that does no tokenization but returns all input as
> one token).
>
> Since QueryParser is fairly simple class, you should be able to see how
wild
> card queries are constructed. You can not (and need not) create a phrase
> query since it does not allow wild cards (like someone pointed out), but
> since the whole phrase is just one token for keyword fields, you can use
> normal wild card query (or prefix for cases like "Hello w*").
>
> It would be nice if FAQ could point out that QueryParser is higher-level
> interface to query part, but it is possible and sometimes necessary to do
> your own query construction. I think it's very cool Lucene queries were
> properly modularized this way -- too many open source projects have
> components too tightly coupled.
>
> -+ Tatu +-
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: RC2 requires reindexing?

2003-08-29 Thread Terry Steichen
I agree.  When I said I was using RC2, that wasn't quite accurate.  I was
using the latest CVS offering, which could have other changes beyond RC2.  A
convenient way to get RC2 would be beneficial.

Regards,

Terry

- Original Message -
From: "Jan Agermose" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, August 29, 2003 8:46 AM
Subject: Re: RC2 requires reindexing?


> ok - but is it then RC2 og simply HEAD? I'm not that interessted in cvs
> access/updating, if its a RC it really should be on the website, as a
binary
> stable download?
>
> Jan
>
>
>
> - Original Message -
> From: "Aviran Mordo" <[EMAIL PROTECTED]>
> To: "'Lucene Users List'" <[EMAIL PROTECTED]>
> Sent: Friday, August 29, 2003 2:34 PM
> Subject: RE: RC2 requires reindexing?
>
>
> > You can find RC2 in CVS
> >
> > -Original Message-
> > From: Jan Agermose [mailto:[EMAIL PROTECTED]
> > Sent: Friday, August 29, 2003 6:32 AM
> > To: Lucene Users List
> > Subject: Re: RC2 requires reindexing?
> >
> >
> > Ok, on the first posting about RC2 i looked for et, but as I did not
> > find any RC2 I guessed he was mistaken... but now? What RC2 are You
> > talking about and if its 1.3RC2 where do I find it and why does the
> > webpage not mention it (or the download area hold it) ?
> >
> > Jan
> >
> > - Original Message -
> > From: "Lukas Zapletal" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > Sent: Friday, August 29, 2003 12:14 PM
> > Subject: Re: RC2 requires reindexing?
> >
> >
> > > it do not need reindexing, it works fine for me
> > >
> > > On Thu, 28 Aug 2003 11:27:34 -0400, Terry Steichen
> > > <[EMAIL PROTECTED]>
> > > wrote:
> > >
> > > > I just switched to RC2 and found that a number of queries now don't
> > work.
> > > > (When I switch back to RC! they work fine.)  Can't seem to figure
> > > > out a pattern regarding those that don't work versus those (the vast
> >
> > > > majority) that still work fine.  I looked in the RC2 source and
> > > > noticed that the dates on IndexWriter and IndexReader and a bunch of
> >
> > > > related modules seem to have been changed.
> > > >
> > > > Is it necessary to reindex (a major task for my stuff) to use RC2?
> > > >
> > > > Regards,
> > > >
> > > > Terry
> > > >
> > >
> > >
> > >
> > > --
> > > Lukas Zapletal
> > >
> > > http://www.tanecni-olomouc.cz/lzap   icq: 17569735
> > > mail: lzap_at_root.czjabber: lzap_at_njs.netlab.cz
> > > pgp: 715B 5502 4FB3 65E7 266B 927E CE9F 1D04 0EE2 4DB7
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Keyword search with space and wildcard

2003-08-29 Thread Terry Steichen
If I understand your issue correctly, I think what you're experiencing is
the fact that you can have a phrase query "hello world", or a wildcard query
+hell* +wor*, but you can't mix the two together.  As far as I've found,
that's a basic limitation you just have to live with.  (Of course, if
someone on the list can show me where I'm wrong, I'll be delighted.)  You
can add boosting to any kind of term (such as wor*^10 or "world order"^10),
but (I don't think) you can't mix wildcards and phrases.

HTH,

Terry

- Original Message -
From: "Brian Campbell" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Thursday, August 28, 2003 4:45 PM
Subject: Keyword search with space and wildcard


> I've created and index that has a Keyword field in it.  I'm trying to do a
> search on that field where my term has a space and the wildcard character
in
> it.  For example, I'll issue the following search:  project_name:"Hello
w*".
>   I have an entry in the project_name field of "Hello world".  I would
> expect to get a hit on this but I don't.  Is this not the way Lucene
> behaves? Am I doing something wrong?  Thanks.
>
> -Brian
>
> _
> Help protect your PC: Get a free online virus scan at McAfee.com.
> http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RC2 requires reindexing?

2003-08-28 Thread Terry Steichen
I just switched to RC2 and found that a number of queries now don't work.  (When I 
switch back to RC! they work fine.)  Can't seem to figure out a pattern regarding 
those that don't work versus those (the vast majority) that still work fine.  I looked 
in the RC2 source and noticed that the dates on IndexWriter and IndexReader and a 
bunch of related modules seem to have been changed.

Is it necessary to reindex (a major task for my stuff) to use RC2?

Regards,

Terry


Re: Similar Document Search

2003-08-21 Thread Terry Steichen
Hi Peter,

I took a look at Mark's thesis and briefly at some of his code.  It appears
to me that what he's done with the so-called forward indexing is to (a)
include a unique id with each document (allowing retrieval by id rather than
by a standard query), and to (b) include a frequency map class with each
document (allowing easier retrieval of term frequency information).

Now I may be missing something very obvious, but it seems to me that both of
these functions can be done rather easily with the standard (unmodified)
version of Lucene.  Moreover, I don't understand how use of these functions
will facilitate retrieval of documents that are "similar" to a selected
document, as outlined in my original question on this topic.

Could you (or anyone else, of course) perhaps elaborate just a bit on how
using this approach will help achieve that end?

Regards,

Terry

- Original Message -
From: "Peter Becker" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, August 21, 2003 1:37 AM
Subject: Re: Similar Document Search


> Hi all,
>
> it seems there are quite a few people looking for similar features, i.e.
> (a) document identity and (b) forward indexing. So far we hacked (a) by
> using a wrapper implementing equals/hashcode based on a unique field,
> but of course that assumes maintaining a unique field in the index. (b)
> is something we haven't tackled yet, but plan to.
>
> The source code for Mark's thesis seems to be part of the Haystack
> distribution. The comments in the files put it under Apche-license. This
> seems to make it a good candidate to be included at least in the Lucene
> sandbox -- although I haven't tried it myself yet. But it sounds like a
> good candidate for us to use.
>
> Since the haystack source is a bit larger and I actually couldn't get
> the download at the moment, here is a copy of the relevant bit grabbed
> from one of my colleague's machines:
>
>   http://www.itee.uq.edu.au/~pbecker/luceneHaystack.tar.gz (22kb)
>
> Note that this is just a tarball of src/org/apache/lucene out of some
> Haystack source. Untested, unmodified.
>
> I'd love to see something like this supported in the Lucene context were
> people might actually find it :-)
>
>   Peter
>
>
> Gregor Heinrich wrote:
>
> >Hello Terry,
> >
> >Lucene can do forward indexing, as Mark Rosen outlines in his Master's
> >thesis: http://citeseer.nj.nec.com/rosen03email.html.
> >
> >We use a similar approach for (probabilistic) latent semantic analysis
and
> >vector space searches. However, the solution is not really completely
fixed
> >yet, therefore no code at this time...
> >
> >Best regards,
> >
> >Gregor
> >
> >
> >
> >
> >-Original Message-
> >From: Peter Becker [mailto:[EMAIL PROTECTED]
> >Sent: Tuesday, August 19, 2003 3:06 AM
> >To: Lucene Users List
> >Subject: Re: Similar Document Search
> >
> >
> >Hi Terry,
> >
> >we have been thinking about the same problem and in the end we decided
> >that most likely the only good solution to this is to keep a
> >non-inverted index, i.e. a map from the documents to the terms. Then you
> >can query the most terms for the documents and query other documents
> >matching parts of this (where you get the usual question of what is
> >actually interesting: high frequency, low frequency or the mid range).
> >
> >Indexing would probably be quite expensive since Lucene doesn't seem to
> >support changes in the index, and the index for the terms would change
> >all the time. We haven't implemented it yet, but it shouldn't be hard to
> >code. I just wouldn't expect good performance when indexing large
> >collections.
> >
> >  Peter
> >
> >
> >Terry Steichen wrote:
> >
> >
> >
> >>Is it possible without extensive additional coding to use Lucene to
conduct
> >>
> >>
> >a search based on a document rather than a query?  (One use of this would
be
> >to refine a search by selecting one of the hits returned from the initial

> >query and subsequently retrieving other documents "like" the selected
one.)
> >
> >
> >>Regards,
> >>
> >>Terry
> >>
> >>
> >>
> >>
> >>
> >
> >
> >
> >-
> >To unsubscribe, e-mail: [EMAIL PROTECTED]
> >For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> >-
> >To unsubscribe, e-mail: [EMAIL PROTECTED]
> >For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Similar Document Search

2003-08-18 Thread Terry Steichen
Hi Peter,

What got me thinking about this was the way that Lucene computes similarity
(or scoring).  After the boolean keyword matches have been found, Lucene
then computes relevance.  What Lucene does, I think, is to process the query
into some intermediate internal representation and computes the similarity
between the query (now a kind of a pseudo-document) and each of the matching
hits.

I was wondering if there might not be a way to internally process a selected
document (rather than the query per se) and then, in effect, compute the
similarity between that document and all the other documents (which have
already been pre-processed in the indexing process).  So, what you'd be
doing is not a boolean keyword match, but a ranking of all the documents in
the repository on the basis of relevance or similarity to the target
document.

(If that's not too far off in terms of reality, maybe Doug could comment?)

Regards,

Terry

- Original Message -
From: "Peter Becker" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, August 18, 2003 9:05 PM
Subject: Re: Similar Document Search


> Hi Terry,
>
> we have been thinking about the same problem and in the end we decided
> that most likely the only good solution to this is to keep a
> non-inverted index, i.e. a map from the documents to the terms. Then you
> can query the most terms for the documents and query other documents
> matching parts of this (where you get the usual question of what is
> actually interesting: high frequency, low frequency or the mid range).
>
> Indexing would probably be quite expensive since Lucene doesn't seem to
> support changes in the index, and the index for the terms would change
> all the time. We haven't implemented it yet, but it shouldn't be hard to
> code. I just wouldn't expect good performance when indexing large
> collections.
>
>   Peter
>
>
> Terry Steichen wrote:
>
> >Is it possible without extensive additional coding to use Lucene to
conduct a search based on a document rather than a query?  (One use of this
would be to refine a search by selecting one of the hits returned from the
initial query and subsequently retrieving other documents "like" the
selected one.)
> >
> >Regards,
> >
> >Terry
> >
> >
> >
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Similar Document Search

2003-08-18 Thread Terry Steichen
Is it possible without extensive additional coding to use Lucene to conduct a search 
based on a document rather than a query?  (One use of this would be to refine a search 
by selecting one of the hits returned from the initial query and subsequently 
retrieving other documents "like" the selected one.)

Regards,

Terry


Re: searching data indexed from database??

2003-06-01 Thread Terry Steichen
Well, if you don't store the data in the index it probably isn't too bad.
Alternatively, if you don't need to do any field-specific searching, then
you *only* index the combined field (and *not* the individual ones).  Then
there's no additional impact.

Regards,

Terry

- Original Message -
From: "Venkatraman, Shiv" <[EMAIL PROTECTED]>
To: "'Terry Steichen '" <[EMAIL PROTECTED]>; "'Lucene Users List '"
<[EMAIL PROTECTED]>
Sent: Saturday, May 31, 2003 11:31 AM
Subject: RE: searching data indexed from database??


> Thanks. That's what I suspected and was hoping that wasn't the case. Won't
> this lead to duplication of data during indexing -- one piece under the
> specific field ("Product") and the same one (along with others) under the
> default field ("contents")?
>
> -Original Message-
> From: Terry Steichen
> To: Lucene Users List
> Sent: 5/31/03 8:06 AM
> Subject: Re: searching data indexed from database??
>
> Shiv,
>
> Searching in Lucene is field-based.  Thus you must specify the field to
> be
> searched - the only 'exception' is that one field is defined as default.
> If
> you want to search across multiple fields, I believe you must create a
> concatenation of the individual fields into a single one during the
> indexing
> process (eg. productName+" "+productDesc), and then use that as the
> basis of
> your subsequent searches.
>
> HTH,
>
> Terry
>
> - Original Message -
> From: "Venkatraman, Shiv" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Saturday, May 31, 2003 10:33 AM
> Subject: searching data indexed from database??
>
>
> > I have an indexer that reads data from database and indexes the data.
> >   foreach(db_row) {
> >   Document doc = new Document();
> >   doc.add(Field.Text("Product", productName);
> >   doc.add(Field.Text("Description", productDesc);
> > ...
> >   writer.addDocument(doc);
> >   }
> >
> >
> > Once indexed, I would like to do a search that spans across multiple
> fields.
> > i.e. the user may enter "lawnmower" and it should perform a search
> across
> > all the indexed fields. Also, how do I pass user queries like
> "lawnmower
> > -grass" to the query API?
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: searching data indexed from database??

2003-06-01 Thread Terry Steichen
Shiv,

Searching in Lucene is field-based.  Thus you must specify the field to be
searched - the only 'exception' is that one field is defined as default.  If
you want to search across multiple fields, I believe you must create a
concatenation of the individual fields into a single one during the indexing
process (eg. productName+" "+productDesc), and then use that as the basis of
your subsequent searches.

HTH,

Terry

- Original Message -
From: "Venkatraman, Shiv" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Saturday, May 31, 2003 10:33 AM
Subject: searching data indexed from database??


> I have an indexer that reads data from database and indexes the data.
>   foreach(db_row) {
>   Document doc = new Document();
>   doc.add(Field.Text("Product", productName);
>   doc.add(Field.Text("Description", productDesc);
> ...
>   writer.addDocument(doc);
>   }
>
>
> Once indexed, I would like to do a search that spans across multiple
fields.
> i.e. the user may enter "lawnmower" and it should perform a search across
> all the indexed fields. Also, how do I pass user queries like  "lawnmower
> -grass" to the query API?
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Problem while indexing

2003-04-03 Thread Terry Steichen
Amit,

I don't exactly know what your problem is, but I'm using a configuration not
too different from yours with no problems - so at least you know it's
possible.

I have an index of about 125MB which I use on various machines, including an
old Windows98/SE 400MHz notebook.  I used the default MergeFactor (10, I
think) and do a daily merge (the daily addition represents about 200
documents added to a total of over 58,000).  Each document (XML format) has
about 15 fields of various types.  I'm using release 1.3 dev 1.

At one point I too had a problem of too many open files - turned out that I
wasn't closing the IndexReader.  Fixed that, and the number of open files
usually stays below 500 (without Lucene, there are typically about 300-400
open files just for the system).

HTH,

Terry



- Original Message -
From: "Amit Kapur" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Thursday, April 03, 2003 12:13 AM
Subject: Problem while indexing


>
> hi all
>
> I m facing problems like mentioned below while indexing, If anyone has any
> help to offer i would to obliged
>  couldn't rename segments.new to segments 
>  F:\Program Files\OmniDocs Server\ftstest\_3cf.fnm (Too many open
> files)
>
> I am trying to index documents using Lucene generating about 30 MB of
index
> (Optimized) which can be raised to about 100 MB or More ( but that would
be
> on a high end server machine).
>
> Description of Current Case:
> #---Each Document has four fields (One Text field, and 3 other Keyword
> Fields).
> #---The analyzer is based on a StopFilter and a PorterStemFilter.
> #---I am using a Compaq PIII, 128 MB RAM, 650 MHz.
> #---mergeFactor is set to 25, and I am optimizing the index after adding
> about 20 Documents.
> #---Using Lucene Release 1.2
>
> Problem Faced
> After adding about 4000 Documents generating an index of 30 MB, I
initially
> got an error saying,  couldn't rename segments.new to segments 
> after which the IndexReader or the IndexWriter to the current index
couldnot
> be opened.
>
> Then I changed a couple of settings,
> #---mergeFactor=20 and Optimize was called after ever 10 documents.
> #---Using Lucene Release 1.3
>
> Problem Faced
> After adding about 1500 Documents generating an index of 10 MB, I
initially
> got an error saying,  F:\Program Files\OmniDocs
Server\ftstest\_3cf.fnm
> (Too many open files) after which the IndexWriter to the current index
> couldnot be opened.
>
> Now my requirement needs to have a much much larger index (practically)
and
> I am actually at the point where these errors are coming unpredictably.
>
> Please if anyone could guide me on this ASAP.
> Thanx in advance
>
> Regards
> Amit
>
> PS: I have already read articles in the mail archieve
> http://www.mail-archive.com/[EMAIL PROTECTED]/msg02815.html.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Tokenize negative number

2003-03-25 Thread Terry Steichen
Probably tokenized 1234 as a string and treated '-' as a separator.  See
previous discussion on "query".

Regards,

Terry

- Original Message -
From: "Lixin Meng" <[EMAIL PROTECTED]>
To: "'Lucene Users List'" <[EMAIL PROTECTED]>
Sent: Tuesday, March 25, 2003 9:16 PM
Subject: Tokenize negative number


> I have a document with content ' -1234 '. However, after calling
the
> StandardTokenizer, the token only has '1234' (missed the '-') as tokeText.
>
> Did anyone experience the similar problem and is there a work around?
>
> Regards,
> Lixin
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: query

2003-03-25 Thread Terry Steichen
Arsineh,

There was some discussion on this list about this topic earlier.  As I
recall, the escaping a '-' doesn't work (for reasons I don't recall -
something about interaction of analyzer and tokenizer, I think).  To handle
this for my own purposes, I believe I modified the QueryParser.jj source to
add a '-' as a valid alphanumeric code.  However, you have to be careful
because this may cause ordinary words not to match if they are hyphenated
from word-wrapping.

HTH,

Terry

- Original Message -
From: <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, March 25, 2003 5:43 AM
Subject: query


> Hi everyone,
>
> I have indexed a table in the database.
> the table has a column named TagNr. It contains values like 25-XX8569,
> 41-VL451   ect
> By indexing the table I use the factory method Field.Keyword for this
> column. So the values are not tokenised in this field.
> Now, when I'm searching for a value containing '-' in the field TagNr I
> don't get any results.
> I have escaped '-' using '\' like 25\-XX8569, 25\\-XX8569 and
25\-\XX8569.
> But I still don't get anything.
>
> Has someone any suggestions?
>
> Thaks for help
> Arsineh
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing and searching database data

2003-03-10 Thread Terry Steichen
+1
- Original Message -
From: <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Monday, March 10, 2003 10:38 AM
Subject: Indexing and searching database data


> Hello,
>
> Would anyone be interested in ability to use Lucene search on the data
from
> a database?
> I've written a small framework that allows to create Lucene index files
out
> of the database data,
> and then helps with the retrieval of that data (if needed) while
searching.
> This way you can merge into one central index data from files, web pages
and
> database etc.
>
> Greetings.
>
> Tom S
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Turbine Service

2003-03-04 Thread Terry Steichen
Samuel,

Not exactly sure of your question.  But, if the path is known at the time of
indexing, you just insert it in the Document that is created as part of the
indexing.  If you don't know the path till later, you might insert a partial
path at index time and add the exact location when you use it.  For example,
in my own system, I use a structure called "master_db", which has a standard
structure underneath it.  However, this whole database might be located
anywhere for any given deployment.  So, when I index, I define my document's
location within this "master_db" structure as the "path" field.  Then when I
retrieve it, I simply doc.get("path") and tack it on to the actual location.

As I said, I'm not sure this is exactly what you're looking for, but HTH
anyway.

Regards,

Terry

- Original Message -
From: "Samuel Alfonso Velázquez Díaz" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, March 04, 2003 11:34 AM
Subject: RE: Lucene Turbine Service


>
> Hi I'm a newbe to lucene, I have troubles creating the index to be usable
by my web application.
>
> The problem is that with the guide lines of the documentation, I created a
index with URLs relative to my local file system
> doc.get("url");   // Returns paths like
C:/CopiaSite20030228/Legislacion/reglamentos/index.htm
> How can I create an index and specify that the ROOT of my documents points
to mydomain.com??
>
> Please help, I've been searching the docs and mailing list, but I've been
without  luck!!
>
> Regards!
>
>
> Samuel Alfonso Velázquez Díaz
> http://www.geocities.com/samuelvd
> [EMAIL PROTECTED]
>
>
> -
> Do you Yahoo!?
> Yahoo! Tax Center - forms, calculators, tips, and more


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Computing Relevancy Differently

2003-02-28 Thread Terry Steichen
Doug,

I'll put a test case together shortly.  In the meanwhile, here's the code in
the attachment that didn't get through (and BTW, is there some special way
to get attachments through?):

public class WESimilarity extends DefaultSimilarity {

 public float lengthNorm(String fieldName, int numTerms) {
  if (fieldName.equals("headline") || fieldName.equals("summary") ){
   System.out.println("WES - special");
   return 4.0f * super.lengthNorm(fieldName, numTerms);
  } else {
   System.out.println("WES - normal");
   return super.lengthNorm(fieldName, Math.max(numTerms, 300));
  }
 }
}

I just ran a test indexing - but neither of the debug statements were
displayed.  I again verified that if I renamed WESimilarity.class, I got an
exception (just to ensure it was being picked up).

Regards,

Terry

- Original Message -
From: "Doug Cutting" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, February 28, 2003 5:52 PM
Subject: Re: Computing Relevancy Differently


> Your attachment did not make it, so I cannot see your code.
>
> If you think there's a bug, cuold you please provide a complete,
> self-contained test case?  You could, for example, model this after the
> TestSimilarity class in the test code hierarchy.
>
> The lengthNorm(String,int) method is called when you index the document.
>
> Doug
>
> Terry Steichen wrote:
> > Doug,
> >
> > I've implemented a subclass of DefaultSimilarity (called
WESimilarity.java,
> > copy attached) which defines a new lengthNorm() method more or less as
you
> > suggested.  I then added a line prior to using my IndexWriter:
> > writer.setSimilarity(new WESimilarity()), and a similar line prior to
using
> > my IndexSeacher: searcher.setSimilarity(new WESimilarity()).
> >
> > The result:
> > 1) There's no change whatsoever in the computed scores, and
> > 2) The debugging messages never get printed out.
> >
> > I know the WESimilarity is being used (because if I rename it I get an
> > exception), but it does not appear that the new lengthNorm() method is
being
> > called.
> >
> > It's probably some silly goof, but I can't figure out where it is.
> >
> > If you (or anyone else, of course) have any ideas/suggestions, I'd
> > appreciate them.
> >
> > Regards,
> >
> > Terry
> >
> > - Original Message -
> > From: "Terry Steichen" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > Sent: Monday, February 10, 2003 2:28 PM
> > Subject: Re: Computing Relevancy Differently
> >
> >
> >
> >>Doug,
> >>
> >>That's excellent.  Just what I've been looking for.  I'll start
> >>experimenting shortly.
> >>
> >>Regards,
> >>
> >>Terry
> >>
> >>- Original Message -
> >>From: "Doug Cutting" <[EMAIL PROTECTED]>
> >>To: "Lucene Users List" <[EMAIL PROTECTED]>
> >>Sent: Monday, February 10, 2003 1:57 PM
> >>Subject: Re: Computing Relevancy Differently
> >>
> >>
> >>
> >>>Terry Steichen wrote:
> >>>
> >>>>Can you give me an idea of what to replace the lengthNorm() method
> >
> > with
> >
> >>to,
> >>
> >>>>for example, remove any special weight given to shorter matching
> >>
> >>documents?
> >>
> >>>The goal of the default implementation is not to give any special
weight
> >>>to shorter documents, but rather to remove the advantage longer
> >>>documents have.  Longer documents are likely to have more matches
simply
> >>>because they contain more terms.  Also, for the query "foo", a document
> >>>containing just "foo" is a better match than a longer one containing
> >>>"foo bar baz", since the match is more exact.
> >>>
> >>>However, one problem with this approach can be that very short
documents
> >>>are in fact not very informative.  Thus a bias against very short
> >>>documents is sometimes useful.
> >>>
> >>>
> >>>>I can certainly go through a bunch of trial-and-error efforts, but it
> >>
> >>would
> >>
> >>>>help if I had some grasp of the logic initially.
> >>>>
> >>>>For example, from DefaultSimilarity, here's the lengthNorm() method:
> >>>>
> >>>> 

Re: Computing Relevancy Differently

2003-02-28 Thread Terry Steichen
Doug,

I've implemented a subclass of DefaultSimilarity (called WESimilarity.java,
copy attached) which defines a new lengthNorm() method more or less as you
suggested.  I then added a line prior to using my IndexWriter:
writer.setSimilarity(new WESimilarity()), and a similar line prior to using
my IndexSeacher: searcher.setSimilarity(new WESimilarity()).

The result:
1) There's no change whatsoever in the computed scores, and
2) The debugging messages never get printed out.

I know the WESimilarity is being used (because if I rename it I get an
exception), but it does not appear that the new lengthNorm() method is being
called.

It's probably some silly goof, but I can't figure out where it is.

If you (or anyone else, of course) have any ideas/suggestions, I'd
appreciate them.

Regards,

Terry

- Original Message -----
From: "Terry Steichen" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, February 10, 2003 2:28 PM
Subject: Re: Computing Relevancy Differently


> Doug,
>
> That's excellent.  Just what I've been looking for.  I'll start
> experimenting shortly.
>
> Regards,
>
> Terry
>
> - Original Message -
> From: "Doug Cutting" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Monday, February 10, 2003 1:57 PM
> Subject: Re: Computing Relevancy Differently
>
>
> > Terry Steichen wrote:
> > > Can you give me an idea of what to replace the lengthNorm() method
with
> to,
> > > for example, remove any special weight given to shorter matching
> documents?
> >
> > The goal of the default implementation is not to give any special weight
> > to shorter documents, but rather to remove the advantage longer
> > documents have.  Longer documents are likely to have more matches simply
> > because they contain more terms.  Also, for the query "foo", a document
> > containing just "foo" is a better match than a longer one containing
> > "foo bar baz", since the match is more exact.
> >
> > However, one problem with this approach can be that very short documents
> > are in fact not very informative.  Thus a bias against very short
> > documents is sometimes useful.
> >
> > > I can certainly go through a bunch of trial-and-error efforts, but it
> would
> > > help if I had some grasp of the logic initially.
> > >
> > > For example, from DefaultSimilarity, here's the lengthNorm() method:
> > >
> > >   public float lengthNorm(String fieldName, int numTerms) {
> > > return (float)(1.0 / Math.sqrt(numTerms));
> > >   }
> > >
> > > Should I (for the purpose of eliminating any size bias) override it to
> > > always return a 1?
> >
> > That's something to try, although, as mentioned above, I suspect your
> > top hits will be dominated by long documents.  Try it.  It's really not
> > a difficult experiment!
> >
> > One trick I've used to keep very short documents from dominating
> > results, that, while good matches, are not informative documents, is to
> > override this with something like:
> >
> > public float lengthNorm(String fieldName, int numTerms) {
> >   super.lengthNorm(fieldName, Math.max(numTerms, 100));
> > }
> >
> > This way all fields shorter than 100 terms are scored like fields
> > containing 100 terms.  Long documents are still normalized, but search
> > is biased a bit against very short documents.
> >
> > > How would I boost the headline field here? Is that how you are
supposed
> to
> > > use the (presently unused) fieldName parameter?  If that's the case, I
> > > assume I would logically (to do what I'm trying to do) make this
factor
> > > greater than 1 for the 'headline' field, and 1 for all other fields?
> >
> > You could do that here too.  So, for example, you could do something
like:
> >
> > public float lengthNorm(String fieldName, int numTerms) {
> >   float n = super.lengthNorm(fieldName, Math.max(numTerms, 100));
> >   if (fieldName.equals("headline"))
> > n *= 4.0f;
> >   return n;
> > }
> >
> > Equivalently, you could create your documents with something like:
> >
> >Document d = new Document();
> >Field f = new Field.Text("headline", headline);
> >f.setBoost(4.0f);
> >...
> >
> > But headlines tend to be short, and naturally benefit from the default
> > lengthNorm implementation.  S

Re: MAX Index Size POLL

2003-02-27 Thread Terry Steichen
Samir,

The size of the index depends on (a) the size of the documents, (b) the
number of fields per document, (c) the fields that are kept in the index.
The time taken to index depends on the same plus the characteristics of the
processor and storage i/o.  With so many variables, I don't think the simple
listing you're requesting will be of much use.

Regards,

Terry

- Original Message -
From: "Samir Satam" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, February 27, 2003 12:22 PM
Subject: MAX Index Size POLL


Hello friends,
If it is not much of a trouble, I would like to ask as many of you as
possible, to post some statistics.
This would preferably include

1. Size of the index.
2. No of documents indexed.
3. Time taken to index new documents.
4. Time taken for a typical query.


thank you,
Samir

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing Tips and Hints

2003-02-24 Thread Terry Steichen
Hi Andrzej,

Thanks for the code.  I'll try it as soon as I have time.  If you had a copy
of the modified FSDirectory implementation you could also share, that would
make testing it a bit quicker and easier.  BTW, when you said it "supposedly
increases I/O", I gather that you are not the author?

Regards,

Terry

- Original Message -
From: "Andrzej Bialecki" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, February 24, 2003 3:59 PM
Subject: Re: Indexing Tips and Hints


> Hello,
>
> Since you are trying this anyway, and looking for ways to improve
> indexing times... Could you perhaps try to replace use of
> java.io.RandomAccessFile in FSDirectory implementation, with the
> attached implementation? It supposedly increases I/O throughput by
> orders of magnitude, by using partial buffering.
>
> Terry Steichen wrote:
> > Mike,
> >
> > By way of comparison, I've got a collection of about 50,000 XML files,
each
> > of which averages about 8K.  It takes about 1.25 hours to index (on a
1.8Ghz
> > machine).  I use basically the standard configuration (mergeFactor,
etc.)
> > and I've got about 30 fields per document.  I add about 200 new ones per
> > day.  I don't recall how long that it takes to index the 200 (I do it
> > through a background task), but it takes a couple of minutes to merge
the
> > new 200 document index with the master index.
> >
> > HTH,
> >
> > Terry
> >
> > - Original Message -
> > From: "Michael Barry" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > Sent: Monday, February 24, 2003 2:00 PM
> > Subject: Indexing Tips and Hints
> >
> >
> >
> >>All,
> >>   I'm in need of some pointers, hints or tips on indexing large
> >
> > collections
> >
> >>of data. I know I saw some tips on this list before but when I tried
> >>searching
> >>the list, I came up blank.
> >>   I have a large collection of XML files (336000 files around 5K
> >>apiece) that I'm
> >>indexing and its taking quite a bit of time (27 hours). I've played
> >>around with the
> >>mergeFactor, RAMDirectories and multiple threads (X number of threads
> >>indexing
> >>a subset of the data and then merging the indexes at the end) but I
> >>cannot seem
> >>to bring the time down. I'm probably not doing these things properly but
> >>from
> >>what I read I believe I am.  Maybe this is the best I can do with this
> >>data but I
> >>would be really grateful to hear how others have tackled this same
issue.
> >>   As always pointers to places in the mailing list archive or other
> >>places would be
> >>appreciated.
> >>
> >>Thanks, Mike.
> >>
> >>-
> >>To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
>
>
> --
>
> --
> Best regards,
> Andrzej Bialecki
>
> -
> Software Architect, System Integration Specialist
> -
> FreeBSD developer (http://www.freebsd.org)
>
>






> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing Tips and Hints

2003-02-24 Thread Terry Steichen
Mike,

By way of comparison, I've got a collection of about 50,000 XML files, each
of which averages about 8K.  It takes about 1.25 hours to index (on a 1.8Ghz
machine).  I use basically the standard configuration (mergeFactor, etc.)
and I've got about 30 fields per document.  I add about 200 new ones per
day.  I don't recall how long that it takes to index the 200 (I do it
through a background task), but it takes a couple of minutes to merge the
new 200 document index with the master index.

HTH,

Terry

- Original Message -
From: "Michael Barry" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, February 24, 2003 2:00 PM
Subject: Indexing Tips and Hints


> All,
>I'm in need of some pointers, hints or tips on indexing large
collections
> of data. I know I saw some tips on this list before but when I tried
> searching
> the list, I came up blank.
>I have a large collection of XML files (336000 files around 5K
> apiece) that I'm
> indexing and its taking quite a bit of time (27 hours). I've played
> around with the
> mergeFactor, RAMDirectories and multiple threads (X number of threads
> indexing
> a subset of the data and then merging the indexes at the end) but I
> cannot seem
> to bring the time down. I'm probably not doing these things properly but
> from
> what I read I believe I am.  Maybe this is the best I can do with this
> data but I
> would be really grateful to hear how others have tackled this same issue.
>As always pointers to places in the mailing list archive or other
> places would be
> appreciated.
>
> Thanks, Mike.
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Syntax Problem - Maybe solved

2003-02-16 Thread Terry Steichen
Responding to my own problem - I think it may not be a Lucene issue, but a
web one.  I'm passing the query to Lucene via a web browser, and I now
believe that what's happening is that the automatic url encoding/decoding is
stripping off the leading (at least) "+" sign (treating it as an encoded
space).  That would explain the behavior.  I will confirm this later on
today.

Regards,

Terry

- Original Message -
From: "Terry Steichen" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Saturday, February 15, 2003 8:56 PM
Subject: Re: Syntax Problem


> Christoph,
>
> Same basic result:
>
> +(cloning clone) +animal yields 1072 hits
> (cloning OR clone) AND animal yields 19 hits.
> (cloning clone) AND animal yields 19 hits.
>
> Regards,
>
> Terry
>
> - Original Message -
> From: "Christoph Kiehl" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Saturday, February 15, 2003 7:41 PM
> Subject: Re: Syntax Problem
>
>
> > Terry Steichen wrote:
> > > I have an index which, when searched with this query ("cloning clone
> > > animal") produces 1103 hits.  A different, more narrow query
> > > ("(cloning clone) AND animal") produces only 19 hits.
> >
> > AFAIK the terms in your queries are by default concatenated by OR. This
> > means "cloning clone animal" == "cloning OR clone OR animal".
> >
> > > What's puzzling to me is that if I try a different (but supposedly
> > > identical) form of the more narrow query ("+(cloning clone)
> > > +animal"), it produces 1103 hits rather than the 19 that I expect.
> > >
> > > In other words, "+(cloning clone) +animal" appears to be the
> > > equivalent of "cloning OR clone OR animal" rather than "(cloning OR
> > > clone) AND animal".
> >
> > Hm, strange. I would expect "+(cloning clone) +animal" being translated
to
> > "(cloning OR clone) AND animal". I just tried it here. The translation
is
> > done as I expected. Perhaps you could try the last query ("(cloning OR
> > clone) AND animal") and compare the resultsize with the one from
> "+(cloning
> > clone) +animal" (even if both seem to be the same as "(cloning clone)
AND
> > animal" ;)?
> >
> > Christoph
> >
> >
> >
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Syntax Problem

2003-02-15 Thread Terry Steichen
Christoph,

Same basic result:

+(cloning clone) +animal yields 1072 hits
(cloning OR clone) AND animal yields 19 hits.
(cloning clone) AND animal yields 19 hits.

Regards,

Terry

- Original Message -
From: "Christoph Kiehl" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Saturday, February 15, 2003 7:41 PM
Subject: Re: Syntax Problem


> Terry Steichen wrote:
> > I have an index which, when searched with this query ("cloning clone
> > animal") produces 1103 hits.  A different, more narrow query
> > ("(cloning clone) AND animal") produces only 19 hits.
>
> AFAIK the terms in your queries are by default concatenated by OR. This
> means "cloning clone animal" == "cloning OR clone OR animal".
>
> > What's puzzling to me is that if I try a different (but supposedly
> > identical) form of the more narrow query ("+(cloning clone)
> > +animal"), it produces 1103 hits rather than the 19 that I expect.
> >
> > In other words, "+(cloning clone) +animal" appears to be the
> > equivalent of "cloning OR clone OR animal" rather than "(cloning OR
> > clone) AND animal".
>
> Hm, strange. I would expect "+(cloning clone) +animal" being translated to
> "(cloning OR clone) AND animal". I just tried it here. The translation is
> done as I expected. Perhaps you could try the last query ("(cloning OR
> clone) AND animal") and compare the resultsize with the one from
"+(cloning
> clone) +animal" (even if both seem to be the same as "(cloning clone) AND
> animal" ;)?
>
> Christoph
>
>
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Syntax Problem

2003-02-15 Thread Terry Steichen
I have an index which, when searched with this query ("cloning clone animal") produces 
1103 hits.  A different, more narrow query ("(cloning clone) AND animal") produces 
only 19 hits.

What's puzzling to me is that if I try a different (but supposedly identical) form of 
the more narrow query ("+(cloning clone) +animal"), it produces 1103 hits rather than 
the 19 that I expect.

In other words, "+(cloning clone) +animal" appears to be the equivalent of "cloning OR 
clone OR animal" rather than "(cloning OR clone) AND animal".

Am I misunderstanding something about the "+ -" syntax, or is this some kind of bug?

Regards,

Terry




Re: Computing Relevancy Differently

2003-02-10 Thread Terry Steichen
Doug,

That's excellent.  Just what I've been looking for.  I'll start
experimenting shortly.

Regards,

Terry

- Original Message -
From: "Doug Cutting" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, February 10, 2003 1:57 PM
Subject: Re: Computing Relevancy Differently


> Terry Steichen wrote:
> > Can you give me an idea of what to replace the lengthNorm() method with
to,
> > for example, remove any special weight given to shorter matching
documents?
>
> The goal of the default implementation is not to give any special weight
> to shorter documents, but rather to remove the advantage longer
> documents have.  Longer documents are likely to have more matches simply
> because they contain more terms.  Also, for the query "foo", a document
> containing just "foo" is a better match than a longer one containing
> "foo bar baz", since the match is more exact.
>
> However, one problem with this approach can be that very short documents
> are in fact not very informative.  Thus a bias against very short
> documents is sometimes useful.
>
> > I can certainly go through a bunch of trial-and-error efforts, but it
would
> > help if I had some grasp of the logic initially.
> >
> > For example, from DefaultSimilarity, here's the lengthNorm() method:
> >
> >   public float lengthNorm(String fieldName, int numTerms) {
> > return (float)(1.0 / Math.sqrt(numTerms));
> >   }
> >
> > Should I (for the purpose of eliminating any size bias) override it to
> > always return a 1?
>
> That's something to try, although, as mentioned above, I suspect your
> top hits will be dominated by long documents.  Try it.  It's really not
> a difficult experiment!
>
> One trick I've used to keep very short documents from dominating
> results, that, while good matches, are not informative documents, is to
> override this with something like:
>
> public float lengthNorm(String fieldName, int numTerms) {
>   super.lengthNorm(fieldName, Math.max(numTerms, 100));
> }
>
> This way all fields shorter than 100 terms are scored like fields
> containing 100 terms.  Long documents are still normalized, but search
> is biased a bit against very short documents.
>
> > How would I boost the headline field here? Is that how you are supposed
to
> > use the (presently unused) fieldName parameter?  If that's the case, I
> > assume I would logically (to do what I'm trying to do) make this factor
> > greater than 1 for the 'headline' field, and 1 for all other fields?
>
> You could do that here too.  So, for example, you could do something like:
>
> public float lengthNorm(String fieldName, int numTerms) {
>   float n = super.lengthNorm(fieldName, Math.max(numTerms, 100));
>   if (fieldName.equals("headline"))
> n *= 4.0f;
>   return n;
> }
>
> Equivalently, you could create your documents with something like:
>
>Document d = new Document();
>Field f = new Field.Text("headline", headline);
>f.setBoost(4.0f);
>...
>
> But headlines tend to be short, and naturally benefit from the default
> lengthNorm implementation.  So what you really might want is something
like:
>
> public float lengthNorm(String fieldName, int numTerms) {
>   if (fieldName.equals("headline"))
> return 4.0f * super.lengthNorm(fieldName, numTerms);
>   else
> return super.lengthNorm(fieldName, Math.max(numTerms, 100));
> }
>
> This is probably what I'd try first.
>
> Doug
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Computing Relevancy Differently

2003-02-08 Thread Terry Steichen
Doug,

Can you give me an idea of what to replace the lengthNorm() method with to,
for example, remove any special weight given to shorter matching documents?
I can certainly go through a bunch of trial-and-error efforts, but it would
help if I had some grasp of the logic initially.

For example, from DefaultSimilarity, here's the lengthNorm() method:

  public float lengthNorm(String fieldName, int numTerms) {
return (float)(1.0 / Math.sqrt(numTerms));
  }

Should I (for the purpose of eliminating any size bias) override it to
always return a 1?

How would I boost the headline field here? Is that how you are supposed to
use the (presently unused) fieldName parameter?  If that's the case, I
assume I would logically (to do what I'm trying to do) make this factor
greater than 1 for the 'headline' field, and 1 for all other fields?


Regards,

Terry

- Original Message -
From: "Doug Cutting" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, February 07, 2003 2:37 PM
Subject: Re: Computing Relevancy Differently


> Terry Steichen wrote:
> > I read all the relevant references I could find in the Users (not
> > Developers) list, and I still don't exactly know what to do.
> >
> > What I'd like to do is get a relevancy-based order in which (a) longer
> > documents tend to get more weight than shorter ones, (b) a document body
> > with 'X' instances of a query term gets a higher ranking than one with
fewer
> > than 'X' instances. and (c) a term found in the headline (usually in
> > addition to finding the same term in the body) is more highly ranked
than
> > one with the term only in the body.
>
> In the latest sources this can all be done by defining your own
> Similarity implementation.  You can make longer documents score higher
> by overriding the lengthNorm() method.  You can boost headlines there,
> or with Field.setBoost(), or at query time with Query.setBoost().
>
> Doug
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Score-Limited Hits?

2003-02-03 Thread Terry Steichen
Is there an existing API that allows you to conduct a search such that only hits with 
a score greater than X are returned?

Regards,

Terry




Re: regarding Query parser for relational operators

2003-02-03 Thread Terry Steichen
Nellai,

That depends on how you've represented your date fields.  Please check
through the archives, 'cause there was quite a bit of discussion on this not
too long ago.

Regards,

Terry

- Original Message -
From: "Nellai" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, February 03, 2003 9:38 AM
Subject: Re: regarding Query parser for relational operators


> Hi Terry,
> Yup! you r right.
> I need the results between 2 range of dates only. is there any way to
> implement it.
>
> Thanks in advance
> Nellai...
> - Original Message -
> From: "Terry Steichen" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Monday, February 03, 2003 7:50 PM
> Subject: Re: regarding Query parser for relational operators
>
>
> > Nellai,
> >
> > Sounds like you want to use a range query.
> >
> > Regards,
> >
> > Terry
> >
> > - Original Message -
> > From: "Nellai" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > Sent: Monday, February 03, 2003 5:10 AM
> > Subject: regarding Query parser for relational operators
> >
> >
> > Hi,
> >
> > Is there any way to filter the search based on the modified date. For
> > example, i need to fetch only those documents whose modified date > or <
> or
> > between. Can any one help me to solve this.
> >
> > Thanks a ton
> > Nellai...
> >
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: regarding Query parser for relational operators

2003-02-03 Thread Terry Steichen
Nellai,

Sounds like you want to use a range query.

Regards,

Terry

- Original Message -
From: "Nellai" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, February 03, 2003 5:10 AM
Subject: regarding Query parser for relational operators


Hi,

Is there any way to filter the search based on the modified date. For
example, i need to fetch only those documents whose modified date > or < or
between. Can any one help me to solve this.

Thanks a ton
Nellai...




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: '-' character not interpreted correctly in field names

2003-02-03 Thread Terry Steichen
I believe that the tokenizer treats a dash as a token separator.  Hence, the
only way, as I recall, to eliminate this behavior is to modify
QueryParser.jj so it doesn't do this.  However, doing this can cause some
other problems, like hyphenated words at a line break and the like.

(Of course, if you do make such a change, you'll have to go back and reindex
after such a change.)

I've run into this problem myself and I've 'punted' -  on certain fields,
when I index, I replace the dash with an underscore.  This isn't a real good
solution, and it does require me to keep remembering in which fields I have
to do this substitution in the search.  But, for the moment it works.  I'll
probably go back and make some kind of change later, when I have more time.

HTH,

Terry

- Original Message -
From: "hermit" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Monday, February 03, 2003 2:39 AM
Subject: '-' character not interpreted correctly in field names


> Hello!
>
> I have a problem, a big one. I have successfully indexed 600 MB of XML
> data, but the search can't give any results if the field contains any
> '-' characters .
> For example: compound@cgx-code:[2 - 5] must match at least two results
> based on my XML data but it gives nothing.
>
> Can you advice me a simple solution? Or is it a bug?
>
> The Hermit
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Wildchars in phrase

2003-02-02 Thread Terry Steichen
Lukas,

I believe that "this" is a stop word, so it is stripped out.

Regards,

Terry

- Original Message - 
From: "Lukas Zapletal" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Sunday, February 02, 2003 11:47 AM
Subject: Wildchars in phrase


> Hello all!
> 
> Why am I not able to use wildchars in phrase. Somethink like this doesnt 
> work:
> 
> "Th* is a phrase."
> 
> This works propperly: "This is a phrase."
> 
> Can you help me? I guess you can ;-)
> 
> -- 
> Lukas Zapletal  [[EMAIL PROTECTED]]
> http://www.tanecni-olomouc.cz/lzap
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Wildchar based search?? |

2003-02-02 Thread Terry Steichen
Leo,

>From my experience, as I update the index (without optimizing), the number
of physical index files grows.  I typically use the number of files as an
indicator as to when optimization is required.  While I don't think Lucene
itself has any API to check this, a shell script or the application can do
so easily.

HTH,

Terry

- Original Message -
From: "Leo Galambos" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Sunday, February 02, 2003 9:48 AM
Subject: Re: Wildchar based search?? |


> On Sat, 1 Feb 2003, Rishabh Bajpai wrote:
>
> > also, i rememebr readin somewhere that one had to build the index in
> > some special way, but since you say no; i will take that. i anyways dont
> > rememebr where I read it, so no point asking about something if I am
> > myself not sure
>
> I remember only one problem that is related to indexing phase - it is
> ``optimize'' function. If you update your index, one cannot tell you if
> you must also call optimize() or not.
>
> If you do not call it, it may slow down queries (I do not know how much,
> but Otis told it). If you call it, it slows down the indexing phase (I
> have tested it and it is significant).
>
> AFAIK Lucene cannot tell you when the index becomes dirty so that you must
> call optimize. On the other hand it does not affect small indexes, where
> optimize() costs nothing.
>
> Otis, I think that this still holds. Right?
>
> -g-
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Computing Relevancy Differently

2003-01-26 Thread Terry Steichen
I admit to a bit of frustration.

With the past several messages, I simply asked (or, more accurately, tried
to ask) how to alter the way that Lucene ranks relevancy, and I asked
whether the selective boost mechanism might do the trick.  I admitted that I
don't know (nor care to know) the theory behind how relevancy is computed.

So far I've been told to review the archives (which I've done), and then
this (which I don't understand - see my embedded [==>]comments below).

What's next? Seems that I'm getting a message: "Figure it out on your own,
you dummy." Maybe I've gotten on the wrong list by mistake?

Terry

- Original Message -
From: "Leo Galambos" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Sunday, January 26, 2003 11:56 AM
Subject: Re: Computing Relevancy Differently


> 1) Lucene uses the Vector model, if you want to use different model

==>I have no idea of what that means, nor what the alternative to the
"Vector model" might be.

>you must understand what you are doing

==>which I don't, as I've already stated several times.

>and you must change similarity calculations.

==>which means what? Is that part of Lucene?

>AFAIK you would set the normalization factor to a constant value (1.0 or
so).

==>Does this mean not to use boost?

> 2) you are trying to search for DATA, not INFORMATION. It is a big
> difference. For your task, you could rather use simpler engine that is
> based on RDBMS and B+.

==>I didn't know I was excluding one for the other.  Do I interpret all this
to mean Lucene can't be adjusted to do what I was asking?  That it's too
complicated?





--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: Computing Relevancy Differently

2003-01-26 Thread Terry Steichen
I read all the relevant references I could find in the Users (not
Developers) list, and I still don't exactly know what to do.

Let me explain a bit more.  The documents I index are all news stories.  The
typical document body ranges in size from 200 to 2000 words.  The document
is structured into a couple of dozen indexed fields, but nearly all
searching is done in two: the headline and the body.

What I'd like to do is get a relevancy-based order in which (a) longer
documents tend to get more weight than shorter ones, (b) a document body
with 'X' instances of a query term gets a higher ranking than one with fewer
than 'X' instances. and (c) a term found in the headline (usually in
addition to finding the same term in the body) is more highly ranked than
one with the term only in the body.

But that's not what happens with the default scoring, and I'd like to change
that.

I'm guessing, but maybe if I check the document length at indexing time and
boost longer documents, that will help.  Maybe I could also (at index time)
give an extra boost to the headline field.  Would that be the most I could
do without changing the Lucene core source?

Regards,

Terry

PS: I'm also wondering if the fact that I have so many other fields, this
may affect the ranking in a way that diminishes the relevance of the
headline and/or body fields?

PSS: I'd just like to clarify another point.  Much of the background
information on the scoring algorithms is beyond me and I have no interest
whatsoever in pushing the boundaries of this part of the technology.  All I
want to do is use it so it comes out in a way that seems reasonable (without
having to become an expert in the complex theory behind this).

- Original Message -
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Saturday, January 25, 2003 2:09 AM
Subject: Re: Computing Relevancy Differently


> Check the lucene-user archives, search for subject "custom scoring api
> questions"
> I think that may give you the answer
>
> Otis
>
>
> --- Terry Steichen <[EMAIL PROTECTED]> wrote:
> > How would one go about altering the formula for relevancy?  (That is,
> > which modules and which code?)  I'm certain that the current
> > algorithm is well founded in logic and probably works well in many
> > environments.
> >
> > However, I find that, as I index news stories, the current algorithm
> > frequently doesn't produce meaningful rankings.  In previous
> > discussions in this list about relevancy, the algorithm seemed to be
> > very complex, possibly too complex for my poor brain to fully grasp.
> > But I'd like to try some other options and see if they result in
> > rankings more in line with what my average viewer would expect.
> >
> > Regards,
> >
> > Terry
> >
> >
>
>
> __
> Do you Yahoo!?
> Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
> http://mailplus.yahoo.com
>
> --
> To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>
>
>


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




Computing Relevancy Differently

2003-01-24 Thread Terry Steichen
How would one go about altering the formula for relevancy?  (That is, which modules 
and which code?)  I'm certain that the current algorithm is well founded in logic and 
probably works well in many environments.  

However, I find that, as I index news stories, the current algorithm frequently 
doesn't produce meaningful rankings.  In previous discussions in this list about 
relevancy, the algorithm seemed to be very complex, possibly too complex for my poor 
brain to fully grasp.  But I'd like to try some other options and see if they result 
in rankings more in line with what my average viewer would expect.

Regards,

Terry




Re: Interpreting the score asociated with the Term? |

2003-01-23 Thread Terry Steichen
Otis,

I think the effort you made in your previous message (to describe the basic
relevance measures in simple, non-algorithmic terms) is very important.  If
you think that list is reasonably comprehensive (that is, it captures most
of what relevance means), I'd urge you to insert this into the
documentation.  I think it is very valuable.

Regards,

Terry

- Original Message -
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, January 23, 2003 12:02 PM
Subject: Re: Interpreting the score asociated with the Term? |


> Yes, I believe so.
>
> --- Terry Steichen <[EMAIL PROTECTED]> wrote:
> > Otis,
> >
> > Didn't somebody (Doug?) also mention that a keyword in a shorter
> > document is
> > deemed more significant than in a longer one (because, I guess, it
> > represents a larger percentage of the document)?
> >
> > Regards,
> >
> > Terry
> > - Original Message -
> > From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" <[EMAIL PROTECTED]>;
> > <[EMAIL PROTECTED]>
> > Sent: Thursday, January 23, 2003 10:58 AM
> > Subject: Re: Interpreting the score asociated with the Term? |
> >
> >
> > > Here is a simplified explanation of some basic stuff.
> > >
> > > 1. the more frequent the term (in a collection) the lower its
> > weight
> > > (significance).  Makes sense - very popular words don't distinguish
> > one
> > > document from the other much, because they are present in so many
> > docs.
> > >
> > > 2. the more frequent a word in a single document, the higher the
> > > documents 'value' when the query contains that word.  So the score
> > goes
> > > up for frequent words in a document, esp. if they are not frequent
> > in
> > > other documents in the collection.
> > >
> > > 3. there is a boost factor which allow you to boost certain terms
> > at
> > > query time (e.g. you value matches in title field more than the
> > body
> > > field?  boost title field queries)
> > >
> > > 4. normalization factor, I believe, normalizes things so that
> > longer
> > > documents don't have advantage over shorter ones.
> > >
> > > There is more to thisbut I am already not 100% about all of the
> > > above, so I'll stop here :)
> > >
> > > Also note that you can boost fields at index time (you'll have to
> > use
> > > the nightly build for that instead of the 1.2 release to get this,
> > I
> > > believe).
> > >
> > > Otis
> > >
> > >
> > > --- Rishabh Bajpai <[EMAIL PROTECTED]> wrote:
> > > >
> > > > Hi All,
> > > >
> > > > I am using Lucene as a Search Engine for my work. I am new to
> > this,
> > > > so forgive me if I am asking a cliched question!
> > > >
> > > > I need to understand how the SCORE for the search TERMs is
> > calculated
> > > > for Lucene, so that indexing can be appropriately be designed to
> > > > return the most relevant results, when searched.
> > > >
> > > > On the official FAQ page of the Lucene site, a formula is listed
> > as
> > > > score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *
> > > > boost_t) * coord_q_d
> > > > where:
> > > >   score_d   : score for document d
> > > >   sum_t : sum for all terms t
> > > >   tf_q  : the square root of the frequency of t in the query
> > > >   tf_d  : the square root of the frequency of t in d
> > > >   idf_t : log(numDocs/docFreq_t+1) + 1.0
> > > >   numDocs   : number of documents in index
> > > >   docFreq_t : number of documents containing t
> > > >   norm_q: sqrt(sum_t((tf_q*idf_t)^2))
> > > >   norm_d_t  : square root of number of tokens in d in the same
> > field
> > > > as t
> > > >   boost_t   : the user-specified boost for term t
> > > >   coord_q_d : number of terms in both query and document / number
> > of
> > > > terms in query
> > > >
> > > > I didnot find the formula too helpful in figuring out what
> > exactly
> > > > the score is trying to calculate.
> > > >
> > > > I want to know of a logic that can be used for translating this
> > score
> > &g

Re: Interpreting the score asociated with the Term? |

2003-01-23 Thread Terry Steichen
Otis,

Didn't somebody (Doug?) also mention that a keyword in a shorter document is
deemed more significant than in a longer one (because, I guess, it
represents a larger percentage of the document)?

Regards,

Terry
- Original Message -
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>;
<[EMAIL PROTECTED]>
Sent: Thursday, January 23, 2003 10:58 AM
Subject: Re: Interpreting the score asociated with the Term? |


> Here is a simplified explanation of some basic stuff.
>
> 1. the more frequent the term (in a collection) the lower its weight
> (significance).  Makes sense - very popular words don't distinguish one
> document from the other much, because they are present in so many docs.
>
> 2. the more frequent a word in a single document, the higher the
> documents 'value' when the query contains that word.  So the score goes
> up for frequent words in a document, esp. if they are not frequent in
> other documents in the collection.
>
> 3. there is a boost factor which allow you to boost certain terms at
> query time (e.g. you value matches in title field more than the body
> field?  boost title field queries)
>
> 4. normalization factor, I believe, normalizes things so that longer
> documents don't have advantage over shorter ones.
>
> There is more to thisbut I am already not 100% about all of the
> above, so I'll stop here :)
>
> Also note that you can boost fields at index time (you'll have to use
> the nightly build for that instead of the 1.2 release to get this, I
> believe).
>
> Otis
>
>
> --- Rishabh Bajpai <[EMAIL PROTECTED]> wrote:
> >
> > Hi All,
> >
> > I am using Lucene as a Search Engine for my work. I am new to this,
> > so forgive me if I am asking a cliched question!
> >
> > I need to understand how the SCORE for the search TERMs is calculated
> > for Lucene, so that indexing can be appropriately be designed to
> > return the most relevant results, when searched.
> >
> > On the official FAQ page of the Lucene site, a formula is listed as
> > score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *
> > boost_t) * coord_q_d
> > where:
> >   score_d   : score for document d
> >   sum_t : sum for all terms t
> >   tf_q  : the square root of the frequency of t in the query
> >   tf_d  : the square root of the frequency of t in d
> >   idf_t : log(numDocs/docFreq_t+1) + 1.0
> >   numDocs   : number of documents in index
> >   docFreq_t : number of documents containing t
> >   norm_q: sqrt(sum_t((tf_q*idf_t)^2))
> >   norm_d_t  : square root of number of tokens in d in the same field
> > as t
> >   boost_t   : the user-specified boost for term t
> >   coord_q_d : number of terms in both query and document / number of
> > terms in query
> >
> > I didnot find the formula too helpful in figuring out what exactly
> > the score is trying to calculate.
> >
> > I want to know of a logic that can be used for translating this score
> > into something that can be used for determining which Terms are more
> > relevant for a given Search Request.
> >
> > One way would be to just assume that - higher the score, more
> > relveant is the search. But is this assumption really valid? Or are
> > there any possible caveats to this?
> >
> > -Rishabh
> >
> >
> >
> > _
> > Get 25MB, POP3, Spam Filtering with LYCOS MAIL PLUS for $19.95/year.
> > http://login.mail.lycos.com/brandPage.shtml?pageId=plus&ref=lmtplus
> >
> > --
> > To unsubscribe, e-mail:
> > 
> > For additional commands, e-mail:
> > 
> >
>
>
> __
> Do you Yahoo!?
> Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
> http://mailplus.yahoo.com
>
> --
> To unsubscribe, e-mail:

> For additional commands, e-mail:

>
>


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: Range queries

2003-01-23 Thread Terry Steichen
Erik,

That's good.  Now I don't have to keep proving what is, is.  Glad it finally
made sense.

Regards,

Terry

- Original Message -
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, January 22, 2003 11:43 PM
Subject: Re: Range queries


> I just realized how dense I've been on this thread.  All along Terry
> has been saying he's indexing date fields as "MMDD" String fields
> and I just wasn't getting it.  I had my brain locked into thinking the
> fields were being indexed as Date fields.
>
> My apologies for the run around on this thread.  Back to your regular
> programming...
>
> On Wednesday, January 22, 2003, at 11:28  PM, Erik Hatcher wrote:
> > Ah, maybe its how we are indexing our fields differently?  How are you
> > indexing your my_date_field?  I'm using this syntax:
> >
> > Field.Keyword(fieldName, new Date())
> >
> > Maybe you are indexing it as a String with "MMDD" format?  If so,
> > that explains it.
> >
> > Erik
>
>
> --
> To unsubscribe, e-mail:

> For additional commands, e-mail:

>


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




  1   2   >