Re: Zip Files

2005-03-01 Thread Luke Shannon
Thanks Ernesto.

The issue I'm working with now (this is more lack of experience than
anything) is getting an input I can index. All my indexing classes (doc,
pdf, xml, ppt) take a File object as a parameter and return a Lucene
Document containing all the fields I need.

I'm struggling with how I can work with an  array of bytes  instead of a
Java File.

It would be easier to unzip the zip to a temp directory, parse the files and
than delete the directory. But this would greatly slow indexing and use up
disk space.

Luke

- Original Message - 
From: "Ernesto De Santis" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Tuesday, March 01, 2005 10:48 AM
Subject: Re: Zip Files


> Hello
>
> first, you need a parser for each file type: pdf, txt, word, etc.
> and use a java api to iterate zip content, see:
>
> http://java.sun.com/j2se/1.4.2/docs/api/java/util/zip/ZipInputStream.html
>
> use getNextEntry() method
>
> little example:
>
> ZipInputStream zis = new ZipInputStream(fileInputStream);
> ZipEntry zipEntry;
> while(zipEntry = zis.getNextEntry() != null){
> //use zipEntry to get name, etc.
> //get properly parser for current entry
> //use parser with zis (ZipInputStream)
> }
>
> good luck
> Ernesto
>
> Luke Shannon escribió:
>
> >Hello;
> >
> >Anyone have an ideas on how to index the contents within zip files?
> >
> >Thanks,
> >
> >Luke
> >
> >
> >-
> >To unsubscribe, e-mail: [EMAIL PROTECTED]
> >For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> >
>
> -- 
> Ernesto De Santis - Colaborativa.net
> Córdoba 1147 Piso 6 Oficinas 3 y 4
> (S2000AWO) Rosario, SF, Argentina.
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Zip Files

2005-03-01 Thread Luke Shannon
Hello;

Anyone have an ideas on how to index the contents within zip files?

Thanks,

Luke


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Filtering Question

2005-02-23 Thread Luke Shannon
Hello;

I'm trying to create a Filter that only retrieves documents with a path
field containing a sub string(s).

I can get the Filter to work if the BooleanQuery below (used to create the
Filter) contains only TermQueries (this requires me to know the exact path).
But not if it contains Wildcard?

Here is the code to create the filter:

//if the paths parameter is null we don't use the filter
boolean useFilter = false;
BooleanQuery filterParams = new BooleanQuery();
if (paths != null) {
useFilter = true;
Trace.DEBUG("The query will have a filter with " + paths.size()
+ " terms.");
Iterator path = paths.iterator();
while (path.hasNext()) {
String strPath = "*" + (String)path.next() + "*";
Trace.DEBUG(strPath + " is one of the params");
filterParams.add(new WildcardQuery(new Term("path",
strPath)), false, false);
}

}
Trace.DEBUG("The filter is created using this: " + filterParams);
Filter pathFilter = new QueryFilter(filterParams);
Trace.DEBUG("The filter is " + pathFilter.toString());

When useFilter is true, the search is executed with pathFilter as the second
parameter -> hits = searcher.search(query, pathFilter);

Here is a small output, without the filter this query returns all the
documents in the index. With it, 6 should be coming back.

*testing* is one of the params
The filter is created using this: path:*testing*
The filter is QueryFilter(path:*testing*)
The query: olfaithfull:*stillhere* returned 0

Why won't this work?

Thanks,

Luke



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MultiField Queries without the QueryParser

2005-02-22 Thread Luke Shannon
Responding to this posts. Please disreguard.

Sorry.

- Original Message - 
From: "Luke Shannon" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Tuesday, February 22, 2005 5:16 PM
Subject: MultiField Queries without the QueryParser


> Hello;
> 
> The book meantions the MultiFieldQueryParser as one way of dealing with
> multifield queries. Can someone point me in the direction of other ways?
> 
> Thanks,
> 
> Luke
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



MultiField Queries without the QueryParser

2005-02-22 Thread Luke Shannon
Hello;

The book meantions the MultiFieldQueryParser as one way of dealing with
multifield queries. Can someone point me in the direction of other ways?

Thanks,

Luke



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Optional Terms in a single query

2005-02-21 Thread Luke Shannon
Hi Tod;

Thanks for your help.

I was able to do what you said but in a much uglier way using a Boolean
Query and adding Wildcard Queries.

The end result looks like this:

The query: +(type:138) +((-name:*tim* -name:*bill* -name:*harry*
+olfaithfull:stillhere))

But this one works as expected.

Thanks!

Luke
- Original Message - 
From: "Todd VanderVeen" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Monday, February 21, 2005 6:26 PM
Subject: Re: Optional Terms in a single query


> Luke Shannon wrote:
>
> >The API I'm working with combines a series of queries into one larger one
> >using a boolean query.
> >
> >Queries on the same field get OR's into one big query. All remaining
queries
> >are AND'd with this big one.
> >
> >Working with in this system I have:
> >
> >arg = (mario luigi bobby joe) //i do have control of how this list is
> >created
> >
> >I pass this to the QueryParser:
> >
> >Query query1 = QueryParser.parse(arg, "name", new StandardAnalyzer());
> >Query query2 = QueryParser.parse("stillhere", "olfaithfull", new
> >StandardAnalyzer());
> >BooleanQuery typeNegativeSearch = new BooleanQuery();
> >typeNegativeSearch.add(query1, false, true);
> >typeNegativeSearch.add(query2, true, false);
> >
> >This is half the query.
> >
> >It gets AND'd with the other half, to create what you see below:
> >
> >+(type:181) +((-(name:tim name:harry name:bill) +olfaithfull:stillhere))
> >
> >What I am having trouble with is getting the QueryParser to create
> >this: -name:(tim bill harry)
> >
> >I feel like this is something simple, but for some reason I can't figure
it
> >out.
> >
> >Thanks,
> >
> >Luke
> >
> >
> >
> Is the API something you control?
>
> Lets call the other half of you query query3. To avoid the extra nesting
> you need to do the composition in a single boolean query.
>
> Query query1 = QueryParser.parse(arg, "name", new StandardAnalyzer());
> Query query2 = QueryParser.parse("stillhere", "olfaithfull", new
StandardAnalyzer());
> Query query3 = 
>
> BooleanQuery finalQuery = new BooleanQuery();
> finalQuery.add(query1, false, true);
> finalQuery.add(query2, true, false);
> finalQuery.add(query3, true, false);
>
> Cheers,
> Todd VanderVeen
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Optional Terms in a single query

2005-02-21 Thread Luke Shannon
The API I'm working with combines a series of queries into one larger one
using a boolean query.

Queries on the same field get OR's into one big query. All remaining queries
are AND'd with this big one.

Working with in this system I have:

arg = (mario luigi bobby joe) //i do have control of how this list is
created

I pass this to the QueryParser:

Query query1 = QueryParser.parse(arg, "name", new StandardAnalyzer());
Query query2 = QueryParser.parse("stillhere", "olfaithfull", new
StandardAnalyzer());
BooleanQuery typeNegativeSearch = new BooleanQuery();
typeNegativeSearch.add(query1, false, true);
typeNegativeSearch.add(query2, true, false);

This is half the query.

It gets AND'd with the other half, to create what you see below:

+(type:181) +((-(name:tim name:harry name:bill) +olfaithfull:stillhere))

What I am having trouble with is getting the QueryParser to create
this: -name:(tim bill harry)

I feel like this is something simple, but for some reason I can't figure it
out.

Thanks,

Luke

- Original Message - 
From: "Todd VanderVeen" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Monday, February 21, 2005 5:33 PM
Subject: Re: Optional Terms in a single query


> Luke Shannon wrote:
>
> >Hi;
> >
> >I'm trying to create a query that look for a field containing type:181
and
> >name doesn't contain tim, bill or harry.
> >
> >+(type: 181) +((-name: tim -name:bill -name:harry +oldfaith:stillHere))
> >+(type: 181) +((-name: tim OR bill OR harry +oldfaith:stillHere))
> >+(type: 181) +((-name:*(tim bill harry)* +olfaithfull:stillhere))
> >+(type:1 81) +((-name:*(tim OR bill OR harry)* +olfaithfull:stillhere))
> >
> >I would really think to do this all in one Query. Is this even possible?
> >
> >Thanks,
> >
> >Luke
> >
> >
> >
> >-
> >To unsubscribe, e-mail: [EMAIL PROTECTED]
> >For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> All all the queries listed attempts at the same things?
>
> I'm guessing you want this:
>
> +type:181 -name:(tim bill harry) +oldfaith:stillHere
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Optional Terms in a single query

2005-02-21 Thread Luke Shannon
Sorry about the typos.

What I would like is a document with a type field = 181,
olfaithfull=stillHere and a name field not containing tim, bill or harry.

Thanks,

Luke

- Original Message - 
From: "Paul Elschot" <[EMAIL PROTECTED]>
To: 
Sent: Monday, February 21, 2005 5:31 PM
Subject: Re: Optional Terms in a single query


> On Monday 21 February 2005 23:23, Luke Shannon wrote:
> > Hi;
> >
> > I'm trying to create a query that look for a field containing type:181
and
> > name doesn't contain tim, bill or harry.
>
> type: 181  -(name: tim name:bill name:harry)
>
> > +(type: 181) +((-name: tim -name:bill -name:harry +oldfaith:stillHere))
>
> stillHere is normally lowercased before searching. Is that ok?
>
> > +(type: 181) +((-name: tim OR bill OR harry +oldfaith:stillHere))
> > +(type: 181) +((-name:*(tim bill harry)* +olfaithfull:stillhere))
>
> typo? olfaithfull
>
> > +(type:1 81) +((-name:*(tim OR bill OR harry)* +olfaithfull:stillhere))
>
> typo? (type:1 81)
>
> > I would really think to do this all in one Query. Is this even possible?
>
> How would you want to combine the results?
>
> Regards,
> Paul Elschot
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Optional Terms in a single query

2005-02-21 Thread Luke Shannon
Hi;

I'm trying to create a query that look for a field containing type:181 and
name doesn't contain tim, bill or harry.

+(type: 181) +((-name: tim -name:bill -name:harry +oldfaith:stillHere))
+(type: 181) +((-name: tim OR bill OR harry +oldfaith:stillHere))
+(type: 181) +((-name:*(tim bill harry)* +olfaithfull:stillhere))
+(type:1 81) +((-name:*(tim OR bill OR harry)* +olfaithfull:stillhere))

I would really think to do this all in one Query. Is this even possible?

Thanks,

Luke



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Handling Synonyms

2005-02-21 Thread Luke Shannon
Hello;

Does anyone see a problem with the following approach?

For synonyms, rather than putting them in the index, I put the original term
and all the synonyms in the query.

Every time I create a query, I check if the term has any synonyms. If it
does, I create Boolean Query OR'ing one Query object for each synonym.

So if I have a synoym list:

red = colour, primary, stop

And someone wants to search the desc field for the red, I would end up with
something like:

( (desc:*red*) (desc:*colout*) (desc:*stop*) ).

Now the synonyms would'nt be in the index, the Query would account for all
the possible synonym terms.

Luke



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



More Analyzer Question

2005-02-18 Thread Luke Shannon
I have created an Analyzer that I think should just be converting to lower
case and add synonyms in the index (it is at the end of the email).

The problem is, after running it I get one more result than I was expecting
(Document 1 should not be there):

Running testNameCombination1, expecting: 1 result
The query: +(type:138) +(name:mario*) returned 2

Start Listing documents:

Document: 0 contains:
Name: Text
Desc: Text


Document: 1 contains:
Name: Text
Desc: Text

End Listing documents

Those same 2 documents in Luke look like this:

Document 0
Text
Text

Document 1
Text
Text

That looks correct to me. The query shouldn't match Document 1.

The analzyer used on this field is below and is applied like so:

//set the default
PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new
SynonymAnalyzer(new FBSynonymEngine()));

//the analyzer for the name field (only converts to lower case and adds
synonyms
analyzer.addAnalyzer("name", new KeywordSynonymAnalyzer(new
FBSynonymEngine()));

Any help would be appreciated.

Thanks,

Luke


import org.apache.lucene.analysis.*;
import java.io.Reader;

public class KeywordSynonymAnalyzer extends Analyzer {
private SynonymEngine engine;

public KeywordSynonymAnalyzer(SynonymEngine engine) {
this.engine = engine;
}

public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new SynonymFilter(new
LowerCaseTokenizer(reader), engine);
return result;
}
}







Luke Shannon | Software Developer
FutureBrand Toronto

207 Queen's Quay, Suite 400
Toronto, ON, M5J 1A7
416 642 7935 (office)



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Analyzing Advise

2005-02-18 Thread Luke Shannon
This is exactly what I was looking for.

Thanks

- Original Message - 
From: "Steven Rowe" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Friday, February 18, 2005 4:41 PM
Subject: Re: Analyzing Advise


> Luke Shannon wrote:
> > But now that I'm looking at the API I'm not sure I can specifiy a
> > different analyzer when creating a field.
>
> Is PerFieldAnalyzerWrapper what you're looking for?
>
>
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/analysis/Pe
rFieldAnalyzerWrapper.html>
>
> Steve
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Analyzing Advise

2005-02-18 Thread Luke Shannon
Hi;

I'm having a situation where my synonyms weren't working for a particular
field.

When I looked at the indexing I noticed it was a Keyword, thus not
tokenized.

The problem is when I switched that field to Text (now tokenized with my
SynonymAnalyzer) a bunch of query queires broke that where testing for
starting with or  or ending with a specific string. My SynonymAnalyzer wraps
a StanardAnalyzer, which acts as I would like for all fields but this one. I
don't want to change the behavior for all tokenizing. Only this one field's
data must remain unaltered.

I was hoping to make a Analyzer, that just applied the Synonyms, that I
could just use on the one field when I added it to the Document. But now
that I'm looking at the API I'm not sure I can specifiy a different analyzer
when creating a field.

Any tips?

Thanks,

Luke



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene in the Humanties

2005-02-18 Thread Luke Shannon
Nice work Eric. I would like to spend more time playing with it, but I saw a
few things I really liked. When a specific query turns up no results you
prompt the client to preform a free form search. Less sauvy search users
will benefit from this strategy. I also like the display of information when
you select a result. Everything is at your finger tips without clutter.

I did get this error when a name search failed to turn up results and I
clicked 'help' in the free form search row (the second row).

Here is my browser info:

Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.5) Gecko/20041107
Firefox/1.0

Below are the details from the error:

 Page 'help-freeform.html' not found in application namespace.

 Stack Trace:
  a..
org.apache.tapestry.resolver.PageSpecificationResolver.resolve(PageSpecifica
tionResolver.java:120)
  b.. org.apache.tapestry.pageload.PageSource.getPage(PageSource.java:144)
  c.. org.apache.tapestry.engine.RequestCycle.getPage(RequestCycle.java:195)
  d.. org.apache.tapestry.engine.PageService.service(PageService.java:73)
  e..
org.apache.tapestry.engine.AbstractEngine.service(AbstractEngine.java:872)
  f..
org.apache.tapestry.ApplicationServlet.doService(ApplicationServlet.java:197
)
  g..
org.apache.tapestry.ApplicationServlet.doGet(ApplicationServlet.java:158)
  h.. javax.servlet.http.HttpServlet.service(HttpServlet.java:740)
  i.. javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
  j..
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
FilterChain.java:247)
  k..
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
ain.java:193)
  l..
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja
va:256)
  m..
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok
eNext(StandardPipeline.java:643)
  n..
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
  o.. org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
  p..
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja
va:191)
  q..
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok
eNext(StandardPipeline.java:643)
  r..
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
  s.. org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
  t..
org.apache.catalina.core.StandardContext.invoke(StandardContext.java:2422)
  u..
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:180
)
  v..
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok
eNext(StandardPipeline.java:643)
  w..
org.apache.catalina.valves.ErrorDispatcherValve.invoke(ErrorDispatcherValve.
java:171)
  x..
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok
eNext(StandardPipeline.java:641)
  y..
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:163
)
  z..
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok
eNext(StandardPipeline.java:641)
  aa..
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
  ab.. org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
  ac..
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java
:174)
  ad..
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok
eNext(StandardPipeline.java:643)
  ae..
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
  af.. org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
  ag..
org.apache.ajp.tomcat4.Ajp13Processor.process(Ajp13Processor.java:457)
  ah.. org.apache.ajp.tomcat4.Ajp13Processor.run(Ajp13Processor.java:576)
  ai.. java.lang.Thread.run(Thread.java:534)

Luke

- Original Message - 
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene User" 
Sent: Friday, February 18, 2005 2:46 PM
Subject: Lucene in the Humanties


> It's about time I actually did something real with Lucene  :)
>
> I have been working with the Applied Research in Patacriticism group at
> the University of Virginia for a few months and finally ready to
> present what I've been doing.  The primary focus of my group is working
> with the Rossetti Archive - poems, artwork, interpretations,
> collections, and so on of Dante Gabriel Rossetti.  I was initially
> brought on to build a collection and exhibit system, though I got
> detoured a bit as I got involved in applying Lucene to the archive to
> replace their existing search system.  The existing system used an old
> version of Tamino with XPath queries.  Tamino is not at fault here, at
> least not entirely, because our data is in a very complicated set of
> XML files with a lot of non-normalized and legacy metadata - getting at
> things via XPath is challenging and practically impossible in many
> cases.
>
> My work is now presentable at
>
> http://www.rossettiarchive.org/rose
>
> (rose is for ROsetti SEarch

Re: Query Question

2005-02-18 Thread Luke Shannon
Thanks Erik. Option 2 sounds like the path of least resistance.

Luke
- Original Message - 
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Thursday, February 17, 2005 9:05 PM
Subject: Re: Query Question


> On Feb 17, 2005, at 5:51 PM, Luke Shannon wrote:
> > My manager is now totally stuck about being able to query data with * 
> > in it.
> 
> He's gonna have to wait a bit longer, you've got a slightly tricky 
> situation on your hands
> 
> > WildcardQuery(new Term("name", "*home\**"));
> 
> The \* is the problem.  WildcardQuery doesn't deal with escaping like 
> you're trying.  Your query is essentially this now:
> 
> home\*
> 
> Where backslash has no special meaning at all... you're literally 
> looking for all terms that start with home followed by a backslash.  
> Two asterisks at the end really collapse into a single one logically.
> 
> > Any theories as to why the it would not match:
> >
> > Document (relevant fields):
> > Keyword
> > Keyword
> >
> > Is the \ escaping both * characters?
> 
> So, again, no escaping is being done here.  You're a bit stuck in this 
> situation because * (and ?) are special to WildcardQuery, and it does 
> no escaping.  Two options I think of:
> 
> - Build your own clone of WildcardQuery that does escaping - or 
> perhaps change the wildcard characters to something you do not index 
> and use those instead.
> 
> - Replace asterisks in the terms indexed with some other non-wildcard 
> character, then replace it on your queries as appropriate.
> 
> Erik
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query Question

2005-02-17 Thread Luke Shannon
Hello;

My manager is now totally stuck about being able to query data with * in it.

Here are two queries.

TermQuery(new Term("type", "203"));
WildcardQuery(new Term("name", "*home\**"));

They are joined in a boolean query. That query gives this result when you
call the toString():

+(type:203) +(name:*home\**)

This looks right to me.

Any theories as to why the it would not match:

Document (relevant fields):
Keyword
Keyword

Is the \ escaping both * characters?

Thanks,

Luke




----- Original Message - 
From: "Luke Shannon" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Thursday, February 17, 2005 2:44 PM
Subject: Query Question


> Hello;
>
> Why won't this query find the document below?
>
> Query:
> +(type:203) +(name:*home\**)
>
> Document (relevant fields):
> Keyword
> Keyword
>
> I was hoping by escaping the * it would be treated as a string. What am I
> doing wrong?
>
> Thanks,
>
> Luke
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query Question

2005-02-17 Thread Luke Shannon
That is a query toString(). I created the Query using a Wildcard Query
object.

Luke

- Original Message - 
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Thursday, February 17, 2005 3:00 PM
Subject: Re: Query Question


>
> On Feb 17, 2005, at 2:44 PM, Luke Shannon wrote:
>
> > Hello;
> >
> > Why won't this query find the document below?
> >
> > Query:
> > +(type:203) +(name:*home\**)
>
> Is that what the query toString is?  Or is that what you handed to
> QueryParser?
>
> Depending on your analyzer, 203 may go away.  QueryParser doesn't
> support leading asterisks, so "*home" would fail to parse.
>
> > Document (relevant fields):
> > Keyword
> > Keyword
> >
> > I was hoping by escaping the * it would be treated as a string. What
> > am I
> > doing wrong?
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Query Question

2005-02-17 Thread Luke Shannon
Hello;

Why won't this query find the document below?

Query:
+(type:203) +(name:*home\**)

Document (relevant fields):
Keyword
Keyword

I was hoping by escaping the * it would be treated as a string. What am I
doing wrong?

Thanks,

Luke



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Searches Contain Special Characters

2005-02-17 Thread Luke Shannon
Hi All;

How could I handle doing a wildcard search on the input *mario?

Basically I would be interested in finding all the Documents containing
*mario

Here is an example of such a Query generated:

+(type:138) +(name:**mario*)

How can I let Lucene know that the star closest to Mario on the left is to
be treated as a string, not a matching character?

Thanks,

Luke



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Negative Match

2005-02-11 Thread Luke Shannon
Thanks Eric. This is indeed the way to go.


- Original Message - 
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Friday, February 11, 2005 10:25 AM
Subject: Re: Negative Match


> 
> On Feb 11, 2005, at 9:52 AM, Luke Shannon wrote:
> 
> > Hey Erik;
> >
> > The problem with that approach is I get document that don't have a
> > kcfileupload field. This makes sense because these documents don't 
> > match the
> > prohibited
> > clause, but doesn't fit with the requirements of the system.
> 
> Ok, so instead of using the dummy field with a single dummy value, use 
> a dummy field to list the field names.  
> Field.Keyword("fields","kcfileupload"), but only for the documents that 
> should have it, of course.  Then use a query like (using QueryParser 
> syntax, but do it with the API as you have since QueryParser doesn't 
> support leading wildcards):
> 
> +fields:kcfileupload -kcfileupload:*jpg*
> 
> Again, your approach is risky with term expansion.  Get more than 1,024 
> unique kcfileupload values and you'll see!
> 
> Erik
> 
> 
> >
> > What I like best about this approach is it doesn't require a filter. 
> > The
> > system I integrate with is presently designed to accept a query 
> > object. I
> > wasn't looking forward to having to add the possibility that queries 
> > might
> > require filters. I may have to still do this, but for now I would like 
> > to
> > try this and see how it goes.
> >
> > Thanks,
> >
> > Luke
> >
> > - Original Message -
> > From: "Erik Hatcher" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" 
> > Sent: Thursday, February 10, 2005 7:23 PM
> > Subject: Re: Negative Match
> >
> >
> >>
> >> On Feb 10, 2005, at 4:06 PM, Luke Shannon wrote:
> >>
> >>> I think I found a pretty good way to do a negative match.
> >>>
> >>> In this query I am looking for all the Documents that have a
> >>> kcfileupload
> >>> field with any value except for jpg.
> >>>
> >>> Query negativeMatch = new WildcardQuery(new
> >>> Term("kcfileupload",
> >>> "*jpg*"));
> >>>  BooleanQuery typeNegAll = new BooleanQuery();
> >>> Query allResults = new WildcardQuery(new Term("kcfileupload",
> >>> "*"));
> >>> IndexSearcher searcher = new IndexSearcher(fsDir);
> >>> BooleanClause clause = new BooleanClause(negativeMatch, 
> >>> false,
> >>> true);
> >>> typeNegAll.add(allResults, true, false);
> >>> typeNegAll.add(clause);
> >>> Hits hits = searcher.search(typeNegAll);
> >>>
> >>> With the little testing I have done this *seems* to work. Does anyone
> >>> see a
> >>> problem with this approach?
> >>
> >> Sure do you realize what WildcardQuery does under the covers?  It
> >> literally expands to a BooleanQuery for all terms that match the
> >> pattern.  There is an adjustable limit built-in of 1,024 clauses to
> >> BooleanQuery.  You obviously have not hit that limit ... yet!
> >>
> >> You're better off using the advice offered on this thread
> >> previously create a single dummy field with a fixed value for all
> >> documents.  Combine a TermQuery for that dummy value with a prohibited
> >> clause like y our negativeMatch above.
> >>
> >> Erik
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Negative Match

2005-02-11 Thread Luke Shannon
Hey Erik;

The problem with that approach is I get document that don't have a
kcfileupload field. This makes sense because these documents don't match the
prohibited
clause, but doesn't fit with the requirements of the system.

What I like best about this approach is it doesn't require a filter. The
system I integrate with is presently designed to accept a query object. I
wasn't looking forward to having to add the possibility that queries might
require filters. I may have to still do this, but for now I would like to
try this and see how it goes.

Thanks,

Luke

- Original Message - 
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Thursday, February 10, 2005 7:23 PM
Subject: Re: Negative Match


>
> On Feb 10, 2005, at 4:06 PM, Luke Shannon wrote:
>
> > I think I found a pretty good way to do a negative match.
> >
> > In this query I am looking for all the Documents that have a
> > kcfileupload
> > field with any value except for jpg.
> >
> > Query negativeMatch = new WildcardQuery(new
> > Term("kcfileupload",
> > "*jpg*"));
> >  BooleanQuery typeNegAll = new BooleanQuery();
> > Query allResults = new WildcardQuery(new Term("kcfileupload",
> > "*"));
> > IndexSearcher searcher = new IndexSearcher(fsDir);
> > BooleanClause clause = new BooleanClause(negativeMatch, false,
> > true);
> > typeNegAll.add(allResults, true, false);
> > typeNegAll.add(clause);
> > Hits hits = searcher.search(typeNegAll);
> >
> > With the little testing I have done this *seems* to work. Does anyone
> > see a
> > problem with this approach?
>
> Sure do you realize what WildcardQuery does under the covers?  It
> literally expands to a BooleanQuery for all terms that match the
> pattern.  There is an adjustable limit built-in of 1,024 clauses to
> BooleanQuery.  You obviously have not hit that limit ... yet!
>
> You're better off using the advice offered on this thread
> previously create a single dummy field with a fixed value for all
> documents.  Combine a TermQuery for that dummy value with a prohibited
> clause like y our negativeMatch above.
>
> Erik
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Negative Match

2005-02-10 Thread Luke Shannon
I think I found a pretty good way to do a negative match.

In this query I am looking for all the Documents that have a kcfileupload
field with any value except for jpg.

Query negativeMatch = new WildcardQuery(new Term("kcfileupload",
"*jpg*"));
 BooleanQuery typeNegAll = new BooleanQuery();
Query allResults = new WildcardQuery(new Term("kcfileupload", "*"));
IndexSearcher searcher = new IndexSearcher(fsDir);
BooleanClause clause = new BooleanClause(negativeMatch, false,
true);
typeNegAll.add(allResults, true, false);
typeNegAll.add(clause);
Hits hits = searcher.search(typeNegAll);

With the little testing I have done this *seems* to work. Does anyone see a
problem with this approach?

Thanks,

Luke



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Problem searching Field.Keyword field

2005-02-10 Thread Luke Shannon
Are there any issues with having a bunch of boolean queries and than adding
them to one big boolean queries (making them all required)?

Or should I be looking at Query.combine()?

Thanks,

Luke
- Original Message - 
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Tuesday, February 08, 2005 12:02 PM
Subject: Re: Problem searching Field.Keyword field


Kelvin - I respectfully disagree - could you elaborate on why this is
not an appropriate use of Field.Keyword?

If the category is "How To", Field.Text would split this (depending on
the Analyzer) into "how" and "to".

If the user is selecting a category from a drop-down, though, you
shouldn't be using QueryParser on it, but instead aggregating a
TermQuery("category", "How To") into a BooleanQuery with the rest of
it.  The rest may be other API created clauses and likely a piece from
QueryParser.

Erik


On Feb 8, 2005, at 11:28 AM, Kelvin Tan wrote:

> As I posted previously, Field.Keyword is appropriate in only certain
> situations. For your use-case, I believe Field.Text is more suitable.
>
> k
>
> On Tue, 8 Feb 2005 10:02:19 -0600, Mike Miller wrote:
>> This may or may not be correct, but I am indexing it as a keyword
>> because I provide a (required) radio button on the add screen for
>> the user to determine which category the document should be
>> assigned. Then in the search, provide a dropdown that can be used
>> in the advanced search so that they can search only for a specific
>> category of documents (like HowTo, Troubleshooting, etc).
>>
>> -Original Message-
>> From: Kelvin Tan [mailto:[EMAIL PROTECTED] Sent: Tuesday,
>> February 08, 2005 9:32 AM To: Lucene Users List
>> Subject: RE: Problem searching Field.Keyword field
>>
>> Mike, is there a reason why you're indexing "category" as keyword
>> not text?
>>
>> k
>>
>> On Tue, 8 Feb 2005 08:26:13 -0600, Mike Miller wrote:
>>
>>> Thanks for the quick response.
>>>
>>> Sorry for my lack of understanding, but I am learning! Won't the
>>> query parser still handle this query? My limited understanding
>>> was that the search call provides the 'all' field as default
>>> field for query terms in the case where fields aren't specified.
>>> Using the current code, searches like author:Mike" and
>>> title:Lucene work fine.
>>>
>>> -Original Message-
>>> From: Miles Barr [mailto:[EMAIL PROTECTED] Sent:
>>> Tuesday, February 08, 2005 8:08 AM To: Lucene Users List Subject:
>>> Re: Problem searching Field.Keyword field
>>>
>>> You're using the query parser with the standard analyser. You
>>> should construct a term query manually instead.
>>>
>>>
>>> --
>>> Miles Barr <[EMAIL PROTECTED]> Runtime Collective Ltd.
>>>
>>> --
>>> -- - To unsubscribe, e-mail: lucene-user-
>>> [EMAIL PROTECTED] For additional commands, e-mail:
>>> [EMAIL PROTECTED]
>>>
>>>
>>> --
>>> -- - To unsubscribe, e-mail: lucene-user-
>>> [EMAIL PROTECTED] For additional commands, e-mail:
>>> [EMAIL PROTECTED]
>>
>>
>> 
>> - To unsubscribe, e-mail: lucene-user-
>> [EMAIL PROTECTED] For additional commands, e-mail:
>> [EMAIL PROTECTED]
>>
>>
>> 
>> - To unsubscribe, e-mail: lucene-user-
>> [EMAIL PROTECTED] For additional commands, e-mail:
>> [EMAIL PROTECTED]
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Problem searching Field.Keyword field

2005-02-10 Thread Luke Shannon
Are there any issues with having a bunch of boolean queries and than adding
them to one big boolean queries (making them all required)?

Or should I be looking at Query.combine()?

Thanks,

Luke
- Original Message - 
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Tuesday, February 08, 2005 12:02 PM
Subject: Re: Problem searching Field.Keyword field


Kelvin - I respectfully disagree - could you elaborate on why this is
not an appropriate use of Field.Keyword?

If the category is "How To", Field.Text would split this (depending on
the Analyzer) into "how" and "to".

If the user is selecting a category from a drop-down, though, you
shouldn't be using QueryParser on it, but instead aggregating a
TermQuery("category", "How To") into a BooleanQuery with the rest of
it.  The rest may be other API created clauses and likely a piece from
QueryParser.

Erik


On Feb 8, 2005, at 11:28 AM, Kelvin Tan wrote:

> As I posted previously, Field.Keyword is appropriate in only certain
> situations. For your use-case, I believe Field.Text is more suitable.
>
> k
>
> On Tue, 8 Feb 2005 10:02:19 -0600, Mike Miller wrote:
>> This may or may not be correct, but I am indexing it as a keyword
>> because I provide a (required) radio button on the add screen for
>> the user to determine which category the document should be
>> assigned. Then in the search, provide a dropdown that can be used
>> in the advanced search so that they can search only for a specific
>> category of documents (like HowTo, Troubleshooting, etc).
>>
>> -Original Message-
>> From: Kelvin Tan [mailto:[EMAIL PROTECTED] Sent: Tuesday,
>> February 08, 2005 9:32 AM To: Lucene Users List
>> Subject: RE: Problem searching Field.Keyword field
>>
>> Mike, is there a reason why you're indexing "category" as keyword
>> not text?
>>
>> k
>>
>> On Tue, 8 Feb 2005 08:26:13 -0600, Mike Miller wrote:
>>
>>> Thanks for the quick response.
>>>
>>> Sorry for my lack of understanding, but I am learning! Won't the
>>> query parser still handle this query? My limited understanding
>>> was that the search call provides the 'all' field as default
>>> field for query terms in the case where fields aren't specified.
>>> Using the current code, searches like author:Mike" and
>>> title:Lucene work fine.
>>>
>>> -Original Message-
>>> From: Miles Barr [mailto:[EMAIL PROTECTED] Sent:
>>> Tuesday, February 08, 2005 8:08 AM To: Lucene Users List Subject:
>>> Re: Problem searching Field.Keyword field
>>>
>>> You're using the query parser with the standard analyser. You
>>> should construct a term query manually instead.
>>>
>>>
>>> --
>>> Miles Barr <[EMAIL PROTECTED]> Runtime Collective Ltd.
>>>
>>> --
>>> -- - To unsubscribe, e-mail: lucene-user-
>>> [EMAIL PROTECTED] For additional commands, e-mail:
>>> [EMAIL PROTECTED]
>>>
>>>
>>> --
>>> -- - To unsubscribe, e-mail: lucene-user-
>>> [EMAIL PROTECTED] For additional commands, e-mail:
>>> [EMAIL PROTECTED]
>>
>>
>> 
>> - To unsubscribe, e-mail: lucene-user-
>> [EMAIL PROTECTED] For additional commands, e-mail:
>> [EMAIL PROTECTED]
>>
>>
>> 
>> - To unsubscribe, e-mail: lucene-user-
>> [EMAIL PROTECTED] For additional commands, e-mail:
>> [EMAIL PROTECTED]
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Starts With x and Ends With x Queries

2005-02-07 Thread Luke Shannon
I implemented this concept for my ends with query. It works very well!

- Original Message - 
From: "Chris Hostetter" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Friday, February 04, 2005 9:37 PM
Subject: Re: Starts With x and Ends With x Queries


>
> : Also keep in mind that QueryParser only allows a trailing asterisk,
> : creating a PrefixQuery.  However, if you use a WildcardQuery directly,
> : you can use an asterisk as the starting character (at the risk of
> : performance).
>
> On the issue of "ends with" wildcard queries, I wanted to throw out and
> idea that i've seen used to deal with matches like this in other systems.
> I've never acctually tried this with Lucene, but I've seen it used
> effectively with other systems where the goal is to "sort" strings by the
> least significant (ie: right most) characters first.  I think it could
> apply nicely to people who have compelling needs for efficent 'ends with'
> queries.
>
>
>
> Imagine you have a field call name, which you can already do efficient
> prefix matching on using the PrefixQuery class.  Your docs and query may
> look something like this...
>
>D1> name:"Adam Smith" age:13 state:CA ...
>D2> name:"Joe Bob" age:42 state:WA ...
>D3> name:"John Adams" age:35 state:NV ...
>D3> name:"Sue Smith" age:33 state:CA ...
>
> ...and your queries may look something like...
>
>Query q1 = new PrefixQuery(new Term("name","J*"));
>Query q2 = new PrefixQuery(new Term("name","Sue*"));
>
> If you want to start doing suffix queries (ie: all names ending with
> "s", or all names ending with "Smith") one approach would be to use
> WildcarQuery, which as Erik mentioned, will allow you to use a quey Term
> that starts with a "*". ie...
>
>Query q3 = new WildcardQuery(new Term("name","*s"));
>Query q4 = new WildcardQuery(new Term("name","*Smith"));
>
> (NOTE: Erik says you can do this, but the docs for WildcardQuery say you
> can't I'll assume the docs are wrong and Erik is correct.)
>
> The problem is that this is horrendously inefficient.  In order to find
> the docs that contain Terms which match your suffix, WildcardQuery must
> first identify what all of those Terms are, by iterating over every Term
> in your index to see if they match the suffix.  This is much slower then a
> PrefixQuery, or even a WildcardQuery that has just 1 initial character
> before a "*" (ie: "s*foobar"), because it can then seek to directly to the
> first Term that starts with that character, and also stop iterating as
> soon as it encounters a Term that no longer begins with that character.
>
> Which leads me to my point: if you denormalize your data so that you store
> both the Term you want, and the *reverse* of the term you want, then a
> Suffix query is just a Prefix query on a reversed field -- by sacrificing
> space, you can get all the speed efficiencies of a PrefixQuery when doing
> a SuffixQuery...
>
>D1> name:"Adam Smith" rname:"htimS madA" age:13 state:CA ...
>D2> name:"Joe Bob" rname:"boB oeJ" age:42 state:WA ...
>D3> name:"John Adams" rname:"smadA nhoJ" age:35 state:NV ...
>D3> name:"Sue Smith" rname:"htimS euS" age:33 state:CA ...
>
>Query q1 = new PrefixQuery(new Term("name","J*"));
>Query q2 = new PrefixQuery(new Term("name","Sue*"));
>Query q3 = new PrefixQuery(new Term("rname","s*"));
>Query q4 = new PrefixQuery(new Term("rname","htimS*"));
>
>
> (If anyone sees a flaw in my theory, please chime in)
>
>
> -Hoss
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: RangeQuery With Date

2005-02-07 Thread Luke Shannon
Bingo. Thanks!

Luke

- Original Message - 
From: "Chris Hostetter" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Monday, February 07, 2005 5:10 PM
Subject: Re: RangeQuery With Date


> : Your dates need to be stored in lexicographical order for the RangeQuery
> : to work.
> :
> : Index them using this date format: MMDD.
> :
> : Also, I'm not sure if the QueryParser can handle range queries with only
> : one end point. You may need to create this query programmatically.
>
> and when creating them progromaticaly, you need to use the exact same
> format they were indexed in.  Assuming I've corectly guess what your
> indexing code looks like, you probably want...
>
> Query query = new RangeQuery(null, new Term("modified", "2004"),
false);
>
>
>
>
> -Hoss
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RangeQuery With Date

2005-02-07 Thread Luke Shannon
Hi;

I am working on a set of queries that allow you to find modification dates
before, after and equal to a given date.

Here are some of the before queries I have been playing with. I want a query
that pull up dates modified before Nov 11 2004:

Query query = new RangeQuery(null, new Term("modified", "11/11/04"), false);

This one doesn't work. It turns up all the documents in the index.

Query query = QueryParser.parse("modified:[1/1/00 TO 11/11/04]", "subject",
new StandardAnalyzer());

This works but I don't like having to specify the begin date like this.

Query query = QueryParser.parse("modified:[null TO 11/11/04]", "subject",
new StandardAnalyzer());

This throws an exception.

How are other doing a Query like this?

Thanks,

Luke



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Starts With x and Ends With x Queries

2005-02-04 Thread Luke Shannon
Hello;

I have these two documents:

Text
Keyword
Text
Text
Text
Text
Text
Text
Text
Text
Text


Text
Text
Text
Keyword
Keyword
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text

I would like to be able to match a name fields that starts with testing
(specifically) and those that end with it.

I thought the below code would parse to a Prefix Query that would satisfy my
starting requirment (maybe I don't understand what this query is for). But
this matches both.

Query query = QueryParser.parse("testing*", "name", new StandardAnalyzer());

Has anyone done this before? Any tips?

Thanks,

Luke



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Parsing The Query: Every document that doesn't have a field containing x (but still has the field)

2005-02-04 Thread Luke Shannon
Hello;

I think Chris's approach might be helpfull, but I can't seems to get it to
work.

So since I running out of time and I still need to figure out "starts with"
and "ends with" queries, I have implemented a hacky solution to getting all
documents with a kcfileupload field present that does not contain jpg:

query1 = QueryParser.parse("jpg", "kcfileupload", new StandardAnalyzer());
query2 = QueryParser.parse("stillhere", "olfaithfull", new
StandardAnalyzer());//each document contains this
BooleanQuery typeNegativeSearch = new BooleanQuery();
typeNegativeSearch.add(query1, false, true);
typeNegativeSearch.add(query2, true, false);

What gets returned are all the documents without a kcfileupload = jpg. This
includes documents that don't even have a kcfileupload.

When I go through the results before displaying I check to make sure there
is a "kcfileupload" field.

This is not a good solution, and I hope to replace it soon. If anyone has
ideas please let me know.

Luke

- Original Message - 
From: "Chris Hostetter" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Friday, February 04, 2005 3:03 PM
Subject: Re: Parsing The Query: Every document that doesn't have a field
containing x



Another approach...

You can make a Filter that is the inverse of the output from another
filter, which means you can make a QueryFilter on the search, then wrap it
in your inverse Filter.

you can't execute a query on a filter without having a Query object, but
you can just apply the Filter directly to an IndexReader yourself, and get
back a BitSet containing the docIds of everydocument that does not contain
your term.

something like this should work...

   class NotFilter extends Filter {
  private Filter wraped;
  public NotFilter(Filter w) {
wraped = w;
  }
  public BitSet bits(IndexReader r) {
BitSet b = wraped.bits(r);
b.flip(0,b.size());
return b;
  }
   }
   ...
   BitSet results = (new NotFilter
 (new QueryFilter
  (new TermQuery(new Term("f","x").bits(reader);




: Date: Thu, 3 Feb 2005 19:51:36 +0100
: From: Kelvin Tan <[EMAIL PROTECTED]>
: Reply-To: Lucene Users List 
: To: Lucene Users List 
: Subject: Re: Parsing The Query: Every document that doesn't have a field
: containing x
:
: Alternatively, add a dummy field-value to all documents, like
doc.add(Field.Keyword("foo", "bar"))
:
: Waste of space, but allows you to perform negated queries.
:
: On Thu, 03 Feb 2005 19:19:15 +0100, Maik Schreiber wrote:
: >> Negating a term must be combined with at least one nonnegated
: >> term to return documents; in other words, it isn't possible to
: >> use a query like NOT term to find all documents that don't
: >> contain a term.
: >>
: >> So does that mean the above example wouldn't work?
: >>
: > Exactly. You cannot search for "-kcfileupload:jpg", you need at
: > least one clause that actually _includes_ documents.
: >
: > Do you by chance have a field with known contents? If so, you could
: > misuse that one and include it in your query (perhaps by doing
: > range or wildcard/prefix search). If not, try IndexReader.terms()
: > for building a Query yourself, then use that one for search.
:
:
:
: -
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Parsing The Query: Every document that doesn't have a field containing x

2005-02-04 Thread Luke Shannon
Hi Chris;

So the result would contain all documents that don't have field f containing
x?

What I need to figure out how to do is return all documents that have a
field f, but does not contain x.

Thanks for your post.

Luke


- Original Message - 
From: "Chris Hostetter" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Friday, February 04, 2005 3:03 PM
Subject: Re: Parsing The Query: Every document that doesn't have a field
containing x



Another approach...

You can make a Filter that is the inverse of the output from another
filter, which means you can make a QueryFilter on the search, then wrap it
in your inverse Filter.

you can't execute a query on a filter without having a Query object, but
you can just apply the Filter directly to an IndexReader yourself, and get
back a BitSet containing the docIds of everydocument that does not contain
your term.

something like this should work...

   class NotFilter extends Filter {
  private Filter wraped;
  public NotFilter(Filter w) {
wraped = w;
  }
  public BitSet bits(IndexReader r) {
BitSet b = wraped.bits(r);
b.flip(0,b.size());
return b;
  }
   }
   ...
   BitSet results = (new NotFilter
 (new QueryFilter
  (new TermQuery(new Term("f","x").bits(reader);




: Date: Thu, 3 Feb 2005 19:51:36 +0100
: From: Kelvin Tan <[EMAIL PROTECTED]>
: Reply-To: Lucene Users List 
: To: Lucene Users List 
: Subject: Re: Parsing The Query: Every document that doesn't have a field
: containing x
:
: Alternatively, add a dummy field-value to all documents, like
doc.add(Field.Keyword("foo", "bar"))
:
: Waste of space, but allows you to perform negated queries.
:
: On Thu, 03 Feb 2005 19:19:15 +0100, Maik Schreiber wrote:
: >> Negating a term must be combined with at least one nonnegated
: >> term to return documents; in other words, it isn't possible to
: >> use a query like NOT term to find all documents that don't
: >> contain a term.
: >>
: >> So does that mean the above example wouldn't work?
: >>
: > Exactly. You cannot search for "-kcfileupload:jpg", you need at
: > least one clause that actually _includes_ documents.
: >
: > Do you by chance have a field with known contents? If so, you could
: > misuse that one and include it in your query (perhaps by doing
: > range or wildcard/prefix search). If not, try IndexReader.terms()
: > for building a Query yourself, then use that one for search.
:
:
:
: -
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Parsing The Query: Every document that doesn't have a field containing x

2005-02-04 Thread Luke Shannon
Lucene Users List" 
Sent: Friday, February 04, 2005 2:12 AM
Subject: Re: Parsing The Query: Every document that doesn't have a field
containing x


I  think you may can use a filter to get right result!
See examlples below
package lia.advsearching;

import junit.framework.TestCase;
import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.QueryFilter;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.store.RAMDirectory;

public class SecurityFilterTest extends TestCase {
  private RAMDirectory directory;

  protected void setUp() throws Exception {
directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(directory,
new WhitespaceAnalyzer(), true);

// Elwood
Document document = new Document();
document.add(Field.Keyword("owner", "elwood"));
document.add(Field.Text("keywords", "elwoods sensitive info"));
writer.addDocument(document);

// Jake
document = new Document();
document.add(Field.Keyword("owner", "jake"));
document.add(Field.Text("keywords", "jakes sensitive info"));
writer.addDocument(document);

writer.close();
  }

  public void testSecurityFilter() throws Exception {
TermQuery query = new TermQuery(new Term("keywords", "info"));

IndexSearcher searcher = new IndexSearcher(directory);
Hits hits = searcher.search(query);
assertEquals("Both documents match", 2, hits.length());

QueryFilter jakeFilter = new QueryFilter(
new TermQuery(new Term("owner", "jake")));

hits = searcher.search(query, jakeFilter);
assertEquals(1, hits.length());
assertEquals("elwood is safe",
"jakes sensitive info", hits.doc(0).get("keywords"));
  }

}


On Thu, 3 Feb 2005 13:04:50 -0500, Luke Shannon
<[EMAIL PROTECTED]> wrote:
> Hello;
>
> I have a query that finds document that contain fields with a specific
> value.
>
> query1 = QueryParser.parse("jpg", "kcfileupload", new StandardAnalyzer());
>
> This works well.
>
> I would like a query that find documents containing all kcfileupload
fields
> that don't contain jpg.
>
> The example I found in the book that seems to relate shows me how to find
> documents without a specific term:
>
> QueryParser parser = new QueryParser("contents", analyzer);
> parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND);
>
> But than it says:
>
> Negating a term must be combined with at least one nonnegated term to
return
> documents; in other words, it isn't possible to use a query like NOT term
to
> find all documents that don't contain a term.
>
> So does that mean the above example wouldn't work?
>
> The API says:
>
>  a plus (+) or a minus (-) sign, indicating that the clause is required or
> prohibited respectively;
>
> I have been playing around with using the minus character without much
luck.
>
> Can someone give point me in the right direction to figure this out?
>
> Thanks,
>
> Luke
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-- 
æäåäæäå

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Parsing The Query: Every document that doesn't have a field containing x

2005-02-04 Thread Luke Shannon
Very Nice. Thanks!

Luke

- Original Message - 
From: "åç" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Friday, February 04, 2005 2:12 AM
Subject: Re: Parsing The Query: Every document that doesn't have a field
containing x


I  think you may can use a filter to get right result!
See examlples below
package lia.advsearching;

import junit.framework.TestCase;
import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.QueryFilter;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.store.RAMDirectory;

public class SecurityFilterTest extends TestCase {
  private RAMDirectory directory;

  protected void setUp() throws Exception {
directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(directory,
new WhitespaceAnalyzer(), true);

// Elwood
Document document = new Document();
document.add(Field.Keyword("owner", "elwood"));
document.add(Field.Text("keywords", "elwoods sensitive info"));
writer.addDocument(document);

// Jake
document = new Document();
document.add(Field.Keyword("owner", "jake"));
document.add(Field.Text("keywords", "jakes sensitive info"));
writer.addDocument(document);

writer.close();
  }

  public void testSecurityFilter() throws Exception {
TermQuery query = new TermQuery(new Term("keywords", "info"));

IndexSearcher searcher = new IndexSearcher(directory);
Hits hits = searcher.search(query);
assertEquals("Both documents match", 2, hits.length());

QueryFilter jakeFilter = new QueryFilter(
new TermQuery(new Term("owner", "jake")));

hits = searcher.search(query, jakeFilter);
assertEquals(1, hits.length());
assertEquals("elwood is safe",
"jakes sensitive info", hits.doc(0).get("keywords"));
  }

}


On Thu, 3 Feb 2005 13:04:50 -0500, Luke Shannon
<[EMAIL PROTECTED]> wrote:
> Hello;
>
> I have a query that finds document that contain fields with a specific
> value.
>
> query1 = QueryParser.parse("jpg", "kcfileupload", new StandardAnalyzer());
>
> This works well.
>
> I would like a query that find documents containing all kcfileupload
fields
> that don't contain jpg.
>
> The example I found in the book that seems to relate shows me how to find
> documents without a specific term:
>
> QueryParser parser = new QueryParser("contents", analyzer);
> parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND);
>
> But than it says:
>
> Negating a term must be combined with at least one nonnegated term to
return
> documents; in other words, it isn't possible to use a query like NOT term
to
> find all documents that don't contain a term.
>
> So does that mean the above example wouldn't work?
>
> The API says:
>
>  a plus (+) or a minus (-) sign, indicating that the clause is required or
> prohibited respectively;
>
> I have been playing around with using the minus character without much
luck.
>
> Can someone give point me in the right direction to figure this out?
>
> Thanks,
>
> Luke
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-- 
æäåäæäå

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Parsing The Query: Every document that doesn't have a field containing x

2005-02-03 Thread Luke Shannon
Bingo! Nice catch. That was it. Made everything lower case when I set the
field. Works great now.

Thanks!

Luke

- Original Message - 
From: "Kauler, Leto S" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Thursday, February 03, 2005 6:48 PM
Subject: RE: Parsing The Query: Every document that doesn't have a field
containing x


Because you are build from QueryParser rather than a TermQuery, all
search terms in the query are being lowercased by StandardAnalyzer.

So your query of "olFaithFull:stillhere" requires that there is an exact
index term of "stillhere" in that field.  It depends on how you built
the index (index and stored fields are different), but I would check on
that.  Also maybe try out TermQuery and see if that does anything for
you.



> -Original Message-
> From: Luke Shannon [mailto:[EMAIL PROTECTED]
> Sent: Friday, 4 February 2005 10:47 AM
> To: Lucene Users List
> Subject: Re: Parsing The Query: Every document that doesn't
> have a field containing x
>
>
> "stillHere"
>
> Capital H.
>
> - Original Message - 
> From: "Kauler, Leto S" <[EMAIL PROTECTED]>
> To: "Lucene Users List" 
> Sent: Thursday, February 03, 2005 6:40 PM
> Subject: RE: Parsing The Query: Every document that doesn't
> have a field containing x
>
>
> First thing that jumps out is case-sensitivity.  Does your
> olFaithFull field contain "stillHere" or "stillhere"?
>
> --Leto
>
>
> > -Original Message-
> > From: Luke Shannon [mailto:[EMAIL PROTECTED]
> > This works:
> >
> > query1 = QueryParser.parse("jpg", "kcfileupload", new
> > StandardAnalyzer()); query2 = QueryParser.parse("stillHere",
> > "olFaithFull", new StandardAnalyzer()); BooleanQuery
> > typeNegativeSearch = new BooleanQuery();
> > typeNegativeSearch.add(query1, false, false);
> > typeNegativeSearch.add(query2, false, false);
> >
> > It returns 9 results. And in string form is: kcfileupload:jpg
> > olFaithFull:stillhere
> >
> > But this:
> >
> > query1 = QueryParser.parse("jpg", "kcfileupload", new
> > StandardAnalyzer());
> > query2 = QueryParser.parse("stillHere",
> "olFaithFull", new
> > StandardAnalyzer());
> > BooleanQuery typeNegativeSearch = new BooleanQuery();
> > typeNegativeSearch.add(query1, true, false);
> > typeNegativeSearch.add(query2, true, false);
> >
> > Reutrns 0 results and is in string form : +kcfileupload:jpg
> > +olFaithFull:stillhere
> >
> > If I do the query kcfileupload:jpg in Luke I get 9 docs, each doc
> > containing a olFaithFull:stillHere. Why would
> > +kcfileupload:jpg +olFaithFull:stillhere return no results?
> >
> > Thanks,
> >
> > Luke

CONFIDENTIALITY NOTICE AND DISCLAIMER

Information in this transmission is intended only for the person(s) to whom
it is addressed and may contain privileged and/or confidential information.
If you are not the intended recipient, any disclosure, copying or
dissemination of the information is unauthorised and you should
delete/destroy all copies and notify the sender. No liability is accepted
for any unauthorised use of the information contained in this transmission.

This disclaimer has been automatically added.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Parsing The Query: Every document that doesn't have a field containing x

2005-02-03 Thread Luke Shannon
"stillHere"

Capital H.

- Original Message - 
From: "Kauler, Leto S" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Thursday, February 03, 2005 6:40 PM
Subject: RE: Parsing The Query: Every document that doesn't have a field
containing x


First thing that jumps out is case-sensitivity.  Does your olFaithFull
field contain "stillHere" or "stillhere"?

--Leto


> -Original Message-
> From: Luke Shannon [mailto:[EMAIL PROTECTED]
> This works:
>
> query1 = QueryParser.parse("jpg", "kcfileupload", new
> StandardAnalyzer()); query2 = QueryParser.parse("stillHere",
> "olFaithFull", new StandardAnalyzer()); BooleanQuery
> typeNegativeSearch = new BooleanQuery();
> typeNegativeSearch.add(query1, false, false);
> typeNegativeSearch.add(query2, false, false);
>
> It returns 9 results. And in string form is: kcfileupload:jpg
> olFaithFull:stillhere
>
> But this:
>
> query1 = QueryParser.parse("jpg", "kcfileupload", new
> StandardAnalyzer());
> query2 = QueryParser.parse("stillHere",
> "olFaithFull", new StandardAnalyzer());
> BooleanQuery typeNegativeSearch = new BooleanQuery();
> typeNegativeSearch.add(query1, true, false);
> typeNegativeSearch.add(query2, true, false);
>
> Reutrns 0 results and is in string form : +kcfileupload:jpg
> +olFaithFull:stillhere
>
> If I do the query kcfileupload:jpg in Luke I get 9 docs, each
> doc containing a olFaithFull:stillHere. Why would
> +kcfileupload:jpg +olFaithFull:stillhere return no results?
>
> Thanks,
>
> Luke

CONFIDENTIALITY NOTICE AND DISCLAIMER

Information in this transmission is intended only for the person(s) to whom
it is addressed and may contain privileged and/or confidential information.
If you are not the intended recipient, any disclosure, copying or
dissemination of the information is unauthorised and you should
delete/destroy all copies and notify the sender. No liability is accepted
for any unauthorised use of the information contained in this transmission.

This disclaimer has been automatically added.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Parsing The Query: Every document that doesn't have a field containing x

2005-02-03 Thread Luke Shannon
This works:

query1 = QueryParser.parse("jpg", "kcfileupload", new StandardAnalyzer());
query2 = QueryParser.parse("stillHere", "olFaithFull", new
StandardAnalyzer());
BooleanQuery typeNegativeSearch = new BooleanQuery();
typeNegativeSearch.add(query1, false, false);
typeNegativeSearch.add(query2, false, false);

It returns 9 results. And in string form is: kcfileupload:jpg
olFaithFull:stillhere

But this:

query1 = QueryParser.parse("jpg", "kcfileupload", new StandardAnalyzer());
query2 = QueryParser.parse("stillHere", "olFaithFull", new
StandardAnalyzer());
BooleanQuery typeNegativeSearch = new BooleanQuery();
typeNegativeSearch.add(query1, true, false);
typeNegativeSearch.add(query2, true, false);

Reutrns 0 results and is in string form : +kcfileupload:jpg
+olFaithFull:stillhere

If I do the query kcfileupload:jpg in Luke I get 9 docs, each doc containing
a olFaithFull:stillHere. Why would +kcfileupload:jpg +olFaithFull:stillhere
return no results?

Thanks,

Luke

- Original Message - 
From: "Maik Schreiber" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Thursday, February 03, 2005 4:55 PM
Subject: Re: Parsing The Query: Every document that doesn't have a field
containing x


> > Yes. There should be 119 with stillHere,
>
> You have double-checked that, haven't you? :)
>
> > and if I run a query in Luke on
> > kcfileupload = ppt, it returns one result. I am thinking I should at
least
> > get this result back with: -kcfileupload:jpg +olFaithFull:stillhere?
>
> You really should.
>
> -- 
> Maik Schreiber   *   http://www.blizzy.de <-- Get GMail invites here!
>
> GPG public key:
http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x1F11D713
> Key fingerprint: CF19 AFCE 6E3D 5443 9599 18B5 5640 1F11 D713
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Parsing The Query: Every document that doesn't have a field containing x

2005-02-03 Thread Luke Shannon
I did, I have ran both queries in Luke.

kcfileupload:ppt

returns 1

olFaithfull:stillhere

returns 119

Luke

- Original Message - 
From: "Maik Schreiber" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Thursday, February 03, 2005 4:55 PM
Subject: Re: Parsing The Query: Every document that doesn't have a field
containing x


> > Yes. There should be 119 with stillHere,
>
> You have double-checked that, haven't you? :)
>
> > and if I run a query in Luke on
> > kcfileupload = ppt, it returns one result. I am thinking I should at
least
> > get this result back with: -kcfileupload:jpg +olFaithFull:stillhere?
>
> You really should.
>
> -- 
> Maik Schreiber   *   http://www.blizzy.de <-- Get GMail invites here!
>
> GPG public key:
http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x1F11D713
> Key fingerprint: CF19 AFCE 6E3D 5443 9599 18B5 5640 1F11 D713
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Parsing The Query: Every document that doesn't have a field containing x

2005-02-03 Thread Luke Shannon
Yes. There should be 119 with stillHere, and if I run a query in Luke on
kcfileupload = ppt, it returns one result. I am thinking I should at least
get this result back with: -kcfileupload:jpg +olFaithFull:stillhere?

Luke

- Original Message - 
From: "Maik Schreiber" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Thursday, February 03, 2005 4:27 PM
Subject: Re: Parsing The Query: Every document that doesn't have a field
containing x


> > -kcfileupload:jpg +olFaithFull:stillhere
> >
> > This looks right to me. Why the 0 results?
>
> Looks good to me, too. You sure all your documents have
> olFaithFull:stillhere and there is at least a document with kcfileupload
not
> being "jpg"?
>
> -- 
> Maik Schreiber   *   http://www.blizzy.de <-- Get GMail invites here!
>
> GPG public key:
http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x1F11D713
> Key fingerprint: CF19 AFCE 6E3D 5443 9599 18B5 5640 1F11 D713
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Parsing The Query: Every document that doesn't have a field containing x

2005-02-03 Thread Luke Shannon
Hello,

Still working on the same query, here is the code I am currently working
with.

I am thinking this should bring up all the documents that have
olFaithFull=stillHere and kcfileupload!=jpg (so anything else)

query1 = QueryParser.parse("jpg", "kcfileupload", new StandardAnalyzer());
query2 = QueryParser.parse("stillHere", "olFaithFull", new
StandardAnalyzer());
BooleanQuery typeNegativeSearch = new BooleanQuery();
typeNegativeSearch.add(query1, false, true);
typeNegativeSearch.add(query2, true, false);

There toString() on the query is:

-kcfileupload:jpg +olFaithFull:stillhere

This looks right to me. Why the 0 results?

Thanks,

Luke

- Original Message - 
From: "Maik Schreiber" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Thursday, February 03, 2005 1:19 PM
Subject: Re: Parsing The Query: Every document that doesn't have a field
containing x


> > Negating a term must be combined with at least one nonnegated term to
return
> > documents; in other words, it isn't possible to use a query like NOT
term to
> > find all documents that don't contain a term.
> >
> > So does that mean the above example wouldn't work?
>
> Exactly. You cannot search for "-kcfileupload:jpg", you need at least one
> clause that actually _includes_ documents.
>
> Do you by chance have a field with known contents? If so, you could misuse
> that one and include it in your query (perhaps by doing range or
> wildcard/prefix search). If not, try IndexReader.terms() for building a
> Query yourself, then use that one for search.
>
> -- 
> Maik Schreiber   *   http://www.blizzy.de
>
> GPG public key:
http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x1F11D713
> Key fingerprint: CF19 AFCE 6E3D 5443 9599 18B5 5640 1F11 D713
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Parsing The Query: Every document that doesn't have a field containing x

2005-02-03 Thread Luke Shannon
Ok.

I have added the following to every document:

doc.add(Field.UnIndexed("olFaithfull", "stillHere"));

The plan is a query that says: olFaithull = stillHere and kcfileupload!=jpg.

I have been experimenting with the MultiFieldQueryParser, this is not
working out for me. From a syntax how is this done? Does someone have an
example of a query similar to the one I am trying?

Thanks,

Luke

- Original Message - 
From: "Maik Schreiber" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Thursday, February 03, 2005 1:19 PM
Subject: Re: Parsing The Query: Every document that doesn't have a field
containing x


> > Negating a term must be combined with at least one nonnegated term to
return
> > documents; in other words, it isn't possible to use a query like NOT
term to
> > find all documents that don't contain a term.
> >
> > So does that mean the above example wouldn't work?
>
> Exactly. You cannot search for "-kcfileupload:jpg", you need at least one
> clause that actually _includes_ documents.
>
> Do you by chance have a field with known contents? If so, you could misuse
> that one and include it in your query (perhaps by doing range or
> wildcard/prefix search). If not, try IndexReader.terms() for building a
> Query yourself, then use that one for search.
>
> -- 
> Maik Schreiber   *   http://www.blizzy.de
>
> GPG public key:
http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x1F11D713
> Key fingerprint: CF19 AFCE 6E3D 5443 9599 18B5 5640 1F11 D713
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Synonyms Not Showing In The Index

2005-02-03 Thread Luke Shannon
Thanks!

I can wait for the release.

Luke

- Original Message - 
From: "Andrzej Bialecki" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Thursday, February 03, 2005 2:53 PM
Subject: Re: Synonyms Not Showing In The Index


> Andrzej Bialecki wrote:
> > Luke Shannon wrote:
> > 
> >> Hello;
> >>
> >> It seems my Synonym analyzer is working (based on some successful 
> >> queries).
> >> But I can't see the synonyms in the index using Luke. Is this correct?
> >>
> > 
> > Did you use the combined JAR to run? It contains an oldish version of 
> > Lucene... Other than that, I'm not sure - if you can't find the reason 
> > you could send me a small test index...
> > 
> > 
> 
> Got the bug. Your index is ok, and your synonym analyzer works as 
> expected. The Doc #16, field "name" has the content "luigi|mario test", 
> where tokens "luigi" and "mario" occupy the same position.
> 
> This was a deficiency with the current version of Luke, where if you 
> press "Reconstruct" it tries to reconstruct only unstored fields, but 
> shows you the stored fields verbatim (without actually checking how 
> their content was tokenized, and what tokens ended up in the index).
> 
> This is fixed in the new (yet unreleased) version of Luke. This new 
> version restores all fields (no matter if they are stored or only 
> indexed), and then displays both the stored content, and the restored 
> tokenized content. There was also a bug in GrowableStringsArray - the 
> values of tokens with the same position were being overwritten instead 
> of appended. This is also fixed now.
> 
> You should expect a new release within a week or two. If you can't wait, 
> let me know and I'll send you the patches.
> 
> -- 
> Best regards,
> Andrzej Bialecki
>   ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Parsing The Query: Every document that doesn't have a field containing x

2005-02-03 Thread Luke Shannon
Hello;

I have a query that finds document that contain fields with a specific
value.

query1 = QueryParser.parse("jpg", "kcfileupload", new StandardAnalyzer());

This works well.

I would like a query that find documents containing all kcfileupload fields
that don't contain jpg.

The example I found in the book that seems to relate shows me how to find
documents without a specific term:

QueryParser parser = new QueryParser("contents", analyzer);
parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND);

But than it says:

Negating a term must be combined with at least one nonnegated term to return
documents; in other words, it isn't possible to use a query like NOT term to
find all documents that don't contain a term.

So does that mean the above example wouldn't work?

The API says:

 a plus (+) or a minus (-) sign, indicating that the clause is required or
prohibited respectively;

I have been playing around with using the minus character without much luck.

Can someone give point me in the right direction to figure this out?

Thanks,

Luke




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lock failure recovery

2005-02-03 Thread Luke Shannon
The indexing process is totally synchronized in our system. Thus if an
Indexing thread starts up and the index exists, but is locked, I know this
to be the only indexing processing running so the lock must be from a
process that got stopped before it could finish.

So right before I begin writing to the index I have this check:

//if we have gotten to here that this is the only index running.
//the index should not be locked. if it is, the lock is "stale"
//and must be released before we can continue
try {
if (index.exists() && IndexReader.isLocked(indexFileLocation)) {
Trace.ERROR("INDEX INFO: Had to clear a stale index lock");
IndexReader.unlock(FSDirectory.getDirectory(index, false));
}
} catch (IOException e3) {
Trace.ERROR("INDEX ERROR: Was unable to clear a stale index
lock: " + e3);
}

Luke

- Original Message - 
From: "Claes Holmerson" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Thursday, February 03, 2005 12:02 PM
Subject: Lock failure recovery


> Hello
>
> A commit.lock can get left by a process that dies in the middle of
> reading the index, for example because of an OutOfMemoryError. How can I
> handle such a left lock gracefully the next time the process runs?
> Checking if there is a lock is straight forward - but how can I be sure
> that it is not just a current lock created by another thread? The only
> methods I find to deal with the lock is IndexReader.isLocked() and
> IndexReader.unlock(). I would like to know the lock age - if it is older
> than a certain age then I can remove it. How do other people deal with
> left over locks?
>
> Claes
> -- 
>
> Claes Holmerson
> Polopoly - Cultivating the information garden
> Kungsgatan 88, SE-112 27 Stockholm, SWEDEN
> Direct: +46 8 506 782 59
> Mobile: +46 704 47 82 59
> Fax:  +46 8 506 782 51
> [EMAIL PROTECTED], http://www.polopoly.com
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Synonyms Not Showing In The Index

2005-02-02 Thread Luke Shannon
Hello;

It seems my Synonym analyzer is working (based on some successful queries).
But I can't see the synonyms in the index using Luke. Is this correct?

Thanks,

Luke



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: QueryParser Help

2005-02-02 Thread Luke Shannon
Actually now that I am looking at it, I think I am already accomplishing it.

I wanted all the documents with Mario in either field to show up.

There are two, but one has them in both fields in the Document. This is
correct.

Thanks for the help. It would have taken me a while to catch that.

Luke

- Original Message - 
From: "Maik Schreiber" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Wednesday, February 02, 2005 6:26 PM
Subject: Re: QueryParser Help


> > Not sure how to handle this yet, I still don't know enough about
> > QueryParsing.
>
> What is it you're trying to accomplish?
>
> -- 
> Maik Schreiber   *   http://www.blizzy.de
>
> GPG public key:
http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x1F11D713
> Key fingerprint: CF19 AFCE 6E3D 5443 9599 18B5 5640 1F11 D713
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: QueryParser Help

2005-02-02 Thread Luke Shannon
This is it. Thank Maik. One of the docs had the result in both name and
desc.

Not sure how to handle this yet, I still don't know enough about
QueryParsing.

Luke

- Original Message - 
From: "Maik Schreiber" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Wednesday, February 02, 2005 6:15 PM
Subject: Re: QueryParser Help


> > But, I know "name" contains 2 documents, I also know "desc" contains
one.
> > This may be a dumb question but why does Hits not contain pointers to 3
> > results (1 from name, 2 from desc)?
>
> Your search is an OR search, which is why you get a union of search hits.
> Consider these documents (which I think you have in your index):
>
> Document 1:
> - name=mario
> - desc=mario
>
> Document 2:
> - name=mario
> - desc=foo
>
>
> - Searching for "mario" in field "name" would return 2 hits.
> - Searching for "mario" in field "desc" would return 1 hit.
> - Searching for "mario" in both fields would return 2 hits (which is what
> you're seeing).
>
> -- 
> Maik Schreiber   *   http://www.blizzy.de
>
> GPG public key:
http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x1F11D713
> Key fingerprint: CF19 AFCE 6E3D 5443 9599 18B5 5640 1F11 D713
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



QueryParser Help

2005-02-02 Thread Luke Shannon
Hello;

Getting squinted with Query Parsing.  I have a questions:

Query query = MultiFieldQueryParser
.parse("mario", new String[] { "name", "desc" }, new int[] {
MultiFieldQueryParser.NORMAL_FIELD,
MultiFieldQueryParser.NORMAL_FIELD }, new StandardAnalyzer());
IndexSearcher searcher = new IndexSearcher(fsDir);
Hits hits = searcher.search(query);
System.out.printing("Keywords : " + hits.length()+ " " +
query.toString());
assertEquals(2, hits.length());

This test is successful.

But, I know "name" contains 2 documents, I also know "desc" contains one.
This may be a dumb question but why does Hits not contain pointers to 3
results (1 from name, 2 from desc)?

Thanks

Luke



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: which HTML parser is better?

2005-02-02 Thread Luke Shannon
In our application I use regular expressions to strip all tags in one
situation and specific ones in another situation. Here is sample code for
both:

This strips all html 4.0 tags except , , , , , ,
:

html_source =
Pattern.compile("",
Pattern.CASE_INSENSITIVE).matcher(html_source).replaceAll("");

When I want to strip anything in a tag I use the following pattern with the
code above:

String strPattern1 = "<\\s?(.|\n)*?\\s?>";

HTH

Luke



- Original Message - 
From: "sergiu gordea" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Wednesday, February 02, 2005 1:23 PM
Subject: Re: which HTML parser is better?


> Karl Koch wrote:
>
> >I am in control of the html, which means it is well formated HTML. I use
> >only HTML files which I have transformed from XML. No external HTML (e.g.
> >the web).
> >
> >Are there any very-short solutions for that?
> >
> >
> if you are using only correct formated HTML pages and you are in control
> of these pages.
> you can use a regular exprestion to remove the tags.
>
> something like
> replaceAll("<*>","");
>
> This is the ideea behind the operation. If you will search on google you
> will find a more robust
> regular expression.
>
> Using a simple regular expression will be a very cheap solution, that
> can cause you a lot of problems in the future.
>
>  It's up to you to use it 
>
>  Best,
>
>  Sergiu
>
> >Karl
> >
> >
> >
> >>Karl Koch wrote:
> >>
> >>
> >>
> >>>Hi,
> >>>
> >>>yes, but the library your are using is quite big. I was thinking that a
> >>>
> >>>
> >>5kB
> >>
> >>
> >>>code could actually do that. That sourceforge project is doing much
more
> >>>than that but I do not need it.
> >>>
> >>>
> >>>
> >>>
> >>you need just the htmlparser.jar 200k.
> >>... you know ... the functionality is strongly correclated with the
size.
> >>
> >>  You can use 3 lines of code with a good regular expresion to eliminate
> >>the html tags,
> >>but this won't give you any guarantie that the text from the bad
> >>fromated html files will be
> >>correctly extracted...
> >>
> >>  Best,
> >>
> >>  Sergiu
> >>
> >>
> >>
> >>>Karl
> >>>
> >>>
> >>>
> >>>
> >>>
>  Hi Karl,
> 
> I already submitted a peace of code that removes the html tags.
> Search for my previous answer in this thread.
> 
>  Best,
> 
>   Sergiu
> 
> Karl Koch wrote:
> 
> 
> 
> 
> 
> >Hello,
> >
> >I have  been following this thread and have another question.
> >
> >Is there a piece of sourcecode (which is preferably very short and
> >
> >
> >>simple
> >>
> >>
> >(KISS)) which allows to remove all HTML tags from HTML content? HTML
> >
> >
> >>3.2
> >>
> >>
> >would be enough...also no frames, CSS, etc.
> >
> >I do not need to have the HTML strucutre tree or any other structure
> >
> >
> >>but
> >>
> >>
> >need a facility to clean up HTML into its normal underlying content
> >
> >
> >
> >
> before
> 
> 
> 
> 
> >indexing that content as a whole.
> >
> >Karl
> >
> >
> >
> >
> >
> >
> >
> >
> >>I think that depends on what you want to do.  The Lucene demo parser
> >>
> >>
> >>
> >>
> does
> 
> 
> 
> 
> >>simple mapping of HTML files into Lucene Documents; it does not give
> >>
> >>
> >>you
> >>
> >>
> >>
> >>
> >>
> >>
> a
> 
> 
> 
> 
> >>parse tree for the HTML doc.  CyberNeko is an extension of Xerces
> >>
> >>
> >>(uses
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >the
> >
> >
> >
> >
> >
> >
> >>same API; will likely become part of Xerces), and so maps an HTML
> >>
> >>
> >>
> >>
> document
> 
> 
> 
> 
> >>into a full DOM that you can manipulate easily for a wide range of
> >>purposes.  I haven't used JTidy at an API level and so don't know it
> >>
> >>
> >>as
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >well --
> >
> >
> >
> >
> >
> >
> >>based on its UI, it appears to be focused primarily on HTML
validation
> >>
> >>
> >>
> >>
> and
> 
> 
> 
> 
> >>error detection/correction.
> >>
> >>I use CyberNeko for a range of operations on HTML documents that go
> >>
> >>
> >>
> >>
> beyond
> 
> 
> 
> 
> >>indexing them in Lucene, and really like it.  It has been robust for
> >>
> >>
> >>me
> >>
> >>
> >>
> >>
> >>
> >>
> so
> 
> 
> 
> 
> >>far.
> >>
> >>Chuck
> >>
> >>
> >>
> >>>-Original Message-
> >>>From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
> >>>Sent: Tuesday, February 01, 2005 1:15 AM
> >>>To: lucene-user@jakarta.apache.org
> >>>Subject: which HTML parser is better?

Combining Documents

2005-02-01 Thread Luke Shannon
Hello;

I have a situation where I need to combine the fields returned from one
document to an existing document.

Is there something in the API for this that I'm missing or is this the best
way:

//add the fields contained in the PDF document to the existing doc Document
Document attachedDoc = LucenePDFDocument.getDocument(attached);
Enumeration docFields = attachedDoc.fields();
 while (docFields.hasMoreElements()) {
 doc.add((Field)docFields.nextElement());
  }

Luke



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to get document count?

2005-02-01 Thread Luke Shannon
Not sure if the API provides a method for this, but you could use Luke:

http://www.getopt.org/luke/

It gives you a count and lets you step through each Doc looking at their
fields.

- Original Message - 
From: "Jim Lynch" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Tuesday, February 01, 2005 11:28 AM
Subject: How to get document count?


> I've indexed a large set of documents and think that something may have
> gone wrong somewhere in the middle.  Is there a way I can display the
> count of documents in the index?
>
> Thanks,
> Jim.
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Boosting Questions

2005-01-27 Thread Luke Shannon
Thanks Otis.

- Original Message - 
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Thursday, January 27, 2005 12:11 PM
Subject: Re: Boosting Questions


> Luke,
> 
> Boosting is only one of the factors involved in Document/Query scoring.
>  Assuming that by applying your boosts to Document A or a single field
> of Document A increases the total score enough, yes, that Document A
> may have the highest score.  But just because you boost a single
> Document and not others, it does not mean it will emerge at the top.
> You should check out the Explanation class, which can dump all scoring
> factors in text or HTML format.
> 
> Otis
> 
> 
> --- Luke Shannon <[EMAIL PROTECTED]> wrote:
> 
> > Hi All;
> > 
> > I just want to make sure I have the right idea about boosting.
> > 
> > So if I boost a document (Document A) after I index it (lets say a
> > score of
> > 2.0) Lucene will now consider this document relativly more important
> > than
> > other documents in the index with a boost factor less than 2.0. This
> > boost
> > factor will also be applied to all the fields in the Document A.
> > Therefore,
> > if I do a TermQuery on a field that all my documents share ("title"),
> > in the
> > returned Hits (assuming Document A was among the return documents),
> > Document
> > A will score higher than other documents with a lower boost factor
> > because
> > the "title" field in A would have been boosted with all its other
> > fields.
> > Correct?
> > 
> > Now if at indexing time I decided to boost a particular field, lets
> > say
> > "address" in Document A (this is a field which all documents have)
> > the boost
> > factor is only applied to the "address" field of Document A. Nothing
> > else is
> > boosted by this operation. This means if a TermQuery on the "address"
> > field
> > returns Document A along with a collection of other documents,
> > Document A
> > will score higher than the others because of boosting. Correct?
> > 
> > Thanks,
> > 
> > Luke
> > 
> > 
> > 
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> > 
> > 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Boosting Questions

2005-01-27 Thread Luke Shannon
Hi All;

I just want to make sure I have the right idea about boosting.

So if I boost a document (Document A) after I index it (lets say a score of
2.0) Lucene will now consider this document relativly more important than
other documents in the index with a boost factor less than 2.0. This boost
factor will also be applied to all the fields in the Document A. Therefore,
if I do a TermQuery on a field that all my documents share ("title"), in the
returned Hits (assuming Document A was among the return documents), Document
A will score higher than other documents with a lower boost factor because
the "title" field in A would have been boosted with all its other fields.
Correct?

Now if at indexing time I decided to boost a particular field, lets say
"address" in Document A (this is a field which all documents have) the boost
factor is only applied to the "address" field of Document A. Nothing else is
boosted by this operation. This means if a TermQuery on the "address" field
returns Document A along with a collection of other documents, Document A
will score higher than the others because of boosting. Correct?

Thanks,

Luke



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Getting Into Search

2005-01-26 Thread Luke Shannon
Thanks Otis. Haven't seen it in either store (at least the ones in downtown
Toronto I usually shop at).

Their website says it ships in 24 hrs. It was cheaper on Amazon.ca so I went
that route for my printed version.

Luke

- Original Message - 
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Wednesday, January 26, 2005 1:13 PM
Subject: Re: Getting Into Search


> Hi Luke,
>
> That's not hard with RangeQuery (supported by QueryParser), take a look
> at this:
>   http://www.lucenebook.com/search?query=date+range
>
> The grayed-out text has the section name and page number, so you can
> quickly locate this stuff in your ebook.
>
> Otis
> P.S.
> Do you know if Indigo/Chapters has Lucene in Action on their book
> shelves yet?
>
>
> --- Luke Shannon <[EMAIL PROTECTED]> wrote:
>
> > Hello;
> >
> > My lucene application has been performing well in our company's CMS
> > application. The plan now is too offer "advanced searching".
> >
> > I just bought the eBook version of Lucene in Action to help with my
> > research
> > (it is taking Amazon for ever to ship the printed version to Canada).
> >
> > The book looks great and will certainly deepen my understanding. But
> > I am
> > suffering a bit of information over load.
> >
> > I was hoping I could post the rough requirments I was given this
> > morning and
> > perhaps some more experienced Luceners could help direct my research
> > (this
> > can even be pointing me to relevant sections of the book).
> >
> > 1. Documents in the system contain the following fields,
> > ModificationDate,
> > CreationDate. A query is required that allows users to search for
> > documents
> > created/modified on a certain date or within a certain date range.
> >
> > 2. Documents in the system also contains fields: Title, Path. A query
> > is
> > required that allows users to search for Titles or Path starting
> > with,
> > ending with, containing (this is all the system currently does) or
> > matching
> > specific term(s).
> >
> > Later today I will get more specific requirments. For now I am
> > looking
> > through Analysis section of the eBook for ideas on how to handle
> > this. Any
> > tips anyone can give would be appreciated.
> >
> > Thanks,
> >
> > Luke
> >
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Getting Into Search

2005-01-26 Thread Luke Shannon
Hello;

My lucene application has been performing well in our company's CMS
application. The plan now is too offer "advanced searching".

I just bought the eBook version of Lucene in Action to help with my research
(it is taking Amazon for ever to ship the printed version to Canada).

The book looks great and will certainly deepen my understanding. But I am
suffering a bit of information over load.

I was hoping I could post the rough requirments I was given this morning and
perhaps some more experienced Luceners could help direct my research (this
can even be pointing me to relevant sections of the book).

1. Documents in the system contain the following fields, ModificationDate,
CreationDate. A query is required that allows users to search for documents
created/modified on a certain date or within a certain date range.

2. Documents in the system also contains fields: Title, Path. A query is
required that allows users to search for Titles or Path starting with,
ending with, containing (this is all the system currently does) or matching
specific term(s).

Later today I will get more specific requirments. For now I am looking
through Analysis section of the eBook for ideas on how to handle this. Any
tips anyone can give would be appreciated.

Thanks,

Luke




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: FOP Generated PDF and PDFBox

2005-01-21 Thread Luke Shannon
Thanks Ben. I new none related issues now. For the time being I will be
using path. Once I get a chance I will try this on the command line as you
have recommended.

Luke

- Original Message - 
From: "Ben Litchfield" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Friday, January 21, 2005 1:05 PM
Subject: Re: FOP Generated PDF and PDFBox


>
>
> Ya, when calling LucenePDFDocument.getDocument( File ) then it should be
> the same as the path.
>
> This is the code that the class uses to set those fields.
>
> document.add( Field.UnIndexed("path", file.getPath() ) );
> document.add(Field.UnIndexed("url", file.getPath().replace(FILE_SEPARATOR,
> '/')));
>
> I have no idea why an FOP PDF would be any different than another PDF.
>
> You can also run it from the command line, this is just for debugging
> purposes like this.
>
> java org.pdfbox.searchengine.lucene.LucenePDFDocument 
>
> and it should print out the fields of the lucene Document object.  Is the
> url there and is it correct?
>
> Ben
>
> On Fri, 21 Jan 2005, Luke Shannon wrote:
>
> > That is correct. No difference with how other PDF are handled.
> >
> > I am looking at the index in Luke now. The FOP generated documents have
a
> > path but no URL? I would guess that these would be the same?
> >
> > Thanks for the speedy reply.
> >
> > Luke
> >
> >
> > - Original Message -
> > From: "Ben Litchfield" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" 
> > Sent: Friday, January 21, 2005 12:34 PM
> > Subject: Re: FOP Generated PDF and PDFBox
> >
> >
> > >
> > >
> > > Are you indexing the FOP PDF's differently than other PDF documents?
> > >
> > > Can I assume that you are using PDFBox's
LucenePDFDocument.getDocument()
> > > method?
> > >
> > > Ben
> > >
> > > On Fri, 21 Jan 2005, Luke Shannon wrote:
> > >
> > > > Hello;
> > > >
> > > > Our CMS now allows users to create PDF documents (uses FOP) and than
> > search
> > > > them.
> > > >
> > > > I seem to be able to index these documents ok. But when I am
generating
> > the
> > > > results to display I get a Null Pointer Exception while trying to
use a
> > > > variable that should contain the url keyword for one of these
documents
> > in
> > > > the index:
> > > >
> > > > Document doc = hits.doc(i);
> > > > String path = doc.get("url");
> > > >
> > > > Path contains null.
> > > >
> > > > The interesting thing is this only happens with PDF that are
generate
> > with
> > > > FOP. Other PDFs are fine.
> > > >
> > > > What I find weird is shouldn't the "url" field just contain the path
of
> > the
> > > > file?
> > > >
> > > > Anyone else seen this before?
> > > >
> > > > Any ideas?
> > > >
> > > > Thanks,
> > > >
> > > > Luke
> > > >
> > > >
> > > >
> > >
> -
> > > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > > For additional commands, e-mail: [EMAIL PROTECTED]
> > > >
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: FOP Generated PDF and PDFBox

2005-01-21 Thread Luke Shannon
That is correct. No difference with how other PDF are handled.

I am looking at the index in Luke now. The FOP generated documents have a
path but no URL? I would guess that these would be the same?

Thanks for the speedy reply.

Luke


- Original Message - 
From: "Ben Litchfield" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Friday, January 21, 2005 12:34 PM
Subject: Re: FOP Generated PDF and PDFBox


>
>
> Are you indexing the FOP PDF's differently than other PDF documents?
>
> Can I assume that you are using PDFBox's LucenePDFDocument.getDocument()
> method?
>
> Ben
>
> On Fri, 21 Jan 2005, Luke Shannon wrote:
>
> > Hello;
> >
> > Our CMS now allows users to create PDF documents (uses FOP) and than
search
> > them.
> >
> > I seem to be able to index these documents ok. But when I am generating
the
> > results to display I get a Null Pointer Exception while trying to use a
> > variable that should contain the url keyword for one of these documents
in
> > the index:
> >
> > Document doc = hits.doc(i);
> > String path = doc.get("url");
> >
> > Path contains null.
> >
> > The interesting thing is this only happens with PDF that are generate
with
> > FOP. Other PDFs are fine.
> >
> > What I find weird is shouldn't the "url" field just contain the path of
the
> > file?
> >
> > Anyone else seen this before?
> >
> > Any ideas?
> >
> > Thanks,
> >
> > Luke
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



FOP Generated PDF and PDFBox

2005-01-21 Thread Luke Shannon
Hello;

Our CMS now allows users to create PDF documents (uses FOP) and than search
them.

I seem to be able to index these documents ok. But when I am generating the
results to display I get a Null Pointer Exception while trying to use a
variable that should contain the url keyword for one of these documents in
the index:

Document doc = hits.doc(i);
String path = doc.get("url");

Path contains null.

The interesting thing is this only happens with PDF that are generate with
FOP. Other PDFs are fine.

What I find weird is shouldn't the "url" field just contain the path of the
file?

Anyone else seen this before?

Any ideas?

Thanks,

Luke



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: where to place the index directory

2005-01-14 Thread Luke Shannon
Hello Philippe;

Things have gotten busy here for me. I can help you trouble shoot this a
little later. Please email me directly if you still need help with this.

Luke
- Original Message - 
From: "philippe" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Friday, January 14, 2005 12:45 PM
Subject: Re: where to place the index directory


> Luke,
>
> the jsp is in :
> /var/www/html/capoeira
>
> and the index directory, called indexHtmlCapoeira,
> /home/quilombo/indexHtmlCapoeira/index
>
> in the jsp, i'm giving the path,
>  String indexLocation = "/home/quilombo/indexHtmlCapoeira/index";
>
> thanks for your help
>
> philippe
>
>
> On Friday 14 January 2005 18:33, Luke Shannon wrote:
> > The jsp is having some trouble locating the index folder. It is probably
> > the path you are supplying when you create the File object for the
index.
> > When you create the File obkect what is the path you are passing in?
> >
> > - Original Message -
> > From: "philippe" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" 
> > Sent: Friday, January 14, 2005 12:17 PM
> > Subject: Re: where to place the index directory
> >
> > > yes Luke,
> > > thank you for your help,
> > >
> > > the message is :
> > > "indexHtml is not a directory"
> > >
> > > during some experimentations, the message has been
> > > "unable to open the directory"
> > >
> > > thanks
> > > philippe
> > >
> > > On Friday 14 January 2005 17:56, Luke Shannon wrote:
> > > > Does it give some sort of error message?
> > > >
> > > > Luke
> > > >
> > > > - Original Message -
> > > > From: "philippe" <[EMAIL PROTECTED]>
> > > > To: 
> > > > Sent: Friday, January 14, 2005 11:39 AM
> > > > Subject: where to place the index directory
> > > >
> > > > > Hi everybody,
> > > > >
> > > > > can someone help me ?
> > > > >
> > > > > i have a problem with my index ?
> > > > >
> > > > > on my localhost, everything is ok,
> > > > > i can put my index directory in different places, it is accessed
by
> > > > > my
> > > >
> > > > jsp.
> > > >
> > > > > But on my hosting tomcat 4, my jsp can't open this directory
> > > > >
> > > > > have an idea ?
> > > > >
> > > > > thanks in advance
> > > > >
> > > > > philippe
> > > > >
> > > >
> -
> > > > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > > > For additional commands, e-mail:
[EMAIL PROTECTED]
> > > >
> > >
> -
> > > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: where to place the index directory

2005-01-14 Thread Luke Shannon
The jsp is having some trouble locating the index folder. It is probably the
path you are supplying when you create the File object for the index. When
you create the File obkect what is the path you are passing in?

- Original Message - 
From: "philippe" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Friday, January 14, 2005 12:17 PM
Subject: Re: where to place the index directory


> yes Luke,
> thank you for your help,
>
> the message is :
> "indexHtml is not a directory"
>
> during some experimentations, the message has been
> "unable to open the directory"
>
> thanks
> philippe
>
> On Friday 14 January 2005 17:56, Luke Shannon wrote:
> > Does it give some sort of error message?
> >
> > Luke
> >
> > - Original Message -
> > From: "philippe" <[EMAIL PROTECTED]>
> > To: 
> > Sent: Friday, January 14, 2005 11:39 AM
> > Subject: where to place the index directory
> >
> > > Hi everybody,
> > >
> > > can someone help me ?
> > >
> > > i have a problem with my index ?
> > >
> > > on my localhost, everything is ok,
> > > i can put my index directory in different places, it is accessed by my
> >
> > jsp.
> >
> > > But on my hosting tomcat 4, my jsp can't open this directory
> > >
> > > have an idea ?
> > >
> > > thanks in advance
> > >
> > > philippe
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: where to place the index directory

2005-01-14 Thread Luke Shannon
Does it give some sort of error message?

Luke

- Original Message - 
From: "philippe" <[EMAIL PROTECTED]>
To: 
Sent: Friday, January 14, 2005 11:39 AM
Subject: where to place the index directory


> Hi everybody,
>
> can someone help me ?
>
> i have a problem with my index ?
>
> on my localhost, everything is ok,
> i can put my index directory in different places, it is accessed by my
jsp.
>
> But on my hosting tomcat 4, my jsp can't open this directory
>
> have an idea ?
>
> thanks in advance
>
> philippe
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: what if the IndexReader crashes, after delete, before close.

2005-01-11 Thread Luke Shannon
Here is how I handle it.

The Indexer is a Runnable. All the members it uses are static. The run()
method calls a syncronized method called go(). This kicks off the indexing.

Before you even get to here, the method in the CMS code that created the
thread object and instaniated the index is also sychronized.

Here is the code that handles the potential lock file that may be left
behind from a Reader or Writer.

Note: I found I had to check if the index existed before checking if it was
locked. If I checked if it was locked and the index had not been created yet
I got an error.

//if we have gotten to hear that this is the only index running.
//the index should not be locked. if it is the lock is "stale"
//and must be released before we can continue
try {
if (index.exists() && IndexReader.isLocked(indexFileLocation)) {
Trace.ERROR("INDEX INFO: Had to clear a stale index lock");
IndexReader.unlock(FSDirectory.getDirectory(index, false));
}
} catch (IOException e3) {
Trace.ERROR("INDEX ERROR: IMPORTANT. Was unable to clear a stale index lock:
" + e3);
}

HTH

Luke

- Original Message - 
From: "Peter Veentjer - Anchor Men" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Tuesday, January 11, 2005 3:24 AM
Subject: RE: what if the IndexReader crashes, after delete, before close.




-Oorspronkelijk bericht-
Van: Luke Shannon [mailto:[EMAIL PROTECTED]
Verzonden: maandag 10 januari 2005 15:46
Aan: Lucene Users List
Onderwerp: Re: what if the IndexReader crashes, after delete, before
close.


>>One thing that will happen is the lock file
>>will get left behind. This means when you start
>>back up and try to create another Reader you will
>>get a file lock error.

I have figured out that part the hard way ;) Why can`t I access my index
anymore?? Ahh.. The lock file

>>Our system is threaded and synchronized.
>>Thus when a Reader is being created I know
>>it is the only one (the Writer comes after
>>the reader has been closed). Before creating
>>it I check if the Index is locked. If it is,
>>I forcefully clear it. This prevents the above
>>problem from happening.

You can have more than 1 reader open at anytime. Even while a delete or
add is in progress. But you can`t use a reader where documents are
deleted (IndexReader) and added(IndexWriter) at the same time. If you
don`t have other threads doing delete/add you won`t have to synchronize
anything.

And how do you synchronize on it? I have applied the ReadWriteLock From
Doug Lea`s concurrency library after I have build my own
synchronization brick and somebody pointed out that I was implementing
the ReadWriteLock. But at the moment I don`t do any synchronization.

And I want to have a component that is executed if the system is started
and knows that to do if there is rubbish in the index directory. I want
that component to restore my index to a usable version (and even small
loss of information is acceptable because everything is checked once and
a while. And user-added-information is going to be stored in the
database. So nothing gets lost. The index can be rebuild..




Luke

- Original Message -
From: "Peter Veentjer - Anchor Men" <[EMAIL PROTECTED]>
To: 
Sent: Saturday, January 08, 2005 4:08 AM
Subject: what if the IndexReader crashes, after delete, before close.


What happens to the Index if the IndexReader crashes, after I have
deleted
documents, and before I have called close. Are the deletes ignored? Is
the
Index screwed up? Is the filesystem screwed up (if a document is deleted
new
delete-files appear) so are the delete-files still there (and can these
be
ignored the next time?). Can I restore the index to the previous state,
just
by removing those delete-files?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How do you handle dynamic html pages?

2005-01-10 Thread Luke Shannon
I run the indexer in our CMS everytime a content change has occured. It is
an incremental update so only documents that generate a different UID than
the coresponding UID in the index get processed.

Luke

- Original Message - 
From: "Kevin L. Cobb" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Monday, January 10, 2005 11:26 AM
Subject: RE: How do you handle dynamic html pages?


I don't like to periodically re-index everything because 1) you can't be
confident that your searches are as up to date as they could be, and 2)
you are wasting cycles either checking for documents that may or may not
need to be updated, or re-indexing documents that don't need updated.

Ideally, I think that you want an event driven system where the content
management system or the like indicates to your searcher engine when a
page/document gets updated. That way, you know that documents are as up
to date as possible in terms of searches, and you know that you aren't
doing unnecessary work.



-Original Message-
From: Luke Francl [mailto:[EMAIL PROTECTED]
Sent: Monday, January 10, 2005 11:09 AM
To: Lucene Users List
Subject: Re: How do you handle dynamic html pages?

On Mon, 2005-01-10 at 10:03, Jim Lynch wrote:
> How is anyone managing reindexing of pages that change?  Just
> periodically reindex everything or do you try to determine frequency
of
> each changes to each page and/or site?

If you are using a CMS, your best bet is to integrate Lucene with the
CMS's content update mechanism. That way, your index will always be
up-to-date.

Otherwise, I would say reindexing everything is easiest, provided it
doesn't take too long. If it's ~15 minutes or less, you could schedule a
processes to do it at a low activity period (2 AM or whenever) every day
and that would probably handle your needs.

Regards,
Luke Francl


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: what if the IndexReader crashes, after delete, before close.

2005-01-10 Thread Luke Shannon
One thing that will happen is the lock file will get left behind. This means
when you start back up and try to create another Reader you will get a file
lock error.

Our system is threaded and synchronized. Thus when a Reader is being created
I know it is the only one (the Writer comes after the reader has been
closed). Before creating it I check if the Index is locked. If it is, I
forcefully clear it. This prevents the above problem from happening.

Luke

- Original Message - 
From: "Peter Veentjer - Anchor Men" <[EMAIL PROTECTED]>
To: 
Sent: Saturday, January 08, 2005 4:08 AM
Subject: what if the IndexReader crashes, after delete, before close.


What happens to the Index if the IndexReader crashes, after I have deleted
documents, and before I have called close. Are the deletes ignored? Is the
Index screwed up? Is the filesystem screwed up (if a document is deleted new
delete-files appear) so are the delete-files still there (and can these be
ignored the next time?). Can I restore the index to the previous state, just
by removing those delete-files?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Check to see if index is optimized

2005-01-07 Thread Luke Shannon
This may not be a simple way, but you could just do a quick check on the
folder to see if there is more than one file containing the name segment.

Luke

- Original Message - 
From: "Crump, Michael" <[EMAIL PROTECTED]>
To: 
Sent: Friday, January 07, 2005 2:24 PM
Subject: Check to see if index is optimized


Hello,



Lucene is great!  I just have a question.



Is there a simple way to check and see if an index is already optimized?
What happens if optimize is called on an already optimized index - does
the call basically do a noop?  Or is it still and expensive call?





Regards,



Michael



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: questions

2005-01-07 Thread Luke Shannon
Hello Jac;

If you have verified that the index folder is indeed being create and their
is a segment(s) file(s) in it, check that the IndexSearcher in the demo is
pointing to that location. This is a easy error to make and would account
for the error message no segments folder.

Luke


- Original Message - 
From: "jac jac" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Friday, January 07, 2005 2:03 AM
Subject: questions


>
> Hi I am a newbie and i just installed Tomcat on my machine.
> May I know, when i placed the Luceneweb folder in the webapps folder of
Tomcat, how come I couldn't conduct the search operation when i test the
website? Did I missed out anything?
>
> It prompts me that there is no c:\opt\index\segment folder...
> I created but i still couldnt get Lucene to work...
>
> At http://jakarta.apache.org/lucene/docs/demo.html:
> under the Indexing file instruction where should I do the following "type
"java org.apache.lucene.demo.IndexFiles {full-path-to-lucene}/src". "???
> Is it a must to install ant?
>
> Please kindly help!!! Thanks very much in advance
>
> regards,
> jac
>
>
>
> -
> Do you Yahoo!?
>  The all-new My Yahoo! - What will yours do?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: how to create a long lasting unique key?

2005-01-04 Thread Luke Shannon
This is taken from the example code writen by Doug Cutting that ships with
Lucene.

It is the key our system uses. It also comes in handy when incrementally
updating.

Luke

public static String uid(File f) {
  // Append path and date into a string in such a way that lexicographic
  // sorting gives the same results as a walk of the file hierarchy. Thus
  // null (\u) is used both to separate directory components and to
  // separate the path from the date.
  return f.getPath().replace(dirSep, '\u') + "\u"
+ DateField.timeToString(f.lastModified());
 }

- Original Message - 
From: "Peter Veentjer - Anchor Men" <[EMAIL PROTECTED]>
To: 
Sent: Tuesday, January 04, 2005 2:43 PM
Subject: how to create a long lasting unique key?


What is the best way to create a key for a document? I know the id (from
hits) can not be used, but what is a good way to create a key

I need this key for a webapplication. At the moment every document can be
identified with the filelocation key, but I would rather some kind of
integer for the Job (nobody needs to know the file location).


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Problems...

2005-01-04 Thread Luke Shannon
I had a similar situation with the same problem.

I found the previous system was creating all the object (including the
Searcher) and than updating the Index.

The result was the Searcher was not able to find any of the data just added
to the Index.

The solution for me was to move the creation of the Searcher to after the
Index had been updated and the Reader and Writer objects had been closed.
Also ensure the Searcher uses the same Analyzer as the IndexWriter used to
create the Index.

This is a good tool for checking what is in your index. It may help with the
trouble shooting:

http://www.getopt.org/luke/

Luke

- Original Message - 
From: "Ross Rankin" <[EMAIL PROTECTED]>
To: 
Sent: Tuesday, January 04, 2005 10:53 AM
Subject: Problems...


> (Bear with me; I have inherited this system from another developer who is
no
> longer with the company.  So I am not familiar with Lucene at all.  I just
> have got the task of "Fixing the search".)
>
>
>
> I have servlet that runs every 10 minutes and indexes and I can see files
> being created in the index path on that interval (fdt,fdx,fnm,frq, etc.)
> however the search function is no longer working.  I'm not getting
anything
> in the log that I can point to that says what is not working, the search
or
> the index.  But since the index files seem to change size/date stamp as
they
> have in the past, I'm leaning towards the search function.
>
>
>
> I'm not sure where or how to troubleshoot.  Can I examine the indexes with
> anything to see what is there and that it's meaningful.  Is there
something
> simple I can do to track down what doesn't work in the process?  Thanks.
>
>
>
> Ross
>
>
>
> Here's the search function:
>
> public Hits search(String searchString, String resellerId) {
>
> int currentOffset = 0;
>
> try {
>
> currentOffset = Integer.parseInt(paramOffset);
>
> } catch (NumberFormatException e) {}
>
>
>
> System.out.println("\n\t\tSearch for " + searchString + " off = "
+
> currentOffset);
>
> if (currentOffset > 0) {
>
> // if the user only requested the next n items from the search
> returns
>
> return hits;
>
> }
>
>
>
> // performs a new search
>
> try {
>
> hits = null;
>
> try {
>
> searcher.close();
>
> } catch (Exception e){}
>
>
>
> searcher = new IndexSearcher(pathToIndex);
>
> Analyzer analyzer = new StandardAnalyzer();
>
>
>
> String searchQuery = LuceneConstants.FIELD_RESELLER_IDS + ":"
>
> + resellerId
>
> + " AND "
>
> + LuceneConstants.FIELD_FULL_DESCRIPTION + ":" +
> searchString;
>
>
>
>
>
> Query query = null;
>
> try {
>
> query = QueryParser.parse(searchQuery,
>
> LuceneConstants.FIELD_FULL_DESCRIPTION,
> analyzer);
>
> } catch (ParseException e) {
>
> // if an excepption occures parsing the search string
> entered by the user
>
> // escapes all the special lucene chars and try to make
the
> query again.
>
> searchQuery = LuceneConstants.FIELD_RESELLER_IDS + ":"
>
> + resellerId + " AND "
>
> + LuceneConstants.FIELD_FULL_DESCRIPTION + ":"
>
> + escape(searchString);
>
> query = QueryParser.parse(searchQuery,
>
> LuceneConstants.FIELD_FULL_DESCRIPTION,
> analyzer);
>
> }
>
> System.out.println("Searching for: " +
> query.toString(LuceneConstants.FIELD_FULL_DESCRIPTION));
>
>
>
> hits = searcher.search(query);
>
> System.out.println(hits.length() + " total matching
documents");
>
> //searcher.close();
>
>
>
> } catch (Exception e) {
>
> e.printStackTrace();
>
> }
>
>
>
> return hits;
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Deleting an index

2005-01-04 Thread Luke Shannon
If you opened an IndexReader was has it also been closed before you attempt
to delete?

- Original Message - 
From: "Scott Smith" <[EMAIL PROTECTED]>
To: 
Sent: Monday, January 03, 2005 7:39 PM
Subject: Deleting an index


I'm writing some junit tests for my search code (which layers on top of
Lucene).  The tests all follow the same pattern:

1. setUp(): Create some directories; write some files to be indexed
2. someTest: Call the indexer to create an index on the generated
files; do several searches and verify counts, expected hits, etc.;
3. tearDown(): Delete all of the directories and associated files
included the just-created index.



My problem is that I am unable to delete the index.  I've narrowed it
down to something in the search routine not letting go of the index file
(i.e., if I do the indexing and comment out the search, then everything
deletes fine).  The search code is pretty straight forward.  It creates
a new IndexSearcher (which it caches and hence uses for all searches in
the test).  Each individual search simply creates several QueryParsers
and then combines them to do a search using the cached IndexSearcher.
After the last search, I close() the IndexSearcher.  But something still
seems to have hold of the index.  I've tried nulling the hits object,
but that didn't seem to affect anything.



Any ideas?



Scott





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: LIMO problems

2004-12-13 Thread Luke Shannon
This is a good place to start for extracting the content from power point
files:

http://www.mail-archive.com/poi-user@jakarta.apache.org/msg04809.html

Luke

- Original Message - 
From: "Daniel Cortes" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, December 13, 2004 10:46 AM
Subject: Re: LIMO problems


>
> Hi, I want to know what library do you use for search in PPT files?
> POI support this?
> thanks
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene in Action e-book now available!

2004-12-10 Thread Luke Shannon
Nice Work!

Congratulations Guys.

- Original Message - 
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene User" <[EMAIL PROTECTED]>; "Lucene List"
<[EMAIL PROTECTED]>
Sent: Friday, December 10, 2004 3:52 AM
Subject: Lucene in Action e-book now available!


> The Lucene in Action e-book is now available at Manning's site:
>
> http://www.manning.com/hatcher2
>
> Manning also put lots of other goodies there, the table of contents,
> "about this book", preface, the foreward from Doug Cutting himself
> (thanks Doug!!!), and a couple of sample chapters.  The complete source
> code is there as well.
>
> Now comes the exciting part to find out what others think of the work
> Otis and I spent 14+ months of our lives on.
>
> Erik
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: LIMO problems

2004-12-09 Thread Luke Shannon
I use "Luke". It is pretty good.

http://www.getopt.org/luke/

Luke
- Original Message - 
From: "Daniel Cortes" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Thursday, December 09, 2004 8:32 AM
Subject: LIMO problems


> Hi, I'm tying Limo (Index Monitor of Lucene) and I have a problem,
> obviously it will be a silly problem but now I don't
> have solution.
> Someone can tell me how structure it have limo.properties file?
> because I have any example thanks.
> If you know another web-aplication for administration Lucenes Index say
me.
> Thanks for all, and excuse me for my silly questions.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Help to remove document

2004-12-08 Thread Luke Shannon
Hi;

The indexReader has a delete method that can do this:

public final void delete(int docNum)
throws IOException
Deletes the document numbered docNum. Once a document is deleted it will not
appear in TermDocs or TermPostitions enumerations. Attempts to read its
field with the document(int) method will result in an error. The presence of
this document may still be reflected in the
docFreq(org.apache.lucene.index.Term) statistic, though this will be
corrected eventually as the index is further modified.

There is an example of how it can be used in the Lucene demo. Ensure you
re-create the indexSearcher for the change to be reflected in your search
queries.

Luke
- Original Message - 
From: "Alex Kiselevski" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, December 08, 2004 9:34 AM
Subject: Help to remove document



Hello,
Help me pls, I want to know how to remove document from index

Alex Kiselevsky
 Speech Technology Tel: 972-9-776-43-46
R&D, Amdocs - Israel Mobile: 972-53-63 50 38
mailto:[EMAIL PROTECTED]




The information contained in this message is proprietary of Amdocs,
protected from disclosure, and may be privileged.
The information is intended to be conveyed only to the designated
recipient(s)
of the message. If the reader of this message is not the intended recipient,
you are hereby notified that any dissemination, use, distribution or copying
of
this communication is strictly prohibited and may be unlawful.
If you have received this communication in error, please notify us
immediately
by replying to the message and deleting it from your computer.
Thank you.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Weird Behavior On Windows

2004-12-07 Thread Luke Shannon
Hey Ottis;

You're right again. Turned out there was a exception around the usage of the
Digester class that wasn't being written to the log. This exception was
being thrown as a result of a configuration issue with the server.

Everything is back to normal.

Thanks!

Luke
- Original Message - 
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, December 07, 2004 6:27 PM
Subject: Re: Weird Behavior On Windows


> The index has been modified, so you need a new IndexSearcher.  Could
> there be logic in the flaw (swap that), or could you be catching an
> Exception that is thrown only on Winblows due to Windows not letting
> you do certain things with referenced files and dirs?
>
> Otis
>
> --- Luke Shannon <[EMAIL PROTECTED]> wrote:
>
> > Hello All;
> >
> > Things have been running smoothly on Linux for sometime. We set up a
> > version
> > of the site on a Win2K machine, this is when all the "fun" started.
> >
> > A pdf would be added to the system. The indexer would run, find the
> > new
> > file, index it and successfully complete the update of the index
> > folder. No
> > IO error, no errors of any kind. Just like on the Linux box.
> >
> > Now we would try to search for a term in the document. 0 results
> > would be
> > returned? To make matters worse if I run a search on a term that
> > shows up in
> > a bunch of documents on windows it only find 2 results, where in
> > Linux it
> > would find 50 (same content).
> >
> > Using "Luke" I was able to verify that the pdf in question is in the
> > index.
> > Why can't the searcher find it?
> >
> > Any ideas would be welcome.
> >
> > Luke
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Weird Behavior On Windows

2004-12-07 Thread Luke Shannon
Hi Otis;

Each time a search request comes in I create a new searcher (same analyzer
as used during indexing). The idea about catching an error somewhere is
interesting, although in most of the cases where I catch an exception I
write to a log file. Anyway, this is all I have to gone on so I am looking
into exceptions now...

Luke
- Original Message - 
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, December 07, 2004 6:27 PM
Subject: Re: Weird Behavior On Windows


> The index has been modified, so you need a new IndexSearcher.  Could
> there be logic in the flaw (swap that), or could you be catching an
> Exception that is thrown only on Winblows due to Windows not letting
> you do certain things with referenced files and dirs?
>
> Otis
>
> --- Luke Shannon <[EMAIL PROTECTED]> wrote:
>
> > Hello All;
> >
> > Things have been running smoothly on Linux for sometime. We set up a
> > version
> > of the site on a Win2K machine, this is when all the "fun" started.
> >
> > A pdf would be added to the system. The indexer would run, find the
> > new
> > file, index it and successfully complete the update of the index
> > folder. No
> > IO error, no errors of any kind. Just like on the Linux box.
> >
> > Now we would try to search for a term in the document. 0 results
> > would be
> > returned? To make matters worse if I run a search on a term that
> > shows up in
> > a bunch of documents on windows it only find 2 results, where in
> > Linux it
> > would find 50 (same content).
> >
> > Using "Luke" I was able to verify that the pdf in question is in the
> > index.
> > Why can't the searcher find it?
> >
> > Any ideas would be welcome.
> >
> > Luke
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Weird Behavior On Windows

2004-12-07 Thread Luke Shannon
Hello All;

Things have been running smoothly on Linux for sometime. We set up a version
of the site on a Win2K machine, this is when all the "fun" started.

A pdf would be added to the system. The indexer would run, find the new
file, index it and successfully complete the update of the index folder. No
IO error, no errors of any kind. Just like on the Linux box.

Now we would try to search for a term in the document. 0 results would be
returned? To make matters worse if I run a search on a term that shows up in
a bunch of documents on windows it only find 2 results, where in Linux it
would find 50 (same content).

Using "Luke" I was able to verify that the pdf in question is in the index.
Why can't the searcher find it?

Any ideas would be welcome.

Luke



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Read locks on indexes

2004-12-07 Thread Luke Shannon
I think the read locks are preventing you from deleting from the index with
your reader and writing to the index with a writer at the same time.

If you never use a writer than I guess you don't need to worry about this.

But how do you create the indexes?

Luke

- Original Message - 
From: "Shawn Konopinsky" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, December 07, 2004 4:17 PM
Subject: Read locks on indexes


> Hi,
>
> I have a question regarding read locks on indexes. I have the situation
> where I have n applications (separated jvms) running queries. These
> applications are read-only, and never use an IndexWriter.
>
> The index is only ever updated using rsync. The applications don't need
> up the minute updates, only the data from when the reader was created is
> fine.
>
> My question is whether it's ok to disable read locks in this scenario?
> What are read locks protecting?
>
> Best,
> Shawn.
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: PDF Indexing Error

2004-12-03 Thread Luke Shannon
Hi Ben;

Actually I think I did update PDFBox. I will put it back to the version I
previously had.

Luke

- Original Message - 
From: "Ben Litchfield" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, December 02, 2004 8:20 PM
Subject: Re: PDF Indexing Error


>
> This error is because of security settings that have been applied to the
> PDF document which disallow text extraction.
>
> Not sure why you would all of a sudden get this error, unless you upgraded
> recently.  Older versions of PDFBox did not fully support PDF security.
>
> Ben
>
> On Thu, 2 Dec 2004, Luke Shannon wrote:
>
> > Hello All;
> >
> > Perhaps this should be on the PDFBox forum but I was curious if anyone
has
> > seen this error parsing PDF documents using packages other than PDFBox.
> >
> > /usr/tomcat/fb_hub/GM/Administration/Document/java/java_io.pdf
> > java.io.IOException: You do not have permission to extract text
> >
> > The weird thing is it gave this error on a document I have indexed a
million
> > times over the last 3 weeks.
> >
> > Thanks,
> >
> > Luke
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



PDF Indexing Error

2004-12-02 Thread Luke Shannon
Hello All;

Perhaps this should be on the PDFBox forum but I was curious if anyone has
seen this error parsing PDF documents using packages other than PDFBox.

/usr/tomcat/fb_hub/GM/Administration/Document/java/java_io.pdf
java.io.IOException: You do not have permission to extract text

The weird thing is it gave this error on a document I have indexed a million
times over the last 3 weeks.

Thanks,

Luke



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Optimized??

2004-11-22 Thread Luke Shannon
As I understand it optimization is when you merge several segments into one
allowing for faster queries.

The FAQs and API have further details.

http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.indexing&toc=faq#q24

Luke

- Original Message - 
From: "Miguel Angel" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Saturday, November 20, 2004 5:19 PM
Subject: Optimized??


What`s mean Optimized index in Lucene¿?
-- 
Miguel Angel Angeles R.
Asesoria en Conectividad y Servidores
Telf. 97451277

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How much time indexing doc ??

2004-11-22 Thread Luke Shannon
PDF(s) can definitely slow things down, depending on their size.

If there are a few larger PDF documents that time is definitely possible.

Luke

- Original Message - 
From: "Miguel Angel" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Saturday, November 20, 2004 11:25 AM
Subject: How much time indexing doc ??


> Hi, i have 1000 doc (Word, PDF and HTML)   , those documents indexed
> in 5 min.  Is this correct?? or i have problem with my Analyzer, i
> used StandartAnalyzer
> -- 
> Miguel Angel Angeles R.
> Asesoria en Conectividad y Servidores
> Telf. 97451277
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



False Locking Conflict?

2004-11-19 Thread Luke Shannon
Hey All;

Is it possible for there to be a situation where the locking file is in place 
after the reader has been closed?

I have extra logging in place and have followed the code execution. The reader 
finishes deleting old content and closes (I know this for sure). This is the 
only reader instance I have for the class (it is a static member). The reader 
is not re-opened. I try to open the writer and I get my old friend:

java.io.IOException: Lock obtain timed out: 
Lock@/usr/tomcat/jakarta-tomcat-5.0.19/temp/lucene-398fbd170a5457d05e2f4d43210f7fe8-write.lock

This code is synchronized so I am sure there is no other processes trying to do 
the same thing. It looks to me like the reader is closing and the lock file is 
not being removed. Is this possible?

Luke

Re: DOC, PPT index???

2004-11-18 Thread Luke Shannon
Check out:

http://jakarta.apache.org/poi/

- Original Message - 
From: "Miguel Angel" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Thursday, November 18, 2004 4:49 PM
Subject: DOC, PPT index???


> Hi !!!
> Lucene can index the files (do, ppt the MS OFFICE ??)
> How do you can this index (doc, ppt)
> -- 
> Miguel Angel Angeles R.
> Asesoria en Conectividad y Servidores
> Telf. 97451277
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: version documents

2004-11-18 Thread Luke Shannon
Thank you for the suggestion.

I ended up biting the bullet and re-working my indexing logic. Luckily the
system itself knows what the "current" version of a document is (otherwise
it won't know which one to display to the user) for any given folder.

I was able to get a static method I could call passing in a folder name. The
method returns the file name of the current version for that folder.

Each time I am doing an incremental update if I find that a document from a
folder hasn't changed I make sure it is the current version before moving
on. If it isn't I remove it from the index.

Than when I am creating a new index or adding files to an existing, for each
file, I have to check the file I am adding to ensure it is the current
version for the folder before adding it.

As you can imagine this slows down indexing (creating a new or updating an
existing) but it ensures content from an old version will never show up in a
query.

Luke

- Original Message - 
From: "Yonik Seeley" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>; "Justin Swanhart"
<[EMAIL PROTECTED]>
Sent: Thursday, November 18, 2004 1:32 PM
Subject: Re: version documents


> This won't fully work.  You still need to delete the
> original out of the lucene index to avoid it showing
> up in searches.
>
> Example:
> myfile v1:  "I want a cat"
> myfile v2:  "I want a dog"
>
> If you change "cat" to "dog" in myfile, and then do a
> search for "cat", you will *only* get v1 and hence the
> sort on version doesn't help.
>
> -Yonik
>
>
> --- Justin Swanhart <[EMAIL PROTECTED]> wrote:
> > Split the filename into "basefilename" and "version"
> > and make each a keyword.
> >
> > Sort your query by version descending, and only use
> > the first
> > "basefile" you encounter.
>
>
> __
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: PDF Index Time

2004-11-18 Thread Luke Shannon
Hi Ben;

Thank you creating such a easy to use package for indexing PDF.

I will keep PDFBox in the system and wait for the next release.

Thanks for the update.

Luke

- Original Message - 
From: "Ben Litchfield" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, November 18, 2004 12:33 PM
Subject: Re: PDF Index Time


> 
> PDFBox is slow, there is an open issue for it on the sourceforge site and
> I am actively working on improving speed and should see significant
> improvements in the next release.
> 
> I have not extensively tried the snowtide package but they have a trial
> download and the docs show that it should be just as easy to integrate as
> PDFBox is.  They list pricings on there site as well, which is nice that
> it is not hidden as some software companies do.
> 
> Ben
> 
> On Thu, 18 Nov 2004, Luke Shannon wrote:
> 
> > Hi;
> >
> > I am using the PDFBox's getLuceneDocument method to parse my PDF
> > documents. It returns good results and was very easy to integrate into
> > the project. However it is slow.
> >
> > Does anyone know of a faster package? Someone mentioned snowtide on an
> > earlier post. Anyone have experience with this package?
> >
> > Luke
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



PDF Index Time

2004-11-18 Thread Luke Shannon
Hi;

I am using the PDFBox's getLuceneDocument method to parse my PDF documents. It 
returns good results and was very easy to integrate into the project. However 
it is slow.

Does anyone know of a faster package? Someone mentioned snowtide on an earlier 
post. Anyone have experience with this package?

Luke

Re: urgent help needed

2004-11-18 Thread Luke Shannon
These are the ones I think. They were the first things I read on Lucene and
were very helpful.

http://www.onjava.com/pub/a/onjava/2003/03/05/lucene.html
http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html

- Original Message - 
From: "Neelam Bhatnagar" <[EMAIL PROTECTED]>
To: "Otis Gospodnetic" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Thursday, November 18, 2004 10:45 AM
Subject: RE: urgent help needed


Hello,

Thank you for your help. Could you tell us the URL of the online version
of these articles?

Thanks and regards
Neelam Bhatnagar


-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Thursday, November 18, 2004 9:12 PM
To: [EMAIL PROTECTED]
Cc: Neelam Bhatnagar
Subject: Re: urgent help needed

Redirecting to the more appropriate lucene-user list.

Hello,

About 2 years ago I wrote 2 articles for O'Reilly Network, where I
believe I mentioned this issue and provided some context.  Make sure
your index is optimized.  If that doesn't help, switch to the compound
index format (1 set call on IndexWriter instance).  You can also adjust
your OS's limits - the article I mentioned cover this for a few UNIX
shells.

Otis

--- Neelam Bhatnagar <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I have posted this several times before but there has been no
> response.
> We really need to resolve this as soon as possible. Kindly help us.
>
> We have been using Lucene 3.1 version with Tomcat 4.0 and jdk1.4.
> It seems that sometimes we see a "Too many files open" exception
> which
> completely garbles the whole index and whole search functionality
> crashes on the web site. It has also been known to crash the complete
> JSP container of tomcat.
>
> After looking at the bug list, we found out that it has been reported
> as
> a bug in the Lucene bug list as Bug#29774, #30049, #30452 which
> claims
> to have been resolved with the new version of Lucene.
>
> We have tried everything to reproduce the problem ourselves to figure
> out the exact circumstances under which it occurs but with out any
> luck.
>
>
> We would be installing the new version of Lucene but we need to be
> able
> to reproduce the problem to test it.
>
> We would really appreciate it if someone could point us to the root
> cause behind this so we can devise a solution around that.
>
> Thanks and regards
> Neelam Bhatnagar
>
> Technology| Sapient
> Presidency Building
> Mehrauli-Gurgaon Road
> Sector-14, Gurgaon-122001
> Haryana, India
>
> Tel: 91.124.2826299
> Cell: 91.9899591054
> Email: [EMAIL PROTECTED]
>
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: version documents

2004-11-17 Thread Luke Shannon
That is a good idea. Thanks!

- Original Message - 
From: "Justin Swanhart" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, November 17, 2004 3:38 PM
Subject: Re: version documents


> Split the filename into "basefilename" and "version" and make each a
keyword.
>
> Sort your query by version descending, and only use the first
> "basefile" you encounter.
>
> On Wed, 17 Nov 2004 15:05:19 -0500, Luke Shannon
> <[EMAIL PROTECTED]> wrote:
> > Hey all;
> >
> > I have ran into an interesting case.
> >
> > Our system has notes. These need to be indexed. They are xml files
called default.xml and are easily parsed and indexed. No problem, have been
doing it all week.
> >
> > The problem is if someone edits the note, the system doesn't update the
default.xml. It creates a new file, default_1.xml (every edit creates a new
file with an incremented number, the sytem only displays the content from
the highest number).
> >
> > My problem is I index all the documents and end up with terms that were
taken out of note several version ago still showing up in the query. From my
point of view this makes sense because the files are still in the content.
But to a user it is confusing because they have no idea every change they
make to a note spans a new file and now the are seeing a term they removed
from their note 2 weeks ago showing up in a query.
> >
> > I have started modifying my incremental update to be look for multiple
version of the default.xml but it is more work than I thought and is going
make things complex.
> >
> > Maybe there is an easier way? If I just let it run and create the index,
can somebody suggest a way I could easily scan the index folder ensuring
only the default.xml with the highest number in its filename remains (only
for folders were there is more than one default.xml file)? Or is this
wishful thinking?
> >
> > Thanks,
> >
> > Luke
> >
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



version documents

2004-11-17 Thread Luke Shannon
Hey all;

I have ran into an interesting case.

Our system has notes. These need to be indexed. They are xml files called 
default.xml and are easily parsed and indexed. No problem, have been doing it 
all week.

The problem is if someone edits the note, the system doesn't update the 
default.xml. It creates a new file, default_1.xml (every edit creates a new 
file with an incremented number, the sytem only displays the content from the 
highest number).

My problem is I index all the documents and end up with terms that were taken 
out of note several version ago still showing up in the query. >From my point 
of view this makes sense because the files are still in the content. But to a 
user it is confusing because they have no idea every change they make to a note 
spans a new file and now the are seeing a term they removed from their note 2 
weeks ago showing up in a query.

I have started modifying my incremental update to be look for multiple version 
of the default.xml but it is more work than I thought and is going make things 
complex.

Maybe there is an easier way? If I just let it run and create the index, can 
somebody suggest a way I could easily scan the index folder ensuring only the 
default.xml with the highest number in its filename remains (only for folders 
were there is more than one default.xml file)? Or is this wishful thinking?

Thanks,

Luke

Re: index document pdf

2004-11-17 Thread Luke Shannon
Hello;

Hopfully I understand the question.

1. Modify the indexDoc(file) method to consider the file type pdf:

else if (file.getPath().endsWith(".html") ||
file.getPath().endsWith(".pdf")) {

2. Create a specific branch of code to create the lucene document from the
file type and than add it to the index:

if (file.getPath().endsWith(".pdf")) {
  try {
   Document doc = LucenePDFDocument.getDocument(file);
   writer.addDocument(doc);
  } catch (Exception e) {
   System.out.println("INDEXING ERROR: Unable to index pdf document: "
   + file.getPath()
   + " "
   + e.getMessage());
  }
 }

Note: Ensure you do step 2 for the case when uidIter != null and when it is
equal to null.

That should do it.

Concerning pdfbox make sure you have all the jars required. I had a little
trouble getting this going at first. It needs log4j.jar to run. If you have
any problems with the appenders I found this message thread helpful.
http://java2.5341.com/msg/32909.html

Luke

- Original Message - 
From: "Miguel Angel" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, November 17, 2004 12:28 PM
Subject: index document pdf


> Hi, i downloading pdfbox 0.6.4  , what add in the source code the
> demo`s lucene 
>
> -- 
> Miguel Angel Angeles R.
> Asesoria en Conectividad y Servidores
> Telf. 97451277
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: tool to check the index field

2004-11-17 Thread Luke Shannon
Try this:

http://www.getopt.org/luke/

Luke
- Original Message - 
From: "lingaraju" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, November 17, 2004 10:00 AM
Subject: tool to check the index field


> HI ALL
> 
> I am having  index file created by other people
> Now  i want to know how many field are there in the index
> Is there any third party tool to do this
> I saw some where some GUI tool to do this but  forgot the name.
> 
> Regards
> LingaRaju 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Index Locking Issues Resolved...I hope

2004-11-16 Thread Luke Shannon
Hello;

I think I have solved my locking issues. I just made it through the set of
test cases that previously resulted in Index Locking Errors. I just removed
the method from my code that checks for a Index lock and forcefully removes
it after 1 minute. Hopefully they never need to be put back in.

Here is what I changed:

I moved all my Indexer logic into a class called Index.java that implemented
Runnable. Index's start() called a method named go() which was static and
synchronized. go() kicks off all the logic to update the index (the reader,
writer and other members involved with incremental updates also static). I
put logging in place that logs when a thread has executed the method and
what the thread's name is.

Every time a client class changes the content it can create a thread
reference and pass it the runnable Index. The convention I have requested
for naming the thread is a toString() of the current date. Then they start
the thread.

How it worked:

A few users just tested the system, half added documents to the system while
another half deleted documents at the same time. No locking issues were seen
and the index was current with the changes made a short time after the last
operation (in my previous code this test resulted in a issue with index
locking).

I was able to go through the log file and find the start of the synchronized
go() method and the successful completion of the indexing operations for
every request made.

The only performance issue I noticed was if someone added a very large PDF
it took a while before the thread handling the request could finish. If this
is the first operation of many it means the operations following this large
file take that much longer. Luckily for me search results don't need to be
instant.

Things are looking much better. For now...

Thanks to all that helped me up till now.

Luke

- Original Message - 
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, November 16, 2004 4:01 PM
Subject: Re: _4c.fnm missing


> 'Concurrent' and 'updates' in the same sentence sounds like a possible
> source of the problem.  You have to use a single IndexWriter and it
> should not overlap with an IndexReader that is doing deletes.
>
> Otis
>
> --- Luke Shannon <[EMAIL PROTECTED]> wrote:
>
> > It conistantly breaks when I run more than 10 concurrent incremental
> > updates.
> >
> > I can post the code on Bugzilla (hopefully when I get to the site it
> > will be
> > obvious how I can post things).
> >
> > Luke
> >
> > - Original Message - 
> > From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > Sent: Tuesday, November 16, 2004 3:20 PM
> > Subject: Re: _4c.fnm missing
> >
> >
> > > Field names are stored in the field info file, with suffix .fnm. -
> > see
> > > http://jakarta.apache.org/lucene/docs/fileformats.html
> > >
> > > The .fnm should be inside the .cfs file (cfs files are compound
> > files
> > > that contain all index files described at the above URL).  Maybe
> > you
> > > can provide the code that causes this error in Bugzilla for
> > somebody to
> > > look at.  Does it consistently break?
> > >
> > > Otis
> > >
> > >
> > > --- Luke Shannon <[EMAIL PROTECTED]> wrote:
> > >
> > > > I received the error below when I was attempting to over whelm my
> > > > system with incremental update requests.
> > > >
> > > > What is this file it is looking for? I checked the index. It
> > > > contains:
> > > >
> > > > _4c.del
> > > > _4d.cfs
> > > > deletable
> > > > segments
> > > >
> > > > Where does _4c.fnm come from?
> > > >
> > > > Here is the error:
> > > >
> > > > Unable to create the create the writer and/or index new content
> > > > /usr/tomcat/fb_hub/WEB-INF/index/_4c.fnm (No such file or
> > directory).
> > > >
> > > > Thanks,
> > > >
> > > > Luke
> > >
> > >
> > >
> > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail:
> > [EMAIL PROTECTED]
> > >
> > >
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: _4c.fnm missing

2004-11-16 Thread Luke Shannon
It doesn't have to be to the second. If things take a few minutes it's ok.

It looks like the first lock issue I'm hitting in my program is when I try
and delete from the Index for the first time. No writer has been created
yet, only the reader so I am not sure why it thinks its locked.

- Original Message - 
From: "Nader Henein" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, November 16, 2004 4:18 PM
Subject: Re: _4c.fnm missing


> That's it, you need to batch your updates, it comes down to do you need to
give your users search accuracy to the second, take your  database and put
an is_dirty row on the master table of the object you're indexing and run a
scheduled task every x minutes and have your process read the objects that
are set to dirty and then re set the flag once they've been indexed
correctly.
>
> my two cents
>
> Nader
>
>
>
> Otis Gospodnetic wrote:
>
> >'Concurrent' and 'updates' in the same sentence sounds like a possible
> >source of the problem.  You have to use a single IndexWriter and it
> >should not overlap with an IndexReader that is doing deletes.
> >
> >Otis
> >
> >--- Luke Shannon <[EMAIL PROTECTED]> wrote:
> >
> >
> >
> >>It conistantly breaks when I run more than 10 concurrent incremental
> >>updates.
> >>
> >>I can post the code on Bugzilla (hopefully when I get to the site it
> >>will be
> >>obvious how I can post things).
> >>
> >>Luke
> >>
> >>- Original Message - 
> >>From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
> >>To: "Lucene Users List" <[EMAIL PROTECTED]>
> >>Sent: Tuesday, November 16, 2004 3:20 PM
> >>Subject: Re: _4c.fnm missing
> >>
> >>
> >>
> >>
> >>>Field names are stored in the field info file, with suffix .fnm. -
> >>>
> >>>
> >>see
> >>
> >>
> >>>http://jakarta.apache.org/lucene/docs/fileformats.html
> >>>
> >>>The .fnm should be inside the .cfs file (cfs files are compound
> >>>
> >>>
> >>files
> >>
> >>
> >>>that contain all index files described at the above URL).  Maybe
> >>>
> >>>
> >>you
> >>
> >>
> >>>can provide the code that causes this error in Bugzilla for
> >>>
> >>>
> >>somebody to
> >>
> >>
> >>>look at.  Does it consistently break?
> >>>
> >>>Otis
> >>>
> >>>
> >>>--- Luke Shannon <[EMAIL PROTECTED]> wrote:
> >>>
> >>>
> >>>
> >>>>I received the error below when I was attempting to over whelm my
> >>>>system with incremental update requests.
> >>>>
> >>>>What is this file it is looking for? I checked the index. It
> >>>>contains:
> >>>>
> >>>>_4c.del
> >>>>_4d.cfs
> >>>>deletable
> >>>>segments
> >>>>
> >>>>Where does _4c.fnm come from?
> >>>>
> >>>>Here is the error:
> >>>>
> >>>>Unable to create the create the writer and/or index new content
> >>>>/usr/tomcat/fb_hub/WEB-INF/index/_4c.fnm (No such file or
> >>>>
> >>>>
> >>directory).
> >>
> >>
> >>>>Thanks,
> >>>>
> >>>>Luke
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>-
> >>
> >>
> >>>To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>For additional commands, e-mail:
> >>>
> >>>
> >>[EMAIL PROTECTED]
> >>
> >>
> >>>
> >>>
> >>
> >>-
> >>To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >>
> >>
> >
> >
> >-
> >To unsubscribe, e-mail: [EMAIL PROTECTED]
> >For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> >
> >
> >
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: _4c.fnm missing

2004-11-16 Thread Luke Shannon
This is the latest error I have received:

IndexReader out of date and no longer valid for delete, undelete, or setNorm
operations

I need synchronize this process more carefully. I think this goes back to
the point that during my incremental update I sometimes need to forcefully
clear the lock on the Index. I am not managing the deleting and writing to
the Index correctly.

The first thing I am doing is tracking down the cause of this situation so I
don't need to forcefully clear locks anymore.

- Original Message - 
From: "Nader Henein" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, November 16, 2004 3:39 PM
Subject: Re: _4c.fnm missing


> what kind of incremental updates are you doing, because we update our
index every 15 minutes with 100 ~ 200 documents and we're writing to a 6 GB
memory resident index, the IndexWriter runs one instance at a time, so what
kind of increments are we talking about it takes a bit of doing to overwhelm
Lucene.
>
> What's your update schedule, how big is the index, and after how many
updates does the system crash?
>
> Nader Henein
>
>
>
> Luke Shannon wrote:
>
> >It conistantly breaks when I run more than 10 concurrent incremental
> >updates.
> >
> >I can post the code on Bugzilla (hopefully when I get to the site it will
be
> >obvious how I can post things).
> >
> >Luke
> >
> >- Original Message - 
> >From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
> >To: "Lucene Users List" <[EMAIL PROTECTED]>
> >Sent: Tuesday, November 16, 2004 3:20 PM
> >Subject: Re: _4c.fnm missing
> >
> >
> >
> >
> >>Field names are stored in the field info file, with suffix .fnm. - see
> >>http://jakarta.apache.org/lucene/docs/fileformats.html
> >>
> >>The .fnm should be inside the .cfs file (cfs files are compound files
> >>that contain all index files described at the above URL).  Maybe you
> >>can provide the code that causes this error in Bugzilla for somebody to
> >>look at.  Does it consistently break?
> >>
> >>Otis
> >>
> >>
> >>--- Luke Shannon <[EMAIL PROTECTED]> wrote:
> >>
> >>
> >>
> >>>I received the error below when I was attempting to over whelm my
> >>>system with incremental update requests.
> >>>
> >>>What is this file it is looking for? I checked the index. It
> >>>contains:
> >>>
> >>>_4c.del
> >>>_4d.cfs
> >>>deletable
> >>>segments
> >>>
> >>>Where does _4c.fnm come from?
> >>>
> >>>Here is the error:
> >>>
> >>>Unable to create the create the writer and/or index new content
> >>>/usr/tomcat/fb_hub/WEB-INF/index/_4c.fnm (No such file or directory).
> >>>
> >>>Thanks,
> >>>
> >>>Luke
> >>>
> >>>
> >>-
> >>To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >>
> >>
> >
> >
> >
> >-
> >To unsubscribe, e-mail: [EMAIL PROTECTED]
> >For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> >
> >
> >
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: _4c.fnm missing

2004-11-16 Thread Luke Shannon
The schedule is determined by the users of the system. Basically when the
user(s) change the content (adding/deleting a folder or file, modify a
file's content) through a web based interface a re-index is required of the
content. This could happen 20 times in the span of a few seconds or once in
an hour.

I doubt I am overwhelming Lucene, I think the problem is in my code and how
I am managing the deleting and writing to the Index.

- Original Message - 
From: "Nader Henein" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, November 16, 2004 3:39 PM
Subject: Re: _4c.fnm missing


> what kind of incremental updates are you doing, because we update our
index every 15 minutes with 100 ~ 200 documents and we're writing to a 6 GB
memory resident index, the IndexWriter runs one instance at a time, so what
kind of increments are we talking about it takes a bit of doing to overwhelm
Lucene.
>
> What's your update schedule, how big is the index, and after how many
updates does the system crash?
>
> Nader Henein
>
>
>
> Luke Shannon wrote:
>
> >It conistantly breaks when I run more than 10 concurrent incremental
> >updates.
> >
> >I can post the code on Bugzilla (hopefully when I get to the site it will
be
> >obvious how I can post things).
> >
> >Luke
> >
> >- Original Message - 
> >From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
> >To: "Lucene Users List" <[EMAIL PROTECTED]>
> >Sent: Tuesday, November 16, 2004 3:20 PM
> >Subject: Re: _4c.fnm missing
> >
> >
> >
> >
> >>Field names are stored in the field info file, with suffix .fnm. - see
> >>http://jakarta.apache.org/lucene/docs/fileformats.html
> >>
> >>The .fnm should be inside the .cfs file (cfs files are compound files
> >>that contain all index files described at the above URL).  Maybe you
> >>can provide the code that causes this error in Bugzilla for somebody to
> >>look at.  Does it consistently break?
> >>
> >>Otis
> >>
> >>
> >>--- Luke Shannon <[EMAIL PROTECTED]> wrote:
> >>
> >>
> >>
> >>>I received the error below when I was attempting to over whelm my
> >>>system with incremental update requests.
> >>>
> >>>What is this file it is looking for? I checked the index. It
> >>>contains:
> >>>
> >>>_4c.del
> >>>_4d.cfs
> >>>deletable
> >>>segments
> >>>
> >>>Where does _4c.fnm come from?
> >>>
> >>>Here is the error:
> >>>
> >>>Unable to create the create the writer and/or index new content
> >>>/usr/tomcat/fb_hub/WEB-INF/index/_4c.fnm (No such file or directory).
> >>>
> >>>Thanks,
> >>>
> >>>Luke
> >>>
> >>>
> >>-
> >>To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >>
> >>
> >
> >
> >
> >-
> >To unsubscribe, e-mail: [EMAIL PROTECTED]
> >For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> >
> >
> >
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: _4c.fnm missing

2004-11-16 Thread Luke Shannon
It conistantly breaks when I run more than 10 concurrent incremental
updates.

I can post the code on Bugzilla (hopefully when I get to the site it will be
obvious how I can post things).

Luke

- Original Message - 
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, November 16, 2004 3:20 PM
Subject: Re: _4c.fnm missing


> Field names are stored in the field info file, with suffix .fnm. - see
> http://jakarta.apache.org/lucene/docs/fileformats.html
>
> The .fnm should be inside the .cfs file (cfs files are compound files
> that contain all index files described at the above URL).  Maybe you
> can provide the code that causes this error in Bugzilla for somebody to
> look at.  Does it consistently break?
>
> Otis
>
>
> --- Luke Shannon <[EMAIL PROTECTED]> wrote:
>
> > I received the error below when I was attempting to over whelm my
> > system with incremental update requests.
> >
> > What is this file it is looking for? I checked the index. It
> > contains:
> >
> > _4c.del
> > _4d.cfs
> > deletable
> > segments
> >
> > Where does _4c.fnm come from?
> >
> > Here is the error:
> >
> > Unable to create the create the writer and/or index new content
> > /usr/tomcat/fb_hub/WEB-INF/index/_4c.fnm (No such file or directory).
> >
> > Thanks,
> >
> > Luke
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



_4c.fnm missing

2004-11-16 Thread Luke Shannon
I received the error below when I was attempting to over whelm my system with 
incremental update requests.

What is this file it is looking for? I checked the index. It contains:

_4c.del
_4d.cfs
deletable
segments

Where does _4c.fnm come from?

Here is the error:

Unable to create the create the writer and/or index new content 
/usr/tomcat/fb_hub/WEB-INF/index/_4c.fnm (No such file or directory).

Thanks,

Luke

Re: IndexSearcher Refresh

2004-11-16 Thread Luke Shannon
Yes it will. Thanks.

- Original Message - 
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, November 16, 2004 10:28 AM
Subject: Re: IndexSearcher Refresh


> This will help:
>
>
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#getCurrentVersion(org.apache.lucene.store.Directory)
>
> Otis
>
>
> --- Luke Shannon <[EMAIL PROTECTED]> wrote:
>
> > It would nice if the IndexerSearcher contained a method that could
> > return
> > the last modified date of the index folder it was created with.
> >
> > This would make it easier to know when you need to create a new
> > Searcher.
> >
> > - Original Message - 
> > From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > Sent: Tuesday, November 16, 2004 8:23 AM
> > Subject: Re: IndexSearcher Refresh
> >
> >
> > > I don't think so, you have to forget or close the old one and
> > create a
> > > new instance.
> > >
> > > Otis
> > >
> > > --- Ravi <[EMAIL PROTECTED]> wrote:
> > >
> > > > Is there a way to refresh the IndexSearcher object with the newly
> > > > added
> > > > documents to the index instead of creating a new object?
> > > >
> > > > Thanks in advance,
> > > > Ravi.
> > > >
> > > >
> > > >
> > -
> > > > To unsubscribe, e-mail:
> > [EMAIL PROTECTED]
> > > > For additional commands, e-mail:
> > [EMAIL PROTECTED]
> > > >
> > > >
> > >
> > >
> > >
> > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail:
> > [EMAIL PROTECTED]
> > >
> > >
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: how do you work with PDF

2004-11-16 Thread Luke Shannon
www.pdfbox.org

Once you get the package installed the code you can use is:

Document doc = LucenePDFDocument.getDocument(file);
  writer.addDocument(doc);

This method returns the PDF in Lucene document format.

Luke

- Original Message - 
From: "Miguel Angel" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, November 16, 2004 10:19 AM
Subject: how do you work with PDF


> Hi, i need know  how do you work with PDF, please give the process.
> Thanks...
> 
> -- 
> Miguel Angel Angeles R.
> Asesoria en Conectividad y Servidores
> Telf. 97451277
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexSearcher Refresh

2004-11-16 Thread Luke Shannon
It would nice if the IndexerSearcher contained a method that could return
the last modified date of the index folder it was created with.

This would make it easier to know when you need to create a new Searcher.

- Original Message - 
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, November 16, 2004 8:23 AM
Subject: Re: IndexSearcher Refresh


> I don't think so, you have to forget or close the old one and create a
> new instance.
>
> Otis
>
> --- Ravi <[EMAIL PROTECTED]> wrote:
>
> > Is there a way to refresh the IndexSearcher object with the newly
> > added
> > documents to the index instead of creating a new object?
> >
> > Thanks in advance,
> > Ravi.
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Is opening IndexReader multiple times safe?

2004-11-15 Thread Luke Shannon
Hi Satoshi;
(B
(BI troubled shooted a problem similar to this by moving around a
(BIndexReader.isLocked(indexFileLocation) to determine exactly when the reader
(Bwas closed.
(B
(BNote: the method throws an error if the index file doesn't exist that you
(Bare checking on.
(B
(BLuke
(B
(B- Original Message - 
(BFrom: "Satoshi Hasegawa" <[EMAIL PROTECTED]>
(BTo: "Lucene Users List" <[EMAIL PROTECTED]>
(BSent: Monday, November 15, 2004 8:25 PM
(BSubject: Is opening IndexReader multiple times safe?
(B
(B
(B> Hello,
(B>
(B> I need to handle IOExceptions that arise from index access
(B> (IndexReader#open, #delete, IndexWriter#optimize etc.), and I'm not sure
(Bif
(B> the IndexReader is open when the exception is thrown/caught. Specifically,
(B> my code is as follows.
(B>
(B> try {
(B> indexReader.delete(term);
(B> indexReader.close();
(B> IndexWriter indexWriter = new IndexWriter(fsDirectory,
(B> new JapaneseAnalyzer(), false);
(B> indexWriter.optimize();
(B> indexWriter.close();
(B> } catch (Exception e) {
(B> // IndexReader may or may not be open
(B> indexReader = IndexReader.open(path);
(B> indexReader.undelete();
(B> }
(B>
(B> Is the above code safe? IndexReader may already be open at the beginning
(Bof
(B> the catch clause if the exception was thrown before closing the
(BIndexReader.
(B>
(B>
(B>
(B> -
(B> To unsubscribe, e-mail: [EMAIL PROTECTED]
(B> For additional commands, e-mail: [EMAIL PROTECTED]
(B>
(B>
(B
(B
(B
(B-
(BTo unsubscribe, e-mail: [EMAIL PROTECTED]
(BFor additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene : avoiding locking (incremental indexing)

2004-11-15 Thread Luke Shannon
I like the sound of the Queue approach.  I also don't like that I have to
focefully unlock the index.

I'm not the most experience programmer and am on a tight deadline. The
approach I ended up with was the best I could do with the experience I've
got and the time I had.

My indexer works so far and doesn't have to forcefully release the lock on
the Index too often (the case is most likely to occur when someone removes a
content file(s) and the reader needs to delete from the existing index for
the first time). We will see what happens as more people use the system with
large content directories.

As I learn more I plan to expand the functionality of my class.

Luke S

- Original Message - 
From: <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, November 15, 2004 5:50 PM
Subject: Re: Lucene : avoiding locking (incremental indexing)


> It really seems like I am not the only person having this issue.
>
> So far I am seeing 2 solutions and honestly I don't love either totally.
I am thinking that without changes to Lucene itself, the best "general" way
to implement this might be to have a queue of changes and have Lucene work
off this queue in a single thread using a time-settable batch method.   This
is similar to what you are using below, but I don't like that you forcibly
unlock Lucene if it shows itself locked.   Using the Queue approach, only
that one thread could be accessing Lucene for writes/deletes anyway so there
should be no "unknown" locking.
>
> I can imagine this being a very good addition to Lucene - creating a high
level interface to Lucene that manages incremental updates in such a manner.
If anybody has such a general piece of code, please post it!!!   I would use
it tonight rather then create my own.
>
> I am not sure if there is anything that can be done to Lucene itself to
help with this need people seem to be having.  I realize the likely reasons
why Lucene might need to only have one Index writer and the additional load
that might be caused by locking off pieces of the database rather then the
whole database.  I think I need to look in the developer archives.
>
> JohnE
>
>
>
> - Original Message -
> From: Luke Shannon <[EMAIL PROTECTED]>
> Date: Monday, November 15, 2004 5:14 pm
> Subject: Re: Lucene : avoiding locking (incremental indexing)
>
> > Hi Luke;
> >
> > I have a similar system (except people don't need to see results
> > immediatly). The approach I took is a little different.
> >
> > I made my Indexer a thread with the indexing operations occuring
> > the in run
> > method. When the IndexWriter is to be created or the IndexReader
> > needs to
> > execute a delete I called the following method:
> >
> > private void manageIndexLock() {
> >  try {
> >   //check if the index is locked and deal with it if it is
> >   if (index.exists() && IndexReader.isLocked(indexFileLocation)) {
> >System.out.println("INDEXING INFO: There is more than one
> > process trying
> > to write to the index folder. Will wait for index to become
> > available.");//perform this loop until the lock if released or
> > 3 mins
> >// has expired
> >int indexChecks = 0;
> >while (IndexReader.isLocked(indexFileLocation)
> >  && indexChecks < 6) {
> > //increment the number of times we check the index
> > // files
> > indexChecks++;
> > try {
> >  //sleep for 30 seconds
> >  Thread.sleep(3L);
> > } catch (InterruptedException e2) {
> >  System.out.println("INDEX ERROR: There was a problem waiting
> > for the
> > lock to release. "
> >  + e2.getMessage());
> > }
> >}//closes the while loop for checking on the index
> >// directory
> >//if we are still locked we need to do something about it
> >if (IndexReader.isLocked(indexFileLocation)) {
> > System.out.println("INDEXING INFO: Index Locked After 3
> > minute of
> > waiting. Forcefully releasing lock.");
> > IndexReader.unlock(FSDirectory.getDirectory(index, false));
> > System.out.println("INDEXING INFO: Index lock released");
> >}//close the if that actually releases the lock
> >   }//close the if ensure the file exists
> >  }//closes the try for all the above operations
> >  catch (IOException e1) {
> >   System.out.println("INDEX ERROR: There was a problem waiting
> > for the lock
> > to release. "
> >   + e1.getMessage());
> >  }
> > }//close the manageIndexLock method
> >
> &

  1   2   >