Re: How to get all field values from a Hits object?

2005-01-17 Thread Chris Hostetter

: is it possible to get all different values for a
:  from a  object and how to do this?

The ording of your question suggests that the Field you are interested in
isn't a field which will have a fairly unique value for every doc (ie: not
a "title", more likely an "author" or "category" field).  Starting with
that assumption, then there is fairly efficient way to get the information
you want...

Assuming the total set of values for the Field you are interested in is
small (relative your index size), you can pre-compute a BitSet for
each value indicating which docs match that value in the Field (using a
TermFilter).  Then store those BitSets in a Map (key'ed by field value)

Everytime a search is performed, use a HitCollector that generates a
BitSet containing the documents in your result; AND that BitSet against (a
copy of) each BitSet in your Map.  All of the resulting BitSets with a
non-zero cardinality represent values in your results.  (As an added bonus
the cardinality() of each BitSet is the total number of docs in your
result that contain that value)

Two caveats:
   1) Everytime you modify your index, you have to regen the
  BitSets in your Map.
   2) You have to know the set of all values for the field you are
  interested in.  In many cases, this is easy to determine from the
  source data while building the index.  but it's also possible to
  get it using IndexReader.termDocs(Term).


(I'm doing something like this to provide ancilary information about which
categories of documents are most common in the users search result, and
what the exact number of documents in those categories is)


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Best way to find if a document exists, using Reader ...

2005-01-17 Thread Chris Hostetter

: 1) Adding 250K documents took half an hour for lucene.
: 2) Deleting and adding same 250K documents took more than 50 minutes. In my
: test all 250K objects are new so there is nothing to delete.
:
: Looks like there is no other way to make it fast.

I bet you can find an improvement in the specific case -- put probably not
in the general case.

Let's summarize your current process in psuedo code:

   open an existing index (which may be empty)
   foreach item in very large set of items:
  id = item.getId()
  index.delete(id)
  index.add(id)

...except that 99% of the time, that delete isn't neccesary right?

so what if you traded space for time, and kept an in memory Cache of all
IDs in your index?

   open an existing index (which may be empty)
   cache = all ids in TermDoc iterator of id field;
   foreach item in very large set of items:
  id = item.getId()
  if cache.contains(id):
 index.delete(id)
  cache.add(id)
  index.add(id)

... assuming you have enough ram to keep a HashMap of every id in your
index arround, i'm fairly confident that would be faster then doing the
delete every time.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Question about Analyzer and words spelled in different languages

2005-01-17 Thread David Spencer
markharw00d wrote:
 >>Writing this kind of an analyzer can be a bit of a hassle and the 
position increment of 0 might affect highlighting code

The highlighter in the sandbox was refactored to support these kinds of 
analyzer some time ago so it shouldn't be a problem.
Sorry - I should have noted that as I think I was the one that requested 
this change! I meant to say that position increments are tricky and an 
increment of 0 could affect all kinds of other code that doesn't know to 
look for such things.

 The Junit test 
that come with the highlighter includes a SynonymAnalyzer class which 
demonstrates this.

Cheers
Mark
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


MoreLikeThis and other similarity query generators checked in + online demo

2005-01-17 Thread David Spencer
Based on mail from Doug I wrote a "more like this" query generator, 
named, well, MoreLikeThis. Bruce Ritchie and Mark Harwood made changes 
to it (esp term vector support) and bug fixes. Thanks to everyone.

I've checked in the code to the sandbox under contributions/similarity.
The package it ends up at is org.apache.lucene.search.similar -- hope 
that makes sense.

I also created a class, SimilarityQueries, to hold other methods of 
similarity query generation. The 2 methods in there are "dumber" 
variations that use the entire source of the target doc to from a large 
query.

Javadoc is here:
http://www.searchmorph.com/pub/jakarta-lucene-sandbox/contributions/similarity/build/docs/api/org/apache/lucene/search/similar/package-summary.html
Online demo here - this page below compares the 3 variations on 
detecting similar docs. The timing info (3 numbers w/ "(ms)") may be 
suspect. Also note if you scroll to the bottom you can see the queries 
that were generated.

Here's a page showing docs similar to the entry for Iraq:
http://www.searchmorph.com/kat/wikipedia-compare.jsp?s=Iraq
And here's one for docs similar to the one on Garry Kasparov (he knows 
how to play chess :) ):

http://www.searchmorph.com/kat/wikipedia-compare.jsp?s=Garry_Kasparov
To get to it you start here:
http://www.searchmorph.com/kat/wikipedia.jsp
And search for something - on the search results page follow a "cmp" link
http://www.searchmorph.com/kat/wikipedia.jsp?s=iraq
Make sense? Useful? Has anyone done any other variations (e.g. cosine 
measure)?

- Dave
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Question about Analyzer and words spelled in different languages

2005-01-17 Thread markharw00d
>>Writing this kind of an analyzer can be a bit of a hassle and the 
position increment of 0 might affect highlighting code

The highlighter in the sandbox was refactored to support these kinds of 
analyzer some time ago so it shouldn't be a problem.  The Junit test 
that come with the highlighter includes a SynonymAnalyzer class which 
demonstrates this.

Cheers
Mark
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Question about Analyzer and words spelled in different languages

2005-01-17 Thread David Spencer
Mariella Di Giacomo wrote:
Hi ALL,
We are trying to index scientic articles written in english, but whose 
authors can be spelled in any language (depending on the author's 
nazionality)

E.g.
Schäffer
In the XML document that we provide to Lucene the author name is written 
in the following way (using HTML ENTITIES)

Schäffer
So in practice that is the name that would be given to a Lucene 
analyzer/filter

Is there any already written analyzer that would take that name 
(Schäffer or any other name that has entities) so that
Lucene index could searched (once the field has been indexed) for the 
real version of the name, which is

Schäffer
and the english spelled version of the name which is
Schaffer
Thanks a lot in advance for your help,
If I understand the question then I think there are 2 ways of doing it.
[1] Write a custom analyzer that uses Token.setPositionIncrement(0) to 
put alternate spellings at the same place in the token stream. This way 
phrase matches work right (so the query "Jonathan Schaffer" and 
"Jonathan Schäffer" will match the same phrase in the doc).

[2] Do not use a special analyzer - instead do query expansion, so if 
they search for "Schaffer" then the generated query is (Schaffer Schäffer).

I've used both techniques before - I use #1 w/ a "JavadocAnalyzer" on 
searchmorph.com so that if you search for "hash" you'll see matches for 
"HashMap", as "HashMap" is tokenized into 3 tokens at the same location 
( 'hash', 'map, 'hashmap').  Writing this kind of an analyzer can be a 
bit of a hassle and the position increment of 0 might affect 
highlighting code or other (say, summarizing) code that uses the Analyzer.

For an example of #2 see my Wordnet/Synonym query expansion example in 
the lucene sandbox. You prebuild an index of synonyms (or in your case 
maybe just rules are fine). Then you need query expansion code that 
takes "Schaffer" and expands it to something like "Schaffer 
Schäffer^0.9" (if you want to assume the user probably spells the name 
right). Simple enough to code, only hassle then is if you want to use 
the standard QueryParser...

thx,
 Dave



Mariella

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to add a Lucene index to a jar file?

2005-01-17 Thread Doug Cutting
David Spencer wrote:
Isn't "ZipDirectory" the thing to search for?
I think it's actually URLDirectory:
http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg02453.html
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to add a Lucene index to a jar file?

2005-01-17 Thread Luke Francl
On Mon, 2005-01-17 at 13:10, Doug Cutting wrote:

> The problem is that a jar file entry becomes an InputStream, but 
> InputStream is not random access, and Lucene requires random access.  So 
> you need to extract the index either to disk or RAM in order to get 
> random access.  I think folks have posted code for this to the list 
> previously.

I think it would be cool if Lucene included a class that facilitated
creating a RAMDirectory from an index in a JAR file, for the kind of
point-and-click startup that Bill is talking about.

Regards,
Luke Francl


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to add a Lucene index to a jar file?

2005-01-17 Thread David Spencer
Doug Cutting wrote:
Miles Barr wrote:
You'll have to implement org.apache.lucene.store.Directory to load the
index from the JAR file. Take a look at FSDirectory and RAMDirectory for
some more details.
Then you have either load the JAR file with java.util.jar.JarFile to get
to the files or you can use Classloader#getResourceAsStream to get to
them.

The problem is that a jar file entry becomes an InputStream, but 
InputStream is not random access, and Lucene requires random access.  So 
you need to extract the index either to disk or RAM in order to get 
random access.  I think folks have posted code for this to the list 
Isn't "ZipDirectory" the thing to search for?
previously.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to add a Lucene index to a jar file?

2005-01-17 Thread Doug Cutting
Miles Barr wrote:
You'll have to implement org.apache.lucene.store.Directory to load the
index from the JAR file. Take a look at FSDirectory and RAMDirectory for
some more details.
Then you have either load the JAR file with java.util.jar.JarFile to get
to the files or you can use Classloader#getResourceAsStream to get to
them.
The problem is that a jar file entry becomes an InputStream, but 
InputStream is not random access, and Lucene requires random access.  So 
you need to extract the index either to disk or RAM in order to get 
random access.  I think folks have posted code for this to the list 
previously.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: carrot2 question too - Re: Fun with the Wikipedia

2005-01-17 Thread David Spencer
Dawid Weiss wrote:
Hi David,
I apologize about the delay in answering this one, Lucene is a busy 
mailing list and I had a hectic last week... Again, sorry for belated 
answer, hope you still find it useful.
Oh no problem, and yes carrot2 is useful and fun.  It's a rich package 
so it takes a while to understand all that it can do.

That is awesome and very inspirational!

Yes, I admit what you've done with Wikipedia is quite interesting and 
looks very good. I'm also glad you spent some time working out Carrot 
integration with Lucene. It works quite nice.
Thanks but I just took code that I think you wrote(!) and made minor 
mods to it - here's one link:
http://www.newsarch.com/archive/mailinglist/jakarta/lucene/user/msg03928.html

I'd like to do more w/ Carrot2- that's where things get harder.

Carrot2 looks very interesting. Wondering if anybody has a list of 
all the

Technically I don't think carrot2 uses lucene per-se- it's just that 
you can integrate the two, and ditto for Nutch - it has code that uses 
Carrot2.

Yes, this is true. Carrot2 doesn't use all of Lucene's potential -- it 
merely takes the output from a query (titles, urls and snippets) and 
attempts to cluster them into some sensible groups. I think many things 
could be improved, the most important of them is fast snippet retrieval 
  from Lucene because right now it takes 50% of the time of the 
clustering; I've seen a post a while ago describing a faster snippet 
generation technique, I'm sure that would give clustering a huge boost 
speed-wise.

And here's my question. I reread the Carrot2<->Lucene code, esp 
Demo.java, and there's this fragment:

// warm-up round (stemmer tables must be read etc).
List clusters = clusterer.clusterHits(docs);
long clusteringStartTime = System.currentTimeMillis();
clusters = clusterer.clusterHits(docs);
long clusteringEndTime = System.currentTimeMillis();
Thus it calls clusterHits() twice.
I don't really understand how to use Carrot2 - but I think the above 
is just for the sake of benchmarking clusterHits() w/o the effect of 
1-time initialization - and that there's no benefit of repeatedly 
calling clusterHits (where a benefit might be that it can find nested 
clusters or whatever) - is that right (that there's no benefit)?

No, there is absolutely no benefit from it. It was merely to show people 
that the clustering needs to be warmed up a bit. I should not have put 
it in the code knowing people would be confused by it. You can safely 
use clusterHits just once. It will just have a small delay at the first 
invocation.

Thanks for experimenting. Please BCC me if you have any urgent projects 
-- I read Lucene's list in batches and my personal e-mail I try to keep 
up to date with.

Dawid
thx,
 Dave
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to add a Lucene index to a jar file?

2005-01-17 Thread Miles Barr
On Mon, 2005-01-17 at 10:25 -0800, Bill Janssen wrote:
> I'd like to package up a Lucene index with the Lucene class files and
> my own application classes into a single jar file, so that it forms a
> "double-clickable" single-file Java application that supports
> searching over the index.  However, I'm not sure how to read the index
> (create an IndexReader) if the files that form the index are packaged
> into the same jar file with the code.
> 
> If anyone could shed some light on how to do this, I'd appreciate it.


You'll have to implement org.apache.lucene.store.Directory to load the
index from the JAR file. Take a look at FSDirectory and RAMDirectory for
some more details.

Then you have either load the JAR file with java.util.jar.JarFile to get
to the files or you can use Classloader#getResourceAsStream to get to
them.




-- 
Miles Barr <[EMAIL PROTECTED]>
Runtime Collective Ltd.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



How to add a Lucene index to a jar file?

2005-01-17 Thread Bill Janssen
Hi.

I'd like to package up a Lucene index with the Lucene class files and
my own application classes into a single jar file, so that it forms a
"double-clickable" single-file Java application that supports
searching over the index.  However, I'm not sure how to read the index
(create an IndexReader) if the files that form the index are packaged
into the same jar file with the code.

If anyone could shed some light on how to do this, I'd appreciate it.

Thanks!

Bill

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: QUERYPARSER + LEXECIAL ERROR

2005-01-17 Thread Vanlerberghe, Luc
Hi,

There's nothing wrong with the searchWrd string itself (I ran a few
variations of it trough my test setup)
The '\' characters in the source are escapes for the java compiler and
will never be seen by lucene.

The only way I found to produce a similar Exception
(org.apache.lucene.queryParser.ParseException: Lexical error at line 1,
column 15.  Encountered:  after : "")
is by passing a string without a matching closing quote like:
String  searchWrd = "kid \"toy\" OR \"";

The exception is thrown as soon as you pass the string to
QueryParser.parse().

I tested using lucene 1.4.3 and jdk 1.5.0

Luc

> -Original Message-
> From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
> Sent: maandag 17 januari 2005 16:23
> To: Lucene Users List
> Subject: Re: QUERYPARSER + LEXECIAL ERROR
> 
> Hello,
> 
> Try:
>   String  searchWrd = "kid \"toy\"" OR "kid \"ball\""
> 
> You'll have to use a WhitespaceAnalyzer with that, though, or 
> a custom Analyzer that doesn't remove the escape character (\).
> 
> Otis
> 
> 
> --- Karthik N S <[EMAIL PROTECTED]> wrote:
> 
> > 
> > 
> > Hi  Guys.
> > 
> > Apologies.
> > 
> > 
> > 
> > The Query Parser is giving me an Lexical Error
> > 
> > String  searchWrd = "kid \"toy\" OR \"ball\" "
> > 
> > org.apache.lucene.queryParser.TokenMgrError: Lexical error 
> at line 1, 
> > column 26. Encountered:  after : ""
> > at
> >
> org.apache.lucene.queryParser.QueryParserTokenManager.getNextT
> oken(QueryPars
> > erTokenManager.java:1050)
> > 
> > What is this Happening?
> > 
> > Lucene version :  Lucene 1.4.3.jar
> > Jdk version:  Jdk 1.4.2
> > O/s  :  Win2000
> > 
> > Some body Please Reply
> > 
> > 
> > 
> > 
> > 
> > 
> > WITH WARM REGARDS
> > HAVE A NICE DAY
> > [ N.S.KARTHIK]
> > 
> > 
> > 
> > 
> -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> > 
> > 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StandardAnalyzer unit tests?

2005-01-17 Thread Daan Hoogland
€ 0.02: Indexing code "++" is a stop term, it might be in english text 
as well. 'C' is a not very descriptive but very valid variable name. '#' 
is used in some old morse transcripts I think. I am not going to die or 
get fired, but I'd suggest not including those tokens in a standard 
anything.

Erik Hatcher wrote:

> I personally don't have a problem with that change, however I don't 
> like changing such things as they can lead to unexpected and confusing 
> issues later. Suppose someone upgrades their version of Lucene without 
> re-indexing and now queries that used to work no longer work? (sure, I 
> agree it is wise to re-index if you upgrade Lucene).
>
> Perhaps others could chime in on whether this change would adversely 
> affect them or if this a desirable change?
>
> Erik
>
>
>
> On Jan 17, 2005, at 4:51 AM, Chris Lamprecht wrote:
>
>> Erik, Paul, Daniel,
>>
>> I submitted a testcase --
>> http://issues.apache.org/bugzilla/show_bug.cgi?id=33134
>>
>> On a related note, what do you all think about updating the
>> StandardAnalyzer grammar to treat "C#" and "C++" as tokens? It's a
>> small modification to the grammar -- NutchAnalysis.jj has it.
>>
>> -Chris
>>
>> On Mon, 17 Jan 2005 03:23:41 -0500, Erik Hatcher
>> <[EMAIL PROTECTED]> wrote:
>>
>>> I don't see any tests of StandardAnalyzer either. Your contribution
>>> would be most welcome. There are tests that use StandardAnalyzer, but
>>> not to test it directly.
>>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 
The information contained in this communication and any attachments is 
confidential and may be privileged, and is for the sole use of the intended 
recipient(s). Any unauthorized review, use, disclosure or distribution is 
prohibited. If you are not the intended recipient, please notify the sender 
immediately by replying to this message and destroy all copies of this message 
and any attachments. ASML is neither liable for the proper and complete 
transmission of the information contained in this communication, nor for any 
delay in its receipt.


Re: StandardAnalyzer unit tests?

2005-01-17 Thread Erik Hatcher
I personally don't have a problem with that change, however I don't 
like changing such things as they can lead to unexpected and confusing 
issues later.  Suppose someone upgrades their version of Lucene without 
re-indexing and now queries that used to work no longer work?  (sure, I 
agree it is wise to re-index if you upgrade Lucene).

Perhaps others could chime in on whether this change would adversely 
affect them or if this a desirable change?

Erik

On Jan 17, 2005, at 4:51 AM, Chris Lamprecht wrote:
Erik, Paul, Daniel,
I submitted a testcase --
http://issues.apache.org/bugzilla/show_bug.cgi?id=33134
On a related note, what do you all think about updating the
StandardAnalyzer grammar to treat "C#" and "C++" as tokens?  It's a
small modification to the grammar -- NutchAnalysis.jj has it.
-Chris
On Mon, 17 Jan 2005 03:23:41 -0500, Erik Hatcher
<[EMAIL PROTECTED]> wrote:
I don't see any tests of StandardAnalyzer either.  Your contribution
would be most welcome.  There are tests that use StandardAnalyzer, but
not to test it directly.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: QUERYPARSER + LEXECIAL ERROR

2005-01-17 Thread Otis Gospodnetic
Hello,

Try:
  String  searchWrd = "kid \"toy\"" OR "kid \"ball\""

You'll have to use a WhitespaceAnalyzer with that, though, or a custom
Analyzer that doesn't remove the escape character (\).

Otis


--- Karthik N S <[EMAIL PROTECTED]> wrote:

> 
> 
> Hi  Guys.
> 
> Apologies.
> 
> 
> 
> The Query Parser is giving me an Lexical Error
> 
> String  searchWrd = "kid \"toy\" OR \"ball\" "
> 
> org.apache.lucene.queryParser.TokenMgrError: Lexical error at line 1,
> column
> 26. Encountered:  after : ""
> at
>
org.apache.lucene.queryParser.QueryParserTokenManager.getNextToken(QueryPars
> erTokenManager.java:1050)
> 
> What is this Happening?
> 
> Lucene version :  Lucene 1.4.3.jar
> Jdk version:  Jdk 1.4.2
> O/s  :  Win2000
> 
> Some body Please Reply
> 
> 
> 
> 
> 
> 
> WITH WARM REGARDS
> HAVE A NICE DAY
> [ N.S.KARTHIK]
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to get all field values from a Hits object?

2005-01-17 Thread Miles Barr
On Mon, 2005-01-17 at 13:36 +0100, Tim Lebedkov (UPK) wrote:
> is it possible to get all different values for a
>  from a  object and how to do this?

Not unless you iterate through all the documents in the Hits object,
e.g.

Set fields = new HashSet();
for (int i = 0; i < hits.length(); i++) {
Document doc = hits.doc(i);
fields.add(doc.get("fieldName"));
}

but you probably don't want to do this on a per search basis because
retrieving the document is an expensive operation.



-- 
Miles Barr <[EMAIL PROTECTED]>
Runtime Collective Ltd.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



SUggestions for remoting lucene?

2005-01-17 Thread Joseph Ottinger
I just realised that the Hits object isn't Serializable, although Document
and Field are. I can easily build a Hits equivalent that *is*
Serializable, but should that be on my end, or at the Lucene API level?

---
Joseph B. Ottinger http://enigmastation.com
IT Consultant[EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



How to get all field values from a Hits object?

2005-01-17 Thread Tim Lebedkov \(UPK\)
Hi,

is it possible to get all different values for a
 from a  object and how to do this?

Thank You
--Tim

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StandardAnalyzer unit tests?

2005-01-17 Thread Chris Lamprecht
Erik, Paul, Daniel,

I submitted a testcase --
http://issues.apache.org/bugzilla/show_bug.cgi?id=33134

On a related note, what do you all think about updating the
StandardAnalyzer grammar to treat "C#" and "C++" as tokens?  It's a
small modification to the grammar -- NutchAnalysis.jj has it.

-Chris

On Mon, 17 Jan 2005 03:23:41 -0500, Erik Hatcher
<[EMAIL PROTECTED]> wrote:
> I don't see any tests of StandardAnalyzer either.  Your contribution
> would be most welcome.  There are tests that use StandardAnalyzer, but
> not to test it directly.
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



QUERYPARSER + LEXECIAL ERROR

2005-01-17 Thread Karthik N S


Hi  Guys.

Apologies.



The Query Parser is giving me an Lexical Error

String  searchWrd = "kid \"toy\" OR \"ball\" "

org.apache.lucene.queryParser.TokenMgrError: Lexical error at line 1, column
26. Encountered:  after : ""
at
org.apache.lucene.queryParser.QueryParserTokenManager.getNextToken(QueryPars
erTokenManager.java:1050)

What is this Happening?

Lucene version :  Lucene 1.4.3.jar
Jdk version:  Jdk 1.4.2
O/s  :  Win2000

Some body Please Reply






WITH WARM REGARDS
HAVE A NICE DAY
[ N.S.KARTHIK]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: carrot2 question too - Re: Fun with the Wikipedia

2005-01-17 Thread Dawid Weiss
Hi David,
I apologize about the delay in answering this one, Lucene is a busy 
mailing list and I had a hectic last week... Again, sorry for belated 
answer, hope you still find it useful.

That is awesome and very inspirational!
Yes, I admit what you've done with Wikipedia is quite interesting and 
looks very good. I'm also glad you spent some time working out Carrot 
integration with Lucene. It works quite nice.

Carrot2 looks very interesting. Wondering if anybody has a list of all 
the
Technically I don't think carrot2 uses lucene per-se- it's just that you 
can integrate the two, and ditto for Nutch - it has code that uses Carrot2.
Yes, this is true. Carrot2 doesn't use all of Lucene's potential -- it 
merely takes the output from a query (titles, urls and snippets) and 
attempts to cluster them into some sensible groups. I think many things 
could be improved, the most important of them is fast snippet retrieval 
  from Lucene because right now it takes 50% of the time of the 
clustering; I've seen a post a while ago describing a faster snippet 
generation technique, I'm sure that would give clustering a huge boost 
speed-wise.

And here's my question. I reread the Carrot2<->Lucene code, esp 
Demo.java, and there's this fragment:

// warm-up round (stemmer tables must be read etc).
List clusters = clusterer.clusterHits(docs);
long clusteringStartTime = System.currentTimeMillis();
clusters = clusterer.clusterHits(docs);
long clusteringEndTime = System.currentTimeMillis();
Thus it calls clusterHits() twice.
I don't really understand how to use Carrot2 - but I think the above is 
just for the sake of benchmarking clusterHits() w/o the effect of 1-time 
initialization - and that there's no benefit of repeatedly calling 
clusterHits (where a benefit might be that it can find nested clusters 
or whatever) - is that right (that there's no benefit)?
No, there is absolutely no benefit from it. It was merely to show people 
that the clustering needs to be warmed up a bit. I should not have put 
it in the code knowing people would be confused by it. You can safely 
use clusterHits just once. It will just have a small delay at the first 
invocation.

Thanks for experimenting. Please BCC me if you have any urgent projects 
-- I read Lucene's list in batches and my personal e-mail I try to keep 
up to date with.

Dawid
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: StandardAnalyzer unit tests?

2005-01-17 Thread Erik Hatcher
I don't see any tests of StandardAnalyzer either.  Your contribution 
would be most welcome.  There are tests that use StandardAnalyzer, but 
not to test it directly.

Erik
On Jan 16, 2005, at 11:48 PM, Chris Lamprecht wrote:
Does anyone have a unit test for StandardAnalyzer?  I've modified the
StandardAnalyzer javacc grammar to tokenize "c#" and "c++" without
removing the "#" and "++" parts, using pieces of the grammar from
Nutch.  Now I'd like to make sure I didn't change the way it parses
any other tokens.  thanks,
-Chris
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: StandardAnalyzer unit tests?

2005-01-17 Thread Paul Elschot
Chris,

On Monday 17 January 2005 05:49, Chris Lamprecht wrote:
> PS-I didn't find any in lucene CVS head, and I'd be glad to contribute
> some unit tests.

Under Unix this will give you the cvs head:

cvs -d :pserver:[EMAIL PROTECTED]:/home/cvspublic checkout jakarta-lucene

The tests are in the jakarta-lucene/src/test directory.
There are some tests that might be interesting in the queryParser and analysis 
directories below the obligatory org/apache/lucene.

In case these tests are not covering what you need, or you need help to run 
the tests, could you continue on lucene-dev?

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]