Modify the StandardTokenizerFactory to concatenate all words

2013-11-05 Thread Kevin
Currently I'm using StandardTokenizerFactory which tokenizes the words
bases on spaces. For Toy Story it will create tokens toy and story.
Ideally, I would want to extend the functionality ofStandardTokenizerFactory to
create tokens toy, story, and toy story. How do I do that?


Re: Index-Format difference between 1.4.3 and 2.0

2006-07-18 Thread kevin

Hi,
how to highlight the keyword in the search result summary ? can i use 
the /highlight/ package?

Thanks!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



HELP: how to highlight the search key word in lucene's search results?

2006-08-11 Thread kevin

Hi,
how to highlight the search key word in lucene's search results? pls 
give advise,thanks!


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Urgent! Forgot to close IndexWriter after adding Documents to the index.

2011-03-20 Thread Kevin Tse
Hi, experts

I had a program running for 2 days to build an index for around 160 million
text files, and after program ended, I tried searching the index and found
the index was not correctly built, *indexReader.numDocs()* returns 0. I
checked the index directory, it looked good, all the index data seemed to be
there, the directory is 1.5 Gigabytes in size.

I checked my code and found that I forgot to call *indexWriter.optimize()*and
*indexWriter.close()*, I want to know if it is possible to
*re-optimize()*the index so I don't need to rebuild the whole index
from scratch? I don't
really want the program to take another 2 days.

Thanks!

-- 
Neevek Est


Possible to cause documents to be contiguous after forceMerge?

2016-11-15 Thread Kevin Burton
I have a large index (say 500GB) that with a large percentage of near
duplicate documents.

I have to keep the documents there (can't delete them) as the metadata is
important.

Is it possible to get the documents to be contiguous somehow?

Once they are contiguous then they will compress very well - which I've
already confirmed by writing the exact same document N times.

IDEALLY I could use two fields and have a unique document ID but then a
group_id so that they can be located on disk by the group_id... but I don't
think this is possible.

Can I just create a synthetic "id" field for this and assume that "id" is
ordered on disk in the lucene index?


-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile



Re: Possible to cause documents to be contiguous after forceMerge?

2016-11-15 Thread Kevin Burton
On Tue, Nov 15, 2016 at 6:16 PM, Erick Erickson 
wrote:

> You can make no assumptions about locality in terms of where separate
> documents land on disk. I suppose if you have the whole corpus at index
> time you
> could index these "similar" documents contiguously. T
>

Wow.. that's shockingly frightening. There are a ton of optimizations if
you can trick the underlying content store into performing locality.

Not trying to be overly negative so another way to phrase it is that at
least there's room for improvement !


> My base question is why you'd care about compressing 500G. Disk space
> is so cheap that the expense of trying to control this dwarfs any
> imaginable
> $avings, unless you're talking about a lot of 500G indexes. In other words
> this seems like an
> XY problem, you're asking about compressing when you are really concerned
> with something else.
>

500GB per day... additionally, disk is cheap, but IOPS are not. The more we
can keep in ram and on SSD the better.

And we're trying to get as much in RAM then SSD as possible... plus we have
about 2 years of content.  It adds up ;)

Kevin

-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>


Sort merge strategy ?

2016-11-16 Thread Kevin Burton
What's the current status of the sort merge strategy?

I want to sort an index by a given field and keep it in that order on disk.

It seems to have evolved over the years and I can't easily figure out the
current status via the Javadoc in 6.x

-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile



Question about BytesRef and BinaryDocValues

2018-08-23 Thread Kevin Manuel
Hi,

I'm using lucene version 4.3.1 and I've implemented a custom score query.
I'm trying to read the value for a field from the field cache. It's a text
field so I'm using getTerms which returns a binarydocvalues object.

However on trying to get the bytes ref object for a document and converting
it to a string using utf8ToString I think characters after a whitespace and
not being returned in the string. For instance if the field has 'hey tom',
the string only returns 'hey'.

I tried this with version 4.10.0 too and I see the same thing. I was
wondering if there's something wrong with the way I'm accessing it or it
was an issue in these versions.

Thanks,
Kevin


Re: Question about BytesRef and BinaryDocValues

2018-08-23 Thread Kevin Manuel
Hi Vadim,

Thank you so much for your reply. I think you were right.

So if a field is 'analyzed' how can I get both terms 'hey' and 'tom'?

Thanks,
Kevin

On Thu, Aug 23, 2018, 20:26 Vadim Gindin  wrote:

> Hi Kevin!
>
> I think that your field is "analyzed" and so your field value is divided to
> 2 terms "hey" and "tom". So docvalue is written for each of them.
>
> Regards
> Vadim Gindin
>
>
> пт, 24 авг. 2018, 5:19 Kevin Manuel :
>
> > Hi,
> >
> > I'm using lucene version 4.3.1 and I've implemented a custom score query.
> > I'm trying to read the value for a field from the field cache. It's a
> text
> > field so I'm using getTerms which returns a binarydocvalues object.
> >
> > However on trying to get the bytes ref object for a document and
> converting
> > it to a string using utf8ToString I think characters after a whitespace
> and
> > not being returned in the string. For instance if the field has 'hey
> tom',
> > the string only returns 'hey'.
> >
> > I tried this with version 4.10.0 too and I see the same thing. I was
> > wondering if there's something wrong with the way I'm accessing it or it
> > was an issue in these versions.
> >
> > Thanks,
> > Kevin
> >
>


Upper limit on Score

2019-04-17 Thread Kevin Manuel
Hi,

I was just wondering is there an upper limit to the score that can be
generated for a non-constant score query?

Thanks,
Kevin


Using FunctionScoreQuery vs CustomScoreQuery

2020-02-20 Thread Kevin Manuel
Hi,

I'm working on upgrading Lucene from 7.0.0 to the latest 8.4.1. We're using
a CustomScoreQuery to score docs/results differently based on their
distance from a the user's location and also some other factors like the
type of the document (i.e. say if we stored documents of places to eat, a
restaurant would have a different boost value vs say a bar).

In the change log it was suggested to use FunctionScoreQuery instead. From
my understanding, it looks like FunctionScoreQuery can only be used for
index-time boosts (i.e. by using a field value present in the document) but
the above use case probably needs something more dynamic due to the
distance calculation.

Was wondering if you had any suggestions on how to achieve this or if maybe
I'm misunderstanding something?

Thanks,
Kevin


Re: Using FunctionScoreQuery vs CustomScoreQuery

2020-02-25 Thread Kevin Manuel
I see, thank you Adrien ! I'll look into it and get back to you if I have
any questions.

On Fri, Feb 21, 2020 at 1:45 AM Adrien Grand  wrote:

> Hi Kevin,
>
> FunctionScoreQuery can also work with dynamically-computed values, you just
> need to provide it with a DoubleValuesSource that computes values
> dynamically. The factory methods that exist in the DoubleValuesSource class
> all work with indexed data, but it is also possible to write a custom
> implementation.
>
> I also wonder whether you saw LatLonPointDistanceFeatureQuery, which is an
> efficient way to boost hits based on geo-distance.
>
> On Thu, Feb 20, 2020 at 10:55 PM Kevin Manuel 
> wrote:
>
> > Hi,
> >
> > I'm working on upgrading Lucene from 7.0.0 to the latest 8.4.1. We're
> using
> > a CustomScoreQuery to score docs/results differently based on their
> > distance from a the user's location and also some other factors like the
> > type of the document (i.e. say if we stored documents of places to eat, a
> > restaurant would have a different boost value vs say a bar).
> >
> > In the change log it was suggested to use FunctionScoreQuery instead.
> From
> > my understanding, it looks like FunctionScoreQuery can only be used for
> > index-time boosts (i.e. by using a field value present in the document)
> but
> > the above use case probably needs something more dynamic due to the
> > distance calculation.
> >
> > Was wondering if you had any suggestions on how to achieve this or if
> maybe
> > I'm misunderstanding something?
> >
> > Thanks,
> > Kevin
> >
>
>
> --
> Adrien
>


Custom DoubleValuesSource to Read from Multiple Indexed DocValue Fields

2020-07-16 Thread Kevin Manuel
Hi,

I'm trying to write a custom DoubleValuesSource for use with a
FunctionScoreQuery instance.

To generate the final score of a document I need to:
1) Read from three indexed docValue fields and
2) Use the score of the wrapped query passed in to the FunctionScoreQuery
instance

For example, a document A would be scored using a formula like:
((docA's_score_from_wrapped_query * some_value_x) + (docA's_field1_value *
some_value_y) + (docA's_field2_value * some_value_z)) * docA's_field3_value

How do I accomplish this?
Can this be done using one custom DoubleValuesSource or do I need one for
reading from each of the indexed docValue fields and then use a combination
of MultiFloatFunctions and SumFloatFunctions to achieve this?

Appreciate your time and help.

Thanks,
Kevin


Re: [VOTE] Lucene logo contest, third time's a charm

2020-09-03 Thread Kevin Risden
A1, A2, D (binding)

Kevin Risden



On Thu, Sep 3, 2020 at 4:44 PM jim ferenczi  wrote:

> A1 (binding)
>
> Le jeu. 3 sept. 2020 à 07:09, Noble Paul  a écrit :
>
>> A1, A2, D binding
>>
>> On Thu, Sep 3, 2020 at 7:22 AM Jason Gerlowski 
>> wrote:
>> >
>> > A1, A2, D (binding)
>> >
>> > On Wed, Sep 2, 2020 at 10:47 AM Michael McCandless
>> >  wrote:
>> > >
>> > > A2, A1, C5, D (binding)
>> > >
>> > > Thank you to everyone for working so hard to make such cool looking
>> possible future Lucene logos!  And to Ryan for the challenging job of
>> calling this VOTE :)
>> > >
>> > > Mike McCandless
>> > >
>> > > http://blog.mikemccandless.com
>> > >
>> > >
>> > > On Tue, Sep 1, 2020 at 4:21 PM Ryan Ernst  wrote:
>> > >>
>> > >> Dear Lucene and Solr developers!
>> > >>
>> > >> Sorry for the multiple threads. This should be the last one.
>> > >>
>> > >> In February a contest was started to design a new logo for Lucene
>> [jira-issue]. The initial attempt [first-vote] to call a vote resulted in
>> some confusion on the rules, as well the request for one additional
>> submission. The second attempt [second-vote] yesterday had incorrect links
>> for one of the submissions. I would like to call a new vote, now with more
>> explicit instructions on how to vote, and corrected links.
>> > >>
>> > >> Please read the following rules carefully before submitting your
>> vote.
>> > >>
>> > >> Who can vote?
>> > >>
>> > >> Anyone is welcome to cast a vote in support of their favorite
>> submission(s). Note that only PMC member's votes are binding. If you are a
>> PMC member, please indicate with your vote that the vote is binding, to
>> ease collection of votes. In tallying the votes, I will attempt to verify
>> only those marked as binding.
>> > >>
>> > >> How do I vote?
>> > >>
>> > >> Votes can be cast simply by replying to this email. It is a
>> ranked-choice vote [rank-choice-voting]. Multiple selections may be made,
>> where the order of preference must be specified. If an entry gets more than
>> half the votes, it is the winner. Otherwise, the entry with the lowest
>> number of votes is removed, and the votes are retallied, taking into
>> account the next preferred entry for those whose first entry was removed.
>> This process repeats until there is a winner.
>> > >>
>> > >> The entries are broken up by variants, since some entries have
>> multiple color or style variations. The entry identifiers are first a
>> capital letter, followed by a variation id (described with each entry
>> below), if applicable. As an example, if you prefer variant 1 of entry A,
>> followed by variant 2 of entry A, variant 3 of entry C, entry D, and lastly
>> variant 4e of entry B, the following should be in your reply:
>> > >>
>> > >> (binding)
>> > >> vote: A1, A2, C3, D, B4e
>> > >>
>> > >> Entries
>> > >>
>> > >> The entries are as follows:
>> > >>
>> > >> A. Submitted by Dustin Haver. This entry has two variants, A1 and A2.
>> > >>
>> > >> [A1]
>> https://issues.apache.org/jira/secure/attachment/12999548/Screen%20Shot%202020-04-10%20at%208.29.32%20AM.png
>> > >> [A2]
>> https://issues.apache.org/jira/secure/attachment/12997172/LuceneLogo.png
>> > >>
>> > >> B. Submitted by Stamatis Zampetakis. This has several variants.
>> Within the linked entry there are 7 patterns and 7 color palettes. Any vote
>> for B should contain the pattern number followed by the lowercase letter of
>> the color palette. For example, B3e or B1a.
>> > >>
>> > >> [B]
>> https://issues.apache.org/jira/secure/attachment/12997768/zabetak-1-7.pdf
>> > >>
>> > >> C. Submitted by Baris Kazar. This entry has 8 variants.
>> > >>
>> > >> [C1]
>> https://issues.apache.org/jira/secure/attachment/13006392/lucene_logo1_full.pdf
>> > >> [C2]
>> https://issues.apache.org/jira/secure/attachment/13006393/lucene_logo2_full.pdf
>> > >> [C3]
>> https://issues.apache.org/jira/secure/attachment/13006394/lucene_logo3_full.pdf
>> > >> [C4]
>> https://issues.apache.org/jira/s

Implementing Custom DoubleValues

2021-02-15 Thread Kevin Manuel
Hi,

I'm trying to write a custom DoubleValuesSource for use with a
FunctionScoreQuery instance. (I'm trying to migrate from an older version
of Lucene so I need to replace CustomScoreQuery).

To generate the final score of a document I need to:
1) Read from three indexed DocValues fields and
2) Use the score of the wrapped query passed in to the FunctionScoreQuery
instance

For example, a document A would be scored using a formula like:
((docA's_score_from_wrapped_query * some_value_x) + (docA's_field1_value *
some_value_y) + (docA's_field2_value * some_value_z)) * docA's_field3_value
I know that I have to use SimpleBindings and Expressions to achieve this.

Q) How do I implement the following methods when writing a custom
DoubleValues inside a custom DoubleValuesSource implementation ?
a) isCacheable
b) needsScores (I'm aware that this should return true if I need the doc's
score value after running the query that's passed into FunctionScoreQuery's
constructor. However, I do not use the score in the custom DoubleValues
implementation itself, but rather in an outer method that makes use of
this. Would this still need to be 'true' in this case?)

I've read the Javadocs as well as multiple other questions on this topic on
this channel, but it's still confusing to me.

Appreciate your time and help.

Thanks,
Kevin


Java 17 and Lucene

2021-10-18 Thread Kevin Rosendahl
Hello,

We are using Lucene 8 and planning to upgrade from Java 11 to Java 17. We
are curious:

   - How lucene is testing against java versions. Are there correctness and
   performance tests using java 17?
  - Additionally, besides Java 17, how are new Java releases tested?
   - Are there any other orgs using Java 17 with Lucene?
   - Any other considerations we should be aware of?


Best,
Kevin Rosendahl


Re: Java 17 and Lucene

2021-10-19 Thread Kevin Rosendahl
Thank you all for the information, it's very useful. Seems like it's best
to hold off on upgrading for now, but great to know that different JDK
versions are at least being exercised in CI.

I'm wondering, is there a better way to assess the production readiness of
a Lucene/JDK combination than just emailing the user group, or is this our
best bet in the future as well?

Thanks again!
Kevin

On Tue, Oct 19, 2021 at 5:07 AM Michael Sokolov  wrote:

> > I would a bit careful: On our Jenkins server running with AMD Ryzen CPU
> it happens quite often that JDK 16, JDK 17 and JDK 18 hang during tests and
> stay unkillable (only a hard kill with" kill -9"). Previous Java versions
> don't hang. It happens not all the time (about 1/4th of all builds) and due
> to the fact that the JVM is unresponsible it is not possible to get a stack
> trace with "jstack". If you know a way to get the stack trace, I'd happy to
> get help.
>
> ooh that sounds scary. I suppose one could maybe get core dumps using
> the right signal and debug that way? Oh wait you said only 9 works,
> darn! How about attaching using gdb? Do we maintain GC logs for these
> Jenkins builds? Maybe something suspicious would show up there.
>
> By the way the JDK is absolutely "responsible" in this situation! Not
> responsive maybe ...
>
> On Tue, Oct 19, 2021 at 4:46 AM Uwe Schindler  wrote:
> >
> > Hi,
> >
> > > Hey,
> > >
> > > Our team at Amazon Product Search recently ran our internal benchmarks
> with
> > > JDK 17.
> > > We saw a ~5% increase in throughput and are in the process of
> > > experimenting/enabling it in production.
> > > We also plan to test the new Corretto Generational Shenandoah GC.
> >
> > I would a bit careful: On our Jenkins server running with AMD Ryzen CPU
> it happens quite often that JDK 16, JDK 17 and JDK 18 hang during tests and
> stay unkillable (only a hard kill with" kill -9"). Previous Java versions
> don't hang. It happens not all the time (about 1/4th of all builds) and due
> to the fact that the JVM is unresponsible it is not possible to get a stack
> trace with "jstack". If you know a way to get the stack trace, I'd happy to
> get help.
> >
> > Once I figured out what makes it hang, I will open issues in OpenJDK (I
> am OpenJDK member/editor). I have now many stuck JVMs running to analyze on
> the server, so you're invited to help! At the moment, I have no time to
> take care, so any help is useful.
> >
> > > On a side note, the Lucene codebase still uses the deprecated (as of
> > > JDK17) AccessController
> > > in the RamUsageEstimator class.
> > > We suppressed the warning for now (based on recommendations
> > > <http://mail-archives.apache.org/mod_mbox/db-derby-
> > > dev/202106.mbox/%3CJIRA.13369440.1617476525000.615331.16239514800
> > > 5...@atlassian.jira%3E>
> > > from the Apache Derby mailing list).
> >
> > This should not be an issue, because we compile Lucene with javac
> parameter "--release 11", so it won't show any warning that you need to
> suppress. Looks like your build system at Amazon is not the original one by
> Lucene's Gradle, which shows no warnings at all.
> >
> > Uwe
> >
> > > Gautam Worah.
> > >
> > >
> > > On Mon, Oct 18, 2021 at 3:02 PM Michael McCandless <
> > > luc...@mikemccandless.com> wrote:
> > >
> > > > Also, I try to semi-aggressively upgrade Lucene's nightly benchmarks
> to new
> > > > JDK releases and leave an annotation on the nightly charts:
> > > > https://home.apache.org/~mikemccand/lucenebench/
> > > >
> > > > I just now upgraded to JDK 17 and kicked off a new benchmark run ...
> in a
> > > > few hours it should show the new data points and then I'll try to
> remember
> > > > to annotate it tomorrow.
> > > >
> > > > So let's see whether nightly benchmarks uncover any performance
> changes
> > > > from JDK17 :)
> > > >
> > > > Mike McCandless
> > > >
> > > > http://blog.mikemccandless.com
> > > >
> > > >
> > > > On Mon, Oct 18, 2021 at 5:36 PM Robert Muir 
> wrote:
> > > >
> > > > > We test different releases on different platforms (e.g. Linux,
> Windows,
> > > > > Mac).
> > > > > We also test EA (Early Access) releases of openjdk versions during
> the
> > > > > developme

index update problems with Linux

2008-01-18 Thread Kevin Dewi

Hello,

I have a problem with this code (updating a lucene index by delete and  
adding documents):



IndexReader reader = IndexReader.open(directory);
while (i.hasNext()) {
reader.deleteDocuments(i.next());
}
reader.close();

...

IndexWriter writer = new IndexWriter(directory,
new GermanAnalyzer(), create);
while (i2.hasNext()) {
writer.addDocument(i2.next());
}

By creating the IndexWriter I became this exception on Linux (ubuntu  
dapper):
java.io.IOException: Lock obtain timed out: Lock@/home/picard/develop/ 
apache-tomcat-6.0.14/temp/lucene-1763c549e0e952256392217dac3f3bdb- 
write.lock

at org.apache.lucene.store.Lock.obtain(Lock.java:56)
at  
org.apache.lucene.index.IndexWriter.(IndexWriter.java:254)
at  
org.apache.lucene.index.IndexWriter.(IndexWriter.java:244)
at  
de.gesichterparty.DatabaseManager.processQueue(DatabaseManager.java:345)

at de.gesichterparty.LuceneServlet.run(LuceneServlet.java:140)
at java.lang.Thread.run(Thread.java:595)
java.io.FileNotFoundException: /home/picard/develop/apache- 
tomcat-6.0.14/webapps/Lucene/WEB-INF/databases/user/segments (No such  
file or directory)

at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.(RandomAccessFile.java:212)
at org.apache.lucene.store.FSIndexInput 
$Descriptor.(FSDirectory.java:430)
at  
org.apache.lucene.store.FSIndexInput.(FSDirectory.java:439)
at  
org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:329)
at  
org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:45)
at org.apache.lucene.index.IndexReader 
$1.doBody(IndexReader.java:146)

at org.apache.lucene.store.Lock$With.run(Lock.java:99)
at org.apache.lucene.index.IndexReader.open(IndexReader.java: 
141)
at org.apache.lucene.index.IndexReader.open(IndexReader.java: 
136)
at  
org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:47)
at  
de.gesichterparty.DatabaseManager.processQueue(DatabaseManager.java:374)

at de.gesichterparty.LuceneServlet.run(LuceneServlet.java:140)
at java.lang.Thread.run(Thread.java:595)


On Mac OS X Leopard this code works fine.

Thanks
Kevin



The localized Languages.

2007-06-20 Thread sejourne kevin
Hi,

It seem that all localized languages Analyser are absent from 
org.apache.lucene.analysis.* in the lastest 2.2
source release of Lucene. Is this normal or not ?
regards,


Kévin.





  
_ 
Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re : The localized Languages.

2007-06-21 Thread sejourne kevin
Thank, 

I found it. I wasn't aware of those both source tree.

Kévin.

- Message d'origine 
De : Doron Cohen <[EMAIL PROTECTED]>
À : java-user@lucene.apache.org
Envoyé le : Mercredi, 20 Juin 2007, 23h42mn 17s
Objet : Re: The localized Languages.

Hi Kevin, are you looking for the sources under
contrib\analyzers?

Javadocs have both "core" and "contrib" together,
but they are separated in the source tree, (and
separated jars are created for them in the binary
dist).

Doron

sejourne kevin <[EMAIL PROTECTED]> wrote on 20/06/2007 15:31:20:

> Hi,
>
> It seem that all localized languages Analyser are absent from
> org.apache.lucene.analysis.* in the lastest 2.2
> source release of Lucene. Is this normal or not ?
> regards,
>
>
> Kévin.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






  
_ 
Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



TermFreqVector

2007-07-19 Thread Kevin Chen
I need to use getTermFreqVector on a subset of docs that belong to the hits for 
a query. I understand I need to pass the docNumber as an argument in this case. 
How do I access that.

For ex .

doc = hits.doc(0);
TermFreqVector vector = reader.getTermFreqVector(docId, "field");


How do I get docId? 

   
-
Looking for a deal? Find great prices on flights and hotels with Yahoo! 
FareChase.

term location in doc

2007-08-08 Thread Kevin Chen
I can see that termpositions gives an enum with all positions of term in 
document. I want to do the opposite. Given a position , can I query the 
document for term at that position in document?



   
-
Ready for the edge of your seat? Check out tonight's top picks on Yahoo! TV. 

RE: Search clustering question

2005-11-23 Thread Runde, Kevin
Does anyone have examples of using Carrot2? I've been looking into it
lately and am not finding good documentation. 

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, November 23, 2005 2:23 PM
To: java-user@lucene.apache.org
Subject: Re: Search clustering question

Have you looked into using Carrot2 (it is on sourceforge...) 
 
-Original Message-
From: Supreet Sethi <[EMAIL PROTECTED]>
To: Java lucene list 
Sent: Wed, 23 Nov 2005 17:34:22 +0530
Subject: Search clustering question


Hi,

For final finish up on work for my project. We intend to do search
clustering. Now I have already read that there is no clear cut way of
doing that in lucene.

Wondering, if anyone has tackled this problem with time constraint as
one issue.

With turn around time of 3 sec clustering 5000 search results by
different criterion seems like a touch nut

regards

Supreet


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Too many required clauses for a BooleanQuery

2006-02-08 Thread Kevin Dutcher
Hey Everyone,

I'm running into the "More than 32 required/prohibited clauses in query"
exception when running a query. I thought I understood the problem but the
following two scenarios confuse me.

1st - No Error
33 required clauses plus additional clauses that are left off b/c they
are the same as the second scenario
=
(categorization:10102617 AND categorization:10102621 AND
categorization:10102625 AND categorization:10102629 AND
categorization:10102633 AND categorization:10102637 AND
categorization:10102641 AND categorization:10102645 AND
categorization:10102649 AND categorization:10102653 AND
categorization:10102657 AND categorization:10102661 AND
categorization:10102665 AND categorization:10102669 AND
categorization:10102673 AND categorization:10102677 AND
categorization:10102681 AND categorization:10102685 AND
categorization:10102689 AND categorization:10102693 AND
categorization:10102697 AND categorization:10102701 AND
categorization:10102705 AND categorization:10102709 AND
categorization:10102713 AND categorization:10102717 AND
categorization:10102721 AND categorization:10102725 AND
categorization:10102729 AND categorization:10102733 AND
categorization:10102737 AND categorization:10102741 AND
categorization:10102745) AND ...

2nd - Error
The 33 required clauses above with the addition of a required
clause that is 3 OR'd clauses

((categorization:10102405 OR categorization:10102409 OR
categorization:10102413) AND categorization:10102617 AND
categorization:10102621 AND categorization:10102625 AND
categorization:10102629 AND categorization:10102633 AND
categorization:10102637 AND categorization:10102641 AND
categorization:10102645 AND categorization:10102649 AND
categorization:10102653 AND categorization:10102657 AND
categorization:10102661 AND categorization:10102665 AND
categorization:10102669 AND categorization:10102673 AND
categorization:10102677 AND categorization:10102681 AND
categorization:10102685 AND categorization:10102689 AND
categorization:10102693 AND categorization:10102697 AND
categorization:10102701 AND categorization:10102705 AND
categorization:10102709 AND categorization:10102713 AND
categorization:10102717 AND categorization:10102721 AND
categorization:10102725 AND categorization:10102729 AND
categorization:10102733 AND categorization:10102737 AND
categorization:10102741 AND categorization:10102745) AND ...

I can add additional required clauses to the 1st scenario without any
problems. So why am I seeing the error in the second scenario and not the
first? After discovering the error, I expected to see it in the first
scenario also. Is there anyway around this error?

As a side note, it is very unlikely that this will be encountered in the
real world, but b/c we are dealing with content categorization it is still
possible.

Thanks in advance,

Kevin


Re: Too many required clauses for a BooleanQuery

2006-02-09 Thread Kevin Dutcher
> I don't know a lot about the error your encountering (or not encountering
> as the case may be) but please for hte love of all that is sane use a
> Filter instead of putting all those categories in your Query.
>
> Your search performance and your scores will thank you.


I need all the documents returned from the search and am manipulating the
results with a custom HitCollector, therefore I can't use filters.

Kevin


Re: Too many required clauses for a BooleanQuery

2006-02-09 Thread Kevin Dutcher
>
> One more thing: in case these queries are generated, you might
> consider building the corresponding (nested) BooleanQuery yourself
> instead of using the QueryParser.
>
> Regards,
> Paul Elschot



I'll give that a try.  Thanks Paul.


Re: Too many required clauses for a BooleanQuery

2006-02-09 Thread Kevin Dutcher
Thanks Hoss... You're absolutely right!

Kevin


On 2/9/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
>
> : I need all the documents returned from the search and am manipulating
> the
> : results with a custom HitCollector, therefore I can't use filters.
>
> I don't understand this comment.  There are certianly methods in the
> Searchble interface that allow you to use both a Filter and a HitCollector
> together -- as for "need all the documents returned from the search" ...
> I'm not suggesting you filter out any docs your query doesn't allready
> restrict out because of hte required clauses.  I'm just saying that
> instead of a few dozen required clauses, you use a Filter like the one
> previously posted in this thread.  if you need to combine those "required"
> filters with other optional condition,s you cna do that using a
> ChainedFilter (or writting your own custom Filter that unions the BitSets
> yourself)
>
>
> -Hoss
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


RE: Vector Space Model <-> Probabilistic Model

2006-03-15 Thread Runde, Kevin
Hello,

I recently came across this email in the Lucene user list and am
interested in this article. I tried to access it from the link you
provided, but couldn't find any link to access it. Do you still have an
electronic copy?

Thanks,
Kevin Runde 

-Original Message-
From: Malcolm [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 17, 2006 8:47 AM
To: java-user@lucene.apache.org
Subject: Re: Vector Space Model <-> Probabilistic Model

I know of one I used for my Thesis. The REF is:
Fuhr, N. 2001, "Models in information retrieval", , pp. 21-50.

http://portal.acm.org/citation.cfm?id=567294

I may have a electronic version. If you need it give me an email address
as 
this service doesn't allow attachments.

Hope this helps,

Malcolm Clark


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Commercial vendors monitoring this ML? was: Lucene Performance Issues

2006-03-28 Thread Runde, Kevin
Of course they are monitoring this mail list, Lucene rocks and it is
beating them. Do yourself a favor and dedicate some time to testing
Lucene vs. any commercial application. A little time spent up front
testing the tools can save you significant time later optimizing,
hacking in a new tool, or refactoring your program because you didn't
understand how to "really" use the tool. We did that here and were
amazed. We found index size was 1/4 and query speed was 4 times faster
when comparing Lucene to several commercial tools. This was on indexes
that were much larger than physical RAM on the box.

-Kevin
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 28, 2006 12:47 PM
To: java-user@lucene.apache.org
Subject: Commercial vendors monitoring this ML? was: Lucene Performance
Issues

Weird, I was just about to comment on the fact that since posting that
my organization has decided to use Lucene, I got calls from two
commercial vendors that didn't give me the time of the day while I was
doing my comparison analysis.

Both of them referred to some random "colleague" in the business
referring them to me.

Jeff Wang
diCarta, Inc.

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 28, 2006 8:39 AM
To: java-user@lucene.apache.org
Subject: Re: Lucene Performance Issues

Hi Thomas,

Sound like FUD to me.  No concrete numbers, and the benchmark they
mention eh, haven't we all seen "funny" benchmarks before?  Lucene
is used in many large operations (e.g. Technorati, Simpy) that involve a
LOT of indexing and searching, large indices, etc.  I suggest you try
both and see which one suits your needs. 

Otis

- Original Message 
From: thomasg <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Tuesday, March 28, 2006 5:06:54 AM
Subject: Lucene Performance Issues


Hi, we are currently intending to implement a document storage / search
tool
using Jackrabbit and Lucene. We have been approached by a commercial
search
and indexing organisation called ISYS who are suggesting the following
problems with using Lucene. We do have a requirement to store and search
large documents and the total document store will be large too. Any
comments
on the following would be greatly appreciated.

1) By default, Lucene only indexes the first 10,000 words from each
document. When increasing this default out-of-memory errors can occur.
This
implies that documents, or large sections thereof, are loaded into
memory.
ISYS has a very small memory footprint which is not affected by document
size nor number of documents.

 
2) Lucene appears to be slow at indexing, at least by ISYS' standards.
Published performance benchmarks seem to vary between almost acceptable,
down to very poor. ISYS' file readers are already optimized for the
fastest
text extraction possible.

 
3) The Lucene documentation suggests it can be slow at searching and can
get
slower and slower the larger your indexes get. The tipping point is
where
the index size exceeds the amount of free memory in your machine. This
also
implies that whole indexes, or large portions of them, are loaded into
memory. The bigger the index, the more powerful the machine required.
ISYS'
search speed is always proportional to the size of the result set. Index
size does not materially affect search speed and the index is never
loaded
into memory. It also appears that Lucene requires hands-on tuning to
keep
its search speed acceptable. ISYS' indexes are self-managing and do not
require any maintenance to keep them searchable at full speed.


Thanks, Thomas
--
View this message in context:
http://www.nabble.com/Lucene-Performance-Issues-t1354811.html#a3626992
Sent from the Lucene - Java Users forum at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Ability to load a document with ONLY a few fields for performance?

2005-05-28 Thread Kevin Burton

I have a Document with about 15 fields.  I only need two of them.

How much faster would lucene be if I only fetched the two fields?  Each 
field is a separate file and this would almost certainly slow down just 
the basic IO.


I think I looked at this a long time ago and there was no high level API 
for doing this and that I'd have to dive into SegmentReader stuff.


Any idea?

--


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.


Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

  Kevin A. Burton, Location - San Francisco, CA
 AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Possible to find min and max values for a Date field?

2005-05-30 Thread Kevin Burton
Is it possible to find the minimum and maximum values for a date field 
with a given reader?


I guess I could use TermEnum to do a binary search until I get a hit but 
this seems a bit kludgy.


Thoughts?

I don't see any APIs for doing this and a google/grep of the source 
doesn't help


Kevin

--


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.


Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

  Kevin A. Burton, Location - San Francisco, CA
 AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Finding minimum and maximum value of a field?

2005-05-31 Thread Kevin Burton
I have an index with a date field.  I want to quickly find the minimum 
and maximum values in the index.


Is there a quick way to do this?  I looked at using TermInfos and 
finding the first one but how to I find the last?


I also tried the new sort API and the performance was horrible :-/

Any ideas?

Kevin

--


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.


Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

  Kevin A. Burton, Location - San Francisco, CA
 AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Finding minimum and maximum value of a field?

2005-05-31 Thread Kevin Burton

Andrew Boyd wrote:


How about using range query?

private Term begin, end;

begin = new Term("dateField",  
DateTools.dateToString(Date.valueOf(<"backInTimeStringDate">)));
end = new Term("dateField",  
DateTools.dateToString(Date.valueOf(<"farFutureStringDate">)));

RangeQuery query = new RangeQuery(begin, end, true);

IndexSearcher searcher = new IndexSearcher(directory);

Hits hits = searcher.search(query);

Document minDoc = hits.doc(0);
Document maxDoc = hits.doc(hits.length()-1);

String minDateString = minDoc.get("dateField");
String maxDateString = maxDoc.get("dateField");

 

This certainly is an interesting solution.  How would lucene score this 
result set?  The first and last will depend on the score...


I  guess I can build up a quick test

Kevin

--


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.


Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

  Kevin A. Burton, Location - San Francisco, CA
 AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Finding minimum and maximum value of a field?

2005-05-31 Thread Kevin Burton

Andrew Boyd wrote:


How about using range query?

private Term begin, end;

begin = new Term("dateField",  
DateTools.dateToString(Date.valueOf(<"backInTimeStringDate">)));
end = new Term("dateField",  
DateTools.dateToString(Date.valueOf(<"farFutureStringDate">)));

 

Ha.. crap.  That won't work either.  We have too many values and I get 
the dreaded:


Exception in thread "main" 
org.apache.lucene.search.BooleanQuery$TooManyClauses


Fun.

--


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.


Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

  Kevin A. Burton, Location - San Francisco, CA
 AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Ability to load a document with ONLY a few fields for performance?

2005-06-01 Thread Kevin Burton

Andrew Boyd wrote:


The numbers look impressive.  If I build from the 1.9 trunck will I get the 
patch?

 

Funny... I went ahead and imoplemented this myself and it didn't work.  
Of course I may have implemented it incorrectly.  I'll look at the patch 
source and try it out!


Something fun to do tomorrow!  w00t!

Kevin

--


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.


Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

  Kevin A. Burton, Location - San Francisco, CA
 AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Performance tuning and org.apache.lucene.store.InputStream.BUFFER_SIZE

2005-06-01 Thread Kevin Burton


I was doing a JProfiler install of our webapp/lucene last week and of 
course a large part of our app is spent in RandomAccessFile.readBytes ...


This is called by InputStream.readByte which internally uses a 
BUFFER_SIZE of 1024 (which is the default).


This value seems too small for a default. Most filesystem block sizes 
are 4096 and a lot of controllers will refuse to read in block sizes 
smaller than 16384.


On top of that you also have the the physical disks that have a minimum 
block size and also RAID controller block sizes.


I tried to tune this variable this weekend with different settings and 
had strange results. Mostly because (I'm assuming) that my Linux 
filesystem cache and RAID controller cache were getting in the way (and 
right now I have no way to fix this).


So my questions are:

1.  What do people thing about changing this value?  Does anyone else do it?

2.  Has anyone run any benchmarks here?  Does anyone have any solid 
lucene benchmarks they can run with before/after runs with this var changed?


3.  Does anyone else think that 1024 is too small of a default value for 
modern systems?


4. Does anyone know what the default block size for other filesystems? I 
know that XFS is 4096.  What about ext2? ext3? JFS? ReiserFS? NTFS? UFS? 
etc....


Kevin

--


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.


Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

  Kevin A. Burton, Location - San Francisco, CA
 AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Fastest way to fetch N documents with unique keys within large numbers of indexes..

2005-06-06 Thread Kevin Burton

Hey.

I'm trying to figure out the FASTEST way to solve this problem.

We have a system where I'll be given 10 or 20 unique keys.  Which are 
stored as non-tokenized fields within Lucene.  Each key represents a 
unique document.


Internally I'm creating a new Term and then calling 
IndexReader.termDocs() on this term.  Then if termdocs.next() matches 
then I'll return this document.


The problem is that this doesn't work very fast either.  This is not an 
academic debate as I've put the system in a profiler and Lucene is the 
top bottleneck (by far).


I don't think there's anything faster than this right?  Could I maybe 
cache a TermEnum and keep it as a pointer to the FIRST field for these 
IDs and then reuse that?  This might allow me to search faster to the 
start of my terms?


Does Lucene internally do a binary search for my term?

I could of course do an index merge of all this content but thats a 
separate problem.  We have a lot of indexes and often have more than 40 
and constantly merging these into a multigig index just takes FOREVER.


It seems that internally IO is the problem. I'm about as fast on IO as I 
can get as I'm on a SCSI RAID array at RAID0 on FAST scsi disks...  I 
also tried tweaking InputStream.BUFFER_SIZE with no visible change in 
performance.


Kevin

--


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.


Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

  Kevin A. Burton, Location - San Francisco, CA
 AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

2005-06-06 Thread Kevin Burton

Chris Hostetter wrote:


I haven't profiled either of thse suggestions but:

1) have you tried constructing a BooleanQuery of all 10-20 terms? Is the
  total time to execute the search, and access each Hit slower then your
  termDocs approach?
 

Actually using any type of query was very slow.  The problem was when it 
was computing the score.  This was a big performance gain.  About 2x and 
since its the slowest part of our app it was a nice one. :)


We were using a TermQuery though. 

I wonder if there's a way to tell lucene not to score.  Maybe I could 
then use a BooleanQuery with internal TermQueries and then scan the 
indexes once each. 


2) have you tried sorting your terms first, then opening a TermDocs on the
  first one, and seeking to each of the remaining terms?  it seems like
  that would be faster then opening a new TermDocs for each Term.
 


How do I do this?  I just assumed that termDocs was already sorted...

I don't see any mention of this in the API...

Kevin

--


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.


Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

  Kevin A. Burton, Location - San Francisco, CA
 AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

2005-06-06 Thread Kevin Burton

Matt Quail wrote:


We have a system where I'll be given 10 or 20 unique keys.



I assume you mean you have one unique-key field, and you are given  
10-20 values to find for this one field?




Internally I'm creating a new Term and then calling  
IndexReader.termDocs() on this term.  Then if termdocs.next()  
matches then I'll return this document.



Are you calling reader.termDocs() inside a (tight) loop? It might be  
better to create one TermEnum, and use "seek". Something like this:


Yes.. this is another approach I was thinking of taking.  I was thinking 
of building up a list of indexes which had a high probability of holding 
the given document and then searching for each of them.


What I'm worried about though is that it would be a bit slower...  I'm 
just going to have to test out different implementations to see







I'm pretty sure that will work. And if you can avoid the multi- 
threading issues, you might try and use the same TermDocs object for  
as long as possible (that is, move it up out of as many tight loops  
as you can).


Well... that doesn't look like the biggest overhead.  The bottleneck 
seens to be in seek() and the fact that its using an InputStream to read 
bytes off disk.  I actually tried to speed that up by crainking 
InputSteam.BUFFER_SIZE var higher but that didn't work either though I'm 
not sure if its a caching issue.  I sent an email to the list about this 
earlier but no one responded.


So it seems like my bottleneck is in seek() so It would make sense to 
figure out how to limit this.


Is this O(log(N))  btw or is it O(N) ?

Kevin

--


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.


Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

  Kevin A. Burton, Location - San Francisco, CA
 AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



use of LinkedList in ConjunctionScorer hurting performance?

2005-06-07 Thread Kevin Burton

This is a strange anomaly I wanted to point out:

http://www.flickr.com/photos/burtonator/18030919/

This is a jprofiler screenshot.  I can give you a jprofiler "snapshot" 
if you want but it requires the clientside app.


I'm not sure why this should be hot... in a linked list this should be 
fast ... maybe we're calling it too often?


I didn't have much time to look at it but I wanted to illuminate the issue.

Kevin

--


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.


Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

  Kevin A. Burton, Location - San Francisco, CA
 AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

2005-06-07 Thread Kevin Burton

Chris Hostetter wrote:


: was computing the score.  This was a big performance gain.  About 2x and
: since its the slowest part of our app it was a nice one. :)
:
: We were using a TermQuery though.

I believe that one search on one BooleanQuery containing 20
TermQueries should be faster then 20 searches on 20 TermQueries.
 


Actually.. it wasn't... :-/

It was about 4x slower.

Ug...

Kevin

--


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.


Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

  Kevin A. Burton, Location - San Francisco, CA
 AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

2005-06-07 Thread Kevin Burton

Paul Elschot wrote:


For a large number of indexes, it may be necessary to do this over
multiple indexes by first getting the doc numbers for all indexes,
then sorting these per index, then retrieving them
from all indexes, and repeating the whole thing using terms determined
from the retrieved docs.
 

Well this was a BIG win.  Just benchmarking it out shows a 10x -> 50x 
performance increase.


Times in milliseconds:

Before:

duration: 1127
duration: 449
duration: 394
duration: 564

After:

duration: 182
duration: 39
duration: 12
duration: 11

The values of 2-4  I'm sure are due to the filesystem buffer cache but I 
can't imagine why they'd be faster in the second round.  It might be 
that Linux is deciding not to buffer the document blocks.


Kevin

--

Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.


Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

  Kevin A. Burton, Location - San Francisco, CA
 AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

2005-06-09 Thread Kevin Burton

Andrew Boyd wrote:


Kevin,
 Those results are awsome.  Could you please give those of us that were 
following but not quite understanding everything some pseudo code or some more 
explaination?

 

Ug.. I hate to say this bug ignore these numbers.  Turns out that I was 
hitting a cache ... I thought I had disable this but I set the wrong 
var. :-/


If only it was true ;)

Kevin

--


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.


Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

  Kevin A. Burton, Location - San Francisco, CA
 AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Optimizing indexes with mulitiple processors?

2005-06-09 Thread Kevin Burton
Is it possible to get Lucene to do an index optimize on multiple 
processors?


Its a single threaded algorithm currently right?

Its a shame since I have a quad  machine but I'm only using 1/4th of the 
capacity.  Thats a heck of a performance hit.


Kevin

--


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.


Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

  Kevin A. Burton, Location - San Francisco, CA
 AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Optimizing indexes with mulitiple processors?

2005-06-09 Thread Kevin Burton

Bill Au wrote:


Optimize is disk I/O bound.  So I am not sure what multiple CPUs will buy you.
 



Now on my system with large indexes... I often have the CPU at 100%...

Kevin

--


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.


Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

  Kevin A. Burton, Location - San Francisco, CA
 AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Optimizing indexes with mulitiple processors?

2005-06-09 Thread Kevin Burton

Chris Collins wrote:


To follow up.  I was surprised to find that from the experiment of indexing 4k
documents to local disk (Dell PE with onboard RAID with 256MB cache). I got the
following data from my profile:

70 % time was spent in inverting the document
30 % in merge

 

Oh.. yeah.. thats indexing.  I'm more interested in merging multiple 
indexes...


Kevin

--


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.


Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

  Kevin A. Burton, Location - San Francisco, CA
 AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Optimizing indexes with mulitiple processors?

2005-06-10 Thread Kevin Burton

Chris Collins wrote:


Well I am currently looking at merging too.  In my application merging will
occur against a filer (read as higher latency device).  I am currently working
on how to stage indices on local disk before moving to a filer.  Assume I must
move to a filer eventually for whatever crazzy reason I need todont ask it
aint funny :-}

In that case I have a different performance issue, that is that FSInputStream
and FSOutputStream inherit the buffer size of 1k from OS and IS  This would be
useful to increase to reduce the amount of RPC's to the filer when doing merges
. assuming that reads and writes are sequential (CIFS supports a 64k block
and NFS supports upto I think 32k). 

Yeah.. I already did this actually ... on local disks the performance 
benefit wasn't noticable.  The variables are  private/final ... I made 
them public and non-final and it worked.


Note that OutputStream has a bug when I set it higher... I don't have 
the trace I'm afraid...



I haven't spent much time on this so far
so its not like I know its hard todo.  From preliminary experiments its obvious
that changing the OS buffersize is not the thing todo. 


If anyone has successfully increased the FSOutputStream and FSInputStream
buffers and got it not to blow up on array copies I would love to know the
short cut


Maybe that was my problem...

Kevin

--


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.


Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

  Kevin A. Burton, Location - San Francisco, CA
 AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Optimizing indexes with mulitiple processors?

2005-06-10 Thread Kevin Burton

Peter A. Friend wrote:



I changed that value to 8k, and based on the truss output from an  
index run, it is working. Haven't gotten much beyond that to see if  
it causes problems elsewhere. The value also needs to be altered on  
the read end of things. Ideally, this will be made settable via a  
system property.


Has anyone tried to tweak this on a RAID array on XFS?  Its confusing to figure 
out the ideal read size.

My performance benchmarks didn't show any benefit to setting this variable 
higher but I'm worried this is due to caching.

I tried to flush the caches by creating a 5G file and then cating that to 
/dev/null but I have no way to verify that this actually works.

I just made the BUFFER_SIZE veriables non-final so that I can set them at any time. 


Kevin

--


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.


Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

  Kevin A. Burton, Location - San Francisco, CA
 AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



NGram Language Categorization Source

2005-08-20 Thread Kevin Burton
Hey lucene guys.

I know for a fact that a bunch of you have been curious about language
categorization for a long time now and Java has lacked a solid way to
solve this problem.

Anyway.  This new library that I just released should be easy to tie
into your lucene indexers.  Just use the library on a text (strip the
HTML) and then create a new field in Lucene called LANG (or soemthing)
and then create a filter before you search with JUST that language
code.

I'd love some help with filling out missing languages if anyone has
some spare time.  That help make up for all the hard work I've done
here (nudge.. nudge)

I did a full research of the lang categorization space for Java and I
think this is basically the only library out there.

Good luck
...

I'm working on a blog post describing how blog search engines like
Technorati, PubSub, and Feedster could/should use language
categorization to help deal with the chaos of tagging and full-text
search. Google has done this for a long time now and Technorati has it
in beta.

http://www.feedblog.org/2005/08/ngram_language_.html

-- 
 Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://www.feedblog.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: NGram Language Categorization Source

2005-08-21 Thread Kevin Burton
><ftp://ftp.software.ibm.com/software/globalization/documents/linguini.pdf>ftp://ftp.software.ibm.com/software/globalization/documents/linguini.pdf
> >
> > Linguini: Language Identification for Multilingual Documents
> > John M. Prager
> 
> Prager also uses an n-gram approach, so you might be able to take
> advantage of some of his research into optimal values for .

Yeah.. though to be honest I as long as you're on the long tail
portion of N the values won't matter much I think.

All you'll do is waste a bit of memory (like 1k)
 
> The code to Linguini doesn't seem to be available (you have to
> purchase some IBM product(s) to get it) so what you've done is great
> for the open source community - thanks!
> 
> Also I could post to the Unicode list re training data in multiple
> languages, as that's a good place to find out about multilingual
> corpora.

Yeah. That was my biggest problem. This area had never really been
solved in the OSS world.

-- 
 Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://www.feedblog.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: NGram Language Categorization Source

2005-08-21 Thread Kevin Burton
> Erhm... Not to rain on your parade, but Googling for "ngram java" gives
> a lot of hits. http://sourceforge.net/projects/ngramj and also
> "languageidentifier" in Nutch are two examples of Open Source Java
> implementations. Each can be used with Lucene.

I think I've played with ngramj and found it very lacking. 
 
I haven't played with 'languageidentifier' in Nutch ... 

> A lot depends on the reference profiles (which in turn depend on the
> quality of your training corpus - in this case, your corpus is not the
> best choice, because each text contains a lot of foreign words).

I realize that my corpus isnt' the best.  That's one of the reason's
I've open source'd it.  The main improvement in ngramcat (my code) is
that if the result isn't obvious we throw an Exception so
theoreticallyi we won't see any false positives unless the language
categorization is WAY off.

> It was
> also found that the way you create ngram profiles (e.g. with or without
> surrounding spaces, single length or mixed length) affects the LI
> performance. 

LI???

I haven't benchmarked it but I'd be interested in any suggestions you have.

> For documents with mixed languages it was also found that
> methods, which combine ngrams with stopwords, work better.

Hm.. interesting.. where?  URL I can reads?
 
> Additionally, simple methods based on cosine similarity (or delta
> ranking) don't give correct results for documents with mixed languages.
> In such cases input texts are chunked, and each chunk is analyzed
> separately, and then the scores are combined... etc, etc... millions of
> ways you can do this - and of course no method is perfect. :-)

Yes.  We don't handle the mixed language case very well.  The chunking
method is something I wanted to approach.

> So, there is still a lot to do in this area, if you come up with some
> unique way of improving LI performance...

Maybe I'm being dense but what is LI performance?

Thanks.

Kevin

-- 
 Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://www.feedblog.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: NGram Language Categorization Source

2005-08-21 Thread Kevin Burton
> * A Nutch implementation:
> http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/languageidentifier/
> 
> * A Lucene patch: http://issues.apache.org/bugzilla/show_bug.cgi?id=26763

A step in the right direction. It doesn't have other language
categories created though.

> * JTextCat (http://www.jedi.be/JTextCat/index.html),  a Java wrapper
> for libtextcat

Yes. I saw JTextCat.. I didn't want any JNI used. 

> * NGramJ (http://ngramj.sourceforge.net/), a general n-gram Java library

LGPL.. yuk. That said I think I reviewed this package and found it
lacking.  I started off just trying to find a library to use in our
crawler but never found anything.  Which is why I ended up writing my
own.

> Of these, the Nutch one is certainly under active development, the
> others don't seem to be as far as I can tell.

They should just use ngramcat :)

Kevin

-- 
 Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://www.feedblog.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene and Xanga.com

2005-08-25 Thread Kevin Burton
On 8/24/05, Monsur Hossain <[EMAIL PROTECTED]> wrote:
> 
> Otis, we've been continually impressed with the performance of Lucene.
> We've been ever increasing the load we are putting on it (from our small
> help section, to our slightly larger metros, to our big groups, and then
> our gigantic weblogs), and it has meet each of these challenges
> wonderfully.
> 
> We are currently indexing only a few weeks worth of data (about a 6.5
> gig index)..  I think we can increase that even more and over the next

I've only found Lucene to fall down when optimizing REALLY huge
indexes like 300G or so.  Then you run out of available system memory
(4G on 32bit machines) and you hit disk.  Then it starts to take weeks
to optimize :-)

Of course you coudl use multiple machines or get more memory.  

Kevin

-- 
 Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://www.feedblog.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Is Lucene right for my app?

2005-09-18 Thread Kevin Stembridge

Hi folks,
I'd like to add seach functionality to a homegrown webapp I'm building 
that will store and display news articles. I've been looking through the 
Lucene Wiki, FAQ and tutorials and it looks like it will be able to 
provide the functionality I'll need. But before I commit to a given 
technology I was hoping to get a bit of advice from the people who use 
Lucene to see if it is the right search technology for my app and also 
some hints on how I should design my app to best make use of Lucene.


First of all, a few details about the application might help. It is 
basically an archive of news articles. Users will be able to perform 
searches to query the articles, which are stored locally, using criteria 
such as full text search, query of headline, author, date range etc. I'm 
able to decide how the articles will be stored and I was planning on 
just creating them and storing them as static HTML pages. For every 
article there will be a database record containing fields such as title, 
author, date published.


So with that in mind, I have a few questions:

Would Lucene be a good choice for my app?
What is the best format to store documents in given that Lucene needs to 
search them but they still need to be rendered to a browser quickly?
How much development effort is usually involved in integrating Lucene 
with an application?


I hope this mailing list is the right place to be asking the questions. 
If not, just point me in the right direction. Either way I would be very 
grateful for any advice.


Cheers,
Kevin


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Is Lucene right for my app?

2005-09-18 Thread Kevin Stembridge

Hi Jeff and Doug,
I've just finished reading Chapter 1 of Lucene in Action (sample chapter 
from Manning) and I've ordered a copy. The sample chapter was pretty 
useful in giving me an idea of how Lucene will fit into my app and I 
think it'll do the trick nicely with a bit of elbow grease on my part.


Thanks very much for the advice.

Cheers,
Kevin

Jeff Rodenburg wrote:


Kevin -

You've come to the right list to get information to help you make a 
decision.  That said, the responsible answer to your question will be 
"it depends".  The supporter in me says Lucene is your best choice, 
hands down.


Your questions aren't as straightforward as you might expect.  Lucene 
is an API, not a full-fledged search engine.  It's up to you to put it 
to work within the confines of your operation, so determining what's 
*best* can normally only be determined by yourself.


My suggestion to you: pick up a copy of Lucene in Action.  You'll get 
plenty of support on this mailing list, but you can educate yourself 
much more effectively with that book.  The authors lurk on this list.  
It's the cheapest consulting ($40) you can get.


Cheers,
jeff



On 9/18/05, *Kevin Stembridge* <[EMAIL PROTECTED] 
<mailto:[EMAIL PROTECTED]>> wrote:



Would Lucene be a good choice for my app?
What is the best format to store documents in given that Lucene
needs to
search them but they still need to be rendered to a browser quickly?
How much development effort is usually involved in integrating Lucene
with an application?




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



CLucene and Lucene

2008-05-16 Thread Kevin Daly (kedaly)
I am have a question concerning interop between CLucene and Lucene. It
is possible to have a C++ Application using CLucene acting as an
IndexWriter, and then have a Web Applicaltion using Lucene to query the
index. Could there be issues with locking under load for example.
 
I have done some basic test where I can write/read to/from index using
Clucene and Lucene.
 
- Kevin.

Kevin Daly
Software Engineer
IP Communications Business Unit

[EMAIL PROTECTED]
Phone :+35391384651



Block 10
Parkmore
Galway
Ireland
Ireland
www.cisco.com/



This e-mail may contain confidential and privileged material for the
sole use of the intended recipient. Any review, use, distribution or
disclosure by others is strictly prohibited. If you are not the intended
recipient (or authorized to receive for the recipient), please contact
the sender by reply e-mail and delete all copies of this message.   
 


 


RE: CLucene and Lucene

2008-05-16 Thread Kevin Daly (kedaly)
 




From: Kevin Daly (kedaly) 
Sent: Friday, May 16, 2008 1:34 PM
To: 'java-user@lucene.apache.org'
Subject: CLucene and Lucene


I am have a question concerning interop between CLucene and Lucene. It
is possible to have a C++ Application using CLucene acting as an
IndexWriter, and then have a Web Applicaltion using Lucene to query the
index. Could there be issues with locking under load for example.
 
I have done some basic test where I can write/read to/from index using
Clucene and Lucene.
 
- Kevin.

Kevin Daly
Software Engineer
IP Communications Business Unit

[EMAIL PROTECTED]
Phone :+35391384651



Block 10
Parkmore
Galway
Ireland
Ireland
www.cisco.com/



This e-mail may contain confidential and privileged material for the
sole use of the intended recipient. Any review, use, distribution or
disclosure by others is strictly prohibited. If you are not the intended
recipient (or authorized to receive for the recipient), please contact
the sender by reply e-mail and delete all copies of this message.   
 


 


Help with Search Java Code set up

2005-10-26 Thread Kevin L. Cobb
I've been using Lucene happily for a couple of years now. But, this new
search functionality I'm trying to add is somewhat different that what
I'm used to doing. Would help if the smart folks on this list would
drive me in the right direction.
 
I have several "searchable" fields and one keyword field in my index. I
usually work with EITHER the keyword or the searchable (non-keyword)
fields at a time, but this time I want to deal with them together. I
need to be able to do a term search in the "searchable" fields but at
the same time apply another term to the keyword field.
 
At this point, I'm thinking that I'll need to do two distinct searches,
one using the search term in what I'm calling my searchable fields, and
the other using the other term in the keyword field. Then join the two
HIT lists together.
 
Looking for some advice. 
 
Thanks,
 
Kevin
 

 


RE: Help with Search Java Code set up

2005-10-26 Thread Kevin L. Cobb
It well could be that I'm lacking in setting up my queries. Here's the
gist of what I'm trying to do, it a little pseudocode. 

1. inputs: 1) termToSearch 2) keywordField
2. Use MultiFieldQueryParser to build the query for the termToSearch in
the searchable fields
3. Use QueryParser to build the query for the keywordField (only one
field to search)
4. Can I combine these separate queries together into one? 

-Kevin



 

-Original Message-
From: Jeff Rodenburg [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, October 26, 2005 1:04 PM
To: java-user@lucene.apache.org
Subject: Re: Help with Search Java Code set up

Kevin -

Maybe I'm misunderstanding, but how is this not a BooleanQuery with two
clauses?

- j

On 10/26/05, Kevin L. Cobb <[EMAIL PROTECTED]> wrote:
>
> I've been using Lucene happily for a couple of years now. But, this 
> new search functionality I'm trying to add is somewhat different that 
> what I'm used to doing. Would help if the smart folks on this list 
> would drive me in the right direction.
>
> I have several "searchable" fields and one keyword field in my index. 
> I usually work with EITHER the keyword or the searchable (non-keyword)

> fields at a time, but this time I want to deal with them together. I 
> need to be able to do a term search in the "searchable" fields but at 
> the same time apply another term to the keyword field.
>
> At this point, I'm thinking that I'll need to do two distinct 
> searches, one using the search term in what I'm calling my searchable 
> fields, and the other using the other term in the keyword field. Then 
> join the two HIT lists together.
>
> Looking for some advice.
>
> Thanks,
>
> Kevin
>
>
>
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Help with Search Java Code set up

2005-10-26 Thread Kevin L. Cobb
Yeah. The last post got me to reading more about BooleanQuery and this
opend up the flood gates. 

A question on the heels of this one though. I have documents indexed in
multiple fields. Lets say Name, Synonym, and Definition. Lets say the
search phrase is "big green cat". What I'm building using the
BooleanQuery object is:

+((Name:big Name:green Name:cat) (Synonym:big Synonym:green
Synonym:cat))
(Definition:big Definition:green Definition:cat)) 

I'm getting too many 100% hits back when the search term is say "cat". I
want a Query that is more descerning, so that the term "cat" returns a
hit but less than 100%. The term "big green cat" should return 100%, the
term "big green" or "green big" should return something less than 100%
and then term "big" or "green" or "cat" something less that the previous
one. Hope this makes since. 

-Thanks, 

Kevin


 

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, October 26, 2005 2:38 PM
To: java-user@lucene.apache.org
Subject: Re: Help with Search Java Code set up

Are you simply looking to use multiple terms in your search?  In that
case, simply use BooleanQuery instead of TermQuery.  QueryParser will
recognize strings likefoo AND baror+foo +barand turn
that into a BooleanQuery for you.

Otis

--- "Kevin L. Cobb" <[EMAIL PROTECTED]> wrote:

> I've been using Lucene happily for a couple of years now. But, this 
> new search functionality I'm trying to add is somewhat different that 
> what I'm used to doing. Would help if the smart folks on this list 
> would drive me in the right direction.
>  
> I have several "searchable" fields and one keyword field in my index.
> I
> usually work with EITHER the keyword or the searchable (non-keyword) 
> fields at a time, but this time I want to deal with them together. I 
> need to be able to do a term search in the "searchable" fields but at 
> the same time apply another term to the keyword field.
>  
> At this point, I'm thinking that I'll need to do two distinct 
> searches, one using the search term in what I'm calling my searchable 
> fields, and the other using the other term in the keyword field. Then 
> join the two HIT lists together.
>  
> Looking for some advice. 
>  
> Thanks,
>  
> Kevin
>  
> 
>  
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: How do you make "protected content" searchable by Google?

2005-03-17 Thread Kevin L. Cobb
I worked on a website that had the same issue. We made a "search engine"
page that listed all the documents that we wanted to index as links to
documents that contained summaries of those documents with links to the
entire document on the limited access site - Google won't be able to
follow these links because they require use sign on, but the link will
be there for enticement when Googlers find the page.

Love the idea about retrieving the stop words and stemming them for
google indexing. But, Google is pretty picky I am told, so I would not
be surprised if they detected this sort of scheme and decided not to
index your pages. 

Good luck. 
 
KLCobb

-Original Message-
From: Chakra Yadavalli [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, March 16, 2005 11:44 PM
To: java-user@lucene.apache.org
Subject: How do you make "protected content" searchable by Google?

Hello, I am not sure if this is the right question for this list but
it is in regards to search engines.

Suppose you have a website that hosts some protected content that is
accessible via registered users. How you make the content searchable
by Google and other popular websearch engines? The idea is not to
reveal the conent even via the "Google cache."

Here is what I am thinking... 
Using Lucene (or its derivatives), skim thru the "protected content"
and remove all the common stop words , stem the words and place the
resulting text files in a directory availabe for the search bots (via
robotstxt rules). That way, even if the content is cached by the
search engines, it does not make much sense to humans but it still
will enable them to search it. When they click on the link to the
skimmed files, we need to redirect them to the login/registe page and
upon successful login, they should  be redirected to the actual human
readable/understandable page that corresponds to that has the "skimmed
content." Note that the "protected content" may be living in a Content
Management System or a database.

Am I overthinking/engineering it? Any ideas are really appreciated.

Thanks in advance,
Chakra
-- 
Visit my weblog: http://www.jroller.com/page/cyblogue

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene bulk indexing

2005-04-19 Thread Kevin L. Cobb
I think your bottleneck is most likely the DB hit. I assume by 2
products you mean 2 distinct entries into the Lucene Index, i.e.
2 rows in the DB to select from. 

I index about 1.5 million rows from a SQL Server 2000 database with
several fields for each entry and it finishes in about twenty minutes so
you should be able to index 2 rows in a few seconds. 

Make sure your database table(s) are indexed appropriately according to
your select statements. Indexing correctly will be the biggest
performance improvement you will see. 

Best of luck. 

KLCobb

-Original Message-
From: Mufaddal Khumri [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, April 19, 2005 2:11 PM
To: java-user@lucene.apache.org
Subject: Lucene bulk indexing

Hi,

I am sure this question must be raised before and maybe it has been even
answered. I would be grateful, if someone could point me in the right
direction or give their thoughts on this topic.

The problem:

I have approximately over 2 products that I need to index. At the
moment I get X number of products at a time and index them. This process
takes about 26 minutes (Am indexing the database id, product name,
product description).

I was thinking of ways to make this indexing faster. For this I was
thinking about writing a threaded module that would index X number of
products simultaneously. For instance I could spawn (Number of
products/X) number of threads and do the indexing. I am guessing this
would be faster but by what factor would this be faster? (I understand
the writes to the index are synchronized by lucene).

Is there any other approach by which I could speed up the indexing?
Thoughts? Suggestions?

Thanks,
Mufaddal.



--
This email and any files transmitted with it are confidential 
and intended solely for the use of the individual or entity 
to whom they are addressed. If you have received this 
email in error please notify the system manager. Please
note that any views or opinions presented in this email 
are solely those of the author and do not necessarily
represent those of the company. Finally, the recipient
should check this email and any attachments for the 
presence of viruses. The company accepts no liability for
any damage caused by any virus transmitted by this email.
Consult your physician prior to the use of any medical
supplies or product.

--


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Best way to purposely corrupt an index?

2005-04-20 Thread Kevin L. Cobb
My policy on this type of exception handling is to only byte off what
you can chew. If you catch an IOException, then you simply report to the
user that an unexpected error has occurred and the search engine is
unobtainable at the moment. Errors should be logged and developers
should look at the specifics of the error to solve the issue. As you
implied, either it's a corrupted index, a permission problem, or another
access problem. 

Trying to attack the issue much deeper than this will sacrifice
development/maintenance time for very little payback in the end if you
expect this error to occur infrequently. 



-Original Message-
From: Andy Roberts [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, April 20, 2005 5:43 AM
To: java-user@lucene.apache.org
Subject: Re: Best way to purposely corrupt an index?

On Wednesday 20 Apr 2005 08:27, Maik Schreiber wrote:
> > As the index is rather critical to my program, I just wanted to make
it
> > really robust, and able to cope should a problem occur with the
index
> > itself. Otherwise, the user will be left with a non-functioning
program
> > with no explanation. That's my reasoning anyway.
>
> You should perhaps go about implementing an automatic index backup
feature
> of some sort. In the case of index corruption you would at least be
able to
> go back to the latest backup.

Don't worry, I know what I intend to do *should* an error exist. My
original 
post was about how to detect corrupt indexes, and also how to purposely 
corrupt an index for the purposes of testing.

Note, IndexReader throws IOExceptions, but, this could be for a
multitude of 
reasons, not just a corrupt index. I was rather hoping for a 
CorruptIndexException of some sort!

It looks to me that if I do get an IOException, I will then have to
perform a 
number of additional checks to eliminate the other possible causes of 
IOExceptions (such as permissions issues), and by a process of
elimination, 
determine a corrupt index.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: New Site Live Using Lucene

2005-08-08 Thread Kevin L. Cobb
Open Source C/C++ only? When are you going to include Open Source Java?
We demand fair treatmant ;)


 

-Original Message-
From: Robert Schultz [mailto:[EMAIL PROTECTED] 
Sent: Sunday, August 07, 2005 6:18 PM
To: java-user@lucene.apache.org
Subject: New Site Live Using Lucene

Not sure if this is appropriate or not, but I just put live a web site
that I have been working on for over a year, and it uses Lucene for all
it's searching.

I have 46 million documents in 15 Lucene index's, although the vast
majority of those consist of only a few words.
The Lucene index's take up about 6GB of space.

I wrote a Java daemon to listen on a socket, and accept connections from
my PHP scripts in order to do the searching.

The results from Lucene include ID numbers that are linked up with MySQL
records thus forming the resulting web page.

You can see the site here: http://csourcesearch.net

It's a website that allows you to search over 99 million lines of open
source C/C++ code :)

Anyways, just wanted to say thanks a lot for such a great product (even
if it is java *snicker*)

Thanks again Lucene! :)


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Serialized Java Objects

2005-08-25 Thread Kevin L. Cobb
I just had a thought this morning. Does Lucene have the ability to store
Serialized Java Objects for return during a search. I was thinking that
this would be a nifty way to package up all of the return values for a
search. Of course, I wouldn't expect the serialized objects would not be
searchable. 
 
Thanks,
 
-Kevin