date:20080904

Delete the index Directory in File System, I think this is the simpliest!!!

2008/9/4 simon litwan <[EMAIL PROTECTED]>

> hi all
>
> i would like to delete the the index to allow to start reindexing from
> scratch.
> is there a way to delete all entries in a index?
>
> any hint is very appreciated.
>
> simon
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Similarity percentage between two Strings

2008-09-04 Thread Ian Lea

Googling for "java string similarity" throws up some stuff you might
find useful.


--
Ian.


On Wed, Sep 3, 2008 at 11:58 PM, Thiago Moreira <[EMAIL PROTECTED]> wrote:
>
> Well, the similar definition that I'm looking for is the number 2, maybe
> the number 3, but to start the number 2 is enough. If you guys think that is
> not a Lucene problem what else tool can I use to implement this
> requirement??
>
> Thanks
> 
> Thiago Moreira
> Software Engineer
> [EMAIL PROTECTED]
> Liferay, Inc.
> Enterprise. Open Source. For Life.
>
>
> N. Hira wrote:
>
> I don't know how much of this is a Lucene problem, but -- as I'm sure you
> will inevitably hear from others on the list -- it depends on what your
> definition of "similar" is.
>
> By similar, do you mean:
> 1.  Identical, except for variations in case (upper/lower)
> 2.  Allow 1., but also allow prefixes/suffixes (e.g., "FW:  " or "...
> (summary")
> 3.  Allow 1., 2. and permit some new terms ... how many?
> 4.  Allow all of the above and allow some changes to terms using stemming
> (E.g., "Google releases Chrome" is similar to "Google announces the release
> of its new Chrome web browser")
> 
>
> I'm sure you see where this is going.  So ... how do you define similar?
>
> Good luck!
>
> -h
> --
> Hira, N.R.
> Cognocys, Inc.
>
> On 03-Sep-2008, at 2:52 PM, Thiago Moreira wrote:
>
>
> Hey all,
>
> I want to know how much two Strings are similar! The thing is: I'm
> processing an email box and I want to group all messages that have the
> subject similar, makes sense?? I looked on the documentation but I didn't
> find how to accomplish this. It's not necessary add the messages or the
> subjects on some kind of index. I'm using 2.3.2 version of Lucene.
>
> Anyone has some idea?
>
> Thanks in advance.
> --
> Thiago Moreira
> Software Engineer
> [EMAIL PROTECTED]
> Liferay, Inc.
> Enterprise. Open Source. For Life.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Pre-filtering for expensive query

2008-09-04 Thread Andrzej Bialecki


Grant Ingersoll wrote:


On Aug 30, 2008, at 3:14 PM, Andrzej Bialecki wrote:




I think you can use a FilteredQuery in a BooleanClause. This may be 
faster than the filtering code in the Searcher, because the evaluation 
is done during scoring and not afterwards. FilteredQuery internally makes



FYI, not sure if this is exactly what you are talking about Andrzej, but 
IndexSearcher no longer filters after scoring.  This was changed in 
https://issues.apache.org/jira/browse/LUCENE-584


Ah, indeed - I was working with 2.3.0 release ... then there should be 
no visible performance difference if using the trunk version of 
IndexSearcher.


The only difference now between the IndexSearcher method and 
ConjunctionScorer would be when the supplied filter would match many 
documents. IndexSearcher always runs skipTo on the filter first, so 
potentially it would stop at many docIds that aren't matching in the 
scorer - whereas the ConjunctionScorer tries to order sub-scorers so 
that "sparse" scorers are checked first.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

QueryParser vs. BooleanQuery

2008-09-04 Thread bogdan71


  Hello,

  I am experiencing a strange behaviour when trying to query the same thing
via
BooleanQuery vs. via the know-it-all QueryParser class. Precisely, the index
contains
the document:
   "12,Visual C++,4.2" with the field layout: ID,name,version(thus, "12" is
the ID field, "Visual C++"
is the name field and "4.2" is the version field). 
  The search string is "Visual C++" for the name field.

  The following test, using QueryParser, goes fine:

public final void testUsingQueryParser()
{
IndexSearcher recordSearcher;
Query q;
QueryParser parser = new QueryParser("name", new 
StandardAnalyzer());
try
{
   q = parser.parse("name:visual +name:c++");

Directory directory =
FSDirectory.getDirectory(); 
recordSearcher = new IndexSearcher(directory);  

Hits h = recordSearcher.search(q);

assertEquals(1, h.length());
assertEquals(12, Integer.parseInt(h.doc(0).get("ID")));
}
catch(Exception exn)
{
fail("Exception occurred.");
}
}

  But this one, using a BooleanQuery, fails.

public final void testUsingTermQuery()
{
IndexSearcher recordSearcher;
BooleanQuery bq = new BooleanQuery();

bq.add(new TermQuery(new Term("name", "visual")),
BooleanClause.Occur.SHOULD);
bq.add(new TermQuery(new Term("name", "c++")), 
BooleanClause.Occur.MUST);

try
{   
Directory directory =
FSDirectory.getDirectory(); 
recordSearcher = new IndexSearcher(directory);  

Hits h = recordSearcher.search(bq);

assertEquals(1, h.length());   // fails, saying it 
expects 0 !!!
assertEquals(12, Integer.parseInt(h.doc(0).get("ID")));
}
catch(Exception exn)
{
fail("Eexception occurred.");
}   
}

   Rewriting the BooleanQuery and taking toString() yields the same String
given to QueryParser.parse() in the first test. I am using Lucene 2.3.0. Can
somebody explain the difference ?
-- 
View this message in context: 
http://www.nabble.com/QueryParser-vs.-BooleanQuery-tp19306087p19306087.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How can we know if 2 lucene indexes are same?

No documents can added into index when the index is optimizing,  or
optimizing can't run durling documents adding to the index.
So, without other error, I think we can beleive the two index are indeed the
same.

:)

2008/9/4 Noble Paul നോബിള്‍ नोब्ळ् <[EMAIL PROTECTED]>

> The use case is as follows
>
> I have two indexes . One at the master and one at the slave. The user
> occasionally keeps committing on the master and the delta is
> replicated everytime. But when the optimize happens the transfer size
> can be really large. So I am thinking of  doing the optimize
> separately on master and slave .
>
> So far, so good. But how can I really know that after the optimize the
> indexes are indeed the same or no documents got added in between.?
>
>
>
> On Fri, Aug 29, 2008 at 3:13 PM, Karl Wettin <[EMAIL PROTECTED]>
> wrote:
> >
> > 29 aug 2008 kl. 11.35 skrev Noble Paul നോബിള്‍ नोब्ळ्:
> >
> >> hi,
> >> I wish to know if the contents of two indexes have same data.
> >> will all the files be exactly same if I put same set of documents to
> both?
> >
> > If you insert the documents in the same order with the same settings and
> > both indices are optimized, then the files ought to be identitical. I'm
> > however not sure.
> >
> > The instantiated index contrib module contains a test that assert two
> index
> > readers are identical. You could use this to be really sure, but it it a
> > rather long running process for a large index:
> >
> >
> http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestIndicesEquals.java
> >
> >
> > Perhaps you should explain why you need to do this.
> >
> >
> >  karl
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>
>
> --
> --Noble Paul
>

Re: QueryParser vs. BooleanQuery

2008-09-04 Thread Ian Lea

Have a look at the index with Luke to see what has actually been
indexed. StandardAnalyzer may well be removing the pluses, or you may
need to escape them.  And watch out for case - Visual != visual in
term query land.


--
Ian.


On Thu, Sep 4, 2008 at 9:46 AM, bogdan71 <[EMAIL PROTECTED]> wrote:
>
>  Hello,
>
>  I am experiencing a strange behaviour when trying to query the same thing
> via
> BooleanQuery vs. via the know-it-all QueryParser class. Precisely, the index
> contains
> the document:
>   "12,Visual C++,4.2" with the field layout: ID,name,version(thus, "12" is
> the ID field, "Visual C++"
> is the name field and "4.2" is the version field).
>  The search string is "Visual C++" for the name field.
>
>  The following test, using QueryParser, goes fine:
>
> public final void testUsingQueryParser()
>{
>IndexSearcher recordSearcher;
>Query q;
>QueryParser parser = new QueryParser("name", new 
> StandardAnalyzer());
>try
>{
>   q = parser.parse("name:visual +name:c++");
>
>Directory directory =
> FSDirectory.getDirectory();
>recordSearcher = new IndexSearcher(directory);
>
>Hits h = recordSearcher.search(q);
>
>assertEquals(1, h.length());
>assertEquals(12, Integer.parseInt(h.doc(0).get("ID")));
>}
>catch(Exception exn)
>{
>fail("Exception occurred.");
>}
>}
>
>  But this one, using a BooleanQuery, fails.
>
> public final void testUsingTermQuery()
>{
>IndexSearcher recordSearcher;
>BooleanQuery bq = new BooleanQuery();
>
>bq.add(new TermQuery(new Term("name", "visual")),
> BooleanClause.Occur.SHOULD);
>bq.add(new TermQuery(new Term("name", "c++")), 
> BooleanClause.Occur.MUST);
>
>try
>{
>Directory directory =
> FSDirectory.getDirectory();
>recordSearcher = new IndexSearcher(directory);
>
>Hits h = recordSearcher.search(bq);
>
>assertEquals(1, h.length());   // fails, saying it 
> expects 0 !!!
>assertEquals(12, Integer.parseInt(h.doc(0).get("ID")));
>}
>catch(Exception exn)
>{
>fail("Eexception occurred.");
>}
>}
>
>   Rewriting the BooleanQuery and taking toString() yields the same String
> given to QueryParser.parse() in the first test. I am using Lucene 2.3.0. Can
> somebody explain the difference ?
> --
> View this message in context: 
> http://www.nabble.com/QueryParser-vs.-BooleanQuery-tp19306087p19306087.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: delete/reset the index



If you're on Windows, the safest way to do this in general, if there  
is any possibility that readers are still using the index, is to  
create a new IndexWriter with create=true.  Windows does not let you  
remove open files.  IndexWriter will gracefully handle failed deletes  
by retrying them over time...


Mike

simon litwan wrote:


hi all

i would like to delete the the index to allow to start reindexing  
from scratch.

is there a way to delete all entries in a index?

any hint is very appreciated.

simon

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How can we know if 2 lucene indexes are same?

Actually, as of 2.3, this is no longer true: merges and optimizing run
in the background, and allow add/update/delete documents to run at the
same time.

I think it's probably best to use application logic (outside of
Lucene) to keep track of what updates happened to the master while the
slave was optimizing.

Mike

叶双明 wrote:

No documents can added into index when the index is optimizing, or
optimizing can't run durling documents adding to the index.
So, without other error, I think we can beleive the two index are
indeed the

same.

2008/9/4 Noble Paul നോബിള്‍ नोब्ळ्
<[EMAIL PROTECTED]>

The use case is as follows

I have two indexes . One at the master and one at the slave. The user
occasionally keeps committing on the master and the delta is
replicated everytime. But when the optimize happens the transfer size
can be really large. So I am thinking of doing the optimize
separately on master and slave .

So far, so good. But how can I really know that after the optimize
the

indexes are indeed the same or no documents got added in between.?

On Fri, Aug 29, 2008 at 3:13 PM, Karl Wettin <[EMAIL PROTECTED]>
wrote:

29 aug 2008 kl. 11.35 skrev Noble Paul നോബിള്‍
नोब्ळ्:

hi,
I wish to know if the contents of two indexes have same data.
will all the files be exactly same if I put same set of documents
to

both?

If you insert the documents in the same order with the same
settings and
both indices are optimized, then the files ought to be
identitical. I'm

however not sure.

The instantiated index contrib module contains a test that assert
two

index
readers are identical. You could use this to be really sure, but
it it a

rather long running process for a large index:

http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestIndicesEquals.java

Perhaps you should explain why you need to do this.

karl
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--
--Noble Paul

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: getTimestamp method in IndexCommit

2008-09-04 Thread Noble Paul നോബിള്‍ नोब्ळ्



Noble Paul നോബിള്‍ नोब्ळ् wrote:


On Wed, Sep 3, 2008 at 2:06 PM, Michael McCandless
<[EMAIL PROTECTED]> wrote:


Noble Paul നോബിള്‍ नोब्ळ् wrote:


On Tue, Sep 2, 2008 at 1:56 PM, Michael McCandless
<[EMAIL PROTECTED]> wrote:


Are you thinking this would just fallback to  
Directory.fileModified on

the
segments_N file for that commit?

You could actually do that without any API change, because  
IndexCommit

exposes a getSegmentsFileName().


If it is a RAMDirectory how can we get the lastmodified?


RAMDirectory will report the System.currentTimeMillis() when the  
file was

last changed.  Is that not sufficient?


Isn't it a lot of overhead to read the file modified time everytime
the timestamp is tobe obtained?


I would think this method does not need to be super fast -- how  
frequently

are you planning to call it?

Only during a onCommit() or a onInit(). So if the commit point is
passed over multiple times it would call this as many times.Not a big
deal in terms of performance. But it is still some 3-4 lines of code
which could very well be added to the API and exposed as a method
getTimestamp()


OK I'll commit this -- it's trivial.  It's simply convenience for  
calling Directory.fileModified.




Note that the segments_N file has no other means of extracting a  
timestamp

for itself; it does not store a timestamp internally or anything.

Mike
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






--
--Noble Paul



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: getTimestamp method in IndexCommit

YOU ARE FAST
thanks.

--Noble

On Thu, Sep 4, 2008 at 2:54 PM, Michael McCandless
<[EMAIL PROTECTED]> wrote:
>
> Noble Paul നോബിള്‍ नोब्ळ् wrote:
>
>> On Wed, Sep 3, 2008 at 2:06 PM, Michael McCandless
>> <[EMAIL PROTECTED]> wrote:
>>>
>>> Noble Paul നോബിള്‍ नोब्ळ् wrote:
>>>
 On Tue, Sep 2, 2008 at 1:56 PM, Michael McCandless
 <[EMAIL PROTECTED]> wrote:
>
> Are you thinking this would just fallback to Directory.fileModified on
> the
> segments_N file for that commit?
>
> You could actually do that without any API change, because IndexCommit
> exposes a getSegmentsFileName().

 If it is a RAMDirectory how can we get the lastmodified?
>>>
>>> RAMDirectory will report the System.currentTimeMillis() when the file was
>>> last changed.  Is that not sufficient?
>>>
 Isn't it a lot of overhead to read the file modified time everytime
 the timestamp is tobe obtained?
>>>
>>> I would think this method does not need to be super fast -- how
>>> frequently
>>> are you planning to call it?
>>
>> Only during a onCommit() or a onInit(). So if the commit point is
>> passed over multiple times it would call this as many times.Not a big
>> deal in terms of performance. But it is still some 3-4 lines of code
>> which could very well be added to the API and exposed as a method
>> getTimestamp()
>
> OK I'll commit this -- it's trivial.  It's simply convenience for calling
> Directory.fileModified.
>
>>>
>>> Note that the segments_N file has no other means of extracting a
>>> timestamp
>>> for itself; it does not store a timestamp internally or anything.
>>>
>>> Mike
>>> -
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>
>>
>>
>> --
>> --Noble Paul
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 
--Noble Paul

Re: getTimestamp method in IndexCommit



Thanks for raising it!

It's through requests like this that Lucene's API improves.

Mike

Noble Paul നോബിള്‍ नोब्ळ् wrote:


YOU ARE FAST
thanks.

--Noble

On Thu, Sep 4, 2008 at 2:54 PM, Michael McCandless
<[EMAIL PROTECTED]> wrote:


Noble Paul നോബിള്‍ नोब्ळ् wrote:


On Wed, Sep 3, 2008 at 2:06 PM, Michael McCandless
<[EMAIL PROTECTED]> wrote:


Noble Paul നോബിള്‍ नोब्ळ् wrote:


On Tue, Sep 2, 2008 at 1:56 PM, Michael McCandless
<[EMAIL PROTECTED]> wrote:


Are you thinking this would just fallback to  
Directory.fileModified on

the
segments_N file for that commit?

You could actually do that without any API change, because  
IndexCommit

exposes a getSegmentsFileName().


If it is a RAMDirectory how can we get the lastmodified?


RAMDirectory will report the System.currentTimeMillis() when the  
file was

last changed.  Is that not sufficient?

Isn't it a lot of overhead to read the file modified time  
everytime

the timestamp is tobe obtained?


I would think this method does not need to be super fast -- how
frequently
are you planning to call it?


Only during a onCommit() or a onInit(). So if the commit point is
passed over multiple times it would call this as many times.Not a  
big

deal in terms of performance. But it is still some 3-4 lines of code
which could very well be added to the API and exposed as a method
getTimestamp()


OK I'll commit this -- it's trivial.  It's simply convenience for  
calling

Directory.fileModified.



Note that the segments_N file has no other means of extracting a
timestamp
for itself; it does not store a timestamp internally or anything.

Mike
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






--
--Noble Paul



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






--
--Noble Paul



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: delete/reset the index

Agree with Michael McCandless!!  By that way,it is handling gracefully.

2008/9/4 Michael McCandless <[EMAIL PROTECTED]>

>
> If you're on Windows, the safest way to do this in general, if there is any
> possibility that readers are still using the index, is to create a new
> IndexWriter with create=true.  Windows does not let you remove open files.
>  IndexWriter will gracefully handle failed deletes by retrying them over
> time...
>
> Mike
>
>
> simon litwan wrote:
>
>  hi all
>>
>> i would like to delete the the index to allow to start reindexing from
>> scratch.
>> is there a way to delete all entries in a index?
>>
>> any hint is very appreciated.
>>
>> simon
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

string similarity measures

Hello,
This came up before but - if we were to make a swear word filter, string
edit distances are no good. for example words like `shot` is confused with
`shit`. there is also problem with words like hitchcock. appearently i need
something like soundex or double metaphone. the thing is - these are
language specific, and i am not operating in english.

I need a fuzzy like curse word filter for turkish, simply.

Best regards,
-C.B.

Re: Realtime Search for Social Networks Collaboration

Hello Jason,
I have been trying to do this for a long time on my own. keep up the good
work.

What I tried was a document cache using apache collections. and before a
indexwrite/delete i would sync the cache with index.

I am waiting for lucene 2.4 to proceed. (query by delete)

Best.

On Wed, Sep 3, 2008 at 10:20 PM, Jason Rutherglen <
[EMAIL PROTECTED]> wrote:

> Hello all,
>
> I don't mean this to sound like a solicitation.  I've been working on
> realtime search and created some Lucene patches etc.  I am wondering
> if there are social networks (or anyone else) out there who would be
> interested in collaborating with Apache on realtime search to get it
> to the point it can be used in production.  It is a challenging
> problem that only Google has solved and made to scale.  I've been
> working on the problem for a while and though a lot has been
> completed, there is still a lot more to do and collaboration amongst
> the most probable users (social networks) seems like a good thing to
> try to do at this point.  I guess I'm saying it seems like a hard
> enough problem that perhaps it's best to work together on it rather
> than each company try to complete their own.  However I could be
> wrong.
>
> Realtime search benefits social networks by providing a scalable
> searchable alternative to large Mysql implementations.  Mysql I have
> heard is difficult to scale at a certain point.  Apparently Google has
> created things like BigTable (a large database) and an online service
> called GData (which Google has not published any whitepapers on the
> technology underneath) to address scaling large database systems.
> BigTable does not offer search.   GData does and is used by all of
> Google's web services instead of something like Mysql (this is at
> least how I understand it).  Social networks usually grow and so
> scaling is continually an issue.  It is possible to build a realtime
> search system that scales linearly, something that I have heard
> becomes difficult with Mysql.  There is an article that discusses some
> of these issues
> http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=337  I
> don't think the current GData implementation is perfect and there is a
> lot that can be improved on.  It might be helpful to figure out
> together what helpful things can be added.
>
> If this sounds like something of interest to anyone feel free to send
> your input.
>
> Take care,
> Jason
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: string similarity measures

2008-09-04 Thread Karl Wettin



4 sep 2008 kl. 14.38 skrev Cam Bazz:


Hello,
This came up before but - if we were to make a swear word filter,  
string
edit distances are no good. for example words like `shot` is  
confused with
`shit`. there is also problem with words like hitchcock. appearently  
i need

something like soundex or double metaphone. the thing is - these are
language specific, and i am not operating in english.

I need a fuzzy like curse word filter for turkish, simply.


You probably need to make a large list of words. I would try to learn  
from the users that do swear, perhaps even trust my users to report  
each other. I would probably also look at storing in what context the  
word is used, perhaps by adding the surrounding words (ngrams,  
shingles, markov chains). Compare "go to hell" and "when hell frezes  
over". The first is rather derogatory while the second doen't have to  
be bad at all.


I'm thinking Hidden Markov Models and Neural Networks.


  karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Similarity percentage between two Strings

2008-09-04 Thread Karl Wettin

I would create 1-5 ngram sized shingles and measure the distance using  
Tanimoto coefficient. That would probably work out just fine. You  
might want to add more weight the greater the size of the shingle.


There are shingle filters in lucene/java/contrib/analyzers and there  
is a Tanimoto distance in lucene/mahout/.


Feel free to report back on how well it works.


 karl

4 sep 2008 kl. 00.58 skrev Thiago Moreira:



Well, the similar definition that I'm looking for is the number  
2, maybe the number 3, but to start the number 2 is enough. If you  
guys think that is not a Lucene problem what else tool can I use to  
implement this requirement??


Thanks
Thiago Moreira
Software Engineer
[EMAIL PROTECTED]
Liferay, Inc.
Enterprise. Open Source. For Life.


N. Hira wrote:


I don't know how much of this is a Lucene problem, but -- as I'm  
sure you will inevitably hear from others on the list -- it depends  
on what your definition of "similar" is.


By similar, do you mean:
1.  Identical, except for variations in case (upper/lower)
2.  Allow 1., but also allow prefixes/suffixes (e.g., "FW:  " or  
"... (summary")

3.  Allow 1., 2. and permit some new terms ... how many?
4.  Allow all of the above and allow some changes to terms using  
stemming (E.g., "Google releases Chrome" is similar to "Google  
announces the release of its new Chrome web browser")



I'm sure you see where this is going.  So ... how do you define  
similar?


Good luck!

-h
--
Hira, N.R.
Cognocys, Inc.

On 03-Sep-2008, at 2:52 PM, Thiago Moreira wrote:



Hey all,

I want to know how much two Strings are similar! The thing is:  
I'm processing an email box and I want to group all messages that  
have the subject similar, makes sense?? I looked on the  
documentation but I didn't find how to accomplish this. It's not  
necessary add the messages or the subjects on some kind of index.  
I'm using 2.3.2 version of Lucene.


Anyone has some idea?

Thanks in advance.
--
Thiago Moreira
Software Engineer
[EMAIL PROTECTED]
Liferay, Inc.
Enterprise. Open Source. For Life.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How can we know if 2 lucene indexes are same?

I don't agreed with Michael McCandless. :)

I konw that after 2.3, add and delete can run in one IndexWriter at one
time, and also lucene has a update method which delete documents by term
then add the new document.

In my test, either LockObtainFailedException with thread sleep sentence:

org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out:
[EMAIL PROTECTED]:\index\write.lock
 at org.apache.lucene.store.Lock.obtain(Lock.java:85)
 at
org.apache.lucene.index.DirectoryIndexReader.acquireWriteLock(DirectoryIndexReader.java:298)
 at org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:750)
 at
org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java:786)
 at org.test.IndexThread.run(IndexThread.java:33)

or StaleReaderException without thread sleep sentence:

org.apache.lucene.index.StaleReaderException: IndexReader out of date and no
longer valid for delete, undelete, or setNorm operations
 at
org.apache.lucene.index.DirectoryIndexReader.acquireWriteLock(DirectoryIndexReader.java:308)
 at org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:750)
 at
org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java:786)
 at org.test.IndexThread.run(IndexThread.java:31)

My test code:


public class Main {

 public static void main(String[] args) throws IOException {
  Directory directory = FSDirectory.getDirectory("e:/index");
  IndexWriter writer = new IndexWriter(directory, null, false);
  Document document = new Document();
  document.add(new Field("bbb", "bbb", Store.YES, Index.UN_TOKENIZED));
  writer.addDocument(document);

  Thread t = new IndexThread();
  t.start();

  try {
   Thread.sleep(1000);
  } catch (InterruptedException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }

  writer.optimize();
  writer.close();
  System.out.println("out");
 }
}

public class IndexThread extends Thread {

 @Override
 public void run() {
  Directory directory;
  try {
   try {
Thread.sleep(10);
   } catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
   }

   directory = FSDirectory.getDirectory("e:/index");
   System.out.println("thread begin");
   //IndexWriter reader = new IndexWriter(directory, null, false);
   IndexReader reader = IndexReader.open(directory);
   Term term = new Term("bbb", "bbb");
   reader.deleteDocuments(term);
   reader.close();
   System.out.println("thread end");
  } catch (IOException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
 }
}



2008/9/4, Michael McCandless <[EMAIL PROTECTED]>:
>
>
> Actually, as of 2.3, this is no longer true: merges and optimizing run in
> the background, and allow add/update/delete documents to run at the same
> time.
>
> I think it's probably best to use application logic (outside of Lucene) to
> keep track of what updates happened to the master while the slave was
> optimizing.
>
> Mike
>
> 叶双明 wrote:
>
> No documents can added into index when the index is optimizing,  or
>> optimizing can't run durling documents adding to the index.
>> So, without other error, I think we can beleive the two index are indeed
>> the
>> same.
>>
>> :)
>>
>> 2008/9/4 Noble Paul നോബിള്‍ नोब्ळ् <[EMAIL PROTECTED]>
>>
>> The use case is as follows
>>>
>>> I have two indexes . One at the master and one at the slave. The user
>>> occasionally keeps committing on the master and the delta is
>>> replicated everytime. But when the optimize happens the transfer size
>>> can be really large. So I am thinking of  doing the optimize
>>> separately on master and slave .
>>>
>>> So far, so good. But how can I really know that after the optimize the
>>> indexes are indeed the same or no documents got added in between.?
>>>
>>>
>>>
>>> On Fri, Aug 29, 2008 at 3:13 PM, Karl Wettin <[EMAIL PROTECTED]>
>>> wrote:
>>>

 29 aug 2008 kl. 11.35 skrev Noble Paul നോബിള്‍ नोब्ळ्:

 hi,
> I wish to know if the contents of two indexes have same data.
> will all the files be exactly same if I put same set of documents to
>
 both?
>>>

 If you insert the documents in the same order with the same settings and
 both indices are optimized, then the files ought to be identitical. I'm
 however not sure.

 The instantiated index contrib module contains a test that assert two

>>> index
>>>
 readers are identical. You could use this to be really sure, but it it a
 rather long running process for a large index:



>>> http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestIndicesEquals.java
>>>


 Perhaps you should explain why you need to do this.


karl
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



>>>
>>>
>>> --
>>> --Noble Paul
>>>
>

Re: string similarity measures

yes, I already have a system for users reporting words. they fall on an
operator screen and if operator approves, or if 3 other people marked it as
curse, then it is filtered.
in the other thread you wrote:

>I would create 1-5 ngram sized shingles and measure the distance using
Tanimoto coefficient. That would probably work out just fine. ?>You might
want to add more weight the greater the size of the shingle.
>
>There are shingle filters in lucene/java/contrib/analyzers and there is a
Tanimoto distance in lucene/mahout/.

would that apply to my case? tanimoto coefficient over shingles?

Best,


On Thu, Sep 4, 2008 at 4:12 PM, Karl Wettin <[EMAIL PROTECTED]> wrote:

>
> 4 sep 2008 kl. 14.38 skrev Cam Bazz:
>
>
>  Hello,
>> This came up before but - if we were to make a swear word filter, string
>> edit distances are no good. for example words like `shot` is confused with
>> `shit`. there is also problem with words like hitchcock. appearently i
>> need
>> something like soundex or double metaphone. the thing is - these are
>> language specific, and i am not operating in english.
>>
>> I need a fuzzy like curse word filter for turkish, simply.
>>
>
> You probably need to make a large list of words. I would try to learn from
> the users that do swear, perhaps even trust my users to report each other. I
> would probably also look at storing in what context the word is used,
> perhaps by adding the surrounding words (ngrams, shingles, markov chains).
> Compare "go to hell" and "when hell frezes over". The first is rather
> derogatory while the second doen't have to be bad at all.
>
> I'm thinking Hidden Markov Models and Neural Networks.
>
>
>  karl
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: string similarity measures

2008-09-04 Thread Karl Wettin



4 sep 2008 kl. 15.54 skrev Cam Bazz:

yes, I already have a system for users reporting words. they fall on  
an
operator screen and if operator approves, or if 3 other people  
marked it as

curse, then it is filtered.
in the other thread you wrote:

I would create 1-5 ngram sized shingles and measure the distance  
using
Tanimoto coefficient. That would probably work out just fine. ?>You  
might

want to add more weight the greater the size of the shingle.


There are shingle filters in lucene/java/contrib/analyzers and  
there is a

Tanimoto distance in lucene/mahout/.

would that apply to my case? tanimoto coefficient over shingles?


Not really, no.


 karl





Best,


On Thu, Sep 4, 2008 at 4:12 PM, Karl Wettin <[EMAIL PROTECTED]>  
wrote:




4 sep 2008 kl. 14.38 skrev Cam Bazz:


Hello,
This came up before but - if we were to make a swear word filter,  
string
edit distances are no good. for example words like `shot` is  
confused with
`shit`. there is also problem with words like hitchcock.  
appearently i

need
something like soundex or double metaphone. the thing is - these are
language specific, and i am not operating in english.

I need a fuzzy like curse word filter for turkish, simply.



You probably need to make a large list of words. I would try to  
learn from
the users that do swear, perhaps even trust my users to report each  
other. I

would probably also look at storing in what context the word is used,
perhaps by adding the surrounding words (ngrams, shingles, markov  
chains).

Compare "go to hell" and "when hell frezes over". The first is rather
derogatory while the second doen't have to be bad at all.

I'm thinking Hidden Markov Models and Neural Networks.


karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How can we know if 2 lucene indexes are same?



Sorry, I should have said: you must always use the same writer, ie as  
of 2.3, while IndexWriter.optimize (or normal segment merging) is  
running, under one thread, another thread can use that *same* writer  
to add/delete/update documents, and both are free to make changes to  
the index.


Before 2.3, optimize() was fully synchronized and blocked add/update/ 
delete documents from changing the index until the optimize() call  
completed.


So, your test is expected to fail: you're not allowed to open 2  
"writers" on a single index at the same time, where "writer" includes  
an IndexReader that deletes documents; so those exceptions  
(LockObtainFailed, StaleReader) are expected.


Mike

叶双明 wrote:


I don't agreed with Michael McCandless. :)

I konw that after 2.3, add and delete can run in one IndexWriter at  
one
time, and also lucene has a update method which delete documents by  
term

then add the new document.

In my test, either LockObtainFailedException with thread sleep  
sentence:


org.apache.lucene.store.LockObtainFailedException: Lock obtain timed  
out:

[EMAIL PROTECTED]:\index\write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:85)
at
org 
.apache 
.lucene 
.index 
.DirectoryIndexReader.acquireWriteLock(DirectoryIndexReader.java:298)
at  
org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java: 
750)

at
org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java: 
786)

at org.test.IndexThread.run(IndexThread.java:33)

or StaleReaderException without thread sleep sentence:

org.apache.lucene.index.StaleReaderException: IndexReader out of  
date and no

longer valid for delete, undelete, or setNorm operations
at
org 
.apache 
.lucene 
.index 
.DirectoryIndexReader.acquireWriteLock(DirectoryIndexReader.java:308)
at  
org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java: 
750)

at
org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java: 
786)

at org.test.IndexThread.run(IndexThread.java:31)

My test code:


public class Main {

public static void main(String[] args) throws IOException {
 Directory directory = FSDirectory.getDirectory("e:/index");
 IndexWriter writer = new IndexWriter(directory, null, false);
 Document document = new Document();
 document.add(new Field("bbb", "bbb", Store.YES, Index.UN_TOKENIZED));
 writer.addDocument(document);

 Thread t = new IndexThread();
 t.start();

 try {
  Thread.sleep(1000);
 } catch (InterruptedException e) {
  // TODO Auto-generated catch block
  e.printStackTrace();
 }

 writer.optimize();
 writer.close();
 System.out.println("out");
}
}

public class IndexThread extends Thread {

@Override
public void run() {
 Directory directory;
 try {
  try {
   Thread.sleep(10);
  } catch (InterruptedException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }

  directory = FSDirectory.getDirectory("e:/index");
  System.out.println("thread begin");
  //IndexWriter reader = new IndexWriter(directory, null, false);
  IndexReader reader = IndexReader.open(directory);
  Term term = new Term("bbb", "bbb");
  reader.deleteDocuments(term);
  reader.close();
  System.out.println("thread end");
 } catch (IOException e) {
  // TODO Auto-generated catch block
  e.printStackTrace();
 }
}
}



2008/9/4, Michael McCandless <[EMAIL PROTECTED]>:



Actually, as of 2.3, this is no longer true: merges and optimizing  
run in
the background, and allow add/update/delete documents to run at the  
same

time.

I think it's probably best to use application logic (outside of  
Lucene) to

keep track of what updates happened to the master while the slave was
optimizing.

Mike

叶双明 wrote:

No documents can added into index when the index is optimizing,  or

optimizing can't run durling documents adding to the index.
So, without other error, I think we can beleive the two index are  
indeed

the
same.

:)

2008/9/4 Noble Paul നോബിള്‍ नोब्ळ्  
<[EMAIL PROTECTED]>


The use case is as follows


I have two indexes . One at the master and one at the slave. The  
user

occasionally keeps committing on the master and the delta is
replicated everytime. But when the optimize happens the transfer  
size

can be really large. So I am thinking of  doing the optimize
separately on master and slave .

So far, so good. But how can I really know that after the  
optimize the

indexes are indeed the same or no documents got added in between.?



On Fri, Aug 29, 2008 at 3:13 PM, Karl Wettin  
<[EMAIL PROTECTED]>

wrote:



29 aug 2008 kl. 11.35 skrev Noble Paul നോബിള്‍  
नोब्ळ्:


hi,

I wish to know if the contents of two indexes have same data.
will all the files be exactly same if I put same set of  
documents to



both?




If you insert the documents in the same order with the same  
settings and
both indices are optimized, then the files ought to be  
identitical. I'm

however not sure.

The instantiated index contrib module contains a test that  
assert two



index

readers are identical. You could use this t

Re: string similarity measures

let me rephrase the problem. I already have a set of bad words. I want to
avoid people inputting typos of the bad words.
for example 'shit' is banned, but someone may enter sh1t.

how can i flag those phonetically similar bad words to the marked bad words?

Best.

On Thu, Sep 4, 2008 at 5:02 PM, Karl Wettin <[EMAIL PROTECTED]> wrote:

>
> 4 sep 2008 kl. 15.54 skrev Cam Bazz:
>
>  yes, I already have a system for users reporting words. they fall on an
>> operator screen and if operator approves, or if 3 other people marked it
>> as
>> curse, then it is filtered.
>> in the other thread you wrote:
>>
>>  I would create 1-5 ngram sized shingles and measure the distance using
>>>
>> Tanimoto coefficient. That would probably work out just fine. ?>You might
>> want to add more weight the greater the size of the shingle.
>>
>>>
>>> There are shingle filters in lucene/java/contrib/analyzers and there is a
>>>
>> Tanimoto distance in lucene/mahout/.
>>
>> would that apply to my case? tanimoto coefficient over shingles?
>>
>
> Not really, no.
>
>
> karl
>
>
>
>
>>
>> Best,
>>
>>
>> On Thu, Sep 4, 2008 at 4:12 PM, Karl Wettin <[EMAIL PROTECTED]>
>> wrote:
>>
>>
>>> 4 sep 2008 kl. 14.38 skrev Cam Bazz:
>>>
>>>
>>> Hello,
>>>
 This came up before but - if we were to make a swear word filter, string
 edit distances are no good. for example words like `shot` is confused
 with
 `shit`. there is also problem with words like hitchcock. appearently i
 need
 something like soundex or double metaphone. the thing is - these are
 language specific, and i am not operating in english.

 I need a fuzzy like curse word filter for turkish, simply.


>>> You probably need to make a large list of words. I would try to learn
>>> from
>>> the users that do swear, perhaps even trust my users to report each
>>> other. I
>>> would probably also look at storing in what context the word is used,
>>> perhaps by adding the surrounding words (ngrams, shingles, markov
>>> chains).
>>> Compare "go to hell" and "when hell frezes over". The first is rather
>>> derogatory while the second doen't have to be bad at all.
>>>
>>> I'm thinking Hidden Markov Models and Neural Networks.
>>>
>>>
>>>karl
>>>
>>> -
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

lucene ram buffering

hello,
I was reading the performance optimization guides then I found :
writer.setRAMBufferSizeMB()

combined with: writer.setMaxBufferedDocs(IndexWriter.DISABLE_AUTO_FLUSH);

this can be used to flush automatically so if the ram buffer size is over a
certain limit it will flush.

now the question: i would like to manage all flushes myself. yet, i still
need a large ram buffer (for search performance)

how can I set ram buffer size to a large value, yet dont use auto flush. I
just want to flush every 32 documents added - and manage that myself.

if the ram buffer size is high, but we have a small number of documents,
does lucene try to write the entire contents of the ram buffer - thus
resulting in a higher flush time?

usually in oodbms systems, you use a larger ram buffer for search, and a
smaller ram buffer for write optimization. the reason being is a smaller ram
buffer is writable to disk faster.

is that the case with lucene?

Best.
-C.B.

Re: How can we know if 2 lucene indexes are same?

I see now, thanks Michael McCandless, good explain!!

2008/9/4, Michael McCandless <[EMAIL PROTECTED]>:
>
>
> Sorry, I should have said: you must always use the same writer, ie as of
> 2.3, while IndexWriter.optimize (or normal segment merging) is running,
> under one thread, another thread can use that *same* writer to
> add/delete/update documents, and both are free to make changes to the index.
>
> Before 2.3, optimize() was fully synchronized and blocked add/update/delete
> documents from changing the index until the optimize() call completed.
>
> So, your test is expected to fail: you're not allowed to open 2 "writers"
> on a single index at the same time, where "writer" includes an IndexReader
> that deletes documents; so those exceptions (LockObtainFailed, StaleReader)
> are expected.
>
> Mike
>
> 叶双明 wrote:
>
> I don't agreed with Michael McCandless. :)
>>
>> I konw that after 2.3, add and delete can run in one IndexWriter at one
>> time, and also lucene has a update method which delete documents by term
>> then add the new document.
>>
>> In my test, either LockObtainFailedException with thread sleep sentence:
>>
>> org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out:
>> [EMAIL PROTECTED]:\index\write.lock
>> at org.apache.lucene.store.Lock.obtain(Lock.java:85)
>> at
>>
>> org.apache.lucene.index.DirectoryIndexReader.acquireWriteLock(DirectoryIndexReader.java:298)
>> at
>> org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:750)
>> at
>> org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java:786)
>> at org.test.IndexThread.run(IndexThread.java:33)
>>
>> or StaleReaderException without thread sleep sentence:
>>
>> org.apache.lucene.index.StaleReaderException: IndexReader out of date and
>> no
>> longer valid for delete, undelete, or setNorm operations
>> at
>>
>> org.apache.lucene.index.DirectoryIndexReader.acquireWriteLock(DirectoryIndexReader.java:308)
>> at
>> org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:750)
>> at
>> org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java:786)
>> at org.test.IndexThread.run(IndexThread.java:31)
>>
>> My test code:
>>
>>
>> public class Main {
>>
>> public static void main(String[] args) throws IOException {
>>  Directory directory = FSDirectory.getDirectory("e:/index");
>>  IndexWriter writer = new IndexWriter(directory, null, false);
>>  Document document = new Document();
>>  document.add(new Field("bbb", "bbb", Store.YES, Index.UN_TOKENIZED));
>>  writer.addDocument(document);
>>
>>  Thread t = new IndexThread();
>>  t.start();
>>
>>  try {
>>  Thread.sleep(1000);
>>  } catch (InterruptedException e) {
>>  // TODO Auto-generated catch block
>>  e.printStackTrace();
>>  }
>>
>>  writer.optimize();
>>  writer.close();
>>  System.out.println("out");
>> }
>> }
>>
>> public class IndexThread extends Thread {
>>
>> @Override
>> public void run() {
>>  Directory directory;
>>  try {
>>  try {
>>   Thread.sleep(10);
>>  } catch (InterruptedException e) {
>>   // TODO Auto-generated catch block
>>   e.printStackTrace();
>>  }
>>
>>  directory = FSDirectory.getDirectory("e:/index");
>>  System.out.println("thread begin");
>>  //IndexWriter reader = new IndexWriter(directory, null, false);
>>  IndexReader reader = IndexReader.open(directory);
>>  Term term = new Term("bbb", "bbb");
>>  reader.deleteDocuments(term);
>>  reader.close();
>>  System.out.println("thread end");
>>  } catch (IOException e) {
>>  // TODO Auto-generated catch block
>>  e.printStackTrace();
>>  }
>> }
>> }
>>
>>
>>
>> 2008/9/4, Michael McCandless <[EMAIL PROTECTED]>:
>>
>>>
>>>
>>> Actually, as of 2.3, this is no longer true: merges and optimizing run in
>>> the background, and allow add/update/delete documents to run at the same
>>> time.
>>>
>>> I think it's probably best to use application logic (outside of Lucene)
>>> to
>>> keep track of what updates happened to the master while the slave was
>>> optimizing.
>>>
>>> Mike
>>>
>>> 叶双明 wrote:
>>>
>>> No documents can added into index when the index is optimizing,  or
>>>
 optimizing can't run durling documents adding to the index.
 So, without other error, I think we can beleive the two index are indeed
 the
 same.

 :)

 2008/9/4 Noble Paul നോബിള്‍ नोब्ळ् <[EMAIL PROTECTED]>

 The use case is as follows

>
> I have two indexes . One at the master and one at the slave. The user
> occasionally keeps committing on the master and the delta is
> replicated everytime. But when the optimize happens the transfer size
> can be really large. So I am thinking of  doing the optimize
> separately on master and slave .
>
> So far, so good. But how can I really know that after the optimize the
> indexes are indeed the same or no documents got added in between.?
>
>
>
> On Fri, Aug 29, 2008 at 3:13 PM, Karl Wettin <[EMAIL PROTECTED]>
> wrote:
>
>
>> 29 aug 2008 k

Re: Realtime Search for Social Networks Collaboration

2008-09-04 Thread Jason Rutherglen

Hi Cam,

Thanks!  It has not been easy, probably has taken 3 years or so to get
this far.  At first I thought the new reopen code would be the
solution.  I used it, but then needed to modify it to do a clone
instead of reference the old deleted docs.  Then as I iterated,
realized that just using reopen on a ramdirectory would not be quite
fast enough because of the merging.  Then started using
InstantiatedIndex which provides an in memory version of the document,
without the overhead of merging during the transaction.  There are
other complexities as well.  The basic code works if you are
interested in trying it out.

Take care,
Jason

On Thu, Sep 4, 2008 at 9:08 AM, Cam Bazz <[EMAIL PROTECTED]> wrote:
> Hello Jason,
> I have been trying to do this for a long time on my own. keep up the good
> work.
>
> What I tried was a document cache using apache collections. and before a
> indexwrite/delete i would sync the cache with index.
>
> I am waiting for lucene 2.4 to proceed. (query by delete)
>
> Best.
>
> On Wed, Sep 3, 2008 at 10:20 PM, Jason Rutherglen <
> [EMAIL PROTECTED]> wrote:
>
>> Hello all,
>>
>> I don't mean this to sound like a solicitation.  I've been working on
>> realtime search and created some Lucene patches etc.  I am wondering
>> if there are social networks (or anyone else) out there who would be
>> interested in collaborating with Apache on realtime search to get it
>> to the point it can be used in production.  It is a challenging
>> problem that only Google has solved and made to scale.  I've been
>> working on the problem for a while and though a lot has been
>> completed, there is still a lot more to do and collaboration amongst
>> the most probable users (social networks) seems like a good thing to
>> try to do at this point.  I guess I'm saying it seems like a hard
>> enough problem that perhaps it's best to work together on it rather
>> than each company try to complete their own.  However I could be
>> wrong.
>>
>> Realtime search benefits social networks by providing a scalable
>> searchable alternative to large Mysql implementations.  Mysql I have
>> heard is difficult to scale at a certain point.  Apparently Google has
>> created things like BigTable (a large database) and an online service
>> called GData (which Google has not published any whitepapers on the
>> technology underneath) to address scaling large database systems.
>> BigTable does not offer search.   GData does and is used by all of
>> Google's web services instead of something like Mysql (this is at
>> least how I understand it).  Social networks usually grow and so
>> scaling is continually an issue.  It is possible to build a realtime
>> search system that scales linearly, something that I have heard
>> becomes difficult with Mysql.  There is an article that discusses some
>> of these issues
>> http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=337  I
>> don't think the current GData implementation is perfect and there is a
>> lot that can be improved on.  It might be helpful to figure out
>> together what helpful things can be added.
>>
>> If this sounds like something of interest to anyone feel free to send
>> your input.
>>
>> Take care,
>> Jason
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: string similarity measures

2008-09-04 Thread mathieu


I submitted a patch to handle Aspell phonetic rules. You can find it in JIRA.

On Thu, 4 Sep 2008 17:07:09 +0300, "Cam Bazz" <[EMAIL PROTECTED]> wrote:
> let me rephrase the problem. I already have a set of bad words. I want to
> avoid people inputting typos of the bad words.
> for example 'shit' is banned, but someone may enter sh1t.
> 
> how can i flag those phonetically similar bad words to the marked bad
> words?
> 
> Best.
> 
> On Thu, Sep 4, 2008 at 5:02 PM, Karl Wettin <[EMAIL PROTECTED]> wrote:
> 
>>
>> 4 sep 2008 kl. 15.54 skrev Cam Bazz:
>>
>>  yes, I already have a system for users reporting words. they fall on an
>>> operator screen and if operator approves, or if 3 other people marked
> it
>>> as
>>> curse, then it is filtered.
>>> in the other thread you wrote:
>>>
>>>  I would create 1-5 ngram sized shingles and measure the distance using

>>> Tanimoto coefficient. That would probably work out just fine. ?>You
> might
>>> want to add more weight the greater the size of the shingle.
>>>

 There are shingle filters in lucene/java/contrib/analyzers and there
> is a

>>> Tanimoto distance in lucene/mahout/.
>>>
>>> would that apply to my case? tanimoto coefficient over shingles?
>>>
>>
>> Not really, no.
>>
>>
>> karl
>>
>>
>>
>>
>>>
>>> Best,
>>>
>>>
>>> On Thu, Sep 4, 2008 at 4:12 PM, Karl Wettin <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>>
 4 sep 2008 kl. 14.38 skrev Cam Bazz:


 Hello,

> This came up before but - if we were to make a swear word filter,
> string
> edit distances are no good. for example words like `shot` is confused
> with
> `shit`. there is also problem with words like hitchcock. appearently
> i
> need
> something like soundex or double metaphone. the thing is - these are
> language specific, and i am not operating in english.
>
> I need a fuzzy like curse word filter for turkish, simply.
>
>
 You probably need to make a large list of words. I would try to learn
 from
 the users that do swear, perhaps even trust my users to report each
 other. I
 would probably also look at storing in what context the word is used,
 perhaps by adding the surrounding words (ngrams, shingles, markov
 chains).
 Compare "go to hell" and "when hell frezes over". The first is rather
 derogatory while the second doen't have to be bad at all.

 I'm thinking Hidden Markov Models and Neural Networks.


karl

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

ramdisks

hello,
anyone using ramdisks for storage? there is ramsam and there is also fusion
io. but they are kinda expensive. any other alternatives I wonder?

Best.

PhraseQuery issues - differences with SpanNearQuery

2008-09-04 Thread Yannis Pavlidis

Hi,

I am having an issue when using the PhraseQuery which is best illustrated with 
this example:

I have created 2 documents to emulate URLs. One with a URL of: 
"http://www.airballoon.com"; and title "air balloon" and the second one with URL
"http://www.balloonair.com"; and title: "balloon air".

Test1 (PhraseQuery)
==
Now when I use the phrase query with - title: "air balloon" ~2
I get back:

url: "http://www.airballoon.com"; - score: 1.0
url: "http://www.balloonair.com"; - score: 0.57

Test2 (PhraseQuery)
==
Now when I use the phrase query with - title: "balloon air" ~2
I get back:
url: "http://www.balloonair.com"; - score: 1.0
url: "http://www.airballoon.com"; - score: 0.57

Test3 (PhraseQuery)
==
Now when I use the phrase query with - title: "air balloon" ~2 title: "balloon 
air" ~2
I get back:
url: "http://www.airballoon.com"; - score: 1.0
url: "http://www.balloonair.com"; - score: 1.0

Test4 (SpanNearQuery)
===
spanNear([title:air, title:balloon], 2, false)
I get back:
url: "http://www.airballoon.com"; - score: 1.0
url: "http://www.balloonair.com"; - score: 1.0

I would have expected that Test1, Test2 would actually return both URLs with 
score of 1.0 since I am setting the slop to 2. It seems though that lucene 
really favors and absolute exact match.

Is it safe to assume that for what I am looking for (basically score the docs 
the same regardless on when someone is searching for "air balloon" or "balloon 
air") it would be better to use the SpanNearQuery rather than the PhraseQuery?

Any input would be appreciated. 

Thanks in advance,

Yannis.

Re: PhraseQuery issues - differences with SpanNearQuery

2008-09-04 Thread Mark Miller

Sounds like its more in line with what you are looking for. If I 
remember correctly, the phrase query factors in the edit distance in 
scoring, but the NearSpanQuery will just use the combined idf for each 
of the terms in it, so distance shouldnt matter with spans (I'm sure 
Paul will correct me if I am wrong).


- Mark

Yannis Pavlidis wrote:

Hi,

I am having an issue when using the PhraseQuery which is best illustrated with 
this example:

I have created 2 documents to emulate URLs. One with a URL of: 
"http://www.airballoon.com"; and title "air balloon" and the second one with URL
"http://www.balloonair.com"; and title: "balloon air".

Test1 (PhraseQuery)
==
Now when I use the phrase query with - title: "air balloon" ~2
I get back:

url: "http://www.airballoon.com"; - score: 1.0
url: "http://www.balloonair.com"; - score: 0.57

Test2 (PhraseQuery)
==
Now when I use the phrase query with - title: "balloon air" ~2
I get back:
url: "http://www.balloonair.com"; - score: 1.0
url: "http://www.airballoon.com"; - score: 0.57

Test3 (PhraseQuery)
==
Now when I use the phrase query with - title: "air balloon" ~2 title: "balloon 
air" ~2
I get back:
url: "http://www.airballoon.com"; - score: 1.0
url: "http://www.balloonair.com"; - score: 1.0

Test4 (SpanNearQuery)
===
spanNear([title:air, title:balloon], 2, false)
I get back:
url: "http://www.airballoon.com"; - score: 1.0
url: "http://www.balloonair.com"; - score: 1.0

I would have expected that Test1, Test2 would actually return both URLs with 
score of 1.0 since I am setting the slop to 2. It seems though that lucene 
really favors and absolute exact match.

Is it safe to assume that for what I am looking for (basically score the docs the same regardless 
on when someone is searching for "air balloon" or "balloon air") it would be 
better to use the SpanNearQuery rather than the PhraseQuery?

Any input would be appreciated. 


Thanks in advance,

Yannis.

  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Newbie question: using Lucene to index hierarchical information.

2008-09-04 Thread Leonid Maslov

Hi all,
Thanks a lot for such a quick reply.

Both scenario sounds very well for me. I would like to do my best and try to
implement any of them (as the proof of the concept) and then incrementally
improve, retest, investigate and rewrite then :)

So, from the soap opera to the question part then:

   - How to implement those things (a and b) on the Lucene and Lucene
   contribs codebase?
  - I looked at the
  
http://xtf.wiki.sourceforge.net/tagRef_textIndexer_PreFilter#toctagRef_textIndexer_PreFilter7
and
  didn't like that (too big, to heavy, ready-to use solution instead of
  toolkit). And I didn't understood how to implement "Normal
scenario" on top
  of that?
   - Any suggestions how could I begin implementing these things? Gently
   moving from "Normal" scenario to some more advanced "Complex"? What should I
   afraid off and possible impacts if any?

Have anybody tried to use Lucene to analyse things like that? What would be
possible solutions to store indexed data and perform queries on that? If
Lucene isn't the right tool for this job, maybe some other toolkit would
more useful(possibly on top of the Lucene)

Thanks in advance for any suggestions and comments. I would appreciate any
ideas and directions to look into.


On Tue, Sep 2, 2008 at 11:46 AM, Karsten F.
<[EMAIL PROTECTED]>wrote:

>
> Hi Leonid,
>
> what kind of query is your use case?
>
> Comlex scenario:
> You need all the hierarchical structure information in one query. This
> means
> you want to search with xpath in a real xml-Database. (like: All Documents
> with a subtitle XY which contains directly after this subtitle a table with
> the same column like ...)
>
> Normal scenario:
> You want to search for only one part of your hierarchical information like
> 'Document with word xy in title' and 'Documents with word xy in table'.
>
> I am not familiar with lucene use in xml-Databases, but I can advice for
> "normal scenario":
>
> Take a look to the xml-aware search in xtf (
>
> http://xtf.wiki.sourceforge.net/tagRef_textIndexer_PreFilter#toctagRef_textIndexer_PreFilter7
> ).
> The idea is to use one lucene-document for each section with only two
> fields: "text" and "sectionType".
> But to collect all hits belonging to one hierarchical information (e.g. one
> html-File) and compress this to one representative hit in lucene.
>
> Best regards
>  Karsten
>
>
> leonardinius wrote:
> >
> > Any comments, suggestions? Maybe I should rephrase my original message or
> > describe it in detail?
> > I really would like to get any response if possible.
> >
> > Thanks a lot in advance!
> >
> > On Mon, Sep 1, 2008 at 10:25 AM, Leonid Maslov <[EMAIL PROTECTED]>
> wrote:
> >
> >> Hi all,
> >>
> >> First of all, sorry for my poor English. It's not my native language.
> >>
> >> I'm trying to use Lucene to index hierarchical kind of information: I
> >> have
> >> structured html and pdf/word documents and I want to index them in ways
> >> to
> >> perform search in titles, text, paragraphs or tables only, or any
> >> combinations of items mentioned above. At the moment I see 3 possible
> >> solutions:
> >>
> >>- Create the set of all possible fields, like: contents, title,
> >>heading, table etc... And index the data in all them accordingly.
> >> Possible
> >>impacts:
> >>- a big count of fields
> >>   - data duplication (because I need to make search looking in the
> >>   paragraphs to look inside all the inner elements, so every outer
> >> element
> >>   indexed will contain all the inner element content as well)
> >>- Create the hierarchy of the fields, like "title",
> "paragraph/title",
> >>"paragraph/title/subparagraph/table". Possible impacts:
> >>   - count of fields remains the same
> >>   - soft set of fields (not consistent)
> >>   - I'm not sure about the ways I could process required information
> >>   and perform search.
> >>   - Performance issues?
> >>   - Use one field for content and just add location prefix to
> >> content.
> >>For example "contents:*paragraph/heading:*token1 token2". *
> >>paragraph/heading:* here is used as additional information prefix.
> So,
> >>I (possibly?) could reuse PrefixQuery functionality or smth. Impacts:
> >>   - Strong set of index fields (small)
> >>   - Additional information processing - all the queries I'll use
> will
> >>   have to work as PrefixQuery
> >>   - Performance issues?
> >>
> >>
> >> So, have anyone tried to make things work like that? Or am I trying to
> >> use
> >> wrench to hammer in nails? I assume Lucene wasn't thought to be used
> like
> >> that, but it's worth trying (at least asking).
> >> Any results / suggestions are welcome!
> >>
> >> --
> >> Bests regards,
> >> Leonid Maslov!
> >> Adrienne Gusoff  - "Opportunity knocked. My doorman threw him out."
> >>
> >
> >
> >
> > --
> > Bests regards,
> > Leonid Maslov!
> > Adrienne Gusoff  - "Opportunity knocked. My doorman threw him

Lucene debug logging?

Is there a way to turn on debug logging / trace logging for Lucene?



  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Problem with lucene search starting to return 0 hits when a few seconds earlier it was returning hundreds

We have some code that uses lucene which has been working perfectly well for 
several months.

Recently, a QA team in our organization has set up a server with a much larger 
data set than we have ever tested with in the past:  the resulting lucene index 
is about 3G in size.

On this particular server, the same lucene code which has been reliable in the 
past is now exhibiting erratic behavior.  The first time you do a search, it 
returns the correct number of hits.  The second time you do a search, it may or 
may not return the correct set.  By the third time you do a search, it will 
return 0 hits even for a search that was returning hundreds of hits only a few 
seconds earlier.  All subsequent searches will return 0 hits until you stop and 
restart the java process.

A snippet of the relevant code follows:

// getReader() returns the singleton IndexReader object
final IndexReader reader = getReader();

// ANALYZER is another singleton
final QueryParser queryParser = new QueryParser("text", 
ANALYZER);
queryParser.setDefaultOperator(spec.getDefaultOp());
final Query query = 
queryParser.parse(spec.getSearchText()).rewrite(
reader);
final IndexSearcher searcher = new IndexSearcher(reader);

final Hits hits = searcher.search(query, new 
CachingWrapperFilter(
new QueryWrapperFilter(visibilityFilter)));
total = hits.length();



I understand that Lucene should be able to handle very large datasets, so I'd 
be surprised if this were an actual Lucene bug.  I'm hoping it's just that I'm 
doing something "wrong" which has gone unnoticed so far for several months 
because we've never had an index this large.

We're using lucene verison 2.2.0.

Thanks!

Justin Grunau



  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Problem with lucene search starting to return 0 hits when a few seconds earlier it was returning hundreds

2008-09-04 Thread Leonid M.

* And what's about visibility filter? * Are you sure no one else accesses
IndexReader and modifies index? See reader.maxDocs() to be confident.

On Fri, Sep 5, 2008 at 12:19 AM, Justin Grunau <[EMAIL PROTECTED]> wrote:

> We have some code that uses lucene which has been working perfectly well
> for several months.
>
> Recently, a QA team in our organization has set up a server with a much
> larger data set than we have ever tested with in the past:  the resulting
> lucene index is about 3G in size.
>
> On this particular server, the same lucene code which has been reliable in
> the past is now exhibiting erratic behavior.  The first time you do a
> search, it returns the correct number of hits.  The second time you do a
> search, it may or may not return the correct set.  By the third time you do
> a search, it will return 0 hits even for a search that was returning
> hundreds of hits only a few seconds earlier.  All subsequent searches will
> return 0 hits until you stop and restart the java process.
>
> A snippet of the relevant code follows:
>
>// getReader() returns the singleton IndexReader object
>final IndexReader reader = getReader();
>
>// ANALYZER is another singleton
>final QueryParser queryParser = new QueryParser("text",
> ANALYZER);
>queryParser.setDefaultOperator(spec.getDefaultOp());
>final Query query =
> queryParser.parse(spec.getSearchText()).rewrite(
>reader);
>final IndexSearcher searcher = new IndexSearcher(reader);
>
>final Hits hits = searcher.search(query, new
> CachingWrapperFilter(
>new QueryWrapperFilter(visibilityFilter)));
>total = hits.length();
>
>
>
> I understand that Lucene should be able to handle very large datasets, so
> I'd be surprised if this were an actual Lucene bug.  I'm hoping it's just
> that I'm doing something "wrong" which has gone unnoticed so far for several
> months because we've never had an index this large.
>
> We're using lucene verison 2.2.0.
>
> Thanks!
>
> Justin Grunau
>
>
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-- 
 Bests regards,
 Leonid Maslov!
 Personal blog: http://leonardinius.blogspot.com/

Random thought:
Princess Margaret  - "I have as much privacy as a goldfish in a bowl."

Re: Problem with lucene search starting to return 0 hits when a few seconds earlier it was returning hundreds

Sorry, I forgot to include the visibility filters:

final BooleanQuery visibilityFilter = new BooleanQuery();
visibilityFilter.add(new TermQuery(new Term("isPublic", 
"true")),
Occur.SHOULD);
visibilityFilter.add(new TermQuery(new Term("reader", 
user.getId())),
Occur.SHOULD);


These visibility filters ensure that a user only sees files which he or she has 
access to see.

I am pretty certain nobody else has modified the index in the meantime, but why 
is that important?  We have several other servers -- whose only difference is a 
smaller data set -- with dozens of concurrent users, and the index on those 
servers gets modified and read concurrently all the time, but none of these 
other servers have ever exhibited this bug.



- Original Message 
From: Leonid M. <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, September 4, 2008 5:35:47 PM
Subject: Re: Problem with lucene search starting to return 0 hits when a few 
seconds earlier it was returning hundreds

* And what's about visibility filter? * Are you sure no one else accesses
IndexReader and modifies index? See reader.maxDocs() to be confident.

On Fri, Sep 5, 2008 at 12:19 AM, Justin Grunau <[EMAIL PROTECTED]> wrote:

> We have some code that uses lucene which has been working perfectly well
> for several months.
>
> Recently, a QA team in our organization has set up a server with a much
> larger data set than we have ever tested with in the past:  the resulting
> lucene index is about 3G in size.
>
> On this particular server, the same lucene code which has been reliable in
> the past is now exhibiting erratic behavior.  The first time you do a
> search, it returns the correct number of hits.  The second time you do a
> search, it may or may not return the correct set.  By the third time you do
> a search, it will return 0 hits even for a search that was returning
> hundreds of hits only a few seconds earlier.  All subsequent searches will
> return 0 hits until you stop and restart the java process.
>
> A snippet of the relevant code follows:
>
>// getReader() returns the singleton IndexReader object
>final IndexReader reader = getReader();
>
>// ANALYZER is another singleton
>final QueryParser queryParser = new QueryParser("text",
> ANALYZER);
>queryParser.setDefaultOperator(spec.getDefaultOp());
>final Query query =
> queryParser.parse(spec.getSearchText()).rewrite(
>reader);
>final IndexSearcher searcher = new IndexSearcher(reader);
>
>final Hits hits = searcher.search(query, new
> CachingWrapperFilter(
>new QueryWrapperFilter(visibilityFilter)));
>total = hits.length();
>
>
>
> I understand that Lucene should be able to handle very large datasets, so
> I'd be surprised if this were an actual Lucene bug.  I'm hoping it's just
> that I'm doing something "wrong" which has gone unnoticed so far for several
> months because we've never had an index this large.
>
> We're using lucene verison 2.2.0.
>
> Thanks!
>
> Justin Grunau
>
>
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-- 
Bests regards,
Leonid Maslov!
Personal blog: http://leonardinius.blogspot.com/

Random thought:
Princess Margaret  - "I have as much privacy as a goldfish in a bowl."



  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene debug logging?

2008-09-04 Thread Daniel Naber

On Donnerstag, 4. September 2008, Justin Grunau wrote:

> Is there a way to turn on debug logging / trace logging for Lucene?

You can use IndexWriter's setInfoStream(). Besides that, Lucene doesn't do 
any logging AFAIK. Are you experiencing any problems that you want to 
diagnose with debugging?

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene debug logging?



For IndexWriter there's setInfoStream, which logs details about when  
flushing & merging is happening.


Mike

Justin Grunau wrote:


Is there a way to turn on debug logging / trace logging for Lucene?






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Problem with lucene search starting to return 0 hits when a few seconds earlier it was returning hundreds

2008-09-04 Thread Leonid M.

Anyway it is worth trying (to ensure docs aren't removed between searches).What
if running MatchAllDocsQuery or smth similar? Still getting different hits
count on query rerun?

PS. I'm kinda newbie with Lucene and Lucene API. So don't take my notes too
seriously :)

On Fri, Sep 5, 2008 at 12:46 AM, Justin Grunau <[EMAIL PROTECTED]> wrote:

> Sorry, I forgot to include the visibility filters:
>
>final BooleanQuery visibilityFilter = new BooleanQuery();
>visibilityFilter.add(new TermQuery(new Term("isPublic",
> "true")),
>Occur.SHOULD);
>visibilityFilter.add(new TermQuery(new Term("reader",
> user.getId())),
>Occur.SHOULD);
>
>
> These visibility filters ensure that a user only sees files which he or she
> has access to see.
>
> I am pretty certain nobody else has modified the index in the meantime, but
> why is that important?  We have several other servers -- whose only
> difference is a smaller data set -- with dozens of concurrent users, and the
> index on those servers gets modified and read concurrently all the time, but
> none of these other servers have ever exhibited this bug.
>
>
>
> - Original Message 
> From: Leonid M. <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Thursday, September 4, 2008 5:35:47 PM
> Subject: Re: Problem with lucene search starting to return 0 hits when a
> few seconds earlier it was returning hundreds
>
> * And what's about visibility filter? * Are you sure no one else accesses
> IndexReader and modifies index? See reader.maxDocs() to be confident.
>
> On Fri, Sep 5, 2008 at 12:19 AM, Justin Grunau <[EMAIL PROTECTED]> wrote:
>
> > We have some code that uses lucene which has been working perfectly well
> > for several months.
> >
> > Recently, a QA team in our organization has set up a server with a much
> > larger data set than we have ever tested with in the past:  the resulting
> > lucene index is about 3G in size.
> >
> > On this particular server, the same lucene code which has been reliable
> in
> > the past is now exhibiting erratic behavior.  The first time you do a
> > search, it returns the correct number of hits.  The second time you do a
> > search, it may or may not return the correct set.  By the third time you
> do
> > a search, it will return 0 hits even for a search that was returning
> > hundreds of hits only a few seconds earlier.  All subsequent searches
> will
> > return 0 hits until you stop and restart the java process.
> >
> > A snippet of the relevant code follows:
> >
> >// getReader() returns the singleton IndexReader
> object
> >final IndexReader reader = getReader();
> >
> >// ANALYZER is another singleton
> >final QueryParser queryParser = new QueryParser("text",
> > ANALYZER);
> >queryParser.setDefaultOperator(spec.getDefaultOp());
> >final Query query =
> > queryParser.parse(spec.getSearchText()).rewrite(
> >reader);
> >final IndexSearcher searcher = new IndexSearcher(reader);
> >
> >final Hits hits = searcher.search(query, new
> > CachingWrapperFilter(
> >new QueryWrapperFilter(visibilityFilter)));
> >total = hits.length();
> >
> >
> >
> > I understand that Lucene should be able to handle very large datasets, so
> > I'd be surprised if this were an actual Lucene bug.  I'm hoping it's just
> > that I'm doing something "wrong" which has gone unnoticed so far for
> several
> > months because we've never had an index this large.
> >
> > We're using lucene verison 2.2.0.
> >
> > Thanks!
> >
> > Justin Grunau
> >
> >
> >
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>
> --
> Bests regards,
> Leonid Maslov!
> Personal blog: http://leonardinius.blogspot.com/
>
> Random thought:
> Princess Margaret  - "I have as much privacy as a goldfish in a bowl."
>
>
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-- 
Bests regards,
Leonid Maslov!
Personal blog: http://leonardinius.blogspot.com/

Random thought:
John Belushi  - "I owe it all to little chocolate donuts."

Re: Lucene debug logging?

Daniel, yes, please see my "Problem with lucene search starting to return 0 
hits when a few seconds earlier it was returning hundreds" thread.

- Original Message 
From: Daniel Naber <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, September 4, 2008 6:10:56 PM
Subject: Re: Lucene debug logging?

On Donnerstag, 4. September 2008, Justin Grunau wrote:

> Is there a way to turn on debug logging / trace logging for Lucene?

You can use IndexWriter's setInfoStream(). Besides that, Lucene doesn't do 
any logging AFAIK. Are you experiencing any problems that you want to 
diagnose with debugging?

Regards
Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: PhraseQuery issues - differences with SpanNearQuery

2008-09-04 Thread Paul Elschot

Op Thursday 04 September 2008 20:39:13 schreef Mark Miller:
> Sounds like its more in line with what you are looking for. If I
> remember correctly, the phrase query factors in the edit distance in
> scoring, but the NearSpanQuery will just use the combined idf for
> each of the terms in it, so distance shouldnt matter with spans (I'm
> sure Paul will correct me if I am wrong).

SpanScorer will use the similarity slop factor for each matching
span size to adjust the effective frequency.
The span size is the difference in position between the first
and last matching term, and idf is not used for scoring Spans.
The reason why idf is not used could be that there is no basic
score value associated with inner spans; only top level spans
are scored by SpanScorer.
For more details, please consult the SpanScorer code.

Regards,
Paul Elschot

>
> - Mark
>
> Yannis Pavlidis wrote:
> > Hi,
> >
> > I am having an issue when using the PhraseQuery which is best
> > illustrated with this example:
> >
> > I have created 2 documents to emulate URLs. One with a URL of:
> > "http://www.airballoon.com"; and title "air balloon" and the second
> > one with URL "http://www.balloonair.com"; and title: "balloon air".
> >
> > Test1 (PhraseQuery)
> > ==
> > Now when I use the phrase query with - title: "air balloon" ~2
> > I get back:
> >
> > url: "http://www.airballoon.com"; - score: 1.0
> > url: "http://www.balloonair.com"; - score: 0.57
> >
> > Test2 (PhraseQuery)
> > ==
> > Now when I use the phrase query with - title: "balloon air" ~2
> > I get back:
> > url: "http://www.balloonair.com"; - score: 1.0
> > url: "http://www.airballoon.com"; - score: 0.57
> >
> > Test3 (PhraseQuery)
> > ==
> > Now when I use the phrase query with - title: "air balloon" ~2
> > title: "balloon air" ~2 I get back:
> > url: "http://www.airballoon.com"; - score: 1.0
> > url: "http://www.balloonair.com"; - score: 1.0
> >
> > Test4 (SpanNearQuery)
> > ===
> > spanNear([title:air, title:balloon], 2, false)
> > I get back:
> > url: "http://www.airballoon.com"; - score: 1.0
> > url: "http://www.balloonair.com"; - score: 1.0
> >
> > I would have expected that Test1, Test2 would actually return both
> > URLs with score of 1.0 since I am setting the slop to 2. It seems
> > though that lucene really favors and absolute exact match.
> >
> > Is it safe to assume that for what I am looking for (basically
> > score the docs the same regardless on when someone is searching for
> > "air balloon" or "balloon air") it would be better to use the
> > SpanNearQuery rather than the PhraseQuery?
> >
> > Any input would be appreciated.
> >
> > Thanks in advance,
> >
> > Yannis.
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: QueryParser vs. BooleanQuery