date:20090302




On 3/2/09 4:23 PM, "Ken Williams"  wrote:

> On 3/2/09 1:58 PM, "Erik Hatcher"  wrote:
> 
>> On Mar 2, 2009, at 2:47 PM, Ken Williams wrote:
>>> In the output, I get explanations like "0.88922405 = (MATCH) product
>>> of:"
>>> with no details.  Perhaps I need to do something different in
>>> indexing?
>> 
>> Explanation.toString() only returns the first line.  You can use
>> toString(int depth) or loop over all the getDetails().   toHtml()
>> returns a decently formatted tree of 's of the whole explanation
>> also.
> 
> It looks like toString(int) is a protected method, and toHtml() only seems
> to return a single  with no content.  I can start writing a recursive
> routine to dive down into getDetails(), but I thought there must be
> something easier.

Okay, silly me - notice that in my code I was printing the string with
println().  I didn't realize println() truncated strings that contain
newline characters (nor was I aware that the string had any newlines, I
guess!).  Once I ran it through replaceAll( "\n", "n" ) I'm getting the
output I need.

Thanks,

-- 
Ken Williams
Research Scientist
The Thomson Reuters Corporation
Eagan, MN


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Confidence scores at search time




On 3/2/09 1:58 PM, "Erik Hatcher"  wrote:

> 
> On Mar 2, 2009, at 2:47 PM, Ken Williams wrote:
>> In the output, I get explanations like "0.88922405 = (MATCH) product
>> of:"
>> with no details.  Perhaps I need to do something different in
>> indexing?
> 
> Explanation.toString() only returns the first line.  You can use
> toString(int depth) or loop over all the getDetails().   toHtml()
> returns a decently formatted tree of 's of the whole explanation
> also.

It looks like toString(int) is a protected method, and toHtml() only seems
to return a single  with no content.  I can start writing a recursive
routine to dive down into getDetails(), but I thought there must be
something easier.

-- 
Ken Williams
Research Scientist
The Thomson Reuters Corporation
Eagan, MN


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Confidence scores at search time




On 3/2/09 4:19 PM, "Steven A Rowe"  wrote:

> On 3/2/2009 at 4:22 PM, Grant Ingersoll wrote:
>> On Mar 2, 2009, at 2:47 PM, Ken Williams wrote:
>>> Also, while perusing the threads you refer to below, I saw a
>>> reference to the following link, which seems to have gone dead:
>>> 
>>>  https://issues.apache.org/bugzilla/show_bug.cgi?id=31841
>> 
>> Hmm, bugzilla has moved to JIRA.  I'm not sure where the mapping is
>> anymore.   There used to be a Bugzilla Id in JIRA, I think. Sorry.
> 
> http://issues.apache.org/jira/browse/LUCENE-295
> 

Great, thanks!

-- 
Ken Williams
Research Scientist
The Thomson Reuters Corporation
Eagan, MN


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Confidence scores at search time

2009-03-02 Thread Steven A Rowe

On 3/2/2009 at 4:22 PM, Grant Ingersoll wrote:
> On Mar 2, 2009, at 2:47 PM, Ken Williams wrote:
> > Also, while perusing the threads you refer to below, I saw a
> > reference to the following link, which seems to have gone dead:
> >
> >  https://issues.apache.org/bugzilla/show_bug.cgi?id=31841
> 
> Hmm, bugzilla has moved to JIRA.  I'm not sure where the mapping is
> anymore.   There used to be a Bugzilla Id in JIRA, I think. Sorry.

http://issues.apache.org/jira/browse/LUCENE-295

I found this by looking up the issue number in the map of Bugzilla -> JIRA 
issue numbers I put into the changes2html.pl script[1], so that linkification 
of old Bugzilla issues would continue to work in the Changes.html[2] it 
generates from CHANGES.txt[3].  Bug 31841 is mentioned (and now linked to 
LUCENE-295 in Changes.html) as item #4 under the "Changes in runtime behavior" 
section of the release notes for Release 1.9 RC1 - see [2].

Steve

[1] changes2html.pl (look for "setup_bugzilla_jira_map" at the bottom of the 
file): 
http://svn.apache.org/viewvc/lucene/java/trunk/src/site/changes/changes2html.pl?view=markup
[2] Changes.html: http://lucene.apache.org/java/2_4_0/changes/Changes.html
[3] CHANGES.txt: 
http://svn.apache.org/viewvc/lucene/java/trunk/CHANGES.txt?view=markup

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Faceted Search using Lucene



So then all is good.

We were only pursuing this to explain it.  Now that we know your  
directories are empty, that explains it.


So you should call maybeReopen() inside get(), as long as it does not  
slow queries down.


Mike

Amin Mohammed-Coleman wrote:


I think that is the case.  When my SearchManager is initialised the
directories are empty so when I do a get() nothing is present.   
Subsequent
calls seem to work.  Is there something I can do? or do I accept  
this or

just do a maybeReopen and do a get().  As you mentioned it depends on
timiing but I would be keen to know what the best practice would be  
in this

situation...

Cheers

On Mon, Mar 2, 2009 at 8:43 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:



Well the code looks fine.

I can't explain why you see no search results if you don't call
maybeReopen() in get, unless at the time you first create  
SearcherManager

the Directories each have an empty index in them.

Mike

Amin Mohammed-Coleman wrote:

Hi

Here is the code that I am using, I've modified the get() method to
include
the maybeReopen() call.  Again I'm not sure if this is a good idea.

public Summary[] search(final SearchRequest searchRequest)
throwsSearchExecutionException {

final String searchTerm = searchRequest.getSearchTerm();

if (StringUtils.isBlank(searchTerm)) {

throw new SearchExecutionException("Search string cannot be empty.  
There

will be too many results to process.");

}

List summaryList = new ArrayList();

StopWatch stopWatch = new StopWatch("searchStopWatch");

stopWatch.start();

MultiSearcher multiSearcher = get();

try {

LOGGER.debug("Ensuring all index readers are up to date...");

Query query = queryParser.parse(searchTerm);

LOGGER.debug("Search Term '" + searchTerm +"' > Lucene Query  
'" +

query.toString() +"'");

Sort sort = null;

sort = applySortIfApplicable(searchRequest);

Filter[] filters =applyFiltersIfApplicable(searchRequest);

ChainedFilter chainedFilter = null;

if (filters != null) {

chainedFilter = new ChainedFilter(filters, ChainedFilter.OR);

}

TopDocs topDocs = multiSearcher.search(query,chainedFilter , 
100,sort);


ScoreDoc[] scoreDocs = topDocs.scoreDocs;

LOGGER.debug("total number of hits for [" + query.toString() + " ] =
"+topDocs.
totalHits);

for (ScoreDoc scoreDoc : scoreDocs) {

final Document doc = multiSearcher.doc(scoreDoc.doc);

float score = scoreDoc.score;

final BaseDocument baseDocument = new BaseDocument(doc, score);

Summary documentSummary = new DocumentSummaryImpl(baseDocument);

summaryList.add(documentSummary);

}

} catch (Exception e) {

throw new IllegalStateException(e);

} finally {

if (multiSearcher != null) {

release(multiSearcher);

}

}

stopWatch.stop();

LOGGER.debug("total time taken for document seach: " +
stopWatch.getTotalTimeMillis() + " ms");

return summaryList.toArray(new Summary[] {});

}


@Autowired

public void setDirectories(@Qualifier("directories")ListFactoryBean
listFactoryBean) throws Exception {

this.directories = (List) listFactoryBean.getObject();

}

@PostConstruct

public void initialiseDocumentSearcher() {

StopWatch stopWatch = new StopWatch("document-search-initialiser");

stopWatch.start();

PerFieldAnalyzerWrapper analyzerWrapper = new  
PerFieldAnalyzerWrapper(

analyzer);

analyzerWrapper.addAnalyzer(FieldNameEnum.TYPE.getDescription(),
newKeywordAnalyzer());

queryParser =
newMultiFieldQueryParser(FieldNameEnum.fieldNameDescriptions(),
analyzerWrapper);

try {

LOGGER.debug("Initialising document searcher ");

documentSearcherManagers = new
DocumentSearcherManager[directories.size()];

for (int i = 0; i < directories.size() ;i++) {

Directory directory = directories.get(i);

DocumentSearcherManager documentSearcherManager =
newDocumentSearcherManager(directory);

documentSearcherManagers[i]=documentSearcherManager;

}

LOGGER.debug("Document searcher initialised");

} catch (IOException e) {

throw new IllegalStateException(e);

}

stopWatch.stop();

LOGGER.debug("Total time taken to initialise DocumentSearcher '" +
stopWatch.getTotalTimeMillis() +"' ms.");

}

private void maybeReopen() throws SearchExecutionException {

LOGGER.debug("Initiating reopening of index readers...");

for (DocumentSearcherManager documentSearcherManager :
documentSearcherManagers) {

try {

documentSearcherManager.maybeReopen();

} catch (InterruptedException e) {

throw new SearchExecutionException(e);

} catch (IOException e) {

throw new SearchExecutionException(e);

}

}

LOGGER.debug("reopening of index readers complete.");

}



private void release(MultiSearcher multiSeacher)  {

IndexSearcher[] indexSearchers = (IndexSearcher[])
multiSeacher.getSearchables();

for(int i =0 ; i < indexSearchers.length;i++) {

try {

documentSearcherManagers[i].release(indexSearchers[i]);

} catch (IOException e) {

throw new IllegalStateException(e);

}

}

}


private MultiSearcher get() throws SearchExecutionException {

maybeReopen();

MultiSearcher multiSearcher =

Re: Faceted Search using Lucene

I think that is the case.  When my SearchManager is initialised the
directories are empty so when I do a get() nothing is present.  Subsequent
calls seem to work.  Is there something I can do? or do I accept this or
just do a maybeReopen and do a get().  As you mentioned it depends on
timiing but I would be keen to know what the best practice would be in this
situation...

Cheers

On Mon, Mar 2, 2009 at 8:43 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

>
> Well the code looks fine.
>
> I can't explain why you see no search results if you don't call
> maybeReopen() in get, unless at the time you first create SearcherManager
> the Directories each have an empty index in them.
>
> Mike
>
> Amin Mohammed-Coleman wrote:
>
>  Hi
>> Here is the code that I am using, I've modified the get() method to
>> include
>> the maybeReopen() call.  Again I'm not sure if this is a good idea.
>>
>> public Summary[] search(final SearchRequest searchRequest)
>> throwsSearchExecutionException {
>>
>> final String searchTerm = searchRequest.getSearchTerm();
>>
>> if (StringUtils.isBlank(searchTerm)) {
>>
>> throw new SearchExecutionException("Search string cannot be empty. There
>> will be too many results to process.");
>>
>> }
>>
>> List summaryList = new ArrayList();
>>
>> StopWatch stopWatch = new StopWatch("searchStopWatch");
>>
>> stopWatch.start();
>>
>> MultiSearcher multiSearcher = get();
>>
>> try {
>>
>> LOGGER.debug("Ensuring all index readers are up to date...");
>>
>> Query query = queryParser.parse(searchTerm);
>>
>> LOGGER.debug("Search Term '" + searchTerm +"' > Lucene Query '" +
>> query.toString() +"'");
>>
>> Sort sort = null;
>>
>> sort = applySortIfApplicable(searchRequest);
>>
>> Filter[] filters =applyFiltersIfApplicable(searchRequest);
>>
>> ChainedFilter chainedFilter = null;
>>
>> if (filters != null) {
>>
>> chainedFilter = new ChainedFilter(filters, ChainedFilter.OR);
>>
>> }
>>
>> TopDocs topDocs = multiSearcher.search(query,chainedFilter ,100,sort);
>>
>> ScoreDoc[] scoreDocs = topDocs.scoreDocs;
>>
>> LOGGER.debug("total number of hits for [" + query.toString() + " ] =
>> "+topDocs.
>> totalHits);
>>
>> for (ScoreDoc scoreDoc : scoreDocs) {
>>
>> final Document doc = multiSearcher.doc(scoreDoc.doc);
>>
>> float score = scoreDoc.score;
>>
>> final BaseDocument baseDocument = new BaseDocument(doc, score);
>>
>> Summary documentSummary = new DocumentSummaryImpl(baseDocument);
>>
>> summaryList.add(documentSummary);
>>
>> }
>>
>> } catch (Exception e) {
>>
>> throw new IllegalStateException(e);
>>
>> } finally {
>>
>> if (multiSearcher != null) {
>>
>> release(multiSearcher);
>>
>> }
>>
>> }
>>
>> stopWatch.stop();
>>
>> LOGGER.debug("total time taken for document seach: " +
>> stopWatch.getTotalTimeMillis() + " ms");
>>
>> return summaryList.toArray(new Summary[] {});
>>
>> }
>>
>>
>> @Autowired
>>
>> public void setDirectories(@Qualifier("directories")ListFactoryBean
>> listFactoryBean) throws Exception {
>>
>> this.directories = (List) listFactoryBean.getObject();
>>
>> }
>>
>>  @PostConstruct
>>
>> public void initialiseDocumentSearcher() {
>>
>> StopWatch stopWatch = new StopWatch("document-search-initialiser");
>>
>> stopWatch.start();
>>
>> PerFieldAnalyzerWrapper analyzerWrapper = new PerFieldAnalyzerWrapper(
>> analyzer);
>>
>> analyzerWrapper.addAnalyzer(FieldNameEnum.TYPE.getDescription(),
>> newKeywordAnalyzer());
>>
>> queryParser =
>> newMultiFieldQueryParser(FieldNameEnum.fieldNameDescriptions(),
>> analyzerWrapper);
>>
>> try {
>>
>> LOGGER.debug("Initialising document searcher ");
>>
>> documentSearcherManagers = new
>> DocumentSearcherManager[directories.size()];
>>
>> for (int i = 0; i < directories.size() ;i++) {
>>
>> Directory directory = directories.get(i);
>>
>> DocumentSearcherManager documentSearcherManager =
>> newDocumentSearcherManager(directory);
>>
>> documentSearcherManagers[i]=documentSearcherManager;
>>
>> }
>>
>> LOGGER.debug("Document searcher initialised");
>>
>> } catch (IOException e) {
>>
>> throw new IllegalStateException(e);
>>
>> }
>>
>> stopWatch.stop();
>>
>> LOGGER.debug("Total time taken to initialise DocumentSearcher '" +
>> stopWatch.getTotalTimeMillis() +"' ms.");
>>
>> }
>>
>>  private void maybeReopen() throws SearchExecutionException {
>>
>> LOGGER.debug("Initiating reopening of index readers...");
>>
>> for (DocumentSearcherManager documentSearcherManager :
>> documentSearcherManagers) {
>>
>> try {
>>
>> documentSearcherManager.maybeReopen();
>>
>> } catch (InterruptedException e) {
>>
>> throw new SearchExecutionException(e);
>>
>> } catch (IOException e) {
>>
>> throw new SearchExecutionException(e);
>>
>> }
>>
>> }
>>
>> LOGGER.debug("reopening of index readers complete.");
>>
>> }
>>
>>
>>
>> private void release(MultiSearcher multiSeacher)  {
>>
>> IndexSearcher[] indexSearchers = (IndexSearcher[])
>> multiSeacher.getSearchables();
>>
>> for(int i =0 ; i < indexSearchers.length;i++) {
>>
>> try {
>>
>> documentSearcherMa

Re: Confidence scores at search time

2009-03-02 Thread Grant Ingersoll

On Mar 2, 2009, at 2:47 PM, Ken Williams wrote:

Hi Grant,

It's true, I may have an X-Y problem here. =)

My basic need is to sacrifice recall to achieve greater precision.
Rather
than always presenting the user with the top N documents, I need to
return
*only* the documents that seem relevant. For some searches this may
be 3

documents, for some it may be none.

Therein lies the rub. How are you determining what is relevant? In
some sense, you are asking Lucene to determine what is relevant and
then turning around and telling it you are not happy with it doing
what you told it to do (I'm exaggerating a bit, I know), namely tell
you what the relevant documents are for a given query and a set of
documents based on it's scoring model. As an alternate tack, I
usually look at this type of thing and try to figure out a way to make
my queries more precise (e.g. replace OR with AND, introduce phrase
queries, filter or add NOT clauses or some other qualifiers) or some
other relevance tricks [1], [2].

That being said, I could see maybe determining a delta value such that
if the distance between any two scores is more than the delta, you cut
off the rest of the docs. This takes into account the relative state
of scores and is not some arbitrary value (although, the delta is, of
course)

Since you are allowing the user to "explore", it may be more
reasonable to cutoff at some point, too, but I still don't know of a
good way to determine what that point is in a generic way. Maybe with
some specific knowledge about how you are creating your queries and
what query terms matched you could come up with something, but still,
I am uncertain.

The other thing that strikes me is that you add in some type of
learning/memory component that tracks your click-through information
and gives feedback into the system about relevance.

My user interface in this case isn't the standard "type words in a
box and
we'll show you the best docs" - I'm using Lucene as a tool in the
background

to do some exploration about how I could augment a set of traditional
results with a few alternative results gleaned from a different path.

Not sure if this helps with the X-Y problem, but that's my task at
hand.

Yes.

Also, keep in mind there are other techniques for encouraging
exploration: clustering, faceting, info extraction (identifying named
entities, etc. and presenting them)

Just throwing out some food for thought.

Also, while perusing the threads you refer to below, I saw a
reference to

the following link, which seems to have gone dead:

https://issues.apache.org/bugzilla/show_bug.cgi?id=31841

Hmm, bugzilla has moved to JIRA. I'm not sure where the mapping is
anymore. There used to be a Bugzilla Id in JIRA, I think. Sorry.

-Grant

[1]
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Debugging-Relevance-Issues-in-Search/
[2]
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Optimizing-Findability-in-Lucene-and-Solr/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Extracting TFIDF vectors

2009-03-02 Thread Grant Ingersoll

Have a look at the MoreLikeThis contrib module in the contrib section  
of Lucene.  You can start with that, and then do the additions and  
subtractions from there.



On Mar 2, 2009, at 9:35 AM, Gregory Gay wrote:


Hi,

I'm a complete novice at Lucene, and I'm looking for a little bit of  
help

with something.

How can I extract the TF*IDF vector for each document in the indexed
collection? Also for the query?

I need to build a user-feedback system which manipulates the query  
based on

the liked and disliked documents from the local collection. This query
modification uses the TF*IDF vectors.

Thanks for your help!

--
Gregory Gay
Editor - 4 Color Rebellion (http://www.4colorrebellion.com)
Research Assistant - WVU CSEE


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Faceted Search using Lucene



Well the code looks fine.

I can't explain why you see no search results if you don't call  
maybeReopen() in get, unless at the time you first create  
SearcherManager the Directories each have an empty index in them.


Mike

Amin Mohammed-Coleman wrote:


Hi
Here is the code that I am using, I've modified the get() method to  
include

the maybeReopen() call.  Again I'm not sure if this is a good idea.

public Summary[] search(final SearchRequest searchRequest)
throwsSearchExecutionException {

final String searchTerm = searchRequest.getSearchTerm();

if (StringUtils.isBlank(searchTerm)) {

throw new SearchExecutionException("Search string cannot be empty.  
There

will be too many results to process.");

}

List summaryList = new ArrayList();

StopWatch stopWatch = new StopWatch("searchStopWatch");

stopWatch.start();

MultiSearcher multiSearcher = get();

try {

LOGGER.debug("Ensuring all index readers are up to date...");

Query query = queryParser.parse(searchTerm);

LOGGER.debug("Search Term '" + searchTerm +"' > Lucene Query '" +
query.toString() +"'");

Sort sort = null;

sort = applySortIfApplicable(searchRequest);

Filter[] filters =applyFiltersIfApplicable(searchRequest);

ChainedFilter chainedFilter = null;

if (filters != null) {

chainedFilter = new ChainedFilter(filters, ChainedFilter.OR);

}

TopDocs topDocs = multiSearcher.search(query,chainedFilter ,100,sort);

ScoreDoc[] scoreDocs = topDocs.scoreDocs;

LOGGER.debug("total number of hits for [" + query.toString() + " ] =  
"+topDocs.

totalHits);

for (ScoreDoc scoreDoc : scoreDocs) {

final Document doc = multiSearcher.doc(scoreDoc.doc);

float score = scoreDoc.score;

final BaseDocument baseDocument = new BaseDocument(doc, score);

Summary documentSummary = new DocumentSummaryImpl(baseDocument);

summaryList.add(documentSummary);

}

} catch (Exception e) {

throw new IllegalStateException(e);

} finally {

if (multiSearcher != null) {

release(multiSearcher);

}

}

stopWatch.stop();

LOGGER.debug("total time taken for document seach: " +
stopWatch.getTotalTimeMillis() + " ms");

return summaryList.toArray(new Summary[] {});

}


@Autowired

public void setDirectories(@Qualifier("directories")ListFactoryBean
listFactoryBean) throws Exception {

this.directories = (List) listFactoryBean.getObject();

}

 @PostConstruct

public void initialiseDocumentSearcher() {

StopWatch stopWatch = new StopWatch("document-search-initialiser");

stopWatch.start();

PerFieldAnalyzerWrapper analyzerWrapper = new PerFieldAnalyzerWrapper(
analyzer);

analyzerWrapper.addAnalyzer(FieldNameEnum.TYPE.getDescription(),
newKeywordAnalyzer());

queryParser =  
newMultiFieldQueryParser(FieldNameEnum.fieldNameDescriptions(),

analyzerWrapper);

try {

LOGGER.debug("Initialising document searcher ");

documentSearcherManagers = new  
DocumentSearcherManager[directories.size()];


for (int i = 0; i < directories.size() ;i++) {

Directory directory = directories.get(i);

DocumentSearcherManager documentSearcherManager =
newDocumentSearcherManager(directory);

documentSearcherManagers[i]=documentSearcherManager;

}

LOGGER.debug("Document searcher initialised");

} catch (IOException e) {

throw new IllegalStateException(e);

}

stopWatch.stop();

LOGGER.debug("Total time taken to initialise DocumentSearcher '" +
stopWatch.getTotalTimeMillis() +"' ms.");

}

  private void maybeReopen() throws SearchExecutionException {

LOGGER.debug("Initiating reopening of index readers...");

for (DocumentSearcherManager documentSearcherManager :
documentSearcherManagers) {

try {

documentSearcherManager.maybeReopen();

} catch (InterruptedException e) {

throw new SearchExecutionException(e);

} catch (IOException e) {

throw new SearchExecutionException(e);

}

}

LOGGER.debug("reopening of index readers complete.");

}



private void release(MultiSearcher multiSeacher)  {

IndexSearcher[] indexSearchers = (IndexSearcher[])
multiSeacher.getSearchables();

for(int i =0 ; i < indexSearchers.length;i++) {

try {

documentSearcherManagers[i].release(indexSearchers[i]);

} catch (IOException e) {

throw new IllegalStateException(e);

}

}

}


 private MultiSearcher get() throws SearchExecutionException {

maybeReopen();

MultiSearcher multiSearcher = null;

List listOfIndexSeachers = new  
ArrayList();


for (DocumentSearcherManager documentSearcherManager :
documentSearcherManagers) {

listOfIndexSeachers.add(documentSearcherManager.get());

}

try {

multiSearcher = new
MultiSearcher(listOfIndexSeachers.toArray(newIndexSearcher[] {}));

} catch (IOException e) {

throw new SearchExecutionException(e);

}

return multiSearcher;

}


Hope there is enough information.


Cheers

Amin


P.S. I will continue to debug.




On Mon, Mar 2, 2009 at 6:55 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:



It makes perfect sense to call maybeReopen() followed by get(), as  
long as
maybeReopen() is never slow enough to be noticeable to an end user  
(because

you are mak

Re: Marking commit points as deleted does not clean up on IW.close



You mean on calling IndexWriter.close, with a deletion policy that's  
functionally equivalent to KeepOnlyLastCommitDeletionPolicy, you  
somehow see that last 2 commits remaining in the Directory once  
IndexWriter is done closing?  That's odd.  Are you sure "onCommit()"  
is really calling delete() on all the IndexCommits except the last one?


Can you post the source for the deletion policy?

Mike

Shalin Shekhar Mangar wrote:


Hello,

In Solr, when a user calls commit, the IndexWriter is closed  
(causing a
commit). It is opened again only when another document is added or,  
a delete

is performed. In order to support replication, Solr trunk now uses a
deletion policy. The default policy is (should be?) equivalent to
KeepOnlyLastCommitDeletionPolicy.

However, once a commit is performed, we see that the last two commit  
points
are being kept back. The 2nd last one is cleaned up once the  
IndexWriter is
opened again. It'd be great if someone can suggest on what we might  
be doing
wrong. For the time being, we can work around this by using  
IW.commit and

keeping the IW open.

--
Regards,
Shalin Shekhar Mangar.



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Faceted Search using Lucene

Hi
Here is the code that I am using, I've modified the get() method to include
the maybeReopen() call.  Again I'm not sure if this is a good idea.

public Summary[] search(final SearchRequest searchRequest)
throwsSearchExecutionException {

final String searchTerm = searchRequest.getSearchTerm();

if (StringUtils.isBlank(searchTerm)) {

throw new SearchExecutionException("Search string cannot be empty. There
will be too many results to process.");

}

List summaryList = new ArrayList();

StopWatch stopWatch = new StopWatch("searchStopWatch");

stopWatch.start();

MultiSearcher multiSearcher = get();

try {

LOGGER.debug("Ensuring all index readers are up to date...");

 Query query = queryParser.parse(searchTerm);

LOGGER.debug("Search Term '" + searchTerm +"' > Lucene Query '" +
query.toString() +"'");

 Sort sort = null;

sort = applySortIfApplicable(searchRequest);

 Filter[] filters =applyFiltersIfApplicable(searchRequest);

 ChainedFilter chainedFilter = null;

if (filters != null) {

chainedFilter = new ChainedFilter(filters, ChainedFilter.OR);

}

TopDocs topDocs = multiSearcher.search(query,chainedFilter ,100,sort);

ScoreDoc[] scoreDocs = topDocs.scoreDocs;

LOGGER.debug("total number of hits for [" + query.toString() + " ] = "+topDocs.
totalHits);

 for (ScoreDoc scoreDoc : scoreDocs) {

final Document doc = multiSearcher.doc(scoreDoc.doc);

float score = scoreDoc.score;

final BaseDocument baseDocument = new BaseDocument(doc, score);

Summary documentSummary = new DocumentSummaryImpl(baseDocument);

summaryList.add(documentSummary);

}

} catch (Exception e) {

throw new IllegalStateException(e);

} finally {

if (multiSearcher != null) {

release(multiSearcher);

}

}

stopWatch.stop();

 LOGGER.debug("total time taken for document seach: " +
stopWatch.getTotalTimeMillis() + " ms");

return summaryList.toArray(new Summary[] {});

}


@Autowired

public void setDirectories(@Qualifier("directories")ListFactoryBean
listFactoryBean) throws Exception {

this.directories = (List) listFactoryBean.getObject();

}

  @PostConstruct

public void initialiseDocumentSearcher() {

StopWatch stopWatch = new StopWatch("document-search-initialiser");

stopWatch.start();

PerFieldAnalyzerWrapper analyzerWrapper = new PerFieldAnalyzerWrapper(
analyzer);

analyzerWrapper.addAnalyzer(FieldNameEnum.TYPE.getDescription(),
newKeywordAnalyzer());

queryParser = newMultiFieldQueryParser(FieldNameEnum.fieldNameDescriptions(),
analyzerWrapper);

 try {

LOGGER.debug("Initialising document searcher ");

documentSearcherManagers = new DocumentSearcherManager[directories.size()];

for (int i = 0; i < directories.size() ;i++) {

Directory directory = directories.get(i);

DocumentSearcherManager documentSearcherManager =
newDocumentSearcherManager(directory);

documentSearcherManagers[i]=documentSearcherManager;

}

LOGGER.debug("Document searcher initialised");

} catch (IOException e) {

throw new IllegalStateException(e);

}

stopWatch.stop();

LOGGER.debug("Total time taken to initialise DocumentSearcher '" +
stopWatch.getTotalTimeMillis() +"' ms.");

}

   private void maybeReopen() throws SearchExecutionException {

LOGGER.debug("Initiating reopening of index readers...");

for (DocumentSearcherManager documentSearcherManager :
documentSearcherManagers) {

try {

documentSearcherManager.maybeReopen();

} catch (InterruptedException e) {

throw new SearchExecutionException(e);

} catch (IOException e) {

throw new SearchExecutionException(e);

}

}

LOGGER.debug("reopening of index readers complete.");

 }



 private void release(MultiSearcher multiSeacher)  {

 IndexSearcher[] indexSearchers = (IndexSearcher[])
multiSeacher.getSearchables();

 for(int i =0 ; i < indexSearchers.length;i++) {

 try {

documentSearcherManagers[i].release(indexSearchers[i]);

} catch (IOException e) {

throw new IllegalStateException(e);

}

 }

 }


  private MultiSearcher get() throws SearchExecutionException {

maybeReopen();

MultiSearcher multiSearcher = null;

List listOfIndexSeachers = new ArrayList();

for (DocumentSearcherManager documentSearcherManager :
documentSearcherManagers) {

listOfIndexSeachers.add(documentSearcherManager.get());

}

try {

multiSearcher = new
MultiSearcher(listOfIndexSeachers.toArray(newIndexSearcher[] {}));

} catch (IOException e) {

throw new SearchExecutionException(e);

}

return multiSearcher;

}


Hope there is enough information.


Cheers

Amin


P.S. I will continue to debug.




On Mon, Mar 2, 2009 at 6:55 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

>
> It makes perfect sense to call maybeReopen() followed by get(), as long as
> maybeReopen() is never slow enough to be noticeable to an end user (because
> you are making random queries pay the reopen/warming cost).
>
> If you call maybeReopen() after get(), then that search will not see the
> newly opened readers, but the next search will.
>
> I'm just thinking that since you see no results with get() alone, debug
> that c

Marking commit points as deleted does not clean up on IW.close

2009-03-02 Thread Shalin Shekhar Mangar

Hello,

In Solr, when a user calls commit, the IndexWriter is closed (causing a
commit). It is opened again only when another document is added or, a delete
is performed. In order to support replication, Solr trunk now uses a
deletion policy. The default policy is (should be?) equivalent to
KeepOnlyLastCommitDeletionPolicy.

However, once a commit is performed, we see that the last two commit points
are being kept back. The 2nd last one is cleaned up once the IndexWriter is
opened again. It'd be great if someone can suggest on what we might be doing
wrong. For the time being, we can work around this by using IW.commit and
keeping the IW open.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Confidence scores at search time

2009-03-02 Thread Erik Hatcher



On Mar 2, 2009, at 2:47 PM, Ken Williams wrote:
Finally, I seem unable to get Searcher.explain() to do much useful -  
my code

looks like:

   Searcher searcher = new IndexSearcher(reader);
   QueryParser parser = new QueryParser(LuceneIndex.CONTENT,  
analyzer);

   Query query = parser.parse(queryString);
   TopDocCollector collector = new TopDocCollector(n);
   searcher.search(query, collector);

   for ( ScoreDoc d : collector.topDocs().scoreDocs ) {
   String explanation = searcher.explain(query,  
d.doc).toString();
   Field id =  
searcher.doc( d.doc ).getField( LuceneIndex.ID );
   System.out.println(id + "\t" + d.score + "\t" +  
explanation);

   }

In the output, I get explanations like "0.88922405 = (MATCH) product  
of:"
with no details.  Perhaps I need to do something different in  
indexing?


Explanation.toString() only returns the first line.  You can use  
toString(int depth) or loop over all the getDetails().   toHtml()  
returns a decently formatted tree of 's of the whole explanation  
also.


Erik


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Confidence scores at search time

Hi Grant,

It's true, I may have an X-Y problem here. =)

My basic need is to sacrifice recall to achieve greater precision.  Rather
than always presenting the user with the top N documents, I need to return
*only* the documents that seem relevant.  For some searches this may be 3
documents, for some it may be none.

My user interface in this case isn't the standard "type words in a box and
we'll show you the best docs" - I'm using Lucene as a tool in the background
to do some exploration about how I could augment a set of traditional
results with a few alternative results gleaned from a different path.

Not sure if this helps with the X-Y problem, but that's my task at hand.

Also, while perusing the threads you refer to below, I saw a reference to
the following link, which seems to have gone dead:

  https://issues.apache.org/bugzilla/show_bug.cgi?id=31841
  (in http://www.lucidimagination.com/search/document/1618ce933c8ebd6b )

Has the issue tracker moved somewhere else?

Finally, I seem unable to get Searcher.explain() to do much useful - my code
looks like:

Searcher searcher = new IndexSearcher(reader);
QueryParser parser = new QueryParser(LuceneIndex.CONTENT, analyzer);
Query query = parser.parse(queryString);
TopDocCollector collector = new TopDocCollector(n);
searcher.search(query, collector);

for ( ScoreDoc d : collector.topDocs().scoreDocs ) {
String explanation = searcher.explain(query, d.doc).toString();
Field id = searcher.doc( d.doc ).getField( LuceneIndex.ID );
System.out.println(id + "\t" + d.score + "\t" + explanation);
}

In the output, I get explanations like "0.88922405 = (MATCH) product of:"
with no details.  Perhaps I need to do something different in indexing?

Thanks,


 -Ken


On 2/26/09 10:36 AM, "Grant Ingersoll"  wrote:

> I don't know of anyone doing work on it in the Lucene community.   My
> understanding to date is that it is not really worth trying, but that
> may in fact be an outdated view.  I haven't stayed up on the
> literature on this subject, so background info on what you are
> interested in would be helpful.
> 
> Digging around in the archives a bit more, I come up with some more
> relevant emails: 
> http://www.lucidimagination.com/search/?q=comparing+scores+across+searches#/
> p:lucene,solr/s:email
> 
> What is the bigger problem that you are trying to solve?  That is, you
> imply that score comparison is the solution, but you haven't said the
> problem you are trying to solve.
> 
> Cheers,
> Grant
> 
> 
> On Feb 25, 2009, at 11:38 AM, Ken Williams wrote:
> 
>> Hi all,
>> 
>> I didn't get a response to this - not sure whether the question was
>> ill-posed, or too-frequently-asked, or just not interesting.  But if
>> anyone
>> could take a stab at it or let me know a different place to look,
>> I'd really
>> appreciate it.
>> 
>> Thanks,
>> 
>> -Ken
>> 
>> 
>> On 2/20/09 12:00 PM, "Ken Williams"
>>  wrote:
>> 
>>> Hi,
>>> 
>>> Has there been any work done on getting confidence scores at
>>> runtime, so
>>> that scores of documents can be compared across queries?  I found one
>>> reference in the mailing list to some work in 2003, but couldn't
>>> find any
>>> follow-up:
>>> 
>>>  http://osdir.com/ml/jakarta.lucene.user/2003-12/msg00093.html
>>> 
>>> Thanks.
>> 
>> -- 
>> Ken Williams
>> Research Scientist
>> The Thomson Reuters Corporation
>> Eagan, MN
>> 
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
> 
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
> using Solr/Lucene:
> http://www.lucidimagination.com/search
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 

-- 
Ken Williams
Research Scientist
The Thomson Reuters Corporation
Eagan, MN


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Restricting the result set with hierarchical ACL

2009-03-02 Thread Chris Lu

There are two ways to handle this:
1) During indexing time, expand the group tree and store them to the
documents, like "groups:1 2 3"
2) When indexing, storing only the exact group the document belongs to. Then
during search time, expand group tree to search all the groups the user
belongs to, including the sub groups.

Approach 2 should be more flexible. I don't think a user will have that many
groups exceeding the default 1024.

-- 
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got
2.6 Million Euro funding!


On Mon, Mar 2, 2009 at 7:58 AM,  wrote:

> Dear list
>
> I need to restrict the resultlist to the appropriate rights of the user
> who is searching the index.
>
> A document may belong to several groups.
>
> A user must belong to all groups of the document to find it. There's one
> additional problem: The groups are a tree. A user is automaticaly
> in every parent group of his groups. For example A is a child of B, so a
> user in group A would also be allowed to see documents of group B.
>
> And now I have no Idea how to get a restricted search result from
> lucene. There are about 1 documents, so I'm not very happy to filter
> them after the index was searched.
>
> I tried to get all allowed document ids (there's a field for the id) and
> put them into a BooleanQuery (id1 or id2, ...), but then I get a
> BooleanQuery$TooManyClauses: maxClauseCount is set to 1024
>
> So how can I restrict my search results with lucene?
>
> Markus Malkusch
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Faceted Search using Lucene



It makes perfect sense to call maybeReopen() followed by get(), as  
long as maybeReopen() is never slow enough to be noticeable to an end  
user (because you are making random queries pay the reopen/warming  
cost).


If you call maybeReopen() after get(), then that search will not see  
the newly opened readers, but the next search will.


I'm just thinking that since you see no results with get() alone,  
debug that case first.  Then put back the maybeReopen().


Can you post your full code at this point?

Mike

Amin Mohammed-Coleman wrote:


Hi

Just out of curiosity does it not make sense to call maybeReopen and  
then call get()? If I call get() then I have a new mulitsearcher, so  
a call to maybeopen won't reinitialise the multi searcher.  Unless I  
pass the multi searcher into the maybereopen method. But somehow  
that doesn't make sense. I maybe missing something here.



Cheers

Amin

On 2 Mar 2009, at 15:48, Amin Mohammed-Coleman   
wrote:


I'm seeing some interesting behviour when i do get() first followed  
by maybeReopen then there are no documents in the directory  
(directory that i am interested in.  When i do the maybeReopen and  
then get() then the doc count is correct.  I can post stats later.


Weird...

On Mon, Mar 2, 2009 at 2:17 PM, Amin Mohammed-Coleman > wrote:

oh dear...i think i may cry...i'll debug.


On Mon, Mar 2, 2009 at 2:15 PM, Michael McCandless > wrote:


Or even just get() with no call to maybeReopen().  That should work  
fine as well.



Mike

Amin Mohammed-Coleman wrote:

In my test case I have a set up method that should populate the  
indexes
before I start using the document searcher.  I will start adding  
some more
debug statements.  So basically I should be able to do: get()  
followed by

maybeReopen.

I will let you know what the outcome is.


Cheers
Amin

On Mon, Mar 2, 2009 at 1:39 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:


Is it possible that when you first create the SearcherManager,  
there is no

index in each Directory?

If not... you better start adding diagnostics.  EG inside your  
get(), print
out the numDocs() of each IndexReader you get from the  
SearcherManager?


Something is wrong and it's best to explain it...


Mike

Amin Mohammed-Coleman wrote:

Nope. If i remove the maybeReopen the search doesn't work.  It only  
works

when i cal maybeReopen followed by get().

Cheers
Amin

On Mon, Mar 2, 2009 at 12:56 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:


That's not right; something must be wrong.

get() before maybeReopen() should simply let you search based on the
searcher before reopening.

If you just do get() and don't call maybeReopen() does it work?


Mike

Amin Mohammed-Coleman wrote:

I noticed that if i do the get() before the maybeReopen then I get no

results.  But otherwise I can change it further.

On Mon, Mar 2, 2009 at 11:46 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:


There is no such thing as final code -- code is alive and is always
changing ;)

It looks good to me.

Though one trivial thing is: I would move the code in the try  
clause up

to
and including the multiSearcher=get() out above the try.  I always
attempt
to "shrink wrap" what's inside a try clause to the minimum that needs
to
be
there.  Ie, your code that creates a query, finds the right sort &
filter
to
use, etc, can all happen outside the try, because you have not yet
acquired
the multiSearcher.

If you do that, you also don't need the null check in the finally
clause,
because multiSearcher must be non-null on entering the try.

Mike

Amin Mohammed-Coleman wrote:

Hi there

Good morning!  Here is the final search code:

public Summary[] search(final SearchRequest searchRequest)
throwsSearchExecutionException {

final String searchTerm = searchRequest.getSearchTerm();

if (StringUtils.isBlank(searchTerm)) {

throw new SearchExecutionException("Search string cannot be empty.
There
will be too many results to process.");

}

List summaryList = new ArrayList();

StopWatch stopWatch = new StopWatch("searchStopWatch");

stopWatch.start();

MultiSearcher multiSearcher = null;

try {

LOGGER.debug("Ensuring all index readers are up to date...");

maybeReopen();

Query query = queryParser.parse(searchTerm);

LOGGER.debug("Search Term '" + searchTerm +"' > Lucene Query '" +
query.toString() +"'");

Sort sort = null;

sort = applySortIfApplicable(searchRequest);

Filter[] filters =applyFiltersIfApplicable(searchRequest);

ChainedFilter chainedFilter = null;

if (filters != null) {

chainedFilter = new ChainedFilter(filters, ChainedFilter.OR);

}

multiSearcher = get();

TopDocs topDocs = multiSearcher.search(query,chainedFilter , 
100,sort);


ScoreDoc[] scoreDocs = topDocs.scoreDocs;

LOGGER.debug("total number of hits for [" + query.toString() + " ] =
"+topDocs.
totalHits);

for (ScoreDoc scoreDoc : scoreDocs) {

final Document doc = multiSearcher.doc(scoreDoc.doc);

float score = scoreDoc.score;

final BaseDocu

Re: Restricting the result set with hierarchical ACL

2009-03-02 Thread Ken Krugler


Hi Markus,


I need to restrict the resultset to the appropriate rights of the user
who is searching the index.

A document may belong to several groups.

A user must belong to all groups of the document to find it. There's one
additional problem: The groups are a tree. A user is automaticaly
in every parent group of his groups. For example A is a child of B, so a
user in group A would also be allowed to see documents of group B.

And now I have no Idea how to get a restricted search result from
lucene. There are about 1 documents, so I'm not very happy to filter
them after the index was searched.


Well, 10K is actually a small number of docs.

And the real question is how many documents will typically be part of 
the found set, and thus in the set that needs to be filtered.


So try that first, as that's the obvious approach (to me, at least). 
Note that for this type of filtering, the way that you do the 
calculation will have a performance impact - e.g. you might want to 
use bitfields versus iterating over group names in the stored field.


Since the set of a document's groups has to be a complete subset of 
the user's groups, you can't use the typical approach of having a doc 
field with every group in it, then adding a required subclause to 
your query with every group as a boolean OR term.


-- Ken
--
Ken Krugler
+1 530-210-6378

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Question on Proximity Search in Lucene Query

2009-03-02 Thread Erick Erickson

See page 88 in Lucene In Action for a fuller explanation, including
ordering considerations.

But basically, phrase query slop is the maximum number of
"moves" be required to get all the words next to each other
in the proper order. If you can get all the words next to each
other within slop moves, you succeed.

So, it's not pairwise. I don't want to reproduce the example
in the book, but that'd be the place to start.

Best
Erick

On Mon, Mar 2, 2009 at 1:07 PM, Vasudevan Comandur wrote:

> Hi All,
>
>   I had posted the below mentioned query a week back and I have not
> received any response from the group so far.
>  I was wondering if this is a trivial question to the group or it has been
> answered previously.
>
>  I appreciate your answers or any pointers to the answers are also welcome.
>
> Regards
>  Vasu
>
> **
>
>
> Hi,
>
>   I have a question on the proximity query usage in Lucene Query Syntax.
>
>  The documentation says "W1 W2"~5 means W1 and W2 can occur within 5 words.
> Here W1 & W2 represents Words.
>
>  What happens when I give "W1 W2 W3 W4"~25 as proximity query?
>
>   Does it treat each word pairs (W1, W2) , (W1, W3) , (W1, W4) , (W2, W3) ,
> (W2, W4) , (W3, W4) can occur within 25 words?
>
>   Looking forward to your reply.
>
> Regards
>  Vasu
>
> ***
>

Question on Proximity Search in Lucene Query

2009-03-02 Thread Vasudevan Comandur

Hi All,

   I had posted the below mentioned query a week back and I have not
received any response from the group so far.
  I was wondering if this is a trivial question to the group or it has been
answered previously.

  I appreciate your answers or any pointers to the answers are also welcome.

Regards
 Vasu

**


Hi,

   I have a question on the proximity query usage in Lucene Query Syntax.

  The documentation says "W1 W2"~5 means W1 and W2 can occur within 5 words.
Here W1 & W2 represents Words.

  What happens when I give "W1 W2 W3 W4"~25 as proximity query?

   Does it treat each word pairs (W1, W2) , (W1, W3) , (W1, W4) , (W2, W3) ,
(W2, W4) , (W3, W4) can occur within 25 words?

   Looking forward to your reply.

Regards
 Vasu

***

Re: Restricting the result set with hierarchical ACL

2009-03-02 Thread Erick Erickson

If you have a reasonable way of getting the doc IDs that
your user is allowed to see (and it appears you do), you
probably want a Filter. At root a Filter is just a BitSet
where you turn on the bit for each document that *could*
be allowed in the results and pass that filter to the appropriate
search routine.

CachingWrapperFilter may be your friend if you want to keep
some of these filters around after you've created them.

Erick

On Mon, Mar 2, 2009 at 10:58 AM,  wrote:

> Dear list
>
> I need to restrict the resultlist to the appropriate rights of the user
> who is searching the index.
>
> A document may belong to several groups.
>
> A user must belong to all groups of the document to find it. There's one
> additional problem: The groups are a tree. A user is automaticaly
> in every parent group of his groups. For example A is a child of B, so a
> user in group A would also be allowed to see documents of group B.
>
> And now I have no Idea how to get a restricted search result from
> lucene. There are about 1 documents, so I'm not very happy to filter
> them after the index was searched.
>
> I tried to get all allowed document ids (there's a field for the id) and
> put them into a BooleanQuery (id1 or id2, ...), but then I get a
> BooleanQuery$TooManyClauses: maxClauseCount is set to 1024
>
> So how can I restrict my search results with lucene?
>
> Markus Malkusch
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Restricting the result set with hierarchical ACL

2009-03-02 Thread Markus Malkusch

Dear list

I need to restrict the resultset to the appropriate rights of the user
who is searching the index.

A document may belong to several groups.

A user must belong to all groups of the document to find it. There's one
additional problem: The groups are a tree. A user is automaticaly
in every parent group of his groups. For example A is a child of B, so a
user in group A would also be allowed to see documents of group B.

And now I have no Idea how to get a restricted search result from
lucene. There are about 1 documents, so I'm not very happy to filter
them after the index was searched.

I tried to get all allowed document ids (there's a field for the id) and
put them into a BooleanQuery (id1 or id2, ...), but then I get a
BooleanQuery$TooManyClauses: maxClauseCount is set to 1024

So how can I restrict my search results with lucene?

Markus Malkusch

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Faceted Search using Lucene


Hi

Just out of curiosity does it not make sense to call maybeReopen and  
then call get()? If I call get() then I have a new mulitsearcher, so a  
call to maybeopen won't reinitialise the multi searcher.  Unless I  
pass the multi searcher into the maybereopen method. But somehow that  
doesn't make sense. I maybe missing something here.



Cheers

Amin

On 2 Mar 2009, at 15:48, Amin Mohammed-Coleman  wrote:

I'm seeing some interesting behviour when i do get() first followed  
by maybeReopen then there are no documents in the directory  
(directory that i am interested in.  When i do the maybeReopen and  
then get() then the doc count is correct.  I can post stats later.


Weird...

On Mon, Mar 2, 2009 at 2:17 PM, Amin Mohammed-Coleman > wrote:

oh dear...i think i may cry...i'll debug.


On Mon, Mar 2, 2009 at 2:15 PM, Michael McCandless > wrote:


Or even just get() with no call to maybeReopen().  That should work  
fine as well.



Mike

Amin Mohammed-Coleman wrote:

In my test case I have a set up method that should populate the  
indexes
before I start using the document searcher.  I will start adding  
some more
debug statements.  So basically I should be able to do: get()  
followed by

maybeReopen.

I will let you know what the outcome is.


Cheers
Amin

On Mon, Mar 2, 2009 at 1:39 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:


Is it possible that when you first create the SearcherManager, there  
is no

index in each Directory?

If not... you better start adding diagnostics.  EG inside your  
get(), print
out the numDocs() of each IndexReader you get from the  
SearcherManager?


Something is wrong and it's best to explain it...


Mike

Amin Mohammed-Coleman wrote:

Nope. If i remove the maybeReopen the search doesn't work.  It only  
works

when i cal maybeReopen followed by get().

Cheers
Amin

On Mon, Mar 2, 2009 at 12:56 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:


That's not right; something must be wrong.

get() before maybeReopen() should simply let you search based on the
searcher before reopening.

If you just do get() and don't call maybeReopen() does it work?


Mike

Amin Mohammed-Coleman wrote:

I noticed that if i do the get() before the maybeReopen then I get no

results.  But otherwise I can change it further.

On Mon, Mar 2, 2009 at 11:46 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:


There is no such thing as final code -- code is alive and is always
changing ;)

It looks good to me.

Though one trivial thing is: I would move the code in the try clause  
up

to
and including the multiSearcher=get() out above the try.  I always
attempt
to "shrink wrap" what's inside a try clause to the minimum that needs
to
be
there.  Ie, your code that creates a query, finds the right sort &
filter
to
use, etc, can all happen outside the try, because you have not yet
acquired
the multiSearcher.

If you do that, you also don't need the null check in the finally
clause,
because multiSearcher must be non-null on entering the try.

Mike

Amin Mohammed-Coleman wrote:

Hi there

Good morning!  Here is the final search code:

public Summary[] search(final SearchRequest searchRequest)
throwsSearchExecutionException {

final String searchTerm = searchRequest.getSearchTerm();

if (StringUtils.isBlank(searchTerm)) {

throw new SearchExecutionException("Search string cannot be empty.
There
will be too many results to process.");

}

List summaryList = new ArrayList();

StopWatch stopWatch = new StopWatch("searchStopWatch");

stopWatch.start();

MultiSearcher multiSearcher = null;

try {

LOGGER.debug("Ensuring all index readers are up to date...");

maybeReopen();

Query query = queryParser.parse(searchTerm);

LOGGER.debug("Search Term '" + searchTerm +"' > Lucene Query '" +
query.toString() +"'");

Sort sort = null;

sort = applySortIfApplicable(searchRequest);

Filter[] filters =applyFiltersIfApplicable(searchRequest);

ChainedFilter chainedFilter = null;

if (filters != null) {

chainedFilter = new ChainedFilter(filters, ChainedFilter.OR);

}

multiSearcher = get();

TopDocs topDocs = multiSearcher.search(query,chainedFilter ,100,sort);

ScoreDoc[] scoreDocs = topDocs.scoreDocs;

LOGGER.debug("total number of hits for [" + query.toString() + " ] =
"+topDocs.
totalHits);

for (ScoreDoc scoreDoc : scoreDocs) {

final Document doc = multiSearcher.doc(scoreDoc.doc);

float score = scoreDoc.score;

final BaseDocument baseDocument = new BaseDocument(doc, score);

Summary documentSummary = new DocumentSummaryImpl(baseDocument);

summaryList.add(documentSummary);

}

} catch (Exception e) {

throw new IllegalStateException(e);

} finally {

if (multiSearcher != null) {

release(multiSearcher);

}

}

stopWatch.stop();

LOGGER.debug("total time taken for document seach: " +
stopWatch.getTotalTimeMillis() + " ms");

return summaryList.toArray(new Summary[] {});

}



I hope this makes sense...thanks again!


Cheers

Amin



On Sun, Mar 1, 2009 at 8:09 PM, Michael Mc

Re: N-grams with numbers and Shinglefilters

2009-03-02 Thread Raymond Balmès

Yes, I don't need a ShingleFilter I understand it by now.

Yes I will have many of these phrases in the documents... this is why I
thought I shouldn't use Lucene fields.

I will investigate further your keyword approach sounds like possible, thx
for the tip.
However I presume I may need to normalize the phrases for the search phase,
so it may not work.

Keep in touch,

-RB-




On Mon, Mar 2, 2009 at 5:23 PM, Steven A Rowe  wrote:

> Hi Raymond,
>
> On 3/2/2009 at 10:09 AM, Raymond Balmès wrote:
> > suppose I have a tri-gram, what I want to do is index the tri-gram
> > "string digit1 digit2" as one indexing phrase, and not index each token
> > separately.
>
> As long as you don't want any transformation performed on the phrase or its
> components, you can add your phrase as a "keyword", i.e. a non-analyzed
> string that will be indexed as-is.
>
> Unless your phrase field will be the only field on this document (pretty
> unlikely), you'll want to use PerFieldAnalyzerWrapper[1] over
> KeywordAnalyzer[2] for the phrase field, and whatever other analyzer you
> like for the other document field(s).
>
> AFAICT, you don't need ShingleFilter.
>
> Steve
>
> [1] PerFieldAnalyzerWrapper:
> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html
> [2] KeywordAnalyzer:
> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/KeywordAnalyzer.html
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Indexing synonyms for multiple words



Since Lucene doesn't represent/store end position for a token, I don't  
think the index can properly represent SYN spanning two positions?


I suppose you could encode this into payloads, and create a custom  
query that would look at the payload to enforce the constraint.


Or, if you switch to doing SYN expansion only at runtime (not adding  
it to the index), that might work.


Mike

Uwe Schindler wrote:


I think his problem is, that "SYN" is a synonym for the phrase "WORD1
WORD2". Using these positions, a phrase like "SYN WORD2" would also  
match

(or other problems in queries that depend on order of words).

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com]
Sent: Monday, March 02, 2009 4:07 PM
To: java-user@lucene.apache.org
Subject: Re: Indexing synonyms for multiple words


Shouldn't WORD2's position be 1 more than your SYN?

Ie, don't you want these positions?:

   WORD1  2
   WORD2  3
   SYN 2

The position is the starting position of the token; Lucene doesn't
store an ending position

Mike

Sumukh wrote:


Hi,

I'm fairly new to Lucene. I'd like to know how we can index synonyms
for
multiple words.

This is the scenario:

Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG.

Now assume the two words combined WORD1 WORD2 can be replaced by
another
word SYN.

If I place SYN after WORD1 with positionIncrement set to 0, WORD2  
will

follow SYN,
which is incorrect; and the other way round if I place it after  
WORD2.


If any of you have solved a similar problem, I'd be thankful if you
could
share some light on
the solution.

Regards,
Sumukh



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Sort Collection of ScoreDocs

2009-03-02 Thread Chetan Shah


Perfect Thanks.

Was also looking at org.apache.lucene.search.ScoreDocComparator




Uwe Schindler wrote:
> 
> How about java.util.Arrays.sort() on the array using a simple
> Comparator with a compare() that returns -Float.compare(a.score,
> b.score)? This is just about 7 lines of Java code.
> 
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
>> -Original Message-
>> From: Chetan Shah [mailto:chetankrs...@gmail.com]
>> Sent: Monday, March 02, 2009 4:47 PM
>> To: java-user@lucene.apache.org
>> Subject: Sort Collection of ScoreDocs
>> 
>> 
>> Is there an existing Utility class which will sort a collection of
>> ScoreDocs
>> ? I have a result set (array of ScoreDocs) stored in JVM and want to sort
>> them by relevanceScore. I do not want to execute the query again. The
>> stored
>> result set is sorted by another term and hence the need.
>> 
>> Would highly appreciate if you would please let me know how do I do so?
>> 
>> Thanks,
>> 
>> -Chetan
>> --
>> View this message in context: http://www.nabble.com/Sort-Collection-of-
>> ScoreDocs-tp22290563p22290563.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Sort-Collection-of-ScoreDocs-tp22290563p22291550.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: queryNorm affect on score

2009-03-02 Thread Peter Keegan

If I set the boost=0 at query time and the query contains only terms with
boost=0, the scores are NaN (because weight.queryNorm = 1/0 = infinity),
instead of 0.

Peter


On Sun, Mar 1, 2009 at 9:27 PM, Erick Erickson wrote:

> FWIW, Hossman pointed out that the difference between index and
> query time boosts is that index time boosts on title, for instance,
> express "I care about this document's title more than other documents'
> titles [when it matches]" Query time boosts express "I care about matches
> on the title field more than matches on other fields".
>
> Best
> Erick
>
> On Sun, Mar 1, 2009 at 8:57 PM, Peter Keegan 
> wrote:
>
> > As suggested, I added a query-time boost of 0.0f to the 'literals' field
> > (with index-time boost still there) and I did get the same scores for
> both
> > queries :)  (there is a subtlety between index-time and query-time
> boosting
> > that I missed.)
> >
> > I also tried disabling the coord factor, but that had no affect on the
> > score, when combined with the above. This seems ok in this example since
> > the
> > the matching terms had boost = 0.
> >
> > Thanks Yonik,
> > Peter
> >
> >
> >
> > On Sat, Feb 28, 2009 at 6:02 PM, Yonik Seeley <
> yo...@lucidimagination.com
> > >wrote:
> >
> > > On Sat, Feb 28, 2009 at 3:02 PM, Peter Keegan 
> > > wrote:
> > > >> in situations where you  deal with simple query types, and matching
> > > query
> > > > structures, the queryNorm
> > > >> *can* be used to make scores semi-comparable.
> > > >
> > > > Hmm. My example used matching query structures. The only difference
> was
> > a
> > > > single term in a field with zero weight that didn't exist in the
> > matching
> > > > document. But one score was 3X the other.
> > >
> > > But the zero boost was an index-time boost, and the queryNorm takes
> > > into account query-time boosts and idfs.  You might get closer to what
> > > you expect with a query time boost of 0.0f
> > >
> > > The other thing affecting the score is the coord factor - the fact
> > > that fewer of the optional terms matched (1/2) lowers the score.  The
> > > coordination factor can be disabled on any BooleanQuery.
> > >
> > > If you do both of the above, I *think* you would get the same scores
> > > for this specific example.
> > >
> > > -Yonik
> > > http://www.lucidimagination.com
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
> >
>

RE: N-grams with numbers and Shinglefilters

2009-03-02 Thread Steven A Rowe

Hi Raymond,

On 3/2/2009 at 10:09 AM, Raymond Balmès wrote:
> suppose I have a tri-gram, what I want to do is index the tri-gram
> "string digit1 digit2" as one indexing phrase, and not index each token
> separately.

As long as you don't want any transformation performed on the phrase or its 
components, you can add your phrase as a "keyword", i.e. a non-analyzed string 
that will be indexed as-is.

Unless your phrase field will be the only field on this document (pretty 
unlikely), you'll want to use PerFieldAnalyzerWrapper[1] over 
KeywordAnalyzer[2] for the phrase field, and whatever other analyzer you like 
for the other document field(s).

AFAICT, you don't need ShingleFilter.

Steve

[1] PerFieldAnalyzerWrapper:  
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html
[2] KeywordAnalyzer: 
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/KeywordAnalyzer.html

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Sort Collection of ScoreDocs

2009-03-02 Thread Uwe Schindler

How about java.util.Arrays.sort() on the array using a simple
Comparator with a compare() that returns -Float.compare(a.score,
b.score)? This is just about 7 lines of Java code.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Chetan Shah [mailto:chetankrs...@gmail.com]
> Sent: Monday, March 02, 2009 4:47 PM
> To: java-user@lucene.apache.org
> Subject: Sort Collection of ScoreDocs
> 
> 
> Is there an existing Utility class which will sort a collection of
> ScoreDocs
> ? I have a result set (array of ScoreDocs) stored in JVM and want to sort
> them by relevanceScore. I do not want to execute the query again. The
> stored
> result set is sorted by another term and hence the need.
> 
> Would highly appreciate if you would please let me know how do I do so?
> 
> Thanks,
> 
> -Chetan
> --
> View this message in context: http://www.nabble.com/Sort-Collection-of-
> ScoreDocs-tp22290563p22290563.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Restricting the result set with hierarchical ACL

2009-03-02 Thread markus

Dear list

I need to restrict the resultlist to the appropriate rights of the user
who is searching the index.

A document may belong to several groups.

A user must belong to all groups of the document to find it. There's one
additional problem: The groups are a tree. A user is automaticaly
in every parent group of his groups. For example A is a child of B, so a
user in group A would also be allowed to see documents of group B.

And now I have no Idea how to get a restricted search result from
lucene. There are about 1 documents, so I'm not very happy to filter
them after the index was searched.

I tried to get all allowed document ids (there's a field for the id) and
put them into a BooleanQuery (id1 or id2, ...), but then I get a
BooleanQuery$TooManyClauses: maxClauseCount is set to 1024

So how can I restrict my search results with lucene?

Markus Malkusch

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Sort Collection of ScoreDocs

2009-03-02 Thread Chetan Shah


Is there an existing Utility class which will sort a collection of ScoreDocs
? I have a result set (array of ScoreDocs) stored in JVM and want to sort
them by relevanceScore. I do not want to execute the query again. The stored
result set is sorted by another term and hence the need.

Would highly appreciate if you would please let me know how do I do so?

Thanks,

-Chetan
-- 
View this message in context: 
http://www.nabble.com/Sort-Collection-of-ScoreDocs-tp22290563p22290563.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

RE: Indexing synonyms for multiple words

2009-03-02 Thread Uwe Schindler

I think his problem is, that "SYN" is a synonym for the phrase "WORD1
WORD2". Using these positions, a phrase like "SYN WORD2" would also match
(or other problems in queries that depend on order of words). 

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Michael McCandless [mailto:luc...@mikemccandless.com]
> Sent: Monday, March 02, 2009 4:07 PM
> To: java-user@lucene.apache.org
> Subject: Re: Indexing synonyms for multiple words
> 
> 
> Shouldn't WORD2's position be 1 more than your SYN?
> 
> Ie, don't you want these positions?:
> 
> WORD1  2
> WORD2  3
> SYN 2
> 
> The position is the starting position of the token; Lucene doesn't
> store an ending position
> 
> Mike
> 
> Sumukh wrote:
> 
> > Hi,
> >
> > I'm fairly new to Lucene. I'd like to know how we can index synonyms
> > for
> > multiple words.
> >
> > This is the scenario:
> >
> > Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG.
> >
> > Now assume the two words combined WORD1 WORD2 can be replaced by
> > another
> > word SYN.
> >
> > If I place SYN after WORD1 with positionIncrement set to 0, WORD2 will
> > follow SYN,
> > which is incorrect; and the other way round if I place it after WORD2.
> >
> > If any of you have solved a similar problem, I'd be thankful if you
> > could
> > share some light on
> > the solution.
> >
> > Regards,
> > Sumukh
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Indexing synonyms for multiple words

2009-03-02 Thread Sumukh

>
> Hi,
>
> I'm fairly new to Lucene. I'd like to know how we can index synonyms for
> multiple words.
>
> This is the scenario:
>
> Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG.
>
> Now assume the two words combined WORD1 WORD2 can be replaced by another
> word SYN.
>
> If I place SYN after WORD1 with positionIncrement set to 0, WORD2 will
> follow SYN,
> which is incorrect; and the other way round if I place it after WORD2.
>
> If any of you have solved a similar problem, I'd be thankful if you could
> share some light on
> the solution.
>
> Regards,
> Sumukh
>
>

Re: N-grams with numbers and Shinglefilters

2009-03-02 Thread Raymond Balmès

Well,

In the mean time I've looked at the details of the implementation and it
gave me an idea for what I'm looking for :

suppose I have a tri-gram, what I want to do is index the tri-gram "string
digit1 digit2" as one indexing phrase, and not index each token separately.

In the shingler filter, if I understood it correctly, tokens are separated
by '_' whilst n-grams are separated  by " ", that is the mechanism which I
was missing. And of course I need my logic around to filter valid tri-grams
but I don't need help for this, I can easily do that using regex for
instance.

My documents look like regular html or pdf pages although some of them
contains those specific tri-grams.

Thx,

-RB-

On Mon, Mar 2, 2009 at 2:37 PM, Steven A Rowe  wrote:

> Hi Raymond,
>
> On 3/1/2009, Raymond Balmès wrote:
> > I'm trying to index (& search later) documents that contain tri-grams
> > however they have the following form:
> >
> >  <2 digit> <2 digit>
> >
> > Does the ShingleFilter work with numbers in the match ?
>
> Yes, though it is the tokenizer and previous filters in the chain that will
> be the (potential) source of difficulties, not ShingleFilter.
>
> > Another complication, in future features I'd like to add optional
> > digits like
> >
> > [<1 digit>]  <2 digit> <2 digit>
> >
> > I suppose the ShingleFilter won't do it ?
>
> ShingleFilter just pastes together the tokens produced by the previous
> component in the analysis chain, in a sliding window.  As currently written,
> it doesn't provide the sort of functionality you seem to be asking for.
>
> > Any better advice ?
>
> What do your documents look like?  What do you hope to accomplish using
> ShingleFilter?  It's tough to give advice without knowing what you want to
> do.
>
> Steve
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Indexing synonyms for multiple words



Shouldn't WORD2's position be 1 more than your SYN?

Ie, don't you want these positions?:

   WORD1  2
   WORD2  3
   SYN 2

The position is the starting position of the token; Lucene doesn't  
store an ending position


Mike

Sumukh wrote:


Hi,

I'm fairly new to Lucene. I'd like to know how we can index synonyms  
for

multiple words.

This is the scenario:

Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG.

Now assume the two words combined WORD1 WORD2 can be replaced by  
another

word SYN.

If I place SYN after WORD1 with positionIncrement set to 0, WORD2 will
follow SYN,
which is incorrect; and the other way round if I place it after WORD2.

If any of you have solved a similar problem, I'd be thankful if you  
could

share some light on
the solution.

Regards,
Sumukh



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Indexing synonyms for multiple words

2009-03-02 Thread Erick Erickson

This has been discussed in the user list, so searching there
might get you answer quicker.

See: http://wiki.apache.org/lucene-java/MailingListArchives

I don't remember the results, but...

Best
Erick

On Mon, Mar 2, 2009 at 9:13 AM, Sumukh  wrote:

> Hi,
>
> I'm fairly new to Lucene. I'd like to know how we can index synonyms for
> multiple words.
>
> This is the scenario:
>
> Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG.
>
> Now assume the two words combined WORD1 WORD2 can be replaced by another
> word SYN.
>
> If I place SYN after WORD1 with positionIncrement set to 0, WORD2 will
> follow SYN,
> which is incorrect; and the other way round if I place it after WORD2.
>
> If any of you have solved a similar problem, I'd be thankful if you could
> share some light on
> the solution.
>
> Regards,
> Sumukh
>

Extracting TFIDF vectors

2009-03-02 Thread Gregory Gay

Hi,

I'm a complete novice at Lucene, and I'm looking for a little bit of help
with something.

How can I extract the TF*IDF vector for each document in the indexed
collection? Also for the query?

I need to build a user-feedback system which manipulates the query based on
the liked and disliked documents from the local collection. This query
modification uses the TF*IDF vectors.

Thanks for your help!

-- 
Gregory Gay
Editor - 4 Color Rebellion (http://www.4colorrebellion.com)
Research Assistant - WVU CSEE

Indexing synonyms for multiple words

2009-03-02 Thread Sumukh

Hi,

I'm fairly new to Lucene. I'd like to know how we can index synonyms for
multiple words.

This is the scenario:

Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG.

Now assume the two words combined WORD1 WORD2 can be replaced by another
word SYN.

If I place SYN after WORD1 with positionIncrement set to 0, WORD2 will
follow SYN,
which is incorrect; and the other way round if I place it after WORD2.

If any of you have solved a similar problem, I'd be thankful if you could
share some light on
the solution.

Regards,
Sumukh

Re: Faceted Search using Lucene

In my test case I have a set up method that should populate the indexes
before I start using the document searcher.  I will start adding some more
debug statements.  So basically I should be able to do: get() followed by
maybeReopen.

I will let you know what the outcome is.


Cheers
Amin

On Mon, Mar 2, 2009 at 1:39 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

>
> Is it possible that when you first create the SearcherManager, there is no
> index in each Directory?
>
> If not... you better start adding diagnostics.  EG inside your get(), print
> out the numDocs() of each IndexReader you get from the SearcherManager?
>
> Something is wrong and it's best to explain it...
>
>
> Mike
>
> Amin Mohammed-Coleman wrote:
>
>  Nope. If i remove the maybeReopen the search doesn't work.  It only works
>> when i cal maybeReopen followed by get().
>>
>> Cheers
>> Amin
>>
>> On Mon, Mar 2, 2009 at 12:56 PM, Michael McCandless <
>> luc...@mikemccandless.com> wrote:
>>
>>
>>> That's not right; something must be wrong.
>>>
>>> get() before maybeReopen() should simply let you search based on the
>>> searcher before reopening.
>>>
>>> If you just do get() and don't call maybeReopen() does it work?
>>>
>>>
>>> Mike
>>>
>>> Amin Mohammed-Coleman wrote:
>>>
>>> I noticed that if i do the get() before the maybeReopen then I get no
>>>
 results.  But otherwise I can change it further.

 On Mon, Mar 2, 2009 at 11:46 AM, Michael McCandless <
 luc...@mikemccandless.com> wrote:


  There is no such thing as final code -- code is alive and is always
> changing ;)
>
> It looks good to me.
>
> Though one trivial thing is: I would move the code in the try clause up
> to
> and including the multiSearcher=get() out above the try.  I always
> attempt
> to "shrink wrap" what's inside a try clause to the minimum that needs
> to
> be
> there.  Ie, your code that creates a query, finds the right sort &
> filter
> to
> use, etc, can all happen outside the try, because you have not yet
> acquired
> the multiSearcher.
>
> If you do that, you also don't need the null check in the finally
> clause,
> because multiSearcher must be non-null on entering the try.
>
> Mike
>
> Amin Mohammed-Coleman wrote:
>
> Hi there
>
>  Good morning!  Here is the final search code:
>>
>> public Summary[] search(final SearchRequest searchRequest)
>> throwsSearchExecutionException {
>>
>> final String searchTerm = searchRequest.getSearchTerm();
>>
>> if (StringUtils.isBlank(searchTerm)) {
>>
>> throw new SearchExecutionException("Search string cannot be empty.
>> There
>> will be too many results to process.");
>>
>> }
>>
>> List summaryList = new ArrayList();
>>
>> StopWatch stopWatch = new StopWatch("searchStopWatch");
>>
>> stopWatch.start();
>>
>> MultiSearcher multiSearcher = null;
>>
>> try {
>>
>> LOGGER.debug("Ensuring all index readers are up to date...");
>>
>> maybeReopen();
>>
>> Query query = queryParser.parse(searchTerm);
>>
>> LOGGER.debug("Search Term '" + searchTerm +"' > Lucene Query '" +
>> query.toString() +"'");
>>
>> Sort sort = null;
>>
>> sort = applySortIfApplicable(searchRequest);
>>
>> Filter[] filters =applyFiltersIfApplicable(searchRequest);
>>
>> ChainedFilter chainedFilter = null;
>>
>> if (filters != null) {
>>
>> chainedFilter = new ChainedFilter(filters, ChainedFilter.OR);
>>
>> }
>>
>> multiSearcher = get();
>>
>> TopDocs topDocs = multiSearcher.search(query,chainedFilter ,100,sort);
>>
>> ScoreDoc[] scoreDocs = topDocs.scoreDocs;
>>
>> LOGGER.debug("total number of hits for [" + query.toString() + " ] =
>> "+topDocs.
>> totalHits);
>>
>> for (ScoreDoc scoreDoc : scoreDocs) {
>>
>> final Document doc = multiSearcher.doc(scoreDoc.doc);
>>
>> float score = scoreDoc.score;
>>
>> final BaseDocument baseDocument = new BaseDocument(doc, score);
>>
>> Summary documentSummary = new DocumentSummaryImpl(baseDocument);
>>
>> summaryList.add(documentSummary);
>>
>> }
>>
>> } catch (Exception e) {
>>
>> throw new IllegalStateException(e);
>>
>> } finally {
>>
>> if (multiSearcher != null) {
>>
>> release(multiSearcher);
>>
>> }
>>
>> }
>>
>> stopWatch.stop();
>>
>> LOGGER.debug("total time taken for document seach: " +
>> stopWatch.getTotalTimeMillis() + " ms");
>>
>> return summaryList.toArray(new Summary[] {});
>>
>> }
>>
>>
>>
>> I hope this makes sense...thanks again!
>>
>>
>> Cheers
>>
>> Amin
>>
>>
>>
>> On Sun, Mar 1, 2009 at 8:09 PM, Michael McCandless <
>> luc.

Re: Faceted Search using Lucene



Is it possible that when you first create the SearcherManager, there  
is no index in each Directory?


If not... you better start adding diagnostics.  EG inside your get(),  
print out the numDocs() of each IndexReader you get from the  
SearcherManager?


Something is wrong and it's best to explain it...

Mike

Amin Mohammed-Coleman wrote:

Nope. If i remove the maybeReopen the search doesn't work.  It only  
works

when i cal maybeReopen followed by get().

Cheers
Amin

On Mon, Mar 2, 2009 at 12:56 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:



That's not right; something must be wrong.

get() before maybeReopen() should simply let you search based on the
searcher before reopening.

If you just do get() and don't call maybeReopen() does it work?


Mike

Amin Mohammed-Coleman wrote:

I noticed that if i do the get() before the maybeReopen then I get no

results.  But otherwise I can change it further.

On Mon, Mar 2, 2009 at 11:46 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:



There is no such thing as final code -- code is alive and is always
changing ;)

It looks good to me.

Though one trivial thing is: I would move the code in the try  
clause up

to
and including the multiSearcher=get() out above the try.  I always
attempt
to "shrink wrap" what's inside a try clause to the minimum that  
needs to

be
there.  Ie, your code that creates a query, finds the right sort  
& filter

to
use, etc, can all happen outside the try, because you have not yet
acquired
the multiSearcher.

If you do that, you also don't need the null check in the finally  
clause,

because multiSearcher must be non-null on entering the try.

Mike

Amin Mohammed-Coleman wrote:

Hi there


Good morning!  Here is the final search code:

public Summary[] search(final SearchRequest searchRequest)
throwsSearchExecutionException {

final String searchTerm = searchRequest.getSearchTerm();

if (StringUtils.isBlank(searchTerm)) {

throw new SearchExecutionException("Search string cannot be  
empty. There

will be too many results to process.");

}

List summaryList = new ArrayList();

StopWatch stopWatch = new StopWatch("searchStopWatch");

stopWatch.start();

MultiSearcher multiSearcher = null;

try {

LOGGER.debug("Ensuring all index readers are up to date...");

maybeReopen();

Query query = queryParser.parse(searchTerm);

LOGGER.debug("Search Term '" + searchTerm +"' > Lucene Query  
'" +

query.toString() +"'");

Sort sort = null;

sort = applySortIfApplicable(searchRequest);

Filter[] filters =applyFiltersIfApplicable(searchRequest);

ChainedFilter chainedFilter = null;

if (filters != null) {

chainedFilter = new ChainedFilter(filters, ChainedFilter.OR);

}

multiSearcher = get();

TopDocs topDocs = multiSearcher.search(query,chainedFilter , 
100,sort);


ScoreDoc[] scoreDocs = topDocs.scoreDocs;

LOGGER.debug("total number of hits for [" + query.toString() +  
" ] =

"+topDocs.
totalHits);

for (ScoreDoc scoreDoc : scoreDocs) {

final Document doc = multiSearcher.doc(scoreDoc.doc);

float score = scoreDoc.score;

final BaseDocument baseDocument = new BaseDocument(doc, score);

Summary documentSummary = new DocumentSummaryImpl(baseDocument);

summaryList.add(documentSummary);

}

} catch (Exception e) {

throw new IllegalStateException(e);

} finally {

if (multiSearcher != null) {

release(multiSearcher);

}

}

stopWatch.stop();

LOGGER.debug("total time taken for document seach: " +
stopWatch.getTotalTimeMillis() + " ms");

return summaryList.toArray(new Summary[] {});

}



I hope this makes sense...thanks again!


Cheers

Amin



On Sun, Mar 1, 2009 at 8:09 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:


You're calling get() too many times.  For every call to get()  
you must

match with a call to release().

So, once at the front of your search method you should:

MultiSearcher searcher = get();

then use that searcher to do searching, retrieve docs, etc.

Then in the finally clause, pass that searcher to release.

So, only one call to get() and one matching call to release().

Mike

Amin Mohammed-Coleman wrote:

Hi

The searchers are injected into the class via Spring.  So when a

client
calls the class it is fully configured with a list of index  
searchers.

However I have removed this list and instead injecting a list of
directories which are passed to the DocumentSearchManager.
DocumentSearchManager is SearchManager (should've mentioned that
earlier).
So finally I have modified by release code to do the following:

private void release(MultiSearcher multiSeacher) throws  
Exception {


IndexSearcher[] indexSearchers = (IndexSearcher[])
multiSeacher.getSearchables();

for(int i =0 ; i < indexSearchers.length;i++) {

documentSearcherManagers[i].release(indexSearchers[i]);

}

}


and it's use looks like this:


public Summary[] search(final SearchRequest searchRequest)
throwsSearchExecutionException {

final String searchTerm = searchRequest.getSearchTerm();

if (StringUtils.isB

RE: N-grams with numbers and Shinglefilters

2009-03-02 Thread Steven A Rowe

Hi Raymond,

On 3/1/2009, Raymond Balmès wrote:
> I'm trying to index (& search later) documents that contain tri-grams
> however they have the following form:
> 
>  <2 digit> <2 digit>
> 
> Does the ShingleFilter work with numbers in the match ?

Yes, though it is the tokenizer and previous filters in the chain that will be 
the (potential) source of difficulties, not ShingleFilter.

> Another complication, in future features I'd like to add optional
> digits like
> 
> [<1 digit>]  <2 digit> <2 digit>
> 
> I suppose the ShingleFilter won't do it ?

ShingleFilter just pastes together the tokens produced by the previous 
component in the analysis chain, in a sliding window.  As currently written, it 
doesn't provide the sort of functionality you seem to be asking for.

> Any better advice ?

What do your documents look like?  What do you hope to accomplish using 
ShingleFilter?  It's tough to give advice without knowing what you want to do.

Steve

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Faceted Search using Lucene

Nope. If i remove the maybeReopen the search doesn't work.  It only works
when i cal maybeReopen followed by get().

Cheers
Amin

On Mon, Mar 2, 2009 at 12:56 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

>
> That's not right; something must be wrong.
>
> get() before maybeReopen() should simply let you search based on the
> searcher before reopening.
>
> If you just do get() and don't call maybeReopen() does it work?
>
>
> Mike
>
> Amin Mohammed-Coleman wrote:
>
>  I noticed that if i do the get() before the maybeReopen then I get no
>> results.  But otherwise I can change it further.
>>
>> On Mon, Mar 2, 2009 at 11:46 AM, Michael McCandless <
>> luc...@mikemccandless.com> wrote:
>>
>>
>>> There is no such thing as final code -- code is alive and is always
>>> changing ;)
>>>
>>> It looks good to me.
>>>
>>> Though one trivial thing is: I would move the code in the try clause up
>>> to
>>> and including the multiSearcher=get() out above the try.  I always
>>> attempt
>>> to "shrink wrap" what's inside a try clause to the minimum that needs to
>>> be
>>> there.  Ie, your code that creates a query, finds the right sort & filter
>>> to
>>> use, etc, can all happen outside the try, because you have not yet
>>> acquired
>>> the multiSearcher.
>>>
>>> If you do that, you also don't need the null check in the finally clause,
>>> because multiSearcher must be non-null on entering the try.
>>>
>>> Mike
>>>
>>> Amin Mohammed-Coleman wrote:
>>>
>>> Hi there
>>>
 Good morning!  Here is the final search code:

 public Summary[] search(final SearchRequest searchRequest)
 throwsSearchExecutionException {

 final String searchTerm = searchRequest.getSearchTerm();

 if (StringUtils.isBlank(searchTerm)) {

 throw new SearchExecutionException("Search string cannot be empty. There
 will be too many results to process.");

 }

 List summaryList = new ArrayList();

 StopWatch stopWatch = new StopWatch("searchStopWatch");

 stopWatch.start();

 MultiSearcher multiSearcher = null;

 try {

 LOGGER.debug("Ensuring all index readers are up to date...");

 maybeReopen();

 Query query = queryParser.parse(searchTerm);

 LOGGER.debug("Search Term '" + searchTerm +"' > Lucene Query '" +
 query.toString() +"'");

 Sort sort = null;

 sort = applySortIfApplicable(searchRequest);

 Filter[] filters =applyFiltersIfApplicable(searchRequest);

 ChainedFilter chainedFilter = null;

 if (filters != null) {

 chainedFilter = new ChainedFilter(filters, ChainedFilter.OR);

 }

 multiSearcher = get();

 TopDocs topDocs = multiSearcher.search(query,chainedFilter ,100,sort);

 ScoreDoc[] scoreDocs = topDocs.scoreDocs;

 LOGGER.debug("total number of hits for [" + query.toString() + " ] =
 "+topDocs.
 totalHits);

 for (ScoreDoc scoreDoc : scoreDocs) {

 final Document doc = multiSearcher.doc(scoreDoc.doc);

 float score = scoreDoc.score;

 final BaseDocument baseDocument = new BaseDocument(doc, score);

 Summary documentSummary = new DocumentSummaryImpl(baseDocument);

 summaryList.add(documentSummary);

 }

 } catch (Exception e) {

 throw new IllegalStateException(e);

 } finally {

 if (multiSearcher != null) {

 release(multiSearcher);

 }

 }

 stopWatch.stop();

 LOGGER.debug("total time taken for document seach: " +
 stopWatch.getTotalTimeMillis() + " ms");

 return summaryList.toArray(new Summary[] {});

 }



 I hope this makes sense...thanks again!


 Cheers

 Amin



 On Sun, Mar 1, 2009 at 8:09 PM, Michael McCandless <
 luc...@mikemccandless.com> wrote:


  You're calling get() too many times.  For every call to get() you must
> match with a call to release().
>
> So, once at the front of your search method you should:
>
> MultiSearcher searcher = get();
>
> then use that searcher to do searching, retrieve docs, etc.
>
> Then in the finally clause, pass that searcher to release.
>
> So, only one call to get() and one matching call to release().
>
> Mike
>
> Amin Mohammed-Coleman wrote:
>
> Hi
>
>  The searchers are injected into the class via Spring.  So when a
>> client
>> calls the class it is fully configured with a list of index searchers.
>> However I have removed this list and instead injecting a list of
>> directories which are passed to the DocumentSearchManager.
>> DocumentSearchManager is SearchManager (should've mentioned that
>> earlier).
>> So finally I have modified by release code to do the following:
>>
>> private void release(MultiSearcher multiSeacher) throws

Re: Faceted Search using Lucene



That's not right; something must be wrong.

get() before maybeReopen() should simply let you search based on the  
searcher before reopening.


If you just do get() and don't call maybeReopen() does it work?

Mike

Amin Mohammed-Coleman wrote:


I noticed that if i do the get() before the maybeReopen then I get no
results.  But otherwise I can change it further.

On Mon, Mar 2, 2009 at 11:46 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:



There is no such thing as final code -- code is alive and is always
changing ;)

It looks good to me.

Though one trivial thing is: I would move the code in the try  
clause up to
and including the multiSearcher=get() out above the try.  I always  
attempt
to "shrink wrap" what's inside a try clause to the minimum that  
needs to be
there.  Ie, your code that creates a query, finds the right sort &  
filter to
use, etc, can all happen outside the try, because you have not yet  
acquired

the multiSearcher.

If you do that, you also don't need the null check in the finally  
clause,

because multiSearcher must be non-null on entering the try.

Mike

Amin Mohammed-Coleman wrote:

Hi there

Good morning!  Here is the final search code:

public Summary[] search(final SearchRequest searchRequest)
throwsSearchExecutionException {

final String searchTerm = searchRequest.getSearchTerm();

if (StringUtils.isBlank(searchTerm)) {

throw new SearchExecutionException("Search string cannot be empty.  
There

will be too many results to process.");

}

List summaryList = new ArrayList();

StopWatch stopWatch = new StopWatch("searchStopWatch");

stopWatch.start();

MultiSearcher multiSearcher = null;

try {

LOGGER.debug("Ensuring all index readers are up to date...");

maybeReopen();

Query query = queryParser.parse(searchTerm);

LOGGER.debug("Search Term '" + searchTerm +"' > Lucene Query  
'" +

query.toString() +"'");

Sort sort = null;

sort = applySortIfApplicable(searchRequest);

Filter[] filters =applyFiltersIfApplicable(searchRequest);

ChainedFilter chainedFilter = null;

if (filters != null) {

chainedFilter = new ChainedFilter(filters, ChainedFilter.OR);

}

multiSearcher = get();

TopDocs topDocs = multiSearcher.search(query,chainedFilter , 
100,sort);


ScoreDoc[] scoreDocs = topDocs.scoreDocs;

LOGGER.debug("total number of hits for [" + query.toString() + " ] =
"+topDocs.
totalHits);

for (ScoreDoc scoreDoc : scoreDocs) {

final Document doc = multiSearcher.doc(scoreDoc.doc);

float score = scoreDoc.score;

final BaseDocument baseDocument = new BaseDocument(doc, score);

Summary documentSummary = new DocumentSummaryImpl(baseDocument);

summaryList.add(documentSummary);

}

} catch (Exception e) {

throw new IllegalStateException(e);

} finally {

if (multiSearcher != null) {

release(multiSearcher);

}

}

stopWatch.stop();

LOGGER.debug("total time taken for document seach: " +
stopWatch.getTotalTimeMillis() + " ms");

return summaryList.toArray(new Summary[] {});

}



I hope this makes sense...thanks again!


Cheers

Amin



On Sun, Mar 1, 2009 at 8:09 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:


You're calling get() too many times.  For every call to get() you  
must

match with a call to release().

So, once at the front of your search method you should:

MultiSearcher searcher = get();

then use that searcher to do searching, retrieve docs, etc.

Then in the finally clause, pass that searcher to release.

So, only one call to get() and one matching call to release().

Mike

Amin Mohammed-Coleman wrote:

Hi

The searchers are injected into the class via Spring.  So when a  
client
calls the class it is fully configured with a list of index  
searchers.

However I have removed this list and instead injecting a list of
directories which are passed to the DocumentSearchManager.
DocumentSearchManager is SearchManager (should've mentioned that
earlier).
So finally I have modified by release code to do the following:

private void release(MultiSearcher multiSeacher) throws  
Exception {


IndexSearcher[] indexSearchers = (IndexSearcher[])
multiSeacher.getSearchables();

for(int i =0 ; i < indexSearchers.length;i++) {

documentSearcherManagers[i].release(indexSearchers[i]);

}

}


and it's use looks like this:


public Summary[] search(final SearchRequest searchRequest)
throwsSearchExecutionException {

final String searchTerm = searchRequest.getSearchTerm();

if (StringUtils.isBlank(searchTerm)) {

throw new SearchExecutionException("Search string cannot be  
empty. There

will be too many results to process.");

}

List summaryList = new ArrayList();

StopWatch stopWatch = new StopWatch("searchStopWatch");

stopWatch.start();

List indexSearchers = new  
ArrayList();


try {

LOGGER.debug("Ensuring all index readers are up to date...");

maybeReopen();

LOGGER.debug("All Index Searchers are up to date. No of index  
searchers

'"
+
indexSearchers.size() +"'");

Query query = queryParser.parse(searchTerm);

LOGGER.debug("Search Term '"

Re: Faceted Search using Lucene

I noticed that if i do the get() before the maybeReopen then I get no
results.  But otherwise I can change it further.

On Mon, Mar 2, 2009 at 11:46 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

>
> There is no such thing as final code -- code is alive and is always
> changing ;)
>
> It looks good to me.
>
> Though one trivial thing is: I would move the code in the try clause up to
> and including the multiSearcher=get() out above the try.  I always attempt
> to "shrink wrap" what's inside a try clause to the minimum that needs to be
> there.  Ie, your code that creates a query, finds the right sort & filter to
> use, etc, can all happen outside the try, because you have not yet acquired
> the multiSearcher.
>
> If you do that, you also don't need the null check in the finally clause,
> because multiSearcher must be non-null on entering the try.
>
> Mike
>
> Amin Mohammed-Coleman wrote:
>
>  Hi there
>> Good morning!  Here is the final search code:
>>
>> public Summary[] search(final SearchRequest searchRequest)
>> throwsSearchExecutionException {
>>
>> final String searchTerm = searchRequest.getSearchTerm();
>>
>> if (StringUtils.isBlank(searchTerm)) {
>>
>> throw new SearchExecutionException("Search string cannot be empty. There
>> will be too many results to process.");
>>
>> }
>>
>> List summaryList = new ArrayList();
>>
>> StopWatch stopWatch = new StopWatch("searchStopWatch");
>>
>> stopWatch.start();
>>
>> MultiSearcher multiSearcher = null;
>>
>> try {
>>
>> LOGGER.debug("Ensuring all index readers are up to date...");
>>
>> maybeReopen();
>>
>> Query query = queryParser.parse(searchTerm);
>>
>> LOGGER.debug("Search Term '" + searchTerm +"' > Lucene Query '" +
>> query.toString() +"'");
>>
>> Sort sort = null;
>>
>> sort = applySortIfApplicable(searchRequest);
>>
>> Filter[] filters =applyFiltersIfApplicable(searchRequest);
>>
>> ChainedFilter chainedFilter = null;
>>
>> if (filters != null) {
>>
>> chainedFilter = new ChainedFilter(filters, ChainedFilter.OR);
>>
>> }
>>
>> multiSearcher = get();
>>
>> TopDocs topDocs = multiSearcher.search(query,chainedFilter ,100,sort);
>>
>> ScoreDoc[] scoreDocs = topDocs.scoreDocs;
>>
>> LOGGER.debug("total number of hits for [" + query.toString() + " ] =
>> "+topDocs.
>> totalHits);
>>
>> for (ScoreDoc scoreDoc : scoreDocs) {
>>
>> final Document doc = multiSearcher.doc(scoreDoc.doc);
>>
>> float score = scoreDoc.score;
>>
>> final BaseDocument baseDocument = new BaseDocument(doc, score);
>>
>> Summary documentSummary = new DocumentSummaryImpl(baseDocument);
>>
>> summaryList.add(documentSummary);
>>
>> }
>>
>> } catch (Exception e) {
>>
>> throw new IllegalStateException(e);
>>
>> } finally {
>>
>> if (multiSearcher != null) {
>>
>> release(multiSearcher);
>>
>> }
>>
>> }
>>
>> stopWatch.stop();
>>
>> LOGGER.debug("total time taken for document seach: " +
>> stopWatch.getTotalTimeMillis() + " ms");
>>
>> return summaryList.toArray(new Summary[] {});
>>
>> }
>>
>>
>>
>> I hope this makes sense...thanks again!
>>
>>
>> Cheers
>>
>> Amin
>>
>>
>>
>> On Sun, Mar 1, 2009 at 8:09 PM, Michael McCandless <
>> luc...@mikemccandless.com> wrote:
>>
>>
>>> You're calling get() too many times.  For every call to get() you must
>>> match with a call to release().
>>>
>>> So, once at the front of your search method you should:
>>>
>>> MultiSearcher searcher = get();
>>>
>>> then use that searcher to do searching, retrieve docs, etc.
>>>
>>> Then in the finally clause, pass that searcher to release.
>>>
>>> So, only one call to get() and one matching call to release().
>>>
>>> Mike
>>>
>>> Amin Mohammed-Coleman wrote:
>>>
>>> Hi
>>>
 The searchers are injected into the class via Spring.  So when a client
 calls the class it is fully configured with a list of index searchers.
 However I have removed this list and instead injecting a list of
 directories which are passed to the DocumentSearchManager.
 DocumentSearchManager is SearchManager (should've mentioned that
 earlier).
 So finally I have modified by release code to do the following:

 private void release(MultiSearcher multiSeacher) throws Exception {

 IndexSearcher[] indexSearchers = (IndexSearcher[])
 multiSeacher.getSearchables();

 for(int i =0 ; i < indexSearchers.length;i++) {

 documentSearcherManagers[i].release(indexSearchers[i]);

 }

 }


 and it's use looks like this:


 public Summary[] search(final SearchRequest searchRequest)
 throwsSearchExecutionException {

 final String searchTerm = searchRequest.getSearchTerm();

 if (StringUtils.isBlank(searchTerm)) {

 throw new SearchExecutionException("Search string cannot be empty. There
 will be too many results to process.");

 }

 List summaryList = new ArrayList();

 StopWatch stopWatch = new StopWatch("searchStopWatch");

 stopWatch.start();

 List indexSearchers =

Re: Merging database index with fulltext index

2009-03-02 Thread Marcelo Ochoa

Hi:
  The point to catch with bad performance during merging a database
result is to reduce the number of rows visited by your first query.
  As an example take a look a these two queries using Lucene Domain
Index, the two are equivalents:
Option A:

select * from (select rownum as ntop_pos,q.* from (
select 
extractValue(object_value,'/page/revision/timestamp'),extractValue(object_value,'/page/title')
  from pages where lcontains(object_value,
  'musica')>0
  and extractValue(object_value,'/page/revision/timestamp')
  between TO_TIMESTAMP_TZ('06-JAN-07 12.20.05.0 PM +00:00')
  and TO_TIMESTAMP_TZ('17-JUL-07 11.47.38.0 AM +00:00')
  order by extractValue(object_value,'/page/revision/timestamp')) q)
where ntop_pos>=20 and ntop_pos<=30;

Option B:

select /*+ DOMAIN_INDEX_SORT */
extractValue(object_value,'/page/revision/timestamp'),extractValue(object_value,'/page/title')
  from pages where lcontains(object_value,
  'rownum:[20 TO 30] AND musica AND revisionDate:[20070101 TO
20070718]','revisionDate')>0;

First query is using all traditional SQL syntax to do filtering,
sorting and pagination (Oracle Top-N syntax), the second query is
using filtering (revisionDate:[20070101 TO 20070718]), sorting
(revisionDate) and pagination (rownum:[20 TO 30], Lucene Domain Index
syntax) resolved inside the Lucene Domain Index.
In execution time the two queries over a sub set (around 32000 pages)
of WikiPedia Dumps uploaded into an Oracle 11g are  4 minutes for the
first option and 55 millisecond for the second option.
The big difference is how many rows the DB need to visits and then
discard, for the first option my DB performs 2.900.671 buffer gets
(block disk that are loaded into memory) and 21 for the second option.
In second execution plan the optimizer receives the exact 10 rows to
return by the Domain Index.
So, no matter what the technology used, the more you can filter on the
index, the faster will be the query.
Obviously there will be queries when this rule is not true, for
example if you have a bit map index on some column, querying the
bitmap index first could be faster than a Domain Index scan, but the
optimizer knows the true.
Best regards, Marcelo.

PD: If you need more information about how to use or how Lucene Domain
Index works inside Oracle please take a look at:
http://docs.google.com/Doc?id=ddgw7sjp_54fgj9kg
On Sat, Feb 28, 2009 at 5:07 PM,   wrote:
> Hi,
>
> what is the best approach to merge a database index with a lucene fulltext
> index? Both databases store a unique ID per doc. This is the join criteria.
>
> requirements:
>
> * both resultsets may be very big (100.000 and much more)
> * the merged resultset must be sorted by database index and/or relevance
> * optional paging the merged resultset, a page has a size of 1000 docs max.
>
> example:
>
> select a, b from dbtable where c = 'foo' and content='bar' order by
> relevance, a desc, d
>
> I would split this into:
>
> database: select ID, a, b from dbtable where c = 'foo' order by a desc, d
> lucene: content:bar (sort:relevance)
> merge: loop over the lucene resultset and add the db record into a new list
> if the ID matches.
>
> If the resultset must be paged:
>
> database: select ID from dbtable where c = 'foo' order by a desc, d
> lucene: content:bar (sort:relevance)
> merge: loop over the lucene resultset and add the db record into a new list
> if the ID matches.
> page 1: select a,b from dbtable where ID IN (list of the ID's of page 1)
> page 2: select a,b from dbtable where ID IN (list of the ID's of page 2)
> ...
>
>
> Is there a better way?
>
> Thank you.
>
>
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-- 
Marcelo F. Ochoa
http://marceloochoa.blogspot.com/
http://marcelo.ochoa.googlepages.com/home
__
Want to integrate Lucene and Oracle?
http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html
Is Oracle 11g REST ready?
http://marceloochoa.blogspot.com/2008/02/is-oracle-11g-rest-ready.html

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Faceted Search using Lucene



There is no such thing as final code -- code is alive and is always  
changing ;)


It looks good to me.

Though one trivial thing is: I would move the code in the try clause  
up to and including the multiSearcher=get() out above the try.  I  
always attempt to "shrink wrap" what's inside a try clause to the  
minimum that needs to be there.  Ie, your code that creates a query,  
finds the right sort & filter to use, etc, can all happen outside the  
try, because you have not yet acquired the multiSearcher.


If you do that, you also don't need the null check in the finally  
clause, because multiSearcher must be non-null on entering the try.


Mike

Amin Mohammed-Coleman wrote:


Hi there
Good morning!  Here is the final search code:

public Summary[] search(final SearchRequest searchRequest)
throwsSearchExecutionException {

final String searchTerm = searchRequest.getSearchTerm();

if (StringUtils.isBlank(searchTerm)) {

throw new SearchExecutionException("Search string cannot be empty.  
There

will be too many results to process.");

}

List summaryList = new ArrayList();

StopWatch stopWatch = new StopWatch("searchStopWatch");

stopWatch.start();

MultiSearcher multiSearcher = null;

try {

LOGGER.debug("Ensuring all index readers are up to date...");

maybeReopen();

Query query = queryParser.parse(searchTerm);

LOGGER.debug("Search Term '" + searchTerm +"' > Lucene Query '" +
query.toString() +"'");

Sort sort = null;

sort = applySortIfApplicable(searchRequest);

Filter[] filters =applyFiltersIfApplicable(searchRequest);

ChainedFilter chainedFilter = null;

if (filters != null) {

chainedFilter = new ChainedFilter(filters, ChainedFilter.OR);

}

multiSearcher = get();

TopDocs topDocs = multiSearcher.search(query,chainedFilter ,100,sort);

ScoreDoc[] scoreDocs = topDocs.scoreDocs;

LOGGER.debug("total number of hits for [" + query.toString() + " ] =  
"+topDocs.

totalHits);

for (ScoreDoc scoreDoc : scoreDocs) {

final Document doc = multiSearcher.doc(scoreDoc.doc);

float score = scoreDoc.score;

final BaseDocument baseDocument = new BaseDocument(doc, score);

Summary documentSummary = new DocumentSummaryImpl(baseDocument);

summaryList.add(documentSummary);

}

} catch (Exception e) {

throw new IllegalStateException(e);

} finally {

if (multiSearcher != null) {

release(multiSearcher);

}

}

stopWatch.stop();

LOGGER.debug("total time taken for document seach: " +
stopWatch.getTotalTimeMillis() + " ms");

return summaryList.toArray(new Summary[] {});

}



I hope this makes sense...thanks again!


Cheers

Amin



On Sun, Mar 1, 2009 at 8:09 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:



You're calling get() too many times.  For every call to get() you  
must

match with a call to release().

So, once at the front of your search method you should:

MultiSearcher searcher = get();

then use that searcher to do searching, retrieve docs, etc.

Then in the finally clause, pass that searcher to release.

So, only one call to get() and one matching call to release().

Mike

Amin Mohammed-Coleman wrote:

Hi
The searchers are injected into the class via Spring.  So when a  
client
calls the class it is fully configured with a list of index  
searchers.

However I have removed this list and instead injecting a list of
directories which are passed to the DocumentSearchManager.
DocumentSearchManager is SearchManager (should've mentioned that  
earlier).

So finally I have modified by release code to do the following:

private void release(MultiSearcher multiSeacher) throws Exception {

IndexSearcher[] indexSearchers = (IndexSearcher[])
multiSeacher.getSearchables();

for(int i =0 ; i < indexSearchers.length;i++) {

documentSearcherManagers[i].release(indexSearchers[i]);

}

}


and it's use looks like this:


public Summary[] search(final SearchRequest searchRequest)
throwsSearchExecutionException {

final String searchTerm = searchRequest.getSearchTerm();

if (StringUtils.isBlank(searchTerm)) {

throw new SearchExecutionException("Search string cannot be empty.  
There

will be too many results to process.");

}

List summaryList = new ArrayList();

StopWatch stopWatch = new StopWatch("searchStopWatch");

stopWatch.start();

List indexSearchers = new ArrayList();

try {

LOGGER.debug("Ensuring all index readers are up to date...");

maybeReopen();

LOGGER.debug("All Index Searchers are up to date. No of index  
searchers '"

+
indexSearchers.size() +"'");

Query query = queryParser.parse(searchTerm);

LOGGER.debug("Search Term '" + searchTerm +"' > Lucene Query  
'" +

query.toString() +"'");

Sort sort = null;

sort = applySortIfApplicable(searchRequest);

Filter[] filters =applyFiltersIfApplicable(searchRequest);

ChainedFilter chainedFilter = null;

if (filters != null) {

chainedFilter = new ChainedFilter(filters, ChainedFilter.OR);

}

TopDocs topDocs = get().search(query,chainedFilter ,100,sort);

ScoreDoc[] scoreDocs = topDocs.scoreDocs;

LOGGER.debug("total number of h

Re: Adding another factor to Lucene search

2009-03-02 Thread Ian Lea

Hi

Document.setBoost(float boost) where boost is either your score as is,
or a value based on that score, might do the trick for you.

Other boosting and custom score options include BoostingQuery,
BoostingTermQuery and CustomScoreQuery.

A google search for "lucene boosting" throws up lots of hits.

--
Ian.

On Mon, Mar 2, 2009 at 10:05 AM, liat oren  wrote:
> Hi,
>
> I would like to add to lucene's score another factor - a score between
> words.
> I have an index that holds couple of words with their score.
> How can I take it into account when using Lucene search?
>
> Many thanks,
> Liat
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: search by word offset

2009-03-02 Thread Shashi Kant

Not sure what you are asking about, but you might want to take a look at
http://lucene.apache.org/java/2_4_0/api/contrib-surround/index.html

The Surround parser offers many features around the span query (which I
suspect is what you are looking for)

Shashi


On Mon, Mar 2, 2009 at 4:57 AM, shb  wrote:

>
> hi i need help.
>
> i need to search by word in sentences with lucene. for example by the word
> "bbb" i got the right results of all the sentences :
>
> "text  ok ok ok bbb" , "text 2 bbb text " , "bbb  text 4...".
>
> but i need the result by the word offset in the sentence like this:
>
> "bbb text 4...". , "text 2 bbb text " , "text 1 ok ok ok bbb" ..
>
> waiting for ideas.. thanks..
>
>
> --
> View this message in context:
> http://www.nabble.com/search-by-word-offset-tp22284787p22284787.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Adding another factor to Lucene search

2009-03-02 Thread liat oren

Hi,

I would like to add to lucene's score another factor - a score between
words.
I have an index that holds couple of words with their score.
How can I take it into account when using Lucene search?

Many thanks,
Liat

search by word offset

2009-03-02 Thread shb


hi i need help. 

i need to search by word in sentences with lucene. for example by the word
"bbb" i got the right results of all the sentences : 

"text  ok ok ok bbb" , "text 2 bbb text " , "bbb  text 4...". 

but i need the result by the word offset in the sentence like this: 

"bbb text 4...". , "text 2 bbb text " , "text 1 ok ok ok bbb" .. 

waiting for ideas.. thanks.. 


-- 
View this message in context: 
http://www.nabble.com/search-by-word-offset-tp22284787p22284787.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Faceted Search using Lucene