Re: Problems with ItemBasedRecommender with Lucene

2009-09-17 Thread Thomas Rewig
Oh, I overlooked the simplest way to do that. You're right, tokens are 
the key to this problem. It works pretty well.
It would be perfect if I use payloads. I read your advice 
http://www.lucidimagination.com/blog/category/payloads/.


You store the payloads with your PayLoadAnalyzer in this way:

   //Store both position and offset information
   Field text = new Field("body", DOCS[i], Field.Store.NO, 
Field.Index.ANALYZED);


Is there a chance to use

   Field.Index.ANALYZED_NO_NORMS

because otherwise my index would be much to big or are normes necessary 
for Payloads?


You use Lucene 2.9 is there a way to do this with Lucene 2.4.1 because I 
can't find e.g. the "PayloadEncoder" or do I have to wait for the release?


Regards Thomas


You might want to ask on mahout-user, but I'm guessing Ted didn't mean 
a new field for every item-item, but instead to represent them as 
tokens and then create the corresponding appropriate queries (seems 
like payloads may be useful, or function queries).  That to me is the 
only way you would achieve the sparseness savings you are after.


-Grant

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) 
using Solr/Lucene:

http://www.lucidimagination.com/search


Hello,

I build a "real time ItemBasedRecommender" based on a users history 
and a (sparse) item similarity matrix with lucene. Some time ago Ted 
Dunning recommended me this approach at the mahout mailing list to 
create a ItemBasedRecommender:


"It is actually very easy to do. The output of the recommendation 
off-line process is generally a sparse matrix of item-item links. 
Each line of this sparse matrix can be considered a document in 
creating a Lucene index. You will have to use a correct analyzer and 
a line by line document segmenter, but that is trivial. Then 
recommendation is a simple query step."


So for 10 items it works fine - but for 1 million items the 
Indexing fails and I have no idea how to avoid this. Maybe you can 
give me a hint.


First I create a Item-Item-Similaritymatrix with mahout's taste and 
in the second step I index it. The matrix is sparce because only 
Item-Item-Relations with a high correlation will be saved.


Here are the Code Snippets for this indexing :
  CachedRowSetImpl rowSetMainItemList = null; // Mapping of 
Items
  ArrayList listBelongingItems = null; // Belonging and 
highest correlating Items for a MainItem

  Document aDocument = null;
  Field aField = null;
  Field aField1 = null;
Analyzer aAnalyzer  = new StandardAnalyzer();
  IndexWriter aWriter = new IndexWriter(this.indexDirectory, 
aAnalyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);

aWriter.setRAMBufferSizeMB(48);
rowSetMainItemList = getRowSetItemList(); //get all Items
aField1 = new Field("Item1", "", 
Field.Store.YES,Field.Index.ANALYZED); // reuse this field

while (rowSetMainItemList.next()){
aDocument = new Document();

aField1.setValue(rowSetMainItemList.getString(1));  
aDocument.add(aField1);
listBelongingItems = 
getRowSetBelongingItems(rowSetMainItemList.getString(1)); // get the 
most similar Items fpr a Item
  Iterator itrBelongingItems = 
listBelongingItems.iterator();

while (itrBelongingItems.hasNext()){
String strBelongingItem = (String) 
itrBelongingItems.next();
  //No reuse of Field possible because of different 
fieldnames:
  aField = new Field(strBelongingItem,"1", 
Field.Store.NO,Field.Index.ANALYZED_NO_NORMS);

  aDocument.add(aField);
  }

aWriter.addDocument(aDocument);}

aWriter.optimize();
  aWriter.close();
aAnalyzer.close();
 Actually the Field of the BelongingItem have to be 
boosted with the MainItem-BelongingItem-Correlation-Value to get 
accurate Recommendations, but here the Index would be about 80 GByte 
for 6 million items... without it will only be about 2Gbyte.
But under the condition that only relevant Correlations will be saved 
in the Similaritymatrix the recommendation quality will be good enough.


The item recommendation for a User is a simple BooleanQuery with 
userhistory boosted TermQuerys. Here I search for documents with the 
largest Correspondence regarding the userhistory.  So I  look in 
which Documents the most Fields with the name of a BelongingItem are 
set (with value 1) and recommend the "key"-value which was set in 
aField1("Item"...)
Whatever, as i mentioned it worked for a Number of 10 Items.  But 
if there are 1 million items the indexing crash after a while with


Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
  at java.util.HashMap.resize(HashMap.

Re: Problems with ItemBasedRecommender with Lucene

2009-09-17 Thread Grant Ingersoll


On Sep 17, 2009, at 5:06 AM, Thomas Rewig wrote:

Oh, I overlooked the simplest way to do that. You're right, tokens  
are the key to this problem. It works pretty well.
It would be perfect if I use payloads. I read your advice http://www.lucidimagination.com/blog/category/payloads/ 
.


You store the payloads with your PayLoadAnalyzer in this way:

  //Store both position and offset information
  Field text = new Field("body", DOCS[i], Field.Store.NO,  
Field.Index.ANALYZED);


Is there a chance to use

  Field.Index.ANALYZED_NO_NORMS


I don't see why not.



because otherwise my index would be much to big or are normes  
necessary for Payloads?


You use Lucene 2.9 is there a way to do this with Lucene 2.4.1  
because I can't find e.g. the "PayloadEncoder" or do I have to wait  
for the release?


I'd bet that patch wouldn't be too hard to backport, since it lives in  
contrib/analyzers.  All it does anyway is give a generic notion to  
adding a payload based on a data type.  Payloads are in 2.4.1 and all  
they are is a byte array, so it should be easy enough to write a  
simple Token Filter that does what you want.




Regards Thomas


You might want to ask on mahout-user, but I'm guessing Ted didn't  
mean a new field for every item-item, but instead to represent them  
as tokens and then create the corresponding appropriate queries  
(seems like payloads may be useful, or function queries).  That to  
me is the only way you would achieve the sparseness savings you are  
after.


-Grant

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search


Hello,

I build a "real time ItemBasedRecommender" based on a users  
history and a (sparse) item similarity matrix with lucene. Some  
time ago Ted Dunning recommended me this approach at the mahout  
mailing list to create a ItemBasedRecommender:


"It is actually very easy to do. The output of the recommendation  
off-line process is generally a sparse matrix of item-item links.  
Each line of this sparse matrix can be considered a document in  
creating a Lucene index. You will have to use a correct analyzer  
and a line by line document segmenter, but that is trivial. Then  
recommendation is a simple query step."


So for 10 items it works fine - but for 1 million items the  
Indexing fails and I have no idea how to avoid this. Maybe you can  
give me a hint.


First I create a Item-Item-Similaritymatrix with mahout's taste  
and in the second step I index it. The matrix is sparce because  
only Item-Item-Relations with a high correlation will be saved.


Here are the Code Snippets for this indexing :
 CachedRowSetImpl rowSetMainItemList = null; // Mapping of  
Items
 ArrayList listBelongingItems = null; // Belonging and  
highest correlating Items for a MainItem

 Document aDocument = null;
 Field aField = null;
 Field aField1 = null;
   Analyzer aAnalyzer  = new StandardAnalyzer();
 IndexWriter aWriter = new IndexWriter(this.indexDirectory,  
aAnalyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);

   aWriter.setRAMBufferSizeMB(48);
   rowSetMainItemList = getRowSetItemList(); //get all Items
   aField1 = new Field("Item1", "",  
Field.Store.YES,Field.Index.ANALYZED); // reuse this field

   while (rowSetMainItemList.next()){
   aDocument = new Document();
   aField1.setValue 
(rowSetMainItemList.getString(1));  aDocument.add 
(aField1);
   listBelongingItems = getRowSetBelongingItems 
(rowSetMainItemList.getString(1)); // get the most similar Items  
fpr a Item
 Iterator itrBelongingItems =  
listBelongingItems.iterator();

   while (itrBelongingItems.hasNext()){
   String strBelongingItem = (String)  
itrBelongingItems.next();
 //No reuse of Field possible because of different  
fieldnames:
 aField = new Field(strBelongingItem,"1",  
Field.Store.NO,Field.Index.ANALYZED_NO_NORMS);

 aDocument.add(aField);
 }
   aWriter.addDocument 
(aDocument);}

   aWriter.optimize();
 aWriter.close();
   aAnalyzer.close();
Actually the Field of the BelongingItem have to be  
boosted with the MainItem-BelongingItem-Correlation-Value to get  
accurate Recommendations, but here the Index would be about 80  
GByte for 6 million items... without it will only be about 2Gbyte.
But under the condition that only relevant Correlations will be  
saved in the Similaritymatrix the recommendation quality will be  
good enough.


The item recommendation for a User is a simple BooleanQuery with  
userhistory boosted TermQuerys. Here I search for documents with  
the largest Correspondence regarding the userhistory. 

Re: Displaying search result data - stored fields vs external source

2009-09-17 Thread Savvas-Andreas Moysidis
Hello ,

I would also prefer to store the content in the index because, as Erick
points out this leads to a more simplified design but also because it allows
me to preserve the relevance sort.



If you store only  the item id in the index then when extracting all the
other required data from  supposedly a database, you will probably execute a
“*select * from item where id in (id_1,id_2,id_3...)*”

which will probably not retain your relevance sort. So unless you sort by a
business field or apply some kind of convoluted sort strategy which maps to
your original Lucene ResultSet you will have lost your ranking.



Cheers,

savvas


2009/9/15 Erick Erickson 

> Categorically I store everything in the index unless/until I *know* it
> doesn'twork. With some things, it's easy to know from the outset, like if I
> have
> 20T of data to store.
>
> First, storing fields has minimal impact on the search speed, the stored
> text
> isn't interleaved with the search tokens, so they're pretty much disjoint.
>
> Second, any scheme storing data separately is inherently more complex
> and difficult to maintain. From the eXtreme Programming folks "Do the
> simplest thing that could possibly work".
>
> Third, there isn't much work in trying it and seeing. I mean you have to
> write
> the retrieval code, and if you encapsulate fetching the data you can switch
> it out later if it comes to that pretty easily. So you don't lose much at
> all
> by "just trying it" ..
>
> HTH
> Erick
>
> On Tue, Sep 15, 2009 at 4:19 AM, Joel Halbert 
> wrote:
>
> > Hi,
> >
> > When using Lucene I always consider two approaches to displaying search
> > result data to users:
> >
> > 1. Store any fields that we index and display to users in the Lucene
> > Documents themselves. When we perform a search simply retrieve the data
> > to be displayed from the Lucence documents themselves.
> >
> > or
> >
> > 2. Index fields in Lucene but reference data to be displayed from
> > another source, such as a database. So, when searching I would search
> > for documents then use a (stored) reference key on the documents to then
> > lookup the display fields to display from another source e.g. a
> > database.
> >
> > With regards to the number and size of stored fields I am looking at
> > indexing and displaying approximately 4 relatively small fields for each
> > document (e.g.  name, age, short description, URL ~ approx 500bytes in
> > total). In any query about 10 hits will be displayed to the user.
> > Approximately 10 million documents to index and search.
> >
> > I am interested the differences in both approaches with regards to:
> >
> > 1) Indexing time performance (how long it might take to index with and
> > without stored fields)
> > 2) Search time performance (total time taken to search for matching
> > documents and then display fields to users)
> >
> > I am less interested in differences arising from
> > maintainability/increased storage requirements.
> >
> > I would be interested to see what others  think of using each approach.
> >
> > Cheers,
> > Joel
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>


Re: Counting search results

2009-09-17 Thread Mathias Bank
Hello,

I have tried your method, but it doesn't work.

set will be null after applying

BitSet set = filter.bits(reader);

I haven't found any reason for this.

Additionally, the bits method is deprecated and it is mentioned to use
"getDocIdSet". But this set does only provide an iterator, no hash
checks are possible.

Are there any other possibilities to improve speed?

Mathias


Am 15.09.2009 17:13 schrieb Simon Willnauer :
> Hmm, so if you wanna use the Filter to narrow down the search results
>
> you could use it in the while loop like this:
>
>
>
> BitSet set = filter.bits(reader);
>
>  int numDocs
>
> TermDocs termDocs = reader.termDocs(new Term("myField", "myTerm"));
>
> while (termDocs.next()) {
>
>  if(set.get(termDocs.doc()))
>
>    numDocs++;
>
> }
>
>
>
> would that help?
>
>
>
> simon
>
> >>
>
> On Tue, Sep 15, 2009 at 5:01 PM, Mathias Bank mathias.b...@gmail.com> wrote:
>
> > Hello,
>
> >
>
> > This seams to be a similar solution like:
>
> >
>
> > Term t = new Term(fieldname, term);
>
> > int count = searcher.docFreq(t);
>
> >
>
> > The problem is, that in this situation it is not possible to apply a
>
> > filter object. If I don't wanna use this filter object, I would have
>
> > to use a complex search query, wich is - again - very slow. So,
>
> > unfortunatelly, your solution does not help.
>
> >
>
> > Mathias
>
> >
>
> > 2009/9/15 Simon Willnauer simon.willna...@googlemail.com>:
>
> >> Did you try:
>
> >> int numDocs
>
> >> TermDocs termDocs = reader.termDocs(new Term("myField", "myTerm"));
>
> >> while (termDocs.next()) { numDocs++; }
>
> >>
>
> >> simon
>
> >>
>
> >> On Tue, Sep 15, 2009 at 2:19 PM, Mathias Bank mathias.b...@gmail.com> 
> >> wrote:
>
> >>> Hello,
>
> >>>
>
> >>> I'm trying to find the number of documents for a specific term to
>
> >>> create text statistics. I'm not interested in ordering the results or
>
> >>> even recieving the first result. I just need the number of results.
>
> >>>
>
> >>> Currently, I'm trying to do this by using the lucene searcher class:
>
> >>>
>
> >>> IndexSearcher searcher = new IndexSearcher(reader);
>
> >>> String queryString = fieldname+":" + term;
>
> >>> QueryParser parser = new QueryParser(fieldname, new GermanAnalyzer());
>
> >>> TopDocs d = searcher.search(parser.parse(queryString), filter, 1);
>
> >>> int count = d.totalHits;
>
> >>>
>
> >>> The problem is, that there is a large index (optimized) with > 8 mio.
>
> >>> entries. One search could return a large number of search results (> 1
>
> >>> mio). Currently these search tasks take more than 15 secunds.
>
> >>>
>
> >>> The question is: is there any way to get the number of search results
>
> >>> faster? I think, that it could be optimized by not using a Weight
>
> >>> object (order is not interesting), but I haven't seen a way to do
>
> >>> this.
>
> >>>
>
> >>> I hope, someone has already solved this problem.
>
> >>>
>
> >>> Mathias
>
> >>>
>
> >>> -
>
> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>
> >>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
> >>>
>
> >>>
>
> >>
>
> >> -
>
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
> >>
>
> >>
>
> >
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



How to perform a phrase "begins with" query?

2009-09-17 Thread Paul_Murdoch
Hi all,

 

Since you can't (and it doesn't make sense to) use wildcards in phrase
queries, how do you construct a query to get results for phrases that
begin with a certain set of terms?  Here are some theoretical
examples...

 

Example 1 - I have an index where each document contains the contents of
short stories.  I want to return each document that begins with the
words "Once upon a time".  I know this in not valid Lucene syntax, but
what I would like to do is query for "Once upon a time"*

 

Example 2 - I have an index where each document contains numbered test
resultssay test 1 - test 5000.  I want to return each document where
the test starts with the number 5.  So the query here would be (again I
know this isn't valid) something like "test 5"*

 

How can this be accomplished?

 

Thanks

Paul  

 



Re: How to perform a phrase "begins with" query?

2009-09-17 Thread AHMET ARSLAN
> Since you can't (and it doesn't make sense to) use
> wildcards in phrase
> queries, how do you construct a query to get results for
> phrases that begin with a certain set of terms?  
> Here are some theoretical examples...
> 
> 
> Example 1 - I have an index where each document contains
> the contents of
> short stories.  I want to return each document that
> begins with the
> words "Once upon a time".  I know this in not valid
> Lucene syntax, but
> what I would like to do is query for "Once upon a time"*

You are trying to retrieve documents begins with "Once upon a time", right? You 
want your phrase in the beginning of the document. You can retrieve them using 
SpanQuery family programmatically.

I am not sure about the value of (int end) in SpanFirstQuery constructor but it 
will be something like that:

s1 = new SpanTermQuery(new Term("story","once"));
s2 = new SpanTermQuery(new Term("story","upon"));
s3 = new SpanTermQuery(new Term("story","time"));

s4 = new SpanNearQuery([s1,s2,s3], 0, true);

s5 = new SpanFirstQuery(s4, 3);

Note that you need to use analyzed text of terms in this approach.

Hope this helps.




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: New "Stream closed" exception with Java 6 - solved

2009-09-17 Thread Chris Hostetter

: It turns out that the cause of the exceptions is in fact adding an item 
: twice - so you were correct right at the start :-)  I ran a test where I 

glad to see it all worked out.

: Just a minor point: isn't Lucence in a position to detect the duplicate 
: insertion attempt and flag it with something less vague than "Stream 
: closed"?  :-)

not really ... adding a document multiple times is a perfectly legal use 
case, adding a document with a "Reader" based field where the reader is 
already closed ... that's not legal (And Lucene doesn't really have any 
way of knowing if the Reader is closed because *it* closed it.


-Hoss


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to perform a phrase "begins with" query?

2009-09-17 Thread Mark Harwood

Since you can't (and it doesn't make sense to) use wildcards in phrase
queries,



You can with this:  
http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/misc/src/java/org/apache/lucene/queryParser/complexPhrase/

Discussion here: http://tinyurl.com/lrnage

Cheers,
 Mark

Re: Combining hits from multiple documents into a single hit

2009-09-17 Thread Chris Hostetter

Assuming i understand you correctly, then...
  1. properties only exist as part of a single article (no articles share 
a complex property) 
  2. you don't have any need to ever return searchese on 
properties, they exist just to add in searching for articles.

IF that's correct, then the idea i would try is to only index 1 document 
per article, with all of the text included, and use payloads to annotate 
which text is securred by which property.  Then use SpanQueries to search 
for your docs, and in a custom HitCollector check the matching spans for 
each doc to get the corrisponding property, and test that 
 tripple agaisnt your security mechanism -- if any 
fail, skip that doc.


It's not something i've ever tried (or thought through very hard) but 
based on other comments i've seen from people about payloads it sounds 
like it should work pretty well and give you decent scores.


: [I originally posted this to the Lucene.net mailing list,but it was suggested
: that I might have more luck here]
: 
: I am trying to get a particular search to work and it is proving problematic.
: The actual source data is quite complex but can be summarised by the following
: example:
: 
: I have articles that are indexed so that they can be searched. Each article
: also has multiple properties associated with it which are also indexed and
: searchable. When users search, they can get hits in either the main article or
: the associated properties. Regardless of where a hit is achieved, the article
: is returned as a search hit (ie. the properties are never a hit in their own
: right).
: 
: Now for the complexity:
: 
: Each property has security on it, which means that for any given user, they
: may or may not be able to see the property. If a user cannot see a property,
: they obviously do not get a search hit in it. This security check is
: proprietary and cannot be done using the typical mechanism of storing a role
: in the index alongside the other fields in the document.
: 
: I currently have a index that contains the articles and properties indexed
: separately (ie. an article is indexed as a document, and each property has its
: own document). When a search happens, a hit in article A or a hit in any of
: the properties of article A should be classed as hit for article A alone, with
: the scores combined.
: 
: Whether or not a user can see a property is not based on the property itself,
: but on the value of the property. I cannot therefore put the extra security
: conditions into the query upfront as I don't know the value to filter by.
: 
: As an example:
: 
: +-+++
: | Article | Property 1 | Property 2 |
: +-+++
: |A| X  | J  |
: |B| Y  | K  |
: |C| Z  | L  |
: +-+++
: 
: If a user can see everything, then searching for "B and Y" will return a
: single search result for article B.
: 
: If another user cannot see a property if its value contains Y, then searching
: for "B and Y" will return no hits.
: 
: I have no way of knowing what values a user can and cannot see upfront. They
: only way to tell is to perform the security check (currently done at the time
: of filtering a hit from a field in the document), which I obviously cannot do
: for every possible data value for each user.
: 
: To achieve this originally, Lucene v1.3 was modified to allow this to happen
: by changing BooleanQuery to have a custom Scorer that could apply the logic of
: the security check and the combination of two hits in different documents
: being classed as a hit in a single document. I am trying to upgrade this
: version to the latest (v2.3.2 - I am using Lucene.Net), but ideally without
: having to modify Lucene in any way.
: 
: An additional problem occurs if I do an AND search. If an article contains the
: word foo and one of its properties contains the word bar, then searching for
: "foo AND bar" will return the article as a hit. My current code deals with
: this inside the custom Scorer.
: 
: Any ideas how/if this can be done?
: 
: I am thinking along the lines of using a custom HitCollector and passing that
: into the search, but when doing the boolean search "foo AND bar", execution
: never reaches my HitCollector as the ConjunctionScorer filters out all of the
: results from the sub-queries before getting there.
: 
: Thanks,
: 
: Adrian
: 
: 
: 
: -
: To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
: For additional commands, e-mail: java-user-h...@lucene.apache.org



-Hoss


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Filtering question/advice

2009-09-17 Thread Chris Hostetter

FWWI: a test case with multiple asserts is more useful if you clarify 
where it failes ... ie: show us the failure message, or put a comment on 
athe line of the assert that fails.

i didn't run your testcase, but skimming it a few things jumpt out at me 
that might explain whatever problem you are seeing...

:Field uw1 = new Field("uw-refernce", "hello", Field.Store.NO,
: Field.Index.ANALYZED);
:Field uw2 = new Field("uw-refernce", "bye", Field.Store.NO,
: Field.Index.ANALYZED);
...
:layerDocumentA = new Document();
:layerDocumentA.add(uw1);
:layerDocumentA.add(uw1);

...did you really mean to add uw1 twice? or did you mean to add uw2 as 
well (it's never used)...

: public void testUWBCanSeeResultIfSearchTermMatchesOnSomethingElse()
: throws Exception {
...
: UnderwriterReferenceFilter filter = new
: UnderwriterReferenceFilter();

...you never set any properties on this Filter before you use it. reading 
it's implementation, that should cause an IllegalArgumentException.


-Hoss


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org