Re: Can I still search documents once updated?
It indeed is not stored, but this is still unexpected behavior. It's a stored and indexed field, why has the index data been lost? On Wed, Jul 13, 2011 at 12:44 AM, Erick Erickson erickerick...@gmail.comwrote: Unless you stored your content field, the value you put in there won't be fetched from the index. Verify that the doc you retrieve from the index has values for content, I bet it doesn't Best Erick On Tue, Jul 12, 2011 at 9:38 AM, Gabriele Kahlout gabri...@mysimpatico.com wrote: @Test public void testUpdateLoseTermsSimplified() throws Exception { *IndexWriter writer = indexDoc();* assertEquals(1, writer.numDocs()); IndexSearcher searcher = getSearcher(writer); final TermQuery termQuery = new TermQuery(new Term(content, essen)); TopDocs docs = searcher.search(termQuery, 1); assertEquals(1, docs.totalHits); Document doc = searcher.doc(0); *writer.updateDocument(new Term(id,doc.get(id)),doc);* searcher = getSearcher(writer); *docs = searcher.search(termQuery, 1);* *assertEquals(1, docs.totalHits);*//docs.totalHits == 0 ! } testUpdateLosesTerms(com.mysimpatico.me.indexplugins.WcTest) Time elapsed: 0.346 sec FAILURE! java.lang.AssertionError: expected:1 but was:0 at org.junit.Assert.fail(Assert.java:91) at org.junit.Assert.failNotEquals(Assert.java:645) at org.junit.Assert.assertEquals(Assert.java:126) at org.junit.Assert.assertEquals(Assert.java:470) at org.junit.Assert.assertEquals(Assert.java:454) at com.mysimpatico.me.indexplugins.WcTest.testUpdateLosesTerms(WcTest.java:271) I have not changed anything (as you can see) during the update. I just retrieve a document and the update it. But then the termQuery that worked before doesn't work anymore (while the id field wasn't changed). Is this to be expected when content field is not stored? -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)). -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).
Re: Can I still search documents once updated?
Wait, you directly contradicted yourself G You say it's not stored, then you say it's stored and indexed, which is it? When you fetch a document, only stored fields are returned and the returned data is the verbatim copy of the original data. No attempt is made to return un-stored fields. This has been the behavior allways. If you attempted to returned indexed but not stored data, you'd get stemmed versions, stop words would be removed, synonyms would be in place etc. Not to mention it would be very slow. If the field is stored, then there's another problem, you might want to dump the document after reading it from the IR. Best Erick On Wed, Jul 13, 2011 at 2:25 AM, Gabriele Kahlout gabri...@mysimpatico.com wrote: It indeed is not stored, but this is still unexpected behavior. It's a stored and indexed field, why has the index data been lost? On Wed, Jul 13, 2011 at 12:44 AM, Erick Erickson erickerick...@gmail.comwrote: Unless you stored your content field, the value you put in there won't be fetched from the index. Verify that the doc you retrieve from the index has values for content, I bet it doesn't Best Erick On Tue, Jul 12, 2011 at 9:38 AM, Gabriele Kahlout gabri...@mysimpatico.com wrote: @Test public void testUpdateLoseTermsSimplified() throws Exception { * IndexWriter writer = indexDoc();* assertEquals(1, writer.numDocs()); IndexSearcher searcher = getSearcher(writer); final TermQuery termQuery = new TermQuery(new Term(content, essen)); TopDocs docs = searcher.search(termQuery, 1); assertEquals(1, docs.totalHits); Document doc = searcher.doc(0); * writer.updateDocument(new Term(id,doc.get(id)),doc);* searcher = getSearcher(writer); * docs = searcher.search(termQuery, 1);* * assertEquals(1, docs.totalHits);*//docs.totalHits == 0 ! } testUpdateLosesTerms(com.mysimpatico.me.indexplugins.WcTest) Time elapsed: 0.346 sec FAILURE! java.lang.AssertionError: expected:1 but was:0 at org.junit.Assert.fail(Assert.java:91) at org.junit.Assert.failNotEquals(Assert.java:645) at org.junit.Assert.assertEquals(Assert.java:126) at org.junit.Assert.assertEquals(Assert.java:470) at org.junit.Assert.assertEquals(Assert.java:454) at com.mysimpatico.me.indexplugins.WcTest.testUpdateLosesTerms(WcTest.java:271) I have not changed anything (as you can see) during the update. I just retrieve a document and the update it. But then the termQuery that worked before doesn't work anymore (while the id field wasn't changed). Is this to be expected when content field is not stored? -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)). -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).
Re: Can I still search documents once updated?
On Wed, Jul 13, 2011 at 1:57 PM, Erick Erickson erickerick...@gmail.comwrote: Wait, you directly contradicted yourself G You say it's not stored, then you say it's stored and indexed, which is it? ja, i meant indexed and not stored. When you fetch a document, only stored fields are returned and the returned data is the verbatim copy of the original data. No attempt is made to return un-stored fields. This has been the behavior allways. If you attempted to returned indexed but not stored data, you'd get stemmed versions, stop words would be removed, synonyms would be in place etc. Not to mention it would be very slow. this is what i was expecting. Otherwise updating a field of a document that has an unstored but indexed field is impossible (without losing the unstored but indexed field. I call this updating a field of a document AND deleting/updating all its unstored but indexed fields). If the field is stored, then there's another problem, you might want to dump the document after reading it from the IR. Best Erick On Wed, Jul 13, 2011 at 2:25 AM, Gabriele Kahlout gabri...@mysimpatico.com wrote: It indeed is not stored, but this is still unexpected behavior. It's a stored and indexed field, why has the index data been lost? On Wed, Jul 13, 2011 at 12:44 AM, Erick Erickson erickerick...@gmail.comwrote: Unless you stored your content field, the value you put in there won't be fetched from the index. Verify that the doc you retrieve from the index has values for content, I bet it doesn't Best Erick On Tue, Jul 12, 2011 at 9:38 AM, Gabriele Kahlout gabri...@mysimpatico.com wrote: @Test public void testUpdateLoseTermsSimplified() throws Exception { *IndexWriter writer = indexDoc();* assertEquals(1, writer.numDocs()); IndexSearcher searcher = getSearcher(writer); final TermQuery termQuery = new TermQuery(new Term(content, essen)); TopDocs docs = searcher.search(termQuery, 1); assertEquals(1, docs.totalHits); Document doc = searcher.doc(0); *writer.updateDocument(new Term(id,doc.get(id)),doc);* searcher = getSearcher(writer); *docs = searcher.search(termQuery, 1);* *assertEquals(1, docs.totalHits);*//docs.totalHits == 0 ! } testUpdateLosesTerms(com.mysimpatico.me.indexplugins.WcTest) Time elapsed: 0.346 sec FAILURE! java.lang.AssertionError: expected:1 but was:0 at org.junit.Assert.fail(Assert.java:91) at org.junit.Assert.failNotEquals(Assert.java:645) at org.junit.Assert.assertEquals(Assert.java:126) at org.junit.Assert.assertEquals(Assert.java:470) at org.junit.Assert.assertEquals(Assert.java:454) at com.mysimpatico.me.indexplugins.WcTest.testUpdateLosesTerms(WcTest.java:271) I have not changed anything (as you can see) during the update. I just retrieve a document and the update it. But then the termQuery that worked before doesn't work anymore (while the id field wasn't changed). Is this to be expected when content field is not stored? -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)). -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)). -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).
Re: Can I still search documents once updated?
Am 13.07.2011 14:05, schrieb Gabriele Kahlout: this is what i was expecting. Otherwise updating a field of a document that has an unstored but indexed field is impossible (without losing the unstored but indexed field. I call this updating a field of a document AND deleting/updating all its unstored but indexed fields). Not necessarily. The usual use case is that you have some kind of existing data source from where you fill your Solr index. When you want to update field of a document, then you simply re-index from that source. There's no need to fetch data from Solr before. Otherwise, if you really don't have such an existing data source because a horde of typewriting monkeys filled your Solr index, then you should better declare all your fields as stored. Otherwise you'll never have a chance to get that data back. Greeting, Kuli
Re: Can I still search documents once updated?
Well, I'm !sure how usual this scenario would be: 1. In general those using solr with nutch don't store the content field to avoid storing the whole web/intranet in their index, twice (1 in the form of stored data, and one in the form of indexed data). Now everytime they need to update a field unrelated to content (number of inbound links for an example) they would have to re-crawl the page again. This is at least !intuitive. On Wed, Jul 13, 2011 at 2:40 PM, Michael Kuhlmann s...@kuli.org wrote: Am 13.07.2011 14:05, schrieb Gabriele Kahlout: this is what i was expecting. Otherwise updating a field of a document that has an unstored but indexed field is impossible (without losing the unstored but indexed field. I call this updating a field of a document AND deleting/updating all its unstored but indexed fields). Not necessarily. The usual use case is that you have some kind of existing data source from where you fill your Solr index. When you want to update field of a document, then you simply re-index from that source. There's no need to fetch data from Solr before. Otherwise, if you really don't have such an existing data source because a horde of typewriting monkeys filled your Solr index, then you should better declare all your fields as stored. Otherwise you'll never have a chance to get that data back. Greeting, Kuli -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).
Re: Can I still search documents once updated?
Am 13.07.2011 15:37, schrieb Gabriele Kahlout: Well, I'm !sure how usual this scenario would be: 1. In general those using solr with nutch don't store the content field to avoid storing the whole web/intranet in their index, twice (1 in the form of stored data, and one in the form of indexed data). Not exactly. The indexed form is quite different from the stored form; only the tokens are stored, each token only once, and some additional data like the document count and, maybe, shingle information etc.. Hence, indexed data usually needs much less space on disk than the original data. There's no practical alternative to storing the content in a stored field. What would you otherwise display as a search result? The following web pages have your search term somewhere in their contents, don't know where, take a look on your own? Greetings, Kuli
Re: Can I still search documents once updated?
On Wed, Jul 13, 2011 at 3:54 PM, Michael Kuhlmann s...@kuli.org wrote: Am 13.07.2011 15:37, schrieb Gabriele Kahlout: Well, I'm !sure how usual this scenario would be: 1. In general those using solr with nutch don't store the content field to avoid storing the whole web/intranet in their index, twice (1 in the form of stored data, and one in the form of indexed data). Not exactly. The indexed form is quite different from the stored form; only the tokens are stored, each token only once, and some additional data like the document count and, maybe, shingle information etc.. Hence, indexed data usually needs much less space on disk than the original data. I realized that. Maybe I should have said 1.X (1 in the form of stored data and 0.X in the form of indexed data). There's no practical alternative to storing the content in a stored field. What would you otherwise display as a search result? The following web pages have your search term somewhere in their contents, don't know where, take a look on your own? Display the title, and url (and implicitly say The following web pages have your search term somewhere in their contents, don't REMEMBER where, take a look on your own?). Solr is already configured by default not to store more than a maxFieldLength anyway. Usually one stores content only to display snippets. Greetings, Kuli -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).
Re: Can I still search documents once updated?
Am 13.07.2011 16:09, schrieb Gabriele Kahlout: Solr is already configured by default not to store more than a maxFieldLength anyway. Usually one stores content only to display snippets. Yes, but the snippets must come from somewhere. For instance, if you're using Solr's highlighting feature, all highlighted fields must be stored. See http://www.intellog.com/blog/?p=208 for explanation from someone else. ;) Greetings, Kuli
Re: Can I still search documents once updated?
Unless you stored your content field, the value you put in there won't be fetched from the index. Verify that the doc you retrieve from the index has values for content, I bet it doesn't Best Erick On Tue, Jul 12, 2011 at 9:38 AM, Gabriele Kahlout gabri...@mysimpatico.com wrote: @Test public void testUpdateLoseTermsSimplified() throws Exception { * IndexWriter writer = indexDoc();* assertEquals(1, writer.numDocs()); IndexSearcher searcher = getSearcher(writer); final TermQuery termQuery = new TermQuery(new Term(content, essen)); TopDocs docs = searcher.search(termQuery, 1); assertEquals(1, docs.totalHits); Document doc = searcher.doc(0); * writer.updateDocument(new Term(id,doc.get(id)),doc);* searcher = getSearcher(writer); * docs = searcher.search(termQuery, 1);* * assertEquals(1, docs.totalHits);*//docs.totalHits == 0 ! } testUpdateLosesTerms(com.mysimpatico.me.indexplugins.WcTest) Time elapsed: 0.346 sec FAILURE! java.lang.AssertionError: expected:1 but was:0 at org.junit.Assert.fail(Assert.java:91) at org.junit.Assert.failNotEquals(Assert.java:645) at org.junit.Assert.assertEquals(Assert.java:126) at org.junit.Assert.assertEquals(Assert.java:470) at org.junit.Assert.assertEquals(Assert.java:454) at com.mysimpatico.me.indexplugins.WcTest.testUpdateLosesTerms(WcTest.java:271) I have not changed anything (as you can see) during the update. I just retrieve a document and the update it. But then the termQuery that worked before doesn't work anymore (while the id field wasn't changed). Is this to be expected when content field is not stored? -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).