Re: Can I still search documents once updated?

2011-07-13 Thread Gabriele Kahlout
It indeed is not stored, but this is still unexpected behavior. It's a
stored and indexed field, why has the index data been lost?


On Wed, Jul 13, 2011 at 12:44 AM, Erick Erickson erickerick...@gmail.comwrote:

 Unless you stored your content field, the value you put in there won't
 be fetched from the index. Verify that the doc you retrieve from the index
 has values for content, I bet it doesn't

 Best
 Erick

 On Tue, Jul 12, 2011 at 9:38 AM, Gabriele Kahlout
 gabri...@mysimpatico.com wrote:
   @Test
 public void testUpdateLoseTermsSimplified() throws Exception {
  *IndexWriter writer = indexDoc();*
 assertEquals(1, writer.numDocs());
 IndexSearcher searcher = getSearcher(writer);
 final TermQuery termQuery = new TermQuery(new Term(content,
  essen));
 
 TopDocs docs = searcher.search(termQuery, 1);
 assertEquals(1, docs.totalHits);
 Document doc = searcher.doc(0);
 
  *writer.updateDocument(new Term(id,doc.get(id)),doc);*
 
 searcher = getSearcher(writer);
  *docs = searcher.search(termQuery, 1);*
  *assertEquals(1, docs.totalHits);*//docs.totalHits == 0 !
 }
 
  testUpdateLosesTerms(com.mysimpatico.me.indexplugins.WcTest)  Time
 elapsed:
  0.346 sec   FAILURE!
  java.lang.AssertionError: expected:1 but was:0
 at org.junit.Assert.fail(Assert.java:91)
 at org.junit.Assert.failNotEquals(Assert.java:645)
 at org.junit.Assert.assertEquals(Assert.java:126)
 at org.junit.Assert.assertEquals(Assert.java:470)
 at org.junit.Assert.assertEquals(Assert.java:454)
 at
 
 com.mysimpatico.me.indexplugins.WcTest.testUpdateLosesTerms(WcTest.java:271)
 
  I have not changed anything (as you can see) during the update. I just
  retrieve a document and the update it. But then the termQuery that worked
  before doesn't work anymore (while the id field wasn't changed). Is
 this
  to be expected when content field is not stored?
 
  --
  Regards,
  K. Gabriele
 
  --- unchanged since 20/9/10 ---
  P.S. If the subject contains [LON] or the addressee acknowledges the
  receipt within 48 hours then I don't resend the email.
  subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
 time(x)
   Now + 48h) ⇒ ¬resend(I, this).
 
  If an email is sent by a sender that is not a trusted contact or the
 email
  does not contain a valid code then the email is not received. A valid
 code
  starts with a hyphen and ends with X.
  ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
  L(-[a-z]+[0-9]X)).
 




-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains [LON] or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
 Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with X.
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).


Re: Can I still search documents once updated?

2011-07-13 Thread Erick Erickson
Wait, you directly contradicted yourself G You say it's
not stored, then you say it's stored and indexed, which is it?

When you fetch a document, only stored fields are returned
and the returned data is the verbatim copy of the original
data. No attempt is made to return un-stored fields. This
has been the behavior allways. If you attempted to returned
indexed but not stored data, you'd get stemmed versions,
stop words would be removed, synonyms would be in place
etc. Not to mention it would be very slow.

If the field is stored, then there's another problem, you might
want to dump the document after reading it from the IR.

Best
Erick

On Wed, Jul 13, 2011 at 2:25 AM, Gabriele Kahlout
gabri...@mysimpatico.com wrote:
 It indeed is not stored, but this is still unexpected behavior. It's a
 stored and indexed field, why has the index data been lost?


 On Wed, Jul 13, 2011 at 12:44 AM, Erick Erickson 
 erickerick...@gmail.comwrote:

 Unless you stored your content field, the value you put in there won't
 be fetched from the index. Verify that the doc you retrieve from the index
 has values for content, I bet it doesn't

 Best
 Erick

 On Tue, Jul 12, 2011 at 9:38 AM, Gabriele Kahlout
 gabri...@mysimpatico.com wrote:
   @Test
     public void testUpdateLoseTermsSimplified() throws Exception {
  *        IndexWriter writer = indexDoc();*
         assertEquals(1, writer.numDocs());
         IndexSearcher searcher = getSearcher(writer);
         final TermQuery termQuery = new TermQuery(new Term(content,
  essen));
 
         TopDocs docs = searcher.search(termQuery, 1);
         assertEquals(1, docs.totalHits);
         Document doc = searcher.doc(0);
 
  *        writer.updateDocument(new Term(id,doc.get(id)),doc);*
 
         searcher = getSearcher(writer);
  *        docs = searcher.search(termQuery, 1);*
  *        assertEquals(1, docs.totalHits);*//docs.totalHits == 0 !
     }
 
  testUpdateLosesTerms(com.mysimpatico.me.indexplugins.WcTest)  Time
 elapsed:
  0.346 sec   FAILURE!
  java.lang.AssertionError: expected:1 but was:0
     at org.junit.Assert.fail(Assert.java:91)
     at org.junit.Assert.failNotEquals(Assert.java:645)
     at org.junit.Assert.assertEquals(Assert.java:126)
     at org.junit.Assert.assertEquals(Assert.java:470)
     at org.junit.Assert.assertEquals(Assert.java:454)
     at
 
 com.mysimpatico.me.indexplugins.WcTest.testUpdateLosesTerms(WcTest.java:271)
 
  I have not changed anything (as you can see) during the update. I just
  retrieve a document and the update it. But then the termQuery that worked
  before doesn't work anymore (while the id field wasn't changed). Is
 this
  to be expected when content field is not stored?
 
  --
  Regards,
  K. Gabriele
 
  --- unchanged since 20/9/10 ---
  P.S. If the subject contains [LON] or the addressee acknowledges the
  receipt within 48 hours then I don't resend the email.
  subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
 time(x)
   Now + 48h) ⇒ ¬resend(I, this).
 
  If an email is sent by a sender that is not a trusted contact or the
 email
  does not contain a valid code then the email is not received. A valid
 code
  starts with a hyphen and ends with X.
  ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
  L(-[a-z]+[0-9]X)).
 




 --
 Regards,
 K. Gabriele

 --- unchanged since 20/9/10 ---
 P.S. If the subject contains [LON] or the addressee acknowledges the
 receipt within 48 hours then I don't resend the email.
 subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
  Now + 48h) ⇒ ¬resend(I, this).

 If an email is sent by a sender that is not a trusted contact or the email
 does not contain a valid code then the email is not received. A valid code
 starts with a hyphen and ends with X.
 ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
 L(-[a-z]+[0-9]X)).



Re: Can I still search documents once updated?

2011-07-13 Thread Gabriele Kahlout
On Wed, Jul 13, 2011 at 1:57 PM, Erick Erickson erickerick...@gmail.comwrote:

 Wait, you directly contradicted yourself G You say it's
 not stored, then you say it's stored and indexed, which is it?


ja, i meant indexed and not stored.



 When you fetch a document, only stored fields are returned
 and the returned data is the verbatim copy of the original
 data. No attempt is made to return un-stored fields. This
 has been the behavior allways. If you attempted to returned
 indexed but not stored data, you'd get stemmed versions,
 stop words would be removed, synonyms would be in place
 etc. Not to mention it would be very slow.


this is what i was expecting. Otherwise updating a field of a document that
has an unstored but indexed field is impossible (without losing the unstored
but indexed field. I call this updating a field of a document AND
deleting/updating all its unstored but indexed fields).


 If the field is stored, then there's another problem, you might
 want to dump the document after reading it from the IR.

 Best
 Erick

 On Wed, Jul 13, 2011 at 2:25 AM, Gabriele Kahlout
 gabri...@mysimpatico.com wrote:
  It indeed is not stored, but this is still unexpected behavior. It's a
  stored and indexed field, why has the index data been lost?
 
 
  On Wed, Jul 13, 2011 at 12:44 AM, Erick Erickson 
 erickerick...@gmail.comwrote:
 
  Unless you stored your content field, the value you put in there won't
  be fetched from the index. Verify that the doc you retrieve from the
 index
  has values for content, I bet it doesn't
 
  Best
  Erick
 
  On Tue, Jul 12, 2011 at 9:38 AM, Gabriele Kahlout
  gabri...@mysimpatico.com wrote:
@Test
  public void testUpdateLoseTermsSimplified() throws Exception {
   *IndexWriter writer = indexDoc();*
  assertEquals(1, writer.numDocs());
  IndexSearcher searcher = getSearcher(writer);
  final TermQuery termQuery = new TermQuery(new Term(content,
   essen));
  
  TopDocs docs = searcher.search(termQuery, 1);
  assertEquals(1, docs.totalHits);
  Document doc = searcher.doc(0);
  
   *writer.updateDocument(new Term(id,doc.get(id)),doc);*
  
  searcher = getSearcher(writer);
   *docs = searcher.search(termQuery, 1);*
   *assertEquals(1, docs.totalHits);*//docs.totalHits == 0 !
  }
  
   testUpdateLosesTerms(com.mysimpatico.me.indexplugins.WcTest)  Time
  elapsed:
   0.346 sec   FAILURE!
   java.lang.AssertionError: expected:1 but was:0
  at org.junit.Assert.fail(Assert.java:91)
  at org.junit.Assert.failNotEquals(Assert.java:645)
  at org.junit.Assert.assertEquals(Assert.java:126)
  at org.junit.Assert.assertEquals(Assert.java:470)
  at org.junit.Assert.assertEquals(Assert.java:454)
  at
  
 
 com.mysimpatico.me.indexplugins.WcTest.testUpdateLosesTerms(WcTest.java:271)
  
   I have not changed anything (as you can see) during the update. I just
   retrieve a document and the update it. But then the termQuery that
 worked
   before doesn't work anymore (while the id field wasn't changed). Is
  this
   to be expected when content field is not stored?
  
   --
   Regards,
   K. Gabriele
  
   --- unchanged since 20/9/10 ---
   P.S. If the subject contains [LON] or the addressee acknowledges the
   receipt within 48 hours then I don't resend the email.
   subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
  time(x)
Now + 48h) ⇒ ¬resend(I, this).
  
   If an email is sent by a sender that is not a trusted contact or the
  email
   does not contain a valid code then the email is not received. A valid
  code
   starts with a hyphen and ends with X.
   ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y
 ∈
   L(-[a-z]+[0-9]X)).
  
 
 
 
 
  --
  Regards,
  K. Gabriele
 
  --- unchanged since 20/9/10 ---
  P.S. If the subject contains [LON] or the addressee acknowledges the
  receipt within 48 hours then I don't resend the email.
  subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
 time(x)
   Now + 48h) ⇒ ¬resend(I, this).
 
  If an email is sent by a sender that is not a trusted contact or the
 email
  does not contain a valid code then the email is not received. A valid
 code
  starts with a hyphen and ends with X.
  ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
  L(-[a-z]+[0-9]X)).
 




-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains [LON] or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
 Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with X.
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).


Re: Can I still search documents once updated?

2011-07-13 Thread Michael Kuhlmann
Am 13.07.2011 14:05, schrieb Gabriele Kahlout:
 this is what i was expecting. Otherwise updating a field of a document that
 has an unstored but indexed field is impossible (without losing the unstored
 but indexed field. I call this updating a field of a document AND
 deleting/updating all its unstored but indexed fields).

Not necessarily. The usual use case is that you have some kind of
existing data source from where you fill your Solr index. When you want
to update field of a document, then you simply re-index from that
source. There's no need to fetch data from Solr before.

Otherwise, if you really don't have such an existing data source because
a horde of typewriting monkeys filled your Solr index, then you should
better declare all your fields as stored. Otherwise you'll never have a
chance to get that data back.

Greeting,
Kuli


Re: Can I still search documents once updated?

2011-07-13 Thread Gabriele Kahlout
Well, I'm !sure how usual this scenario would be:
1. In general those using solr with nutch don't store the content field to
avoid storing the whole web/intranet in their index, twice (1 in the form of
stored data, and one in the form of indexed data).

Now everytime they need to update a field unrelated to content (number of
inbound links for an example) they would have to re-crawl the page again.
This is at least !intuitive.


On Wed, Jul 13, 2011 at 2:40 PM, Michael Kuhlmann s...@kuli.org wrote:

 Am 13.07.2011 14:05, schrieb Gabriele Kahlout:
  this is what i was expecting. Otherwise updating a field of a document
 that
  has an unstored but indexed field is impossible (without losing the
 unstored
  but indexed field. I call this updating a field of a document AND
  deleting/updating all its unstored but indexed fields).

 Not necessarily. The usual use case is that you have some kind of
 existing data source from where you fill your Solr index. When you want
 to update field of a document, then you simply re-index from that
 source. There's no need to fetch data from Solr before.

 Otherwise, if you really don't have such an existing data source because
 a horde of typewriting monkeys filled your Solr index, then you should
 better declare all your fields as stored. Otherwise you'll never have a
 chance to get that data back.

 Greeting,
 Kuli




-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains [LON] or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
 Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with X.
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).


Re: Can I still search documents once updated?

2011-07-13 Thread Michael Kuhlmann
Am 13.07.2011 15:37, schrieb Gabriele Kahlout:
 Well, I'm !sure how usual this scenario would be:
 1. In general those using solr with nutch don't store the content field to
 avoid storing the whole web/intranet in their index, twice (1 in the form of
 stored data, and one in the form of indexed data).
 

Not exactly. The indexed form is quite different from the stored form;
only the tokens are stored, each token only once, and some additional
data like the document count and, maybe, shingle information etc..

Hence, indexed data usually needs much less space on disk than the
original data.

There's no practical alternative to storing the content in a stored
field. What would you otherwise display as a search result? The
following web pages have your search term somewhere in their contents,
don't know where, take a look on your own?

Greetings,
Kuli


Re: Can I still search documents once updated?

2011-07-13 Thread Gabriele Kahlout
On Wed, Jul 13, 2011 at 3:54 PM, Michael Kuhlmann s...@kuli.org wrote:

 Am 13.07.2011 15:37, schrieb Gabriele Kahlout:
  Well, I'm !sure how usual this scenario would be:
  1. In general those using solr with nutch don't store the content field
 to
  avoid storing the whole web/intranet in their index, twice (1 in the form
 of
  stored data, and one in the form of indexed data).
 

 Not exactly. The indexed form is quite different from the stored form;
 only the tokens are stored, each token only once, and some additional
 data like the document count and, maybe, shingle information etc..

 Hence, indexed data usually needs much less space on disk than the
 original data.


I realized that. Maybe I should have said 1.X (1 in the form of stored data
and 0.X in the form of indexed data).


 There's no practical alternative to storing the content in a stored
 field. What would you otherwise display as a search result? The
 following web pages have your search term somewhere in their contents,
 don't know where, take a look on your own?

 Display the title, and url (and implicitly say The
following web pages have your search term somewhere in their contents, don't
REMEMBER where, take a look on your own?).

Solr is already configured by default not to store more than a
maxFieldLength anyway. Usually one stores content only to display
snippets.



 Greetings,
 Kuli




-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains [LON] or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
 Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with X.
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).


Re: Can I still search documents once updated?

2011-07-13 Thread Michael Kuhlmann
Am 13.07.2011 16:09, schrieb Gabriele Kahlout:
 Solr is already configured by default not to store more than a
 maxFieldLength anyway. Usually one stores content only to display
 snippets.

Yes, but the snippets must come from somewhere.

For instance, if you're using Solr's highlighting feature, all
highlighted fields must be stored.

See http://www.intellog.com/blog/?p=208 for explanation from someone
else. ;)

Greetings,
Kuli


Re: Can I still search documents once updated?

2011-07-12 Thread Erick Erickson
Unless you stored your content field, the value you put in there won't
be fetched from the index. Verify that the doc you retrieve from the index
has values for content, I bet it doesn't

Best
Erick

On Tue, Jul 12, 2011 at 9:38 AM, Gabriele Kahlout
gabri...@mysimpatico.com wrote:
  @Test
    public void testUpdateLoseTermsSimplified() throws Exception {
 *        IndexWriter writer = indexDoc();*
        assertEquals(1, writer.numDocs());
        IndexSearcher searcher = getSearcher(writer);
        final TermQuery termQuery = new TermQuery(new Term(content,
 essen));

        TopDocs docs = searcher.search(termQuery, 1);
        assertEquals(1, docs.totalHits);
        Document doc = searcher.doc(0);

 *        writer.updateDocument(new Term(id,doc.get(id)),doc);*

        searcher = getSearcher(writer);
 *        docs = searcher.search(termQuery, 1);*
 *        assertEquals(1, docs.totalHits);*//docs.totalHits == 0 !
    }

 testUpdateLosesTerms(com.mysimpatico.me.indexplugins.WcTest)  Time elapsed:
 0.346 sec   FAILURE!
 java.lang.AssertionError: expected:1 but was:0
    at org.junit.Assert.fail(Assert.java:91)
    at org.junit.Assert.failNotEquals(Assert.java:645)
    at org.junit.Assert.assertEquals(Assert.java:126)
    at org.junit.Assert.assertEquals(Assert.java:470)
    at org.junit.Assert.assertEquals(Assert.java:454)
    at
 com.mysimpatico.me.indexplugins.WcTest.testUpdateLosesTerms(WcTest.java:271)

 I have not changed anything (as you can see) during the update. I just
 retrieve a document and the update it. But then the termQuery that worked
 before doesn't work anymore (while the id field wasn't changed). Is this
 to be expected when content field is not stored?

 --
 Regards,
 K. Gabriele

 --- unchanged since 20/9/10 ---
 P.S. If the subject contains [LON] or the addressee acknowledges the
 receipt within 48 hours then I don't resend the email.
 subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
  Now + 48h) ⇒ ¬resend(I, this).

 If an email is sent by a sender that is not a trusted contact or the email
 does not contain a valid code then the email is not received. A valid code
 starts with a hyphen and ends with X.
 ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
 L(-[a-z]+[0-9]X)).