Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/
Yes, this should be definitely mentioned somewhere (in the documentation :) At least we left a track on the mailing list so it'll be possible to refer to it. D. Jérôme Charron wrote: You're right -- changing anything with the input (snippets length, number of documents etc) will alter the clusters. This is basically how it works. If you want clustering in your search engine then, depending on the type of data you serve, you'll have to experiment with the settings a bit and see which give you satisfactory results. I don't think there is any particular reason to provide different data to the clusterer. Moreover, it'd complicate things quite badly. Thanks Dawid for your response. In fact, I don't really want to change this, but just to be sure that everybody is aware about it and to have some opinions. Regards Jérôme
Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/
You're right -- changing anything with the input (snippets length, number of documents etc) will alter the clusters. This is basically how it works. If you want clustering in your search engine then, depending on the type of data you serve, you'll have to experiment with the settings a bit and see which give you satisfactory results. I don't think there is any particular reason to provide different data to the clusterer. Moreover, it'd complicate things quite badly. Thanks Dawid for your response. In fact, I don't really want to change this, but just to be sure that everybody is aware about it and to have some opinions. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/
Hi Jerome, Yes Dawid, but it is already committed => the clustering now uses the plain text version returned by the toString() method. Ugh, yes, sorry about that, it uses Summary.toStrings(summaries) to be specific and that uses toString internally. Actually, the clustering uses the summaries as input. I assumes it would provides some better results if it takes the whole documents content. no? I assumes that clustering uses the summaries instead of documents content for some performances purpose. Not always. Or rather: depends what your goals are. Full document clustering will take longer (word segmentation, feature extraction etc), but since you have more data to work with, document similarity should be more accurate and hence clusters more sensible. In practice, however, similarity between documents and "cluster quality" is just a mathematical concept which is never shown to the user -- what the user sees is the representation of a cluster, which in case of full-document clustering is usually quite inconvenient to build and has a weak relationship with the actual mathematical model of clusters. Contextual (keyword-in-context) snippets have a great advantage: they are shorter and carry the neighborhood of your query's terms. This very neighborhood (or rather: repetitive sequences of terms) can be used to first determine "clusters" of documents and then to describe them to the user. This is how most Web clustering algorithms work (excuse me if I explained it in a very imprecise way). But there is a (bad) side effect : since the size of the summaries is configurable, the clustering "quality" will vary depending on the summaries size configuration. I really found this very confusing : when folks adjust this parameter it is only for front-end consideration (they want to display a long or a short summary), but certainly not for clustering reasons. You're right -- changing anything with the input (snippets length, number of documents etc) will alter the clusters. This is basically how it works. If you want clustering in your search engine then, depending on the type of data you serve, you'll have to experiment with the settings a bit and see which give you satisfactory results. I don't think there is any particular reason to provide different data to the clusterer. Moreover, it'd complicate things quite badly. D.
Re: [Nutch-dev] Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/
Bob Carpenter of alias-i had this to say when I brought up this very idea: http://article.gmane.org/gmane.comp.jakarta.lucene.devel/12599 Thanks for you response Marvin. But finally my question is : shouldn't the nutch clustering uses some fixed size snippets instead of the configurable displayed size? Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/
> (but if the nutch-site.xml overrides the plugin.include property and > doen't > include it it will not be activated, like any other plugin) yes, that's what I ment, I quess that's the default case for people hacking plugins. Oh, yes Sami, I understand what you mean... Sorry, I just forgot to mention this point on the list (so, plugins hackers, you need to add one of the new summary plugin if you want to have some summaries displayed). Sorry, I forgot too to add summary plugins in the default webapp context file (nutch.xml) ... I will add this once the svn write access will be available. And one more time sorry, because I forgot too to report summary APIs changes to web2 module... Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: [Nutch-dev] Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/
On May 11, 2006, at 3:36 AM, Jérôme Charron wrote: Actually, the clustering uses the summaries as input. I assumes it would provides some better results if it takes the whole documents content. no? I assumes that clustering uses the summaries instead of documents content for some performances purpose. But there is a (bad) side effect : since the size of the summaries is configurable, the clustering "quality" will vary depending on the summaries size configuration. I really found this very confusing : when folks adjust this parameter it is only for front-end consideration (they want to display a long or a short summary), but certainly not for clustering reasons. What you and others thinks about this? Bob Carpenter of alias-i had this to say when I brought up this very idea: http://article.gmane.org/gmane.comp.jakarta.lucene.devel/12599 Marvin Humphrey Rectangular Research http://www.rectangular.com/
Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/
Jérôme Charron wrote: (but if the nutch-site.xml overrides the plugin.include property and doen't include it it will not be activated, like any other plugin) yes, that's what I ment, I quess that's the default case for people hacking plugins. -- Sami Siren
Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/
Add 3. Clustering would benefit from a plain text version. Yes Dawid, but it is already committed => the clustering now uses the plain text version returned by the toString() method. Dawid, I have a question about clustering. Actually, the clustering uses the summaries as input. I assumes it would provides some better results if it takes the whole documents content. no? I assumes that clustering uses the summaries instead of documents content for some performances purpose. But there is a (bad) side effect : since the size of the summaries is configurable, the clustering "quality" will vary depending on the summaries size configuration. I really found this very confusing : when folks adjust this parameter it is only for front-end consideration (they want to display a long or a short summary), but certainly not for clustering reasons. What you and others thinks about this? Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/
The reason is that they should not use the same HTML code : 1. OpenSearch should only use around highlights 2. search.jsp should use some more complicated HTML code () Add 3. Clustering would benefit from a plain text version. D.
Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/
Jérôme Charron wrote: Yes Doug, but in fact, the idea is to add the toString(Formatter) method in a common place (Summary). And add one specific Formatter implementation for OpenSearch and another one for search.jsp : The reason is that they should not use the same HTML code : 1. OpenSearch should only use around highlights 2. search.jsp should use some more complicated HTML code () In fact, I don't know if the "Formatter" solution is the good one, but the toString() or toHtml() must be parametrized since the two pieces of code that use this method should have distinct outputs. This all sounds fine, I'm just remarking that, at present, the OpenSearch output has changed incompatibly, which is a bad thing, and that I wish, until this is fully worked out, OpenSearch returned what it did before (markup, although perhaps exceeding what's advised). Doug
Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/
> String toString(Encoder, Formatter) like in the Lucene's Highlighter and > provide some basic implementations of Encoder and Formatter. That sounds fine, but in the meantime, let's not reproduce the html-specific code in lots of places. We need it in both search.jsp and in OpenSearchServlet.java. So we should have it in a common place. A method on Summary seems like a good place. If we subsequently add a more general API then we could re-implement the toHtml() method using that API, but I think a generic toHtml() method will be useful for quite a while yet. Yes Doug, but in fact, the idea is to add the toString(Formatter) method in a common place (Summary). And add one specific Formatter implementation for OpenSearch and another one for search.jsp : The reason is that they should not use the same HTML code : 1. OpenSearch should only use around highlights 2. search.jsp should use some more complicated HTML code () In fact, I don't know if the "Formatter" solution is the good one, but the toString() or toHtml() must be parametrized since the two pieces of code that use this method should have distinct outputs. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/
> Also a friendly hint to all plugin hackers, you need to enable > summary-basic in your existing nutch-site.xml to get things working. > Took me some time to realize this fact :) I think we should add this to nutch-default.xml, Does I missed something? summary-basic is activated in the nutch-default.xml ... no? if omitting this results in a non-working installation ... During my tests, it only results in no summary in the results pages... Isn't it the case? Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/
> Also a friendly hint to all plugin hackers, you need to enable > summary-basic in your existing nutch-site.xml to get things working. > Took me some time to realize this fact :) Sounds like we should enable it by default, no? The summary-basic plugin is already enabled by default in nutch-default.xml (but if the nutch-site.xml overrides the plugin.include property and doen't include it it will not be activated, like any other plugin) Jérôme
Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/
Sami Siren wrote: Also a friendly hint to all plugin hackers, you need to enable summary-basic in your existing nutch-site.xml to get things working. Took me some time to realize this fact :) Sounds like we should enable it by default, no? Doug
Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/
Sami Siren wrote: Doesn't this break any existing application that uses OpenSearch and displays summaries in a web browser? This is an incompatible change which we should avoid. Also a friendly hint to all plugin hackers, you need to enable summary-basic in your existing nutch-site.xml to get things working. Took me some time to realize this fact :) I think we should add this to nutch-default.xml, if omitting this results in a non-working installation ... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/
Doesn't this break any existing application that uses OpenSearch and displays summaries in a web browser? This is an incompatible change which we should avoid. Also a friendly hint to all plugin hackers, you need to enable summary-basic in your existing nutch-site.xml to get things working. Took me some time to realize this fact :) That sounds fine, but in the meantime, let's not reproduce the html-specific code in lots of places. We need it in both search.jsp and in OpenSearchServlet.java. So we should have it in a common place. A method on Summary seems like a good place. If we subsequently add a more general API then we could re-implement the toHtml() method using that API, but I think a generic toHtml() method will be useful for quite a while yet. +1 -- Sami Siren
Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/
Jérôme Charron wrote: This means there's no markup in the OpenSearch output? Yes, no markup for now. Doesn't this break any existing application that uses OpenSearch and displays summaries in a web browser? This is an incompatible change which we should avoid. Shouldn't there be? The restriction on description field is : "Can contain simple escaped HTML markup, such as , , , and elements." So, ya, why not. We can add around highlights. What you and others thinks? +1 Perhaps this should be a method on Summary, to render it as html? I had some hesitations about this while coding In fact, as suggested in the issue's comments, I would like to add a generic method on Summary : String toString(Encoder, Formatter) like in the Lucene's Highlighter and provide some basic implementations of Encoder and Formatter. That sounds fine, but in the meantime, let's not reproduce the html-specific code in lots of places. We need it in both search.jsp and in OpenSearchServlet.java. So we should have it in a common place. A method on Summary seems like a good place. If we subsequently add a more general API then we could re-implement the toHtml() method using that API, but I think a generic toHtml() method will be useful for quite a while yet. Doug
Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/
This means there's no markup in the OpenSearch output? Yes, no markup for now. Shouldn't there be? The restriction on description field is : "Can contain simple escaped HTML markup, such as , , , and elements." So, ya, why not. We can add around highlights. What you and others thinks? Perhaps this should be a method on Summary, to render it as html? I had some hesitations about this while coding In fact, as suggested in the issue's comments, I would like to add a generic method on Summary : String toString(Encoder, Formatter) like in the Lucene's Highlighter and provide some basic implementations of Encoder and Formatter. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/
Thanks for making this change! A few comments: [EMAIL PROTECTED] wrote: == --- lucene/nutch/trunk/src/java/org/apache/nutch/searcher/OpenSearchServlet.java (original) +++ lucene/nutch/trunk/src/java/org/apache/nutch/searcher/OpenSearchServlet.java Tue May 9 16:04:40 2006 [...] -addNode(doc, item, "description", summaries[i]); +addNode(doc, item, "description", summaries[i].toString()); This means there's no markup in the OpenSearch output? Shouldn't there be? Modified: lucene/nutch/trunk/src/web/jsp/search.jsp URL: http://svn.apache.org/viewcvs/lucene/nutch/trunk/src/web/jsp/search.jsp?rev=405565&r1=405564&r2=405565&view=diff == + +// Build the summary +StringBuffer sum = new StringBuffer(); +Fragment[] fragments = summaries[i].getFragments(); +for (int j=0; j") + .append(Entities.encode(fragments[j].getText())) + .append(""); + } else if (fragments[j].isEllipsis()) { +sum.append(" ... "); + } else { +sum.append(Entities.encode(fragments[j].getText())); + } +} +String summary = sum.toString(); Perhaps this should be a method on Summary, to render it as html? Doug