[jira] Commented: (NUTCH-88) Enhance ParserFactory plugin selection policy
[ http://issues.apache.org/jira/browse/NUTCH-88?page=comments#action_12323001 ] Dawid Weiss commented on NUTCH-88: -- Hi. I share your opinion -- this is an important issue. If I may add my few cents, the crawler should try to mimic a browser in handling mime types. This, of course, gets quite complex since Internet Explorer has a very confusing and unnecessarily complex mime type handling heuristic... which happens to change from version to version as well. Anyway, if you care to look, there are a few articles that explain the steps performed by IE to resolve a mime type of a Web page -- http://msdn.microsoft.com/workshop/networking/moniker/overview/appendix_a.asp http://msdn.microsoft.com/workshop/networking/moniker/overview/mime_handling.asp D. > Enhance ParserFactory plugin selection policy > - > > Key: NUTCH-88 > URL: http://issues.apache.org/jira/browse/NUTCH-88 > Project: Nutch > Type: Improvement > Components: indexer > Versions: 0.7, 0.8-dev > Reporter: Jerome Charron > Fix For: 0.8-dev > > The ParserFactory choose the Parser plugin to use based on the content-types > and path-suffix defined in the parsers plugin.xml file. > The selection policy is as follow: > Content type has priority: the first plugin found whose "contentType" > attribute matches the beginning of the content's type is used. > If none match, then the first whose "pathSuffix" attribute matches the end of > the url's path is used. > If neither of these match, then the first plugin whose "pathSuffix" is the > empty string is used. > This policy has a lot of problems when no matching is found, because a random > parser is used (and there is a lot of chance this parser can't handle the > content). > On the other hand, the content-type associated to a parser plugin is > specified in the plugin.xml of each plugin (this is the value used by the > ParserFactory), AND the code of each parser checks itself in its code if the > content-type is ok (it uses an hard-coded content-type value, and not uses > the value specified in the plugin.xml => possibility of missmatches between > content-type hard-coded and content-type delcared in plugin.xml). > A complete list of problems and discussion aout this point is available in: > * http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html > * http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-88) Enhance ParserFactory plugin selection policy
[ http://issues.apache.org/jira/browse/NUTCH-88?page=comments#action_12323009 ] Dawid Weiss commented on NUTCH-88: -- Yep, I know about byte-magic mime detector. I'm just pointing out Internet Explorer doesn't use it... or at least, it doesn't always use it the way you would expect it to. Whether Nutch should mimic IE in this behaviour is another question. > Enhance ParserFactory plugin selection policy > - > > Key: NUTCH-88 > URL: http://issues.apache.org/jira/browse/NUTCH-88 > Project: Nutch > Type: Improvement > Components: indexer > Versions: 0.7, 0.8-dev > Reporter: Jerome Charron > Fix For: 0.8-dev > > The ParserFactory choose the Parser plugin to use based on the content-types > and path-suffix defined in the parsers plugin.xml file. > The selection policy is as follow: > Content type has priority: the first plugin found whose "contentType" > attribute matches the beginning of the content's type is used. > If none match, then the first whose "pathSuffix" attribute matches the end of > the url's path is used. > If neither of these match, then the first plugin whose "pathSuffix" is the > empty string is used. > This policy has a lot of problems when no matching is found, because a random > parser is used (and there is a lot of chance this parser can't handle the > content). > On the other hand, the content-type associated to a parser plugin is > specified in the plugin.xml of each plugin (this is the value used by the > ParserFactory), AND the code of each parser checks itself in its code if the > content-type is ok (it uses an hard-coded content-type value, and not uses > the value specified in the plugin.xml => possibility of missmatches between > content-type hard-coded and content-type delcared in plugin.xml). > A complete list of problems and discussion aout this point is available in: > * http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html > * http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-82) Nutch Commands should run on Windows without external tools
[ http://issues.apache.org/jira/browse/NUTCH-82?page=comments#action_12332559 ] Dawid Weiss commented on NUTCH-82: -- I personally disagree Perl is a better alternative to Cygwin... Most people familiar with Unix/ Windows development will have no problems modifying a bash script, whereas a Perl script... hmm.. Perl is perl :) As for a pure Java solution, I agree this would be handy. However, Java is quite a pain to invoke, especially with multiple JVM switches such as -Xmx... So you'd probably have to fall back to a 'boot' script anyway at some point. The only pure Java thing that comes to my mind is using ANT to spawn a JVM and then write commons-cli equivalents of command line tools... but this, as much as I hate to have platform-dependent scripts, seems like an overkill compared to the bash solution. > Nutch Commands should run on Windows without external tools > --- > > Key: NUTCH-82 > URL: http://issues.apache.org/jira/browse/NUTCH-82 > Project: Nutch > Type: New Feature > Environment: Windows 2000 > Reporter: AJ Banck > Attachments: nutch.bat, nutch.bat, nutch.pl > > Currently there is only a shellscript to run the Nutch commands. This should > be platform independant. > Best would be Ant tools, or scripts generated by a template tool to avoid > replication. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-217) InstantiationException when deserializing Query (no parameterless constructor)
InstantiationException when deserializing Query (no parameterless constructor) -- Key: NUTCH-217 URL: http://issues.apache.org/jira/browse/NUTCH-217 Project: Nutch Type: Bug Components: searcher Versions: 0.8-dev Reporter: Dawid Weiss I've been playing with the trunk. The distributed searcher complains with an instantiation exception when deserializing Query. A quick code inspection shows that Query doesn't have any parameterless constructor. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-228) Clustering plugin descriptor broken (fix included)
Clustering plugin descriptor broken (fix included) -- Key: NUTCH-228 URL: http://issues.apache.org/jira/browse/NUTCH-228 Project: Nutch Type: Bug Reporter: Dawid Weiss Priority: Minor The plugin descriptor for clustering-carrot2 is currently broken (points to a missing JAR). I'm adding a patch fixing this to this issue in a minute. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-228) Clustering plugin descriptor broken (fix included)
[ http://issues.apache.org/jira/browse/NUTCH-228?page=all ] Dawid Weiss updated NUTCH-228: -- Attachment: clustering.patch This patch fixed the plugin descriptor and a typo in cluster.jsp that caused wrong number of milliseconds to be dumped in the output log file. > Clustering plugin descriptor broken (fix included) > -- > > Key: NUTCH-228 > URL: http://issues.apache.org/jira/browse/NUTCH-228 > Project: Nutch > Type: Bug > Reporter: Dawid Weiss > Priority: Minor > Attachments: clustering.patch > > The plugin descriptor for clustering-carrot2 is currently broken (points to a > missing JAR). I'm adding a patch fixing this to this issue in a minute. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-234) Clustering extension code cleanups and a real JUnit test case for the current implementation.
Clustering extension code cleanups and a real JUnit test case for the current implementation. - Key: NUTCH-234 URL: http://issues.apache.org/jira/browse/NUTCH-234 Project: Nutch Type: Test Reporter: Dawid Weiss Priority: Minor I've cleaned up the code a bit and added a real test case for the clustering extension. This is in preparation for upgrading to the most recent Carrot2 codebase and I didn't want to mix these two patches together. I'd appreciate if somebody could review this patch so that I can integrate the newest C2 code this weekend. Thanks. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-234) Clustering extension code cleanups and a real JUnit test case for the current implementation.
[ http://issues.apache.org/jira/browse/NUTCH-234?page=all ] Dawid Weiss updated NUTCH-234: -- Attachment: patch.diff The patch adding: - a JUnit test case to the clustering extension, - minor code cleanups - adds ".settings" file to svn:ignore on the main Nutch folder -- this is Eclipse's project settings file. > Clustering extension code cleanups and a real JUnit test case for the current > implementation. > - > > Key: NUTCH-234 > URL: http://issues.apache.org/jira/browse/NUTCH-234 > Project: Nutch > Type: Test > Reporter: Dawid Weiss > Priority: Minor > Attachments: patch.diff > > I've cleaned up the code a bit and added a real test case for the clustering > extension. This is in preparation for upgrading to the most recent Carrot2 > codebase and I didn't want to mix these two patches together. I'd appreciate > if somebody could review this patch so that I can integrate the newest C2 > code this weekend. Thanks. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-237) Carrot2 clustering plugin upgrade.
Carrot2 clustering plugin upgrade. -- Key: NUTCH-237 URL: http://issues.apache.org/jira/browse/NUTCH-237 Project: Nutch Type: Improvement Reporter: Dawid Weiss Priority: Trivial This is an upgrade of the clustering plugin to the newest release (1.0.2). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-237) Carrot2 clustering plugin upgrade.
[ http://issues.apache.org/jira/browse/NUTCH-237?page=all ] Dawid Weiss updated NUTCH-237: -- Attachment: c2.patch svn-stat.txt Note the two deleted files (I attached the result of svn stat). I didn't know how to include this info in the diff file, don't think it's possible with plain svn. > Carrot2 clustering plugin upgrade. > -- > > Key: NUTCH-237 > URL: http://issues.apache.org/jira/browse/NUTCH-237 > Project: Nutch > Type: Improvement > Reporter: Dawid Weiss > Priority: Trivial > Attachments: c2.patch, svn-stat.txt > > This is an upgrade of the clustering plugin to the newest release (1.0.2). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-237) Carrot2 clustering plugin upgrade.
[ http://issues.apache.org/jira/browse/NUTCH-237?page=all ] Dawid Weiss updated NUTCH-237: -- Attachment: libs.zip Libraries that need to be replaced. > Carrot2 clustering plugin upgrade. > -- > > Key: NUTCH-237 > URL: http://issues.apache.org/jira/browse/NUTCH-237 > Project: Nutch > Type: Improvement > Reporter: Dawid Weiss > Priority: Trivial > Attachments: c2.patch, libs.zip, svn-stat.txt > > This is an upgrade of the clustering plugin to the newest release (1.0.2). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-237) Carrot2 clustering plugin upgrade.
[ http://issues.apache.org/jira/browse/NUTCH-237?page=comments#action_12371687 ] Dawid Weiss commented on NUTCH-237: --- Yes and no. I removed the "support" for foreign languages from the constructor code: // We initialize Lingo with English stemming and stopwords. Lingo has // a simple language detection filter, but you'll be better off hardcoding // the language according to your needs. If you have bilingual indices, // then there is a possibility of creating a more complex process that assigns // a language tag before the clustering is actually started. return new LingoLocalFilterComponent( new Language[] { new English() }, defaults); } Language detection is not really brilliant in the open source Lingo so I thought it wouldn't make sense to give people false hopes. Now, all the stemmers and stopword lists are still included in the release (look inside carrot2-util-tokenizer.jar$/com/dawidweiss/carrot/util/tokenizer/languages/...) so you can freely switch to another language by changing the instantiated language. I have a better idea though -- how about if you apply this patch (because I\ve tested it and know it works) and I'll make the language configurable via ISO codes set in nutch configuration? The default would be English and you could set your own language in there if you wanted to. All right? > Carrot2 clustering plugin upgrade. > -- > > Key: NUTCH-237 > URL: http://issues.apache.org/jira/browse/NUTCH-237 > Project: Nutch > Type: Improvement > Reporter: Dawid Weiss > Priority: Trivial > Attachments: c2.patch, libs.zip, svn-stat.txt > > This is an upgrade of the clustering plugin to the newest release (1.0.2). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-237) Carrot2 clustering plugin upgrade.
[ http://issues.apache.org/jira/browse/NUTCH-237?page=all ] Dawid Weiss updated NUTCH-237: -- Attachment: NUTCH-237.DWEISS.patch.zip Hi Andrzej. The ZIP file contains a patch and svn stat with the improved code: - The primary language for hits without explicit langid and a list of enabled languages in the clustering component can be specified in the configuration file (readme.txt gives the details). - by default all languages in Carrot2 (except for Polish) are enabled. English is the default. - I removed the dependency on Neko in favor of the simpler routine we have in Carrot2 codebase anyway. The change shouldn't affect the results (I checked on my local installation and it seems to be fine). I haven't played with the language identifier yet because I don't have a crawl with documents containing langid codes. The code should work without problems though -- details.getValue("lang") is converted to Carrot2's property RawDocument.PROPERTY_LANGUAGE and this is taken into account when clustering. I couldn't delete previously attached files. This ZIP file contains only the patch and svnstat -- you'll have to remove a few JARs manually and replace other with their new counterparts from the ZIP file I've attached to this issue earlier (they haven't changed). Let me know if you need anything. > Carrot2 clustering plugin upgrade. > -- > > Key: NUTCH-237 > URL: http://issues.apache.org/jira/browse/NUTCH-237 > Project: Nutch > Type: Improvement > Reporter: Dawid Weiss > Priority: Trivial > Attachments: NUTCH-237.DWEISS.patch.zip, c2.patch, libs.zip, svn-stat.txt > > This is an upgrade of the clustering plugin to the newest release (1.0.2). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets
[ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12378387 ] Dawid Weiss commented on NUTCH-134: --- (back from holidays, so a bit delayed, but) I confirm Andrzej's suggestion -- a plain-text only summarized is ideal for clustering for example. HTML is quite uncomfortable to work with. > Summarizer doesn't select the best snippets > --- > > Key: NUTCH-134 > URL: http://issues.apache.org/jira/browse/NUTCH-134 > Project: Nutch > Type: Bug > Components: searcher > Versions: 0.7.2, 0.7.1, 0.7, 0.8-dev > Reporter: Andrzej Bialecki > Attachments: summarizer.060506.patch > > Summarizer.java tries to select the best fragments from the input text, where > the frequency of query terms is the highest. However, the logic in line 223 > is flawed in that the excerptSet.add() operation will add new excerpts only > if they are not already present - the test is performed using the Comparator > that compares only the numUniqueTokens. This means that if there are two or > more excerpts, which score equally high, only the first of them will be > retained, and the rest of equally-scoring excerpts will be discarded, in > favor of other excerpts (possibly lower scoring). > To fix this the Set should be replaced with a List + a sort operation. To > keep the relative position of excerpts in the original order the Excerpt > class should be extended with an "int order" field, and the collected > excerpts should be sorted in that order prior to adding them to the summary. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-265) Getting Clustered results in better form.
[ http://issues.apache.org/jira/browse/NUTCH-265?page=comments#action_12378425 ] Dawid Weiss commented on NUTCH-265: --- The clustering interface is very simple in Nutch because it usually needs to be adjusted to the needs of a particular application. Maintaing a complex user interface is not among Nutch's objectives, so I doubt if it's possible. Carrot2, which Nutch internally uses, has a JavaScript-powered interface which could be added to Nutch if there are folks that really think it is worth the effort. See this one: http://carrot.cs.put.poznan.pl/carrot2-remote-controller/newsearch.do?query=nutch&processingChain=carrot2.process.lingo-yahooapi&resultsRequested=100 > Getting Clustered results in better form. > - > > Key: NUTCH-265 > URL: http://issues.apache.org/jira/browse/NUTCH-265 > Project: Nutch > Type: Improvement > Components: searcher > Versions: 0.7.2 > Reporter: Kris K > > The cluster results are coming with title and link to URL. For improvement it > should be clustered keyword phrases (Like Vivisimo type). Any person can > share their views on it. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-265) Getting Clustered results in better form.
[ http://issues.apache.org/jira/browse/NUTCH-265?page=comments#action_12413072 ] Dawid Weiss commented on NUTCH-265: --- Chris, the current clusterer in Nutch _does_ discover phrases for clusters, so I don't know what you really mean. Did you take a look at my previous post? Would that kind of user interface make you happy? > Getting Clustered results in better form. > - > > Key: NUTCH-265 > URL: http://issues.apache.org/jira/browse/NUTCH-265 > Project: Nutch > Type: Improvement > Components: searcher > Versions: 0.7.2 > Reporter: Kris K > > The cluster results are coming with title and link to URL. For improvement it > should be clustered keyword phrases (Like Vivisimo type). Any person can > share their views on it. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-265) Getting Clustered results in better form.
[ http://issues.apache.org/jira/browse/NUTCH-265?page=comments#action_12413220 ] Dawid Weiss commented on NUTCH-265: --- If you just mean the user interface, then you can simply take the XSLT stylesheet from Carrot2 and reuse it in Nutch with the opensearch XML -- I believe there is even an example in Carrot2 of using opensearch, so you shouldn't have much troubles. Now, the phrases you wish to see on your screen won't always be so beautiful because search results clustering works on snippets extracted from search results. If you want clean and accurate labels then you'd need to use a predefined ontology or something -- I can't help you with that. Try playing around with Carrot2 demo and see if the results satisfy your needs. If so, then rewriting Nutch's user interface to suit your needs shouldn't be a problem. If your expectations are more demanding then you'll need to think of some other solution. > Getting Clustered results in better form. > - > > Key: NUTCH-265 > URL: http://issues.apache.org/jira/browse/NUTCH-265 > Project: Nutch > Type: Improvement > Components: searcher > Versions: 0.7.2 > Reporter: Kris K > > The cluster results are coming with title and link to URL. For improvement it > should be clustered keyword phrases (Like Vivisimo type). Any person can > share their views on it. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-294) Topic-maps of related searchwords
[ http://issues.apache.org/jira/browse/NUTCH-294?page=comments#action_12414960 ] Dawid Weiss commented on NUTCH-294: --- Ehm, sorry I'm so late with this -- tons of work. 1) Stefan, if you can't get it working, speak up what is not working (exceptions? anything else?). The only thing you need to do is enable the clustering plugin in your configuration -- there should be a checkbox next to your search box, tick that and you should be able to see clustered results when you perform a query. 2) Now, having said that, I don't think that's what you're after. Carrot2 performs clustering of search results based solely on the information contained in snippets retrieved from documents (in other words, there is NO ontology and NO predefined information, everything is constructed dynamically). If you're looking for topic-maps then I guess you're after a certain type of classification engine that could pick relevant categories and display them along with search results. It's not what (the open source) Carrot2 does. > Topic-maps of related searchwords > - > > Key: NUTCH-294 > URL: http://issues.apache.org/jira/browse/NUTCH-294 > Project: Nutch > Type: New Feature > Components: searcher > Reporter: Stefan Neufeind > > Would it be possible to offer a user "topic-maps"? It's when you search for > something and get topic-related words that might also be of interest for you. > I wonder if that's somehow possible with the ngram-index for "did you mean" > (see separate feature-enhancement-bug for this), but we'd need to have a > relation between words (in what context do they occur). > For the webfrontend usually trees are used - which for some users offer > quite impressive eye-candy :-) E.g. see this advertisement by Novell where > I've just seen a similar "topic-map" as well: > http://www.novell.com/de-de/company/advertising/defineyouropen.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-294) Topic-maps of related searchwords
[ http://issues.apache.org/jira/browse/NUTCH-294?page=comments#action_12415094 ] Dawid Weiss commented on NUTCH-294: --- Well, you certainly have something wrong in your configuration then. I just tried with the head revision. My nutch-site looks like this: [...] plugin.includes clustering-carrot2|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic Regular expression naming plugin directory names to [...] [...] Start Tomcat and issue any query that returns results. Look in the log files for: 2006-06-07 09:29:35 org.apache.nutch.plugin.PluginRepository displayStatus INFO: Online Search Results Clustering using Carrot2's Lingo component (clustering-carrot2) 2006-06-07 09:29:35 org.apache.nutch.clustering.OnlineClustererFactory getOnlineClusterer INFO: Using the first clustering extension found: Carrot2-Lingo Ok, the results page should show a "clustering" option next to "Search" button (it does on my installation). Select it and rerun the query. On the right side you'll have clusters (titles and three sample documents from each cluster are shown). As for your idea, I still don't think Lingo is what you need... Of course you can try feeding it with unrelated keywords and then see what comes out, but I don't think it's the right approach. > Topic-maps of related searchwords > - > > Key: NUTCH-294 > URL: http://issues.apache.org/jira/browse/NUTCH-294 > Project: Nutch > Type: New Feature > Components: searcher > Reporter: Stefan Neufeind > > Would it be possible to offer a user "topic-maps"? It's when you search for > something and get topic-related words that might also be of interest for you. > I wonder if that's somehow possible with the ngram-index for "did you mean" > (see separate feature-enhancement-bug for this), but we'd need to have a > relation between words (in what context do they occur). > For the webfrontend usually trees are used - which for some users offer > quite impressive eye-candy :-) E.g. see this advertisement by Novell where > I've just seen a similar "topic-map" as well: > http://www.novell.com/de-de/company/advertising/defineyouropen.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-309) Uses commons logging Code Guards
[ http://issues.apache.org/jira/browse/NUTCH-309?page=comments#action_12418396 ] Dawid Weiss commented on NUTCH-309: --- Painful job, Jerome, but in most cases (non-critical loops) the gain will not be significant and proliferating if statements makes the code harder to read. Wrapping logging statements with code guards is a perfect aspect -- I'm sure it'd be possible to postprocess the binaries and do it automatically (with AspectJ or even a simple implementation of an observer in asmlib). Just a thought. > Uses commons logging Code Guards > > > Key: NUTCH-309 > URL: http://issues.apache.org/jira/browse/NUTCH-309 > Project: Nutch > Type: Improvement > Versions: 0.8-dev > Reporter: Jerome Charron > Assignee: Jerome Charron > Priority: Minor > Fix For: 0.8-dev > > "Code guards are typically used to guard code that only needs to execute in > support of logging, that otherwise introduces undesirable runtime overhead in > the general case (logging disabled). Examples are multiple parameters, or > expressions (e.g. string + " more") for parameters. Use the guard methods of > the form log.is() to verify that logging should be performed, > before incurring the overhead of the logging method call. Yes, the logging > methods will perform the same check, but only after resolving parameters." > (description extracted from > http://jakarta.apache.org/commons/logging/guide.html#Code_Guards) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-300) Clustering API improvements
[ http://issues.apache.org/jira/browse/NUTCH-300?page=comments#action_12419708 ] Dawid Weiss commented on NUTCH-300: --- Hi. I just took a look at it -- I don't see anything wrong with the code and Andrzej has used Carrot2 before. We're under major refactorings to simplify things within Carrot2 -- the internals won't change much, but we drop obsolete APIs etc. The new web application has a new shiny user interface (at the moment XSLT-filtered from XMLs, so not applicable for huge user loads, but very convenient to work with on customizations). Stay tuned. > Clustering API improvements > --- > > Key: NUTCH-300 > URL: http://issues.apache.org/jira/browse/NUTCH-300 > Project: Nutch > Type: Improvement > Versions: 0.8-dev > Reporter: Andrzej Bialecki > Priority: Minor > Attachments: patch.txt > > This patch adds support for retrieving original document scores (from > NutchBean), as well as cluster-level relevance scores (from Clusterer). Both > methods may improve visual representation of the clusters, where individual > items may be visually differentiated depending on their query relevance and > cluster relevance. A modified cluster.jsp illustrates this feature. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-397) porting clustering-carrot2 plugin to carrot2 v2.0
[ http://issues.apache.org/jira/browse/NUTCH-397?page=comments#action_12450146 ] Dawid Weiss commented on NUTCH-397: --- I'll review this patch and commit all the necessary code as soon as possible (it may be around the end of the week though since I have a few urgent papers to review). > porting clustering-carrot2 plugin to carrot2 v2.0 > - > > Key: NUTCH-397 > URL: http://issues.apache.org/jira/browse/NUTCH-397 > Project: Nutch > Issue Type: Improvement >Reporter: Doğacan Güney >Priority: Trivial > Attachments: carrot2-nutch-plugin.patch, > clustering-carrot2-lib.tar.gz, clustering.patch > > > A rather trivial port of clustering-carrot2 to new carrot2. I also added the > necessary jars for Polish, so that nutch will not give the annoying > exceptions when it is initializing clustering-carrot2. > There is a small problem, though. AFAICS, a small patch has to be applied to > carrot2, otherwise nutch can not start the plugin. (I am also attaching that > here.) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)
Upgrade Carrot2 clustering plugin to the newest stable release (2.1) Key: NUTCH-544 URL: https://issues.apache.org/jira/browse/NUTCH-544 Project: Nutch Issue Type: Improvement Reporter: Dawid Weiss Priority: Minor This issue upgrades Carrot2 search results clustering plugin to the newest stable version. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)
[ https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521784 ] Dawid Weiss commented on NUTCH-544: --- I've started working on this -- will send a patch for revision soon (tested against the current trunk -- didn't know which version to set for "Affects version", please feel free to edit this field on this issue). > Upgrade Carrot2 clustering plugin to the newest stable release (2.1) > > > Key: NUTCH-544 > URL: https://issues.apache.org/jira/browse/NUTCH-544 > Project: Nutch > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Minor > > This issue upgrades Carrot2 search results clustering plugin to the newest > stable version. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)
[ https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521791 ] Dawid Weiss commented on NUTCH-544: --- Yes, absolutely -- it's actually my fault I didn't notice these tasks, apologies. > Upgrade Carrot2 clustering plugin to the newest stable release (2.1) > > > Key: NUTCH-544 > URL: https://issues.apache.org/jira/browse/NUTCH-544 > Project: Nutch > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Minor > > This issue upgrades Carrot2 search results clustering plugin to the newest > stable version. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)
[ https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521792 ] Dawid Weiss commented on NUTCH-544: --- Doğacan, would it be a problem if we threw in BeanShell and Dom4j JARs? We have been talking about this with Staszek -- this would allow us to instantiate clustering algorithms dynamically and would effectively provide alternatives for Nutch users to use Lingo, STC or Lingo3G (our commercial clusterer). I'm asking because I remember at the beginning there were concerns about the size of Nutch when compliled with all plugin dependencies etc. > Upgrade Carrot2 clustering plugin to the newest stable release (2.1) > > > Key: NUTCH-544 > URL: https://issues.apache.org/jira/browse/NUTCH-544 > Project: Nutch > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Minor > > This issue upgrades Carrot2 search results clustering plugin to the newest > stable version. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)
[ https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated NUTCH-544: -- Attachment: clustering-upgrade-2.1.patch svn diff of the patch. Binary files are not included (is there a way to do it with Subversion?), I'll post them in a separate bundle. > Upgrade Carrot2 clustering plugin to the newest stable release (2.1) > > > Key: NUTCH-544 > URL: https://issues.apache.org/jira/browse/NUTCH-544 > Project: Nutch > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Minor > Attachments: clustering-upgrade-2.1.patch > > > This issue upgrades Carrot2 search results clustering plugin to the newest > stable version. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)
[ https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated NUTCH-544: -- Attachment: libs-packed.tar.gz lib folder (binary files to be replaced). > Upgrade Carrot2 clustering plugin to the newest stable release (2.1) > > > Key: NUTCH-544 > URL: https://issues.apache.org/jira/browse/NUTCH-544 > Project: Nutch > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Minor > Attachments: clustering-upgrade-2.1.patch, libs-packed.tar.gz > > > This issue upgrades Carrot2 search results clustering plugin to the newest > stable version. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)
[ https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521842 ] Dawid Weiss commented on NUTCH-544: --- Ok, this patch does the following: - upgrades Carrot2 libs to 2.1 (the most recent stable version) - fixes issues with tests not run properly, - fixes some multiple-initialization issues. It is ready for review/ commit. > Upgrade Carrot2 clustering plugin to the newest stable release (2.1) > > > Key: NUTCH-544 > URL: https://issues.apache.org/jira/browse/NUTCH-544 > Project: Nutch > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Minor > Attachments: clustering-upgrade-2.1.patch, libs-packed.tar.gz > > > This issue upgrades Carrot2 search results clustering plugin to the newest > stable version. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)
[ https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521843 ] Dawid Weiss commented on NUTCH-544: --- Not exactly; the initialization issue is still present, but I'll create another JIRA entry for it and fix it there (it's not related to the upgrade, but rather to the webapp). > Upgrade Carrot2 clustering plugin to the newest stable release (2.1) > > > Key: NUTCH-544 > URL: https://issues.apache.org/jira/browse/NUTCH-544 > Project: Nutch > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Minor > Attachments: clustering-upgrade-2.1.patch, libs-packed.tar.gz > > > This issue upgrades Carrot2 search results clustering plugin to the newest > stable version. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-545) Configuration and OnlineClusterer get initialized in every request.
Configuration and OnlineClusterer get initialized in every request. --- Key: NUTCH-545 URL: https://issues.apache.org/jira/browse/NUTCH-545 Project: Nutch Issue Type: Bug Components: web gui Reporter: Dawid Weiss The initialization code block in search.jsp is invoked in every request (it's part of the request block). This is unnecessary and actually slows down the request cycle -- Configuration and OnlineClusterer can (and should) be reused. The attached patch moved initialization code to jspInit(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-545) Configuration and OnlineClusterer get initialized in every request.
[ https://issues.apache.org/jira/browse/NUTCH-545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated NUTCH-545: -- Attachment: search.jsp.patch Patch of search.jsp that moves initialization code to jspInit(). > Configuration and OnlineClusterer get initialized in every request. > --- > > Key: NUTCH-545 > URL: https://issues.apache.org/jira/browse/NUTCH-545 > Project: Nutch > Issue Type: Bug > Components: web gui >Reporter: Dawid Weiss > Attachments: search.jsp.patch > > > The initialization code block in search.jsp is invoked in every request (it's > part of the request block). This is unnecessary and actually slows down the > request cycle -- Configuration and OnlineClusterer can (and should) be reused. > The attached patch moved initialization code to jspInit(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)
[ https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated NUTCH-544: -- Attachment: (was: clustering-upgrade-2.1.patch) > Upgrade Carrot2 clustering plugin to the newest stable release (2.1) > > > Key: NUTCH-544 > URL: https://issues.apache.org/jira/browse/NUTCH-544 > Project: Nutch > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Minor > Attachments: libs-packed.tar.gz > > > This issue upgrades Carrot2 search results clustering plugin to the newest > stable version. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)
[ https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated NUTCH-544: -- Attachment: clustering-upgrade-2.1.patch Same patch, but I added an optional parameter that allows custom clustering processes to be used. > Upgrade Carrot2 clustering plugin to the newest stable release (2.1) > > > Key: NUTCH-544 > URL: https://issues.apache.org/jira/browse/NUTCH-544 > Project: Nutch > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Minor > Attachments: clustering-upgrade-2.1.patch, libs-packed.tar.gz > > > This issue upgrades Carrot2 search results clustering plugin to the newest > stable version. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)
[ https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12522047 ] Dawid Weiss commented on NUTCH-544: --- This parameter is in the code. It is specific to the plugin, not the extension point, so I didn't add it to nutch-defaults.xml. I'll write the configuration/ process switching info on the Wiki -- I guess it makes more sense to have it there. http://wiki.apache.org/nutch/ClusteringPlugin Switching clustering algorithms isn't very intuitive because they come with their own JARs and Nutch's plugin system requires all JARs to be explicitly defined in the plugin's descriptor. I finally decided to go for a workaround -- there is a default clustering algorithm embedded with the clustering plugin (which uses the Lingo algorithm), if another clustering process is to be used, all its required classes must be present in classpath (for example by placing them in the container's shared classes). Worked for me quite well since you don't have to modify Nutch's WAR at all. As I said, I'll write a longer explanation of this on the Wiki. > Upgrade Carrot2 clustering plugin to the newest stable release (2.1) > > > Key: NUTCH-544 > URL: https://issues.apache.org/jira/browse/NUTCH-544 > Project: Nutch > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Minor > Attachments: clustering-upgrade-2.1.patch, libs-packed.tar.gz > > > This issue upgrades Carrot2 search results clustering plugin to the newest > stable version. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)
[ https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12522992 ] Dawid Weiss commented on NUTCH-544: --- Hey, Doğacan will you find a spare minute to commit this patch some time this week? Thanks a bunch, > Upgrade Carrot2 clustering plugin to the newest stable release (2.1) > > > Key: NUTCH-544 > URL: https://issues.apache.org/jira/browse/NUTCH-544 > Project: Nutch > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Minor > Attachments: clustering-upgrade-2.1.patch, libs-packed.tar.gz > > > This issue upgrades Carrot2 search results clustering plugin to the newest > stable version. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)
[ https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated NUTCH-544: -- Attachment: clustering-upgrade-2.1.patch2 The same patch, one extra line of logging info added (specifying the clustering algorithm used). > Upgrade Carrot2 clustering plugin to the newest stable release (2.1) > > > Key: NUTCH-544 > URL: https://issues.apache.org/jira/browse/NUTCH-544 > Project: Nutch > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Minor > Attachments: clustering-upgrade-2.1.patch2, libs-packed.tar.gz > > > This issue upgrades Carrot2 search results clustering plugin to the newest > stable version. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)
[ https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated NUTCH-544: -- Attachment: (was: clustering-upgrade-2.1.patch) > Upgrade Carrot2 clustering plugin to the newest stable release (2.1) > > > Key: NUTCH-544 > URL: https://issues.apache.org/jira/browse/NUTCH-544 > Project: Nutch > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Minor > Attachments: clustering-upgrade-2.1.patch2, libs-packed.tar.gz > > > This issue upgrades Carrot2 search results clustering plugin to the newest > stable version. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-567) Proper (?) handling of URIs in TagSoup.
Proper (?) handling of URIs in TagSoup. --- Key: NUTCH-567 URL: https://issues.apache.org/jira/browse/NUTCH-567 Project: Nutch Issue Type: Improvement Reporter: Dawid Weiss Priority: Minor Attachments: uri-entities.patch Doug Cook reported that TagSoup incorrectly handles some URI parameters. More discussion on the list and at TagSoup's mailing list. http://tech.groups.yahoo.com/group/tagsoup-friends/message/838 I looked at the sources of TagSoup because I'm using it myself (although the URIs are not relevant for me). It seems like you can implement a naive workaround by remembering the parsing state and just avoiding entity resolution. Attached is the patch that does this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-567) Proper (?) handling of URIs in TagSoup.
[ https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated NUTCH-567: -- Attachment: uri-entities.patch A patch against tagsoup-1.1.3 fixing the entities-in-URIs problem. Hopefully, I didn't test much. You'll have to fix paths in the patch file to apply it locally. > Proper (?) handling of URIs in TagSoup. > --- > > Key: NUTCH-567 > URL: https://issues.apache.org/jira/browse/NUTCH-567 > Project: Nutch > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Minor > Attachments: uri-entities.patch > > > Doug Cook reported that TagSoup incorrectly handles some URI parameters. More > discussion on the list and at TagSoup's mailing list. > http://tech.groups.yahoo.com/group/tagsoup-friends/message/838 > I looked at the sources of TagSoup because I'm using it myself (although the > URIs are not relevant for me). It seems like you can implement a naive > workaround by remembering the parsing state and just avoiding entity > resolution. Attached is the patch that does this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-567) Proper (?) handling of URIs in TagSoup.
[ https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated NUTCH-567: -- Attachment: tagsoup-1.1.3-uripatched.jar Binary of tagsoup with the patched compiled in. > Proper (?) handling of URIs in TagSoup. > --- > > Key: NUTCH-567 > URL: https://issues.apache.org/jira/browse/NUTCH-567 > Project: Nutch > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Minor > Attachments: tagsoup-1.1.3-uripatched.jar , uri-entities.patch > > > Doug Cook reported that TagSoup incorrectly handles some URI parameters. More > discussion on the list and at TagSoup's mailing list. > http://tech.groups.yahoo.com/group/tagsoup-friends/message/838 > I looked at the sources of TagSoup because I'm using it myself (although the > URIs are not relevant for me). It seems like you can implement a naive > workaround by remembering the parsing state and just avoiding entity > resolution. Attached is the patch that does this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-567) Proper (?) handling of URIs in TagSoup.
[ https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535853 ] Dawid Weiss commented on NUTCH-567: --- Don't mention it. Happy birthday and I hope it'll work for you. If you take a look at the patch (source) you'll see it's really a trivial change to the source... I actually looked at how browsers handle such "illegal" URIs (because all URIs should have & in them to separate parameters, not just an ampersand) and it seems they use some heuristics to determine what is an entity and what is not. Look at the test case -- it shows such a nasty situation. The current patch attempts to resolve URIs in a similar way my Firefox does it. > Proper (?) handling of URIs in TagSoup. > --- > > Key: NUTCH-567 > URL: https://issues.apache.org/jira/browse/NUTCH-567 > Project: Nutch > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Minor > Attachments: tagsoup-1.1.3-uripatched.jar , uri-entities.patch > > > Doug Cook reported that TagSoup incorrectly handles some URI parameters. More > discussion on the list and at TagSoup's mailing list. > http://tech.groups.yahoo.com/group/tagsoup-friends/message/838 > I looked at the sources of TagSoup because I'm using it myself (although the > URIs are not relevant for me). It seems like you can implement a naive > workaround by remembering the parsing state and just avoiding entity > resolution. Attached is the patch that does this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-567) Proper (?) handling of URIs in TagSoup.
[ https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539025 ] Dawid Weiss commented on NUTCH-567: --- Hi Doğacan. I have sent an e-mail to Tagsoup's mailing list, but it seems like the project has been inactive for some time. (http://tech.groups.yahoo.com/group/tagsoup-friends/). I guess we could patch TagSoup locally so that people can use it with Nutch. I didn't do any extensive tests though, so if Doug has done some testing this would be valuable. If this patch were to be integrated with Nutch I can prepare some more tests to cover border cases. D. > Proper (?) handling of URIs in TagSoup. > --- > > Key: NUTCH-567 > URL: https://issues.apache.org/jira/browse/NUTCH-567 > Project: Nutch > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Minor > Attachments: tagsoup-1.1.3-uripatched.jar , uri-entities.patch > > > Doug Cook reported that TagSoup incorrectly handles some URI parameters. More > discussion on the list and at TagSoup's mailing list. > http://tech.groups.yahoo.com/group/tagsoup-friends/message/838 > I looked at the sources of TagSoup because I'm using it myself (although the > URIs are not relevant for me). It seems like you can implement a naive > workaround by remembering the parsing state and just avoiding entity > resolution. Attached is the patch that does this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-567) Proper (?) handling of URIs in TagSoup.
[ https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539162 ] Dawid Weiss commented on NUTCH-567: --- I agree. What we used to do in Carrot2 was to include the patch (against the original version of the sources) along with the recompiled binary. This way you did have a track of what's been changed locally compared to the publicly available version. > Proper (?) handling of URIs in TagSoup. > --- > > Key: NUTCH-567 > URL: https://issues.apache.org/jira/browse/NUTCH-567 > Project: Nutch > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Minor > Attachments: tagsoup-1.1.3-uripatched.jar , uri-entities.patch > > > Doug Cook reported that TagSoup incorrectly handles some URI parameters. More > discussion on the list and at TagSoup's mailing list. > http://tech.groups.yahoo.com/group/tagsoup-friends/message/838 > I looked at the sources of TagSoup because I'm using it myself (although the > URIs are not relevant for me). It seems like you can implement a naive > workaround by remembering the parsing state and just avoiding entity > resolution. Attached is the patch that does this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-567) Proper (?) handling of URIs in TagSoup.
[ https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541074 ] Dawid Weiss commented on NUTCH-567: --- I didn't put the feather because I wasn't sure about licensing; I'll see into it, but I have to leave in a minute -- it'll be tomorrow. If you want to go ahead with it just check the license and if it's conforming to Apache's then re-submit it on your own. I'll do it tomorrow if it's not done by then. > Proper (?) handling of URIs in TagSoup. > --- > > Key: NUTCH-567 > URL: https://issues.apache.org/jira/browse/NUTCH-567 > Project: Nutch > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Minor > Attachments: README-tagsoup-patched.txt, tagsoup-1.1.3-uripatched.jar > , uri-entities.patch > > > Doug Cook reported that TagSoup incorrectly handles some URI parameters. More > discussion on the list and at TagSoup's mailing list. > http://tech.groups.yahoo.com/group/tagsoup-friends/message/838 > I looked at the sources of TagSoup because I'm using it myself (although the > URIs are not relevant for me). It seems like you can implement a naive > workaround by remembering the parsing state and just avoiding entity > resolution. Attached is the patch that does this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-567) Proper (?) handling of URIs in TagSoup.
[ https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated NUTCH-567: -- Attachment: (was: uri-entities.patch) > Proper (?) handling of URIs in TagSoup. > --- > > Key: NUTCH-567 > URL: https://issues.apache.org/jira/browse/NUTCH-567 > Project: Nutch > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Minor > Attachments: README-tagsoup-patched.txt > > > Doug Cook reported that TagSoup incorrectly handles some URI parameters. More > discussion on the list and at TagSoup's mailing list. > http://tech.groups.yahoo.com/group/tagsoup-friends/message/838 > I looked at the sources of TagSoup because I'm using it myself (although the > URIs are not relevant for me). It seems like you can implement a naive > workaround by remembering the parsing state and just avoiding entity > resolution. Attached is the patch that does this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-567) Proper (?) handling of URIs in TagSoup.
[ https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated NUTCH-567: -- Attachment: (was: tagsoup-1.1.3-uripatched.jar ) > Proper (?) handling of URIs in TagSoup. > --- > > Key: NUTCH-567 > URL: https://issues.apache.org/jira/browse/NUTCH-567 > Project: Nutch > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Minor > Attachments: README-tagsoup-patched.txt > > > Doug Cook reported that TagSoup incorrectly handles some URI parameters. More > discussion on the list and at TagSoup's mailing list. > http://tech.groups.yahoo.com/group/tagsoup-friends/message/838 > I looked at the sources of TagSoup because I'm using it myself (although the > URIs are not relevant for me). It seems like you can implement a naive > workaround by remembering the parsing state and just avoiding entity > resolution. Attached is the patch that does this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-567) Proper (?) handling of URIs in TagSoup.
[ https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated NUTCH-567: -- Attachment: tagsoup-1.1.3-uripatched.jar Attached is a patched version of tagsoup. The Tagsoup's Web site states that: "TagSoup is free and Open Source software, licensed under the Academic Free License version 3.0, a cleaned-up and patent-safe BSD-style license which allows proprietary re-use." I haven't found any information about incompatibilities between Apache vs. AFL licenses. > Proper (?) handling of URIs in TagSoup. > --- > > Key: NUTCH-567 > URL: https://issues.apache.org/jira/browse/NUTCH-567 > Project: Nutch > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Minor > Attachments: README-tagsoup-patched.txt, tagsoup-1.1.3-uripatched.jar > > > Doug Cook reported that TagSoup incorrectly handles some URI parameters. More > discussion on the list and at TagSoup's mailing list. > http://tech.groups.yahoo.com/group/tagsoup-friends/message/838 > I looked at the sources of TagSoup because I'm using it myself (although the > URIs are not relevant for me). It seems like you can implement a naive > workaround by remembering the parsing state and just avoiding entity > resolution. Attached is the patch that does this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-567) Proper (?) handling of URIs in TagSoup.
[ https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556261#action_12556261 ] Dawid Weiss commented on NUTCH-567: --- John Cowan apparently released a fixed version of TagSoup (1.2). This is good news for several reasons (quoting): - As noted above, I have changed the license to Apache 2.0. - The processing of entity references in attribute values has finally been fixed to do what browsers do. That is, a reference is only recognized if it is properly terminated by a semicolon; otherwise it is treated as plain text. This means that URIs like "foo?cdown=32&cup=42" are no longer seen as containing an instance of the cup character. I guess this issue is no longer applicable and an upgrade to the newer TagSoup would be appropriate. > Proper (?) handling of URIs in TagSoup. > --- > > Key: NUTCH-567 > URL: https://issues.apache.org/jira/browse/NUTCH-567 > Project: Nutch > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Minor > Attachments: README-tagsoup-patched.txt, tagsoup-1.1.3-uripatched.jar > > > Doug Cook reported that TagSoup incorrectly handles some URI parameters. More > discussion on the list and at TagSoup's mailing list. > http://tech.groups.yahoo.com/group/tagsoup-friends/message/838 > I looked at the sources of TagSoup because I'm using it myself (although the > URIs are not relevant for me). It seems like you can implement a naive > workaround by remembering the parsing state and just avoiding entity > resolution. Attached is the patch that does this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0
[ https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830051#action_12830051 ] Dawid Weiss commented on NUTCH-673: --- Hi guys. I'd be willing to proceed with this and upgrade to Carrot2 3.x line. The first issue I have encountered is Lucene incompatibilities between 2.9 (currently in Nutch) and 3.0 (currently in Carrot2). Any plans or reasons not to upgrade to Lucene 3.0? It's been with us for quite a while. If there are no objections, I can prepare a patch replacing Lucene 2.9 with Lucene 3.0 (as a separate issue). > Upgrade the Carrot2 plug-in to release 3.0 > -- > > Key: NUTCH-673 > URL: https://issues.apache.org/jira/browse/NUTCH-673 > Project: Nutch > Issue Type: Improvement > Components: web gui >Affects Versions: 0.9.0 > Environment: All Nutch deployments. >Reporter: Sean Dean >Priority: Minor > Fix For: 1.1 > > > Release 3.0 of the Carrot2 plug-in was released recently. > We currently have version 2.1 in the source tree and upgrading it to the > latest version before 1.0-release might make sence. > Details on the release can be found here: > http://project.carrot2.org/release-3.0-notes.html > One major change in requirements is for JDK 1.5 to be used, but this is also > now required for Hadoop 0.19 so this wouldnt be the only reason for the > switch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-787) Upgrade Lucene to 3.0.0.
Upgrade Lucene to 3.0.0. Key: NUTCH-787 URL: https://issues.apache.org/jira/browse/NUTCH-787 Project: Nutch Issue Type: Task Components: build Reporter: Dawid Weiss Priority: Trivial -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0
[ https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830078#action_12830078 ] Dawid Weiss commented on NUTCH-673: --- O.K., I'll see into the complexity of upgrading to 3.0 first then. Filing a separate issue. > Upgrade the Carrot2 plug-in to release 3.0 > -- > > Key: NUTCH-673 > URL: https://issues.apache.org/jira/browse/NUTCH-673 > Project: Nutch > Issue Type: Improvement > Components: web gui >Affects Versions: 0.9.0 > Environment: All Nutch deployments. >Reporter: Sean Dean >Priority: Minor > Fix For: 1.1 > > > Release 3.0 of the Carrot2 plug-in was released recently. > We currently have version 2.1 in the source tree and upgrading it to the > latest version before 1.0-release might make sence. > Details on the release can be found here: > http://project.carrot2.org/release-3.0-notes.html > One major change in requirements is for JDK 1.5 to be used, but this is also > now required for Hadoop 0.19 so this wouldnt be the only reason for the > switch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-787) Upgrade Lucene to 3.0.0.
[ https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830085#action_12830085 ] Dawid Weiss commented on NUTCH-787: --- Just did an initial check -- this should be doable, although will result in a sizeable patch due to API changes and removed deprecations. I think it still makes sense to try and push the 3.0 version of Lucene into Nutch, so I will keep working on this and seek help in reviewing the patch (and incompatible changes) once it's ready. > Upgrade Lucene to 3.0.0. > > > Key: NUTCH-787 > URL: https://issues.apache.org/jira/browse/NUTCH-787 > Project: Nutch > Issue Type: Task > Components: build >Reporter: Dawid Weiss >Priority: Trivial > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-787) Upgrade Lucene to 3.0.0.
[ https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated NUTCH-787: -- Attachment: NUTCH-787.patch Text-patch of changes porting the code to Lucene 3.0.0. > Upgrade Lucene to 3.0.0. > > > Key: NUTCH-787 > URL: https://issues.apache.org/jira/browse/NUTCH-787 > Project: Nutch > Issue Type: Task > Components: build >Reporter: Dawid Weiss >Priority: Trivial > Attachments: NUTCH-787.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-787) Upgrade Lucene to 3.0.0.
[ https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830534#action_12830534 ] Dawid Weiss commented on NUTCH-787: --- Definitely not an easy thing to do. I need to finish for today, the code compiles, here's a brief summary of changes: - modified all filters and streams to use token attributes instead of raw Tokens. In many places I tried to be least intrusive so that the patch can be easily reviewed and accepted; improvements resulting from the new API can follow, - replaced deprecated constants to their new equivalents (UN_TOKENIZED, etc), - there are no compressed fields any more, so this stuff is commented out. If I may ask as many people with Lucene/Nutch knowledge to go through the patch and point out potential problems, it would be great. At the moment one core test fails for me -- TestIndexSorter. I don't know if the difference in boosts is something that is a result of Lucene changes or my bug introduced somewhere along the way. > Upgrade Lucene to 3.0.0. > > > Key: NUTCH-787 > URL: https://issues.apache.org/jira/browse/NUTCH-787 > Project: Nutch > Issue Type: Task > Components: build >Reporter: Dawid Weiss >Priority: Trivial > Attachments: NUTCH-787.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-787) Upgrade Lucene to 3.0.0.
[ https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830900#action_12830900 ] Dawid Weiss commented on NUTCH-787: --- The failing test in TestIndexSorter is caused by the change of implementation inside Lucene. In Lucene 2.9, SegmentMerger calls IndexReader#document(int, FieldSelector), but in 3.0 this has been changed to a call to document(int): Document doc = reader.document(docCount); Now, IndexSorter in Nutch overrides both methods and delegates to the superclass (IndexReader) with mapping from old ids to new ids, but IndexReader re-delegates back to the overriden method, so IDs are effectively remapped back to original values. > Upgrade Lucene to 3.0.0. > > > Key: NUTCH-787 > URL: https://issues.apache.org/jira/browse/NUTCH-787 > Project: Nutch > Issue Type: Task > Components: build >Reporter: Dawid Weiss >Priority: Trivial > Attachments: NUTCH-787.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-787) Upgrade Lucene to 3.0.0.
[ https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated NUTCH-787: -- Attachment: (was: NUTCH-787.patch) > Upgrade Lucene to 3.0.0. > > > Key: NUTCH-787 > URL: https://issues.apache.org/jira/browse/NUTCH-787 > Project: Nutch > Issue Type: Task > Components: build >Reporter: Dawid Weiss >Priority: Trivial > Attachments: NUTCH-787.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-787) Upgrade Lucene to 3.0.0.
[ https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated NUTCH-787: -- Attachment: NUTCH-787.patch This patch moves Nutch from Lucene 2.9.1 to Lucene 3.0.0. All tests pass. The patch does not contain binary files (Lucene JARs), these should be applied manually. D src/plugin/summary-lucene/lib/lucene-highlighter-2.9.1.jar A src/plugin/summary-lucene/lib/lucene-highlighter-3.0.0.jar D src/plugin/lib-lucene-analyzers/lib/lucene-analyzers-2.9.1.jar A src/plugin/lib-lucene-analyzers/lib/lucene-analyzers-3.0.0.jar D lib/lucene-misc-2.9.1.jar A lib/lucene-core-3.0.0.jar D lib/lucene-core-2.9.1.jar A lib/lucene-misc-3.0.0.jar > Upgrade Lucene to 3.0.0. > > > Key: NUTCH-787 > URL: https://issues.apache.org/jira/browse/NUTCH-787 > Project: Nutch > Issue Type: Task > Components: build >Reporter: Dawid Weiss >Priority: Trivial > Attachments: NUTCH-787.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-787) Upgrade Lucene to 3.0.0.
[ https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830902#action_12830902 ] Dawid Weiss commented on NUTCH-787: --- O.K. I think this is ready for review/ testing and integration. All built-in tests pass, it would be good if people could test it against their indexes. > Upgrade Lucene to 3.0.0. > > > Key: NUTCH-787 > URL: https://issues.apache.org/jira/browse/NUTCH-787 > Project: Nutch > Issue Type: Task > Components: build >Reporter: Dawid Weiss >Priority: Trivial > Attachments: NUTCH-787.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-787) Upgrade Lucene to 3.0.0.
[ https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846434#action_12846434 ] Dawid Weiss commented on NUTCH-787: --- I'll be happy to help if I can. I admit I only ran the build tests -- some empirical crawls and other types of jobs would be more then desirable, but I don't have the infrastructure to do it. > Upgrade Lucene to 3.0.0. > > > Key: NUTCH-787 > URL: https://issues.apache.org/jira/browse/NUTCH-787 > Project: Nutch > Issue Type: Task > Components: build >Reporter: Dawid Weiss >Priority: Trivial > Fix For: 1.1 > > Attachments: NUTCH-787.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-787) Upgrade Lucene to 3.0.1.
[ https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847325#action_12847325 ] Dawid Weiss commented on NUTCH-787: --- Thanks Andrzej. > Upgrade Lucene to 3.0.1. > > > Key: NUTCH-787 > URL: https://issues.apache.org/jira/browse/NUTCH-787 > Project: Nutch > Issue Type: Task > Components: build >Reporter: Dawid Weiss >Assignee: Andrzej Bialecki >Priority: Trivial > Fix For: 1.1 > > Attachments: NUTCH-787.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.