[jira] Commented: (NUTCH-798) Upgrade to SOLR1.4
[ https://issues.apache.org/jira/browse/NUTCH-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843546#action_12843546 ] Sami Siren commented on NUTCH-798: -- +1 Upgrade to SOLR1.4 -- Key: NUTCH-798 URL: https://issues.apache.org/jira/browse/NUTCH-798 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Julien Nioche Fix For: 1.1 in particular SOLR1.4 has a StreamingUpdateSolrServer which would simplify the way we buffer the docs before sending them to the SOLR instance -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-793) search.jsp compile errors
search.jsp compile errors - Key: NUTCH-793 URL: https://issues.apache.org/jira/browse/NUTCH-793 Project: Nutch Issue Type: Bug Components: web gui Reporter: Sami Siren Assignee: Sami Siren Fix For: 1.1 Related to the searcher interface changes recently committed I broke search.jsp which does not currently compile. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-793) search.jsp compile errors
[ https://issues.apache.org/jira/browse/NUTCH-793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-793. -- Resolution: Fixed committed a fix search.jsp compile errors - Key: NUTCH-793 URL: https://issues.apache.org/jira/browse/NUTCH-793 Project: Nutch Issue Type: Bug Components: web gui Reporter: Sami Siren Assignee: Sami Siren Fix For: 1.1 Related to the searcher interface changes recently committed I broke search.jsp which does not currently compile. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-788) search.jsp typo causing searches to fail
[ https://issues.apache.org/jira/browse/NUTCH-788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-788. -- Resolution: Fixed Fix Version/s: 1.1 Assignee: Sami Siren Thanks Sammy for the fix, I did not realize you had spotted this too. It's now fixed in trunk. search.jsp typo causing searches to fail Key: NUTCH-788 URL: https://issues.apache.org/jira/browse/NUTCH-788 Project: Nutch Issue Type: Bug Components: web gui Affects Versions: 1.1 Environment: On trunk Reporter: Sammy Yu Assignee: Sami Siren Fix For: 1.1 Attachments: 0001-Fix-up-servlet.patch Call to initialize the servlet parameter is missing parentheses. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-789) Improvements to Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833714#action_12833714 ] Sami Siren commented on NUTCH-789: -- It would be really useful to include the improvements in the functionality since that way almost all (-flash ?) parsers would be covered. Improvements to Tika parser --- Key: NUTCH-789 URL: https://issues.apache.org/jira/browse/NUTCH-789 Project: Nutch Issue Type: Improvement Components: fetcher Environment: reported by Sami, in NUTCH-766 Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.1 Attachments: NutchTikaConfig.java, TikaParser.java As reported by Sami in NUTCH-766, Sami has a few improvements he made to the Tika parser. We'll track that progress here. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-790) Some external javadoc links are broken
Some external javadoc links are broken -- Key: NUTCH-790 URL: https://issues.apache.org/jira/browse/NUTCH-790 Project: Nutch Issue Type: Improvement Components: build Reporter: Sami Siren Assignee: Sami Siren Priority: Trivial Nutch javadoc links for lucene and hadoop are broken. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-790) Some external javadoc links are broken
[ https://issues.apache.org/jira/browse/NUTCH-790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-790: - Attachment: NUTCH-790.patch proposed patch, fixes links for lucene and hadoop, also updates j2se link to version 1.6 Some external javadoc links are broken -- Key: NUTCH-790 URL: https://issues.apache.org/jira/browse/NUTCH-790 Project: Nutch Issue Type: Improvement Components: build Reporter: Sami Siren Assignee: Sami Siren Priority: Trivial Attachments: NUTCH-790.patch Nutch javadoc links for lucene and hadoop are broken. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-791) External links for published javadocs are partially broken
External links for published javadocs are partially broken -- Key: NUTCH-791 URL: https://issues.apache.org/jira/browse/NUTCH-791 Project: Nutch Issue Type: Bug Components: documentation Reporter: Sami Siren Lucene and Hadoop links point to non existing urls. For some versions of apidocs the links are just broken and for some they do not exist at all. Basically what is required is that the javadocs are generated again with proper urls for external packages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-790) Some external javadoc links are broken
[ https://issues.apache.org/jira/browse/NUTCH-790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-790. -- Resolution: Fixed Fix Version/s: 1.1 committed Some external javadoc links are broken -- Key: NUTCH-790 URL: https://issues.apache.org/jira/browse/NUTCH-790 Project: Nutch Issue Type: Improvement Components: build Reporter: Sami Siren Assignee: Sami Siren Priority: Trivial Fix For: 1.1 Attachments: NUTCH-790.patch Nutch javadoc links for lucene and hadoop are broken. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-792) Nutch version still contains 1.0
[ https://issues.apache.org/jira/browse/NUTCH-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-792: - Attachment: NUTCH-792.patch pump version to 1.1-dev Nutch version still contains 1.0 Key: NUTCH-792 URL: https://issues.apache.org/jira/browse/NUTCH-792 Project: Nutch Issue Type: Task Components: build Reporter: Sami Siren Assignee: Sami Siren Attachments: NUTCH-792.patch Should be 1.1-dev now in trunk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-792) Nutch version still contains 1.0
Nutch version still contains 1.0 Key: NUTCH-792 URL: https://issues.apache.org/jira/browse/NUTCH-792 Project: Nutch Issue Type: Task Components: build Reporter: Sami Siren Assignee: Sami Siren Attachments: NUTCH-792.patch Should be 1.1-dev now in trunk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-792) Nutch version still contains 1.0
[ https://issues.apache.org/jira/browse/NUTCH-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-792. -- Resolution: Fixed committed Nutch version still contains 1.0 Key: NUTCH-792 URL: https://issues.apache.org/jira/browse/NUTCH-792 Project: Nutch Issue Type: Task Components: build Reporter: Sami Siren Assignee: Sami Siren Attachments: NUTCH-792.patch Should be 1.1-dev now in trunk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832406#action_12832406 ] Sami Siren commented on NUTCH-766: -- I suggest that we would still drive this a bit further an use. currently this patch does not use Tika for pkg formats nor html. Julien: was there a reason not to use AutoDetect parser? The only thing that I could come with was that the mime type detection would be done twice. We could get around this by implementing somethin simlilar to what composite parser does (it uses a parser (AutodetectParser) class from the context to do further parsing) to cover all supported pkg formats. Also was there a reson not to parse html wtih tika? I have a patch nearby to demonstrate some of the improvements that I will try to post briefly. Tika parser --- Key: NUTCH-766 URL: https://issues.apache.org/jira/browse/NUTCH-766 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Chris A. Mattmann Fix For: 1.1 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome. Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places NUTCH_HOME/lib : tika-core.jar NUTCH_HOME/tika-plugin/lib : tika-parsers.jar Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points. Unlike most other parsers, Tika handles more than one Mime-type which is why we are using * as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection, metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it. The following libraries are required in the lib/ directory of the tika-parser : library name=asm-3.1.jar/ library name=bcmail-jdk15-144.jar/ library name=commons-compress-1.0.jar/ library name=commons-logging-1.1.1.jar/ library name=dom4j-1.6.1.jar/ library name=fontbox-0.8.0-incubator.jar/ library name=geronimo-stax-api_1.0_spec-1.0.1.jar/ library name=hamcrest-core-1.1.jar/ library name=jce-jdk13-144.jar/ library name=jempbox-0.8.0-incubator.jar/ library name=metadata-extractor-2.4.0-beta-1.jar/ library name=mockito-core-1.7.jar/ library name=objenesis-1.0.jar/ library name=ooxml-schemas-1.0.jar/ library name=pdfbox-0.8.0-incubating.jar/ library name=poi-3.5-FINAL.jar/ library name=poi-ooxml-3.5-FINAL.jar/ library name=poi-scratchpad-3.5-FINAL.jar/ library name=tagsoup-1.2.jar/ library name=tika-parsers-0.5-SNAPSHOT.jar/ library name=xml-apis-1.0.b2.jar/ library name=xmlbeans-2.3.0.jar/ There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed
[jira] Updated: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-766: - Attachment: NutchTikaConfig.java Extended TikaConfig that is able to load parsers and can be used with existing tika classes. The call to (super) cannot load parser but then the config is porcessed again locally. This is a hack and hopefully at some point we can drop the class alltogether. Tika parser --- Key: NUTCH-766 URL: https://issues.apache.org/jira/browse/NUTCH-766 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Chris A. Mattmann Fix For: 1.1 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, sample.tar.gz Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome. Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places NUTCH_HOME/lib : tika-core.jar NUTCH_HOME/tika-plugin/lib : tika-parsers.jar Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points. Unlike most other parsers, Tika handles more than one Mime-type which is why we are using * as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection, metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it. The following libraries are required in the lib/ directory of the tika-parser : library name=asm-3.1.jar/ library name=bcmail-jdk15-144.jar/ library name=commons-compress-1.0.jar/ library name=commons-logging-1.1.1.jar/ library name=dom4j-1.6.1.jar/ library name=fontbox-0.8.0-incubator.jar/ library name=geronimo-stax-api_1.0_spec-1.0.1.jar/ library name=hamcrest-core-1.1.jar/ library name=jce-jdk13-144.jar/ library name=jempbox-0.8.0-incubator.jar/ library name=metadata-extractor-2.4.0-beta-1.jar/ library name=mockito-core-1.7.jar/ library name=objenesis-1.0.jar/ library name=ooxml-schemas-1.0.jar/ library name=pdfbox-0.8.0-incubating.jar/ library name=poi-3.5-FINAL.jar/ library name=poi-ooxml-3.5-FINAL.jar/ library name=poi-scratchpad-3.5-FINAL.jar/ library name=tagsoup-1.2.jar/ library name=tika-parsers-0.5-SNAPSHOT.jar/ library name=xml-apis-1.0.b2.jar/ library name=xmlbeans-2.3.0.jar/ There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine. Again, your comments are welcome. Please bear in mind that this is just a first step. Julien http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-766: - Attachment: TikaParser.java Modified parser that can process package formats too. To get rid of the mime type detection happening twice we have to extend AutoDetectParser so that skips the intitial detection but does the detection for the rest of the content (in pkg formats) Tika parser --- Key: NUTCH-766 URL: https://issues.apache.org/jira/browse/NUTCH-766 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Chris A. Mattmann Fix For: 1.1 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, sample.tar.gz, TikaParser.java Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome. Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places NUTCH_HOME/lib : tika-core.jar NUTCH_HOME/tika-plugin/lib : tika-parsers.jar Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points. Unlike most other parsers, Tika handles more than one Mime-type which is why we are using * as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection, metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it. The following libraries are required in the lib/ directory of the tika-parser : library name=asm-3.1.jar/ library name=bcmail-jdk15-144.jar/ library name=commons-compress-1.0.jar/ library name=commons-logging-1.1.1.jar/ library name=dom4j-1.6.1.jar/ library name=fontbox-0.8.0-incubator.jar/ library name=geronimo-stax-api_1.0_spec-1.0.1.jar/ library name=hamcrest-core-1.1.jar/ library name=jce-jdk13-144.jar/ library name=jempbox-0.8.0-incubator.jar/ library name=metadata-extractor-2.4.0-beta-1.jar/ library name=mockito-core-1.7.jar/ library name=objenesis-1.0.jar/ library name=ooxml-schemas-1.0.jar/ library name=pdfbox-0.8.0-incubating.jar/ library name=poi-3.5-FINAL.jar/ library name=poi-ooxml-3.5-FINAL.jar/ library name=poi-scratchpad-3.5-FINAL.jar/ library name=tagsoup-1.2.jar/ library name=tika-parsers-0.5-SNAPSHOT.jar/ library name=xml-apis-1.0.b2.jar/ library name=xmlbeans-2.3.0.jar/ There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine. Again, your comments are welcome. Please bear in mind that this is just a first step. Julien http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0
[ https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12830053#action_12830053 ] Sami Siren commented on NUTCH-673: -- {quote} Any plans or reasons not to upgrade to Lucene 3.0? {quote} I see no reason to stick with 2.9 {quote} I can prepare a patch replacing Lucene 2.9 with Lucene 3.0 (as a separate issue). {quote} +1 Upgrade the Carrot2 plug-in to release 3.0 -- Key: NUTCH-673 URL: https://issues.apache.org/jira/browse/NUTCH-673 Project: Nutch Issue Type: Improvement Components: web gui Affects Versions: 0.9.0 Environment: All Nutch deployments. Reporter: Sean Dean Priority: Minor Fix For: 1.1 Release 3.0 of the Carrot2 plug-in was released recently. We currently have version 2.1 in the source tree and upgrading it to the latest version before 1.0-release might make sence. Details on the release can be found here: http://project.carrot2.org/release-3.0-notes.html One major change in requirements is for JDK 1.5 to be used, but this is also now required for Hadoop 0.19 so this wouldnt be the only reason for the switch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-781) Update Tika to v0.6 for the MimeType detection
[ https://issues.apache.org/jira/browse/NUTCH-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828561#action_12828561 ] Sami Siren commented on NUTCH-781: -- {quote} the version we had was the same as the one provided by Tika 0.4 so I suppose we could safely rely on theTika defaults. MimeUtil currently requires needs tika-mimetypes.xml to be in the available in the classpath but we could modify that so that it uses the default version from the tika jar if nothing can be found in conf. Let's put that in a separate JIRA issue if we really want it, in the meantime I'll commit the v 0.6 of tika-mimetypes.xml {quote} ok. thanks. Update Tika to v0.6 for the MimeType detection --- Key: NUTCH-781 URL: https://issues.apache.org/jira/browse/NUTCH-781 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 [from annoucement] Apache Tika, a subproject of Apache Lucene, is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Apache Tika 0.6 contains a number of improvements and bug fixes. Details can be found in the changes file: http://www.apache.org/dist/lucene/tika/CHANGES-0.6.txt -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-775) Enhance Searcher interface
[ https://issues.apache.org/jira/browse/NUTCH-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-775. -- Resolution: Fixed I committed this Enhance Searcher interface -- Key: NUTCH-775 URL: https://issues.apache.org/jira/browse/NUTCH-775 Project: Nutch Issue Type: Improvement Components: searcher Reporter: Sami Siren Assignee: Sami Siren Fix For: 1.1 Attachments: NUTCH-775.patch Current Searcher interface is too limited for many purposes: Hits search(Query query, int numHits, String dedupField, String sortField, boolean reverse) throws IOException; It would be nice that we had an interface that allowed adding different features without changing the interface. I am proposing that we deprecate the current search method and introduce something like: Hits search(Query query, Metadata context) throws IOException; Also at the same time we should enhance the QueryFilter interface to look something like: BooleanQuery filter(Query input, BooleanQuery translation, Metadata context) throws QueryException; I would like to hear your comments before proceeding with a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-781) Update Tika to v0.6 for the MimeType detection
[ https://issues.apache.org/jira/browse/NUTCH-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828275#action_12828275 ] Sami Siren commented on NUTCH-781: -- did you forgot to update conf/tika-mimetypes.xml ? Related question: do we actually need our own version on the tika config anymore? I saw there were some old issues that were fixed in the custom version but i would quess those changes, if important, have already made their way into Tika? Update Tika to v0.6 for the MimeType detection --- Key: NUTCH-781 URL: https://issues.apache.org/jira/browse/NUTCH-781 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 [from annoucement] Apache Tika, a subproject of Apache Lucene, is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Apache Tika 0.6 contains a number of improvements and bug fixes. Details can be found in the changes file: http://www.apache.org/dist/lucene/tika/CHANGES-0.6.txt -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-775) Enhance Searcher interface
[ https://issues.apache.org/jira/browse/NUTCH-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806019#action_12806019 ] Sami Siren commented on NUTCH-775: -- If there are no objections I'll commit the proposed patch within few days. Enhance Searcher interface -- Key: NUTCH-775 URL: https://issues.apache.org/jira/browse/NUTCH-775 Project: Nutch Issue Type: Improvement Components: searcher Reporter: Sami Siren Assignee: Sami Siren Fix For: 1.1 Attachments: NUTCH-775.patch Current Searcher interface is too limited for many purposes: Hits search(Query query, int numHits, String dedupField, String sortField, boolean reverse) throws IOException; It would be nice that we had an interface that allowed adding different features without changing the interface. I am proposing that we deprecate the current search method and introduce something like: Hits search(Query query, Metadata context) throws IOException; Also at the same time we should enhance the QueryFilter interface to look something like: BooleanQuery filter(Query input, BooleanQuery translation, Metadata context) throws QueryException; I would like to hear your comments before proceeding with a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-775) Enhance Searcher interface
[ https://issues.apache.org/jira/browse/NUTCH-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806051#action_12806051 ] Sami Siren commented on NUTCH-775: -- {quote}IMHO this could go as it is ... one suggestion though: this Query/QueryContext now resembles SolrQuery/SolrParams. Perhaps we could rename QueryContext to QueryParams? {quote} That sounds reasonable, I will change the name before committing. Also I forgot to change web gui to use the new api, will do that also. Enhance Searcher interface -- Key: NUTCH-775 URL: https://issues.apache.org/jira/browse/NUTCH-775 Project: Nutch Issue Type: Improvement Components: searcher Reporter: Sami Siren Assignee: Sami Siren Fix For: 1.1 Attachments: NUTCH-775.patch Current Searcher interface is too limited for many purposes: Hits search(Query query, int numHits, String dedupField, String sortField, boolean reverse) throws IOException; It would be nice that we had an interface that allowed adding different features without changing the interface. I am proposing that we deprecate the current search method and introduce something like: Hits search(Query query, Metadata context) throws IOException; Also at the same time we should enhance the QueryFilter interface to look something like: BooleanQuery filter(Query input, BooleanQuery translation, Metadata context) throws QueryException; I would like to hear your comments before proceeding with a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12805661#action_12805661 ] Sami Siren commented on NUTCH-766: -- {quote} Sure, it's more of a configuration backwards-compat issue. For those folks who have gone to the trouble of customizing their nutch configuration (nuch-site.xml, or nutch-default.xml, or even parse-plugins), to remove out the parsing plugins (e.g., basically say they don't exist anymore and update your deployed configuration to use the tika-plugin), this patch would require a configuration update in their deployed environments. Because of that, why don't we ease them into that upgrade with at least one released version before the plugins go away. It would make it easier from a configuration backwards-compat perspective. {quote} Ok, so you mean that we need to have duplicate parser plugins because we don't want to ask people already using nutch to reconfigure the bits this involves now even though we have to do it later? How is postponing going to ease the task they need to do anyway at some point? I still don't understand the (longer term) benefit. I am not strongly against the idea of keeping duplicate plugins, I mean it's just another ~20M in the .job, what I am worried about is that the history will repeat itself and we will end up having one more case of duplicate components (in this case many of them) doing the same work and no interest in cleaning up afterwards. Doing it the way I suggested would guarantee that this will not happen. Tika parser --- Key: NUTCH-766 URL: https://issues.apache.org/jira/browse/NUTCH-766 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Chris A. Mattmann Fix For: 1.1 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome. Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places NUTCH_HOME/lib : tika-core.jar NUTCH_HOME/tika-plugin/lib : tika-parsers.jar Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points. Unlike most other parsers, Tika handles more than one Mime-type which is why we are using * as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection, metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it. The following libraries are required in the lib/ directory of the tika-parser : library name=asm-3.1.jar/ library name=bcmail-jdk15-144.jar/ library name=commons-compress-1.0.jar/ library name=commons-logging-1.1.1.jar/ library name=dom4j-1.6.1.jar/ library name=fontbox-0.8.0-incubator.jar/ library name=geronimo-stax-api_1.0_spec-1.0.1.jar/ library name=hamcrest-core-1.1.jar/ library name=jce-jdk13-144.jar/ library name=jempbox-0.8.0-incubator.jar/ library name=metadata-extractor-2.4.0-beta-1.jar/ library
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804448#action_12804448 ] Sami Siren commented on NUTCH-766: -- +1, I'm going to agree on this one here Julien. Other communities have convinced me of the need for backwards compat and unobtrusiveness when bringing in new functionality or results. +1 to at least in Nutch 1.1 leaving the old plugins (perhaps mentioning they should be deprecated and replaced by the Tika functionality) and then removing them in 1.2 or 1.3. Chris, can you please explain me how keeping two components doing identical work would be more backwards compatible than having only 1? Tika parser --- Key: NUTCH-766 URL: https://issues.apache.org/jira/browse/NUTCH-766 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Chris A. Mattmann Fix For: 1.1 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome. Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places NUTCH_HOME/lib : tika-core.jar NUTCH_HOME/tika-plugin/lib : tika-parsers.jar Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points. Unlike most other parsers, Tika handles more than one Mime-type which is why we are using * as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection, metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it. The following libraries are required in the lib/ directory of the tika-parser : library name=asm-3.1.jar/ library name=bcmail-jdk15-144.jar/ library name=commons-compress-1.0.jar/ library name=commons-logging-1.1.1.jar/ library name=dom4j-1.6.1.jar/ library name=fontbox-0.8.0-incubator.jar/ library name=geronimo-stax-api_1.0_spec-1.0.1.jar/ library name=hamcrest-core-1.1.jar/ library name=jce-jdk13-144.jar/ library name=jempbox-0.8.0-incubator.jar/ library name=metadata-extractor-2.4.0-beta-1.jar/ library name=mockito-core-1.7.jar/ library name=objenesis-1.0.jar/ library name=ooxml-schemas-1.0.jar/ library name=pdfbox-0.8.0-incubating.jar/ library name=poi-3.5-FINAL.jar/ library name=poi-ooxml-3.5-FINAL.jar/ library name=poi-scratchpad-3.5-FINAL.jar/ library name=tagsoup-1.2.jar/ library name=tika-parsers-0.5-SNAPSHOT.jar/ library name=xml-apis-1.0.b2.jar/ library name=xmlbeans-2.3.0.jar/ There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine. Again, your comments are welcome. Please bear in mind that this is just a first step. Julien
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803664#action_12803664 ] Sami Siren commented on NUTCH-766: -- I took a brief look into the proposed patch, some somments: The public API footprint of new classes should be smaller, eg use private, package private or protected methods/classes as much as possible. I think the end result of this plugin should be replacing all Tika supported parsers (or the parsers we choose to replace) with the TikaParser and not to build a parallel ways to parse same formats. So I think we need to copy all of the the existing test files and moveadapt the existing testcases fully before committing this. That is a good way of seeing that the parse result is what is expected and also find out about possible differences with old vs. Tika version. Tika parser --- Key: NUTCH-766 URL: https://issues.apache.org/jira/browse/NUTCH-766 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Chris A. Mattmann Fix For: 1.1 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome. Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places NUTCH_HOME/lib : tika-core.jar NUTCH_HOME/tika-plugin/lib : tika-parsers.jar Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points. Unlike most other parsers, Tika handles more than one Mime-type which is why we are using * as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection, metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it. The following libraries are required in the lib/ directory of the tika-parser : library name=asm-3.1.jar/ library name=bcmail-jdk15-144.jar/ library name=commons-compress-1.0.jar/ library name=commons-logging-1.1.1.jar/ library name=dom4j-1.6.1.jar/ library name=fontbox-0.8.0-incubator.jar/ library name=geronimo-stax-api_1.0_spec-1.0.1.jar/ library name=hamcrest-core-1.1.jar/ library name=jce-jdk13-144.jar/ library name=jempbox-0.8.0-incubator.jar/ library name=metadata-extractor-2.4.0-beta-1.jar/ library name=mockito-core-1.7.jar/ library name=objenesis-1.0.jar/ library name=ooxml-schemas-1.0.jar/ library name=pdfbox-0.8.0-incubating.jar/ library name=poi-3.5-FINAL.jar/ library name=poi-ooxml-3.5-FINAL.jar/ library name=poi-scratchpad-3.5-FINAL.jar/ library name=tagsoup-1.2.jar/ library name=tika-parsers-0.5-SNAPSHOT.jar/ library name=xml-apis-1.0.b2.jar/ library name=xmlbeans-2.3.0.jar/ There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803673#action_12803673 ] Sami Siren commented on NUTCH-766: -- Sure, but it would be silly to block the whole Tika plugin because Tika does not support such or such format as well as the original Nutch plugins. As I explained above we can configure which parser to use for which mimetype and use the Tika-plugin by default. Hopefully the Tika implementation will get better and better and there will be no need for keeping the old plugins. I meant test files for the parsers we replace, not all BTW http://wiki.apache.org/nutch/TikaPlugin lists the differences between the current version of Tika and the existing Nutch parsers ok, I had misses that one. Tika parser --- Key: NUTCH-766 URL: https://issues.apache.org/jira/browse/NUTCH-766 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Chris A. Mattmann Fix For: 1.1 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome. Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places NUTCH_HOME/lib : tika-core.jar NUTCH_HOME/tika-plugin/lib : tika-parsers.jar Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points. Unlike most other parsers, Tika handles more than one Mime-type which is why we are using * as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection, metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it. The following libraries are required in the lib/ directory of the tika-parser : library name=asm-3.1.jar/ library name=bcmail-jdk15-144.jar/ library name=commons-compress-1.0.jar/ library name=commons-logging-1.1.1.jar/ library name=dom4j-1.6.1.jar/ library name=fontbox-0.8.0-incubator.jar/ library name=geronimo-stax-api_1.0_spec-1.0.1.jar/ library name=hamcrest-core-1.1.jar/ library name=jce-jdk13-144.jar/ library name=jempbox-0.8.0-incubator.jar/ library name=metadata-extractor-2.4.0-beta-1.jar/ library name=mockito-core-1.7.jar/ library name=objenesis-1.0.jar/ library name=ooxml-schemas-1.0.jar/ library name=pdfbox-0.8.0-incubating.jar/ library name=poi-3.5-FINAL.jar/ library name=poi-ooxml-3.5-FINAL.jar/ library name=poi-scratchpad-3.5-FINAL.jar/ library name=tagsoup-1.2.jar/ library name=tika-parsers-0.5-SNAPSHOT.jar/ library name=xml-apis-1.0.b2.jar/ library name=xmlbeans-2.3.0.jar/ There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine. Again, your
[jira] Updated: (NUTCH-775) Enhance Searcher interface
[ https://issues.apache.org/jira/browse/NUTCH-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-775: - Attachment: NUTCH-775.patch I ended up changing the Query API instead since the changes were smaller from API perspective that way. Enhance Searcher interface -- Key: NUTCH-775 URL: https://issues.apache.org/jira/browse/NUTCH-775 Project: Nutch Issue Type: Improvement Components: searcher Reporter: Sami Siren Assignee: Sami Siren Fix For: 1.1 Attachments: NUTCH-775.patch Current Searcher interface is too limited for many purposes: Hits search(Query query, int numHits, String dedupField, String sortField, boolean reverse) throws IOException; It would be nice that we had an interface that allowed adding different features without changing the interface. I am proposing that we deprecate the current search method and introduce something like: Hits search(Query query, Metadata context) throws IOException; Also at the same time we should enhance the QueryFilter interface to look something like: BooleanQuery filter(Query input, BooleanQuery translation, Metadata context) throws QueryException; I would like to hear your comments before proceeding with a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool
[ https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791829#action_12791829 ] Sami Siren commented on NUTCH-666: -- We should also consider switching to Tika for language identification and route the proposed improvements in that area through Tika? Analysis plugins for multiple language and new Language Identifier Tool --- Key: NUTCH-666 URL: https://issues.apache.org/jira/browse/NUTCH-666 Project: Nutch Issue Type: Improvement Affects Versions: 1.1 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.1 Attachments: NUTCH-666-1-20081126.patch Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, russian, and thai. Also includes a new Language Identifier tool that used the new indexing framework in NUTCH-646. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-775) Enhance Searcher interface
Enhance Searcher interface -- Key: NUTCH-775 URL: https://issues.apache.org/jira/browse/NUTCH-775 Project: Nutch Issue Type: Improvement Components: searcher Reporter: Sami Siren Assignee: Sami Siren Fix For: 1.1 Current Searcher interface is too limited for many purposes: Hits search(Query query, int numHits, String dedupField, String sortField, boolean reverse) throws IOException; It would be nice that we had an interface that allowed adding different features without changing the interface. I am proposing that we deprecate the current search method and introduce something like: Hits search(Query query, Metadata context) throws IOException; Also at the same time we should enhance the QueryFilter interface to look something like: BooleanQuery filter(Query input, BooleanQuery translation, Metadata context) throws QueryException; I would like to hear your comments before proceeding with a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-743) Site search powered by Lucene/Solr
[ https://issues.apache.org/jira/browse/NUTCH-743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-743. -- Resolution: Fixed committed Site search powered by Lucene/Solr -- Key: NUTCH-743 URL: https://issues.apache.org/jira/browse/NUTCH-743 Project: Nutch Issue Type: New Feature Components: documentation Reporter: Sami Siren Assignee: Sami Siren Priority: Minor Attachments: NUTCH-743.patch Replace current Nutch site search with Lucene/Solr powered search hosted by Lucid Imagination (http://www.lucidimagination.com/search). It allows one to search all of the Nutch (content from other parts of the Lucene ecosystem is also available) content from a single place, including web, wiki, JIRA and mail archives. Lucid has a fault tolerant setup with replication and fail over as well as monitoring services in place. A preview of the site with the new search enabled is available at http://people.apache.org/~siren/site/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-743) Site search powered by Lucene/Solr
Site search powered by Lucene/Solr -- Key: NUTCH-743 URL: https://issues.apache.org/jira/browse/NUTCH-743 Project: Nutch Issue Type: New Feature Components: documentation Reporter: Sami Siren Assignee: Sami Siren Priority: Minor Replace current Nutch site search with Lucene/Solr powered search hosted by Lucid Imagination (http://www.lucidimagination.com/search). It allows one to search all of the Nutch (content from other parts of the Lucene ecosystem is also available) content from a single place, including web, wiki, JIRA and mail archives. Lucid has a fault tolerant setup with replication and fail over as well as monitoring services in place. A preview of the site with the new search enabled is available at http://people.apache.org/~siren/site/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-743) Site search powered by Lucene/Solr
[ https://issues.apache.org/jira/browse/NUTCH-743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-743: - Attachment: NUTCH-743.patch If there are no objections I will commit this within a week or so. Site search powered by Lucene/Solr -- Key: NUTCH-743 URL: https://issues.apache.org/jira/browse/NUTCH-743 Project: Nutch Issue Type: New Feature Components: documentation Reporter: Sami Siren Assignee: Sami Siren Priority: Minor Attachments: NUTCH-743.patch Replace current Nutch site search with Lucene/Solr powered search hosted by Lucid Imagination (http://www.lucidimagination.com/search). It allows one to search all of the Nutch (content from other parts of the Lucene ecosystem is also available) content from a single place, including web, wiki, JIRA and mail archives. Lucid has a fault tolerant setup with replication and fail over as well as monitoring services in place. A preview of the site with the new search enabled is available at http://people.apache.org/~siren/site/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-730) NPE in LinkRank if no nodes with which to create the WebGraph
[ https://issues.apache.org/jira/browse/NUTCH-730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-730: - Fix Version/s: (was: 1.0.0) NPE in LinkRank if no nodes with which to create the WebGraph - Key: NUTCH-730 URL: https://issues.apache.org/jira/browse/NUTCH-730 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.1 Attachments: NUTCH-730-1-20090325.patch For LinkRank, if there are no nodes to process, then a NullPointerException is thrown when trying to count number of nodes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-722) Nutch contains jars that we cannot redistribute
[ https://issues.apache.org/jira/browse/NUTCH-722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-722. -- Resolution: Fixed removed the jars and added note about this in README.txt Nutch contains jars that we cannot redistribute --- Key: NUTCH-722 URL: https://issues.apache.org/jira/browse/NUTCH-722 Project: Nutch Issue Type: Bug Reporter: Sami Siren Priority: Blocker Fix For: 1.0.0 It seems that we have some jars (as part of pdf parser) that we cannot redistribute. Jukkas comment from email: The release contains the Java Advanced Imaging libraries (jai_core.jar and jai_codec.jar) which are licensed under Sun's Binary Code License. We can't redistribute those libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-728) Improve nutch release packaging
[ https://issues.apache.org/jira/browse/NUTCH-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683814#action_12683814 ] Sami Siren commented on NUTCH-728: -- not really, it just happens to be the mirror I use. Improve nutch release packaging --- Key: NUTCH-728 URL: https://issues.apache.org/jira/browse/NUTCH-728 Project: Nutch Issue Type: Improvement Reporter: Sami Siren Attachments: NUTCH-728.patch see the discussion from http://www.lucidimagination.com/search/document/aa4d52cbd9af026a/discuss_contents_of_nutch_release_artifact -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-722) Nutch contains jars that we cannot redistribute
Nutch contains jars that we cannot redistribute --- Key: NUTCH-722 URL: https://issues.apache.org/jira/browse/NUTCH-722 Project: Nutch Issue Type: Bug Reporter: Sami Siren Priority: Blocker Fix For: 1.0.0 It seems that we have some jars (as part of pdf parser) that we cannot redistribute. Jukkas comment from email: The release contains the Java Advanced Imaging libraries (jai_core.jar and jai_codec.jar) which are licensed under Sun's Binary Code License. We can't redistribute those libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-723) LICENCE.txt is lacking info that should be there
LICENCE.txt is lacking info that should be there Key: NUTCH-723 URL: https://issues.apache.org/jira/browse/NUTCH-723 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.0.0 Reporter: Sami Siren Jukkas comment from email: * The LICENSE.txt file should have at least references to the licenses of the bundled libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-725) NOTICE.txt is lacking info that should be there
NOTICE.txt is lacking info that should be there --- Key: NUTCH-725 URL: https://issues.apache.org/jira/browse/NUTCH-725 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Sami Siren Jukkas comment from email: * The NOTICE.txt file should start with the the following lines: Apache Nutch Copyright 2009 The Apache Software Foundation * The NOTICE.txt file should contain the required copyright notices from all bundled libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-726) README.txt is lacking info that should be there
README.txt is lacking info that should be there --- Key: NUTCH-726 URL: https://issues.apache.org/jira/browse/NUTCH-726 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.0.0 Reporter: Sami Siren from Jukkas email: * The README.txt should start with Apache Nutch instead of Nutch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-727) Add KEYS file to release artifact
Add KEYS file to release artifact - Key: NUTCH-727 URL: https://issues.apache.org/jira/browse/NUTCH-727 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Sami Siren comment from Grant: Where's the KEYS file for Nutch? hi, the keys file is at the top level nutch directory (eg: http://www.nic.funet.fi/pub/mirrors/apache.org/lucene/nutch/KEYS) OK, I think it should be in the tarball, too., at the top -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-726) README.txt is lacking info that should be there
[ https://issues.apache.org/jira/browse/NUTCH-726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-726. -- Resolution: Fixed Fix Version/s: 1.0.0 committed README.txt is lacking info that should be there --- Key: NUTCH-726 URL: https://issues.apache.org/jira/browse/NUTCH-726 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.0.0 Reporter: Sami Siren Fix For: 1.0.0 from Jukkas email: * The README.txt should start with Apache Nutch instead of Nutch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-724) Drop the JAI libraries
[ https://issues.apache.org/jira/browse/NUTCH-724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-724. -- Resolution: Duplicate Drop the JAI libraries -- Key: NUTCH-724 URL: https://issues.apache.org/jira/browse/NUTCH-724 Project: Nutch Issue Type: Bug Reporter: Jukka Zitting Priority: Blocker Fix For: 1.0.0 The PDF parser plugin contains Java Advanced Imaging (JAI) libraries (jai_core.jar and jai_codec.jar) that are licensed under the Sun Binary Code License. The license is incompatible with Apache policies, so we need to drop those libraries. AFAIK (see PDFBOX-381) PDFBox only uses the JAI libraries for handling page rotations and tiff images, so simply dropping the JAI jars shouldn't have too much impact. A better solution would be to switch to using Apache PDFBox that has a proper workaround for this issue, but the first Apache PDFBox release has not yet been made. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-722) Nutch contains jars that we cannot redistribute
[ https://issues.apache.org/jira/browse/NUTCH-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683482#action_12683482 ] Sami Siren commented on NUTCH-722: -- +1, i am fine with this solution too Nutch contains jars that we cannot redistribute --- Key: NUTCH-722 URL: https://issues.apache.org/jira/browse/NUTCH-722 Project: Nutch Issue Type: Bug Reporter: Sami Siren Priority: Blocker Fix For: 1.0.0 It seems that we have some jars (as part of pdf parser) that we cannot redistribute. Jukkas comment from email: The release contains the Java Advanced Imaging libraries (jai_core.jar and jai_codec.jar) which are licensed under Sun's Binary Code License. We can't redistribute those libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-725) NOTICE.txt is lacking info that should be there
[ https://issues.apache.org/jira/browse/NUTCH-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-725. -- Resolution: Fixed went through the libs and added copyright notices NOTICE.txt is lacking info that should be there --- Key: NUTCH-725 URL: https://issues.apache.org/jira/browse/NUTCH-725 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Sami Siren Jukkas comment from email: * The NOTICE.txt file should start with the the following lines: Apache Nutch Copyright 2009 The Apache Software Foundation * The NOTICE.txt file should contain the required copyright notices from all bundled libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-723) LICENCE.txt is lacking info that should be there
[ https://issues.apache.org/jira/browse/NUTCH-723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-723. -- Resolution: Fixed added licenses of 4rd party software LICENCE.txt is lacking info that should be there Key: NUTCH-723 URL: https://issues.apache.org/jira/browse/NUTCH-723 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.0.0 Reporter: Sami Siren Jukkas comment from email: * The LICENSE.txt file should have at least references to the licenses of the bundled libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (NUTCH-723) LICENCE.txt is lacking info that should be there
[ https://issues.apache.org/jira/browse/NUTCH-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683618#action_12683618 ] Sami Siren edited comment on NUTCH-723 at 3/19/09 2:11 PM: --- added licenses of 3rd party software was (Author: siren): added licenses of 4rd party software LICENCE.txt is lacking info that should be there Key: NUTCH-723 URL: https://issues.apache.org/jira/browse/NUTCH-723 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.0.0 Reporter: Sami Siren Jukkas comment from email: * The LICENSE.txt file should have at least references to the licenses of the bundled libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-728) Improve nutch release packaging
[ https://issues.apache.org/jira/browse/NUTCH-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-728: - Attachment: NUTCH-728.patch add simple target to generate source release tgz from svn tag -did not touch to the binary one Improve nutch release packaging --- Key: NUTCH-728 URL: https://issues.apache.org/jira/browse/NUTCH-728 Project: Nutch Issue Type: Improvement Reporter: Sami Siren Attachments: NUTCH-728.patch see the discussion from http://www.lucidimagination.com/search/document/aa4d52cbd9af026a/discuss_contents_of_nutch_release_artifact -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-722) Nutch contains jars that we cannot redistribute
[ https://issues.apache.org/jira/browse/NUTCH-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683634#action_12683634 ] Sami Siren commented on NUTCH-722: -- if there are no objections I will commit this change tomorrow morning (EET) Nutch contains jars that we cannot redistribute --- Key: NUTCH-722 URL: https://issues.apache.org/jira/browse/NUTCH-722 Project: Nutch Issue Type: Bug Reporter: Sami Siren Priority: Blocker Fix For: 1.0.0 It seems that we have some jars (as part of pdf parser) that we cannot redistribute. Jukkas comment from email: The release contains the Java Advanced Imaging libraries (jai_core.jar and jai_codec.jar) which are licensed under Sun's Binary Code License. We can't redistribute those libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-715) Subcollection plugin doesn't work with default subcollections.xml file
[ https://issues.apache.org/jira/browse/NUTCH-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-715. -- Resolution: Fixed committed, thanks Dmitry! Subcollection plugin doesn't work with default subcollections.xml file -- Key: NUTCH-715 URL: https://issues.apache.org/jira/browse/NUTCH-715 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Reporter: Dmitry Lihachev Assignee: Sami Siren Fix For: 1.0.0 Attachments: NUTCH-715-testcase.patch, NUTCH-715_subcollections_fix.patch Subcollection plugin cann't parse his configuration file because it contatins top level comment (ASF notice) and DomUtil doesn't carry about of top-level comments -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-705) parse-rtf plugin
[ https://issues.apache.org/jira/browse/NUTCH-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12680411#action_12680411 ] Sami Siren commented on NUTCH-705: -- I think we should start looking at Apache Tika for most (or all) of our parsers. parse-rtf plugin Key: NUTCH-705 URL: https://issues.apache.org/jira/browse/NUTCH-705 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 1.0.0 Reporter: Dmitry Lihachev Priority: Minor Fix For: 1.1 Attachments: NUTCH-705.patch Demoting this issue and moving to 1.1 - current patch is not suitable due to LGPL licensed parts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-717) Make Nutch Solr integration easier
Make Nutch Solr integration easier -- Key: NUTCH-717 URL: https://issues.apache.org/jira/browse/NUTCH-717 Project: Nutch Issue Type: New Feature Reporter: Sami Siren Fix For: 1.1 Erik Hatcher proposed we should provide a full solr config dir to be used with Nutch-Solr. Now we only provide index schema. It would be considerably easier to setup nutch-solr if we provided the whole conf dir that you could use with solr like: java -Dsolr.solr.home=Nutch's Solr Home -jar start.jar -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-711) Indexer failing after upgrade to Hadoop 0.19.1
[ https://issues.apache.org/jira/browse/NUTCH-711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678691#action_12678691 ] Sami Siren commented on NUTCH-711: -- +1 Indexer failing after upgrade to Hadoop 0.19.1 -- Key: NUTCH-711 URL: https://issues.apache.org/jira/browse/NUTCH-711 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Priority: Blocker Fix For: 1.0.0 Attachments: patch.txt After upgrade to Hadoop 0.19.1 Reducer is initialized in a different order than before (see http://svn.apache.org/viewvc?view=revrevision=736239). IndexingFilters populate current JobConf with field options that are required for IndexerOutputFormat to function properly. However, the filters are instantiated in Reducer.configure(), which is now called after the OutputFormat is initialized, and not before as previously. The workaround for now is to instantiate IndexinigFilters once again inside IndexerOutputFormat. This issue should be revisited before 1.1 in order to find a better solution. See this thread for more information: http://www.lucidimagination.com/search/document/7c62c625c7ea17fe/problem_with_crawling_using_the_latest_1_0_trunk -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-700) Neko1.9.11 goes into a loop
[ https://issues.apache.org/jira/browse/NUTCH-700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-700: - Fix Version/s: 1.0.0 Assignee: Sami Siren This one just bit me - the effect is that parsing hangs forever. I am promoting it to be fixed in 1.0. Neko1.9.11 goes into a loop --- Key: NUTCH-700 URL: https://issues.apache.org/jira/browse/NUTCH-700 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: julien nioche Assignee: Sami Siren Priority: Critical Fix For: 1.0.0 Neko1.9.11 goes into a loop on some documents e.g. http://mediacet.com/Archive/FourYorkshiremen/bb/post.htm http://cizel.co.kr/main.php reverting to 0.9.4 seems to fix the problem The approach mentioned in https://issues.apache.org/jira/browse/NUTCH-696 could be a way to alleviate similar issues PS: haven't had time to report to the Neko people yet, will do at some stage -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-700) Neko1.9.11 goes into a loop
[ https://issues.apache.org/jira/browse/NUTCH-700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-700. -- Resolution: Fixed reverted to 0.9.4 Neko1.9.11 goes into a loop --- Key: NUTCH-700 URL: https://issues.apache.org/jira/browse/NUTCH-700 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: julien nioche Assignee: Sami Siren Priority: Critical Fix For: 1.0.0 Neko1.9.11 goes into a loop on some documents e.g. http://mediacet.com/Archive/FourYorkshiremen/bb/post.htm http://cizel.co.kr/main.php reverting to 0.9.4 seems to fix the problem The approach mentioned in https://issues.apache.org/jira/browse/NUTCH-696 could be a way to alleviate similar issues PS: haven't had time to report to the Neko people yet, will do at some stage -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-669) Consolidate code for Fetcher and Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-669. -- Resolution: Fixed replaced fetcher with fetcher2 Consolidate code for Fetcher and Fetcher2 - Key: NUTCH-669 URL: https://issues.apache.org/jira/browse/NUTCH-669 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Todd Lipcon Assignee: Sami Siren Fix For: 1.0.0 I'd like to consolidate a lot of the common code between Fetcher and Fetcher2.java. It seems to me like there are the following differences: - Fetcher relies on the Protocol to obey robots.txt and crawl delay settings whereas Fetcher2 implements them itself - Fetcher2 uses a different queueing model (queue per crawl host) to accomplish the per-host limiting without making the Protocol do it. I've begun work on this but want to check with people on the following: - What reason is there for Fetcher existing at all since Fetcher2 seems to be a superset of functionality? - Is it on the road map to remove the robots/delay logic from the Http protocol and make Fetcher2's delegation of duties the standard? - Any other improvements wanted for Fetcher while I am in and around the code? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-705) parse-rtf plugin
[ https://issues.apache.org/jira/browse/NUTCH-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677508#action_12677508 ] Sami Siren commented on NUTCH-705: -- I think that the patch contains some lgpl code that we cannot commit into apache repository. parse-rtf plugin Key: NUTCH-705 URL: https://issues.apache.org/jira/browse/NUTCH-705 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 1.0.0 Reporter: Dmitry Lihachev Fix For: 1.0.0 Attachments: NUTCH-705.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-699) Add an official solr schema for solr integration
[ https://issues.apache.org/jira/browse/NUTCH-699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-699. -- Resolution: Fixed committed Add an official solr schema for solr integration -- Key: NUTCH-699 URL: https://issues.apache.org/jira/browse/NUTCH-699 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Doğacan Güney Assignee: Doğacan Güney Fix For: 1.0.0 See Andrzej's comments on NUTCH-684 for more info. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-669) Consolidate code for Fetcher and Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren reassigned NUTCH-669: Assignee: Sami Siren Consolidate code for Fetcher and Fetcher2 - Key: NUTCH-669 URL: https://issues.apache.org/jira/browse/NUTCH-669 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Todd Lipcon Assignee: Sami Siren Fix For: 1.0.0 I'd like to consolidate a lot of the common code between Fetcher and Fetcher2.java. It seems to me like there are the following differences: - Fetcher relies on the Protocol to obey robots.txt and crawl delay settings whereas Fetcher2 implements them itself - Fetcher2 uses a different queueing model (queue per crawl host) to accomplish the per-host limiting without making the Protocol do it. I've begun work on this but want to check with people on the following: - What reason is there for Fetcher existing at all since Fetcher2 seems to be a superset of functionality? - Is it on the road map to remove the robots/delay logic from the Http protocol and make Fetcher2's delegation of duties the standard? - Any other improvements wanted for Fetcher while I am in and around the code? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-703) Upgrade to Hadoop 0.19.1
[ https://issues.apache.org/jira/browse/NUTCH-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677266#action_12677266 ] Sami Siren commented on NUTCH-703: -- Andrzej, are you working with this now? Upgrade to Hadoop 0.19.1 Key: NUTCH-703 URL: https://issues.apache.org/jira/browse/NUTCH-703 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Priority: Blocker Fix For: 1.0.0 From release notes: Release 0.19.1 fixes many critical bugs in 0.19.0, including ***some data loss issues***.. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-247) robot parser to restrict.
[ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-247. -- Resolution: Fixed Assignee: Sami Siren (was: Dennis Kubes) committed this - added checking to F2 (which is soon to be Fetcher) robot parser to restrict. - Key: NUTCH-247 URL: https://issues.apache.org/jira/browse/NUTCH-247 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8 Reporter: Stefan Groschupf Assignee: Sami Siren Priority: Minor Fix For: 1.0.0 Attachments: agent-names.patch, agent-names3.patch.txt If the agent name and the robots agents are not proper configure the Robot rule parser uses LOG.severe to log the problem but solve it also. Later on the fetcher thread checks for severe errors and stop if there is one. RobotRulesParser: if (agents.size() == 0) { agents.add(agentName); LOG.severe(No agents listed in 'http.robots.agents' property!); } else if (!((String)agents.get(0)).equalsIgnoreCase(agentName)) { agents.add(0, agentName); LOG.severe(Agent we advertise ( + agentName + ) not listed first in 'http.robots.agents' property!); } Fetcher.FetcherThread: if (LogFormatter.hasLoggedSevere()) // something bad happened break; I suggest to use warn or something similar instead of severe to log this problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-701) replace Fetcher with Fetcher2
replace Fetcher with Fetcher2 - Key: NUTCH-701 URL: https://issues.apache.org/jira/browse/NUTCH-701 Project: Nutch Issue Type: Bug Components: fetcher Reporter: Sami Siren Assignee: Sami Siren Fix For: 1.0.0 Currently there are two fetcher implementation within nutch, one too many. This task tracks the process of promoting Fetcher2. my plan is basically to -remove Fetcher all together and rename Fetcher2 to Fetcher -fix crawl class so it works with F2 api. If there are no objections I will proceed with this soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-701) Replace Fetcher with Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-701: - Summary: Replace Fetcher with Fetcher2 (was: replace Fetcher with Fetcher2) Replace Fetcher with Fetcher2 - Key: NUTCH-701 URL: https://issues.apache.org/jira/browse/NUTCH-701 Project: Nutch Issue Type: Bug Components: fetcher Reporter: Sami Siren Assignee: Sami Siren Fix For: 1.0.0 Currently there are two fetcher implementation within nutch, one too many. This task tracks the process of promoting Fetcher2. my plan is basically to -remove Fetcher all together and rename Fetcher2 to Fetcher -fix crawl class so it works with F2 api. If there are no objections I will proceed with this soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-698) CrawlDb is corrupted after a few crawl cycles
[ https://issues.apache.org/jira/browse/NUTCH-698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-698. -- Resolution: Fixed committed. thanks guys CrawlDb is corrupted after a few crawl cycles - Key: NUTCH-698 URL: https://issues.apache.org/jira/browse/NUTCH-698 Project: Nutch Issue Type: Bug Reporter: Doğacan Güney Assignee: Doğacan Güney Priority: Blocker Fix For: 1.0.0 Attachments: NUTCH-698_v1.patch After change to hadoop's MapWritable, crawldb becomes corrupted after some fetch cycles. For more details see this discussion thread: http://www.nabble.com/Fetcher2-crashes-with-current-trunk-td21978049.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-699) Add an official solr schema for solr integration
[ https://issues.apache.org/jira/browse/NUTCH-699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12676233#action_12676233 ] Sami Siren commented on NUTCH-699: -- We could put it under conf/ ? Add an official solr schema for solr integration -- Key: NUTCH-699 URL: https://issues.apache.org/jira/browse/NUTCH-699 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Doğacan Güney Assignee: Doğacan Güney Fix For: 1.0.0 See Andrzej's comments on NUTCH-684 for more info. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-701) Replace Fetcher with Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-701. -- Resolution: Duplicate Replace Fetcher with Fetcher2 - Key: NUTCH-701 URL: https://issues.apache.org/jira/browse/NUTCH-701 Project: Nutch Issue Type: Bug Components: fetcher Reporter: Sami Siren Assignee: Sami Siren Fix For: 1.0.0 Currently there are two fetcher implementation within nutch, one too many. This task tracks the process of promoting Fetcher2. my plan is basically to -remove Fetcher all together and rename Fetcher2 to Fetcher -fix crawl class so it works with F2 api. If there are no objections I will proceed with this soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-669) Consolidate code for Fetcher and Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-669: - Fix Version/s: (was: 1.1) 1.0.0 Moving this back to 1.0 Are you close with your patch? As discussed in this thread we should just replace Fetcher With Fetcher2, change Crawl class and check that the tests pass. other issues we can deal within their own tickets. I can also help with this if you don't have the time. Consolidate code for Fetcher and Fetcher2 - Key: NUTCH-669 URL: https://issues.apache.org/jira/browse/NUTCH-669 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Todd Lipcon Fix For: 1.0.0 I'd like to consolidate a lot of the common code between Fetcher and Fetcher2.java. It seems to me like there are the following differences: - Fetcher relies on the Protocol to obey robots.txt and crawl delay settings whereas Fetcher2 implements them itself - Fetcher2 uses a different queueing model (queue per crawl host) to accomplish the per-host limiting without making the Protocol do it. I've begun work on this but want to check with people on the following: - What reason is there for Fetcher existing at all since Fetcher2 seems to be a superset of functionality? - Is it on the road map to remove the robots/delay logic from the Http protocol and make Fetcher2's delegation of duties the standard? - Any other improvements wanted for Fetcher while I am in and around the code? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-694) Distributed Search Server fails
[ https://issues.apache.org/jira/browse/NUTCH-694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-694. -- Resolution: Fixed Committed. Thanks for testing it. Distributed Search Server fails --- Key: NUTCH-694 URL: https://issues.apache.org/jira/browse/NUTCH-694 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 1.0.0 Environment: Single Server with one Nutch instance in DistributedSearchServerMode, not in PseudoDistirubutedMode Reporter: Dr. Nadine Hochstotter Assignee: Sami Siren Priority: Blocker Fix For: 1.0.0 Attachments: NUTCH-694-2.patch, NUTCH-694.patch I run Nutch on a single server, I have two crawl directories, that's why I use Nutch in distributed search server mode as described in the hadoop manual. But since I have a new Trunk Version (04.02.2009) it fails. Local search on one index works fine. But distributed search throws following exception: In catalina.out (server) 2009-02-18 17:08:14,906 ERROR NutchBean - org.apache.hadoop.ipc.RemoteException: java.io.IOException: Unknown Protocol classname:org.apache.nutch.searcher.RPCSegmentBean at org.apache.nutch.searcher.NutchBean.getProtocolVersion(NutchBean.java:403) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892) at org.apache.hadoop.ipc.Client.call(Client.java:696) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at $Proxy4.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343) at org.apache.nutch.searcher.DistributedSegmentBean.init(DistributedSegmentBean.java:103) at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:111) at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:80) at org.apache.nutch.searcher.NutchBean$NutchBeanConstructor.contextInitialized(NutchBean.java:422) at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:3843) at org.apache.catalina.core.StandardContext.start(StandardContext.java:4350) at org.apache.catalina.core.StandardContext.reload(StandardContext.java:3099) at org.apache.catalina.manager.ManagerServlet.reload(ManagerServlet.java:913) at org.apache.catalina.manager.HTMLManagerServlet.reload(HTMLManagerServlet.java:536) at org.apache.catalina.manager.HTMLManagerServlet.doGet(HTMLManagerServlet.java:114) at javax.servlet.http.HttpServlet.service(HttpServlet.java:690) at javax.servlet.http.HttpServlet.service(HttpServlet.java:803) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:525) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.valves.RequestFilterValve.process(RequestFilterValve.java:269) at org.apache.catalina.valves.RemoteAddrValve.invoke(RemoteAddrValve.java:81) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:619) And in Hadoop.log: 2009-02-18 17:07:52,847 INFO ipc.Server - IPC Server handler 48 on 13001: starting 2009-02-18 17:07:52,847 INFO ipc.Server - IPC Server handler 49 on 13001: starting 2009-02-18 17:07:52,847 INFO ipc.Server - IPC Server handler 40 on 13001: starting 2009-02-18
[jira] Commented: (NUTCH-477) Extend URLFilters to support different filtering chains
[ https://issues.apache.org/jira/browse/NUTCH-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12675793#action_12675793 ] Sami Siren commented on NUTCH-477: -- It's your call. IMO the whole URLFIlters - URLFIlter, URLNormalizers - URLNormalizer is a bit too complex as it is now, we can make it more clean but it's probably not worth the trouble pre 1.0. Extend URLFilters to support different filtering chains --- Key: NUTCH-477 URL: https://issues.apache.org/jira/browse/NUTCH-477 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: urlfilters.patch I propose to make the following changes to URLFilters: * extend URLFilters so that they support different filtering rules depending on the context where they are executed. This functionality mirrors the one that URLNormalizers already support. * change their return value to an int code, in order to support early termination of long filtering chains. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-694) Distributed Search Server fails
[ https://issues.apache.org/jira/browse/NUTCH-694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-694: - Attachment: NUTCH-694-2.patch I rechecked this again and there was also something else wrong, I am attaching a new patch that is now manually tested (we lost the testcase somewhere) with local and nutch rpc search. Distributed Search Server fails --- Key: NUTCH-694 URL: https://issues.apache.org/jira/browse/NUTCH-694 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 1.0.0 Environment: Single Server with one Nutch instance in DistributedSearchServerMode, not in PseudoDistirubutedMode Reporter: Dr. Nadine Hochstotter Priority: Blocker Fix For: 1.0.0 Attachments: NUTCH-694-2.patch, NUTCH-694.patch I run Nutch on a single server, I have two crawl directories, that's why I use Nutch in distributed search server mode as described in the hadoop manual. But since I have a new Trunk Version (04.02.2009) it fails. Local search on one index works fine. But distributed search throws following exception: In catalina.out (server) 2009-02-18 17:08:14,906 ERROR NutchBean - org.apache.hadoop.ipc.RemoteException: java.io.IOException: Unknown Protocol classname:org.apache.nutch.searcher.RPCSegmentBean at org.apache.nutch.searcher.NutchBean.getProtocolVersion(NutchBean.java:403) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892) at org.apache.hadoop.ipc.Client.call(Client.java:696) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at $Proxy4.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343) at org.apache.nutch.searcher.DistributedSegmentBean.init(DistributedSegmentBean.java:103) at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:111) at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:80) at org.apache.nutch.searcher.NutchBean$NutchBeanConstructor.contextInitialized(NutchBean.java:422) at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:3843) at org.apache.catalina.core.StandardContext.start(StandardContext.java:4350) at org.apache.catalina.core.StandardContext.reload(StandardContext.java:3099) at org.apache.catalina.manager.ManagerServlet.reload(ManagerServlet.java:913) at org.apache.catalina.manager.HTMLManagerServlet.reload(HTMLManagerServlet.java:536) at org.apache.catalina.manager.HTMLManagerServlet.doGet(HTMLManagerServlet.java:114) at javax.servlet.http.HttpServlet.service(HttpServlet.java:690) at javax.servlet.http.HttpServlet.service(HttpServlet.java:803) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:525) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.valves.RequestFilterValve.process(RequestFilterValve.java:269) at org.apache.catalina.valves.RemoteAddrValve.invoke(RemoteAddrValve.java:81) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:619) And in Hadoop.log: 2009-02-18 17:07:52,847 INFO ipc.Server - IPC Server handler 48 on 13001: starting 2009-02-18 17:07:52,847 INFO ipc.Server - IPC Server
[jira] Updated: (NUTCH-694) Distributed Search Server fails
[ https://issues.apache.org/jira/browse/NUTCH-694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-694: - Patch Info: [Patch Available] Assignee: Sami Siren Distributed Search Server fails --- Key: NUTCH-694 URL: https://issues.apache.org/jira/browse/NUTCH-694 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 1.0.0 Environment: Single Server with one Nutch instance in DistributedSearchServerMode, not in PseudoDistirubutedMode Reporter: Dr. Nadine Hochstotter Assignee: Sami Siren Priority: Blocker Fix For: 1.0.0 Attachments: NUTCH-694-2.patch, NUTCH-694.patch I run Nutch on a single server, I have two crawl directories, that's why I use Nutch in distributed search server mode as described in the hadoop manual. But since I have a new Trunk Version (04.02.2009) it fails. Local search on one index works fine. But distributed search throws following exception: In catalina.out (server) 2009-02-18 17:08:14,906 ERROR NutchBean - org.apache.hadoop.ipc.RemoteException: java.io.IOException: Unknown Protocol classname:org.apache.nutch.searcher.RPCSegmentBean at org.apache.nutch.searcher.NutchBean.getProtocolVersion(NutchBean.java:403) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892) at org.apache.hadoop.ipc.Client.call(Client.java:696) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at $Proxy4.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343) at org.apache.nutch.searcher.DistributedSegmentBean.init(DistributedSegmentBean.java:103) at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:111) at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:80) at org.apache.nutch.searcher.NutchBean$NutchBeanConstructor.contextInitialized(NutchBean.java:422) at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:3843) at org.apache.catalina.core.StandardContext.start(StandardContext.java:4350) at org.apache.catalina.core.StandardContext.reload(StandardContext.java:3099) at org.apache.catalina.manager.ManagerServlet.reload(ManagerServlet.java:913) at org.apache.catalina.manager.HTMLManagerServlet.reload(HTMLManagerServlet.java:536) at org.apache.catalina.manager.HTMLManagerServlet.doGet(HTMLManagerServlet.java:114) at javax.servlet.http.HttpServlet.service(HttpServlet.java:690) at javax.servlet.http.HttpServlet.service(HttpServlet.java:803) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:525) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.valves.RequestFilterValve.process(RequestFilterValve.java:269) at org.apache.catalina.valves.RemoteAddrValve.invoke(RemoteAddrValve.java:81) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:619) And in Hadoop.log: 2009-02-18 17:07:52,847 INFO ipc.Server - IPC Server handler 48 on 13001: starting 2009-02-18 17:07:52,847 INFO ipc.Server - IPC Server handler 49 on 13001: starting 2009-02-18 17:07:52,847 INFO ipc.Server - IPC Server handler 40 on 13001: starting 2009-02-18
[jira] Updated: (NUTCH-573) Multiple Domains - Query Search
[ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-573: - Patch Info: [Patch Available] Multiple Domains - Query Search --- Key: NUTCH-573 URL: https://issues.apache.org/jira/browse/NUTCH-573 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 0.9.0 Environment: All Reporter: Rajasekar Karthik Assignee: Enis Soztutar Fix For: 1.0.0 Attachments: multiTermQuery_v1.patch Searching multiple domains can be done on Lucene - nut not that efficiently on nutch. Query: +content:abc +(sitewww.aaa.com site:www.bbb.com) works on lucene but the same concept does not work on nutch. In Lucene, it works with org.apache.lucene.analysis.KeywordAnalyzer org.apache.lucene.analysis.standard.StandardAnalyzer but NOT on org.apache.lucene.analysis.SimpleAnalyzer Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using? Just FYI, another solution (inefficient I believe) which seems to be working on nutch query -site:ccc.com -site:ddd.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-477) Extend URLFilters to support different filtering chains
[ https://issues.apache.org/jira/browse/NUTCH-477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-477: - Patch Info: [Patch Available] Extend URLFilters to support different filtering chains --- Key: NUTCH-477 URL: https://issues.apache.org/jira/browse/NUTCH-477 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: urlfilters.patch I propose to make the following changes to URLFilters: * extend URLFilters so that they support different filtering rules depending on the context where they are executed. This functionality mirrors the one that URLNormalizers already support. * change their return value to an int code, in order to support early termination of long filtering chains. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-694) Distributed Search Server fails
[ https://issues.apache.org/jira/browse/NUTCH-694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-694: - Attachment: NUTCH-694.patch This fixed the problem for me. Distributed Search Server fails --- Key: NUTCH-694 URL: https://issues.apache.org/jira/browse/NUTCH-694 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 1.0.0 Environment: Single Server with one Nutch instance in DistributedSearchServerMode, not in PseudoDistirubutedMode Reporter: Dr. Nadine Hochstotter Priority: Blocker Fix For: 1.0.0 Attachments: NUTCH-694.patch I run Nutch on a single server, I have two crawl directories, that's why I use Nutch in distributed search server mode as described in the hadoop manual. But since I have a new Trunk Version (04.02.2009) it fails. Local search on one index works fine. But distributed search throws following exception: In catalina.out (server) 2009-02-18 17:08:14,906 ERROR NutchBean - org.apache.hadoop.ipc.RemoteException: java.io.IOException: Unknown Protocol classname:org.apache.nutch.searcher.RPCSegmentBean at org.apache.nutch.searcher.NutchBean.getProtocolVersion(NutchBean.java:403) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892) at org.apache.hadoop.ipc.Client.call(Client.java:696) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at $Proxy4.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343) at org.apache.nutch.searcher.DistributedSegmentBean.init(DistributedSegmentBean.java:103) at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:111) at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:80) at org.apache.nutch.searcher.NutchBean$NutchBeanConstructor.contextInitialized(NutchBean.java:422) at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:3843) at org.apache.catalina.core.StandardContext.start(StandardContext.java:4350) at org.apache.catalina.core.StandardContext.reload(StandardContext.java:3099) at org.apache.catalina.manager.ManagerServlet.reload(ManagerServlet.java:913) at org.apache.catalina.manager.HTMLManagerServlet.reload(HTMLManagerServlet.java:536) at org.apache.catalina.manager.HTMLManagerServlet.doGet(HTMLManagerServlet.java:114) at javax.servlet.http.HttpServlet.service(HttpServlet.java:690) at javax.servlet.http.HttpServlet.service(HttpServlet.java:803) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:525) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.valves.RequestFilterValve.process(RequestFilterValve.java:269) at org.apache.catalina.valves.RemoteAddrValve.invoke(RemoteAddrValve.java:81) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:619) And in Hadoop.log: 2009-02-18 17:07:52,847 INFO ipc.Server - IPC Server handler 48 on 13001: starting 2009-02-18 17:07:52,847 INFO ipc.Server - IPC Server handler 49 on 13001: starting 2009-02-18 17:07:52,847 INFO ipc.Server - IPC Server handler 40 on 13001: starting 2009-02-18 17:08:14,675 INFO ipc.RPC - Call:
[jira] Resolved: (NUTCH-695) incorrect mime type detection by MoreIndexingFilter plugin
[ https://issues.apache.org/jira/browse/NUTCH-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-695. -- Resolution: Fixed Assignee: Sami Siren committed, thanks incorrect mime type detection by MoreIndexingFilter plugin -- Key: NUTCH-695 URL: https://issues.apache.org/jira/browse/NUTCH-695 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Reporter: Dmitry Lihachev Assignee: Sami Siren Fix For: 1.0.0 Attachments: NUTCH-695_MoreIndexingFilter.patch, NUTCH-695_TestMoreIndexingFilter.patch When server sends {{Content-Type}} header with optional params like {{Content-Type: text/html; charset=UTF-8}} MoreIndexingFilter returns null in {{type}} field. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-694) Distributed Search Server fails
[ https://issues.apache.org/jira/browse/NUTCH-694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12674964#action_12674964 ] Sami Siren commented on NUTCH-694: -- Strange, did you update both ends (the server and the client?), normally the web application (.war) is the client. After patching you should run 1. ant clean job 2. deploy run server + client Distributed Search Server fails --- Key: NUTCH-694 URL: https://issues.apache.org/jira/browse/NUTCH-694 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 1.0.0 Environment: Single Server with one Nutch instance in DistributedSearchServerMode, not in PseudoDistirubutedMode Reporter: Dr. Nadine Hochstotter Priority: Blocker Fix For: 1.0.0 Attachments: NUTCH-694.patch I run Nutch on a single server, I have two crawl directories, that's why I use Nutch in distributed search server mode as described in the hadoop manual. But since I have a new Trunk Version (04.02.2009) it fails. Local search on one index works fine. But distributed search throws following exception: In catalina.out (server) 2009-02-18 17:08:14,906 ERROR NutchBean - org.apache.hadoop.ipc.RemoteException: java.io.IOException: Unknown Protocol classname:org.apache.nutch.searcher.RPCSegmentBean at org.apache.nutch.searcher.NutchBean.getProtocolVersion(NutchBean.java:403) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892) at org.apache.hadoop.ipc.Client.call(Client.java:696) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at $Proxy4.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343) at org.apache.nutch.searcher.DistributedSegmentBean.init(DistributedSegmentBean.java:103) at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:111) at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:80) at org.apache.nutch.searcher.NutchBean$NutchBeanConstructor.contextInitialized(NutchBean.java:422) at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:3843) at org.apache.catalina.core.StandardContext.start(StandardContext.java:4350) at org.apache.catalina.core.StandardContext.reload(StandardContext.java:3099) at org.apache.catalina.manager.ManagerServlet.reload(ManagerServlet.java:913) at org.apache.catalina.manager.HTMLManagerServlet.reload(HTMLManagerServlet.java:536) at org.apache.catalina.manager.HTMLManagerServlet.doGet(HTMLManagerServlet.java:114) at javax.servlet.http.HttpServlet.service(HttpServlet.java:690) at javax.servlet.http.HttpServlet.service(HttpServlet.java:803) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:525) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.valves.RequestFilterValve.process(RequestFilterValve.java:269) at org.apache.catalina.valves.RemoteAddrValve.invoke(RemoteAddrValve.java:81) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:619) And in Hadoop.log: 2009-02-18 17:07:52,847 INFO ipc.Server - IPC Server handler 48 on 13001: starting 2009-02-18 17:07:52,847 INFO ipc.Server - IPC
[jira] Resolved: (NUTCH-687) Add RAT
[ https://issues.apache.org/jira/browse/NUTCH-687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-687. -- Resolution: Fixed Fix Version/s: 1.0.0 committed Add RAT --- Key: NUTCH-687 URL: https://issues.apache.org/jira/browse/NUTCH-687 Project: Nutch Issue Type: Improvement Reporter: Sami Siren Assignee: Sami Siren Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-687.patch Add apache rat so we can easily see the situation with required headers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-689) Swf parser doesn't seem to handle relative links
[ https://issues.apache.org/jira/browse/NUTCH-689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12674520#action_12674520 ] Sami Siren commented on NUTCH-689: -- for some reason I cannot apply the patch: patching file src/java/org/apache/nutch/parse/swf/SWFParser.java Hunk #2 FAILED at 94. Swf parser doesn't seem to handle relative links Key: NUTCH-689 URL: https://issues.apache.org/jira/browse/NUTCH-689 Project: Nutch Issue Type: Bug Affects Versions: 0.9.0 Reporter: Peter Sparks Attachments: parse-swf.patch I was using the swf parser to extract links from flash files on the site www.arnoldworldwide.com and I was getting an malformed url exception because an outlink was found and it was a relative link that wasn't being resolved. I was able to fix it by resolving all links as they are added to the list of outlinks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-591) StringIndexOutOfBoundsException when extracting text from a Word document.
[ https://issues.apache.org/jira/browse/NUTCH-591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-591. -- Resolution: Duplicate duplicate of NUTCH-691 StringIndexOutOfBoundsException when extracting text from a Word document. -- Key: NUTCH-591 URL: https://issues.apache.org/jira/browse/NUTCH-591 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.9.0 Environment: linux redhat as4u4 x86 kernel 2.6.9 Reporter: frank ling see http://issues.apache.org/bugzilla/show_bug.cgi?id=41076+ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-688) Fix missing/wrong headers in source files
[ https://issues.apache.org/jira/browse/NUTCH-688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-688. -- Resolution: Fixed I think we are done with this. Fix missing/wrong headers in source files - Key: NUTCH-688 URL: https://issues.apache.org/jira/browse/NUTCH-688 Project: Nutch Issue Type: Bug Reporter: Sami Siren Assignee: Sami Siren Priority: Blocker Fix For: 1.0.0 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-691) Update jakarta poi jars to the most relevant version
[ https://issues.apache.org/jira/browse/NUTCH-691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-691. -- Resolution: Fixed Fix Version/s: 1.0.0 committed, Thanks Dmitry Update jakarta poi jars to the most relevant version Key: NUTCH-691 URL: https://issues.apache.org/jira/browse/NUTCH-691 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.0.0 Reporter: Dmitry Lihachev Fix For: 1.0.0 Attachments: NUTCH-691-v1-poi.patch, NUTCH-691-v1-test.patch Original Estimate: 0.25h Remaining Estimate: 0.25h Update jakarta poi jars to the most relevant version closes bug NUTCH-591. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-563) Include custom fields in BasicQueryFilter
[ https://issues.apache.org/jira/browse/NUTCH-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-563. -- Resolution: Fixed Assignee: Sami Siren committed, thanks Include custom fields in BasicQueryFilter - Key: NUTCH-563 URL: https://issues.apache.org/jira/browse/NUTCH-563 Project: Nutch Issue Type: New Feature Components: searcher Reporter: julien nioche Assignee: Sami Siren Priority: Minor Fix For: 1.0.0 Attachments: diff.BasicQueryFilter.dynamicFields.txt, NUTCH-563.patch This patch allows to include additional fields in the BasicQueryFilter by specifying runtime parameters. Any parameter matching the regular expression (query\\.basic\\.(.+)\\.boost) will be added to the list of fields to be used by the BQF and the specified float value will be used as boost. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19
[ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12674603#action_12674603 ] Sami Siren commented on NUTCH-692: -- Have you seen this outside of EC2? Only in multinode setup? AlreadyBeingCreatedException with Hadoop 0.19 - Key: NUTCH-692 URL: https://issues.apache.org/jira/browse/NUTCH-692 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: julien nioche I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran into this AlreadyBeingCreatedException when other nodes tried to pick it up. There was recently a discussion on the Hadoop user list on similar issues with Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using 0.18.2 yet but will do if the problems persist with 0.19 I was wondering whether anyone else had experienced the same problem. Do you think 0.19 is stable enough to use it for Nutch 1.0? I will be running a crawl on a super large cluster in the next couple of weeks and I will confirm this issue J. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-583) FeedParser empty links for items
[ https://issues.apache.org/jira/browse/NUTCH-583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-583: - Fix Version/s: (was: 1.0.0) 1.1 pushing this to 1.1 FeedParser empty links for items Key: NUTCH-583 URL: https://issues.apache.org/jira/browse/NUTCH-583 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 1.1 FeedParser in feed plugin just discards the item if it does not have link element. However Rss 2.0 does not necessitate the link element for each item. Moreover sometimes the link is given in the guid element which is a globally unique identifier for the item. I think we can search the url for an item first, then if it is still not found, we can use the feed's url, but with merging all the parse texts into one Parse object. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-631) MoreIndexingFilter fails with NoSuchElementException
[ https://issues.apache.org/jira/browse/NUTCH-631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-631: - Attachment: NUTCH-631.patch Attaching a patch that fixes the problem as proposed, If there are no objections I will commit this soon. MoreIndexingFilter fails with NoSuchElementException Key: NUTCH-631 URL: https://issues.apache.org/jira/browse/NUTCH-631 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Environment: Verified on CentOS and OSX Reporter: Stefan Will Assignee: Chris A. Mattmann Priority: Blocker Fix For: 1.0.0 Attachments: NUTCH-631.patch I did a simple crawl and started the indexer with the index-more plugin activated. The index job fails with the following stack trace in the task log: java.util.NoSuchElementException at java.util.TreeMap.key(TreeMap.java:433) at java.util.TreeMap.firstKey(TreeMap.java:287) at java.util.TreeSet.first(TreeSet.java:407) at java.util.Collections$UnmodifiableSortedSet.first(Collections.java:1114) at org.apache.nutch.indexer.more.MoreIndexingFilter.addType(MoreIndexingFilter.java:207) at org.apache.nutch.indexer.more.MoreIndexingFilter.filter(MoreIndexingFilter.java:90) at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:111) at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:249) at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:52) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:333) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:164) I traced this down to the part in MoreIndexingFilter where the mime type is split into primary type and subtype for indexing: contentType = mimeType.getName(); String primaryType = mimeType.getSuperType().getName(); String subType = mimeType.getSubTypes().first().getName(); Apparently Tika does not have a subtype for text/html. Furthermore, the supertype for text/html is set as application/octet-stream, which I doubt is what we want indexed. Don't we want primaryType to be text and subType to be html ? So I changed the code to: contentType = mimeType.getName(); String[] split = contentType.split(/); String primaryType = split[0]; String subType = (split.length1)?split[1]:null; This does what I think it should do, but perhaps I'm missing something ? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-687) Add RAT
Add RAT --- Key: NUTCH-687 URL: https://issues.apache.org/jira/browse/NUTCH-687 Project: Nutch Issue Type: Improvement Reporter: Sami Siren Assignee: Sami Siren Priority: Minor Attachments: NUTCH-687.patch Add apache rat so we can easily see the situation with required headers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-687) Add RAT
[ https://issues.apache.org/jira/browse/NUTCH-687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-687: - Attachment: NUTCH-687.patch Add RAT --- Key: NUTCH-687 URL: https://issues.apache.org/jira/browse/NUTCH-687 Project: Nutch Issue Type: Improvement Reporter: Sami Siren Assignee: Sami Siren Priority: Minor Attachments: NUTCH-687.patch Add apache rat so we can easily see the situation with required headers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-631) MoreIndexingFilter fails with NoSuchElementException
[ https://issues.apache.org/jira/browse/NUTCH-631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-631. -- Resolution: Fixed Assignee: Sami Siren (was: Chris A. Mattmann) committed, thanks MoreIndexingFilter fails with NoSuchElementException Key: NUTCH-631 URL: https://issues.apache.org/jira/browse/NUTCH-631 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Environment: Verified on CentOS and OSX Reporter: Stefan Will Assignee: Sami Siren Priority: Blocker Fix For: 1.0.0 Attachments: NUTCH-631.patch I did a simple crawl and started the indexer with the index-more plugin activated. The index job fails with the following stack trace in the task log: java.util.NoSuchElementException at java.util.TreeMap.key(TreeMap.java:433) at java.util.TreeMap.firstKey(TreeMap.java:287) at java.util.TreeSet.first(TreeSet.java:407) at java.util.Collections$UnmodifiableSortedSet.first(Collections.java:1114) at org.apache.nutch.indexer.more.MoreIndexingFilter.addType(MoreIndexingFilter.java:207) at org.apache.nutch.indexer.more.MoreIndexingFilter.filter(MoreIndexingFilter.java:90) at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:111) at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:249) at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:52) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:333) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:164) I traced this down to the part in MoreIndexingFilter where the mime type is split into primary type and subtype for indexing: contentType = mimeType.getName(); String primaryType = mimeType.getSuperType().getName(); String subType = mimeType.getSubTypes().first().getName(); Apparently Tika does not have a subtype for text/html. Furthermore, the supertype for text/html is set as application/octet-stream, which I doubt is what we want indexed. Don't we want primaryType to be text and subType to be html ? So I changed the code to: contentType = mimeType.getName(); String[] split = contentType.split(/); String primaryType = split[0]; String subType = (split.length1)?split[1]:null; This does what I think it should do, but perhaps I'm missing something ? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-582) Add missing type parameters
[ https://issues.apache.org/jira/browse/NUTCH-582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-582. -- Resolution: Fixed yep, all of this has been committed Add missing type parameters --- Key: NUTCH-582 URL: https://issues.apache.org/jira/browse/NUTCH-582 Project: Nutch Issue Type: Improvement Reporter: Sami Siren Assignee: Sami Siren Priority: Minor Attachments: typeparams.patch Hadoop 0.15 added possibility to use type parameters with several interfaces and makes it easier to use correct types in Mappers, Reducers et al. in addition to improved readability. Following patch will add type parameters to Mappers, Reducers, OutputCollectors, MapRunnables, InputFormats and OutputFormats. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-86) LanguageIdentifier API enhancements
[ https://issues.apache.org/jira/browse/NUTCH-86?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-86: Fix Version/s: (was: 1.0.0) removing from 1.0 queue since there has been no activity lately LanguageIdentifier API enhancements --- Key: NUTCH-86 URL: https://issues.apache.org/jira/browse/NUTCH-86 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 0.6, 0.7, 0.8 Reporter: Jerome Charron Assignee: Jerome Charron Priority: Minor More informations can be found on the following thread on Nutch-Dev mailing list: http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00569.html Summary: 1. LanguageIdentifier API changes. The similarity methods should return an ordered array of language-code/score pairs instead of a simple String containing the language-code. 2. Ensure consistency between LanguageIdentifier scoring and NGramProfile.getSimilarity(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-609) Allow Plugins to be Loaded from Jar File(s)
[ https://issues.apache.org/jira/browse/NUTCH-609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-609: - Fix Version/s: (was: 1.0.0) 1.1 pushing this to 1.1, feel free to put back if there is traction Allow Plugins to be Loaded from Jar File(s) --- Key: NUTCH-609 URL: https://issues.apache.org/jira/browse/NUTCH-609 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Priority: Minor Fix For: 1.1 Attachments: NUTCH-609-1-20080212.patch Currently plugins cannot be loaded from a jar file. Plugins must be unzipped in one or more directories specified by the plugin.folders config. I have been thinking about an extension to PluginRepository or PluginManifestParser (or both) that would allow plugins to packaged into multiple independent jar files and placed on the classpath. The system would search the classpath for resources with the correct folder name and would load any plugins in those jars. This functionality would be very useful in making the nutch core more flexible in terms of packaging. It would also help with web applications where we don't want to have a plugins directory included in the webapp. Thoughts so far are unzipping those plugin jars into a common temp directory before loading. Another option is using something like commons vfs to interact with the jar files. VFS essential uses a disk based temporary cache for jar files, so it is pretty much the same solution. What are everyone else's thoughts on this? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-469) changes to geoPosition plugin to make it work on nutch 0.9
[ https://issues.apache.org/jira/browse/NUTCH-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-469: - Fix Version/s: (was: 1.0.0) 1.1 pushing this to 1.1 changes to geoPosition plugin to make it work on nutch 0.9 -- Key: NUTCH-469 URL: https://issues.apache.org/jira/browse/NUTCH-469 Project: Nutch Issue Type: Improvement Components: indexer, searcher Affects Versions: 0.9.0 Reporter: Mike Schwartz Fix For: 1.1 Attachments: geoPosition-0.5.tgz, geoPosition0.6_cdiff.zip, NUTCH-469-2007-05-09.txt.gz I have modified the geoPosition plugin (http://wiki.apache.org/nutch/GeoPosition) code to work with nutch 0.9. (The code was built originally using nutch 0.7.) I'd like to contribute my changes back to the nutch project. I already communicated with the code's author (Matthias Jaekle), and he agrees with my mods. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-309) Uses commons logging Code Guards
[ https://issues.apache.org/jira/browse/NUTCH-309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-309: - Fix Version/s: (was: 1.0.0) 1.1 pushing this to 1.1 Uses commons logging Code Guards Key: NUTCH-309 URL: https://issues.apache.org/jira/browse/NUTCH-309 Project: Nutch Issue Type: Improvement Affects Versions: 0.8 Reporter: Jerome Charron Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.1 Code guards are typically used to guard code that only needs to execute in support of logging, that otherwise introduces undesirable runtime overhead in the general case (logging disabled). Examples are multiple parameters, or expressions (e.g. string + more) for parameters. Use the guard methods of the form log.isPriority() to verify that logging should be performed, before incurring the overhead of the logging method call. Yes, the logging methods will perform the same check, but only after resolving parameters. (description extracted from http://jakarta.apache.org/commons/logging/guide.html#Code_Guards) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-689) Swf parser doesn't seem to handle relative links
[ https://issues.apache.org/jira/browse/NUTCH-689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12674360#action_12674360 ] Sami Siren commented on NUTCH-689: -- about development: check url http://wiki.apache.org/nutch/Becoming%20A%20Nutch%20Developer for instructions about developing nutch, in particular section Step Three: Using the JIRA and Developing You should attach pacthes instead of full java source files because it's much easier to see what changed by looking at diffs. Swf parser doesn't seem to handle relative links Key: NUTCH-689 URL: https://issues.apache.org/jira/browse/NUTCH-689 Project: Nutch Issue Type: Bug Affects Versions: 0.9.0 Reporter: Peter Sparks Attachments: SWFParser.java I was using the swf parser to extract links from flash files on the site www.arnoldworldwide.com and I was getting an malformed url exception because an outlink was found and it was a relative link that wasn't being resolved. I was able to fix it by resolving all links as they are added to the list of outlinks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-621) Nutch needs to declare it's crypto usage
[ https://issues.apache.org/jira/browse/NUTCH-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12603890#action_12603890 ] Sami Siren commented on NUTCH-621: -- I agree, seem to me that we're in same situation as jackrabbit ? I think we do not provide bc libraries with nutch, only pdfbox. Nutch needs to declare it's crypto usage Key: NUTCH-621 URL: https://issues.apache.org/jira/browse/NUTCH-621 Project: Nutch Issue Type: Task Reporter: Grant Ingersoll Assignee: Chris A. Mattmann Priority: Blocker Per the ASF board direction outlined at http://www.apache.org/dev/crypto.html, Nutch needs to declare it's use of crypto libraries (i.e. BouncyCastle, via PDFBox/Tika). See TIKA-118. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-602) Allow configurable number of handlers for search servers
[ https://issues.apache.org/jira/browse/NUTCH-602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566779#action_12566779 ] Sami Siren commented on NUTCH-602: -- +1 Allow configurable number of handlers for search servers Key: NUTCH-602 URL: https://issues.apache.org/jira/browse/NUTCH-602 Project: Nutch Issue Type: Improvement Components: searcher Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: NUTCH-602-1-20080205.patch This improvement changes the distributed search server to allow a configurable number of RPC handlers. Before the number was hardcoded at 10 handlers. For high volume environments that limit will be quickly reached and the overall search will slowdown. The patch changes nutch-default.xml with the configuration parameter searchers.num.handlers and changes DistributedSearch to pull the number of handlers from the configuration. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-580) Remove deprecated hadoop api calls (FS)
[ https://issues.apache.org/jira/browse/NUTCH-580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-580. -- Resolution: Fixed Fix Version/s: 1.0.0 Committed. Remove deprecated hadoop api calls (FS) --- Key: NUTCH-580 URL: https://issues.apache.org/jira/browse/NUTCH-580 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Reporter: Sami Siren Assignee: Sami Siren Priority: Minor Fix For: 1.0.0 Attachments: hadoopfsdeprecated.patch There are quite a lot of calls to deprecated hadoop api functionality. Following patch will take care of fs related ones. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-580) Remove deprecated hadoop api calls (FS)
Remove deprecated hadoop api calls (FS) --- Key: NUTCH-580 URL: https://issues.apache.org/jira/browse/NUTCH-580 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Reporter: Sami Siren Assignee: Sami Siren Priority: Minor Attachments: hadoopfsdeprecated.patch There are quite a lot of calls to deprecated hadoop api functionality. Following patch will take care of fs related ones. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-582) Add missing type parameters
[ https://issues.apache.org/jira/browse/NUTCH-582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-582: - Attachment: typeparams.patch Add missing type parameters --- Key: NUTCH-582 URL: https://issues.apache.org/jira/browse/NUTCH-582 Project: Nutch Issue Type: Improvement Reporter: Sami Siren Assignee: Sami Siren Priority: Minor Attachments: typeparams.patch Hadoop 0.15 added possibility to use type parameters with several interfaces and makes it easier to use correct types in Mappers, Reducers et al. in addition to improved readability. Following patch will add type parameters to Mappers, Reducers, OutputCollectors, MapRunnables, InputFormats and OutputFormats. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-582) Add missing type parameters
Add missing type parameters --- Key: NUTCH-582 URL: https://issues.apache.org/jira/browse/NUTCH-582 Project: Nutch Issue Type: Improvement Reporter: Sami Siren Assignee: Sami Siren Priority: Minor Hadoop 0.15 added possibility to use type parameters with several interfaces and makes it easier to use correct types in Mappers, Reducers et al. in addition to improved readability. Following patch will add type parameters to Mappers, Reducers, OutputCollectors, MapRunnables, InputFormats and OutputFormats. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-568) Indexer does not update the Lucene TITLE field
[ https://issues.apache.org/jira/browse/NUTCH-568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12536756 ] Sami Siren commented on NUTCH-568: -- There is a BOM (Byte Order Mark) in the beginning of the file [feff] that seems to confuse nutch. I did not track down the change that cased this. Indexer does not update the Lucene TITLE field Key: NUTCH-568 URL: https://issues.apache.org/jira/browse/NUTCH-568 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Environment: Windows XP Reporter: smorales Attachments: RN-071018-24.html Hi, The indexer is unable to update the field TITLE of the Lucene index when processing specific html documents. This issue has been reproduced using Nutch-Nightly Build #241 (Oct 19, 2007 4:01:28 AM) The problem does not occurs using NUTCH 9.0. Workflow: 1.- Extracted package and copy across the following configuration files from NUTCH 9.0 - {nutch_home_9.0}/bin/url folder, containing the urls - {nutch_home_9.0}/conf/nutch-site.xml - {nutch_home_9.0}/conf/crawl-urlfilter.txt 2.- To reproduce the issue, you need to copy the attached html document to your webserver/filesytem. 3.- Run the crawl. For example: ./nutch crawl urls -dir crawl -depth 22 4.- Open the index using Luke. For this test, I used lukeall-0.7.1.jar 5.- Select the window select the document tab, move thru the docs until you find our html document. You will see that the TITLE field is empty -- INCORRECT because this html document contains a title. 6.- Now, open the html document, add a space anywhere then save it again. 7.- Repeat step 3 and 4. You will notice that this time the field TITLE field contains the correct information Please advice, Many thanks in advance for your support. Sergio -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-565) Arc File to Nutch Segments Converter
[ https://issues.apache.org/jira/browse/NUTCH-565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534364 ] Sami Siren commented on NUTCH-565: -- I didn't actually test this, but it looks like useful addition to nutch, so +1 from me. Arc File to Nutch Segments Converter Key: NUTCH-565 URL: https://issues.apache.org/jira/browse/NUTCH-565 Project: Nutch Issue Type: Improvement Environment: all Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: arcsegments2.patch, nutch-565-1-20071009.patch Functionality that allows arc files, such as those produced by the internet archive project or by the Grub distributed crawler to be parsed into Nutch segments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.