Re: [VOTE 2] Board resolution for Nutch as TLP
On 04/12/2010 02:08 PM, Andrzej Bialecki wrote: Hi, Take two, after s/crawling/search/ ... Following the discussion, below is the text of the proposed Board Resolution to vote upon. [X] +1. Request the Board make Nutch a TLP -- Sami Siren
Re: [DISCUSS] Board resolution for Nutch as TLP
Looks good to me after the proposed changes. -- Sami Siren On Sat, Apr 10, 2010 at 6:09 PM, Andrzej Bialecki a...@getopt.org wrote: On 2010-04-10 15:32, Jukka Zitting wrote: Hi, On Fri, Apr 9, 2010 at 6:52 PM, Andrzej Bialecki a...@getopt.org wrote: WHEREAS, the Board of Directors deems it to be in the best interests of the Foundation and consistent with the Foundation's purpose to establish a Project Management Committee charged with the creation and maintenance of open-source software related to a large-scale web crawling platform for distribution at no charge to the public. Would it make sense to simplify the scope to ... open-source software related to large-scale web crawling for distribution at no charge to the public? Yes, that's a good change too. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [DISCUSS] Nutch as a top level project (TLP)?
My opinion is neutral on this matter. I don't see any technical benefit from going to top level project, exposure-wise I think the impact is probably negative. So for me the reason would be strictly political. But the fact is that Nutch is pretty independent from Lucene/Solr and there is not much overlap with dev communities. -- Sami Siren
[jira] Commented: (NUTCH-798) Upgrade to SOLR1.4
[ https://issues.apache.org/jira/browse/NUTCH-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843546#action_12843546 ] Sami Siren commented on NUTCH-798: -- +1 Upgrade to SOLR1.4 -- Key: NUTCH-798 URL: https://issues.apache.org/jira/browse/NUTCH-798 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Julien Nioche Fix For: 1.1 in particular SOLR1.4 has a StreamingUpdateSolrServer which would simplify the way we buffer the docs before sending them to the SOLR instance -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: need advice trouble shooting zero results problem
I think we should add some logging on the initialization code of various back ends, currently they log nearly nothing and it's hard to find out what's happening (specially when something is wrong). You can go ahead and propose a patch in jira that adds proper logging statements so that it's easier to diagnose situations like that. -- Sami Siren On Fri, Feb 19, 2010 at 5:24 AM, Jesse Hires jhi...@gmail.com wrote: I am getting zero results when I search, but have no idea where to look for clues as to why. Is there a log that shows failure to find search-servers.txt, or failures to connect to the searchers? Is there a way I can verify the searchers can find the indexes? There seems to be very few breadcrumbs to follow when the configuration is not quite correct. I have had this working at one point, but decided to start over with the latest version from the trunk. I have a feeling I missed something, but I just don't know where to look. Jesse int GetRandomNumber() { return 4; // Chosen by fair roll of dice // Guaranteed to be random } // xkcd.com
[jira] Created: (NUTCH-793) search.jsp compile errors
search.jsp compile errors - Key: NUTCH-793 URL: https://issues.apache.org/jira/browse/NUTCH-793 Project: Nutch Issue Type: Bug Components: web gui Reporter: Sami Siren Assignee: Sami Siren Fix For: 1.1 Related to the searcher interface changes recently committed I broke search.jsp which does not currently compile. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: exception in search.jsp
Hi Jesse, thanks for spotting this. I fixed the problem in trunk, see https://issues.apache.org/jira/browse/NUTCH-793 -- Sami Siren Jesse Hires wrote: I am seeing the following and am able to find any notes anywhere on it. org.apache.jasper.JasperException: Unable to compile class for JSP: An error occurred at line: 207 in the jsp file: /search.jsp query.getParams cannot be resolved or is not a field 204: // position this is good, bad?... ugly? 205:Hits hits; 206:try{ 207: query.getParams.initFrom(start + hitsToRetrieve, hitsPerSite, site, sort, reverse); 208: hits = bean.search(query); 209:} catch (IOException e){ 210: hits = new Hits(0,new Hit[0]); It looks like this change came in recently to SVN --- lucene/nutch/trunk/src/web/jsp/search.jsp 2009/10/09 17:02:32 823614 +++ lucene/nutch/trunk/src/web/jsp/search.jsp 2010/02/01 20:47:34 905410 @@ -204,8 +204,8 @@ // position this is good, bad?... ugly? Hits hits; try{ - hits = bean.search(query, start + hitsToRetrieve, hitsPerSite, site, -sort, reverse); + query.getParams.initFrom(start + hitsToRetrieve, hitsPerSite, site, sort, reverse); + hits = bean.search(query); } catch (IOException e){ hits = new Hits(0,new Hit[0]); } Has anyone else run into this, or did I miss something when updating to the latest version? Jesse int GetRandomNumber() { return 4; // Chosen by fair roll of dice // Guaranteed to be random } // xkcd.com http://xkcd.com
[jira] Resolved: (NUTCH-793) search.jsp compile errors
[ https://issues.apache.org/jira/browse/NUTCH-793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-793. -- Resolution: Fixed committed a fix search.jsp compile errors - Key: NUTCH-793 URL: https://issues.apache.org/jira/browse/NUTCH-793 Project: Nutch Issue Type: Bug Components: web gui Reporter: Sami Siren Assignee: Sami Siren Fix For: 1.1 Related to the searcher interface changes recently committed I broke search.jsp which does not currently compile. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-788) search.jsp typo causing searches to fail
[ https://issues.apache.org/jira/browse/NUTCH-788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-788. -- Resolution: Fixed Fix Version/s: 1.1 Assignee: Sami Siren Thanks Sammy for the fix, I did not realize you had spotted this too. It's now fixed in trunk. search.jsp typo causing searches to fail Key: NUTCH-788 URL: https://issues.apache.org/jira/browse/NUTCH-788 Project: Nutch Issue Type: Bug Components: web gui Affects Versions: 1.1 Environment: On trunk Reporter: Sammy Yu Assignee: Sami Siren Fix For: 1.1 Attachments: 0001-Fix-up-servlet.patch Call to initialize the servlet parameter is missing parentheses. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-789) Improvements to Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833714#action_12833714 ] Sami Siren commented on NUTCH-789: -- It would be really useful to include the improvements in the functionality since that way almost all (-flash ?) parsers would be covered. Improvements to Tika parser --- Key: NUTCH-789 URL: https://issues.apache.org/jira/browse/NUTCH-789 Project: Nutch Issue Type: Improvement Components: fetcher Environment: reported by Sami, in NUTCH-766 Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.1 Attachments: NutchTikaConfig.java, TikaParser.java As reported by Sami in NUTCH-766, Sami has a few improvements he made to the Tika parser. We'll track that progress here. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-790) Some external javadoc links are broken
Some external javadoc links are broken -- Key: NUTCH-790 URL: https://issues.apache.org/jira/browse/NUTCH-790 Project: Nutch Issue Type: Improvement Components: build Reporter: Sami Siren Assignee: Sami Siren Priority: Trivial Nutch javadoc links for lucene and hadoop are broken. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-790) Some external javadoc links are broken
[ https://issues.apache.org/jira/browse/NUTCH-790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-790: - Attachment: NUTCH-790.patch proposed patch, fixes links for lucene and hadoop, also updates j2se link to version 1.6 Some external javadoc links are broken -- Key: NUTCH-790 URL: https://issues.apache.org/jira/browse/NUTCH-790 Project: Nutch Issue Type: Improvement Components: build Reporter: Sami Siren Assignee: Sami Siren Priority: Trivial Attachments: NUTCH-790.patch Nutch javadoc links for lucene and hadoop are broken. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-791) External links for published javadocs are partially broken
External links for published javadocs are partially broken -- Key: NUTCH-791 URL: https://issues.apache.org/jira/browse/NUTCH-791 Project: Nutch Issue Type: Bug Components: documentation Reporter: Sami Siren Lucene and Hadoop links point to non existing urls. For some versions of apidocs the links are just broken and for some they do not exist at all. Basically what is required is that the javadocs are generated again with proper urls for external packages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-790) Some external javadoc links are broken
[ https://issues.apache.org/jira/browse/NUTCH-790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-790. -- Resolution: Fixed Fix Version/s: 1.1 committed Some external javadoc links are broken -- Key: NUTCH-790 URL: https://issues.apache.org/jira/browse/NUTCH-790 Project: Nutch Issue Type: Improvement Components: build Reporter: Sami Siren Assignee: Sami Siren Priority: Trivial Fix For: 1.1 Attachments: NUTCH-790.patch Nutch javadoc links for lucene and hadoop are broken. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-792) Nutch version still contains 1.0
[ https://issues.apache.org/jira/browse/NUTCH-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-792: - Attachment: NUTCH-792.patch pump version to 1.1-dev Nutch version still contains 1.0 Key: NUTCH-792 URL: https://issues.apache.org/jira/browse/NUTCH-792 Project: Nutch Issue Type: Task Components: build Reporter: Sami Siren Assignee: Sami Siren Attachments: NUTCH-792.patch Should be 1.1-dev now in trunk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-792) Nutch version still contains 1.0
Nutch version still contains 1.0 Key: NUTCH-792 URL: https://issues.apache.org/jira/browse/NUTCH-792 Project: Nutch Issue Type: Task Components: build Reporter: Sami Siren Assignee: Sami Siren Attachments: NUTCH-792.patch Should be 1.1-dev now in trunk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-792) Nutch version still contains 1.0
[ https://issues.apache.org/jira/browse/NUTCH-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-792. -- Resolution: Fixed committed Nutch version still contains 1.0 Key: NUTCH-792 URL: https://issues.apache.org/jira/browse/NUTCH-792 Project: Nutch Issue Type: Task Components: build Reporter: Sami Siren Assignee: Sami Siren Attachments: NUTCH-792.patch Should be 1.1-dev now in trunk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832406#action_12832406 ] Sami Siren commented on NUTCH-766: -- I suggest that we would still drive this a bit further an use. currently this patch does not use Tika for pkg formats nor html. Julien: was there a reason not to use AutoDetect parser? The only thing that I could come with was that the mime type detection would be done twice. We could get around this by implementing somethin simlilar to what composite parser does (it uses a parser (AutodetectParser) class from the context to do further parsing) to cover all supported pkg formats. Also was there a reson not to parse html wtih tika? I have a patch nearby to demonstrate some of the improvements that I will try to post briefly. Tika parser --- Key: NUTCH-766 URL: https://issues.apache.org/jira/browse/NUTCH-766 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Chris A. Mattmann Fix For: 1.1 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome. Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places NUTCH_HOME/lib : tika-core.jar NUTCH_HOME/tika-plugin/lib : tika-parsers.jar Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points. Unlike most other parsers, Tika handles more than one Mime-type which is why we are using * as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection, metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it. The following libraries are required in the lib/ directory of the tika-parser : library name=asm-3.1.jar/ library name=bcmail-jdk15-144.jar/ library name=commons-compress-1.0.jar/ library name=commons-logging-1.1.1.jar/ library name=dom4j-1.6.1.jar/ library name=fontbox-0.8.0-incubator.jar/ library name=geronimo-stax-api_1.0_spec-1.0.1.jar/ library name=hamcrest-core-1.1.jar/ library name=jce-jdk13-144.jar/ library name=jempbox-0.8.0-incubator.jar/ library name=metadata-extractor-2.4.0-beta-1.jar/ library name=mockito-core-1.7.jar/ library name=objenesis-1.0.jar/ library name=ooxml-schemas-1.0.jar/ library name=pdfbox-0.8.0-incubating.jar/ library name=poi-3.5-FINAL.jar/ library name=poi-ooxml-3.5-FINAL.jar/ library name=poi-scratchpad-3.5-FINAL.jar/ library name=tagsoup-1.2.jar/ library name=tika-parsers-0.5-SNAPSHOT.jar/ library name=xml-apis-1.0.b2.jar/ library name=xmlbeans-2.3.0.jar/ There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed
[jira] Updated: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-766: - Attachment: NutchTikaConfig.java Extended TikaConfig that is able to load parsers and can be used with existing tika classes. The call to (super) cannot load parser but then the config is porcessed again locally. This is a hack and hopefully at some point we can drop the class alltogether. Tika parser --- Key: NUTCH-766 URL: https://issues.apache.org/jira/browse/NUTCH-766 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Chris A. Mattmann Fix For: 1.1 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, sample.tar.gz Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome. Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places NUTCH_HOME/lib : tika-core.jar NUTCH_HOME/tika-plugin/lib : tika-parsers.jar Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points. Unlike most other parsers, Tika handles more than one Mime-type which is why we are using * as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection, metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it. The following libraries are required in the lib/ directory of the tika-parser : library name=asm-3.1.jar/ library name=bcmail-jdk15-144.jar/ library name=commons-compress-1.0.jar/ library name=commons-logging-1.1.1.jar/ library name=dom4j-1.6.1.jar/ library name=fontbox-0.8.0-incubator.jar/ library name=geronimo-stax-api_1.0_spec-1.0.1.jar/ library name=hamcrest-core-1.1.jar/ library name=jce-jdk13-144.jar/ library name=jempbox-0.8.0-incubator.jar/ library name=metadata-extractor-2.4.0-beta-1.jar/ library name=mockito-core-1.7.jar/ library name=objenesis-1.0.jar/ library name=ooxml-schemas-1.0.jar/ library name=pdfbox-0.8.0-incubating.jar/ library name=poi-3.5-FINAL.jar/ library name=poi-ooxml-3.5-FINAL.jar/ library name=poi-scratchpad-3.5-FINAL.jar/ library name=tagsoup-1.2.jar/ library name=tika-parsers-0.5-SNAPSHOT.jar/ library name=xml-apis-1.0.b2.jar/ library name=xmlbeans-2.3.0.jar/ There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine. Again, your comments are welcome. Please bear in mind that this is just a first step. Julien http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-766: - Attachment: TikaParser.java Modified parser that can process package formats too. To get rid of the mime type detection happening twice we have to extend AutoDetectParser so that skips the intitial detection but does the detection for the rest of the content (in pkg formats) Tika parser --- Key: NUTCH-766 URL: https://issues.apache.org/jira/browse/NUTCH-766 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Chris A. Mattmann Fix For: 1.1 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, sample.tar.gz, TikaParser.java Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome. Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places NUTCH_HOME/lib : tika-core.jar NUTCH_HOME/tika-plugin/lib : tika-parsers.jar Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points. Unlike most other parsers, Tika handles more than one Mime-type which is why we are using * as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection, metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it. The following libraries are required in the lib/ directory of the tika-parser : library name=asm-3.1.jar/ library name=bcmail-jdk15-144.jar/ library name=commons-compress-1.0.jar/ library name=commons-logging-1.1.1.jar/ library name=dom4j-1.6.1.jar/ library name=fontbox-0.8.0-incubator.jar/ library name=geronimo-stax-api_1.0_spec-1.0.1.jar/ library name=hamcrest-core-1.1.jar/ library name=jce-jdk13-144.jar/ library name=jempbox-0.8.0-incubator.jar/ library name=metadata-extractor-2.4.0-beta-1.jar/ library name=mockito-core-1.7.jar/ library name=objenesis-1.0.jar/ library name=ooxml-schemas-1.0.jar/ library name=pdfbox-0.8.0-incubating.jar/ library name=poi-3.5-FINAL.jar/ library name=poi-ooxml-3.5-FINAL.jar/ library name=poi-scratchpad-3.5-FINAL.jar/ library name=tagsoup-1.2.jar/ library name=tika-parsers-0.5-SNAPSHOT.jar/ library name=xml-apis-1.0.b2.jar/ library name=xmlbeans-2.3.0.jar/ There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine. Again, your comments are welcome. Please bear in mind that this is just a first step. Julien http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0
[ https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12830053#action_12830053 ] Sami Siren commented on NUTCH-673: -- {quote} Any plans or reasons not to upgrade to Lucene 3.0? {quote} I see no reason to stick with 2.9 {quote} I can prepare a patch replacing Lucene 2.9 with Lucene 3.0 (as a separate issue). {quote} +1 Upgrade the Carrot2 plug-in to release 3.0 -- Key: NUTCH-673 URL: https://issues.apache.org/jira/browse/NUTCH-673 Project: Nutch Issue Type: Improvement Components: web gui Affects Versions: 0.9.0 Environment: All Nutch deployments. Reporter: Sean Dean Priority: Minor Fix For: 1.1 Release 3.0 of the Carrot2 plug-in was released recently. We currently have version 2.1 in the source tree and upgrading it to the latest version before 1.0-release might make sence. Details on the release can be found here: http://project.carrot2.org/release-3.0-notes.html One major change in requirements is for JDK 1.5 to be used, but this is also now required for Hadoop 0.19 so this wouldnt be the only reason for the switch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-781) Update Tika to v0.6 for the MimeType detection
[ https://issues.apache.org/jira/browse/NUTCH-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828561#action_12828561 ] Sami Siren commented on NUTCH-781: -- {quote} the version we had was the same as the one provided by Tika 0.4 so I suppose we could safely rely on theTika defaults. MimeUtil currently requires needs tika-mimetypes.xml to be in the available in the classpath but we could modify that so that it uses the default version from the tika jar if nothing can be found in conf. Let's put that in a separate JIRA issue if we really want it, in the meantime I'll commit the v 0.6 of tika-mimetypes.xml {quote} ok. thanks. Update Tika to v0.6 for the MimeType detection --- Key: NUTCH-781 URL: https://issues.apache.org/jira/browse/NUTCH-781 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 [from annoucement] Apache Tika, a subproject of Apache Lucene, is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Apache Tika 0.6 contains a number of improvements and bug fixes. Details can be found in the changes file: http://www.apache.org/dist/lucene/tika/CHANGES-0.6.txt -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-775) Enhance Searcher interface
[ https://issues.apache.org/jira/browse/NUTCH-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-775. -- Resolution: Fixed I committed this Enhance Searcher interface -- Key: NUTCH-775 URL: https://issues.apache.org/jira/browse/NUTCH-775 Project: Nutch Issue Type: Improvement Components: searcher Reporter: Sami Siren Assignee: Sami Siren Fix For: 1.1 Attachments: NUTCH-775.patch Current Searcher interface is too limited for many purposes: Hits search(Query query, int numHits, String dedupField, String sortField, boolean reverse) throws IOException; It would be nice that we had an interface that allowed adding different features without changing the interface. I am proposing that we deprecate the current search method and introduce something like: Hits search(Query query, Metadata context) throws IOException; Also at the same time we should enhance the QueryFilter interface to look something like: BooleanQuery filter(Query input, BooleanQuery translation, Metadata context) throws QueryException; I would like to hear your comments before proceeding with a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-781) Update Tika to v0.6 for the MimeType detection
[ https://issues.apache.org/jira/browse/NUTCH-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828275#action_12828275 ] Sami Siren commented on NUTCH-781: -- did you forgot to update conf/tika-mimetypes.xml ? Related question: do we actually need our own version on the tika config anymore? I saw there were some old issues that were fixed in the custom version but i would quess those changes, if important, have already made their way into Tika? Update Tika to v0.6 for the MimeType detection --- Key: NUTCH-781 URL: https://issues.apache.org/jira/browse/NUTCH-781 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 [from annoucement] Apache Tika, a subproject of Apache Lucene, is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Apache Tika 0.6 contains a number of improvements and bug fixes. Details can be found in the changes file: http://www.apache.org/dist/lucene/tika/CHANGES-0.6.txt -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-775) Enhance Searcher interface
[ https://issues.apache.org/jira/browse/NUTCH-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806019#action_12806019 ] Sami Siren commented on NUTCH-775: -- If there are no objections I'll commit the proposed patch within few days. Enhance Searcher interface -- Key: NUTCH-775 URL: https://issues.apache.org/jira/browse/NUTCH-775 Project: Nutch Issue Type: Improvement Components: searcher Reporter: Sami Siren Assignee: Sami Siren Fix For: 1.1 Attachments: NUTCH-775.patch Current Searcher interface is too limited for many purposes: Hits search(Query query, int numHits, String dedupField, String sortField, boolean reverse) throws IOException; It would be nice that we had an interface that allowed adding different features without changing the interface. I am proposing that we deprecate the current search method and introduce something like: Hits search(Query query, Metadata context) throws IOException; Also at the same time we should enhance the QueryFilter interface to look something like: BooleanQuery filter(Query input, BooleanQuery translation, Metadata context) throws QueryException; I would like to hear your comments before proceeding with a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-775) Enhance Searcher interface
[ https://issues.apache.org/jira/browse/NUTCH-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806051#action_12806051 ] Sami Siren commented on NUTCH-775: -- {quote}IMHO this could go as it is ... one suggestion though: this Query/QueryContext now resembles SolrQuery/SolrParams. Perhaps we could rename QueryContext to QueryParams? {quote} That sounds reasonable, I will change the name before committing. Also I forgot to change web gui to use the new api, will do that also. Enhance Searcher interface -- Key: NUTCH-775 URL: https://issues.apache.org/jira/browse/NUTCH-775 Project: Nutch Issue Type: Improvement Components: searcher Reporter: Sami Siren Assignee: Sami Siren Fix For: 1.1 Attachments: NUTCH-775.patch Current Searcher interface is too limited for many purposes: Hits search(Query query, int numHits, String dedupField, String sortField, boolean reverse) throws IOException; It would be nice that we had an interface that allowed adding different features without changing the interface. I am proposing that we deprecate the current search method and introduce something like: Hits search(Query query, Metadata context) throws IOException; Also at the same time we should enhance the QueryFilter interface to look something like: BooleanQuery filter(Query input, BooleanQuery translation, Metadata context) throws QueryException; I would like to hear your comments before proceeding with a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12805661#action_12805661 ] Sami Siren commented on NUTCH-766: -- {quote} Sure, it's more of a configuration backwards-compat issue. For those folks who have gone to the trouble of customizing their nutch configuration (nuch-site.xml, or nutch-default.xml, or even parse-plugins), to remove out the parsing plugins (e.g., basically say they don't exist anymore and update your deployed configuration to use the tika-plugin), this patch would require a configuration update in their deployed environments. Because of that, why don't we ease them into that upgrade with at least one released version before the plugins go away. It would make it easier from a configuration backwards-compat perspective. {quote} Ok, so you mean that we need to have duplicate parser plugins because we don't want to ask people already using nutch to reconfigure the bits this involves now even though we have to do it later? How is postponing going to ease the task they need to do anyway at some point? I still don't understand the (longer term) benefit. I am not strongly against the idea of keeping duplicate plugins, I mean it's just another ~20M in the .job, what I am worried about is that the history will repeat itself and we will end up having one more case of duplicate components (in this case many of them) doing the same work and no interest in cleaning up afterwards. Doing it the way I suggested would guarantee that this will not happen. Tika parser --- Key: NUTCH-766 URL: https://issues.apache.org/jira/browse/NUTCH-766 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Chris A. Mattmann Fix For: 1.1 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome. Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places NUTCH_HOME/lib : tika-core.jar NUTCH_HOME/tika-plugin/lib : tika-parsers.jar Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points. Unlike most other parsers, Tika handles more than one Mime-type which is why we are using * as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection, metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it. The following libraries are required in the lib/ directory of the tika-parser : library name=asm-3.1.jar/ library name=bcmail-jdk15-144.jar/ library name=commons-compress-1.0.jar/ library name=commons-logging-1.1.1.jar/ library name=dom4j-1.6.1.jar/ library name=fontbox-0.8.0-incubator.jar/ library name=geronimo-stax-api_1.0_spec-1.0.1.jar/ library name=hamcrest-core-1.1.jar/ library name=jce-jdk13-144.jar/ library name=jempbox-0.8.0-incubator.jar/ library name=metadata-extractor-2.4.0-beta-1.jar/ library
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804448#action_12804448 ] Sami Siren commented on NUTCH-766: -- +1, I'm going to agree on this one here Julien. Other communities have convinced me of the need for backwards compat and unobtrusiveness when bringing in new functionality or results. +1 to at least in Nutch 1.1 leaving the old plugins (perhaps mentioning they should be deprecated and replaced by the Tika functionality) and then removing them in 1.2 or 1.3. Chris, can you please explain me how keeping two components doing identical work would be more backwards compatible than having only 1? Tika parser --- Key: NUTCH-766 URL: https://issues.apache.org/jira/browse/NUTCH-766 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Chris A. Mattmann Fix For: 1.1 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome. Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places NUTCH_HOME/lib : tika-core.jar NUTCH_HOME/tika-plugin/lib : tika-parsers.jar Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points. Unlike most other parsers, Tika handles more than one Mime-type which is why we are using * as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection, metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it. The following libraries are required in the lib/ directory of the tika-parser : library name=asm-3.1.jar/ library name=bcmail-jdk15-144.jar/ library name=commons-compress-1.0.jar/ library name=commons-logging-1.1.1.jar/ library name=dom4j-1.6.1.jar/ library name=fontbox-0.8.0-incubator.jar/ library name=geronimo-stax-api_1.0_spec-1.0.1.jar/ library name=hamcrest-core-1.1.jar/ library name=jce-jdk13-144.jar/ library name=jempbox-0.8.0-incubator.jar/ library name=metadata-extractor-2.4.0-beta-1.jar/ library name=mockito-core-1.7.jar/ library name=objenesis-1.0.jar/ library name=ooxml-schemas-1.0.jar/ library name=pdfbox-0.8.0-incubating.jar/ library name=poi-3.5-FINAL.jar/ library name=poi-ooxml-3.5-FINAL.jar/ library name=poi-scratchpad-3.5-FINAL.jar/ library name=tagsoup-1.2.jar/ library name=tika-parsers-0.5-SNAPSHOT.jar/ library name=xml-apis-1.0.b2.jar/ library name=xmlbeans-2.3.0.jar/ There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine. Again, your comments are welcome. Please bear in mind that this is just a first step. Julien http
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803664#action_12803664 ] Sami Siren commented on NUTCH-766: -- I took a brief look into the proposed patch, some somments: The public API footprint of new classes should be smaller, eg use private, package private or protected methods/classes as much as possible. I think the end result of this plugin should be replacing all Tika supported parsers (or the parsers we choose to replace) with the TikaParser and not to build a parallel ways to parse same formats. So I think we need to copy all of the the existing test files and moveadapt the existing testcases fully before committing this. That is a good way of seeing that the parse result is what is expected and also find out about possible differences with old vs. Tika version. Tika parser --- Key: NUTCH-766 URL: https://issues.apache.org/jira/browse/NUTCH-766 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Chris A. Mattmann Fix For: 1.1 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome. Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places NUTCH_HOME/lib : tika-core.jar NUTCH_HOME/tika-plugin/lib : tika-parsers.jar Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points. Unlike most other parsers, Tika handles more than one Mime-type which is why we are using * as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection, metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it. The following libraries are required in the lib/ directory of the tika-parser : library name=asm-3.1.jar/ library name=bcmail-jdk15-144.jar/ library name=commons-compress-1.0.jar/ library name=commons-logging-1.1.1.jar/ library name=dom4j-1.6.1.jar/ library name=fontbox-0.8.0-incubator.jar/ library name=geronimo-stax-api_1.0_spec-1.0.1.jar/ library name=hamcrest-core-1.1.jar/ library name=jce-jdk13-144.jar/ library name=jempbox-0.8.0-incubator.jar/ library name=metadata-extractor-2.4.0-beta-1.jar/ library name=mockito-core-1.7.jar/ library name=objenesis-1.0.jar/ library name=ooxml-schemas-1.0.jar/ library name=pdfbox-0.8.0-incubating.jar/ library name=poi-3.5-FINAL.jar/ library name=poi-ooxml-3.5-FINAL.jar/ library name=poi-scratchpad-3.5-FINAL.jar/ library name=tagsoup-1.2.jar/ library name=tika-parsers-0.5-SNAPSHOT.jar/ library name=xml-apis-1.0.b2.jar/ library name=xmlbeans-2.3.0.jar/ There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803673#action_12803673 ] Sami Siren commented on NUTCH-766: -- Sure, but it would be silly to block the whole Tika plugin because Tika does not support such or such format as well as the original Nutch plugins. As I explained above we can configure which parser to use for which mimetype and use the Tika-plugin by default. Hopefully the Tika implementation will get better and better and there will be no need for keeping the old plugins. I meant test files for the parsers we replace, not all BTW http://wiki.apache.org/nutch/TikaPlugin lists the differences between the current version of Tika and the existing Nutch parsers ok, I had misses that one. Tika parser --- Key: NUTCH-766 URL: https://issues.apache.org/jira/browse/NUTCH-766 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Chris A. Mattmann Fix For: 1.1 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome. Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places NUTCH_HOME/lib : tika-core.jar NUTCH_HOME/tika-plugin/lib : tika-parsers.jar Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points. Unlike most other parsers, Tika handles more than one Mime-type which is why we are using * as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection, metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it. The following libraries are required in the lib/ directory of the tika-parser : library name=asm-3.1.jar/ library name=bcmail-jdk15-144.jar/ library name=commons-compress-1.0.jar/ library name=commons-logging-1.1.1.jar/ library name=dom4j-1.6.1.jar/ library name=fontbox-0.8.0-incubator.jar/ library name=geronimo-stax-api_1.0_spec-1.0.1.jar/ library name=hamcrest-core-1.1.jar/ library name=jce-jdk13-144.jar/ library name=jempbox-0.8.0-incubator.jar/ library name=metadata-extractor-2.4.0-beta-1.jar/ library name=mockito-core-1.7.jar/ library name=objenesis-1.0.jar/ library name=ooxml-schemas-1.0.jar/ library name=pdfbox-0.8.0-incubating.jar/ library name=poi-3.5-FINAL.jar/ library name=poi-ooxml-3.5-FINAL.jar/ library name=poi-scratchpad-3.5-FINAL.jar/ library name=tagsoup-1.2.jar/ library name=tika-parsers-0.5-SNAPSHOT.jar/ library name=xml-apis-1.0.b2.jar/ library name=xmlbeans-2.3.0.jar/ There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine. Again, your
[jira] Updated: (NUTCH-775) Enhance Searcher interface
[ https://issues.apache.org/jira/browse/NUTCH-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-775: - Attachment: NUTCH-775.patch I ended up changing the Query API instead since the changes were smaller from API perspective that way. Enhance Searcher interface -- Key: NUTCH-775 URL: https://issues.apache.org/jira/browse/NUTCH-775 Project: Nutch Issue Type: Improvement Components: searcher Reporter: Sami Siren Assignee: Sami Siren Fix For: 1.1 Attachments: NUTCH-775.patch Current Searcher interface is too limited for many purposes: Hits search(Query query, int numHits, String dedupField, String sortField, boolean reverse) throws IOException; It would be nice that we had an interface that allowed adding different features without changing the interface. I am proposing that we deprecate the current search method and introduce something like: Hits search(Query query, Metadata context) throws IOException; Also at the same time we should enhance the QueryFilter interface to look something like: BooleanQuery filter(Query input, BooleanQuery translation, Metadata context) throws QueryException; I would like to hear your comments before proceeding with a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool
[ https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791829#action_12791829 ] Sami Siren commented on NUTCH-666: -- We should also consider switching to Tika for language identification and route the proposed improvements in that area through Tika? Analysis plugins for multiple language and new Language Identifier Tool --- Key: NUTCH-666 URL: https://issues.apache.org/jira/browse/NUTCH-666 Project: Nutch Issue Type: Improvement Affects Versions: 1.1 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.1 Attachments: NUTCH-666-1-20081126.patch Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, russian, and thai. Also includes a new Language Identifier tool that used the new indexing framework in NUTCH-646. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-775) Enhance Searcher interface
Enhance Searcher interface -- Key: NUTCH-775 URL: https://issues.apache.org/jira/browse/NUTCH-775 Project: Nutch Issue Type: Improvement Components: searcher Reporter: Sami Siren Assignee: Sami Siren Fix For: 1.1 Current Searcher interface is too limited for many purposes: Hits search(Query query, int numHits, String dedupField, String sortField, boolean reverse) throws IOException; It would be nice that we had an interface that allowed adding different features without changing the interface. I am proposing that we deprecate the current search method and introduce something like: Hits search(Query query, Metadata context) throws IOException; Also at the same time we should enhance the QueryFilter interface to look something like: BooleanQuery filter(Query input, BooleanQuery translation, Metadata context) throws QueryException; I would like to hear your comments before proceeding with a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-743) Site search powered by Lucene/Solr
[ https://issues.apache.org/jira/browse/NUTCH-743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-743. -- Resolution: Fixed committed Site search powered by Lucene/Solr -- Key: NUTCH-743 URL: https://issues.apache.org/jira/browse/NUTCH-743 Project: Nutch Issue Type: New Feature Components: documentation Reporter: Sami Siren Assignee: Sami Siren Priority: Minor Attachments: NUTCH-743.patch Replace current Nutch site search with Lucene/Solr powered search hosted by Lucid Imagination (http://www.lucidimagination.com/search). It allows one to search all of the Nutch (content from other parts of the Lucene ecosystem is also available) content from a single place, including web, wiki, JIRA and mail archives. Lucid has a fault tolerant setup with replication and fail over as well as monitoring services in place. A preview of the site with the new search enabled is available at http://people.apache.org/~siren/site/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-743) Site search powered by Lucene/Solr
Site search powered by Lucene/Solr -- Key: NUTCH-743 URL: https://issues.apache.org/jira/browse/NUTCH-743 Project: Nutch Issue Type: New Feature Components: documentation Reporter: Sami Siren Assignee: Sami Siren Priority: Minor Replace current Nutch site search with Lucene/Solr powered search hosted by Lucid Imagination (http://www.lucidimagination.com/search). It allows one to search all of the Nutch (content from other parts of the Lucene ecosystem is also available) content from a single place, including web, wiki, JIRA and mail archives. Lucid has a fault tolerant setup with replication and fail over as well as monitoring services in place. A preview of the site with the new search enabled is available at http://people.apache.org/~siren/site/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-743) Site search powered by Lucene/Solr
[ https://issues.apache.org/jira/browse/NUTCH-743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-743: - Attachment: NUTCH-743.patch If there are no objections I will commit this within a week or so. Site search powered by Lucene/Solr -- Key: NUTCH-743 URL: https://issues.apache.org/jira/browse/NUTCH-743 Project: Nutch Issue Type: New Feature Components: documentation Reporter: Sami Siren Assignee: Sami Siren Priority: Minor Attachments: NUTCH-743.patch Replace current Nutch site search with Lucene/Solr powered search hosted by Lucid Imagination (http://www.lucidimagination.com/search). It allows one to search all of the Nutch (content from other parts of the Lucene ecosystem is also available) content from a single place, including web, wiki, JIRA and mail archives. Lucid has a fault tolerant setup with replication and fail over as well as monitoring services in place. A preview of the site with the new search enabled is available at http://people.apache.org/~siren/site/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[ANNOUNCE] Apache Nutch 1.0
I am pleased to announce the availability of Apache Nutch 1.0. Apache Nutch, a subproject of Apache Lucene, is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats. Apache Nutch 1.0 contains a number of bug fixes and improvements such as Solr Integration, new indexing framework and new scoring framework just to mention a few. Details can be found in the changes file: http://svn.apache.org/repos/asf/lucene/nutch/tags/release-1.0/CHANGES.txt Apache Nutch is available for download from the following download page: http://www.apache.org/dyn/closer.cgi/lucene/nutch/nutch-1.0.tar.gz When downloading from a mirror site, please remember to verify the downloads using signatures found on the Apache site: http://www.apache.org/dist/lucene/nutch/KEYS For more information on Apache Nutch, visit the project home page: http://lucene.apache.org/nutch -- Sami Siren (on behalf of the Apache Nutch community)
Re: [VOTE] Release Apache Nutch 1.0
Thanks Andrzej, This vote has passed, we now have a release with three binding +1 votes from: -Andrzej Bialecki -Dennis Kubes -Sami Siren I'll finalize the remaining tasks and do the announcement after the package has been mirrored. ps. we should perhaps create jira issues for all the findings, small and big, so we can take care of them before next release. -- Sami Siren Andrzej Bialecki wrote: Sami Siren wrote: Hello, I have packaged the third release candidate for Apache Nutch 1.0 release at http://people.apache.org/~siren/nutch-1.0/rc2/ See the CHANGES.txt[1] file for details on release contents and latest changes. The release was made from tag: http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc2/ The following issues that were discovered during the review of last rc have been fixed: https://issues.apache.org/jira/browse/NUTCH-722 https://issues.apache.org/jira/browse/NUTCH-723 https://issues.apache.org/jira/browse/NUTCH-725 https://issues.apache.org/jira/browse/NUTCH-726 https://issues.apache.org/jira/browse/NUTCH-727 Please vote on releasing this package as Apache Nutch 1.0. The vote is open for the next 72 hours. Only votes from Lucene PMC members are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.0 [ ] -1 Do not release the packages because... +1. There's a minor issue when using the supplied build.xml to rebuild the sources - there are no conf/*.template files in the package, so Ant fails with an error. Creating an empty conf/dummy.template fixes this. IMHO this is a minor thing, so I vote for releasing the package as is.
[jira] Updated: (NUTCH-730) NPE in LinkRank if no nodes with which to create the WebGraph
[ https://issues.apache.org/jira/browse/NUTCH-730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-730: - Fix Version/s: (was: 1.0.0) NPE in LinkRank if no nodes with which to create the WebGraph - Key: NUTCH-730 URL: https://issues.apache.org/jira/browse/NUTCH-730 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.1 Attachments: NUTCH-730-1-20090325.patch For LinkRank, if there are no nodes to process, then a NullPointerException is thrown when trying to count number of nodes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-722) Nutch contains jars that we cannot redistribute
[ https://issues.apache.org/jira/browse/NUTCH-722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-722. -- Resolution: Fixed removed the jars and added note about this in README.txt Nutch contains jars that we cannot redistribute --- Key: NUTCH-722 URL: https://issues.apache.org/jira/browse/NUTCH-722 Project: Nutch Issue Type: Bug Reporter: Sami Siren Priority: Blocker Fix For: 1.0.0 It seems that we have some jars (as part of pdf parser) that we cannot redistribute. Jukkas comment from email: The release contains the Java Advanced Imaging libraries (jai_core.jar and jai_codec.jar) which are licensed under Sun's Binary Code License. We can't redistribute those libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
NUTCH-722 is resolved
I think we are good to go for rc2 and it also seems that the smartest thing to do with the package contents at this point is do not touch them. I will roll out the new rc later today. -- Sami Siren
[VOTE] Release Apache Nutch 1.0
Hello, I have packaged the third release candidate for Apache Nutch 1.0 release at http://people.apache.org/~siren/nutch-1.0/rc2/ See the CHANGES.txt[1] file for details on release contents and latest changes. The release was made from tag: http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc2/ The following issues that were discovered during the review of last rc have been fixed: https://issues.apache.org/jira/browse/NUTCH-722 https://issues.apache.org/jira/browse/NUTCH-723 https://issues.apache.org/jira/browse/NUTCH-725 https://issues.apache.org/jira/browse/NUTCH-726 https://issues.apache.org/jira/browse/NUTCH-727 Please vote on releasing this package as Apache Nutch 1.0. The vote is open for the next 72 hours. Only votes from Lucene PMC members are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.0 [ ] -1 Do not release the packages because... Here's my +1 Thanks! [1] http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc2/CHANGES.txt?revision=757511 -- Sami Siren
[jira] Commented: (NUTCH-728) Improve nutch release packaging
[ https://issues.apache.org/jira/browse/NUTCH-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683814#action_12683814 ] Sami Siren commented on NUTCH-728: -- not really, it just happens to be the mirror I use. Improve nutch release packaging --- Key: NUTCH-728 URL: https://issues.apache.org/jira/browse/NUTCH-728 Project: Nutch Issue Type: Improvement Reporter: Sami Siren Attachments: NUTCH-728.patch see the discussion from http://www.lucidimagination.com/search/document/aa4d52cbd9af026a/discuss_contents_of_nutch_release_artifact -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [VOTE] Release Apache Nutch 1.0
Fellow PMC members, As you might know already we have posted a release candidate for Nutch 1.0 some time ago. However we have so far received only two +1 votes from Lucene PMC members and one more is required before we can actually finalize the release. The vote thread as it currently is can be seen from: http://www.lucidimagination.com/search/document/33b2a26db25db492/vote_release_apache_nutch_1_0 We (as a Nutch community) would really appreciate if somebody from the PMC had the time to check it out. Thanks for your time, Sami Siren Sami Siren wrote: We're lacking one +1, could someone please take a look? Thanks, Sami Siren Sami Siren wrote: Hello, I have packaged the second release candidate for Apache Nutch 1.0 release at http://people.apache.org/~siren/nutch-1.0/rc1/ See the CHANGES.txt[1] file for details on release contents and latest changes. The release was made from tag: http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc1/?pathrev=752004 Please vote on releasing this package as Apache Nutch 1.0. The vote is open for the next 72 hours. Only votes from Lucene PMC members are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.0 [ ] -1 Do not release the packages because... Here's my +1 Thanks! [1] *http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc1/CHANGES.txt?view=logpathrev=752004 *-- Sami Siren
Re: [VOTE] Release Apache Nutch 1.0
thanks Jukka, Jukka Zitting wrote: Hi, On Thu, Mar 19, 2009 at 10:32 AM, Sami Siren ssi...@gmail.com wrote: We (as a Nutch community) would really appreciate if somebody from the PMC had the time to check it out. -1 The release contains the Java Advanced Imaging libraries (jai_core.jar and jai_codec.jar) which are licensed under Sun's Binary Code License. We can't redistribute those libraries. ok, we need to address that somehow. Other comments based on a quick look: * The LICENSE.txt file should have at least references to the licenses of the bundled libraries. * The NOTICE.txt file should start with the the following lines: Apache Nutch Copyright 2009 The Apache Software Foundation * The NOTICE.txt file should contain the required copyright notices from all bundled libraries. * The README.txt should start with Apache Nutch instead of Nutch * Why does the release package contain pre-built documentation and binaries? Downloading the 90MB package takes much longer than checking out and building the 40MB tag from svn. IMHO it would be a service to users to make the release contain just the svn export with instruction on how to build the rest. I see your point about the fat artifact but I am not totally convinced that users (as in end users) would prefer the idea of fetching the development tools and compiling the software before they use it, at least I am not doing that with the software I use. I will discuss this with rest of the devs and see what we can do here. One solution could be to split the release in two parts binary only and source (they would both be about the same size since out build process currently copies jars around I think that's mostly the reason for the gigantic size) as you propose below. We can also still provide pre-built binaries as separate downloads. More notably: how am I to verify that the release came from the sources in our svn when it contains stuff that doesn't exist in the svn? May be that I don't understand what you're trying to say here but isn't that always the case with binary releases (the difficulty to verify that the binary is build from certain tag from svn)? -- Sami Siren
[jira] Created: (NUTCH-722) Nutch contains jars that we cannot redistribute
Nutch contains jars that we cannot redistribute --- Key: NUTCH-722 URL: https://issues.apache.org/jira/browse/NUTCH-722 Project: Nutch Issue Type: Bug Reporter: Sami Siren Priority: Blocker Fix For: 1.0.0 It seems that we have some jars (as part of pdf parser) that we cannot redistribute. Jukkas comment from email: The release contains the Java Advanced Imaging libraries (jai_core.jar and jai_codec.jar) which are licensed under Sun's Binary Code License. We can't redistribute those libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-723) LICENCE.txt is lacking info that should be there
LICENCE.txt is lacking info that should be there Key: NUTCH-723 URL: https://issues.apache.org/jira/browse/NUTCH-723 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.0.0 Reporter: Sami Siren Jukkas comment from email: * The LICENSE.txt file should have at least references to the licenses of the bundled libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-725) NOTICE.txt is lacking info that should be there
NOTICE.txt is lacking info that should be there --- Key: NUTCH-725 URL: https://issues.apache.org/jira/browse/NUTCH-725 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Sami Siren Jukkas comment from email: * The NOTICE.txt file should start with the the following lines: Apache Nutch Copyright 2009 The Apache Software Foundation * The NOTICE.txt file should contain the required copyright notices from all bundled libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-726) README.txt is lacking info that should be there
README.txt is lacking info that should be there --- Key: NUTCH-726 URL: https://issues.apache.org/jira/browse/NUTCH-726 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.0.0 Reporter: Sami Siren from Jukkas email: * The README.txt should start with Apache Nutch instead of Nutch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-727) Add KEYS file to release artifact
Add KEYS file to release artifact - Key: NUTCH-727 URL: https://issues.apache.org/jira/browse/NUTCH-727 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Sami Siren comment from Grant: Where's the KEYS file for Nutch? hi, the keys file is at the top level nutch directory (eg: http://www.nic.funet.fi/pub/mirrors/apache.org/lucene/nutch/KEYS) OK, I think it should be in the tarball, too., at the top -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[DISCUSS] contents of nutch release artifact
Jukka Zitting was suggesting we should rethink the Nutch release packaging because of it's size. I don't see this as a blocker for 1.0 but we could perhaps start the discussion about this anyway so throw in your opinions... the related snippet from email discussion: Sami Siren wrote: Jukka Zitting wrote: * Why does the release package contain pre-built documentation and binaries? Downloading the 90MB package takes much longer than checking out and building the 40MB tag from svn. IMHO it would be a service to users to make the release contain just the svn export with instruction on how to build the rest. I see your point about the fat artifact but I am not totally convinced that users (as in end users) would prefer the idea of fetching the development tools and compiling the software before they use it, at least I am not doing that with the software I use. I will discuss this with rest of the devs and see what we can do here. One solution could be to split the release in two parts binary only and source (they would both be about the same size since out build process currently copies jars around I think that's mostly the reason for the gigantic size) as you propose below. -- Sami Siren
[jira] Resolved: (NUTCH-726) README.txt is lacking info that should be there
[ https://issues.apache.org/jira/browse/NUTCH-726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-726. -- Resolution: Fixed Fix Version/s: 1.0.0 committed README.txt is lacking info that should be there --- Key: NUTCH-726 URL: https://issues.apache.org/jira/browse/NUTCH-726 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.0.0 Reporter: Sami Siren Fix For: 1.0.0 from Jukkas email: * The README.txt should start with Apache Nutch instead of Nutch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-724) Drop the JAI libraries
[ https://issues.apache.org/jira/browse/NUTCH-724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-724. -- Resolution: Duplicate Drop the JAI libraries -- Key: NUTCH-724 URL: https://issues.apache.org/jira/browse/NUTCH-724 Project: Nutch Issue Type: Bug Reporter: Jukka Zitting Priority: Blocker Fix For: 1.0.0 The PDF parser plugin contains Java Advanced Imaging (JAI) libraries (jai_core.jar and jai_codec.jar) that are licensed under the Sun Binary Code License. The license is incompatible with Apache policies, so we need to drop those libraries. AFAIK (see PDFBOX-381) PDFBox only uses the JAI libraries for handling page rotations and tiff images, so simply dropping the JAI jars shouldn't have too much impact. A better solution would be to switch to using Apache PDFBox that has a proper workaround for this issue, but the first Apache PDFBox release has not yet been made. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-722) Nutch contains jars that we cannot redistribute
[ https://issues.apache.org/jira/browse/NUTCH-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683482#action_12683482 ] Sami Siren commented on NUTCH-722: -- +1, i am fine with this solution too Nutch contains jars that we cannot redistribute --- Key: NUTCH-722 URL: https://issues.apache.org/jira/browse/NUTCH-722 Project: Nutch Issue Type: Bug Reporter: Sami Siren Priority: Blocker Fix For: 1.0.0 It seems that we have some jars (as part of pdf parser) that we cannot redistribute. Jukkas comment from email: The release contains the Java Advanced Imaging libraries (jai_core.jar and jai_codec.jar) which are licensed under Sun's Binary Code License. We can't redistribute those libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [DISCUSS] contents of nutch release artifact
Andrzej Bialecki wrote: Sami Siren wrote: Jukka Zitting was suggesting we should rethink the Nutch release packaging because of it's size. I don't see this as a blocker for 1.0 but we could perhaps start the discussion about this anyway so throw in your opinions... I agree with you and Jukka that we should provide separate tarballs of source and binaries. This likely won't result in significant size reductions (anyway, what's a measly 90MB nowadays .. ;) but it would help other parties to deploy clean binaries and/or track the officially released sources. The source package is straight forward one. Size of source package would be about 30GB. but the binary package will still remain quite big if we need to allow it to run on local and distributed mode (plugins as exploded format and also the .job + .war), size of such binary package would still be nearly 80G. We could split the binary to yet smaller pieces: one for local mode, one for distributed mode, and the .war separately but I am not sure if that's worth the effort. -- Sami Siren
Re: [DISCUSS] contents of nutch release artifact
The source package is straight forward one. Size of source package would be about 30GB. but the binary package will still remain quite big if we Now, this is big, indeed ;) heh, some serious software, need to buy more disc just to download it (yes I was thinking of M not G) :) -- Sami Siren
Re: [DISCUSS] contents of nutch release artifact
Andrzej Bialecki wrote: How about the following: we build just 2 packages: * binary: this includes only base hadoop libs in lib/ (enough to start a local job, no optional filesystems etc), the *.job and *.war files and scripts. Scripts would check for the presence of plugins/ dir, and offer an option to create it from *.job. Assumption here is that this shouldbe enough to run full cycle in local mode, and that people who want to run a distributed cluster will first install a plain Hadoop release, and then just put the *.job and bin/nutch on the master. * source: no build artifacts, no .svn (equivalent to svn export), simple tgz. this sounds good to me. additionally some new documentation needs to be written too. -- Sami Siren
[jira] Resolved: (NUTCH-725) NOTICE.txt is lacking info that should be there
[ https://issues.apache.org/jira/browse/NUTCH-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-725. -- Resolution: Fixed went through the libs and added copyright notices NOTICE.txt is lacking info that should be there --- Key: NUTCH-725 URL: https://issues.apache.org/jira/browse/NUTCH-725 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Sami Siren Jukkas comment from email: * The NOTICE.txt file should start with the the following lines: Apache Nutch Copyright 2009 The Apache Software Foundation * The NOTICE.txt file should contain the required copyright notices from all bundled libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-723) LICENCE.txt is lacking info that should be there
[ https://issues.apache.org/jira/browse/NUTCH-723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-723. -- Resolution: Fixed added licenses of 4rd party software LICENCE.txt is lacking info that should be there Key: NUTCH-723 URL: https://issues.apache.org/jira/browse/NUTCH-723 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.0.0 Reporter: Sami Siren Jukkas comment from email: * The LICENSE.txt file should have at least references to the licenses of the bundled libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (NUTCH-723) LICENCE.txt is lacking info that should be there
[ https://issues.apache.org/jira/browse/NUTCH-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683618#action_12683618 ] Sami Siren edited comment on NUTCH-723 at 3/19/09 2:11 PM: --- added licenses of 3rd party software was (Author: siren): added licenses of 4rd party software LICENCE.txt is lacking info that should be there Key: NUTCH-723 URL: https://issues.apache.org/jira/browse/NUTCH-723 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.0.0 Reporter: Sami Siren Jukkas comment from email: * The LICENSE.txt file should have at least references to the licenses of the bundled libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-728) Improve nutch release packaging
[ https://issues.apache.org/jira/browse/NUTCH-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-728: - Attachment: NUTCH-728.patch add simple target to generate source release tgz from svn tag -did not touch to the binary one Improve nutch release packaging --- Key: NUTCH-728 URL: https://issues.apache.org/jira/browse/NUTCH-728 Project: Nutch Issue Type: Improvement Reporter: Sami Siren Attachments: NUTCH-728.patch see the discussion from http://www.lucidimagination.com/search/document/aa4d52cbd9af026a/discuss_contents_of_nutch_release_artifact -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-722) Nutch contains jars that we cannot redistribute
[ https://issues.apache.org/jira/browse/NUTCH-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683634#action_12683634 ] Sami Siren commented on NUTCH-722: -- if there are no objections I will commit this change tomorrow morning (EET) Nutch contains jars that we cannot redistribute --- Key: NUTCH-722 URL: https://issues.apache.org/jira/browse/NUTCH-722 Project: Nutch Issue Type: Bug Reporter: Sami Siren Priority: Blocker Fix For: 1.0.0 It seems that we have some jars (as part of pdf parser) that we cannot redistribute. Jukkas comment from email: The release contains the Java Advanced Imaging libraries (jai_core.jar and jai_codec.jar) which are licensed under Sun's Binary Code License. We can't redistribute those libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [DISCUSS] contents of nutch release artifact
Sami Siren wrote: Andrzej Bialecki wrote: How about the following: we build just 2 packages: * binary: this includes only base hadoop libs in lib/ (enough to start a local job, no optional filesystems etc), the *.job and *.war files and scripts. Scripts would check for the presence of plugins/ dir, and offer an option to create it from *.job. Assumption here is that this shouldbe enough to run full cycle in local mode, and that people who want to run a distributed cluster will first install a plain Hadoop release, and then just put the *.job and bin/nutch on the master. * source: no build artifacts, no .svn (equivalent to svn export), simple tgz. this sounds good to me. additionally some new documentation needs to be written too. I added a simple patch to NUTCH-728 to make a plain source release from svn, what do people think should we add the plain source package into next rc. I would not like to make changes to binary package now but propose that we do those changes post 1.0. -- Sami Siren
Re: [VOTE] Release Apache Nutch 1.0
Grant Ingersoll wrote: Where's the KEYS file for Nutch? hi, the keys file is at the top level nutch directory (eg: http://www.nic.funet.fi/pub/mirrors/apache.org/lucene/nutch/KEYS) I don't see it in the tarball and I don't think Sami's key is on a public server that I am aware of (at least not pgp.mit.edu). http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x0B7E6CFA -- Sami Siren On Mar 11, 2009, at 10:13 AM, Andrzej Bialecki wrote: Sami Siren wrote: Hello, I have packaged the second release candidate for Apache Nutch 1.0 release at http://people.apache.org/~siren/nutch-1.0/rc1/
Re: [VOTE] Release Apache Nutch 1.0
This vote has been cancelled due to some last minute additions. I will post another RC soon. Sami Siren wrote: -- Sami Siren Hello, I have packaged the first release candidate for Apache Nutch 1.0 release at http://people.apache.org/~siren/nutch-1.0/rc0/ See the included CHANGES.txt file for details on release contents and latest changes. The release was made from tag: http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc0/?pathrev=751480 Please vote on releasing this package as Apache Nutch 1.0. The vote is open for the next 72 hours. Only votes from Lucene PMC members are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.0 [ ] -1 Do not release the packages because... Thanks! -- Sami Siren
Re: Nutch ML cleanup
Like I suspected: I have no power to do or view any admin stuff there. Btw. I am not seeing any span, perhaps google takes care of that for me? -- Sami Siren Sami Siren wrote: I'll take a look at this, I am pretty sure we have to ask Doug at the end :) -- Sami Siren Otis Gospodnetic wrote: Hi, This has been bugging me for a while now. For some reason Nutch MLs get the most junk emails - both rude/rudeish emails, as well as clear spam (with SPAM in the subject - something must be detecting it). I just looked at the headers of the clearly labeled spam messages and found that they all seem to come from SF: To: nutch-...@lists.sourceforge.net To: nutch-gene...@lists.sourceforge.net I assume there is some kind of a mail forward from the old Nutch MLs on SF to the new Nutch MLs at ASF. Do you think we could remove this forwarding and get rid of this spam? Sami Andrzej seem to be members who mght be able to make this change: http://sourceforge.net/project/memberlist.php?group_id=59548 Otis
[jira] Resolved: (NUTCH-715) Subcollection plugin doesn't work with default subcollections.xml file
[ https://issues.apache.org/jira/browse/NUTCH-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-715. -- Resolution: Fixed committed, thanks Dmitry! Subcollection plugin doesn't work with default subcollections.xml file -- Key: NUTCH-715 URL: https://issues.apache.org/jira/browse/NUTCH-715 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Reporter: Dmitry Lihachev Assignee: Sami Siren Fix For: 1.0.0 Attachments: NUTCH-715-testcase.patch, NUTCH-715_subcollections_fix.patch Subcollection plugin cann't parse his configuration file because it contatins top level comment (ASF notice) and DomUtil doesn't carry about of top-level comments -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[VOTE] Release Apache Nutch 1.0
Hello, I have packaged the second release candidate for Apache Nutch 1.0 release at http://people.apache.org/~siren/nutch-1.0/rc1/ See the CHANGES.txt[1] file for details on release contents and latest changes. The release was made from tag: http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc1/?pathrev=752004 Please vote on releasing this package as Apache Nutch 1.0. The vote is open for the next 72 hours. Only votes from Lucene PMC members are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.0 [ ] -1 Do not release the packages because... Here's my +1 Thanks! [1] *http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc1/CHANGES.txt?view=logpathrev=752004 *-- Sami Siren
Re: [VOTE] Release Apache Nutch 1.0
!!!NOTE!!! There was faulty link in the message I sent earlier, hopefully I get it right this time: Hello, I have packaged the second release candidate for Apache Nutch 1.0 release at http://people.apache.org/~siren/nutch-1.0/rc1/ See the CHANGES.txt[1] file for details on release contents and latest changes. The release was made from tag: http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc1/?pathrev=752004 Please vote on releasing this package as Apache Nutch 1.0. The vote is open for the next 72 hours. Only votes from Lucene PMC members are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.0 [ ] -1 Do not release the packages because... Here's my +1 Thanks! [1] http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc1/CHANGES.txt?view=logpathrev=752004 -- Sami Siren
[jira] Commented: (NUTCH-705) parse-rtf plugin
[ https://issues.apache.org/jira/browse/NUTCH-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12680411#action_12680411 ] Sami Siren commented on NUTCH-705: -- I think we should start looking at Apache Tika for most (or all) of our parsers. parse-rtf plugin Key: NUTCH-705 URL: https://issues.apache.org/jira/browse/NUTCH-705 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 1.0.0 Reporter: Dmitry Lihachev Priority: Minor Fix For: 1.1 Attachments: NUTCH-705.patch Demoting this issue and moving to 1.1 - current patch is not suitable due to LGPL licensed parts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-717) Make Nutch Solr integration easier
Make Nutch Solr integration easier -- Key: NUTCH-717 URL: https://issues.apache.org/jira/browse/NUTCH-717 Project: Nutch Issue Type: New Feature Reporter: Sami Siren Fix For: 1.1 Erik Hatcher proposed we should provide a full solr config dir to be used with Nutch-Solr. Now we only provide index schema. It would be considerably easier to setup nutch-solr if we provided the whole conf dir that you could use with solr like: java -Dsolr.solr.home=Nutch's Solr Home -jar start.jar -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Moving Nutch parsers to Tika
Andrzej Bialecki wrote: Hi all, I've been debating this for a while, too, what Sami suggested in another thread: I think we should start looking at Apache Tika for most (or all) of our parsers. This is actually a part of my broader vision for Nutch, that this project should not duplicate functionality of other well-established projects by re-implementing the same functionality, only poorly - because our focus is not on parsers, plugins, mime/charset detection, distributed RPC, but on building a robust platform for crawling. I share that same vision. We could start working on this particular issue by donating the Nutch parsers to Tika, those that are not already present there, and start using Tika's parsers in Nutch where it's already possible. Once Tika supports all types of parsers that we have, we should switch completely to Tika. I think that the only parser that is totally missing from Tika is swf (https://issues.apache.org/jira/browse/TIKA-147). Tika also supports some formats that Nutch currently does not (in addition to providing more advanced parsing on some formats). -- Sami Siren
NUTCH-684 [was: Re: [VOTE] Release Apache Nutch 1.0]
Dog(acan Güney wrote: On Sun, Mar 8, 2009 at 20:25, Sami Siren ssi...@gmail.com wrote: Hello, I have packaged the first release candidate for Apache Nutch 1.0 release at http://people.apache.org/~siren/nutch-1.0/rc0/ See the included CHANGES.txt file for details on release contents and latest changes. The release was made from tag: http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc0/?pathrev=751480 Please vote on releasing this package as Apache Nutch 1.0. The vote is open for the next 72 hours. Only votes from Lucene PMC members are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.0 [ ] -1 Do not release the packages because... Thanks! That's great! I would like to see NUTCH-684 in but I guess I was too late :) Anyway, my non-binding +1. uh, I missed that one, sorry. Do you think it's ready to be included? (IMO that's an important feature) It's not a big deal for me to rebuild the package with that feature included. -- Sami Siren
Re: NUTCH-684 [was: Re: [VOTE] Release Apache Nutch 1.0]
Doğacan Güney wrote: On 09.Mar.2009, at 11:05, Sami Siren ssi...@gmail.com mailto:ssi...@gmail.com wrote: Doğacan Güney wrote: On Sun, Mar 8, 2009 at 20:25, Sami Siren ssi...@gmail.com mailto:ssi...@gmail.com wrote: Hello, I have packaged the first release candidate for Apache Nutch 1.0 release at http://people.apache.org/~siren/nutch-1.0/rc0/ http://people.apache.org/%7Esiren/nutch-1.0/rc0/ See the included CHANGES.txt file for details on release contents and latest changes. The release was made from tag: http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc0/?pathrev=751480 Please vote on releasing this package as Apache Nutch 1.0. The vote is open for the next 72 hours. Only votes from Lucene PMC members are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.0 [ ] -1 Do not release the packages because... Thanks! That's great! I would like to see NUTCH-684 in but I guess I was too late :) Anyway, my non-binding +1. uh, I missed that one, sorry. Do you think it's ready to be included? (IMO that's an important feature) It's not a big deal for me to rebuild the package with that feature included. I only tested it on a small crawl. Still, I believe it is important too so I would like to include it. Worst case we release a 1.0.1 soon after:) I am fine either way. So if you think it's good enough to go in just commit it and I'll build another rc. If not then we can release it later too when it's ready. -- Sami Siren -- Sami Siren
[VOTE] Release Apache Nutch 1.0
Hello, I have packaged the first release candidate for Apache Nutch 1.0 release at http://people.apache.org/~siren/nutch-1.0/rc0/ See the included CHANGES.txt file for details on release contents and latest changes. The release was made from tag: http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc0/?pathrev=751480 Please vote on releasing this package as Apache Nutch 1.0. The vote is open for the next 72 hours. Only votes from Lucene PMC members are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.0 [ ] -1 Do not release the packages because... Thanks! -- Sami Siren
Re: planning for nutch-1.0-rc1
I am sure all of you noticed that the release planned to be cut during this week was delayed because of a new discovery right before the deadline (NUTCH-711). That has now been fixed so it's time to move on. I am now going to build the first RC during the weekend. -- Sami Siren Sami Siren wrote: I am planning to build the first rc for nutch 1.0 at Tue 3.3.2009 morning (EET). There are still some issues marked as fix for 1.0 in Jira. Neither of the two remaining _bugs_ seems too important to me, actually I only count the issues assigned to developers as real candidates to be included in 1.0: NUTCH-578 (kubes) NUTCH-477 (ab) NUTCH-669 (siren) I am also volunteering to push all open issues to 1.1 before starting the RC build on Tuesday. Any objections on the proposed procedure or timing? -- Sami Siren
[jira] Commented: (NUTCH-711) Indexer failing after upgrade to Hadoop 0.19.1
[ https://issues.apache.org/jira/browse/NUTCH-711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678691#action_12678691 ] Sami Siren commented on NUTCH-711: -- +1 Indexer failing after upgrade to Hadoop 0.19.1 -- Key: NUTCH-711 URL: https://issues.apache.org/jira/browse/NUTCH-711 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Priority: Blocker Fix For: 1.0.0 Attachments: patch.txt After upgrade to Hadoop 0.19.1 Reducer is initialized in a different order than before (see http://svn.apache.org/viewvc?view=revrevision=736239). IndexingFilters populate current JobConf with field options that are required for IndexerOutputFormat to function properly. However, the filters are instantiated in Reducer.configure(), which is now called after the OutputFormat is initialized, and not before as previously. The workaround for now is to instantiate IndexinigFilters once again inside IndexerOutputFormat. This issue should be revisited before 1.1 in order to find a better solution. See this thread for more information: http://www.lucidimagination.com/search/document/7c62c625c7ea17fe/problem_with_crawling_using_the_latest_1_0_trunk -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Resolved: (NUTCH-711) Indexer failing after upgrade to Hadoop 0.19.1
Alternatively you could create another issue to track the proper fix and let this close during the release process. -- Sami Siren Andrzej Bialecki (JIRA) wrote: [ https://issues.apache.org/jira/browse/NUTCH-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-711. - Resolution: Fixed Applied the patch in rev. 750037. I'm not closing this issue, because this needs to be solved in a better way after 1.0. Indexer failing after upgrade to Hadoop 0.19.1 -- Key: NUTCH-711 URL: https://issues.apache.org/jira/browse/NUTCH-711 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Priority: Blocker Fix For: 1.0.0 Attachments: patch.txt After upgrade to Hadoop 0.19.1 Reducer is initialized in a different order than before (see http://svn.apache.org/viewvc?view=revrevision=736239). IndexingFilters populate current JobConf with field options that are required for IndexerOutputFormat to function properly. However, the filters are instantiated in Reducer.configure(), which is now called after the OutputFormat is initialized, and not before as previously. The workaround for now is to instantiate IndexinigFilters once again inside IndexerOutputFormat. This issue should be revisited before 1.1 in order to find a better solution. See this thread for more information: http://www.lucidimagination.com/search/document/7c62c625c7ea17fe/problem_with_crawling_using_the_latest_1_0_trunk
[jira] Updated: (NUTCH-700) Neko1.9.11 goes into a loop
[ https://issues.apache.org/jira/browse/NUTCH-700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-700: - Fix Version/s: 1.0.0 Assignee: Sami Siren This one just bit me - the effect is that parsing hangs forever. I am promoting it to be fixed in 1.0. Neko1.9.11 goes into a loop --- Key: NUTCH-700 URL: https://issues.apache.org/jira/browse/NUTCH-700 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: julien nioche Assignee: Sami Siren Priority: Critical Fix For: 1.0.0 Neko1.9.11 goes into a loop on some documents e.g. http://mediacet.com/Archive/FourYorkshiremen/bb/post.htm http://cizel.co.kr/main.php reverting to 0.9.4 seems to fix the problem The approach mentioned in https://issues.apache.org/jira/browse/NUTCH-696 could be a way to alleviate similar issues PS: haven't had time to report to the Neko people yet, will do at some stage -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: planning for nutch-1.0-rc1
Andrzej Bialecki wrote: Sami Siren wrote: I am planning to build the first rc for nutch 1.0 at Tue 3.3.2009 morning (EET). There are still some issues marked as fix for 1.0 in Jira. Neither of the two remaining _bugs_ seems too important to me, actually I only count the issues assigned to developers as real candidates to be included in 1.0: NUTCH-578 (kubes) NUTCH-477 (ab) NUTCH-669 (siren) There's one Critical issue reported, related to NekoHTML (NUTCH-700). I'm not sure what are the feature differences (pertinent to Nutch) between 0.9.4 and 1.9.11 - perhaps downgrading is the safest course of action. I will take care of that. I am also volunteering to push all open issues to 1.1 before starting the RC build on Tuesday. Any objections on the proposed procedure or timing? Sounds good. great! -- Sami Siren
[jira] Resolved: (NUTCH-700) Neko1.9.11 goes into a loop
[ https://issues.apache.org/jira/browse/NUTCH-700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-700. -- Resolution: Fixed reverted to 0.9.4 Neko1.9.11 goes into a loop --- Key: NUTCH-700 URL: https://issues.apache.org/jira/browse/NUTCH-700 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: julien nioche Assignee: Sami Siren Priority: Critical Fix For: 1.0.0 Neko1.9.11 goes into a loop on some documents e.g. http://mediacet.com/Archive/FourYorkshiremen/bb/post.htm http://cizel.co.kr/main.php reverting to 0.9.4 seems to fix the problem The approach mentioned in https://issues.apache.org/jira/browse/NUTCH-696 could be a way to alleviate similar issues PS: haven't had time to report to the Neko people yet, will do at some stage -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-669) Consolidate code for Fetcher and Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-669. -- Resolution: Fixed replaced fetcher with fetcher2 Consolidate code for Fetcher and Fetcher2 - Key: NUTCH-669 URL: https://issues.apache.org/jira/browse/NUTCH-669 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Todd Lipcon Assignee: Sami Siren Fix For: 1.0.0 I'd like to consolidate a lot of the common code between Fetcher and Fetcher2.java. It seems to me like there are the following differences: - Fetcher relies on the Protocol to obey robots.txt and crawl delay settings whereas Fetcher2 implements them itself - Fetcher2 uses a different queueing model (queue per crawl host) to accomplish the per-host limiting without making the Protocol do it. I've begun work on this but want to check with people on the following: - What reason is there for Fetcher existing at all since Fetcher2 seems to be a superset of functionality? - Is it on the road map to remove the robots/delay logic from the Http protocol and make Fetcher2's delegation of duties the standard? - Any other improvements wanted for Fetcher while I am in and around the code? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Resolved: (NUTCH-669) Consolidate code for Fetcher and Fetcher2
Andrzej Bialecki wrote: Sami Siren (JIRA) wrote: [ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-669. -- Resolution: Fixed replaced fetcher with fetcher2 I'm puzzled .. it seemed the goal was to integrate Todd's patch, which effectively replaces both Fetchers. Does this mean that Todd's version was not ready, or is the current code based on Todd's version? There was no Todd's path that I could see, he never provided one even after asked multiple times, first by you at dec 2008 then dogacan jan 2009 and finally me last week. My motivation to get this fixed was, as I understood most of the developers thought too, to get rid of the burden of supporting two classes providing roughly the same piece of functionality. I opened a jira for this but closed it soon after as you told me it was a duplicate to this one. So, what I did was: replaced original Fetcher with Fetcher2. The Fetcher is still there to be improved by Todd and others at will. -- Sami Siren
Re: Release 1.0?
dealmaker wrote: Hi, Is there going to be a delay of the 1.0 release? Today is almost Feb 28. You said that 1.0 will come in Feb. I am customizing Nutch 0.9, and I am wondering if I should wait couple more days for the 1.0 release. I think that no one else but me made any guesses about the release date? (since it is virtually impossible due to fact that this is not a paid project). The general consensus seems to be that we should get the next release out preferably sooner than later. I personally still think that the first release candidate is not that far away - we have no blocker issues left and it seems (judged by the lack of activity on working with those remaining issues) that the ones still there are not too important. I am going to commit NUTCH-669 soon and after that I am fine with starting the release process. Other devs might have different opinions. -- Sami Siren -- Sami Siren Thanks. Andrzej Bialecki wrote: Marko Bauhardt wrote: Hi, is there anybody out there? ;) exists a plan when version 1.0 will be released? thanks marko On Jan 28, 2009, at 9:45 AM, Marko Bauhardt wrote: Hi all, is there a timeline for the release 1.0? Currently it exists 33 issues (9 Bugs). Is there a plan for a feature freeze? Maybe some big issues can be moved to version 1.1? We do exist. ;) We plan to release in February - I can't tell you yet when exactly, we need to review the (few) remaining issues that we want to resolve before the release. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Release 1.0?
Sami Siren wrote: I think that no one else but me made any guesses about the release date? (since it is virtually impossible due to fact that this is not a paid project). Andrzej Bialecki wrote: We do exist. ;) We plan to release in February - I can't tell you yet when exactly, we need to review the (few) remaining issues that we want to resolve before the release. oh I now see Andrzej made some also :) . -- Sami Siren
planning for nutch-1.0-rc1
I am planning to build the first rc for nutch 1.0 at Tue 3.3.2009 morning (EET). There are still some issues marked as fix for 1.0 in Jira. Neither of the two remaining _bugs_ seems too important to me, actually I only count the issues assigned to developers as real candidates to be included in 1.0: NUTCH-578 (kubes) NUTCH-477 (ab) NUTCH-669 (siren) I am also volunteering to push all open issues to 1.1 before starting the RC build on Tuesday. Any objections on the proposed procedure or timing? -- Sami Siren
[jira] Commented: (NUTCH-705) parse-rtf plugin
[ https://issues.apache.org/jira/browse/NUTCH-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677508#action_12677508 ] Sami Siren commented on NUTCH-705: -- I think that the patch contains some lgpl code that we cannot commit into apache repository. parse-rtf plugin Key: NUTCH-705 URL: https://issues.apache.org/jira/browse/NUTCH-705 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 1.0.0 Reporter: Dmitry Lihachev Fix For: 1.0.0 Attachments: NUTCH-705.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Url regex normalizer
Meghna Kukreja wrote: Thanks Andrzej. Here is the issue that I created in JIRA: https://issues.apache.org/jira/browse/NUTCH-706. I have suggested an alternative regular expression but would appreciate if someone could verfiy this as I am not very great with those :) Perhaps you could write some junit test to verify it behaves as expected? -- Sami Siren
[jira] Resolved: (NUTCH-699) Add an official solr schema for solr integration
[ https://issues.apache.org/jira/browse/NUTCH-699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-699. -- Resolution: Fixed committed Add an official solr schema for solr integration -- Key: NUTCH-699 URL: https://issues.apache.org/jira/browse/NUTCH-699 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Doğacan Güney Assignee: Doğacan Güney Fix For: 1.0.0 See Andrzej's comments on NUTCH-684 for more info. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-669) Consolidate code for Fetcher and Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren reassigned NUTCH-669: Assignee: Sami Siren Consolidate code for Fetcher and Fetcher2 - Key: NUTCH-669 URL: https://issues.apache.org/jira/browse/NUTCH-669 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Todd Lipcon Assignee: Sami Siren Fix For: 1.0.0 I'd like to consolidate a lot of the common code between Fetcher and Fetcher2.java. It seems to me like there are the following differences: - Fetcher relies on the Protocol to obey robots.txt and crawl delay settings whereas Fetcher2 implements them itself - Fetcher2 uses a different queueing model (queue per crawl host) to accomplish the per-host limiting without making the Protocol do it. I've begun work on this but want to check with people on the following: - What reason is there for Fetcher existing at all since Fetcher2 seems to be a superset of functionality? - Is it on the road map to remove the robots/delay logic from the Http protocol and make Fetcher2's delegation of duties the standard? - Any other improvements wanted for Fetcher while I am in and around the code? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-703) Upgrade to Hadoop 0.19.1
[ https://issues.apache.org/jira/browse/NUTCH-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677266#action_12677266 ] Sami Siren commented on NUTCH-703: -- Andrzej, are you working with this now? Upgrade to Hadoop 0.19.1 Key: NUTCH-703 URL: https://issues.apache.org/jira/browse/NUTCH-703 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Priority: Blocker Fix For: 1.0.0 From release notes: Release 0.19.1 fixes many critical bugs in 0.19.0, including ***some data loss issues***.. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-247) robot parser to restrict.
[ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-247. -- Resolution: Fixed Assignee: Sami Siren (was: Dennis Kubes) committed this - added checking to F2 (which is soon to be Fetcher) robot parser to restrict. - Key: NUTCH-247 URL: https://issues.apache.org/jira/browse/NUTCH-247 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8 Reporter: Stefan Groschupf Assignee: Sami Siren Priority: Minor Fix For: 1.0.0 Attachments: agent-names.patch, agent-names3.patch.txt If the agent name and the robots agents are not proper configure the Robot rule parser uses LOG.severe to log the problem but solve it also. Later on the fetcher thread checks for severe errors and stop if there is one. RobotRulesParser: if (agents.size() == 0) { agents.add(agentName); LOG.severe(No agents listed in 'http.robots.agents' property!); } else if (!((String)agents.get(0)).equalsIgnoreCase(agentName)) { agents.add(0, agentName); LOG.severe(Agent we advertise ( + agentName + ) not listed first in 'http.robots.agents' property!); } Fetcher.FetcherThread: if (LogFormatter.hasLoggedSevere()) // something bad happened break; I suggest to use warn or something similar instead of severe to log this problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-701) replace Fetcher with Fetcher2
replace Fetcher with Fetcher2 - Key: NUTCH-701 URL: https://issues.apache.org/jira/browse/NUTCH-701 Project: Nutch Issue Type: Bug Components: fetcher Reporter: Sami Siren Assignee: Sami Siren Fix For: 1.0.0 Currently there are two fetcher implementation within nutch, one too many. This task tracks the process of promoting Fetcher2. my plan is basically to -remove Fetcher all together and rename Fetcher2 to Fetcher -fix crawl class so it works with F2 api. If there are no objections I will proceed with this soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-701) Replace Fetcher with Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-701: - Summary: Replace Fetcher with Fetcher2 (was: replace Fetcher with Fetcher2) Replace Fetcher with Fetcher2 - Key: NUTCH-701 URL: https://issues.apache.org/jira/browse/NUTCH-701 Project: Nutch Issue Type: Bug Components: fetcher Reporter: Sami Siren Assignee: Sami Siren Fix For: 1.0.0 Currently there are two fetcher implementation within nutch, one too many. This task tracks the process of promoting Fetcher2. my plan is basically to -remove Fetcher all together and rename Fetcher2 to Fetcher -fix crawl class so it works with F2 api. If there are no objections I will proceed with this soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-698) CrawlDb is corrupted after a few crawl cycles
[ https://issues.apache.org/jira/browse/NUTCH-698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-698. -- Resolution: Fixed committed. thanks guys CrawlDb is corrupted after a few crawl cycles - Key: NUTCH-698 URL: https://issues.apache.org/jira/browse/NUTCH-698 Project: Nutch Issue Type: Bug Reporter: Doğacan Güney Assignee: Doğacan Güney Priority: Blocker Fix For: 1.0.0 Attachments: NUTCH-698_v1.patch After change to hadoop's MapWritable, crawldb becomes corrupted after some fetch cycles. For more details see this discussion thread: http://www.nabble.com/Fetcher2-crashes-with-current-trunk-td21978049.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-699) Add an official solr schema for solr integration
[ https://issues.apache.org/jira/browse/NUTCH-699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12676233#action_12676233 ] Sami Siren commented on NUTCH-699: -- We could put it under conf/ ? Add an official solr schema for solr integration -- Key: NUTCH-699 URL: https://issues.apache.org/jira/browse/NUTCH-699 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Doğacan Güney Assignee: Doğacan Güney Fix For: 1.0.0 See Andrzej's comments on NUTCH-684 for more info. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-701) Replace Fetcher with Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-701. -- Resolution: Duplicate Replace Fetcher with Fetcher2 - Key: NUTCH-701 URL: https://issues.apache.org/jira/browse/NUTCH-701 Project: Nutch Issue Type: Bug Components: fetcher Reporter: Sami Siren Assignee: Sami Siren Fix For: 1.0.0 Currently there are two fetcher implementation within nutch, one too many. This task tracks the process of promoting Fetcher2. my plan is basically to -remove Fetcher all together and rename Fetcher2 to Fetcher -fix crawl class so it works with F2 api. If there are no objections I will proceed with this soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-669) Consolidate code for Fetcher and Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-669: - Fix Version/s: (was: 1.1) 1.0.0 Moving this back to 1.0 Are you close with your patch? As discussed in this thread we should just replace Fetcher With Fetcher2, change Crawl class and check that the tests pass. other issues we can deal within their own tickets. I can also help with this if you don't have the time. Consolidate code for Fetcher and Fetcher2 - Key: NUTCH-669 URL: https://issues.apache.org/jira/browse/NUTCH-669 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Todd Lipcon Fix For: 1.0.0 I'd like to consolidate a lot of the common code between Fetcher and Fetcher2.java. It seems to me like there are the following differences: - Fetcher relies on the Protocol to obey robots.txt and crawl delay settings whereas Fetcher2 implements them itself - Fetcher2 uses a different queueing model (queue per crawl host) to accomplish the per-host limiting without making the Protocol do it. I've begun work on this but want to check with people on the following: - What reason is there for Fetcher existing at all since Fetcher2 seems to be a superset of functionality? - Is it on the road map to remove the robots/delay logic from the Http protocol and make Fetcher2's delegation of duties the standard? - Any other improvements wanted for Fetcher while I am in and around the code? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-694) Distributed Search Server fails
[ https://issues.apache.org/jira/browse/NUTCH-694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-694. -- Resolution: Fixed Committed. Thanks for testing it. Distributed Search Server fails --- Key: NUTCH-694 URL: https://issues.apache.org/jira/browse/NUTCH-694 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 1.0.0 Environment: Single Server with one Nutch instance in DistributedSearchServerMode, not in PseudoDistirubutedMode Reporter: Dr. Nadine Hochstotter Assignee: Sami Siren Priority: Blocker Fix For: 1.0.0 Attachments: NUTCH-694-2.patch, NUTCH-694.patch I run Nutch on a single server, I have two crawl directories, that's why I use Nutch in distributed search server mode as described in the hadoop manual. But since I have a new Trunk Version (04.02.2009) it fails. Local search on one index works fine. But distributed search throws following exception: In catalina.out (server) 2009-02-18 17:08:14,906 ERROR NutchBean - org.apache.hadoop.ipc.RemoteException: java.io.IOException: Unknown Protocol classname:org.apache.nutch.searcher.RPCSegmentBean at org.apache.nutch.searcher.NutchBean.getProtocolVersion(NutchBean.java:403) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892) at org.apache.hadoop.ipc.Client.call(Client.java:696) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at $Proxy4.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343) at org.apache.nutch.searcher.DistributedSegmentBean.init(DistributedSegmentBean.java:103) at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:111) at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:80) at org.apache.nutch.searcher.NutchBean$NutchBeanConstructor.contextInitialized(NutchBean.java:422) at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:3843) at org.apache.catalina.core.StandardContext.start(StandardContext.java:4350) at org.apache.catalina.core.StandardContext.reload(StandardContext.java:3099) at org.apache.catalina.manager.ManagerServlet.reload(ManagerServlet.java:913) at org.apache.catalina.manager.HTMLManagerServlet.reload(HTMLManagerServlet.java:536) at org.apache.catalina.manager.HTMLManagerServlet.doGet(HTMLManagerServlet.java:114) at javax.servlet.http.HttpServlet.service(HttpServlet.java:690) at javax.servlet.http.HttpServlet.service(HttpServlet.java:803) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:525) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.valves.RequestFilterValve.process(RequestFilterValve.java:269) at org.apache.catalina.valves.RemoteAddrValve.invoke(RemoteAddrValve.java:81) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:619) And in Hadoop.log: 2009-02-18 17:07:52,847 INFO ipc.Server - IPC Server handler 48 on 13001: starting 2009-02-18 17:07:52,847 INFO ipc.Server - IPC Server handler 49 on 13001: starting 2009-02-18 17:07:52,847 INFO ipc.Server - IPC Server handler 40 on 13001: starting 2009-02-18 17
[jira] Commented: (NUTCH-477) Extend URLFilters to support different filtering chains
[ https://issues.apache.org/jira/browse/NUTCH-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12675793#action_12675793 ] Sami Siren commented on NUTCH-477: -- It's your call. IMO the whole URLFIlters - URLFIlter, URLNormalizers - URLNormalizer is a bit too complex as it is now, we can make it more clean but it's probably not worth the trouble pre 1.0. Extend URLFilters to support different filtering chains --- Key: NUTCH-477 URL: https://issues.apache.org/jira/browse/NUTCH-477 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: urlfilters.patch I propose to make the following changes to URLFilters: * extend URLFilters so that they support different filtering rules depending on the context where they are executed. This functionality mirrors the one that URLNormalizers already support. * change their return value to an int code, in order to support early termination of long filtering chains. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.