Re: [ANNONCEMENT] Apache Nutch 1.8 Release
Thanks lewis!Lewis John Mcgibbney lewis.mcgibb...@gmail.com schreef:Good Evening, The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v1.8. Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene, the project has diversified and now comprises two codebases, namely: Nutch 1.x: A well matured, production ready crawler. 1.x enables fine grained configuration, relying on Apache Hadoop data structures, which are great for batch processing. Nutch 2.x: An emerging alternative taking direct inspiration from 1.x, but which differs in one key area; storage is abstracted away from any specific underlying data store by using Apache Gora for handling object to persistent mappings. This means we can implement an extremely flexibile model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) into a number of NoSQL storage solutions. We advise all current users and developers of the 1.X series to upgrade to this release. Although this release includes library upgrades to Crawler Commons 0.3 and Apache Tika 1.4, it also provides over 30 bug fixes as well as 18 improvements. Please see the list of changes for a full breakdown, or see the release report. As usual in the 1.X series, this release is made available both as source and binary. Additionally developers can find Maven artifacts within Maven Central. The release is available here. Thank you Lewis (On behalf of the Nutch PMC) -- Lewis
[jira] [Created] (NUTCH-1738) Expose number of URLs generated per batch in GeneratorJob
Lewis John McGibbney created NUTCH-1738: --- Summary: Expose number of URLs generated per batch in GeneratorJob Key: NUTCH-1738 URL: https://issues.apache.org/jira/browse/NUTCH-1738 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 2.2.1 Reporter: Lewis John McGibbney Fix For: 2.3 GeneratorJob contains one trivial line of logging {code:title=GeneratorJob.java|borderStyle=solid} LOG.info(GeneratorJob: generated batch id: + batchId); {code} I propose to improve this logging by exposing how many URL's are contained within the generated batch. Something like {code:title=GeneratorJob.java|borderStyle=solid} LOG.info(GeneratorJob: generated batch id: + batchId + containing + $numOfURLs + URLs); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1738) Expose number of URLs generated per batch in GeneratorJob
[ https://issues.apache.org/jira/browse/NUTCH-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13937642#comment-13937642 ] Lewis John McGibbney commented on NUTCH-1738: - This concept could also be ported to 1.X as AFAIK we do not know the num of URLs generated explicitly but rely upon a restrictive value to be set for generate.max.count property in nutch-default.xml. It is of course advised to set smaller more frequent fetchlists*, however the logging is still valuable as it indicates how many URLs _should/could_ have been fetched per round. *Please note I am referring to fetchlists and BatchId's as an equivalent entity here. Expose number of URLs generated per batch in GeneratorJob - Key: NUTCH-1738 URL: https://issues.apache.org/jira/browse/NUTCH-1738 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 2.2.1 Reporter: Lewis John McGibbney Fix For: 2.3 GeneratorJob contains one trivial line of logging {code:title=GeneratorJob.java|borderStyle=solid} LOG.info(GeneratorJob: generated batch id: + batchId); {code} I propose to improve this logging by exposing how many URL's are contained within the generated batch. Something like {code:title=GeneratorJob.java|borderStyle=solid} LOG.info(GeneratorJob: generated batch id: + batchId + containing + $numOfURLs + URLs); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (NUTCH-1738) Expose number of URLs generated per batch in GeneratorJob
[ https://issues.apache.org/jira/browse/NUTCH-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-1738: --- Assignee: Lewis John McGibbney Expose number of URLs generated per batch in GeneratorJob - Key: NUTCH-1738 URL: https://issues.apache.org/jira/browse/NUTCH-1738 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 2.2.1 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 2.3 GeneratorJob contains one trivial line of logging {code:title=GeneratorJob.java|borderStyle=solid} LOG.info(GeneratorJob: generated batch id: + batchId); {code} I propose to improve this logging by exposing how many URL's are contained within the generated batch. Something like {code:title=GeneratorJob.java|borderStyle=solid} LOG.info(GeneratorJob: generated batch id: + batchId + containing + $numOfURLs + URLs); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: How do I customize Nutch to cater to existing SOLR schema
Hi Lajos, Appreciate ur help in providing the patch which would definitely improve the usability of the product. For now we have resolved the unique field issue using the following changes to solrindex-mapping.xml: field dest=_uniqueid source=url/ copyField source=url dest=_uniqueid/ For other the Nutch fields we have added the corresponding fields in SOLR schema which are copied into the respective target CMS Schema fields. Will wait for ur patch to get a more robust and flexible solution. thanx -- View this message in context: http://lucene.472066.n3.nabble.com/How-do-I-customize-Nutch-to-cater-to-existing-SOLR-schema-tp4123062p4124742.html Sent from the Nutch - Dev mailing list archive at Nabble.com.
[jira] [Resolved] (NUTCH-1671) indexchecker to add digest field
[ https://issues.apache.org/jira/browse/NUTCH-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1671. Resolution: Fixed Committed to trunk r1578616 and 2.x r1578620. indexchecker to add digest field Key: NUTCH-1671 URL: https://issues.apache.org/jira/browse/NUTCH-1671 Project: Nutch Issue Type: Bug Affects Versions: 1.7, 2.2.1 Reporter: Sebastian Nagel Priority: Trivial Fix For: 2.3, 1.9 Attachments: NUTCH-1671-2x.patch, NUTCH-1671-trunk.patch IndexingFiltersChecker does not add field digest as done by IndexerMapReduce. Digest/signature could be also used by indexing filters which then may fail. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1671) indexchecker to add digest field
[ https://issues.apache.org/jira/browse/NUTCH-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938653#comment-13938653 ] Hudson commented on NUTCH-1671: --- SUCCESS: Integrated in Nutch-nutchgora #957 (See [https://builds.apache.org/job/Nutch-nutchgora/957/]) NUTCH-1671 indexchecker to add digest field (snagel: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1578620) * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java indexchecker to add digest field Key: NUTCH-1671 URL: https://issues.apache.org/jira/browse/NUTCH-1671 Project: Nutch Issue Type: Bug Affects Versions: 1.7, 2.2.1 Reporter: Sebastian Nagel Priority: Trivial Fix For: 2.3, 1.9 Attachments: NUTCH-1671-2x.patch, NUTCH-1671-trunk.patch IndexingFiltersChecker does not add field digest as done by IndexerMapReduce. Digest/signature could be also used by indexing filters which then may fail. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1671) indexchecker to add digest field
[ https://issues.apache.org/jira/browse/NUTCH-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938662#comment-13938662 ] Hudson commented on NUTCH-1671: --- SUCCESS: Integrated in Nutch-trunk #2568 (See [https://builds.apache.org/job/Nutch-trunk/2568/]) NUTCH-1671 indexchecker to add digest field (snagel: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1578616) * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java indexchecker to add digest field Key: NUTCH-1671 URL: https://issues.apache.org/jira/browse/NUTCH-1671 Project: Nutch Issue Type: Bug Affects Versions: 1.7, 2.2.1 Reporter: Sebastian Nagel Priority: Trivial Fix For: 2.3, 1.9 Attachments: NUTCH-1671-2x.patch, NUTCH-1671-trunk.patch IndexingFiltersChecker does not add field digest as done by IndexerMapReduce. Digest/signature could be also used by indexing filters which then may fail. -- This message was sent by Atlassian JIRA (v6.2#6252)
[GitHub] nutch pull request: Patch for fixing coding bug
Github user ysc closed the pull request at: https://github.com/apache/nutch/pull/2 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---