[jira] [Commented] (NUTCH-1502) Test for CrawlDatum state transitions
[ https://issues.apache.org/jira/browse/NUTCH-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053592#comment-14053592 ] Julien Nioche commented on NUTCH-1502: -- great stuff! I'd rename the *TODO class into org.apache.nutch.crawl.TODOTestCrawlDbStates: this way it will not get included in the test suite. It currently fails as the corresponding issues are not fixed. Test for CrawlDatum state transitions - Key: NUTCH-1502 URL: https://issues.apache.org/jira/browse/NUTCH-1502 Project: Nutch Issue Type: Improvement Components: crawldb Affects Versions: 1.7, 2.2 Reporter: Sebastian Nagel Fix For: 2.4, 1.9 Attachments: NUTCH-1502-trunk-v1.patch An exhaustive test to check the matrix of CrawlDatum state transitions (CrawlStatus in 2.x) would be useful to detect errors esp. for continuous crawls where the number of possible transitions is quite large. Additional factors with impact on state transitions (retry counters, static and dynamic intervals) are also tested. The tests will help to address the NUTCH-578 and NUTCH-1245. See the latter for a first sketchy patch. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (NUTCH-578) URL fetched with 403 is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-578. - Resolution: Fixed Committed revision 1608431. Let's track the overflowing issue in https://issues.apache.org/jira/browse/NUTCH-1247 Thanks! URL fetched with 403 is generated over and over again - Key: NUTCH-578 URL: https://issues.apache.org/jira/browse/NUTCH-578 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.0.0 Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I have checked out the most recent version of the trunk as of Nov 20, 2007 Reporter: Nathaniel Powell Assignee: Markus Jelsma Priority: Critical Fix For: 1.9 Attachments: NUTCH-578.patch, NUTCH-578_v2.patch, NUTCH-578_v3.patch, NUTCH-578_v4.patch, NUTCH-578_v5.patch, crawl-urlfilter.txt, nutch-site.xml, regex-normalize.xml, urls.txt I have not changed the following parameter in the nutch-default.xml: property namedb.fetch.retry.max/name value3/value descriptionThe maximum number of times a url that has encountered recoverable errors is generated for fetch./description /property However, there is a URL which is on the site that I'm crawling, www.teachertube.com, which keeps being generated over and over again for almost every segment (many more times than 3): fetch of http://www.teachertube.com/images/ failed with: Http code=403, url=http://www.teachertube.com/images/ This is a bug, right? Thanks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1088) Write Solr XML documents
[ https://issues.apache.org/jira/browse/NUTCH-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1088: - Fix Version/s: (was: 1.9) 1.10 Write Solr XML documents Key: NUTCH-1088 URL: https://issues.apache.org/jira/browse/NUTCH-1088 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.10 Documents need to be reindexed when index-time analysis is modified. Indexing individual segments from Nutch is tedious, especially for small segments. This issue should add a feature that can write XML batches. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-881) Good quality documentation for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-881: Fix Version/s: (was: 1.9) 1.10 Good quality documentation for Nutch Key: NUTCH-881 URL: https://issues.apache.org/jira/browse/NUTCH-881 Project: Nutch Issue Type: Improvement Components: documentation Affects Versions: nutchgora Reporter: Andrzej Bialecki Assignee: Lewis John McGibbney Fix For: 1.10 This is, and has been, a long standing request from Nutch users. This becomes an acute need as we redesign Nutch 2.0, because the collective knowledge and the Wiki will no longer be useful without massive amount of editing. IMHO the reference documentation should be in SVN, and not on the Wiki - the Wiki is good for casual information and recipes but I think it's too messy and not reliable enough as a reference. I propose to start with the following: 1. let's decide on the format of the docs. Each format has its own pros and cons: * HTML: easy to work with, but formatting may be messy unless we edit it by hand, at which point it's no longer so easy... Good toolchains to convert to other formats, but limited expressiveness of larger structures (e.g. book, chapters, TOC, multi-column layouts, etc). * Docbook: learning curve is higher, but not insurmountable... Naturally yields very good structure. Figures/diagrams may be problematic - different renderers (html, pdf) like to treat the scaling and placing somewhat differently. * Wiki-style (Confluence or TWiki): easy to use, but limited control over larger structures. Maven Doxia can format cwiki, twiki, and a host of other formats to e.g. html and pdf. * other? 2. start documenting the main tools and the main APIs (e.g. the plugins and all the extension points). We can of course reuse material from the Wiki and from various presentations (e.g. the ApacheCon slides). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1201) Allow for different FetcherThread impls
[ https://issues.apache.org/jira/browse/NUTCH-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1201: - Fix Version/s: (was: 1.9) 1.10 Allow for different FetcherThread impls --- Key: NUTCH-1201 URL: https://issues.apache.org/jira/browse/NUTCH-1201 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.10 Attachments: CustomFetcher.java, NUTCH-1201-1.5-wip.patch For certain cases we need to modify parts in FetcherThread and make it pluggable. This introduces a new config directive fetcher.impl that takes a FQCN and uses that setting Fetcher.fetch to load a class to use for job.setMapRunnerClass(). This new class has to extend Fetcher and and inner class FetcherThread. This allows for overriding methods in FetcherThread but also methods in Fetcher itself if required. A follow up on this issue would be to refactor parts of FetcherThread to make it easier to override small sections instead of copying the entire method body for a small change, which is now the case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (NUTCH-933) Fetcher does not save a pages Last-Modified value in CrawlDatum
[ https://issues.apache.org/jira/browse/NUTCH-933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-933. - Resolution: Not a Problem Fix Version/s: (was: 1.9) Marked as not a problem. PLease reopen if necessary Fetcher does not save a pages Last-Modified value in CrawlDatum --- Key: NUTCH-933 URL: https://issues.apache.org/jira/browse/NUTCH-933 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.2 Reporter: Joe Kemp I added the following code in the output method just after the If (content !=null) statement. String lastModified = metadata.get(Last-Modified); if (lastModified !=null !lastModified.equals()) { try { Date lastModifiedDate = DateUtil.parseDate(lastModified); datum.setModifiedTime(lastModifiedDate.getTime()); } catch (DateParseException e) { } } I now get 304 for pages that haven't changed when I recrawl. Need to do further testing. Might also need a configuration parameter to turn off this behavior, allowing pages to be forced to be refreshed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1218) Improve trunk API documentation
[ https://issues.apache.org/jira/browse/NUTCH-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1218: - Fix Version/s: (was: 1.9) 1.10 Improve trunk API documentation --- Key: NUTCH-1218 URL: https://issues.apache.org/jira/browse/NUTCH-1218 Project: Nutch Issue Type: Sub-task Components: documentation Affects Versions: 1.4 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.10 Attachments: NUTCH-1218.patch The trunk API Java documentation could do with some improving. This issue should track that. It should however not seek to change any functionality within the codebase, only to substantiate and improve the existing documentation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1219) Upgrade all jobs to new MapReduce API
[ https://issues.apache.org/jira/browse/NUTCH-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1219: - Fix Version/s: (was: 1.9) 1.10 Upgrade all jobs to new MapReduce API - Key: NUTCH-1219 URL: https://issues.apache.org/jira/browse/NUTCH-1219 Project: Nutch Issue Type: Task Reporter: Markus Jelsma Fix For: 1.10 We should upgrade to the new Hadoop API for Nutch trunk as already has been done for the Nutchgora branch. If i'm not mistaken we can already upgrade to the latest 0.20.5 version that still carries the legacy API so we can, without immediately upgrading to 0.21 or higher, port the jobs to the new API without having the need for a separate branch to work on. To the committers who created/ported jobs in NutchGora, please write down your advice and experience. http://www.slideshare.net/sh1mmer/upgrading-to-the-new-map-reduce-api -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1076) Solrindex has no documents following bin/nutch solrindex when using protocol-file
[ https://issues.apache.org/jira/browse/NUTCH-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1076: - Fix Version/s: (was: 1.9) 1.10 Solrindex has no documents following bin/nutch solrindex when using protocol-file - Key: NUTCH-1076 URL: https://issues.apache.org/jira/browse/NUTCH-1076 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.3 Environment: Ubuntu Linux 10.04 server JDK 1.6 Nutch 1.3 Solr 3.1.0 Reporter: Seth Griffin Assignee: Markus Jelsma Labels: nutch, protocol-file, solrindex Fix For: 1.10 Note: When using protocol-http I am able to update solr effortlessly. To test this I have a single pdf file that I am trying to index in my urls directory. I execute: bin/nutch crawl urls Output: solrUrl is not set, indexing will be skipped... crawl started in: crawl-20110805151045 rootUrlDir = urls threads = 10 depth = 5 solrUrl=null Injector: starting at 2011-08-05 15:10:45 Injector: crawlDb: crawl-20110805151045/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2011-08-05 15:10:48, elapsed: 00:00:02 Generator: starting at 2011-08-05 15:10:48 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl-20110805151045/segments/20110805151050 Generator: finished at 2011-08-05 15:10:51, elapsed: 00:00:03 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2011-08-05 15:10:51 Fetcher: segment: crawl-20110805151045/segments/20110805151050 Fetcher: threads: 10 QueueFeeder finished: total 1 records + hit by time limit :0 fetching file:///home/nutch/nutch-1.3/runtime/local/indexdir/Altec.pdf -finishing thread FetcherThread, activeThreads=9 -finishing thread FetcherThread, activeThreads=8 -finishing thread FetcherThread, activeThreads=7 -finishing thread FetcherThread, activeThreads=6 -finishing thread FetcherThread, activeThreads=5 -finishing thread FetcherThread, activeThreads=4 -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2011-08-05 15:10:53, elapsed: 00:00:02 ParseSegment: starting at 2011-08-05 15:10:53 ParseSegment: segment: crawl-20110805151045/segments/20110805151050 ParseSegment: finished at 2011-08-05 15:10:56, elapsed: 00:00:03 CrawlDb update: starting at 2011-08-05 15:10:56 CrawlDb update: db: crawl-20110805151045/crawldb CrawlDb update: segments: [crawl-20110805151045/segments/20110805151050] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2011-08-05 15:10:57, elapsed: 00:00:01 Generator: starting at 2011-08-05 15:10:57 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=1 - no more URLs to fetch. LinkDb: starting at 2011-08-05 15:10:58 LinkDb: linkdb: crawl-20110805151045/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/home/nutch/nutch-1.3/runtime/local/crawl-20110805151045/segments/20110805151050 LinkDb: finished at 2011-08-05 15:10:59, elapsed: 00:00:01 crawl finished: crawl-20110805151045 Then with a clean solr index (stats output from stats.jsp below): searcherName : Searcher@14dd758 main caching : true numDocs : 0 maxDoc : 0 reader : SolrIndexReader{this=1ee148b,r=ReadOnlyDirectoryReader@1ee148b,refCnt=1,segments=0} readerDir : org.apache.lucene.store.NIOFSDirectory@/home/solr/apache-solr-3.1.0/example/solr/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@987197 indexVersion : 1312575204101 openedAt : Fri Aug 05 15:13:24 CDT 2011 registeredAt : Fri Aug 05 15:13:24 CDT 2011 warmupTime : 0 I then execute: bin/nutch solrindex http://localhost:8983/solr/ crawl-20110805151045/crawldb/ crawl-20110805151045/linkdb/ crawl-20110805151045/segments/* bin/nutch output: SolrIndexer: starting at 2011-08-05 15:15:48 SolrIndexer: finished at 2011-08-05 15:15:50,
[jira] [Updated] (NUTCH-1179) Option to restrict generated records by metadata
[ https://issues.apache.org/jira/browse/NUTCH-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1179: - Fix Version/s: (was: 1.9) 1.10 Option to restrict generated records by metadata Key: NUTCH-1179 URL: https://issues.apache.org/jira/browse/NUTCH-1179 Project: Nutch Issue Type: New Feature Components: generator Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.10 The generator should be able to select entries based on a metadata key/value pair. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1179) Option to restrict generated records by metadata
[ https://issues.apache.org/jira/browse/NUTCH-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053630#comment-14053630 ] Julien Nioche commented on NUTCH-1179: -- Why not. An alternative strategy is to write a custom Scoringfilter that returns a generate score based on a metadata K/V and set a min score for the generation. Option to restrict generated records by metadata Key: NUTCH-1179 URL: https://issues.apache.org/jira/browse/NUTCH-1179 Project: Nutch Issue Type: New Feature Components: generator Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.10 The generator should be able to select entries based on a metadata key/value pair. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1539) Implement the Hypertext Induced Topic Search (HITS) algorithm in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1539: - Fix Version/s: (was: 1.9) 1.10 Implement the Hypertext Induced Topic Search (HITS) algorithm in Nutch -- Key: NUTCH-1539 URL: https://issues.apache.org/jira/browse/NUTCH-1539 Project: Nutch Issue Type: Bug Components: linkdb Environment: CSCI 572: Search Engines and Information Retrieval @ USC, http://sunset.usc.edu/classes/cs572_2010/ Nutch 1.1 Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.10 Attachments: CS572CourseProjectReport_Yongqiang.pdf, NUTCH-1538.yongqiang.Mattmann.030413.patch.txt, csci572CourseProject_Yongqiang.rar In my Summer 2010 CSCI 572: Search Engines and Information Retrieval class, my student Yongqiang Li and I implemented the HITS algorithm in Nutch based on Jon Kleinberg's paper: Authoritative Sources in a Hyperlinked Environment http://dl.acm.org/citation.cfm?id=324140 I'll put up the code we had shortly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-849) different versions of the same library in nutch-2.0-dev.job and local\lib directory
[ https://issues.apache.org/jira/browse/NUTCH-849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-849: Fix Version/s: (was: 1.9) 1.10 different versions of the same library in nutch-2.0-dev.job and local\lib directory Key: NUTCH-849 URL: https://issues.apache.org/jira/browse/NUTCH-849 Project: Nutch Issue Type: Task Affects Versions: 1.4, nutchgora Environment: Window XP SP3, Cygwin Reporter: Pham Tuan Minh Priority: Minor Fix For: 1.10 Hi, I found that after building runtime, In nutch-2.0-dev.job and local\lib directory contains different versions of the same library ant-1.7.1.jar ant-1.6.5.jar servlet-api-2.5-20081211.jar servlet-api-2.5-6.1.14.jar I predict these libraries come from different dependencies branch. Anyone help me to fix it? Thanks, -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1040) Backport REST-API from 2.0
[ https://issues.apache.org/jira/browse/NUTCH-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1040: - Fix Version/s: (was: 1.9) 1.10 Backport REST-API from 2.0 -- Key: NUTCH-1040 URL: https://issues.apache.org/jira/browse/NUTCH-1040 Project: Nutch Issue Type: New Feature Components: REST_api Reporter: Julien Nioche Fix For: 1.10 See https://issues.apache.org/jira/browse/NUTCH-880 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1267) urlmeta to delegate indexing to index-metadata
[ https://issues.apache.org/jira/browse/NUTCH-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1267: - Fix Version/s: (was: 1.9) 1.10 urlmeta to delegate indexing to index-metadata -- Key: NUTCH-1267 URL: https://issues.apache.org/jira/browse/NUTCH-1267 Project: Nutch Issue Type: Sub-task Components: indexer Affects Versions: 1.6 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.10 Ideally we should get rid of urlmeta altogether and add the transmission of the meta to the outlinks in the core classes - not as a plugin. URLMeta is also a terrible name :-( -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1371) Replace Ivy with Maven Ant tasks
[ https://issues.apache.org/jira/browse/NUTCH-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1371: - Fix Version/s: (was: 1.9) 1.10 Replace Ivy with Maven Ant tasks Key: NUTCH-1371 URL: https://issues.apache.org/jira/browse/NUTCH-1371 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 1.7, 2.2.1 Reporter: Julien Nioche Assignee: Lewis John McGibbney Fix For: 1.10 Attachments: NUTCH-1371-2x.patch, NUTCH-1371-plugins.trunk.patch, NUTCH-1371-pom.patch, NUTCH-1371-r1461140.patch, NUTCH-1371.patch We might move to Maven altogether but a good intermediate step could be to rely on the maven ant tasks for managing the dependencies. Ivy does a good job but we need to have a pom file anyway for publishing the artefacts which means keeping the pom.xml and ivy.xml contents in sync. Most devs are also more familiar with Maven, and it is well integrated in IDEs. Going the ANT+MVN way also means that we don't have to rewrite the whole building process and can rely on our existing script -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job
[ https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1712: - Fix Version/s: (was: 1.9) 1.10 Use MultipleInputs in Injector to make it a single mapreduce job Key: NUTCH-1712 URL: https://issues.apache.org/jira/browse/NUTCH-1712 Project: Nutch Issue Type: Improvement Components: injector Affects Versions: 1.7 Reporter: Tejas Patil Assignee: Tejas Patil Fix For: 1.10 Attachments: NUTCH-1712-trunk.v1.patch Currently Injector creates two mapreduce jobs: 1. sort job: get the urls from seeds file, emit CrawlDatum objects. 2. merge job: read CrawlDatum objects from both crawldb and output of sort job. Merge and emit final CrawlDatum objects. Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls from seeds file simultaneously and perform inject in a single map-reduce job. Also, here are additional things covered with this jira: 1. Pushed filtering and normalization above metadata extraction so that the unwanted records are ruled out quickly. 2. Migrated to new mapreduce API 3. Improved documentation 4. New junits with better coverage Relevant discussion over nutch-dev can be found here: http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1561) improve usability of parse-metatags and index-metadata
[ https://issues.apache.org/jira/browse/NUTCH-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1561: - Fix Version/s: (was: 1.10) 1.9 improve usability of parse-metatags and index-metadata -- Key: NUTCH-1561 URL: https://issues.apache.org/jira/browse/NUTCH-1561 Project: Nutch Issue Type: Improvement Affects Versions: 1.6 Reporter: Sebastian Nagel Assignee: Sebastian Nagel Priority: Minor Fix For: 1.9 Attachments: NUTCH-1561-trunk-v2.patch, NUTCH-1561-v1.patch Usually, the plugins parse-metatags and index-metadata are used in combination: the former extracts meta tags, the latter adds the extracted tags as fields to the index. Configuration of the two plugins differs which causes pitfalls and reduces the usability (see example config): * the property metatags.names of parse-metatags uses ';' as separator instead of ',' used by index-metadata * meta tags have to be lowercased in index-metadata {code} property namemetatags.names/name valueDC.creator;DCTERMS.bibliographicCitation/value /property property nameindex.parse.md/name valuemetatag.dc.creator,metatag.dcterms.bibliographiccitation/value /property {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-926) Redirections from META tag don't get filtered
[ https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-926: Fix Version/s: (was: 1.10) 1.9 Redirections from META tag don't get filtered - Key: NUTCH-926 URL: https://issues.apache.org/jira/browse/NUTCH-926 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.2 Environment: gnu/linux centOs Reporter: Marco Novo Fix For: 1.9 Attachments: NUTCH-926-trunk.patch, ParseOutputFormat.java.patch We have nutch set to crawl a domain urllist and we want to fetch only passed domains (hosts) not subdomains. So WWW.DOMAIN1.COM .. .. .. WWW.RIGHTDOMAIN.COM .. .. .. .. WWW.DOMAIN.COM We sets nutch to: NOT FOLLOW EXERNAL LINKS During crawling of WWW.RIGHTDOMAIN.COM if a page contains !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN html head title/title META http-equiv=refresh content=0; url=http://WRONG.RIGHTDOMAIN.COM; /head body /body /html Nutch continues to crawl the WRONG subdomains! But it should not do this!! During crawling of WWW.RIGHTDOMAIN.COM if a page contains !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN html head title/title META http-equiv=refresh content=0; url=http://WWW.WRONGDOMAIN.COM; /head body /body /html Nutch continues to crawl the WRONG domain! But it should not do this! If that we will spider all the web We think the problem is in org.apache.nutch.parse ParseOutputFormat. We have done a patch so we will attach it -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1708) use same id when indexing and deleting redirects
[ https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1708: - Fix Version/s: 1.9 use same id when indexing and deleting redirects Key: NUTCH-1708 URL: https://issues.apache.org/jira/browse/NUTCH-1708 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.7 Reporter: Sebastian Nagel Fix For: 1.9 Attachments: NUTCH-1708-2x-v1.patch, NUTCH-1708-trunk-v1.patch Redirect targets are indexed using representative URL * in Fetcher repr URL is determined by URLUtil.chooseRepr() and stored in CrawlDatum (CrawlDb). Repr URL is either source or target URL of the redirect pair. * NutchField url is filled by basic indexing filter with repr URL * id field used as unique key is filled from url per solrindex-mapping.xml Deletion of redirects is done in IndexerMapReduce.reduce() by key which is the URL of the redirect source. If the source URL is chosen as repr URL a redirect target may get erroneously deleted. Test crawl with seed {{http://wiki.apache.org/nutch}} which redirects to {{http://wiki.apache.org/nutch/}}. DummyIndexWriter (NUTCH-1707) indicates that same URL is deleted and added: {code} delete http://wiki.apache.org/nutch add http://wiki.apache.org/nutch {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: Nearing a 1.9 release?
Hi, I've moved all the open issues that were marked with fix version = 1.9 to 1.10 except for the ones that Seb mentioned earlier. Please go through the issue listed for 1.10 https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%201.10%20AND%20project%20%3D%20NUTCH%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20updated%20DESC and change their fix version back to 1.9 if you think they should be included in the next release. Thanks Julien On 29 June 2014 10:20, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi guys, We've done loads of good work on the trunk since the last release, in particular : - NUTCH-1736 https://issues.apache.org/jira/browse/NUTCH-1736 - NUTCH-1647 https://issues.apache.org/jira/browse/NUTCH-1647 - NUTCH-1793 https://issues.apache.org/jira/browse/NUTCH-1793 which are important bug fixes (NUTCH-578 https://issues.apache.org/jira/browse/NUTCH-578 will also be an important one). If you want to help make the new release happen, could you please go through the issues listed for 1.9 https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%201.9%20AND%20project%20%3D%20NUTCH%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20updated%20DESC and vote for the ones you think should be included in the next release / comment on issues opened by others / review the patches / contribute to the discussions? Thanks! Julien -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
[jira] [Created] (NUTCH-1812) Create Vagrant artifacts for 2.X branch
Lewis John McGibbney created NUTCH-1812: --- Summary: Create Vagrant artifacts for 2.X branch Key: NUTCH-1812 URL: https://issues.apache.org/jira/browse/NUTCH-1812 Project: Nutch Issue Type: Improvement Components: build Reporter: Lewis John McGibbney Fix For: 2.4 Vagrant [0] is a very useful tool which I believe we can use to create VM containers to help aid the process of setting up and running a Nutch 2.X build. Right now, provisioning Nutch 2.X is a PITA for new users. We've been working with Vagrant a lot. I feel that it would be really nice to just drop in an image into virtualbox (or something similar) and then just start. [0] http://www.vagrantup.com/ -- This message was sent by Atlassian JIRA (v6.2#6252)