[jira] [Commented] (NUTCH-1502) Test for CrawlDatum state transitions

2014-07-07 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053592#comment-14053592
 ] 

Julien Nioche commented on NUTCH-1502:
--

great stuff! I'd rename the *TODO class into  
org.apache.nutch.crawl.TODOTestCrawlDbStates: this way it will not get included 
in the test suite. It currently fails as the corresponding issues are not fixed.

 Test for CrawlDatum state transitions
 -

 Key: NUTCH-1502
 URL: https://issues.apache.org/jira/browse/NUTCH-1502
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb
Affects Versions: 1.7, 2.2
Reporter: Sebastian Nagel
 Fix For: 2.4, 1.9

 Attachments: NUTCH-1502-trunk-v1.patch


 An exhaustive test to check the matrix of CrawlDatum state transitions 
 (CrawlStatus in 2.x) would be useful to detect errors esp. for continuous 
 crawls where the number of possible transitions is quite large. Additional 
 factors with impact on state transitions (retry counters, static and dynamic 
 intervals) are also tested.
 The tests will help to address the NUTCH-578 and NUTCH-1245. See the latter 
 for a first sketchy patch.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (NUTCH-578) URL fetched with 403 is generated over and over again

2014-07-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-578.
-

Resolution: Fixed

Committed revision 1608431.

Let's track the overflowing issue in 
https://issues.apache.org/jira/browse/NUTCH-1247

Thanks!

 URL fetched with 403 is generated over and over again
 -

 Key: NUTCH-578
 URL: https://issues.apache.org/jira/browse/NUTCH-578
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 1.0.0
 Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I 
 have checked out the most recent version of the trunk as of Nov 20, 2007
Reporter: Nathaniel Powell
Assignee: Markus Jelsma
Priority: Critical
 Fix For: 1.9

 Attachments: NUTCH-578.patch, NUTCH-578_v2.patch, NUTCH-578_v3.patch, 
 NUTCH-578_v4.patch, NUTCH-578_v5.patch, crawl-urlfilter.txt, nutch-site.xml, 
 regex-normalize.xml, urls.txt


 I have not changed the following parameter in the nutch-default.xml:
 property
   namedb.fetch.retry.max/name
   value3/value
   descriptionThe maximum number of times a url that has encountered
   recoverable errors is generated for fetch./description
 /property
 However, there is a URL which is on the site that I'm crawling, 
 www.teachertube.com, which keeps being generated over and over again for 
 almost every segment (many more times than 3):
 fetch of http://www.teachertube.com/images/ failed with: Http code=403, 
 url=http://www.teachertube.com/images/
 This is a bug, right?
 Thanks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1088) Write Solr XML documents

2014-07-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1088:
-

Fix Version/s: (was: 1.9)
   1.10

 Write Solr XML documents
 

 Key: NUTCH-1088
 URL: https://issues.apache.org/jira/browse/NUTCH-1088
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.10


 Documents need to be reindexed when index-time analysis is modified. Indexing 
 individual segments from Nutch is tedious, especially for small segments. 
 This issue should add a feature that can write XML batches.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-881) Good quality documentation for Nutch

2014-07-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-881:


Fix Version/s: (was: 1.9)
   1.10

 Good quality documentation for Nutch
 

 Key: NUTCH-881
 URL: https://issues.apache.org/jira/browse/NUTCH-881
 Project: Nutch
  Issue Type: Improvement
  Components: documentation
Affects Versions: nutchgora
Reporter: Andrzej Bialecki 
Assignee: Lewis John McGibbney
 Fix For: 1.10


 This is, and has been, a long standing request from Nutch users. This becomes 
 an acute need as we redesign Nutch 2.0, because the collective knowledge and 
 the Wiki will no longer be useful without massive amount of editing.
 IMHO the reference documentation should be in SVN, and not on the Wiki - the 
 Wiki is good for casual information and recipes but I think it's too messy 
 and not reliable enough as a reference.
 I propose to start with the following:
  1. let's decide on the format of the docs. Each format has its own pros and 
 cons:
   * HTML: easy to work with, but formatting may be messy unless we edit it by 
 hand, at which point it's no longer so easy... Good toolchains to convert to 
 other formats, but limited expressiveness of larger structures (e.g. book, 
 chapters, TOC, multi-column layouts, etc).
   * Docbook: learning curve is higher, but not insurmountable... Naturally 
 yields very good structure. Figures/diagrams may be problematic - different 
 renderers (html, pdf) like to treat the scaling and placing somewhat 
 differently.
   * Wiki-style (Confluence or TWiki): easy to use, but limited control over 
 larger structures. Maven Doxia can format cwiki, twiki, and a host of other 
 formats to e.g. html and pdf.
   * other?
  2. start documenting the main tools and the main APIs (e.g. the plugins and 
 all the extension points). We can of course reuse material from the Wiki and 
 from various presentations (e.g. the ApacheCon slides).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1201) Allow for different FetcherThread impls

2014-07-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1201:
-

Fix Version/s: (was: 1.9)
   1.10

 Allow for different FetcherThread impls
 ---

 Key: NUTCH-1201
 URL: https://issues.apache.org/jira/browse/NUTCH-1201
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.10

 Attachments: CustomFetcher.java, NUTCH-1201-1.5-wip.patch


 For certain cases we need to modify parts in FetcherThread and make it 
 pluggable. This introduces a new config directive fetcher.impl that takes a 
 FQCN and uses that setting Fetcher.fetch to load a class to use for 
 job.setMapRunnerClass(). This new class has to extend Fetcher and and inner 
 class FetcherThread. This allows for overriding methods in FetcherThread but 
 also methods in Fetcher itself if required.
 A follow up on this issue would be to refactor parts of FetcherThread to make 
 it easier to override small sections instead of copying the entire method 
 body for a small change, which is now the case.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (NUTCH-933) Fetcher does not save a pages Last-Modified value in CrawlDatum

2014-07-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-933.
-

   Resolution: Not a Problem
Fix Version/s: (was: 1.9)

Marked as not a problem. PLease reopen if necessary

 Fetcher does not save a pages Last-Modified value in CrawlDatum
 ---

 Key: NUTCH-933
 URL: https://issues.apache.org/jira/browse/NUTCH-933
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.2
Reporter: Joe Kemp

 I added the following code in the output method just after the If (content 
 !=null) statement.
 String lastModified = metadata.get(Last-Modified);
 if (lastModified !=null  !lastModified.equals()) {
   try {
   Date lastModifiedDate = 
 DateUtil.parseDate(lastModified);
   
 datum.setModifiedTime(lastModifiedDate.getTime());
   } catch (DateParseException e) {
   
   }
 }
 I now get 304 for pages that haven't changed when I recrawl.  Need to do 
 further testing.  Might also need a configuration parameter to turn off this 
 behavior, allowing pages to be forced to be refreshed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1218) Improve trunk API documentation

2014-07-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1218:
-

Fix Version/s: (was: 1.9)
   1.10

 Improve trunk API documentation
 ---

 Key: NUTCH-1218
 URL: https://issues.apache.org/jira/browse/NUTCH-1218
 Project: Nutch
  Issue Type: Sub-task
  Components: documentation
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.10

 Attachments: NUTCH-1218.patch


 The trunk API Java documentation could do with some improving. This issue 
 should track that. It should however not seek to change any functionality 
 within the codebase, only to substantiate and improve the existing 
 documentation.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1219) Upgrade all jobs to new MapReduce API

2014-07-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1219:
-

Fix Version/s: (was: 1.9)
   1.10

 Upgrade all jobs to new MapReduce API
 -

 Key: NUTCH-1219
 URL: https://issues.apache.org/jira/browse/NUTCH-1219
 Project: Nutch
  Issue Type: Task
Reporter: Markus Jelsma
 Fix For: 1.10


 We should upgrade to the new Hadoop API for Nutch trunk as already has been 
 done for the Nutchgora branch. If i'm not mistaken we can already upgrade to 
 the latest 0.20.5 version that still carries the legacy API so we can, 
 without immediately upgrading to 0.21 or higher, port the jobs to the new API 
 without having the need for a separate branch to work on.
 To the committers who created/ported jobs in NutchGora, please write down 
 your advice and experience.
 http://www.slideshare.net/sh1mmer/upgrading-to-the-new-map-reduce-api



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1076) Solrindex has no documents following bin/nutch solrindex when using protocol-file

2014-07-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1076:
-

Fix Version/s: (was: 1.9)
   1.10

 Solrindex has no documents following bin/nutch solrindex when using 
 protocol-file
 -

 Key: NUTCH-1076
 URL: https://issues.apache.org/jira/browse/NUTCH-1076
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.3
 Environment: Ubuntu Linux 10.04 server
 JDK 1.6
 Nutch 1.3
 Solr 3.1.0
Reporter: Seth Griffin
Assignee: Markus Jelsma
  Labels: nutch, protocol-file, solrindex
 Fix For: 1.10


 Note: When using protocol-http I am able to update solr effortlessly.
 To test this I have a single pdf file that I am trying to index in my urls 
 directory.
 I execute:
 bin/nutch crawl urls
 Output:
 solrUrl is not set, indexing will be skipped...
 crawl started in: crawl-20110805151045
 rootUrlDir = urls
 threads = 10
 depth = 5
 solrUrl=null
 Injector: starting at 2011-08-05 15:10:45
 Injector: crawlDb: crawl-20110805151045/crawldb
 Injector: urlDir: urls
 Injector: Converting injected urls to crawl db entries.
 Injector: Merging injected urls into crawl db.
 Injector: finished at 2011-08-05 15:10:48, elapsed: 00:00:02
 Generator: starting at 2011-08-05 15:10:48
 Generator: Selecting best-scoring urls due for fetch.
 Generator: filtering: true
 Generator: normalizing: true
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: Partitioning selected urls for politeness.
 Generator: segment: crawl-20110805151045/segments/20110805151050
 Generator: finished at 2011-08-05 15:10:51, elapsed: 00:00:03
 Fetcher: Your 'http.agent.name' value should be listed first in 
 'http.robots.agents' property.
 Fetcher: starting at 2011-08-05 15:10:51
 Fetcher: segment: crawl-20110805151045/segments/20110805151050
 Fetcher: threads: 10
 QueueFeeder finished: total 1 records + hit by time limit :0
 fetching file:///home/nutch/nutch-1.3/runtime/local/indexdir/Altec.pdf
 -finishing thread FetcherThread, activeThreads=9
 -finishing thread FetcherThread, activeThreads=8
 -finishing thread FetcherThread, activeThreads=7
 -finishing thread FetcherThread, activeThreads=6
 -finishing thread FetcherThread, activeThreads=5
 -finishing thread FetcherThread, activeThreads=4
 -finishing thread FetcherThread, activeThreads=3
 -finishing thread FetcherThread, activeThreads=2
 -finishing thread FetcherThread, activeThreads=1
 -finishing thread FetcherThread, activeThreads=0
 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
 -activeThreads=0
 Fetcher: finished at 2011-08-05 15:10:53, elapsed: 00:00:02
 ParseSegment: starting at 2011-08-05 15:10:53
 ParseSegment: segment: crawl-20110805151045/segments/20110805151050
 ParseSegment: finished at 2011-08-05 15:10:56, elapsed: 00:00:03
 CrawlDb update: starting at 2011-08-05 15:10:56
 CrawlDb update: db: crawl-20110805151045/crawldb
 CrawlDb update: segments: [crawl-20110805151045/segments/20110805151050]
 CrawlDb update: additions allowed: true
 CrawlDb update: URL normalizing: true
 CrawlDb update: URL filtering: true
 CrawlDb update: Merging segment data into db.
 CrawlDb update: finished at 2011-08-05 15:10:57, elapsed: 00:00:01
 Generator: starting at 2011-08-05 15:10:57
 Generator: Selecting best-scoring urls due for fetch.
 Generator: filtering: true
 Generator: normalizing: true
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: 0 records selected for fetching, exiting ...
 Stopping at depth=1 - no more URLs to fetch.
 LinkDb: starting at 2011-08-05 15:10:58
 LinkDb: linkdb: crawl-20110805151045/linkdb
 LinkDb: URL normalize: true
 LinkDb: URL filter: true
 LinkDb: adding segment: 
 file:/home/nutch/nutch-1.3/runtime/local/crawl-20110805151045/segments/20110805151050
 LinkDb: finished at 2011-08-05 15:10:59, elapsed: 00:00:01
 crawl finished: crawl-20110805151045
 Then with a clean solr index (stats output from stats.jsp below):
 searcherName : Searcher@14dd758 main
 caching : true
 numDocs : 0
 maxDoc : 0
 reader : 
 SolrIndexReader{this=1ee148b,r=ReadOnlyDirectoryReader@1ee148b,refCnt=1,segments=0}
 readerDir : 
 org.apache.lucene.store.NIOFSDirectory@/home/solr/apache-solr-3.1.0/example/solr/data/index
  lockFactory=org.apache.lucene.store.NativeFSLockFactory@987197
 indexVersion : 1312575204101
 openedAt : Fri Aug 05 15:13:24 CDT 2011
 registeredAt : Fri Aug 05 15:13:24 CDT 2011
 warmupTime : 0 
 I then execute:
 bin/nutch solrindex http://localhost:8983/solr/ crawl-20110805151045/crawldb/ 
 crawl-20110805151045/linkdb/ crawl-20110805151045/segments/*
 bin/nutch output:
 SolrIndexer: starting at 2011-08-05 15:15:48
 SolrIndexer: finished at 2011-08-05 15:15:50, 

[jira] [Updated] (NUTCH-1179) Option to restrict generated records by metadata

2014-07-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1179:
-

Fix Version/s: (was: 1.9)
   1.10

 Option to restrict generated records by metadata
 

 Key: NUTCH-1179
 URL: https://issues.apache.org/jira/browse/NUTCH-1179
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.10


 The generator should be able to select entries based on a metadata key/value 
 pair.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1179) Option to restrict generated records by metadata

2014-07-07 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053630#comment-14053630
 ] 

Julien Nioche commented on NUTCH-1179:
--

Why not. An alternative strategy is to write a custom Scoringfilter that 
returns a generate score based on a metadata K/V and set a min score for the 
generation.

 Option to restrict generated records by metadata
 

 Key: NUTCH-1179
 URL: https://issues.apache.org/jira/browse/NUTCH-1179
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.10


 The generator should be able to select entries based on a metadata key/value 
 pair.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1539) Implement the Hypertext Induced Topic Search (HITS) algorithm in Nutch

2014-07-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1539:
-

Fix Version/s: (was: 1.9)
   1.10

 Implement the Hypertext Induced Topic Search (HITS) algorithm in Nutch
 --

 Key: NUTCH-1539
 URL: https://issues.apache.org/jira/browse/NUTCH-1539
 Project: Nutch
  Issue Type: Bug
  Components: linkdb
 Environment: CSCI 572: Search Engines and Information Retrieval @ 
 USC, http://sunset.usc.edu/classes/cs572_2010/
 Nutch 1.1
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.10

 Attachments: CS572CourseProjectReport_Yongqiang.pdf, 
 NUTCH-1538.yongqiang.Mattmann.030413.patch.txt, 
 csci572CourseProject_Yongqiang.rar


 In my Summer 2010 CSCI 572: Search Engines and Information Retrieval class, 
 my student Yongqiang Li and I implemented the HITS algorithm in Nutch based 
 on Jon Kleinberg's paper:
 Authoritative Sources in a Hyperlinked Environment
 http://dl.acm.org/citation.cfm?id=324140
 I'll put up the code we had shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-849) different versions of the same library in nutch-2.0-dev.job and local\lib directory

2014-07-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-849:


Fix Version/s: (was: 1.9)
   1.10

 different versions of the same library in nutch-2.0-dev.job and local\lib 
 directory 
 

 Key: NUTCH-849
 URL: https://issues.apache.org/jira/browse/NUTCH-849
 Project: Nutch
  Issue Type: Task
Affects Versions: 1.4, nutchgora
 Environment: Window XP SP3, Cygwin
Reporter: Pham Tuan Minh
Priority: Minor
 Fix For: 1.10


 Hi,
 I found that after building runtime, In nutch-2.0-dev.job and local\lib 
 directory contains different versions of the same library
 ant-1.7.1.jar
 ant-1.6.5.jar
 servlet-api-2.5-20081211.jar
 servlet-api-2.5-6.1.14.jar
 I predict these libraries come from different dependencies branch. Anyone 
 help me to fix it?
 Thanks,



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1040) Backport REST-API from 2.0

2014-07-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1040:
-

Fix Version/s: (was: 1.9)
   1.10

 Backport REST-API from 2.0
 --

 Key: NUTCH-1040
 URL: https://issues.apache.org/jira/browse/NUTCH-1040
 Project: Nutch
  Issue Type: New Feature
  Components: REST_api
Reporter: Julien Nioche
 Fix For: 1.10


 See https://issues.apache.org/jira/browse/NUTCH-880 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1267) urlmeta to delegate indexing to index-metadata

2014-07-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1267:
-

Fix Version/s: (was: 1.9)
   1.10

 urlmeta to delegate indexing to index-metadata
 --

 Key: NUTCH-1267
 URL: https://issues.apache.org/jira/browse/NUTCH-1267
 Project: Nutch
  Issue Type: Sub-task
  Components: indexer
Affects Versions: 1.6
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.10


 Ideally we should get rid of urlmeta altogether and add the transmission of 
 the meta to the outlinks in the core classes - not as a plugin. URLMeta is 
 also a terrible name :-(



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1371) Replace Ivy with Maven Ant tasks

2014-07-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1371:
-

Fix Version/s: (was: 1.9)
   1.10

 Replace Ivy with Maven Ant tasks
 

 Key: NUTCH-1371
 URL: https://issues.apache.org/jira/browse/NUTCH-1371
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.7, 2.2.1
Reporter: Julien Nioche
Assignee: Lewis John McGibbney
 Fix For: 1.10

 Attachments: NUTCH-1371-2x.patch, NUTCH-1371-plugins.trunk.patch, 
 NUTCH-1371-pom.patch, NUTCH-1371-r1461140.patch, NUTCH-1371.patch


 We might move to Maven altogether but a good intermediate step could be to 
 rely on the maven ant tasks for managing the dependencies. Ivy does a good 
 job but we need to have a pom file anyway for publishing the artefacts which 
 means keeping the pom.xml and ivy.xml contents in sync. Most devs are also 
 more familiar with Maven, and it is well integrated in IDEs. Going the 
 ANT+MVN way also means that we don't have to rewrite the whole building 
 process and can rely on our existing script



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2014-07-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1712:
-

Fix Version/s: (was: 1.9)
   1.10

 Use MultipleInputs in Injector to make it a single mapreduce job
 

 Key: NUTCH-1712
 URL: https://issues.apache.org/jira/browse/NUTCH-1712
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Affects Versions: 1.7
Reporter: Tejas Patil
Assignee: Tejas Patil
 Fix For: 1.10

 Attachments: NUTCH-1712-trunk.v1.patch


 Currently Injector creates two mapreduce jobs:
 1. sort job: get the urls from seeds file, emit CrawlDatum objects.
 2. merge job: read CrawlDatum objects from both crawldb and output of sort 
 job. Merge and emit final CrawlDatum objects.
 Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls 
 from seeds file simultaneously and perform inject in a single map-reduce job.
 Also, here are additional things covered with this jira:
 1. Pushed filtering and normalization above metadata extraction so that the 
 unwanted records are ruled out quickly.
 2. Migrated to new mapreduce API
 3. Improved documentation 
 4. New junits with better coverage
 Relevant discussion over nutch-dev can be found here:
 http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1561) improve usability of parse-metatags and index-metadata

2014-07-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1561:
-

Fix Version/s: (was: 1.10)
   1.9

 improve usability of parse-metatags and index-metadata
 --

 Key: NUTCH-1561
 URL: https://issues.apache.org/jira/browse/NUTCH-1561
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.6
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
Priority: Minor
 Fix For: 1.9

 Attachments: NUTCH-1561-trunk-v2.patch, NUTCH-1561-v1.patch


 Usually, the plugins parse-metatags and index-metadata are used in 
 combination: the former extracts meta tags, the latter adds the extracted 
 tags as fields to the index. 
 Configuration of the two plugins differs which causes pitfalls and reduces 
 the usability (see example config):
 * the property metatags.names of parse-metatags uses ';' as separator 
 instead of ',' used by index-metadata
 * meta tags have to be lowercased in index-metadata
 {code}
 property
   namemetatags.names/name
   valueDC.creator;DCTERMS.bibliographicCitation/value
 /property
 property
   nameindex.parse.md/name
   valuemetatag.dc.creator,metatag.dcterms.bibliographiccitation/value
 /property
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-926) Redirections from META tag don't get filtered

2014-07-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-926:


Fix Version/s: (was: 1.10)
   1.9

 Redirections from META tag don't get filtered
 -

 Key: NUTCH-926
 URL: https://issues.apache.org/jira/browse/NUTCH-926
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.2
 Environment: gnu/linux centOs
Reporter: Marco Novo
 Fix For: 1.9

 Attachments: NUTCH-926-trunk.patch, ParseOutputFormat.java.patch


 We have nutch set to crawl a domain urllist and we want to fetch only passed 
 domains (hosts) not subdomains.
 So
 WWW.DOMAIN1.COM
 ..
 ..
 ..
 WWW.RIGHTDOMAIN.COM
 ..
 ..
 ..
 ..
 WWW.DOMAIN.COM
 We sets nutch to:
 NOT FOLLOW EXERNAL LINKS
 During crawling of WWW.RIGHTDOMAIN.COM
 if a page contains
 !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN
 html
 head
 title/title
 META http-equiv=refresh content=0;
 url=http://WRONG.RIGHTDOMAIN.COM;
 /head
 body
 /body
 /html
 Nutch continues to crawl the WRONG subdomains! But it should not do this!!
 During crawling of WWW.RIGHTDOMAIN.COM
 if a page contains
 !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN
 html
 head
 title/title
 META http-equiv=refresh content=0;
 url=http://WWW.WRONGDOMAIN.COM;
 /head
 body
 /body
 /html
 Nutch continues to crawl the WRONG domain! But it should not do this! If that 
 we will spider all the web
 We think the problem is in org.apache.nutch.parse ParseOutputFormat. We have 
 done a patch so we will attach it



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1708) use same id when indexing and deleting redirects

2014-07-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1708:
-

Fix Version/s: 1.9

 use same id when indexing and deleting redirects
 

 Key: NUTCH-1708
 URL: https://issues.apache.org/jira/browse/NUTCH-1708
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.7
Reporter: Sebastian Nagel
 Fix For: 1.9

 Attachments: NUTCH-1708-2x-v1.patch, NUTCH-1708-trunk-v1.patch


 Redirect targets are indexed using representative URL
 * in Fetcher repr URL is determined by URLUtil.chooseRepr() and stored in 
 CrawlDatum (CrawlDb). Repr URL is either source or target URL of the redirect 
 pair.
 * NutchField url is filled by basic indexing filter with repr URL
 * id field used as unique key is filled from url per solrindex-mapping.xml
 Deletion of redirects is done in IndexerMapReduce.reduce() by key which is 
 the URL of the redirect source. If the source URL is chosen as repr URL a 
 redirect target may get erroneously deleted.
 Test crawl with seed {{http://wiki.apache.org/nutch}} which redirects to 
 {{http://wiki.apache.org/nutch/}}. DummyIndexWriter (NUTCH-1707) indicates 
 that same URL is deleted and added:
 {code}
 delete  http://wiki.apache.org/nutch
 add http://wiki.apache.org/nutch
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Nearing a 1.9 release?

2014-07-07 Thread Julien Nioche
Hi,

I've moved all the open issues that were marked with fix version = 1.9 to
1.10 except for the ones that Seb mentioned earlier.

Please go through the issue listed for 1.10
https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%201.10%20AND%20project%20%3D%20NUTCH%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20updated%20DESC
and
change their fix version back to 1.9 if you think they should be included
in the next release.

Thanks

Julien



On 29 June 2014 10:20, Julien Nioche lists.digitalpeb...@gmail.com wrote:

 Hi guys,

 We've done loads of good work on the trunk since the last release, in
 particular :

- NUTCH-1736 https://issues.apache.org/jira/browse/NUTCH-1736
- NUTCH-1647 https://issues.apache.org/jira/browse/NUTCH-1647
- NUTCH-1793 https://issues.apache.org/jira/browse/NUTCH-1793

 which are important bug fixes (NUTCH-578
 https://issues.apache.org/jira/browse/NUTCH-578 will also be an
 important one).

 If you want to help make the new release happen, could you please go
 through the issues listed for 1.9
 https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%201.9%20AND%20project%20%3D%20NUTCH%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20updated%20DESC
 and vote for the ones you think should be included in the next release /
 comment on issues opened by others / review the patches / contribute to the
 discussions?

 Thanks!

 Julien

 --

 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


[jira] [Created] (NUTCH-1812) Create Vagrant artifacts for 2.X branch

2014-07-07 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-1812:
---

 Summary: Create Vagrant artifacts for 2.X branch
 Key: NUTCH-1812
 URL: https://issues.apache.org/jira/browse/NUTCH-1812
 Project: Nutch
  Issue Type: Improvement
  Components: build
Reporter: Lewis John McGibbney
 Fix For: 2.4


Vagrant [0] is a very useful tool which I believe we can use to create VM 
containers to help aid the process of setting up and running a Nutch 2.X build. 
Right now, provisioning Nutch 2.X is a PITA for new users.
We've been working with Vagrant a lot. I feel that it would be really nice to 
just drop in an image into virtualbox (or something similar) and then just 
start.

[0] http://www.vagrantup.com/



--
This message was sent by Atlassian JIRA
(v6.2#6252)