[jira] [Updated] (NUTCH-882) Design a Host table in GORA

2012-04-19 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-882:


Assignee: (was: Julien Nioche)

 Design a Host table in GORA
 ---

 Key: NUTCH-882
 URL: https://issues.apache.org/jira/browse/NUTCH-882
 Project: Nutch
  Issue Type: New Feature
Affects Versions: nutchgora
Reporter: Julien Nioche
 Fix For: nutchgora

 Attachments: NUTCH-882-v1.patch, hostdb.patch


 Having a separate GORA table for storing information about hosts (and 
 domains?) would be very useful for : 
 * customising the behaviour of the fetching on a host basis e.g. number of 
 threads, min time between threads etc...
 * storing stats
 * keeping metadata and possibly propagate them to the webpages 
 * keeping a copy of the robots.txt and possibly use that later to filter the 
 webtable
 * store sitemaps files and update the webtable accordingly
 I'll try to come up with a GORA schema for such a host table but any comments 
 are of course already welcome 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1331) limit crawler to defined depth

2012-04-18 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1331:
-

Attachment: NUTCH-1331-v2.patch

Attached is an implementation of what I described earlier. This has been 
generously donated by www.ant.com

This allows to track the depth for a URL and remove its outlinks based on a 
global setting or per-seed

 

 limit crawler to defined depth
 --

 Key: NUTCH-1331
 URL: https://issues.apache.org/jira/browse/NUTCH-1331
 Project: Nutch
  Issue Type: New Feature
  Components: generator, parser, storage
Affects Versions: 1.4
Reporter: behnam nikbakht
 Attachments: NUTCH-1331-v2.patch, NUTCH-1331.patch


 there is a need to limit crawler to some defined depth, and importance of 
 this option is to avoid crawling of infinite loops, with dynamic generated 
 urls, that occur in some sites, and to optimize crawler to select important 
 urls.
 an option is define a iteration limit on generate,fetch,parse,updatedb cycle, 
 but it works only if in each cycle, all of unfetched urls become fetched, 
 (without recrawling them and with some other considerations)
 we can define a new parameter in CrawlDatum, named depth, and like score-opic 
 algorithm, compute depth of a link after parse, and in generate, only select 
 urls with valid depth.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1334) NPE in FetcherOutputFormat

2012-04-16 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1334:
-

Attachment: NUTCH-1334.patch

Will commit post 1.5

 NPE in FetcherOutputFormat 
 ---

 Key: NUTCH-1334
 URL: https://issues.apache.org/jira/browse/NUTCH-1334
 Project: Nutch
  Issue Type: Bug
Reporter: Julien Nioche
 Attachments: NUTCH-1334.patch


 If fetcher.parse or fetcher.store.content are set to false AND the write 
 method receives an instance of Parse or Content, a NPE will be thrown.
 This usually does not happen as the Fetcher does not output a Parse or 
 Content based on the configuration, however this class is also used by the 
 ArcSegmentCreator which is unaware of these parameters and will output a 
 Parse or Content instance regardless of the configuration. One option would 
 be to make the ArcSegmentCreator aware of the fetcher.* parameters and output 
 things accordingly but it also makes sense to modify the FetcherOutputFormat 
 so that it checks whether a subWriter has been created before trying to use 
 it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-809) Parse-metatags plugin

2012-03-21 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-809:


Attachment: NUTCH-809-trunk.patch

Patch for Nutch-809 against trunk. Delegates the indexing to index-metatags

 Parse-metatags plugin
 -

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.4, nutchgora
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.5

 Attachments: NUTCH-809-trunk.patch, NUTCH-809.patch, 
 NUTCH-809_metatags_1.3.patch, metatags-plugin+tutorial.zip


 h2. Parse-metatags plugin
 The parse-metatags plugin consists of a HTMLParserFilter which takes as 
 parameter a list of metatag names with '*' as default value. The values are 
 separated by ';'.
 In order to extract the values of the metatags description and keywords, you 
 must specify in nutch-site.xml
 {code:xml}
 property
   namemetatags.names/name
   valuedescription;keywords/value
 /property
 {code}
 The MetatagIndexer uses the output of the parsing above to create two fields 
 'keywords' and 'description'. Note that keywords is multivalued.
 The query-basic plugin is used to include these fields in the search e.g. in 
 nutch-site.xml
 {code:xml}
 property
   namequery.basic.description.boost/name
   value2.0/value
 /property
 property
   namequery.basic.keywords.boost/name
   value2.0/value
 /property
 {code}
 This code has been developed by DigitalPebble Ltd and offered to the 
 community by ANT.com

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1259) Store detected content type in crawldatum metadata

2012-02-13 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1259:
-

Summary: Store detected content type in crawldatum metadata  (was: 
TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata)

 Store detected content type in crawldatum metadata
 --

 Key: NUTCH-1259
 URL: https://issues.apache.org/jira/browse/NUTCH-1259
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1259-1.5-1.patch


 The MIME-type detected by Tika's Detect() API is never added to a Parse's 
 ContentMetaData or ParseMetaData. Because of this bad Content-Types will end 
 up in the documents. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1258) MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata

2012-02-13 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1258:
-

Attachment: NUTCH-1258-v2.patch

We now have access to the detected content-type from the crawldatum metadata as 
of NUTCH-1259. This patch tries to get this first then goes in the parse 
metadata.


 MoreIndexingFilter should be able to read Content-Type from both parse 
 metadata and content metadata
 

 Key: NUTCH-1258
 URL: https://issues.apache.org/jira/browse/NUTCH-1258
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: NUTCH-1258-1.5-1.patch, NUTCH-1258-v2.patch


 The MoreIndexingFilter reads the Content-Type from parse metadata. However, 
 this usually contains a lot of crap because web developers can set it to 
 anything they like. The filter must be able to read the Content-Type field 
 from content metadata as well because that contains the type detected by 
 Tika's Detector.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1264) Configurable indexing plugin (index-extra)

2012-02-06 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1264:
-

Attachment: NUTCH-1264-trunk-v2.patch

 Configurable indexing plugin (index-extra) 
 ---

 Key: NUTCH-1264
 URL: https://issues.apache.org/jira/browse/NUTCH-1264
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.5
Reporter: Julien Nioche
 Attachments: NUTCH-1264-trunk-v2.patch, NUTCH-1264-trunk.patch


 We currently have several plugins already distributed or proposed which do 
 very comparable things : 
 - parse-meta [NUTCH-809] to generate metadata fields in parse-metadata and 
 index them
 - headings [NUTCH-1005] to generate headings fields in parse-metadata and 
 index them
 - index-extra [NUTCH-422] to index configurable fields 
 - urlmeta [NUTCH-855] to propagate metadata from the seeds to the outlinks 
 and index them
 - index-static [NUTCH-940] to generate configurable static fields 
 All these plugins have in common that they allow to extract information from 
 various sources and generate fields from them and are largely redundant. 
 Instead this issue proposes to have a single plugin allowing to generate 
 configurable fields from : 
 - static values
 - parse metadata
 - content metadata
 - crawldb metadata
 and let the other plugins focus on the parsing and extraction of the values 
 to index. This will make the addition of new fields simpler by relying on a 
 stable common plugin instead of multiplying the code in various plugins.
 This plugin will replace index-static [NUTCH-940] and index-extra [NUTCH-422] 
 and will serve as a basis for further improvements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1264) Configurable indexing plugin (index-metadata)

2012-02-06 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1264:
-

Description: 
We currently have several plugins already distributed or proposed which do very 
comparable things : 
- parse-meta [NUTCH-809] to generate metadata fields in parse-metadata and 
index them
- headings [NUTCH-1005] to generate headings fields in parse-metadata and index 
them
- index-extra [NUTCH-422] to index configurable fields 
- urlmeta [NUTCH-855] to propagate metadata from the seeds to the outlinks and 
index them
- index-static [NUTCH-940] to generate configurable static fields 

All these plugins have in common that they allow to extract information from 
various sources and generate fields from them and are largely redundant. 
Instead this issue proposes to have a single plugin allowing to generate 
configurable fields from : 
- static values
- parse metadata
- content metadata
- crawldb metadata

and let the other plugins focus on the parsing and extraction of the values to 
index. This will make the addition of new fields simpler by relying on a stable 
common plugin instead of multiplying the code in various plugins.

This plugin will replace index-extra [NUTCH-422] and will serve as a basis for 
further improvements.




  was:
We currently have several plugins already distributed or proposed which do very 
comparable things : 
- parse-meta [NUTCH-809] to generate metadata fields in parse-metadata and 
index them
- headings [NUTCH-1005] to generate headings fields in parse-metadata and index 
them
- index-extra [NUTCH-422] to index configurable fields 
- urlmeta [NUTCH-855] to propagate metadata from the seeds to the outlinks and 
index them
- index-static [NUTCH-940] to generate configurable static fields 

All these plugins have in common that they allow to extract information from 
various sources and generate fields from them and are largely redundant. 
Instead this issue proposes to have a single plugin allowing to generate 
configurable fields from : 
- static values
- parse metadata
- content metadata
- crawldb metadata

and let the other plugins focus on the parsing and extraction of the values to 
index. This will make the addition of new fields simpler by relying on a stable 
common plugin instead of multiplying the code in various plugins.

This plugin will replace index-static [NUTCH-940] and index-extra [NUTCH-422] 
and will serve as a basis for further improvements.




Summary: Configurable indexing plugin (index-metadata)   (was: 
Configurable indexing plugin (index-extra) )

 Configurable indexing plugin (index-metadata) 
 --

 Key: NUTCH-1264
 URL: https://issues.apache.org/jira/browse/NUTCH-1264
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.5
Reporter: Julien Nioche
 Attachments: NUTCH-1264-trunk-v2.patch, NUTCH-1264-trunk.patch


 We currently have several plugins already distributed or proposed which do 
 very comparable things : 
 - parse-meta [NUTCH-809] to generate metadata fields in parse-metadata and 
 index them
 - headings [NUTCH-1005] to generate headings fields in parse-metadata and 
 index them
 - index-extra [NUTCH-422] to index configurable fields 
 - urlmeta [NUTCH-855] to propagate metadata from the seeds to the outlinks 
 and index them
 - index-static [NUTCH-940] to generate configurable static fields 
 All these plugins have in common that they allow to extract information from 
 various sources and generate fields from them and are largely redundant. 
 Instead this issue proposes to have a single plugin allowing to generate 
 configurable fields from : 
 - static values
 - parse metadata
 - content metadata
 - crawldb metadata
 and let the other plugins focus on the parsing and extraction of the values 
 to index. This will make the addition of new fields simpler by relying on a 
 stable common plugin instead of multiplying the code in various plugins.
 This plugin will replace index-extra [NUTCH-422] and will serve as a basis 
 for further improvements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1264) Configurable indexing plugin (index-extra)

2012-02-01 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1264:
-

Attachment: NUTCH-1264-trunk.patch

 Configurable indexing plugin (index-extra) 
 ---

 Key: NUTCH-1264
 URL: https://issues.apache.org/jira/browse/NUTCH-1264
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.5
Reporter: Julien Nioche
 Attachments: NUTCH-1264-trunk.patch


 We currently have several plugins already distributed or proposed which do 
 very comparable things : 
 - parse-meta [NUTCH-809] to generate metadata fields in parse-metadata and 
 index them
 - headings [NUTCH-1005] to generate headings fields in parse-metadata and 
 index them
 - index-extra [NUTCH-422] to index configurable fields 
 - urlmeta [NUTCH-855] to propagate metadata from the seeds to the outlinks 
 and index them
 - index-static [NUTCH-940] to generate configurable static fields 
 All these plugins have in common that they allow to extract information from 
 various sources and generate fields from them and are largely redundant. 
 Instead this issue proposes to have a single plugin allowing to generate 
 configurable fields from : 
 - static values
 - parse metadata
 - content metadata
 - crawldb metadata
 and let the other plugins focus on the parsing and extraction of the values 
 to index. This will make the addition of new fields simpler by relying on a 
 stable common plugin instead of multiplying the code in various plugins.
 This plugin will replace index-static [NUTCH-940] and index-extra [NUTCH-422] 
 and will serve as a basis for further improvements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1243) Junit jar removed from lib

2012-01-05 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1243:
-

Description: 
When calling 'ant test' the junit jar is added to the lib dir by Ivy but gets 
removed before the test classes are compiled.

-This seems to happen with Ivy 2.1 but not with Ivy 2.2.-
-We do have 2.2 in the /ivy directory but the ant script uses whatever is found 
in ~/.ant/lib - ideally we would like to be able to force the location of the 
jar file.-

Actually the issue also happens with Ivy 2.2. I will commit a quick fix 
consisting of adding junit in the default ivy configuration, however it will be 
good to get to the bottom of this. 

  was:
When calling 'ant test' the junit jar is added to the lib dir by Ivy but gets 
removed before the test classes are compiled. This seems to happen with Ivy 2.1 
but not with Ivy 2.2.
We do have 2.2 in the /ivy directory but the ant script uses whatever is found 
in ~/.ant/lib - ideally we would like to be able to force the location of the 
jar file.
As seen in [NUTCH-995] a workaround is to call : 'ant -lib ivy test' but having 
the value coded in the build script would be better


 Junit jar removed from lib
 --

 Key: NUTCH-1243
 URL: https://issues.apache.org/jira/browse/NUTCH-1243
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.5
 Environment: Ivy 2.1.0 - 20090925235825
Reporter: Julien Nioche

 When calling 'ant test' the junit jar is added to the lib dir by Ivy but gets 
 removed before the test classes are compiled.
 -This seems to happen with Ivy 2.1 but not with Ivy 2.2.-
 -We do have 2.2 in the /ivy directory but the ant script uses whatever is 
 found in ~/.ant/lib - ideally we would like to be able to force the location 
 of the jar file.-
 Actually the issue also happens with Ivy 2.2. I will commit a quick fix 
 consisting of adding junit in the default ivy configuration, however it will 
 be good to get to the bottom of this. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1053) Parsing of RSS feeds fails

2011-10-11 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1053:
-

Fix Version/s: 1.5

I'd happily give an example of fix it myself if only I could find it :-)
Moved to 1.5 and left open for now

 Parsing of RSS feeds fails 
 ---

 Key: NUTCH-1053
 URL: https://issues.apache.org/jira/browse/NUTCH-1053
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.5

 Attachments: seed.txt


 See discussion on 
 http://lucene.472066.n3.nabble.com/RSS-feed-parsing-on-Nutch-1-3-td3166487.html
 Have been able to reproduce the problem and will look into it

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1053) Parsing of RSS feeds fails

2011-10-11 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1053:
-

Fix Version/s: (was: 1.4)

 Parsing of RSS feeds fails 
 ---

 Key: NUTCH-1053
 URL: https://issues.apache.org/jira/browse/NUTCH-1053
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.5

 Attachments: seed.txt


 See discussion on 
 http://lucene.472066.n3.nabble.com/RSS-feed-parsing-on-Nutch-1-3-td3166487.html
 Have been able to reproduce the problem and will look into it

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1046) Add tests for indexing to SOLR

2011-09-29 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1046:
-

Affects Version/s: (was: 1.4)
   (was: 2.0)
Fix Version/s: (was: 1.4)
   (was: 2.0)
   1.5

 Add tests for indexing to SOLR
 --

 Key: NUTCH-1046
 URL: https://issues.apache.org/jira/browse/NUTCH-1046
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
 Fix For: 1.5


 We currently have no tests for checking that the indexing to SOLR works as 
 expected. Running an embedded SOLR Server within the tests would be good.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1064) o.a.n.util.MimeUtil uses deprecated Tika methods

2011-09-29 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1064:
-

Fix Version/s: (was: 1.4)
   1.5

Postpone to 1.5. Should have new Tika version available in the meantime

 o.a.n.util.MimeUtil uses deprecated Tika methods
 

 Key: NUTCH-1064
 URL: https://issues.apache.org/jira/browse/NUTCH-1064
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.5


 this class is in serious need of refactoring as the underlying Tika API has 
 changed a lot. The logic around what strategies to use e.g. trust the 
 metadata returned by the server? trust Tika's detection? etc... should be 
 reimplemented using the Detector implementations

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1090) LinkDb (invertlinks) should inform the user when it ignores internal links

2011-09-28 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1090:
-

Fix Version/s: (was: 1.3)
   1.5

 LinkDb (invertlinks) should inform the user when it ignores internal links
 --

 Key: NUTCH-1090
 URL: https://issues.apache.org/jira/browse/NUTCH-1090
 Project: Nutch
  Issue Type: Improvement
  Components: linkdb
Affects Versions: 1.3
Reporter: Marek Bachmann
Priority: Trivial
  Labels: configuration, information, log
 Fix For: 1.5

 Attachments: LinkDb.patch


 I used nutch to crawl sites on a single domain. After the crawl was complete 
 I tried to build a LinkDb. The LinkDb was empty. 
 It comes up that this happens because the invertlinks command ignores 
 internal links to the same domain by default. 
 Unfortunately the LinkDb class doesn't tell anything about that. So it was 
 hard to find out why the LinkDb was empty. 
 I suggest to add an information for the user when the invertlinks command is 
 ignoring internal links.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1040) Backport REST-API from 2.0

2011-09-28 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1040:
-

Affects Version/s: (was: 1.4)
Fix Version/s: 1.5
   Issue Type: New Feature  (was: Task)

 Backport REST-API from 2.0
 --

 Key: NUTCH-1040
 URL: https://issues.apache.org/jira/browse/NUTCH-1040
 Project: Nutch
  Issue Type: New Feature
  Components: REST_api
Reporter: Julien Nioche
 Fix For: 1.4, 1.5


 See https://issues.apache.org/jira/browse/NUTCH-880 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1129) Any23 Nutch plugin

2011-09-28 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1129:
-

Affects Version/s: (was: 1.4)
Fix Version/s: (was: 1.4)
   1.5

 Any23 Nutch plugin
 --

 Key: NUTCH-1129
 URL: https://issues.apache.org/jira/browse/NUTCH-1129
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.5


 This plugin should build on the Any23 library to provide us with a plugin 
 which extracts RDF data from HTTP and file resources. Although as of writing 
 Any23 not part of the ASF, the project is working towards integration into 
 the Apache Incubator. Once the project proves its value, this would be an 
 excellent addition to the Nutch 1.X codebase. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2011-09-28 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-585:


Fix Version/s: (was: 1.4)
   1.5

Marking for 1.5. Needs reviewing and won't make it into 1.4

 [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
 ---

 Key: NUTCH-585
 URL: https://issues.apache.org/jira/browse/NUTCH-585
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: All operating systems
Reporter: Andrea Spinelli
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: blacklist_whitelist_plugin.patch, 
 nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch


 We are using nutch to index our own web sites; we would like not to index 
 certain parts of our pages, because we know they are not relevant (for 
 instance, there are several links to change the background color) and 
 generate spurious matches.
 We have modified the plugin so that it ignores HTML code between certain HTML 
 comments, like
 !-- START-IGNORE --
 ... ignored part ...
 !-- STOP-IGNORE --
 We feel this might be useful to someone else, maybe factorizing the comment 
 strings as constants in the configuration files (say parser.html.ignore.start 
 and parser.html.ignore.stop in nutch-site.xml).
 We are almost ready to contribute our code snippet.  Looking forward for any 
 expression of  interest - or for an explanation why waht we are doing is 
 plain wrong!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1079) StringBuffer converted to StringBuilder

2011-09-28 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1079:
-

 Priority: Minor  (was: Major)
Affects Version/s: (was: 1.3)
Fix Version/s: (was: 1.4)
   1.5
   Issue Type: Improvement  (was: Bug)

Not a bug but an improvement. Moved from 1.4 to 1.5

 StringBuffer converted to StringBuilder
 ---

 Key: NUTCH-1079
 URL: https://issues.apache.org/jira/browse/NUTCH-1079
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher, indexer
Reporter: Kay Kay
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: NUTCH-1079.patch, NUTCH-rel_14-1079.patch


 All across the codebase, it contains StringBuffer, when thread-safety is 
 probably not intended. 
 This patch replaces StringBuffer to StringBuilder, as applicable. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1047) Pluggable indexing backends

2011-09-28 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1047:
-

Affects Version/s: (was: 1.4)
Fix Version/s: (was: 1.4)
   1.5
 Assignee: Julien Nioche
   Issue Type: New Feature  (was: Improvement)

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.5


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1088) Write Solr XML documents

2011-09-28 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1088:
-

 Priority: Minor  (was: Major)
Fix Version/s: (was: 1.4)
   1.5

Could do that with the pluggable indexing framework in NUTCH-1047?

 Write Solr XML documents
 

 Key: NUTCH-1088
 URL: https://issues.apache.org/jira/browse/NUTCH-1088
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5


 Documents need to be reindexed when index-time analysis is modified. Indexing 
 individual segments from Nutch is tedious, especially for small segments. 
 This issue should add a feature that can write XML batches.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1117) JUnit test for index-anchor

2011-09-28 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1117:
-

Fix Version/s: (was: 1.4)
   1.5

 JUnit test for index-anchor
 ---

 Key: NUTCH-1117
 URL: https://issues.apache.org/jira/browse/NUTCH-1117
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.5


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1119) JUnit test for index-static

2011-09-28 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1119:
-

Fix Version/s: (was: 1.4)
   1.5

 JUnit test for index-static
 ---

 Key: NUTCH-1119
 URL: https://issues.apache.org/jira/browse/NUTCH-1119
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.5


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1118) JUnit test for index-basic

2011-09-28 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1118:
-

Fix Version/s: (was: 1.4)
   1.5

 JUnit test for index-basic
 --

 Key: NUTCH-1118
 URL: https://issues.apache.org/jira/browse/NUTCH-1118
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.5


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1124) JUnit test for scoring-opic

2011-09-28 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1124:
-

Fix Version/s: (was: 1.4)
   1.5

 JUnit test for scoring-opic
 ---

 Key: NUTCH-1124
 URL: https://issues.apache.org/jira/browse/NUTCH-1124
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.5


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1123) JUnit test for scoring-link

2011-09-28 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1123:
-

Fix Version/s: (was: 1.4)
   1.5

 JUnit test for scoring-link
 ---

 Key: NUTCH-1123
 URL: https://issues.apache.org/jira/browse/NUTCH-1123
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.5


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1128) JUnit test for urlmeta

2011-09-28 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1128:
-

Fix Version/s: (was: 1.4)
   1.5

 JUnit test for urlmeta
 --

 Key: NUTCH-1128
 URL: https://issues.apache.org/jira/browse/NUTCH-1128
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.5


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1127) JUnit test for urlfilter-validator

2011-09-28 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1127:
-

Fix Version/s: (was: 1.4)
   1.5

 JUnit test for urlfilter-validator
 --

 Key: NUTCH-1127
 URL: https://issues.apache.org/jira/browse/NUTCH-1127
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.5


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1130) JUnit test for Any23 RDF plugin

2011-09-28 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1130:
-

Fix Version/s: (was: 1.4)
   1.5

 JUnit test for Any23 RDF plugin
 ---

 Key: NUTCH-1130
 URL: https://issues.apache.org/jira/browse/NUTCH-1130
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.5


 The JUnit test should be written prior to the progression of the Any23 Nutch 
 plugin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1120) JUnit test for microformats-reltag

2011-09-28 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1120:
-

Fix Version/s: (was: 1.4)
   1.5

 JUnit test for microformats-reltag
 --

 Key: NUTCH-1120
 URL: https://issues.apache.org/jira/browse/NUTCH-1120
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.5


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1125) JUnit test for tld

2011-09-28 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1125:
-

Fix Version/s: (was: 1.4)
   1.5

 JUnit test for tld
 --

 Key: NUTCH-1125
 URL: https://issues.apache.org/jira/browse/NUTCH-1125
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.5


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1122) JUnit test for protocol-ftp

2011-09-28 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1122:
-

Fix Version/s: (was: 1.4)
   1.5

 JUnit test for protocol-ftp
 ---

 Key: NUTCH-1122
 URL: https://issues.apache.org/jira/browse/NUTCH-1122
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.5


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1121) JUnit test for parse-js

2011-09-28 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1121:
-

Fix Version/s: (was: 1.4)
   1.5

 JUnit test for parse-js
 ---

 Key: NUTCH-1121
 URL: https://issues.apache.org/jira/browse/NUTCH-1121
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.5


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira