from:"JIRA"

Parse-metatags plugin
-

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-809.patch

h2. Parse-metatags plugin

*NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see 
[TIKA-379]).* 

To use the legacy HTML parser specify in parse-plugins.xml

{code:xml}
mimeType name=text/html
  plugin id=parse-html /
/mimeType
{code}

The parse-metatags plugin consists of a HTMLParserFilter which takes as 
parameter a list of metatag names with '*' as default value. The values are 
separated by ';'.

In order to extract the values of the metatags description and keywords, you 
must specify in nutch-site.xml

{code:xml}
property
  namemetatags.names/name
  valuedescription;keywords/value
/property
{code}

The MetatagIndexer uses the output of the parsing above to create two fields 
'keywords' and 'description'. Note that keywords is multivalued.
The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.

This code has been developed by DigitalPebble Ltd and offered to the community 
by ANT.com



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-809) Parse-metatags plugin

[
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Julien Nioche updated NUTCH-809:

Attachment: NUTCH-809.patch

Parse-metatags plugin
-

Key: NUTCH-809
URL: https://issues.apache.org/jira/browse/NUTCH-809
Project: Nutch
Issue Type: New Feature
Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
Attachments: NUTCH-809.patch

h2. Parse-metatags plugin
*NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see
[TIKA-379]).*
To use the legacy HTML parser specify in parse-plugins.xml
{code:xml}
mimeType name=text/html
plugin id=parse-html /
/mimeType
{code}
The parse-metatags plugin consists of a HTMLParserFilter which takes as
parameter a list of metatag names with '*' as default value. The values are
separated by ';'.
In order to extract the values of the metatags description and keywords, you
must specify in nutch-site.xml
{code:xml}
property
namemetatags.names/name
valuedescription;keywords/value
/property
{code}
The MetatagIndexer uses the output of the parsing above to create two fields
'keywords' and 'description'. Note that keywords is multivalued.
The MetaTagsQueryFilter allows to include the fields above in the Nutch
queries.
This code has been developed by DigitalPebble Ltd and offered to the
community by ANT.com

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-809) Parse-metatags plugin


 [ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-809:


Attachment: (was: NUTCH-809.patch)

 Parse-metatags plugin
 -

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche

 h2. Parse-metatags plugin
 *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see 
 [TIKA-379]).* 
 To use the legacy HTML parser specify in parse-plugins.xml
 {code:xml}
 mimeType name=text/html
   plugin id=parse-html /
 /mimeType
 {code}
 The parse-metatags plugin consists of a HTMLParserFilter which takes as 
 parameter a list of metatag names with '*' as default value. The values are 
 separated by ';'.
 In order to extract the values of the metatags description and keywords, you 
 must specify in nutch-site.xml
 {code:xml}
 property
   namemetatags.names/name
   valuedescription;keywords/value
 /property
 {code}
 The MetatagIndexer uses the output of the parsing above to create two fields 
 'keywords' and 'description'. Note that keywords is multivalued.
 The MetaTagsQueryFilter allows to include the fields above in the Nutch 
 queries.
 This code has been developed by DigitalPebble Ltd and offered to the 
 community by ANT.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-809) Parse-metatags plugin

[
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Julien Nioche updated NUTCH-809:

Attachment: NUTCH-809.patch

Modified version of the plugin which is compatible with parse-tika

Parse-metatags plugin
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-809) Parse-metatags plugin

[
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Julien Nioche updated NUTCH-809:

Description:
h2. Parse-metatags plugin

The parse-metatags plugin consists of a HTMLParserFilter which takes as
parameter a list of metatag names with '*' as default value. The values are
separated by ';'.

In order to extract the values of the metatags description and keywords, you
must specify in nutch-site.xml

{code:xml}
property
namemetatags.names/name
valuedescription;keywords/value
/property
{code}

The MetatagIndexer uses the output of the parsing above to create two fields
'keywords' and 'description'. Note that keywords is multivalued.
The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.

This code has been developed by DigitalPebble Ltd and offered to the community
by ANT.com

was:
h2. Parse-metatags plugin

*NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see
[TIKA-379]).*

To use the legacy HTML parser specify in parse-plugins.xml

{code:xml}
mimeType name=text/html
plugin id=parse-html /
/mimeType
{code}

The parse-metatags plugin consists of a HTMLParserFilter which takes as
parameter a list of metatag names with '*' as default value. The values are
separated by ';'.

In order to extract the values of the metatags description and keywords, you
must specify in nutch-site.xml

{code:xml}
property
namemetatags.names/name
valuedescription;keywords/value
/property
{code}

This code has been developed by DigitalPebble Ltd and offered to the community
by ANT.com

Parse-metatags plugin
-

h2. Parse-metatags plugin
The parse-metatags plugin consists of a HTMLParserFilter which takes as
parameter a list of metatag names with '*' as default value. The values are
separated by ';'.
In order to extract the values of the metatags description and keywords, you
must specify in nutch-site.xml
{code:xml}
property
namemetatags.names/name
valuedescription;keywords/value
/property
{code}
The MetatagIndexer uses the output of the parsing above to create two fields
'keywords' and 'description'. Note that keywords is multivalued.
The MetaTagsQueryFilter allows to include the fields above in the Nutch
queries.
This code has been developed by DigitalPebble Ltd and offered to the
community by ANT.com

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-02 Thread Enis Soztutar (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852840#action_12852840
 ] 

Enis Soztutar commented on NUTCH-808:
-

A candidate framework is DataNucleus. It has the following benefits. 

- Apache 2 license. 
- JDO support 
- HBase, RDBMS, XML persistance. 

I will further investigate whether we can integrate Hadoop writables/Avro 
serialization so that objects can be passed from Mapred. 


 Evaluate ORM Frameworks which support non-relational column-oriented 
 datastores and RDBMs 
 --

 Key: NUTCH-808
 URL: https://issues.apache.org/jira/browse/NUTCH-808
 Project: Nutch
  Issue Type: Task
Reporter: Enis Soztutar
Assignee: Enis Soztutar

 We have an ORM layer in the NutchBase branch, which uses Avro Specific 
 Compiler to compile class definitions given in JSON. Before moving on with 
 this, we might benefit from evaluating other frameworks, whether they suit 
 our needs. 
 We want at least the following capabilities:
 - Using POJOs 
 - Able to persist objects to at least HBase, Cassandra, and RDBMs 
 - Able to efficiently serialize objects as task outputs from Hadoop jobs
 - Allow native queries, along with standard queries 
 Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-706) Url regex normalizer

2010-03-31 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-706:


Fix Version/s: (was: 1.1)

Both variants of the substitution rule above break existing tests. More work 
will be needed to get a pattern which covers the case described by Meghna *and* 
is compatible with the existing test cases.
Moving it to post-1.1

 Url regex normalizer
 

 Key: NUTCH-706
 URL: https://issues.apache.org/jira/browse/NUTCH-706
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Meghna Kukreja
Priority: Minor

 Hey,
 I encountered the following problem while trying to crawl a site using
 nutch-trunk. In the file regex-normalize.xml, the following regex is
 used to remove session ids:
 pattern([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|amp;|#|$)/pattern.
 This pattern also transforms a url, such as,
 newsId=2000484784794newsLang=en into newnewsLang=en (since it
 matches 'sId' in the 'newsId'), which is incorrect and hence does not
 get fetched. This expression needs to be changed to prevent this.
 Thanks,
 Meghna

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-706) Url regex normalizer

2010-03-31 Thread Ken Krugler (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851923#action_12851923
 ] 

Ken Krugler commented on NUTCH-706:
---

Two comments about this:

1. From my experiences with Nutch  Bixo, I think that URL normalization 
ultimately needs to be more structured - ie first break the URL into pieces, 
then apply rules against the pieces. Trying to craft regular expressions to 
handle target cases leads to big, hairy, hard-to-understand strings.

2. URL normalization is something that makes a lot of sense for 
crawler-commons. If somebody from the Nutch side wants to define a target API, 
I could look at porting existing Bixo code to crawler-commons.


 Url regex normalizer
 

 Key: NUTCH-706
 URL: https://issues.apache.org/jira/browse/NUTCH-706
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Meghna Kukreja
Priority: Minor

 Hey,
 I encountered the following problem while trying to crawl a site using
 nutch-trunk. In the file regex-normalize.xml, the following regex is
 used to remove session ids:
 pattern([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|amp;|#|$)/pattern.
 This pattern also transforms a url, such as,
 newsId=2000484784794newsLang=en into newnewsLang=en (since it
 matches 'sId' in the 'newsId'), which is incorrect and hence does not
 get fetched. This expression needs to be changed to prevent this.
 Thanks,
 Meghna

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-249) black- white list url filtering


 [ 
https://issues.apache.org/jira/browse/NUTCH-249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-249:


Fix Version/s: (was: 1.1)

- push out per http://bit.ly/c7tBv9

 black- white list url filtering
 ---

 Key: NUTCH-249
 URL: https://issues.apache.org/jira/browse/NUTCH-249
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.8
Reporter: Stefan Groschupf
Assignee: Dennis Kubes
Priority: Trivial
 Attachments: blackWhiteListV2.patch, blackWhiteListV3.patch, bw.patch


 Existing url filter mechanisms need to process each url against each filter 
 pattern. For very large filter sets this may be does not scale very well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-309) Uses commons logging Code Guards


 [ 
https://issues.apache.org/jira/browse/NUTCH-309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-309:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Uses commons logging Code Guards
 

 Key: NUTCH-309
 URL: https://issues.apache.org/jira/browse/NUTCH-309
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Jerome Charron
Assignee: Chris A. Mattmann
Priority: Minor

 Code guards are typically used to guard code that only needs to execute in 
 support of logging, that otherwise introduces undesirable runtime overhead in 
 the general case (logging disabled). Examples are multiple parameters, or 
 expressions (e.g. string +  more) for parameters. Use the guard methods of 
 the form log.isPriority() to verify that logging should be performed, 
 before incurring the overhead of the logging method call. Yes, the logging 
 methods will perform the same check, but only after resolving parameters.
 (description extracted from 
 http://jakarta.apache.org/commons/logging/guide.html#Code_Guards)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-763) Separate configuration files from resources to be included in the job file


 [ 
https://issues.apache.org/jira/browse/NUTCH-763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-763:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Separate configuration files from resources to be included in the job file
 --

 Key: NUTCH-763
 URL: https://issues.apache.org/jira/browse/NUTCH-763
 Project: Nutch
  Issue Type: Wish
Reporter: Julien Nioche
Priority: Minor

 One of the things I found confusing when I was learning Nutch was the fact 
 that the conf/ directory contains at the same time : 
 - configuration files for Hadoop / Nutch which are put in the jar files but 
 not used there
 - resource files (e.g. filtering rules) which MUST be up to date in the job 
 file
 I would separate the conf/ directory from say a resources/ directory which 
 would contain the rule files and other things to put in the job file. Unless 
 I am mistaken none of the configuration files need to be in the job file. I 
 know it is a very minor point, but that would probably simplify things and 
 make it easier for beginners to understand what has to be modified where. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-577) Use explicit tika-config.xml file to enable mime magic detection to be turned on and off


 [ 
https://issues.apache.org/jira/browse/NUTCH-577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-577:


 Due Date: 30/Nov/07  (was: 30/Nov/07)
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Use explicit tika-config.xml file to enable mime magic detection to be turned 
 on and off
 

 Key: NUTCH-577
 URL: https://issues.apache.org/jira/browse/NUTCH-577
 Project: Nutch
  Issue Type: Improvement
  Components: mime_type_detector
Affects Versions: 1.0.0
 Environment: Mac Book Pro Intel Core Duo 2.0 Ghz, 2. 0 GB RAM, Mac OS 
 X 10.4, although improvement is indep. of env.
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Minor

 Currently, there is a configuration file for Tika (which the trunk in Nutch 
 uses for its mime type detection) called tika-config.xml left unexposed (a 
 default one lives in the tika-0.1-dev.jar file). Tika's mime system has two 
 config files it relies on: tika-mimetypes.xml (which Nutch has its own 
 version of, that overrides the version that comes with the tika jar file), 
 and tika-config.xml (to turn on or off magic char detection). We should 
 probably have a nutch version of tika-config.xml, so that Nutch users can 
 employ magic char mime detection. I'll get going on this in the next day or 
 so.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-310) Review Log Levels


 [ 
https://issues.apache.org/jira/browse/NUTCH-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-310:


Fix Version/s: (was: 1.1)
 Assignee: Chris A. Mattmann  (was: Jerome Charron)

- pushing this out per http://bit.ly/c7tBv9 (and assign to me, I think this can 
be closed but will wait until after 1.1 to revisit)

 Review Log Levels
 -

 Key: NUTCH-310
 URL: https://issues.apache.org/jira/browse/NUTCH-310
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Jerome Charron
Assignee: Chris A. Mattmann
Priority: Minor

 Review of logs content and logs levels (see Commons Logging Best Parctices : 
 http://jakarta.apache.org/commons/logging/guide.html#Message_Priorities_Levels)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0


 [ 
https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-673:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Upgrade the Carrot2 plug-in to release 3.0
 --

 Key: NUTCH-673
 URL: https://issues.apache.org/jira/browse/NUTCH-673
 Project: Nutch
  Issue Type: Improvement
  Components: web gui
Affects Versions: 0.9.0
 Environment: All Nutch deployments.
Reporter: Sean Dean
Priority: Minor

 Release 3.0 of the Carrot2 plug-in was released recently.
 We currently have version 2.1 in the source tree and upgrading it to the 
 latest version before 1.0-release might make sence.
 Details on the release can be found here: 
 http://project.carrot2.org/release-3.0-notes.html
 One major change in requirements is for JDK 1.5 to be used, but this is also 
 now required for Hadoop 0.19 so this wouldnt be the only reason for the 
 switch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-664) Possibility to update already stored documents.


 [ 
https://issues.apache.org/jira/browse/NUTCH-664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-664:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Possibility to update already stored documents.
 ---

 Key: NUTCH-664
 URL: https://issues.apache.org/jira/browse/NUTCH-664
 Project: Nutch
  Issue Type: Wish
Reporter: Sergey Khilkov
Priority: Minor

 We have huge index of stored documents. It is high cost procedure to fetch 
 page, merge indexes any time we update some information about page. The 
 information can be changed 1-3 times per day. At this moment we have to store 
 changed info in database, but in this case we have lots of problems with 
 sorting, search restricions and so on. Lucene itself allows delete single 
 document and add new one into existing index. But there is a problem with 
 hadoop... As I understand hadoop filesystem has no possibility to write in 
 random positions. But it will be great feature if nutch will be able to 
 update created index.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-750) HtmlParser plugin - page title extraction


 [ 
https://issues.apache.org/jira/browse/NUTCH-750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-750:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 HtmlParser plugin - page title extraction
 -

 Key: NUTCH-750
 URL: https://issues.apache.org/jira/browse/NUTCH-750
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.0.0
Reporter: Alexey Torochkov
Priority: Minor
 Attachments: SkipBody.patch


 A little improvement to trying to extract title tag in body if it doesn't 
 exist in head.
 In current version DOMContentUtils just skip all after body in getTitle() 
 method.
 Attached patch allows to change this behavior (for default it doesn't change 
 anything) and can cope with webmasters mistakes

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-564) External parser supports encoding attribute


 [ 
https://issues.apache.org/jira/browse/NUTCH-564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-564:


   Patch Info: [Patch Available]
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 External parser supports encoding attribute
 ---

 Key: NUTCH-564
 URL: https://issues.apache.org/jira/browse/NUTCH-564
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 0.9.0
 Environment: All
Reporter: Antony Bowesman
Priority: Minor
 Attachments: ExtParser_0.9.0.patch, ExtParser_1.0.0.patch


 When an external component generates text, which is returned to the external 
 parser, it always converts the text using the default character set.  
 (os.toString()).  For example, the returned text may be utf-8, but will not 
 be converted to a String correctly.
 I added the attribute encoding to the implementation XML in plugin.xml 
 and this is then used to convert the text.
 I have tested my original fix on my local 0.9 and include a patch, but have 
 also made an untested patch for trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-477) Extend URLFilters to support different filtering chains


 [ 
https://issues.apache.org/jira/browse/NUTCH-477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-477:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Extend URLFilters to support different filtering chains
 ---

 Key: NUTCH-477
 URL: https://issues.apache.org/jira/browse/NUTCH-477
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
Priority: Minor
 Attachments: urlfilters.patch


 I propose to make the following changes to URLFilters:
 * extend URLFilters so that they support different filtering rules depending 
 on the context where they are executed. This functionality mirrors the one 
 that URLNormalizers already support.
 * change their return value to an int code, in order to support early 
 termination of long filtering chains.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-251) Administration GUI


 [ 
https://issues.apache.org/jira/browse/NUTCH-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-251:


   Patch Info: [Patch Available]
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9 (comment from me: would be nice to 
get this into 1.2)

 Administration GUI
 --

 Key: NUTCH-251
 URL: https://issues.apache.org/jira/browse/NUTCH-251
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Stefan Groschupf
Priority: Minor
 Attachments: hadoop_nutch_gui_v1.patch, Nutch-251-AdminGUI.tar.gz, 
 nutch_gui_plugins_v1.zip, nutch_gui_v1.patch


 Having a web based administration interface would help to make nutch 
 administration and management much more user friendly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-609) Allow Plugins to be Loaded from Jar File(s)

[
https://issues.apache.org/jira/browse/NUTCH-609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris A. Mattmann updated NUTCH-609:

Due Date: 13/Feb/08 (was: 13/Feb/08)
Patch Info: [Patch Available]
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

Allow Plugins to be Loaded from Jar File(s)
---

Key: NUTCH-609
URL: https://issues.apache.org/jira/browse/NUTCH-609
Project: Nutch
Issue Type: Improvement
Affects Versions: 1.0.0
Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
Attachments: NUTCH-609-1-20080212.patch

Currently plugins cannot be loaded from a jar file. Plugins must be unzipped
in one or more directories specified by the plugin.folders config. I have
been thinking about an extension to PluginRepository or PluginManifestParser
(or both) that would allow plugins to packaged into multiple independent jar
files and placed on the classpath. The system would search the classpath for
resources with the correct folder name and would load any plugins in those
jars.
This functionality would be very useful in making the nutch core more
flexible in terms of packaging. It would also help with web applications
where we don't want to have a plugins directory included in the webapp.
Thoughts so far are unzipping those plugin jars into a common temp directory
before loading. Another option is using something like commons vfs to
interact with the jar files. VFS essential uses a disk based temporary cache
for jar files, so it is pretty much the same solution. What are everyone
else's thoughts on this?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-794) Language Identification must use check the parse metadata for language values


 [ 
https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-794.
-

Resolution: Fixed

@julien -- I think this issue has been fixed in Tika right? If not, feel free 
to reopen, or better yet, re-file the issue against a post 1.1 Nutch release. 
Thanks!

 Language Identification must use check the parse metadata for language values 
 --

 Key: NUTCH-794
 URL: https://issues.apache.org/jira/browse/NUTCH-794
 Project: Nutch
  Issue Type: Bug
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-794.patch


 The following HTML document : 
 html lang=fiheaddocument 1 title/headbodyjotain 
 suomeksi/body/html
 is rendered as the following xhtml by Tika : 
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;headtitle//headbodydocument 1 
 titlejotain suomeksi/body/html
 with the lang attribute getting lost.  The lang is not stored in the metadata 
 either.
 I will open an issue on Tika and modify TestHTMLLanguageParser so that the 
 tests don't break anymore 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-578) URL fetched with 403 is generated over and over again


 [ 
https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-578:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 URL fetched with 403 is generated over and over again
 -

 Key: NUTCH-578
 URL: https://issues.apache.org/jira/browse/NUTCH-578
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 1.0.0
 Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I 
 have checked out the most recent version of the trunk as of Nov 20, 2007
Reporter: Nathaniel Powell
Assignee: Dennis Kubes
 Attachments: crawl-urlfilter.txt, NUTCH-578.patch, 
 NUTCH-578_v2.patch, NUTCH-578_v3.patch, NUTCH-578_v4.patch, nutch-site.xml, 
 regex-normalize.xml, urls.txt


 I have not changed the following parameter in the nutch-default.xml:
 property
   namedb.fetch.retry.max/name
   value3/value
   descriptionThe maximum number of times a url that has encountered
   recoverable errors is generated for fetch./description
 /property
 However, there is a URL which is on the site that I'm crawling, 
 www.teachertube.com, which keeps being generated over and over again for 
 almost every segment (many more times than 3):
 fetch of http://www.teachertube.com/images/ failed with: Http code=403, 
 url=http://www.teachertube.com/images/
 This is a bug, right?
 Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-540) some problem about the Nutch cache


 [ 
https://issues.apache.org/jira/browse/NUTCH-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-540:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 some problem about the Nutch cache
 --

 Key: NUTCH-540
 URL: https://issues.apache.org/jira/browse/NUTCH-540
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 0.9.0
 Environment: Red hat AS4 + Tomcat5.5 + Nutch0.9
Reporter: crossany
 Attachments: 1.gif, 1186733525.jpg


 I'am a chinese.
 I just test to search chinese word in nutch. I install nutch0.9 in tomcat5 on 
 linux.and the Tomcat charset it's UTF-8 and I use nutch to Crawl the website 
 it a chinese website the web charset it's also UTF-8. when Use the nutch on 
 tomcat for search chinese word , I find the search result' Title and 
 description was right to display. but when I click the cache, the cache web 
 was display a error charset code, I see the cache
 web' charset also utf-8. I find a website use Nutch 
 http://www.synoo.com:8080/zh/ I just test to search chinese word . It's also 
 error.
 I use Luke to see the segments It's can display chinese word, I think maybe 
 it's a Bug.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-455) dedup on tokenized fields is faulty

[
https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris A. Mattmann updated NUTCH-455:

Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

dedup on tokenized fields is faulty
---

Key: NUTCH-455
URL: https://issues.apache.org/jira/browse/NUTCH-455
Project: Nutch
Issue Type: Bug
Components: searcher
Affects Versions: 0.9.0
Reporter: Enis Soztutar
Attachments: IndexSearcherCacheWarm.patch

(From LUCENE-252)
nutch uses several index servers, and the search results from these servers
are merged using a dedup field for for deleting duplicates. The values from
this field is cached by Lucene's FieldCachImpl. The default is the site
field, which is indexed and tokenized. However for a Tokenized Field (for
example url in nutch), FieldCacheImpl returns an array of Terms rather that
array of field values, so dedup'ing becomes faulty. Current FieldCache
implementation does not respect tokenized fields , and as described above
caches only terms.
So in the situation that we are searching using url as the dedup field,
when a Hit is constructed in IndexSearcher, the dedupValue becomes a token of
the url (such as www or com) rather that the whole url. This prevents
using tokenized fields in the dedup field.
I have written a patch for lucene and attached it in
http://issues.apache.org/jira/browse/LUCENE-252, this patch fixes the
aforementioned issue about tokenized field caching. However building such a
cache for about 1.5M documents takes 20+ secs. The code in
IndexSearcher.translateHits() starts with
if (dedupField != null)
dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField);
and for the first call of search in IndexSearcher, cache is built.
Long story short, i have written a patch against IndexSearcher, which in
constructor warms-up the caches of wanted fields(configurable). I think we
should vote for LUCENE-252, and then commit the above patch with the last
version of lucene.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-747) injectIndex metadatas and inherit these metadatas to all matching suburls


 [ 
https://issues.apache.org/jira/browse/NUTCH-747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-747:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 injectIndex metadatas and inherit these metadatas to all matching suburls
 --

 Key: NUTCH-747
 URL: https://issues.apache.org/jira/browse/NUTCH-747
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, injector
Reporter: Marko Bauhardt
 Attachments: index-metadata.patch, metadata.patch


 Hi.
 the following two patches supports
 + inject metadatas to url's into a metadatadb
 url.com TAB METAKEY : TAB METAVALUE TAB METAVALUE METAKEY : 
 METAVALUE ...
 ...
 + updates the parse_data metadata from a shard and write the metadatas to all 
 fetched urls that starts with an url from the metadatadb
 + this patch support's metadata to all matching suburls inheritance
 the second patch implements a index-metadata plugin.
 + this plugin extract all metadats from the parse_data of a shard and index 
 it. which metadats you can configure in the plugin.properties.
 + to index for example the lang you have to configure the plugin.properties: 
 lang=STORE,UNTOKENIZED
 + that means that the index plugin exract metadata values with key lang. if 
 exists, all values are indexed stored and untokenized
 Example
 create start url's in /tmp/urls/start/urls.txt
 http://lucene.apache.org/nutch/apidocs-1.0/index.html
 http://lucene.apache.org/nutch/apidocs-0.9/index.html
 create metadata url's in /tmp/urls/metadata/urls.txt
 http://lucene.apache.org/nutch/apidocs-1.0/ version:1.0
 http://lucene.apache.org/nutch/apidocs-0.9/ version:0.9
 Inject Urls
 bin/nutch inject crawldb /tmp/urls/start/
 bin/nutch org.apache.nutch.crawl.metadata.MetadataInjector metadatadb 
 /tmp/urls/metadata/
 Fetch  Parse  Update
 bin/nutch generate crawldb segments
 bin/nutch fetch segments/20090806105717/
 bin/nutch org.apache.nutch.crawl.metadata.ParseDataUpdater metadatadb 
 segments/20090806105717
 bin/nutch updatedb crawldb/ segments/20090806105717/
 Fetch  Parse  Update Again
 ...
 Index
 bin/nutch invertlinks linkdb -dir segments/
 bin/nutch index index crawldb/ linkdb/ segments/20090806105717 
 segments/20090806110127
 Check your Index
 All urls starting with http://lucene.apache.org/nutch/apidocs-1.0/  are 
 indexed with version:1.0.
 All urls starting with http://lucene.apache.org/nutch/apidocs-0.9/  are 
 indexed with version:0.9.
 This issue is some related to http://issues.apache.org/jira/browse/NUTCH-655

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-479) Support for OR queries


 [ 
https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-479:


   Patch Info: [Patch Available]
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Support for OR queries
 --

 Key: NUTCH-479
 URL: https://issues.apache.org/jira/browse/NUTCH-479
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: nutch_0.9_OR.patch, or.patch, or.patch


 There have been many requests from users to extend Nutch query syntax to add 
 support for OR queries, in addition to the implicit AND and NOT queries 
 supported now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-677) Segment merge filering based on segment content


 [ 
https://issues.apache.org/jira/browse/NUTCH-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-677:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Segment merge filering based on segment content
 ---

 Key: NUTCH-677
 URL: https://issues.apache.org/jira/browse/NUTCH-677
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
Reporter: Marcin Okraszewski
 Attachments: MergeFilter.patch, MergeFilter_for_1.0.patch, 
 SegmentMergeFilter.java, SegmentMergeFilter.java, SegmentMergeFilters.java, 
 SegmentMergeFilters.java


 I needed a segment filtering based on meta data detected during parse phase. 
 Unfortunately current URL based filtering does not allow for this. So I have 
 created a new SegmentMergeFilter extension which receives segment entry which 
 is being merged and decides if it should be included or not. Even though I 
 needed only ParseData for my purpose I have done it a bit more general 
 purpose, so the filter receives all merged data.
 The attached patch is for version 0.9 which I use. Unfortunately I didn't 
 have time to check how it fits to trunk version. Sorry :(

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-774) Retry interval in crawl date is set to 0


 [ 
https://issues.apache.org/jira/browse/NUTCH-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-774:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Retry interval in crawl date is set to 0
 

 Key: NUTCH-774
 URL: https://issues.apache.org/jira/browse/NUTCH-774
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Reinhard Schwab
Assignee: Andrzej Bialecki 
 Attachments: NUTCH-774.patch, NUTCH-774_2.patch


 When i fetch and parse a feed with the feed plugin,
 http://www.wachauclimbing.net/home/impressum-disclaimer/feed/
 another crawl date is generated
 http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/
 after fetching a second round
 the dump in the crawl db still shows a retry interval with value 0.
 http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/ 
 Version: 7
 Status: 2 (db_fetched)
 Fetch time: Wed Dec 02 12:48:22 CET 2009
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 0 seconds (0 days)
 Score: 1.084
 Signature: db9ab2193924cd2d0b53113a500ca604
 Metadata: _pst_: success(1), lastModified=0
 a check should be done in DefaultFetchSchedule (or AbstractFetchSchedule) in 
 the
 method 
 setFetchSchedule

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-460) RDF parser plugin


 [ 
https://issues.apache.org/jira/browse/NUTCH-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-460:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 RDF parser plugin
 -

 Key: NUTCH-460
 URL: https://issues.apache.org/jira/browse/NUTCH-460
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Ricardo J. Méndez
 Attachments: rubyspider-rdf.zip


 I've written a couple plugins that I'd like to contribute.  
 RDFLinkParseFilter looks for links on the pages that point towards RDF 
 information, and tags the pages with metadata about the type of links they 
 hold. RDFLinkIndexingFilter indexes said metadata.  RDFParser parses RDF 
 information from several possible formats using Jena, and extracts the links 
 that the file points to as Outlinks so that they can be fetched as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-460) RDF parser plugin


 [ 
https://issues.apache.org/jira/browse/NUTCH-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-460:


Patch Info: [Patch Available]

- pushing this out per http://bit.ly/c7tBv9

 RDF parser plugin
 -

 Key: NUTCH-460
 URL: https://issues.apache.org/jira/browse/NUTCH-460
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Ricardo J. Méndez
 Attachments: rubyspider-rdf.zip


 I've written a couple plugins that I'd like to contribute.  
 RDFLinkParseFilter looks for links on the pages that point towards RDF 
 information, and tags the pages with metadata about the type of links they 
 hold. RDFLinkIndexingFilter indexes said metadata.  RDFParser parses RDF 
 information from several possible formats using Jena, and extracts the links 
 that the file points to as Outlinks so that they can be fetched as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-729) NPE in FieldIndexer when BasicFields url doesn't exist


 [ 
https://issues.apache.org/jira/browse/NUTCH-729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-729:


 Due Date: 26/Mar/09  (was: 26/Mar/09)
   Patch Info: [Patch Available]
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 NPE in FieldIndexer when BasicFields url doesn't exist
 --

 Key: NUTCH-729
 URL: https://issues.apache.org/jira/browse/NUTCH-729
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 0.9.0, 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: NUTCH-729-1-20090235.patch


 There is a NullPointerException during a logging call in FieldIndexer when 
 there isn't a url for a document.  Documents shouldn't be without urls but 
 since the FieldIndexer doesn't validate fields it is possible for it to 
 occur.  Most often this happens when BasicFields is run with the wrong 
 segments directory and doesn't complain.  It could also occur if using the 
 FieldIndexer to index things other than basic fields.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-573) Multiple Domains - Query Search


 [ 
https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-573:



- pushing this out per http://bit.ly/c7tBv9

 Multiple Domains - Query Search
 ---

 Key: NUTCH-573
 URL: https://issues.apache.org/jira/browse/NUTCH-573
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 0.9.0
 Environment: All
Reporter: Rajasekar Karthik
Assignee: Enis Soztutar
 Attachments: multiTermQuery_v1.patch


 Searching multiple domains can be done on Lucene - nut not that efficiently 
 on nutch.
 Query:
 +content:abc +(sitewww.aaa.com site:www.bbb.com)
 works on lucene but the same concept does not work on nutch.
 In Lucene, it works with 
 org.apache.lucene.analysis.KeywordAnalyzer
 org.apache.lucene.analysis.standard.StandardAnalyzer 
 but NOT on
 org.apache.lucene.analysis.SimpleAnalyzer 
 Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a 
 workaround to make this work? Is there an option to change what analyzer 
 nutch is using? 
 Just FYI, another solution (inefficient I believe) which seems to be working 
 on nutch
 query -site:ccc.com -site:ddd.com 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-717) Make Nutch Solr integration easier


 [ 
https://issues.apache.org/jira/browse/NUTCH-717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-717:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Make Nutch Solr integration easier
 --

 Key: NUTCH-717
 URL: https://issues.apache.org/jira/browse/NUTCH-717
 Project: Nutch
  Issue Type: New Feature
Reporter: Sami Siren

 Erik Hatcher proposed we should provide a full solr config dir to be used 
 with Nutch-Solr. Now we only provide index schema. It would be considerably 
 easier to setup nutch-solr if we provided the whole conf dir that you could 
 use with solr like:
 java -Dsolr.solr.home=Nutch's Solr Home -jar start.jar

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-541) Index url field untokenized


 [ 
https://issues.apache.org/jira/browse/NUTCH-541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-541:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Index url field untokenized
 ---

 Key: NUTCH-541
 URL: https://issues.apache.org/jira/browse/NUTCH-541
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, searcher
Affects Versions: 1.0.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar

 Url field is indexed as Strore.YES , Index.TOKENIZED. We also need the 
 untokenized version of the url field in some contexts : 
 1. For deleting duplicates by url (at search time). see NUTCH-455
 2. For restricting the search to a certain url (may be used in the case of 
 RSS search where each entry in the Rss is added as a distinct document with 
 (possibly) same url ) 
query-url extends FieldQueryFilter so: 
 Query: url:http://www.apache.org/
 Parsed: url:http http-www http-www-apache www www-apache apache org
 Translated: +url:http-http-www http-www-http-www-apache 
 http-www-apache-www www-www-apache www-apache apache org
 3. for accessing a document(s) in the search servers in the search servers. 
 (using query plugin)
 I suggest we add url as in index-basic and implement a query-url-untoken 
 plugin. 
 doc.add(new Field(url, url.toString(), Field.Store.YES, 
 Field.Index.TOKENIZED));
 doc.add(new Field(url_untoken, url.toString(), Field.Store.NO, 
 Field.Index.UN_TOKENIZED));

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-628) Host database to keep track of host-level information

[
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris A. Mattmann updated NUTCH-628:

Patch Info: [Patch Available]
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

Host database to keep track of host-level information
-

Key: NUTCH-628
URL: https://issues.apache.org/jira/browse/NUTCH-628
Project: Nutch
Issue Type: New Feature
Components: fetcher, generator
Reporter: Otis Gospodnetic
Attachments: domain_statistics_v2.patch,
NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch

Nutch would benefit from having a DB with per-host/domain/TLD information.
For instance, Nutch could detect hosts that are timing out, store information
about that in this DB. Segment/fetchlist Generator could then skip such
hosts, so they don't slow down the fetch job. Another good use for such a DB
is keeping track of various host scores, e.g. spam score.
From the recent thread on nutch-u...@lucene:
Otis asked:
While we are at it, how would one go about implementing this DB, as far as
its structures go?
Andrzej said:
The easiest I can imagine is to use something like Text, MapWritable.
This way you could store arbitrary information under arbitrary keys.
I.e. a single database then could keep track of aggregate statistics at
different levels, e.g. TLD, domain, host, ip range, etc. The basic set
of statistics could consist of a few predefined gauges, totals and averages.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-650) Hbase Integration


 [ 
https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-650:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Hbase Integration
 -

 Key: NUTCH-650
 URL: https://issues.apache.org/jira/browse/NUTCH-650
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 1.0.0
Reporter: Doğacan Güney
Assignee: Doğacan Güney
 Attachments: hbase-integration_v1.patch, hbase_v2.patch, 
 malformedurl.patch, meta.patch, meta2.patch, nofollow-hbase.patch, 
 NUTCH-650.patch, nutch-habase.patch, searching.diff, slash.patch


 This issue will track nutch/hbase integration

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-583) FeedParser empty links for items


 [ 
https://issues.apache.org/jira/browse/NUTCH-583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-583:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 FeedParser empty links for items
 

 Key: NUTCH-583
 URL: https://issues.apache.org/jira/browse/NUTCH-583
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar

 FeedParser in feed plugin just discards the item if it does not have link 
 element. However Rss 2.0 does not necessitate the link element for each 
 item. 
 Moreover sometimes the link is given in the guid element which is a 
 globally unique identifier for the item. I think we can search the url for an 
 item first, then if it is still not found, we can use the feed's url, but 
 with merging all the parse texts into one Parse object. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool


 [ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-666:


 Due Date: 27/Nov/08  (was: 27/Nov/08)
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Analysis plugins for multiple language and new Language Identifier Tool
 ---

 Key: NUTCH-666
 URL: https://issues.apache.org/jira/browse/NUTCH-666
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: NUTCH-666-1-20081126.patch, NUTCH-666-2-20091217-nf.patch


 Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
 russian, and thai.  Also includes a new Language Identifier tool that used 
 the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool


 [ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-666:


Patch Info: [Patch Available]

 Analysis plugins for multiple language and new Language Identifier Tool
 ---

 Key: NUTCH-666
 URL: https://issues.apache.org/jira/browse/NUTCH-666
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: NUTCH-666-1-20081126.patch, NUTCH-666-2-20091217-nf.patch


 Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
 russian, and thai.  Also includes a new Language Identifier tool that used 
 the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-475) Adaptive crawl delay


 [ 
https://issues.apache.org/jira/browse/NUTCH-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-475:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Adaptive crawl delay
 

 Key: NUTCH-475
 URL: https://issues.apache.org/jira/browse/NUTCH-475
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Doğacan Güney
 Attachments: adaptive-delay_draft.patch


 Current fetcher implementation waits a default interval before making another 
 request to the same server (if crawl-delay is not specified in robots.txt). 
 IMHO, an adaptive implementation will be better. If the server is under 
 little load and can server requests fast, then fetcher can ask for more pages 
 in a given interval. Similarly, if the server is suffering from heavy load, 
 fetcher can slow down(w.r.t that host), easing the load on the server.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-771) Add WebGraph classes to the bin/nutch script


 [ 
https://issues.apache.org/jira/browse/NUTCH-771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-771:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Add WebGraph classes to the bin/nutch script
 

 Key: NUTCH-771
 URL: https://issues.apache.org/jira/browse/NUTCH-771
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All, shell script
Reporter: Dennis Kubes
Assignee: Dennis Kubes

 Currently the webgraph jobs are called on the command line by calling main 
 methods on their classes.  I propose to upgrade the bin/nutch shell script to 
 allow calling these jobs as well.  This would include the webgraphdb, 
 linkrank, scoreupdater, and nodedumper jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0


[ 
https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852047#action_12852047
 ] 

Chris A. Mattmann commented on NUTCH-673:
-

Folks: if you get time to put together a patch for 1.1 or feel that this should 
go into 1.1, please see:  http://bit.ly/c7tBv9 and comment in the next 48 hrs...

 Upgrade the Carrot2 plug-in to release 3.0
 --

 Key: NUTCH-673
 URL: https://issues.apache.org/jira/browse/NUTCH-673
 Project: Nutch
  Issue Type: Improvement
  Components: web gui
Affects Versions: 0.9.0
 Environment: All Nutch deployments.
Reporter: Sean Dean
Priority: Minor

 Release 3.0 of the Carrot2 plug-in was released recently.
 We currently have version 2.1 in the source tree and upgrading it to the 
 latest version before 1.0-release might make sence.
 Details on the release can be found here: 
 http://project.carrot2.org/release-3.0-notes.html
 One major change in requirements is for JDK 1.5 to be used, but this is also 
 now required for Hadoop 0.19 so this wouldnt be the only reason for the 
 switch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-789) Improvements to Tika parser


[ 
https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852048#action_12852048
 ] 

Chris A. Mattmann commented on NUTCH-789:
-

Folks, I'm going to put together an RC for Tika 0.7 and take care of JIRA now. 
Once I do that, we can try and close out this issue for 1.1. I should be able 
to do this before the 48 hr deadline I threw up for Nutch 1.1...

 Improvements to Tika parser
 ---

 Key: NUTCH-789
 URL: https://issues.apache.org/jira/browse/NUTCH-789
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
 Environment: reported by Sami, in NUTCH-766
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.1

 Attachments: NutchTikaConfig.java, TikaParser.java


 As reported by Sami in NUTCH-766, Sami has a few improvements he made to the 
 Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-794) Language Identification must use check the parse metadata for language values


[ 
https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852101#action_12852101
 ] 

Chris A. Mattmann commented on NUTCH-794:
-

Hey Julien, yepper, I posted an RC of Tika 0.7, see: http://bit.ly/c7FZRc. If 
the VOTE passes on that in say the next 72 hours, I will push out a Tika 0.7 
release to the mirrors. If everyone is OK with that, we can release Nutch 1.1 
after...thoughts?

 Language Identification must use check the parse metadata for language values 
 --

 Key: NUTCH-794
 URL: https://issues.apache.org/jira/browse/NUTCH-794
 Project: Nutch
  Issue Type: Bug
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-794.patch


 The following HTML document : 
 html lang=fiheaddocument 1 title/headbodyjotain 
 suomeksi/body/html
 is rendered as the following xhtml by Tika : 
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;headtitle//headbodydocument 1 
 titlejotain suomeksi/body/html
 with the lang attribute getting lost.  The lang is not stored in the metadata 
 either.
 I will open an issue on Tika and modify TestHTMLLanguageParser so that the 
 tests don't break anymore 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-570) Improvement of URL Ordering in Generator.java

2010-03-30 Thread Serykh Evgeniy (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Serykh Evgeniy updated NUTCH-570:
-

Attachment: GeneratorDiff_v1.out

Improvement of URL Ordering in Generator.java
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-779) Mechanism for passing metadata from parse to crawldb


 [ 
https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-779.
-

   Resolution: Fixed
Fix Version/s: 1.1

Committed revision 929038.

Thanks Andrzej for your feedback

 Mechanism for passing metadata from parse to crawldb
 

 Key: NUTCH-779
 URL: https://issues.apache.org/jira/browse/NUTCH-779
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-779, NUTCH-779-v2.patch


 The patch attached allows to pass parse metadata to the corresponding entry 
 of the crawldb.  
 Comments are welcome

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-785) Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL


 [ 
https://issues.apache.org/jira/browse/NUTCH-785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-785.
---

Resolution: Fixed

Committed revision 929039

Thanks Andrzej for reviewing it

 Fetcher : copy metadata from origin URL when redirecting + call 
 scfilters.initialScore on newly created URL
 ---

 Key: NUTCH-785
 URL: https://issues.apache.org/jira/browse/NUTCH-785
 Project: Nutch
  Issue Type: Bug
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-785.patch


 When following the redirections, the Fetcher does not copy the metadata from 
 the original URL to the new one or calls the method scfilters.initialScore

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-789) Improvements to Tika parser


[ 
https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851316#action_12851316
 ] 

Julien Nioche commented on NUTCH-789:
-

Shall we postpone the work on this issue to after 1.1?

 Improvements to Tika parser
 ---

 Key: NUTCH-789
 URL: https://issues.apache.org/jira/browse/NUTCH-789
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
 Environment: reported by Sami, in NUTCH-766
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.1

 Attachments: NutchTikaConfig.java, TikaParser.java


 As reported by Sami in NUTCH-766, Sami has a few improvements he made to the 
 Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-789) Improvements to Tika parser

2010-03-30 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851331#action_12851331
 ] 

Andrzej Bialecki  commented on NUTCH-789:
-

There are no diffs, so it's difficult to figure out what's changed ... I think 
that Tika will soon release v. 0.7 which may also impact this patch if we 
decide to upgrade before our release. I asked the Tika guys about their 
release, let's wait a couple days more.

 Improvements to Tika parser
 ---

 Key: NUTCH-789
 URL: https://issues.apache.org/jira/browse/NUTCH-789
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
 Environment: reported by Sami, in NUTCH-766
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.1

 Attachments: NutchTikaConfig.java, TikaParser.java


 As reported by Sami in NUTCH-766, Sami has a few improvements he made to the 
 Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java

2010-03-30 Thread Otis Gospodnetic (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851461#action_12851461
]

Otis Gospodnetic commented on NUTCH-570:

Serykh, what does your version of the patch do differently? (maybe it's just an
update so it applies to trunk?)

Julien, want to take this?

Improvement of URL Ordering in Generator.java
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java

[
https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851545#action_12851545
]

Julien Nioche commented on NUTCH-570:
-

{quote}Julien, want to take this?{quote}

Not particularly. I am busy on short term issues for 1.1 so feel free to take
it if you have a particular interest in this.
I would be curious to see some figures on the improvements from this patch, my
impression is that NUTCH-776 would be quicker to implement and maintain and
might possibly give similar gains.

Improvement of URL Ordering in Generator.java
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java

2010-03-30 Thread Dmitry Lihachev (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851710#action_12851710
]

Dmitry Lihachev commented on NUTCH-570:
---

Yeah, Otis. It's just an update so it applies to trunk.

Improvement of URL Ordering in Generator.java
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-03-30 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851719#action_12851719
 ] 

Hudson commented on NUTCH-779:
--

Integrated in Nutch-trunk #1112 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1112/])
 Mechanism for passing metadata from parse to crawldb


 Mechanism for passing metadata from parse to crawldb
 

 Key: NUTCH-779
 URL: https://issues.apache.org/jira/browse/NUTCH-779
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-779, NUTCH-779-v2.patch


 The patch attached allows to pass parse metadata to the corresponding entry 
 of the crawldb.  
 Comments are welcome

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-784) CrawlDBScanner


 [ 
https://issues.apache.org/jira/browse/NUTCH-784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-784.
---

Resolution: Fixed

Committed revision 928746

 CrawlDBScanner 
 ---

 Key: NUTCH-784
 URL: https://issues.apache.org/jira/browse/NUTCH-784
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-784.patch


 The patch file contains a utility which dumps all the entries matching a 
 regular expression on their URL. The dump mechanism of the crawldb reader is 
 not  very useful on large crawldbs as the ouput can be extremely large and 
 the -url  function can't help if we don't know what url we want to have a 
 look at.
 The CrawlDBScanner can either generate a text representation of the 
 CrawlDatum-s or binary objects which can then be used as a new CrawlDB. 
 Usage: CrawlDBScanner crawldb output regex [-s status] -text
 regex: regular expression on the crawldb key
 -s status : constraint on the status of the crawldb entries e.g. db_fetched, 
 db_unfetched
 -text : if this parameter is used, the output will be of TextOutputFormat; 
 otherwise it generates a 'normal' crawldb with the MapFileOutputFormat
 for instance the command below : 
 ./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* 
 -s db_fetched -text
 will generate a text file /tmp/amazon-dump containing all the entries of the 
 crawldb matching the regexp  .+amazon.com.* and having a status of db_fetched

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-784) CrawlDBScanner


 [ 
https://issues.apache.org/jira/browse/NUTCH-784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-784:


Fix Version/s: 1.1

 CrawlDBScanner 
 ---

 Key: NUTCH-784
 URL: https://issues.apache.org/jira/browse/NUTCH-784
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-784.patch


 The patch file contains a utility which dumps all the entries matching a 
 regular expression on their URL. The dump mechanism of the crawldb reader is 
 not  very useful on large crawldbs as the ouput can be extremely large and 
 the -url  function can't help if we don't know what url we want to have a 
 look at.
 The CrawlDBScanner can either generate a text representation of the 
 CrawlDatum-s or binary objects which can then be used as a new CrawlDB. 
 Usage: CrawlDBScanner crawldb output regex [-s status] -text
 regex: regular expression on the crawldb key
 -s status : constraint on the status of the crawldb entries e.g. db_fetched, 
 db_unfetched
 -text : if this parameter is used, the output will be of TextOutputFormat; 
 otherwise it generates a 'normal' crawldb with the MapFileOutputFormat
 for instance the command below : 
 ./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* 
 -s db_fetched -text
 will generate a text file /tmp/amazon-dump containing all the entries of the 
 crawldb matching the regexp  .+amazon.com.* and having a status of db_fetched

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-784) CrawlDBScanner

2010-03-29 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850896#action_12850896
 ] 

Andrzej Bialecki  commented on NUTCH-784:
-

This should have been reviewed first - I don't question the usefulness of this 
class, but I think that this should have been added as an option to 
CrawlDbReader. As it is now we get a new tool with a cryptic name that performs 
a function that is a variant of another existing tool...

 CrawlDBScanner 
 ---

 Key: NUTCH-784
 URL: https://issues.apache.org/jira/browse/NUTCH-784
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-784.patch


 The patch file contains a utility which dumps all the entries matching a 
 regular expression on their URL. The dump mechanism of the crawldb reader is 
 not  very useful on large crawldbs as the ouput can be extremely large and 
 the -url  function can't help if we don't know what url we want to have a 
 look at.
 The CrawlDBScanner can either generate a text representation of the 
 CrawlDatum-s or binary objects which can then be used as a new CrawlDB. 
 Usage: CrawlDBScanner crawldb output regex [-s status] -text
 regex: regular expression on the crawldb key
 -s status : constraint on the status of the crawldb entries e.g. db_fetched, 
 db_unfetched
 -text : if this parameter is used, the output will be of TextOutputFormat; 
 otherwise it generates a 'normal' crawldb with the MapFileOutputFormat
 for instance the command below : 
 ./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* 
 -s db_fetched -text
 will generate a text file /tmp/amazon-dump containing all the entries of the 
 crawldb matching the regexp  .+amazon.com.* and having a status of db_fetched

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (NUTCH-806) Merge CrawlDBScanner with CrawlDBReader

Merge CrawlDBScanner with CrawlDBReader
---

 Key: NUTCH-806
 URL: https://issues.apache.org/jira/browse/NUTCH-806
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Julien Nioche


The CrawlDBScanner [NUTCH-784] should be merged with the CrawlDBReader. Will do 
that after the 1.1 release 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-783) IndexerChecker Utilty


 [ 
https://issues.apache.org/jira/browse/NUTCH-783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-783:


Fix Version/s: (was: 1.1)

Removed tag 1.1
Will rename to IndexingPluginsChecker later

 IndexerChecker Utilty
 -

 Key: NUTCH-783
 URL: https://issues.apache.org/jira/browse/NUTCH-783
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-783.patch


 This patch contains a new utility which allows to check the configuration of 
 the indexing filters. The IndexerChecker reads and parses a URL and run the 
 indexers on it. Displays the fields obtained and the first
  100 characters of their value.
 Can be used e.g. ./nutch org.apache.nutch.indexer.IndexerChecker 
 http://www.lemonde.fr/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-785) Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL


[ 
https://issues.apache.org/jira/browse/NUTCH-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850912#action_12850912
 ] 

Julien Nioche commented on NUTCH-785:
-

Could anyone please review this issue? I would like to commit it in time for 
the 1.1 release

 Fetcher : copy metadata from origin URL when redirecting + call 
 scfilters.initialScore on newly created URL
 ---

 Key: NUTCH-785
 URL: https://issues.apache.org/jira/browse/NUTCH-785
 Project: Nutch
  Issue Type: Bug
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-785.patch


 When following the redirections, the Fetcher does not copy the metadata from 
 the original URL to the new one or calls the method scfilters.initialScore

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb