[jira] [Resolved] (NUTCH-1394) backport NUTCH-1232 Remove site field from index-basic

2012-12-08 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1394.
-

Resolution: Fixed
  Assignee: Lewis John McGibbney

Committed @revision 1418750 in Nutch 2.x branch

 backport NUTCH-1232 Remove site field from index-basic
 --

 Key: NUTCH-1394
 URL: https://issues.apache.org/jira/browse/NUTCH-1394
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, storage
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 2.2


 This is a simple backport. The 2.0 Solr schema and mappings still contain the 
 field site which has been removed in 1.x (NUTCH-1232). Should be done also 
 in 2.0: it's easier to maintain only one Solr installation for all Nutch 
 versions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1038) Port IndexingFiltersChecker to 2.0

2012-12-08 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13527265#comment-13527265
 ] 

Lewis John McGibbney commented on NUTCH-1038:
-

Hi Seb, this works great, however I 'think' there is a bug lingering in 
BasicIndexingFilter's doc.add(tstamp, tstamp) function. It doesn't see right. I 
have posted it on to the user@ list.
I am +1 for committing though. The bug (if it is one) is not related to this 
patch.  

 Port IndexingFiltersChecker to 2.0
 --

 Key: NUTCH-1038
 URL: https://issues.apache.org/jira/browse/NUTCH-1038
 Project: Nutch
  Issue Type: New Feature
Affects Versions: nutchgora
Reporter: Markus Jelsma
 Fix For: 2.2

 Attachments: NUTCH-1038.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[ANNOUNCE] Apache Nutch 1.6 Released

2012-12-08 Thread Lewis John Mcgibbney
Hi All,

The Apache Nutch PMC are extremely pleased to announce the release of
Apache Nutch v1.6. This release includes over 20 bug fixes, the same
in improvements, as well as new functionalities including a new
HostNormalizer, the ability to dynamically set fetchInterval by
MIME-type and functional enhancements to the Indexer API inluding the
normalization of URL's and the deletion of robots noIndex documents.
Other notable improvements include the upgrade of key dependencies to
Tika 1.2 and Automaton 1.11-8.

A full PMC statement can be found here [0]

The release can be found on official Apache mirrors [1] as well as
sources in Maven Central [2]

Thank you

Lewis
On Behalf of the Nutch PMC

[0] http://s.apache.org/NFp
[1] http://www.apache.org/dyn/closer.cgi/nutch/
[2] http://search.maven.org/#artifactdetails|org.apache.nutch|nutch|1.6|jar

-- 
Lewis


[jira] [Resolved] (NUTCH-1183) Summary task for adding command line usage instructions to webgraph classes

2012-12-08 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1183.
-

Resolution: Not A Problem

This issue is invalid. Thinking back I really haven't got a clue what I 
registered it for...
Sorry troops.
CLosing

 Summary task for adding command line usage instructions to webgraph classes
 ---

 Key: NUTCH-1183
 URL: https://issues.apache.org/jira/browse/NUTCH-1183
 Project: Nutch
  Issue Type: Improvement
  Components: documentation
Affects Versions: 1.4, nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.7


 The following files should provide output when called innacurately from the 
 command line. Something similar to 
 {code}
 Usage: class -arg1, -arg2, etc etc
 {code}
 * webgraph
 * linkrank
 * scoreupdater
 * nodedumper
 * nodereader
 If anyone would like to see further classes included in this task please add 
 to the above list.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1140) index-more plugin, resetTitle method creates multiple values in the Title field

2012-12-08 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13527276#comment-13527276
 ] 

Lewis John McGibbney commented on NUTCH-1140:
-

DO we want to integrate this into trunk and 2.x? If so I can write the trivial 
test case?

 index-more plugin, resetTitle method creates multiple values in the Title 
 field
 ---

 Key: NUTCH-1140
 URL: https://issues.apache.org/jira/browse/NUTCH-1140
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.3
Reporter: Joe Liedtke
Priority: Minor
 Fix For: 1.7

 Attachments: MoreIndexingFilter.093011.patch


 From the comments in MoreIndexingFilter.java, the index-more plugin is meant 
 to reset the Title field of a document if it contains a Content-Disposition 
 header. The current behavior is to add a Title regardless of whether one 
 exists or not, which can cause issues down the line with the Solr Indexing 
 process, and based on a thread in the nutch user list it appears that this is 
 causing some users to mark the title as multi-valued in the schema:
   
 http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8
 The following patch removes the title field before adding a new one, which 
 has resolved the issue for me:
 --- MoreIndexingFilter.old2011-09-30 11:44:35.0 +
 +++ MoreIndexingFilter.java   2011-09-30 09:58:48.0 +
 @@ -276,6 +276,7 @@
  for (int i=0; ipatterns.length; i++) {
if (matcher.contains(contentDisposition,patterns[i])) {
  result = matcher.getMatch();
 +doc.removeField(title);
  doc.add(title, result.group(1));
  break;
}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1409) Remove deprecated properties in nutch-default.xml

2012-12-08 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13527278#comment-13527278
 ] 

Lewis John McGibbney commented on NUTCH-1409:
-

I am not sure about the logging for this one. If we are to remove the 
properties, thereby denying users the choice to override them, then why do we 
need to log that they should use some other settings instead? Saying that, this 
one is nearly ready though. 
@Matthias, off the top of your head, I wonder if you were able to check out 2.x 
and comment on the code? Thank you very much for the patch.

 Remove deprecated properties in nutch-default.xml
 -

 Key: NUTCH-1409
 URL: https://issues.apache.org/jira/browse/NUTCH-1409
 Project: Nutch
  Issue Type: Improvement
Reporter: Matthias Agethle
Priority: Minor
 Fix For: 1.7

 Attachments: NUTCH-1409.patch


 1) Remove deprecated properties from nutch-default.xml (generate.max.per.host 
 and db.default.fetch.interval).
 2) The already removed properties generate.max.per.host.by.ip and 
 db.max.fetch.interval are still used in source code.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika

2012-12-08 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-840:
---

Attachment: NUTCH-840v2.patch

This is for trunk.
There is a problem here where the new tests (for parse-tika) also seem to be 
executed against (within?) other plugin testing scenarios... I am stuck atm as 
to why this is.
Once we fix we will port to 2.x

 Port tests from parse-html to parse-tika
 

 Key: NUTCH-840
 URL: https://issues.apache.org/jira/browse/NUTCH-840
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.2

 Attachments: NUTCH-840.patch, NUTCH-840.patch, NUTCH-840v2.patch


 We don't have test for HTML in parse-tika so I'll copy them from the old 
 parse-html plugin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: [ANNOUNCE] Apache Nutch 1.6 Released

2012-12-08 Thread Julien Nioche
Great stuff! Thanks Lewis

On 8 December 2012 21:50, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote:

 Hi All,

 The Apache Nutch PMC are extremely pleased to announce the release of
 Apache Nutch v1.6. This release includes over 20 bug fixes, the same
 in improvements, as well as new functionalities including a new
 HostNormalizer, the ability to dynamically set fetchInterval by
 MIME-type and functional enhancements to the Indexer API inluding the
 normalization of URL's and the deletion of robots noIndex documents.
 Other notable improvements include the upgrade of key dependencies to
 Tika 1.2 and Automaton 1.11-8.

 A full PMC statement can be found here [0]

 The release can be found on official Apache mirrors [1] as well as
 sources in Maven Central [2]

 Thank you

 Lewis
 On Behalf of the Nutch PMC

 [0] http://s.apache.org/NFp
 [1] http://www.apache.org/dyn/closer.cgi/nutch/
 [2]
 http://search.maven.org/#artifactdetails|org.apache.nutch|nutch|1.6|jar

 --
 Lewis




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika

2012-12-08 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-840:


Attachment: NUTCH-840-trunk.patch

Modified version of the patch to fix the tests post NUTCH-797

 Port tests from parse-html to parse-tika
 

 Key: NUTCH-840
 URL: https://issues.apache.org/jira/browse/NUTCH-840
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.2

 Attachments: NUTCH-840.patch, NUTCH-840.patch, NUTCH-840-trunk.patch, 
 NUTCH-840v2.patch


 We don't have test for HTML in parse-tika so I'll copy them from the old 
 parse-html plugin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-840) Port tests from parse-html to parse-tika

2012-12-08 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13527362#comment-13527362
 ] 

Julien Nioche commented on NUTCH-840:
-

The tests now run OK with the patch I just attached.

bq. There is a problem here where the new tests (for parse-tika) also seem to 
be executed against (within?) other plugin testing scenarios

can you give more detail on this please Lewis?

 Port tests from parse-html to parse-tika
 

 Key: NUTCH-840
 URL: https://issues.apache.org/jira/browse/NUTCH-840
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.2

 Attachments: NUTCH-840.patch, NUTCH-840.patch, NUTCH-840-trunk.patch, 
 NUTCH-840v2.patch


 We don't have test for HTML in parse-tika so I'll copy them from the old 
 parse-html plugin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika

2012-12-08 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-840:


Affects Version/s: 1.6
Fix Version/s: 1.7

 Port tests from parse-html to parse-tika
 

 Key: NUTCH-840
 URL: https://issues.apache.org/jira/browse/NUTCH-840
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1, 1.6
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.7, 2.2

 Attachments: NUTCH-840.patch, NUTCH-840.patch, NUTCH-840-trunk.patch, 
 NUTCH-840v2.patch


 We don't have test for HTML in parse-tika so I'll copy them from the old 
 parse-html plugin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-891) Nutch build should not depend on unversioned local deps

2012-12-08 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-891:


Affects Version/s: 2.1

Probably not an issue anymore. marking it as 2.x to triage unversioned issues, 
will check later

 Nutch build should not depend on unversioned local deps
 ---

 Key: NUTCH-891
 URL: https://issues.apache.org/jira/browse/NUTCH-891
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.1
Reporter: Andrzej Bialecki 
 Attachments: gora-49_v1.patch, gora.build.patch


 The fix in NUTCH-873 introduces an unknown variable to the build process. 
 Since local ivy artifacts are unversioned, different people that install Gora 
 jars at different points in time will use the same artifact id but in fact 
 the artifacts (jars) will differ because they will come from different 
 revisions of Gora sources. Therefore Nutch builds based on the same svn rev. 
 won't be repeatable across different environments.
 As much as it pains the ivy purists ;) until Gora publishes versioned 
 artifacts I'd like to revert the fix in NUTCH-873 and add again Gora jars 
 built from a known external rev. We can add a README that contains commit id 
 from Gora.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-807) JSParseFilter produces malformed URL

2012-12-08 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-807.
---

Resolution: Won't Fix

Closing old issues. The JSParseFilter is known to generate noisy URLS and is 
not used by default anymore. This won't get fixed

 JSParseFilter produces malformed URL
 

 Key: NUTCH-807
 URL: https://issues.apache.org/jira/browse/NUTCH-807
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.0.0
 Environment: Redhat 2.6.18-128.1.6.el5PAE  i686 i686 i386 GNU/Linux
Reporter: Minyao Zhu

 This is found when crawling site: http://zhidao.baidu.com/( a Chinese 
 language site )
 It appears this page contains javascripts which confused JSParseFilter, which 
 produced URL like this:
 http://zhidao.baidu.com/){if(A===46){baidu.hide(
 Not sure the impact/scope of this issue in general.  The observation for this 
 specific site is, much less pages got crawled.
 Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-62) Add html META tag information into metaData in index-more plugin

2012-12-08 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-62?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-62.


Resolution: Implemented

This can be done in a more flexible way using index-metadata
https://issues.apache.org/jira/browse/NUTCH-1264

 Add html META tag information into metaData in index-more plugin
 

 Key: NUTCH-62
 URL: https://issues.apache.org/jira/browse/NUTCH-62
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Jack Tang
Priority: Trivial
 Attachments: index-more.patch.zip


 Now(version dev-0.7), only some metaData  in http response such as type, 
 date, content-length are available int the index-more plugin. And we cannot 
 index/sotre the meta data in html header (META exactly)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1267) urlmeta to delegate indexing to index-metadata

2012-12-08 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1267:
-

Assignee: Julien Nioche

 urlmeta to delegate indexing to index-metadata
 --

 Key: NUTCH-1267
 URL: https://issues.apache.org/jira/browse/NUTCH-1267
 Project: Nutch
  Issue Type: Sub-task
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira