from:"Julien Nioche"


 [ 
https://issues.apache.org/jira/browse/NUTCH-834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-834.
---

Resolution: Fixed

Committed revision 959228.

Thanks Chris for your comments and help with this

 Separate the Nutch web site from trunk
 --

 Key: NUTCH-834
 URL: https://issues.apache.org/jira/browse/NUTCH-834
 Project: Nutch
  Issue Type: Task
  Components: documentation
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0


 As discussed on dev@, it would be useful to move the -PDFBox- Nutch web site 
 sources from .../asf/nutch/trunk to .../asf/nutch/site and to use the 
 svnpubsub mechanism for instant deployment of site changes.
 The related issue for infra is 
 https://issues.apache.org/jira/browse/INFRA-2822
 See also https://issues.apache.org/jira/browse/PDFBOX-623

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-650) Hbase Integration


[ 
https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883880#action_12883880
 ] 

Julien Nioche commented on NUTCH-650:
-

The patch has been committed with revision # 959259. The content of 
https://svn.apache.org/repos/asf/nutch/branches/nutchbase is now the same as 
github.

 Hbase Integration
 -

 Key: NUTCH-650
 URL: https://issues.apache.org/jira/browse/NUTCH-650
 Project: Nutch
  Issue Type: New Feature
Reporter: Doğacan Güney
Assignee: Doğacan Güney
 Fix For: 2.0

 Attachments: hbase-integration_v1.patch, hbase_v2.patch, 
 latest-nutchbase-vs-original-branch-point.patch, 
 latest-nutchbase-vs-svn-nutchbase.patch, malformedurl.patch, meta.patch, 
 meta2.patch, nb-design.txt, nb-installusage.txt, nofollow-hbase.patch, 
 NUTCH-650.patch, nutch-habase.patch, searching.diff, slash.patch


 This issue will track nutch/hbase integration

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-836) Remove deprecated parse plugins


 [ 
https://issues.apache.org/jira/browse/NUTCH-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-836:


Attachment: NUTCH-836.patch

 Remove deprecated parse plugins
 ---

 Key: NUTCH-836
 URL: https://issues.apache.org/jira/browse/NUTCH-836
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0

 Attachments: NUTCH-836.patch


 Some of the parser plugins in 1.1 are covered by the parse-tika plugin. These 
 plugins have been kept in 1.1 but should be removed from 2.0 where we'll rely 
 on parse-tika almost exclusively. Some existing plugins might be kept when 
 there is no equivalent in Tika (to be discussed). The following plugins are 
 removed : 
 * parse-html
 * parse-msexcel
 * parse-mspowerpoint
 * parse-msword
 * parse-pdf
 * parse-oo
 * parse-text
 * lib-jakarta-poi
 * lib-parsems
 The patch does not (yet) remove :
 * parse-js
 * parse-rss
 * parse-swf
 * parse-zip
 * feed
 Please review the patch and vote for its inclusion in the trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (NUTCH-836) Remove deprecated parse plugins

Remove deprecated parse plugins
---

 Key: NUTCH-836
 URL: https://issues.apache.org/jira/browse/NUTCH-836
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0
 Attachments: NUTCH-836.patch

Some of the parser plugins in 1.1 are covered by the parse-tika plugin. These 
plugins have been kept in 1.1 but should be removed from 2.0 where we'll rely 
on parse-tika almost exclusively. Some existing plugins might be kept when 
there is no equivalent in Tika (to be discussed). The following plugins are 
removed : 
* parse-html
* parse-msexcel
* parse-mspowerpoint
* parse-msword
* parse-pdf
* parse-oo
* parse-text
* lib-jakarta-poi
* lib-parsems

The patch does not (yet) remove :

* parse-js
* parse-rss
* parse-swf
* parse-zip
* feed

Please review the patch and vote for its inclusion in the trunk.




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-836) Remove deprecated parse plugins

[
https://issues.apache.org/jira/browse/NUTCH-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883891#action_12883891
]

Julien Nioche commented on NUTCH-836:
-

Actually creative-commons + languageidentifier currently have a dependency on
parse-html and parse-zip has one on parse-text in their build script.
The tests for the Fetcher and ParserFactory also fail without parse-html and
parse-text.

I will modify the patch to prevent these issues

Remove deprecated parse plugins
---

Key: NUTCH-836
URL: https://issues.apache.org/jira/browse/NUTCH-836
Project: Nutch
Issue Type: Task
Components: parser
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Julien Nioche
Fix For: 2.0

Attachments: NUTCH-836.patch

Some of the parser plugins in 1.1 are covered by the parse-tika plugin. These
plugins have been kept in 1.1 but should be removed from 2.0 where we'll rely
on parse-tika almost exclusively. Some existing plugins might be kept when
there is no equivalent in Tika (to be discussed). The following plugins are
removed :
* parse-html
* parse-msexcel
* parse-mspowerpoint
* parse-msword
* parse-pdf
* parse-oo
* parse-text
* lib-jakarta-poi
* lib-parsems
The patch does not (yet) remove :
* parse-js
* parse-rss
* parse-swf
* parse-zip
* feed
Please review the patch and vote for its inclusion in the trunk.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-836) Remove deprecated parse plugins


 [ 
https://issues.apache.org/jira/browse/NUTCH-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-836:


Description: 
Some of the parser plugins in 1.1 are covered by the parse-tika plugin. These 
plugins have been kept in 1.1 but should be removed from 2.0 where we'll rely 
on parse-tika almost exclusively. Some existing plugins might be kept when 
there is no equivalent in Tika (to be discussed). The following plugins are 
removed : 
* parse-html
* parse-msexcel
* parse-mspowerpoint
* parse-msword
* parse-pdf
* parse-oo
* parse-text
* lib-jakarta-poi
* lib-parsems

The patch does not (yet) remove :
* parse-ext
* parse-js
* parse-rss
* parse-swf
* parse-zip
* feed

Please review the patch and vote for its inclusion in the trunk.




  was:
Some of the parser plugins in 1.1 are covered by the parse-tika plugin. These 
plugins have been kept in 1.1 but should be removed from 2.0 where we'll rely 
on parse-tika almost exclusively. Some existing plugins might be kept when 
there is no equivalent in Tika (to be discussed). The following plugins are 
removed : 
* parse-html
* parse-msexcel
* parse-mspowerpoint
* parse-msword
* parse-pdf
* parse-oo
* parse-text
* lib-jakarta-poi
* lib-parsems

The patch does not (yet) remove :

* parse-js
* parse-rss
* parse-swf
* parse-zip
* feed

Please review the patch and vote for its inclusion in the trunk.





 Remove deprecated parse plugins
 ---

 Key: NUTCH-836
 URL: https://issues.apache.org/jira/browse/NUTCH-836
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0

 Attachments: NUTCH-836.patch


 Some of the parser plugins in 1.1 are covered by the parse-tika plugin. These 
 plugins have been kept in 1.1 but should be removed from 2.0 where we'll rely 
 on parse-tika almost exclusively. Some existing plugins might be kept when 
 there is no equivalent in Tika (to be discussed). The following plugins are 
 removed : 
 * parse-html
 * parse-msexcel
 * parse-mspowerpoint
 * parse-msword
 * parse-pdf
 * parse-oo
 * parse-text
 * lib-jakarta-poi
 * lib-parsems
 The patch does not (yet) remove :
 * parse-ext
 * parse-js
 * parse-rss
 * parse-swf
 * parse-zip
 * feed
 Please review the patch and vote for its inclusion in the trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (NUTCH-837) Remove search servers and Lucene dependencies

Remove search servers and Lucene dependencies 
--

 Key: NUTCH-837
 URL: https://issues.apache.org/jira/browse/NUTCH-837
 Project: Nutch
  Issue Type: Task
  Components: searcher, web gui
Affects Versions: 1.1
Reporter: Julien Nioche
 Fix For: 2.0


One of the main aspects of 2.0 is the delegation of the indexing and search to 
external resources like SOLR. We can simplify the code a lot by getting rid of 
the : 
* search servers
* indexing and analysis with Lucene
* search side functionalities : ontologies / clustering etc...
In the short term only SOLR / SOLRCloud will be supported but the plan would be 
to add other systems as well. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-836) Remove deprecated parse plugins


 [ 
https://issues.apache.org/jira/browse/NUTCH-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-836:


Attachment: (was: NUTCH-836.patch)

 Remove deprecated parse plugins
 ---

 Key: NUTCH-836
 URL: https://issues.apache.org/jira/browse/NUTCH-836
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0

 Attachments: NUTCH-836-2.patch


 Some of the parser plugins in 1.1 are covered by the parse-tika plugin. These 
 plugins have been kept in 1.1 but should be removed from 2.0 where we'll rely 
 on parse-tika almost exclusively. Some existing plugins might be kept when 
 there is no equivalent in Tika (to be discussed). The following plugins are 
 removed : 
 * parse-html
 * parse-msexcel
 * parse-mspowerpoint
 * parse-msword
 * parse-pdf
 * parse-oo
 * parse-text
 * lib-jakarta-poi
 * lib-parsems
 The patch does not (yet) remove :
 * parse-ext
 * parse-js
 * parse-rss
 * parse-swf
 * parse-zip
 * feed
 Please review the patch and vote for its inclusion in the trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-835) document deduplication (exact duplicates) failed using MD5Signature


[ 
https://issues.apache.org/jira/browse/NUTCH-835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884624#action_12884624
 ] 

Julien Nioche commented on NUTCH-835:
-

This patch has been marked for 1.2 but has been committed to trunk only (2.0). 
Shall we also apply it to /nutch/branches/branch-1.2 ?

 document deduplication (exact duplicates) failed using MD5Signature
 ---

 Key: NUTCH-835
 URL: https://issues.apache.org/jira/browse/NUTCH-835
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0, 1.1
 Environment: Linux, Ubuntu 10.04, Java 1.6.0_20
Reporter: Sebastian Nagel
Assignee: Andrzej Bialecki 
 Fix For: 1.2, 2.0


 The MD5Signature class calculates different signatures for identical 
 documents.
 The reason is that
   byte[] data = content.getContent();
   ... StringBuilder().append(data) ...
 uses java.lang.Object.toString() to get a string representation of the 
 (binary) content
 which results in unique hash codes (e.g., [...@30dc9065) even for two byte 
 arrays
 with identical content.
 A solution would be to take the MD5 sum of the binary content as first part 
 of the
 final signature calculation (the parsed content is the second part):
   ... 
 .append(StringUtil.toHexString(MD5Hash.digest(data).getDigest())).append(parse.getText());
 Of course, there are many other solutions...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (NUTCH-840) Port tests from parse-html to parse-tika

Port tests from parse-html to parse-tika


 Key: NUTCH-840
 URL: https://issues.apache.org/jira/browse/NUTCH-840
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0


We don't have test for HTML in parse-tika so I'll copy them from the old 
parse-html plugin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-840) Port tests from parse-html to parse-tika


 [ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-840:


Attachment: NUTCH-840.patch

Patch which adds the HTML tests to the Tika Parser

The tests currently rely on some DOM related code from Neko-HTML which 
introduces a dependency to the plugin lib-nekohtml.
Apart from parse-tika lib-nekohtml is used only in clustering-carrot which will 
be removed shortly. Once this is done we can delete lib-nekohtml as well then 
either : 
a) add the neko jar to the parse-tika lib via IVY
b) replace it with another implementation already available from the tika 
dependencies or the main Nutch dependencies (e.g. dom4j)





 Port tests from parse-html to parse-tika
 

 Key: NUTCH-840
 URL: https://issues.apache.org/jira/browse/NUTCH-840
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0

 Attachments: NUTCH-840.patch


 We don't have test for HTML in parse-tika so I'll copy them from the old 
 parse-html plugin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-837) Remove search servers and Lucene dependencies


[ 
https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884671#action_12884671
 ] 

Julien Nioche commented on NUTCH-837:
-

I think we can also get rid of  :

* docs/
* WAR related tasks in ANT
* src/web/
* src/xmlcatalog/
* src/engines/


 Remove search servers and Lucene dependencies 
 --

 Key: NUTCH-837
 URL: https://issues.apache.org/jira/browse/NUTCH-837
 Project: Nutch
  Issue Type: Task
  Components: searcher, web gui
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-837.patch


 One of the main aspects of 2.0 is the delegation of the indexing and search 
 to external resources like SOLR. We can simplify the code a lot by getting 
 rid of the : 
 * search servers
 * indexing and analysis with Lucene
 * search side functionalities : ontologies / clustering etc...
 In the short term only SOLR / SOLRCloud will be supported but the plan would 
 be to add other systems as well. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-837) Remove search servers and Lucene dependencies


[ 
https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884734#action_12884734
 ] 

Julien Nioche commented on NUTCH-837:
-

:-)

 Remove search servers and Lucene dependencies 
 --

 Key: NUTCH-837
 URL: https://issues.apache.org/jira/browse/NUTCH-837
 Project: Nutch
  Issue Type: Task
  Components: searcher, web gui
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-837.patch


 One of the main aspects of 2.0 is the delegation of the indexing and search 
 to external resources like SOLR. We can simplify the code a lot by getting 
 rid of the : 
 * search servers
 * indexing and analysis with Lucene
 * search side functionalities : ontologies / clustering etc...
 In the short term only SOLR / SOLRCloud will be supported but the plan would 
 be to add other systems as well. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-821) Use ivy in nutch builds


 [ 
https://issues.apache.org/jira/browse/NUTCH-821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-821:


Attachment: NUTCH-821.patch

Adds IVY support for dependencies

The lib/. dir is maintained and will be used to store dependencies which are 
not accessible via Ivy (e.g. GORA). The libs managed by Ivy are put in the 
directory build/lib. 

This patch also differentiates the _build_ path from the _dist_ path.



 Use ivy in nutch builds
 ---

 Key: NUTCH-821
 URL: https://issues.apache.org/jira/browse/NUTCH-821
 Project: Nutch
  Issue Type: New Feature
  Components: build
Affects Versions: 2.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 2.0

 Attachments: NUTCH-821.patch, nutchbase-ivy_v1.patch


 Ivy is the de-facto dependency management tool used in conjunction with Ant. 
 It would be nice if we switch to using Ivy in Nutch builds. 
 Maven is also an alternative, but I think Nutch will benefit more with an 
 Ant+Ivy architecture. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-791) External links for published javadocs are partially broken


 [ 
https://issues.apache.org/jira/browse/NUTCH-791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-791.
-

Fix Version/s: 1.1
   Resolution: Duplicate

Duplicates 790?

 External links for published javadocs are partially broken
 --

 Key: NUTCH-791
 URL: https://issues.apache.org/jira/browse/NUTCH-791
 Project: Nutch
  Issue Type: Bug
  Components: documentation
Reporter: Sami Siren
 Fix For: 1.1


 Lucene and Hadoop links point to non existing urls. For some versions of 
 apidocs the links are just broken and for some they do not exist at all. 
 Basically what is required is that the javadocs are generated again with 
 proper urls for external packages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-821) Use ivy in nutch builds

[
https://issues.apache.org/jira/browse/NUTCH-821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885207#action_12885207
]

Julien Nioche commented on NUTCH-821:
-

{QUOTE}
I think this patch refers to some parts that were already removed in NUTCH-837
...
{QUOTE}

I applied NUTCH-837 before but indeed it does remove references to parts
deleted in NUTCH-837. Maybe I should have done it in a separate issue.

{QUOTE}
Also, it would be nice to have a target that sets up an Eclipse project - after
this patch is applied the lib/ is nearly empty and you need to run build at
least once to bring dependencies - this may be confusing.
{QUOTE}

The jars are put in the build/lib directory so this assumes that the project
has been built in order to get the dependencies. I think there are resources in
Eclipse for dealing with Ivy configurations. If anyone has any pointers they
will be most welcome

Use ivy in nutch builds
---

Key: NUTCH-821
URL: https://issues.apache.org/jira/browse/NUTCH-821
Project: Nutch
Issue Type: New Feature
Components: build
Affects Versions: 2.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar
Fix For: 2.0

Attachments: NUTCH-821.patch, nutchbase-ivy_v1.patch

Ivy is the de-facto dependency management tool used in conjunction with Ant.
It would be nice if we switch to using Ivy in Nutch builds.
Maven is also an alternative, but I think Nutch will benefit more with an
Ant+Ivy architecture.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-821) Use ivy in nutch builds


[ 
https://issues.apache.org/jira/browse/NUTCH-821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885244#action_12885244
 ] 

Julien Nioche commented on NUTCH-821:
-

I found [http://ant.apache.org/ivy/ivyde/] which allows to manage Ivy 
dependencies in Eclipse. 
I had to rewrite ivy/ivy.xml to make the version numbers explicit as IvyDE was 
not able to load the properties in ivy/library.properties but it worked fine 
after that. The beauty of it is that we don't rely on the content of build/lib 
at all

 Use ivy in nutch builds
 ---

 Key: NUTCH-821
 URL: https://issues.apache.org/jira/browse/NUTCH-821
 Project: Nutch
  Issue Type: New Feature
  Components: build
Affects Versions: 2.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 2.0

 Attachments: NUTCH-821.patch, nutchbase-ivy_v1.patch


 Ivy is the de-facto dependency management tool used in conjunction with Ant. 
 It would be nice if we switch to using Ivy in Nutch builds. 
 Maven is also an alternative, but I think Nutch will benefit more with an 
 Ant+Ivy architecture. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-696) Timeout for Parser


[ 
https://issues.apache.org/jira/browse/NUTCH-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885260#action_12885260
 ] 

Julien Nioche commented on NUTCH-696:
-

+1 : this is definitely useful. Hopefully the underlying parsers in Tika are 
constantly improved to prevent loops and crashes but having the parser timeout 
on top would be great 

 Timeout for Parser
 --

 Key: NUTCH-696
 URL: https://issues.apache.org/jira/browse/NUTCH-696
 Project: Nutch
  Issue Type: Wish
  Components: fetcher
Reporter: Julien Nioche
Priority: Minor
 Attachments: timeout.patch


 I found that the parsing sometimes crashes due to a problem on a specific 
 document, which is a bit of a shame as this blocks the rest of the segment 
 and Hadoop ends up finding that the node does not respond. I was wondering 
 about whether it would make sense to have a timeout mechanism for the parsing 
 so that if a document is not parsed after a time t, it is simply treated as 
 an exception and we can get on with the rest of the process.
 Does that make sense? Where do you think we should implement that, in 
 ParseUtil?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (NUTCH-696) Timeout for Parser