Re: Nearing a 1.9 release?

2014-06-29 Thread Sebastian Nagel
+1 for a release during the next month

I plan to address before a release:
- 2 issues related to redirects
  NUTCH-926 
  NUTCH-1708 

- issues ready for commit:
  NUTCH-1605 
  NUTCH-1561 
  NUTCH-1566 
  NUTCH-1776 

Sebastian

On 06/29/2014 11:20 AM, Julien Nioche wrote:
> Hi guys, 
> 
> We've done loads of good work on the trunk since the last release, in 
> particular : 
> 
>   * NUTCH-1736 
>   * NUTCH-1647 
>   * NUTCH-1793 
> 
> which are important bug fixes (NUTCH-578 
>  will also
> be an important one).
> 
> If you want to help make the new release happen, could you please go through 
> the issues listed for
> 1.9
> 
>  
> and vote for the ones you think should be included in the next release / 
> comment on issues opened by
> others / review the patches / contribute to the discussions?
> 
> Thanks!
> 
> Julien
> 
> -- 
> *
> *Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble



[jira] [Updated] (NUTCH-1561) improve usability of parse-metatags and index-metadata

2014-06-29 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1561:
---

Attachment: NUTCH-1561-trunk-v2.patch

Yes, of course, the wiki needs to be updated.
Improved patch for trunk:
* do lower case conversion already in setConf if possible
* do not use default locale (cf. NUTCH-1807)
* bundled addition of found metatags in one method

> improve usability of parse-metatags and index-metadata
> --
>
> Key: NUTCH-1561
> URL: https://issues.apache.org/jira/browse/NUTCH-1561
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.6
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.9
>
> Attachments: NUTCH-1561-trunk-v2.patch, NUTCH-1561-v1.patch
>
>
> Usually, the plugins parse-metatags and index-metadata are used in 
> combination: the former "extracts" meta tags, the latter adds the extracted 
> tags as fields to the index. 
> Configuration of the two plugins differs which causes pitfalls and reduces 
> the usability (see example config):
> * the property "metatags.names" of parse-metatags uses ';' as separator 
> instead of ',' used by index-metadata
> * meta tags have to be lowercased in index-metadata
> {code}
> 
>   metatags.names
>   DC.creator;DCTERMS.bibliographicCitation
> 
> 
>   index.parse.md
>   metatag.dc.creator,metatag.dcterms.bibliographiccitation
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1801) Improve handling of test dependencies in ANT+Ivy

2014-06-29 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14047256#comment-14047256
 ] 

Sebastian Nagel commented on NUTCH-1801:


+1 (including sub-tasks)

> Improve handling of test dependencies in ANT+Ivy
> 
>
> Key: NUTCH-1801
> URL: https://issues.apache.org/jira/browse/NUTCH-1801
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.8
>Reporter: Julien Nioche
> Fix For: 1.9
>
>
> The chain of dependencies between ANT tasks needs fixing. The main issue is 
> that the dependencies with a 'test' scope in Ivy are not resolved properly or 
> rather the resolution task works fine but is not called from the upper level 
> 'test' tasks. This can easily be reproduced by marking the junit dependency 
> in ivy.xml as conf="test->default".
> Ideally we'd want to have a separate lib dir for the test dependencies so 
> that they do not get copied into the job file where they are absolutely not 
> needed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1693) TextMD5Signatue compute on textual content

2014-06-29 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1693:
---

Attachment: NUTCH-1693-2x-v2.patch

Hi [~markus17], +1 for your patch (a late review).
Updated and completed patch for 2.x: need also to call signature calculation 
after text has been stored in WebPage.
Opened NUTCH-1807 to address the problem with default locales/charsets.

> TextMD5Signatue compute on textual content
> --
>
> Key: NUTCH-1693
> URL: https://issues.apache.org/jira/browse/NUTCH-1693
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Tien Nguyen Manh
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.9
>
> Attachments: NUTCH-1693-2x-v2.patch, NUTCH-1693-trunk.patch, 
> NUTCH-1693-trunk.patch, NUTCH-1693.patch
>
>
> I create a new MD5Signature that based on textual content. In our case we use 
> boilerpipe to extract main text from content so this signature is more 
> effective to deduplicate.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (NUTCH-1807) avoid methods relying on system-specific default locale / charset

2014-06-29 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1807:
--

 Summary: avoid methods relying on system-specific default locale / 
charset
 Key: NUTCH-1807
 URL: https://issues.apache.org/jira/browse/NUTCH-1807
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.2.1, 1.8
Reporter: Sebastian Nagel
Priority: Minor
 Fix For: 2.4, 1.9


Many methods in Java (and libraries) used to convert Strings, Numbers, Dates 
rely on the system-specific default locale / character set. This may cause 
strange behaviour and errors impossible to reproduce on other systems, see 
[~thetaphi]'s [blog 
post|http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html],
 and discussions in NUTCH-1693 and NUTCH-1554.

A search with the [forbidden-apis 
client|https://code.google.com/p/forbidden-apis/wiki/CliUsage] shows 120 calls 
of such methods in trunk (without test classes):
{code}
# compile Nutch before check: all tested class files
# are then located in build/ (including plugins)
% CLASSPATH=`find build/ -name '*.jar' | tr '\n' ':'`
% java -jar forbiddenapis-1.5.1.jar -d build/ -c $CLASSPATH \
  -b jdk-unsafe-1.8 -b commons-io-unsafe-2.4
{code}
It is also possible to [integrate the check into the ant 
build|https://code.google.com/p/forbidden-apis/wiki/AntUsage] (to avoid that 
"forbidden" calls slip into the code again).




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-578) URL fetched with 403 is generated over and over again

2014-06-29 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-578:


Priority: Critical  (was: Major)

> URL fetched with 403 is generated over and over again
> -
>
> Key: NUTCH-578
> URL: https://issues.apache.org/jira/browse/NUTCH-578
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.0.0
> Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I 
> have checked out the most recent version of the trunk as of Nov 20, 2007
>Reporter: Nathaniel Powell
>Assignee: Markus Jelsma
>Priority: Critical
> Fix For: 1.9
>
> Attachments: NUTCH-578.patch, NUTCH-578_v2.patch, NUTCH-578_v3.patch, 
> NUTCH-578_v4.patch, NUTCH-578_v5.patch, crawl-urlfilter.txt, nutch-site.xml, 
> regex-normalize.xml, urls.txt
>
>
> I have not changed the following parameter in the nutch-default.xml:
> 
>   db.fetch.retry.max
>   3
>   The maximum number of times a url that has encountered
>   recoverable errors is generated for fetch.
> 
> However, there is a URL which is on the site that I'm crawling, 
> www.teachertube.com, which keeps being generated over and over again for 
> almost every segment (many more times than 3):
> fetch of http://www.teachertube.com/images/ failed with: Http code=403, 
> url=http://www.teachertube.com/images/
> This is a bug, right?
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Nearing a 1.9 release?

2014-06-29 Thread Julien Nioche
Hi guys,

We've done loads of good work on the trunk since the last release, in
particular :

   - NUTCH-1736 
   - NUTCH-1647 
   - NUTCH-1793 

which are important bug fixes (NUTCH-578
 will also be an important
one).

If you want to help make the new release happen, could you please go
through the issues listed for 1.9

and vote for the ones you think should be included in the next release /
comment on issues opened by others / review the patches / contribute to the
discussions?

Thanks!

Julien

-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


[jira] [Closed] (NUTCH-1285) Debian Packaging for Nutch

2014-06-29 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-1285.
---


> Debian Packaging for Nutch
> --
>
> Key: NUTCH-1285
> URL: https://issues.apache.org/jira/browse/NUTCH-1285
> Project: Nutch
>  Issue Type: New Feature
>  Components: build
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Priority: Minor
>
> This is a utopian type issue which will not be addressed for some time due to 
> many factors, outwith our control which exist within the Debian policy 
> ecosystem. 
> I've been in touch with Ioan over @ Apache James and they have recently 
> (after a number of years) made some real progress with this. Some links are 
> below
> [0] http://svn.apache.org/repos/asf/james/app
> [1] http://svn.apache.org/viewvc/james/app/trunk/pom.xml?view=markup
> [2] https://issues.apache.org/jira/browse/JAMES-1343
> [3] http://www.mail-archive.com/server-dev@james.apache.org/
> [4] http://www.debian.org/doc/debian-policy/
> [5] http://www.debian.org/doc/manuals/maint-guide/index.en.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1285) Debian Packaging for Nutch

2014-06-29 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1285:
-

Fix Version/s: (was: 1.9)

> Debian Packaging for Nutch
> --
>
> Key: NUTCH-1285
> URL: https://issues.apache.org/jira/browse/NUTCH-1285
> Project: Nutch
>  Issue Type: New Feature
>  Components: build
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Priority: Minor
>
> This is a utopian type issue which will not be addressed for some time due to 
> many factors, outwith our control which exist within the Debian policy 
> ecosystem. 
> I've been in touch with Ioan over @ Apache James and they have recently 
> (after a number of years) made some real progress with this. Some links are 
> below
> [0] http://svn.apache.org/repos/asf/james/app
> [1] http://svn.apache.org/viewvc/james/app/trunk/pom.xml?view=markup
> [2] https://issues.apache.org/jira/browse/JAMES-1343
> [3] http://www.mail-archive.com/server-dev@james.apache.org/
> [4] http://www.debian.org/doc/debian-policy/
> [5] http://www.debian.org/doc/manuals/maint-guide/index.en.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)