[Nutch Wiki] Update of "Release_HOWTO" by SebastianNagel

2016-06-29 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "Release_HOWTO" page has been changed by SebastianNagel:
https://wiki.apache.org/nutch/Release_HOWTO?action=diff&rev1=52&rev2=53

Comment:
added for release candidate: check copyright year in NOTICE.txt

1. Update version numbers (from X.Y-dev to X.Y) for release in:
* nutch-default.xml - http.agent.version property
* default.properties - version property and year property
+   1. Check the copyright year in NOTICE.txt, update to current year if 
necessary
1. Update CHANGES.txt with release date and (if needed) add additional 
changelog entries (from Jira Report). It's also good practice to include a link 
to the Jira report. 
1. Check if documentation needs an update. Although this may be a huge 
task at any given time, any minor contribution is better than nothing at all.
1. Commit all these changes to the branch you are releasing.


[jira] [Created] (NUTCH-2290) Update licenses of bundled libraries

2016-06-29 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2290:
--

 Summary: Update licenses of bundled libraries
 Key: NUTCH-2290
 URL: https://issues.apache.org/jira/browse/NUTCH-2290
 Project: Nutch
  Issue Type: Bug
  Components: deployment
Affects Versions: 1.12, 2.3.1
Reporter: Sebastian Nagel
 Fix For: 2.4, 1.13


The files LICENSE.txt and NOTICE.txt were last edited 5 years ago and should be 
updated to include all licenses of dependencies (and their dependencies) in 
accordance to [Assembling LICENSE and NOTICE 
HOWTO|http://www.apache.org/dev/licensing-howto.html]:
# check for missing or obsolete licenses due to added or removed dependencies
# update year in NOTICE.txt -- should be a range according to the licensing 
HOWTO
# bundled libraries are referenced with path and version number, e.g 
{{lib/icu4j-4_0_1.jar}}. This would require to update the LICENSE.txt with 
every dependency upgrade. A more generic reference ("ICU4J") would be easier to 
maintain but the HOWTO requires to "specify the version of the dependency as 
licenses are sometimes changed".
# try to reduce the size of LICENSE.txt (currently 5800 lines). Mainly, 
according to the HOWTO there is no need to repeat the Apache license again and 
again.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


ApacheCon EU Sevilla

2016-06-29 Thread Julien Nioche
Hi,

Sorry for cross posting. As you are probably aware, the ApacheCon Europe,
and Apache Big Data conferences will take place in Seville, Spain, November
14-18, 2016.

http://events.linuxfoundation.org/events/apache-big-data-europe/

I just submitted a talk on StormCrawler  (which
will touch on Apache Nutch as well) and I know that at least 1 other fellow
Nutch committer will be there.

Is anyone else planning on going? It would be interesting not only to catch
up within each respective project but also meet people from other crawl
related projects.

Best regards

Julien

-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


Re: ApacheCon EU Sevilla

2016-06-29 Thread BlackIce
I might go!
El 29/6/2016 16:06, "Julien Nioche" 
escribió:

> Hi,
>
> Sorry for cross posting. As you are probably aware, the ApacheCon Europe,
> and Apache Big Data conferences will take place in Seville, Spain,
> November 14-18, 2016.
>
> http://events.linuxfoundation.org/events/apache-big-data-europe/
>
> I just submitted a talk on StormCrawler  (which
> will touch on Apache Nutch as well) and I know that at least 1 other fellow
> Nutch committer will be there.
>
> Is anyone else planning on going? It would be interesting not only to
> catch up within each respective project but also meet people from other
> crawl related projects.
>
> Best regards
>
> Julien
>
> --
>
> *Open Source Solutions for Text Engineering*
>
> http://www.digitalpebble.com
> http://digitalpebble.blogspot.com/
> #digitalpebble 
>


[jira] [Commented] (NUTCH-2234) Upgrade to elasticsearch 2.1.1

2016-06-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15355396#comment-15355396
 ] 

ASF GitHub Bot commented on NUTCH-2234:
---

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/118


> Upgrade to elasticsearch 2.1.1
> --
>
> Key: NUTCH-2234
> URL: https://issues.apache.org/jira/browse/NUTCH-2234
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
>Assignee: Markus Jelsma
> Fix For: 1.13
>
> Attachments: NUTCH-2234.patch
>
>
> Currently we use elasticsearch 1.x, We should upgrade to 2.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request #118: fix for NUTCH-2234 and NUTCH-2236

2016-06-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/118


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Resolved] (NUTCH-2234) Upgrade to elasticsearch 2.3.3

2016-06-29 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2234.
-
Resolution: Fixed

Nice work on this one [~naegelejd] :)

> Upgrade to elasticsearch 2.3.3
> --
>
> Key: NUTCH-2234
> URL: https://issues.apache.org/jira/browse/NUTCH-2234
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
>Assignee: Markus Jelsma
> Fix For: 1.13
>
> Attachments: NUTCH-2234.patch
>
>
> Currently we use elasticsearch 1.x, We should upgrade to 2.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2236) Upgrade to Hadoop 2.7.2

2016-06-29 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2236:

Summary: Upgrade to Hadoop 2.7.2  (was: Upgrade to Hadoop 2.7.1)

> Upgrade to Hadoop 2.7.2
> ---
>
> Key: NUTCH-2236
> URL: https://issues.apache.org/jira/browse/NUTCH-2236
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
>Assignee: Markus Jelsma
> Fix For: 1.13
>
> Attachments: NUTCH-2236.patch
>
>
> Upgrade to Hadoop 2.7.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2236) Upgrade to Hadoop 2.7.2

2016-06-29 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2236.
-
Resolution: Fixed

Nice work on this one [~naegelejd] :)

> Upgrade to Hadoop 2.7.2
> ---
>
> Key: NUTCH-2236
> URL: https://issues.apache.org/jira/browse/NUTCH-2236
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
>Assignee: Markus Jelsma
> Fix For: 1.13
>
> Attachments: NUTCH-2236.patch
>
>
> Upgrade to Hadoop 2.7.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request #112: NUTCH-2262 Utilize parameterized logging notation a...

2016-06-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/112


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-2262) Utilize parameterized logging notation across Fetcher

2016-06-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15355410#comment-15355410
 ] 

ASF GitHub Bot commented on NUTCH-2262:
---

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/112


> Utilize parameterized logging notation across Fetcher
> -
>
> Key: NUTCH-2262
> URL: https://issues.apache.org/jira/browse/NUTCH-2262
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.11
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.13
>
>
> This issue was something I have had lying around for a wee while. It merely 
> consists of implementing the parameterized logging for slf4j which improves 
> logging speed. PR coming up. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2262) Utilize parameterized logging notation across Fetcher

2016-06-29 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2262.
-
Resolution: Fixed

> Utilize parameterized logging notation across Fetcher
> -
>
> Key: NUTCH-2262
> URL: https://issues.apache.org/jira/browse/NUTCH-2262
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.11
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.13
>
>
> This issue was something I have had lying around for a wee while. It merely 
> consists of implementing the parameterized logging for slf4j which improves 
> logging speed. PR coming up. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2234) Upgrade to elasticsearch 2.3.3

2016-06-29 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2234:

Summary: Upgrade to elasticsearch 2.3.3  (was: Upgrade to elasticsearch 
2.1.1)

> Upgrade to elasticsearch 2.3.3
> --
>
> Key: NUTCH-2234
> URL: https://issues.apache.org/jira/browse/NUTCH-2234
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
>Assignee: Markus Jelsma
> Fix For: 1.13
>
> Attachments: NUTCH-2234.patch
>
>
> Currently we use elasticsearch 1.x, We should upgrade to 2.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Build failed in Jenkins: Nutch-trunk #3376

2016-06-29 Thread Apache Jenkins Server
See 

Changes:

[lewis.mcgibbney] NUTCH-2262 Utilize parameterized logging notation across 
Fetcher

[jnaegele] fix for NUTCH-2234

--
[...truncated 4684 lines...]
[mkdir] Created dir: 

[mkdir] Created dir: 


init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-regex
[javac] Compiling 2 source files to 

[javac] Creating empty 


jar:
  [jar] Building jar: 


deps-test:

init:

init-plugin:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:

jar:

deps-test:

deploy:

copy-generated-lib:

deploy:
 [copy] Copying 1 file to 


copy-generated-lib:
 [copy] Copying 1 file to 

[mkdir] Created dir: 

 [copy] Copying 1 file to 


init:
[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 


init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-slash
[javac] Compiling 1 source file to 


jar:
  [jar] Building jar: 


deps-test:

deploy:
 [copy] Copying 1 file to 


copy-generated-lib:
 [copy] Copying 1 file to 


compile:

job:
  [jar] Building jar: 


test-core:
[mkdir] Created dir: 

 [copy] Copying 48 files to 

 [copy] Copying 1 file to 

 [copy] Copying 1 file to 

 [copy] Copying 1 file to 

 [copy] Copying 1 file to 

[junit] Running org.apache.nutch.crawl.TestAdaptiveFetchSchedule
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.523 sec
[junit] Running org.apache.nutch.crawl.TestCrawlDbFilter
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
4.269 sec
[junit] Running org.apache.nutch.crawl.TestCrawlDbMerger
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
4.17 sec
[junit] Running org.apache.nutch.crawl.TestCrawlDbStates
[junit] Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
4.058 sec
[junit] Test org.apache.nutch.crawl.TestCrawlDbStates FAILED
[junit] Running org.apache.nutch.crawl.TestGenerator
[junit] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
45.988 sec
[junit] Running org.apache.nutch.crawl.TestInjector
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
6.483 sec
[junit] Running org.apache.nutch.crawl.TestLinkDbMerger
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
4.219 sec
[junit] Running org.apache.nutch.crawl.TestSignatureFactory
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.465 sec
[junit] Running org.apache.nutch.fetcher.TestFetcher
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed

Re: ApacheCon EU Sevilla

2016-06-29 Thread Mattmann, Chris A (3980)
I’m thinking about it :) Would be great to go.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 6/29/16, 7:05 AM, "Julien Nioche"  wrote:

>Hi, 
>
>
>Sorry for cross posting. As you are probably aware, the ApacheCon Europe, and 
>Apache Big Data conferences will take place in Seville, Spain, November
> 14-18, 2016.
>
>
>http://events.linuxfoundation.org/events/apache-big-data-europe/
>
>
>
>I just submitted a talk on StormCrawler  (which will 
>touch on Apache Nutch as well) and I know that at least 1 other fellow Nutch 
>committer will be there.
>
>
>Is anyone else planning on going? It would be interesting not only to catch up 
>within each respective project but also meet people from other crawl related 
>projects.
>
>
>Best regards
>
>
>Julien
>
>
>-- 
>
>Open Source Solutions for Text Engineering
>
>http://www.digitalpebble.com 
>http://digitalpebble.blogspot.com/
>#digitalpebble 
>
>
>
>
>


[jira] [Assigned] (NUTCH-1553) Property 'indexer.delete.robots.noindex' not working when using parser-html.

2016-06-29 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-1553:
--

Assignee: Sebastian Nagel

> Property 'indexer.delete.robots.noindex' not working when using parser-html.
> 
>
> Key: NUTCH-1553
> URL: https://issues.apache.org/jira/browse/NUTCH-1553
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer, parser
>Affects Versions: 1.6
>Reporter: Alfonso Presa
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.13
>
> Attachments: NUTCH-1553-trunk-1.patch
>
>
> May be I'm doing something wrong, but it seems to me that +NUTCH-1434+ patch 
> only works when using tika's parser. When using parser-html, "robots" metatag 
> is only populated if parse-metatags plugin is enabled and it's done with the 
> prefix "metatag.". So parseData.getMeta("robots") returns nothing if not 
> using tika.
> I guess the simplest solution would be to provide a fallback in case 
> parseData.getMeta("robots") is null and then get 
> parseData.getMeta("metatag.robots") in that case.
> Also dependency of this property with parse-metadata plugin when using 
> parse-html would be something interesting to document somewhere... 
> (nutch-default.xml?)
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1553) Property 'indexer.delete.robots.noindex' not working when using parser-html.

2016-06-29 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1553:
---
Fix Version/s: 1.13

> Property 'indexer.delete.robots.noindex' not working when using parser-html.
> 
>
> Key: NUTCH-1553
> URL: https://issues.apache.org/jira/browse/NUTCH-1553
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer, parser
>Affects Versions: 1.6
>Reporter: Alfonso Presa
>Priority: Minor
> Fix For: 1.13
>
> Attachments: NUTCH-1553-trunk-1.patch
>
>
> May be I'm doing something wrong, but it seems to me that +NUTCH-1434+ patch 
> only works when using tika's parser. When using parser-html, "robots" metatag 
> is only populated if parse-metatags plugin is enabled and it's done with the 
> prefix "metatag.". So parseData.getMeta("robots") returns nothing if not 
> using tika.
> I guess the simplest solution would be to provide a fallback in case 
> parseData.getMeta("robots") is null and then get 
> parseData.getMeta("metatag.robots") in that case.
> Also dependency of this property with parse-metadata plugin when using 
> parse-html would be something interesting to document somewhere... 
> (nutch-default.xml?)
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-1553) Property 'indexer.delete.robots.noindex' not working when using parser-html.

2016-06-29 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1553.

Resolution: Fixed

Thanks [~alfonso.presa]! Verified the solution and committed.

> Property 'indexer.delete.robots.noindex' not working when using parser-html.
> 
>
> Key: NUTCH-1553
> URL: https://issues.apache.org/jira/browse/NUTCH-1553
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer, parser
>Affects Versions: 1.6
>Reporter: Alfonso Presa
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.13
>
> Attachments: NUTCH-1553-trunk-1.patch
>
>
> May be I'm doing something wrong, but it seems to me that +NUTCH-1434+ patch 
> only works when using tika's parser. When using parser-html, "robots" metatag 
> is only populated if parse-metatags plugin is enabled and it's done with the 
> prefix "metatag.". So parseData.getMeta("robots") returns nothing if not 
> using tika.
> I guess the simplest solution would be to provide a fallback in case 
> parseData.getMeta("robots") is null and then get 
> parseData.getMeta("metatag.robots") in that case.
> Also dependency of this property with parse-metadata plugin when using 
> parse-html would be something interesting to document somewhere... 
> (nutch-default.xml?)
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-1553) Property 'indexer.delete.robots.noindex' not working when using parser-html.

2016-06-29 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15356625#comment-15356625
 ] 

Sebastian Nagel edited comment on NUTCH-1553 at 6/30/16 6:48 AM:
-

Thanks [~alfonso.presa] and [~Fengtan]! Verified [~Fengtan]'s solution and 
committed.


was (Author: wastl-nagel):
Thanks [~alfonso.presa]! Verified the solution and committed.

> Property 'indexer.delete.robots.noindex' not working when using parser-html.
> 
>
> Key: NUTCH-1553
> URL: https://issues.apache.org/jira/browse/NUTCH-1553
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer, parser
>Affects Versions: 1.6
>Reporter: Alfonso Presa
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.13
>
> Attachments: NUTCH-1553-trunk-1.patch
>
>
> May be I'm doing something wrong, but it seems to me that +NUTCH-1434+ patch 
> only works when using tika's parser. When using parser-html, "robots" metatag 
> is only populated if parse-metatags plugin is enabled and it's done with the 
> prefix "metatag.". So parseData.getMeta("robots") returns nothing if not 
> using tika.
> I guess the simplest solution would be to provide a fallback in case 
> parseData.getMeta("robots") is null and then get 
> parseData.getMeta("metatag.robots") in that case.
> Also dependency of this property with parse-metadata plugin when using 
> parse-html would be something interesting to document somewhere... 
> (nutch-default.xml?)
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)