date:20181119

[jira] [Commented] (NUTCH-2676) Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver

2018-11-19 Thread Stas Batururimi (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692799#comment-16692799
 ] 

Stas Batururimi commented on NUTCH-2676:


Running twice didn't help. Looks like the property is still unresolved during 
the build time
```
resolve-default:
[ivy:resolve] :: loading settings :: file = 
/root/nutch_source/ivy/ivysettings.xml
[ivy:resolve] 
[ivy:resolve] :: problems summary ::
[ivy:resolve]  WARNINGS
[ivy:resolve]   [FAILED ] 
javax.ws.rs#javax.ws.rs-api;2.1!javax.ws.rs-api.${packaging.type}:  (0ms)
[ivy:resolve]    local: tried
[ivy:resolve] 
/root/.ivy2/local/javax.ws.rs/javax.ws.rs-api/2.1/${packaging.type}s/javax.ws.rs-api.${packaging.type}
[ivy:resolve]    maven2: tried
[ivy:resolve] 
http://repo1.maven.org/maven2/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type}
[ivy:resolve]    apache-snapshot: tried
[ivy:resolve] 
https://repository.apache.org/content/repositories/snapshots/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type}
[ivy:resolve]    sonatype: tried
[ivy:resolve] 
http://oss.sonatype.org/content/repositories/releases/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type}
[ivy:resolve]   ::
[ivy:resolve]   ::  FAILED DOWNLOADS::
[ivy:resolve]   :: ^ see resolution messages for details  ^ ::
[ivy:resolve]   ::
[ivy:resolve]   :: 
javax.ws.rs#javax.ws.rs-api;2.1!javax.ws.rs-api.${packaging.type}
[ivy:resolve]   ::
[ivy:resolve] 
[ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
```

> Update to the latest selenium and add code to use chrome and firefox headless 
> mode with the remote web driver
> -
>
> Key: NUTCH-2676
> URL: https://issues.apache.org/jira/browse/NUTCH-2676
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.15
>Reporter: Stas Batururimi
>Priority: Major
> Fix For: 1.16
>
> Attachments: Screenshot 2018-11-16 at 18.15.52.png
>
>
> * Selenium needs to be updated
>  * missing remote web driver for chrome 
>  * necessity to add headless mode for both remote WebDriverBase Firefox & 
> Chrome
>  * use case with Selenium grid using docker (1 hub docker container, several 
> nodes in different docker containers, Nutch in another docker container, 
> streaming to Apache Solr in docker container, that is at least 4 different 
> docker containers)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2668) Integrate OWASP dependency checks as ant target

2018-11-19 Thread Hudson (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692352#comment-16692352
 ] 

Hudson commented on NUTCH-2668:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3591 (See 
[https://builds.apache.org/job/Nutch-trunk/3591/])
NUTCH-2668 Integrate OWASP dependency checks as ant target - relax ant (snagel: 
[https://github.com/apache/nutch/commit/a965cd21fbc2e6037c792eb60919a8b0dc240103])
* (edit) build.xml


> Integrate OWASP dependency checks as ant target
> ---
>
> Key: NUTCH-2668
> URL: https://issues.apache.org/jira/browse/NUTCH-2668
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 2.4, 1.16
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 2.4, 1.16
>
> Attachments: 1x-dependency-check-report.html, 
> 1x-dependency-check-vulnerability.html, 2x-dependency-check-report.html, 
> 2x-dependency-check-vulnerability.html
>
>
> [OWASP|http://www.owasp.org/] provides the [ant tool 
> "dependency-check"|https://jeremylong.github.io/DependencyCheck/dependency-check-ant/index.html]
>  which lists potential vulnerabilities of library dependencies. We should 
> integrate the generation of vulnerability reports into our build system as an 
> optional task/target recommended to be run from time to time and especially 
> shortly before releases are prepared.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Jenkins build is back to normal : Nutch-trunk #3591

2018-11-19 Thread Apache Jenkins Server

See

[jira] [Commented] (NUTCH-2668) Integrate OWASP dependency checks as ant target

2018-11-19 Thread Hudson (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692349#comment-16692349
 ] 

Hudson commented on NUTCH-2668:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1623 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1623/])
NUTCH-2668 Integrate OWASP dependency checks as ant target - relax ant (snagel: 
[https://github.com/apache/nutch/commit/6adca89c01cc846a361a99594a53cae40ee632bf])
* (edit) build.xml


> Integrate OWASP dependency checks as ant target
> ---
>
> Key: NUTCH-2668
> URL: https://issues.apache.org/jira/browse/NUTCH-2668
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 2.4, 1.16
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 2.4, 1.16
>
> Attachments: 1x-dependency-check-report.html, 
> 1x-dependency-check-vulnerability.html, 2x-dependency-check-report.html, 
> 2x-dependency-check-vulnerability.html
>
>
> [OWASP|http://www.owasp.org/] provides the [ant tool 
> "dependency-check"|https://jeremylong.github.io/DependencyCheck/dependency-check-ant/index.html]
>  which lists potential vulnerabilities of library dependencies. We should 
> integrate the generation of vulnerability reports into our build system as an 
> optional task/target recommended to be run from time to time and especially 
> shortly before releases are prepared.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Jenkins build is back to normal : Nutch-nutchgora #1623

2018-11-19 Thread Apache Jenkins Server

See

[jira] [Commented] (NUTCH-2606) MIME detection is wrong for plain-text documents send as Content-Type "application/msword"

2018-11-19 Thread Hudson (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692320#comment-16692320
 ] 

Hudson commented on NUTCH-2606:
---

FAILURE: Integrated in Jenkins build Nutch-trunk #3590 (See 
[https://builds.apache.org/job/Nutch-trunk/3590/])
NUTCH-2606 MIME detection is wrong for plain-text documents send as (snagel: 
[https://github.com/apache/nutch/commit/5f53fd4807f62d002d24f6cfe4b3fae5c0e62741])
* (edit) src/test/org/apache/nutch/util/TestMimeUtil.java
* (edit) src/java/org/apache/nutch/util/MimeUtil.java


> MIME detection is wrong for plain-text documents send as Content-Type 
> "application/msword"
> --
>
> Key: NUTCH-2606
> URL: https://issues.apache.org/jira/browse/NUTCH-2606
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> Plain-text documents send as Content-Type "application/msword" are tried to 
> parse as Word documents. The MIME detection should be fixed, so that these 
> are correctly identified as plain-text documents. See NUTCH-2603 and 
> https://www.atnf.csiro.au/computing/software/gipsy/doc/update.doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2668) Integrate OWASP dependency checks as ant target

2018-11-19 Thread Hudson (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692315#comment-16692315
 ] 

Hudson commented on NUTCH-2668:
---

FAILURE: Integrated in Jenkins build Nutch-nutchgora #1622 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1622/])
NUTCH-2668 Integrate OWASP dependency checks as ant target - add ant (snagel: 
[https://github.com/apache/nutch/commit/f88d73d07db4f12fb009240d0f76977fb4cb50cf])
* (add) ivy/dependency-check-ant/dependency-check-suppressions.xml
* (edit) build.xml


> Integrate OWASP dependency checks as ant target
> ---
>
> Key: NUTCH-2668
> URL: https://issues.apache.org/jira/browse/NUTCH-2668
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 2.4, 1.16
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 2.4, 1.16
>
> Attachments: 1x-dependency-check-report.html, 
> 1x-dependency-check-vulnerability.html, 2x-dependency-check-report.html, 
> 2x-dependency-check-vulnerability.html
>
>
> [OWASP|http://www.owasp.org/] provides the [ant tool 
> "dependency-check"|https://jeremylong.github.io/DependencyCheck/dependency-check-ant/index.html]
>  which lists potential vulnerabilities of library dependencies. We should 
> integrate the generation of vulnerability reports into our build system as an 
> optional task/target recommended to be run from time to time and especially 
> shortly before releases are prepared.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Reopened] (NUTCH-2668) Integrate OWASP dependency checks as ant target

2018-11-19 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reopened NUTCH-2668:

  Assignee: Sebastian Nagel

Ok, the nightly builds fail with
{noformat}
ivy/dependency-check-ant/lib does not exist
{noformat}
Sorry, this folder does not exist without installation of the OWASP dependency 
checker. That's not the case for the Jenkins builds as the check needs a manual 
verification anyway.

> Integrate OWASP dependency checks as ant target
> ---
>
> Key: NUTCH-2668
> URL: https://issues.apache.org/jira/browse/NUTCH-2668
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 2.4, 1.16
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 2.4, 1.16
>
> Attachments: 1x-dependency-check-report.html, 
> 1x-dependency-check-vulnerability.html, 2x-dependency-check-report.html, 
> 2x-dependency-check-vulnerability.html
>
>
> [OWASP|http://www.owasp.org/] provides the [ant tool 
> "dependency-check"|https://jeremylong.github.io/DependencyCheck/dependency-check-ant/index.html]
>  which lists potential vulnerabilities of library dependencies. We should 
> integrate the generation of vulnerability reports into our build system as an 
> optional task/target recommended to be run from time to time and especially 
> shortly before releases are prepared.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-1842) crawl.gen.delay has a wrong default value in nutch-default.xml or is being parsed incorrectly

2018-11-19 Thread Hudson (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692322#comment-16692322
 ] 

Hudson commented on NUTCH-1842:
---

FAILURE: Integrated in Jenkins build Nutch-trunk #3590 (See 
[https://builds.apache.org/job/Nutch-trunk/3590/])
NUTCH-1842: crawl.gen.delay value is read incorrectly from config - add 
(snagel: 
[https://github.com/apache/nutch/commit/a37bde1c03bd355c25edf6a240bac6079cb3cdc7])
* (edit) CHANGES.txt


> crawl.gen.delay has a wrong default value in nutch-default.xml or is being 
> parsed incorrectly 
> --
>
> Key: NUTCH-1842
> URL: https://issues.apache.org/jira/browse/NUTCH-1842
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.9
>Reporter: kaveh minooie
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> this is from nutch-default.xml:
> 
>   crawl.gen.delay
>   60480
>   
>This value, expressed in milliseconds, defines how long we should keep the 
> lock on records 
>in CrawlDb that were just selected for fetching. If these records are not 
> updated 
>in the meantime, the lock is canceled, i.e. they become eligible for 
> selecting. 
>Default value of this is 7 days (60480 ms).
>   
> 
> this is the from o.a.n.crawl.Generator.configure(JobConf job)
> genDelay = job.getLong(GENERATOR_DELAY, 7L) * 3600L * 24L * 1000L;
> the value in config file is in milliseconds but the code expect it to be in 
> days. I reported this couple of years ago on the mailing list as well. I 
> didn't post a patch becaue I am not sure which one needs to be fixed. 
> considering all the other values in config file are in milliseconds it can be 
> argued to that consistency matters, but 'day' is a much more reasonable unit 
> for this property.
> Also this value is not being used in 2.x ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Build failed in Jenkins: Nutch-trunk #3590

2018-11-19 Thread Apache Jenkins Server

See 


Changes:

[snagel] NUTCH-2606 MIME detection is wrong for plain-text documents send as

[snagel] NUTCH-2668 Integrate OWASP dependency checks as ant target - add ant

[snagel] NUTCH-1842: crawl.gen.delay value is read incorrectly from config - add

--
Started by an SCM change
[EnvInject] - Loading node environment variables.
Building remotely on ubuntu-4 (ubuntu trusty) in workspace 

 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/nutch.git # timeout=10
Fetching upstream changes from https://github.com/apache/nutch.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/nutch.git 
 > +refs/heads/*:refs/remotes/origin/*
 > git rev-parse refs/remotes/origin/master^{commit} # timeout=10
 > git rev-parse refs/remotes/origin/origin/master^{commit} # timeout=10
Checking out Revision 785a52f897cab00711417be8fd002b32f8b2c93e 
(refs/remotes/origin/master)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 785a52f897cab00711417be8fd002b32f8b2c93e
Commit message: "Merge pull request #401 from sebastian-nagel/dependency-check"
 > git rev-list --no-walk f861c8203c8544b91e061964441485bd2f6de145 # timeout=10
[Nutch-trunk] $ /home/jenkins/tools/ant/latest/bin/ant -file build.xml 
-Dtest.junit.output.format=xml clean nightly javadoc
Buildfile: 
Trying to override old definition of task javac

BUILD FAILED
:627: 
 
does not exist.

Total time: 0 seconds
Build step 'Invoke Ant' marked build as failure
Publishing Javadoc
INFO: Starting to record.
INFO: Processing JUnit
INFO: [JUnit] - 36 test report file(s) were found with the pattern 
'build/test/TEST-*.xml' relative to 
' for the testing framework 
'JUnit'.
ERROR: Step ?Publish xUnit test result report? aborted due to exception: 
Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to ubuntu-4
at 
hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1741)
at 
hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)
at hudson.remoting.Channel.call(Channel.java:955)
at hudson.FilePath.act(FilePath.java:1036)
at hudson.FilePath.act(FilePath.java:1025)
at 
org.jenkinsci.plugins.xunit.XUnitProcessor.processTestsReport(XUnitProcessor.java:174)
at 
org.jenkinsci.plugins.xunit.XUnitProcessor.process(XUnitProcessor.java:144)
at 
org.jenkinsci.plugins.xunit.XUnitPublisher.perform(XUnitPublisher.java:127)
at 
hudson.tasks.BuildStepCompatibilityLayer.perform(BuildStepCompatibilityLayer.java:81)
at 
hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:744)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:690)
at hudson.model.Build$BuildExecution.post2(Build.java:186)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:635)
at hudson.model.Run.execute(Run.java:1819)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
at 
hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:429)
org.jenkinsci.plugins.xunit.service.NoNewTestReportException: Test reports were 
found but not all of them are new. Did all the tests run?
  * 

 is 5 days 11 hr old
  * 

 is 5 days 11 hr old
  * 

 is 5 days 11 hr old
  * 

 is 5 days 11 hr old
  * 

 is 5 days 11 hr old
  * 

 is 5 days 11 hr old
  * 

 is 5 days 11 hr old
  *

[jira] [Commented] (NUTCH-2668) Integrate OWASP dependency checks as ant target

2018-11-19 Thread Hudson (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692321#comment-16692321
 ] 

Hudson commented on NUTCH-2668:
---

FAILURE: Integrated in Jenkins build Nutch-trunk #3590 (See 
[https://builds.apache.org/job/Nutch-trunk/3590/])
NUTCH-2668 Integrate OWASP dependency checks as ant target - add ant (snagel: 
[https://github.com/apache/nutch/commit/3e9a6e42b240535f114b4a3c0864b269c449d2a1])
* (edit) build.xml
* (add) ivy/dependency-check-ant/dependency-check-suppressions.xml


> Integrate OWASP dependency checks as ant target
> ---
>
> Key: NUTCH-2668
> URL: https://issues.apache.org/jira/browse/NUTCH-2668
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 2.4, 1.16
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 2.4, 1.16
>
> Attachments: 1x-dependency-check-report.html, 
> 1x-dependency-check-vulnerability.html, 2x-dependency-check-report.html, 
> 2x-dependency-check-vulnerability.html
>
>
> [OWASP|http://www.owasp.org/] provides the [ant tool 
> "dependency-check"|https://jeremylong.github.io/DependencyCheck/dependency-check-ant/index.html]
>  which lists potential vulnerabilities of library dependencies. We should 
> integrate the generation of vulnerability reports into our build system as an 
> optional task/target recommended to be run from time to time and especially 
> shortly before releases are prepared.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Build failed in Jenkins: Nutch-nutchgora #1622

2018-11-19 Thread Apache Jenkins Server

See 


Changes:

[snagel] NUTCH-2668 Integrate OWASP dependency checks as ant target - add ant

--
Started by an SCM change
[EnvInject] - Loading node environment variables.
Building remotely on H38 (ubuntu xenial) in workspace 

Cloning the remote Git repository
Cloning repository https://github.com/apache/nutch.git
 > git init  # timeout=10
Fetching upstream changes from https://github.com/apache/nutch.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/nutch.git 
 > +refs/heads/*:refs/remotes/origin/*
 > git config remote.origin.url https://github.com/apache/nutch.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # 
 > timeout=10
 > git config remote.origin.url https://github.com/apache/nutch.git # timeout=10
Fetching upstream changes from https://github.com/apache/nutch.git
 > git fetch --tags --progress https://github.com/apache/nutch.git 
 > +refs/heads/*:refs/remotes/origin/*
 > git rev-parse refs/remotes/origin/2.x^{commit} # timeout=10
 > git rev-parse refs/remotes/origin/origin/2.x^{commit} # timeout=10
Checking out Revision 5013b9e56cc128d10b31e32e771bd7b0c4aec9b2 
(refs/remotes/origin/2.x)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 5013b9e56cc128d10b31e32e771bd7b0c4aec9b2
Commit message: "Merge pull request #404 from 
sebastian-nagel/dependency-check-2x"
 > git rev-list --no-walk 855e650f1bd72dc38f9ddccde051d95967cc95a2 # timeout=10
[Nutch-nutchgora] $ /home/jenkins/tools/ant/latest/bin/ant nightly javadoc
Buildfile: 
Trying to override old definition of task javac

BUILD FAILED
:614: 
 
does not exist.

Total time: 0 seconds
Build step 'Invoke Ant' marked build as failure
Publishing Javadoc
[JIRA] Updating issue NUTCH-2668

[jira] [Commented] (NUTCH-2675) Give parsers the capability to read and write CrawlDatum

2018-11-19 Thread Sebastian Nagel (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692303#comment-16692303
 ] 

Sebastian Nagel commented on NUTCH-2675:


Yes, that could be done but it would also require to change the interfaces of 
the parser and/or ParseFilter plugins.

[~aquaticwater], did you consider to implement a scoring filter to do this job? 
Although the 
[ScoringFilter|http://nutch.apache.org/apidocs/apidocs-1.15/index.html] 
interface is originally thought to transfer and distribute the score from the 
CrawlDb over fetch datum, parsed page back to the crawldb (both via outlinks 
and the CrawlDatum of the fetched page), it can be also used to transfer 
metadata. The 
[DepthScoringFilter|https://gitbox.apache.org/repos/asf?p=nutch.git;a=blob;f=src/plugin/scoring-depth/src/java/org/apache/nutch/scoring/depth/DepthScoringFilter.java;h=07e0e3f04effe6526088a0c088ec506952d55424;hb=HEAD]
 is a good example for this approach. It does not look straight-forward at a 
first glance and you need to pass the information along over multiple 
hops/methods but it has the advantage to work under any conditions.

> Give parsers the capability to read and write CrawlDatum
> 
>
> Key: NUTCH-2675
> URL: https://issues.apache.org/jira/browse/NUTCH-2675
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.15
>Reporter: Junqiang Zhang
>Priority: Minor
> Fix For: 1.16
>
>
> Parsers are called inside org.apache.nutch.parse.ParseSegment,
> (Line 127 for version 1.15)parseResult = parseUtil.parse(content);
> and inside org.apache.nutch.fetcher.FetcherThread.
> (Line 640 for version 1.15)parseResult = 
> this.parseUtil.parse(content);
> The current version of Nutch does not give parsers the capability to access 
> CrawlDatum. If users want to customize the parsing process using some 
> metadata of CrawlDatum, it is difficult to read the required metadata. 
> On the other side, if users want to save metadata generated during parsing, 
> the metadata can only be saved as parseMeta of 
> org.apache.nutch.parse.ParseData, and those of parseMeta selected by 
> db.parsemeta.to.crawldb in nutch-site.xml can be added to CrawlDatum inside 
> org.apache.nutch.parse.ParseOutputFormat and 
> org.apache.nutch.crawl.CrawlDbReducer. If parsers have direct access to 
> CrawlDatum, the metadata generated during parsing can be added to CrawlDatum 
> directly by parsers.
> I use Nutch to fetch and parse web pages. To read required metadata from 
> CrawlDatum during parsing, I do the following steps to work around.
> (1) During web page fetching, inside 
> org.apache.nutch.protocol.http.api.HttpBase of lib-http plugin, read the 
> required metadata from CrawlDatum, and save the required metadata together 
> with the Headers metadata of org.apache.nutch.net.protocols.Response to the 
> metadata of org.apache.nutch.protocol.Content. This can be done at line 334 
> of the code by replacing "response.getHeaders()" by a new metadata containing 
> both the required metadata from CrawlDatum and the Headers metadata.
> The code need to be modified inside 
> org.apache.nutch.protocol.http.api.HttpBase of lib-http plugin is
> (Line 332 for version 1.15)  Content c = new Content(u.toString(), 
> u.toString(),
> (Line 333 for version 1.15)   (content == null ? EMPTY_CONTENT : 
> content),
> (Line 334 for version 1.15)   response.getHeader("Content-Type"), 
> response.getHeaders(), mimeTypes);
> (2) During html page parsing, inside org.apache.nutch.parse.html.HtmlParser 
> of parse-html plugin, read the required metadata from the metadata of 
> org.apache.nutch.protocol.Content, and customize the parsing process using 
> the required metadata.
> If parsers have direct access to CrawlDatum, the above workaround is not 
> needed. To give parsers the capacity to directly read and write CrawlDatum, I 
> would like to suggest adding a new method "public ParseResult parse(Content 
> content, CrawlDatum datum)" to org.apache.nutch.parse.ParseUtil in future 
> versions of Nutch.
> To be compatible with current 1.15 and previous versions, I would like to 
> suggest adding a new configuration property to nutch-default.xml. The default 
> of the configuration property can be use the current method "public 
> ParseResult parse(Content content)". If users want to use "public ParseResult 
> parse(Content content, CrawlDatum datum)", they can change the property in 
> nutch-site.xml.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (NUTCH-2675) Give parsers the capability to read and write CrawlDatum

2018-11-19 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2675:
---
Fix Version/s: (was: 1.15)
   1.16

> Give parsers the capability to read and write CrawlDatum
> 
>
> Key: NUTCH-2675
> URL: https://issues.apache.org/jira/browse/NUTCH-2675
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.15
>Reporter: Junqiang Zhang
>Priority: Minor
> Fix For: 1.16
>
>
> Parsers are called inside org.apache.nutch.parse.ParseSegment,
> (Line 127 for version 1.15)parseResult = parseUtil.parse(content);
> and inside org.apache.nutch.fetcher.FetcherThread.
> (Line 640 for version 1.15)parseResult = 
> this.parseUtil.parse(content);
> The current version of Nutch does not give parsers the capability to access 
> CrawlDatum. If users want to customize the parsing process using some 
> metadata of CrawlDatum, it is difficult to read the required metadata. 
> On the other side, if users want to save metadata generated during parsing, 
> the metadata can only be saved as parseMeta of 
> org.apache.nutch.parse.ParseData, and those of parseMeta selected by 
> db.parsemeta.to.crawldb in nutch-site.xml can be added to CrawlDatum inside 
> org.apache.nutch.parse.ParseOutputFormat and 
> org.apache.nutch.crawl.CrawlDbReducer. If parsers have direct access to 
> CrawlDatum, the metadata generated during parsing can be added to CrawlDatum 
> directly by parsers.
> I use Nutch to fetch and parse web pages. To read required metadata from 
> CrawlDatum during parsing, I do the following steps to work around.
> (1) During web page fetching, inside 
> org.apache.nutch.protocol.http.api.HttpBase of lib-http plugin, read the 
> required metadata from CrawlDatum, and save the required metadata together 
> with the Headers metadata of org.apache.nutch.net.protocols.Response to the 
> metadata of org.apache.nutch.protocol.Content. This can be done at line 334 
> of the code by replacing "response.getHeaders()" by a new metadata containing 
> both the required metadata from CrawlDatum and the Headers metadata.
> The code need to be modified inside 
> org.apache.nutch.protocol.http.api.HttpBase of lib-http plugin is
> (Line 332 for version 1.15)  Content c = new Content(u.toString(), 
> u.toString(),
> (Line 333 for version 1.15)   (content == null ? EMPTY_CONTENT : 
> content),
> (Line 334 for version 1.15)   response.getHeader("Content-Type"), 
> response.getHeaders(), mimeTypes);
> (2) During html page parsing, inside org.apache.nutch.parse.html.HtmlParser 
> of parse-html plugin, read the required metadata from the metadata of 
> org.apache.nutch.protocol.Content, and customize the parsing process using 
> the required metadata.
> If parsers have direct access to CrawlDatum, the above workaround is not 
> needed. To give parsers the capacity to directly read and write CrawlDatum, I 
> would like to suggest adding a new method "public ParseResult parse(Content 
> content, CrawlDatum datum)" to org.apache.nutch.parse.ParseUtil in future 
> versions of Nutch.
> To be compatible with current 1.15 and previous versions, I would like to 
> suggest adding a new configuration property to nutch-default.xml. The default 
> of the configuration property can be use the current method "public 
> ParseResult parse(Content content)". If users want to use "public ParseResult 
> parse(Content content, CrawlDatum datum)", they can change the property in 
> nutch-site.xml.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (NUTCH-2668) Integrate OWASP dependency checks as ant target

2018-11-19 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2668.

Resolution: Fixed

Thanks! Merged into master/1.x and 2.x

> Integrate OWASP dependency checks as ant target
> ---
>
> Key: NUTCH-2668
> URL: https://issues.apache.org/jira/browse/NUTCH-2668
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 2.4, 1.16
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 2.4, 1.16
>
> Attachments: 1x-dependency-check-report.html, 
> 1x-dependency-check-vulnerability.html, 2x-dependency-check-report.html, 
> 2x-dependency-check-vulnerability.html
>
>
> [OWASP|http://www.owasp.org/] provides the [ant tool 
> "dependency-check"|https://jeremylong.github.io/DependencyCheck/dependency-check-ant/index.html]
>  which lists potential vulnerabilities of library dependencies. We should 
> integrate the generation of vulnerability reports into our build system as an 
> optional task/target recommended to be run from time to time and especially 
> shortly before releases are prepared.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2668) Integrate OWASP dependency checks as ant target

2018-11-19 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692266#comment-16692266
 ] 

ASF GitHub Bot commented on NUTCH-2668:
---

sebastian-nagel closed pull request #404: NUTCH-2668 Integrate OWASP dependency 
checks as ant target
URL: https://github.com/apache/nutch/pull/404
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/build.xml b/build.xml
index c21ced9a3..1f5a0d9b2 100644
--- a/build.xml
+++ b/build.xml
@@ -599,6 +599,35 @@
   
  

+  
+  
+  
+  
+  
+  
+  
+
+
+  
+
+  
+  
+
+  
+  
+
+
+
+
+  
+  
+
+
+  
+
+
   
 
+https://jeremylong.github.io/DependencyCheck/dependency-suppression.1.1.xsd";>
+   
+  only applies to tika-server < 1.18
+  ^org\.(apache\.tika:tika-(core|parsers)|gagravarr:vorbis-java-tika):.*$
+  CVE-2018-1335
+   
+


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Integrate OWASP dependency checks as ant target
> ---
>
> Key: NUTCH-2668
> URL: https://issues.apache.org/jira/browse/NUTCH-2668
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 2.4, 1.16
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 2.4, 1.16
>
> Attachments: 1x-dependency-check-report.html, 
> 1x-dependency-check-vulnerability.html, 2x-dependency-check-report.html, 
> 2x-dependency-check-vulnerability.html
>
>
> [OWASP|http://www.owasp.org/] provides the [ant tool 
> "dependency-check"|https://jeremylong.github.io/DependencyCheck/dependency-check-ant/index.html]
>  which lists potential vulnerabilities of library dependencies. We should 
> integrate the generation of vulnerability reports into our build system as an 
> optional task/target recommended to be run from time to time and especially 
> shortly before releases are prepared.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2668) Integrate OWASP dependency checks as ant target

2018-11-19 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692265#comment-16692265
 ] 

ASF GitHub Bot commented on NUTCH-2668:
---

sebastian-nagel closed pull request #401: NUTCH-2668 Integrate OWASP dependency 
checks as ant target
URL: https://github.com/apache/nutch/pull/401
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/build.xml b/build.xml
index e19179e12..928ffaa0e 100644
--- a/build.xml
+++ b/build.xml
@@ -612,6 +612,34 @@
 
   
 
+  
+  
+  
+  
+  
+  
+  
+
+
+  
+
+  
+  
+
+  
+  
+
+
+
+
+  
+  
+
+
+  
+
   
   
   
diff --git a/ivy/dependency-check-ant/dependency-check-suppressions.xml 
b/ivy/dependency-check-ant/dependency-check-suppressions.xml
new file mode 100644
index 0..e7de8febb
--- /dev/null
+++ b/ivy/dependency-check-ant/dependency-check-suppressions.xml
@@ -0,0 +1,8 @@
+
+https://jeremylong.github.io/DependencyCheck/dependency-suppression.1.1.xsd";>
+   
+  only applies to tika-server < 1.18
+  ^org\.(apache\.tika:tika-(core|parsers)|gagravarr:vorbis-java-tika):.*$
+  CVE-2018-1335
+   
+


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Integrate OWASP dependency checks as ant target
> ---
>
> Key: NUTCH-2668
> URL: https://issues.apache.org/jira/browse/NUTCH-2668
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 2.4, 1.16
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 2.4, 1.16
>
> Attachments: 1x-dependency-check-report.html, 
> 1x-dependency-check-vulnerability.html, 2x-dependency-check-report.html, 
> 2x-dependency-check-vulnerability.html
>
>
> [OWASP|http://www.owasp.org/] provides the [ant tool 
> "dependency-check"|https://jeremylong.github.io/DependencyCheck/dependency-check-ant/index.html]
>  which lists potential vulnerabilities of library dependencies. We should 
> integrate the generation of vulnerability reports into our build system as an 
> optional task/target recommended to be run from time to time and especially 
> shortly before releases are prepared.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (NUTCH-2606) MIME detection is wrong for plain-text documents send as Content-Type "application/msword"

2018-11-19 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2606.

Resolution: Fixed

> MIME detection is wrong for plain-text documents send as Content-Type 
> "application/msword"
> --
>
> Key: NUTCH-2606
> URL: https://issues.apache.org/jira/browse/NUTCH-2606
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> Plain-text documents send as Content-Type "application/msword" are tried to 
> parse as Word documents. The MIME detection should be fixed, so that these 
> are correctly identified as plain-text documents. See NUTCH-2603 and 
> https://www.atnf.csiro.au/computing/software/gipsy/doc/update.doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2606) MIME detection is wrong for plain-text documents send as Content-Type "application/msword"

2018-11-19 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692254#comment-16692254
 ] 

ASF GitHub Bot commented on NUTCH-2606:
---

sebastian-nagel closed pull request #392: NUTCH-2606 MIME detection is wrong 
for plain-text documents send as Content-Type "application/msword"
URL: https://github.com/apache/nutch/pull/392
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/java/org/apache/nutch/util/MimeUtil.java 
b/src/java/org/apache/nutch/util/MimeUtil.java
index d380427ae..443341ecd 100644
--- a/src/java/org/apache/nutch/util/MimeUtil.java
+++ b/src/java/org/apache/nutch/util/MimeUtil.java
@@ -200,8 +200,7 @@ public String autoResolveContentType(String typeName, 
String url, byte[] data) {
   }
 
   if (magicType != null && !magicType.equals(MimeTypes.OCTET_STREAM)
-  && !magicType.equals(MimeTypes.PLAIN_TEXT) && retType != null
-  && !retType.equals(magicType)) {
+  && retType != null && !retType.equals(magicType)) {
 
 // If magic enabled and the current mime type differs from that of the
 // one returned from the magic, take the magic mimeType
diff --git a/src/test/org/apache/nutch/util/TestMimeUtil.java 
b/src/test/org/apache/nutch/util/TestMimeUtil.java
index d0b45dbac..72a42b457 100644
--- a/src/test/org/apache/nutch/util/TestMimeUtil.java
+++ b/src/test/org/apache/nutch/util/TestMimeUtil.java
@@ -68,7 +68,16 @@
   "\nhttp://www.w3.org/1999/xhtml\";>"
   + "\n\n"
   + ""
-  + "\nHello, World!" } };
+  + "\nHello, World!" },
+  { /*
+ * test detection of plain-text documents with erroneous Content-Type
+ * sent in HTTP header (NUTCH-2606)
+ */
+  "text/plain", // correct MIME type
+  "test.doc", // erroneously indicates MS-Word document
+  "application/msword", // erroneous Content-Type
+  "This is a plain text document",
+  "requires-mime-magic" } };
 
   public static String[][] binaryFiles = { {
   "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
@@ -99,6 +108,9 @@ public void testWithMimeMagic() {
   /** use only HTTP Content-Type (if given) and URL pattern */
   public void testWithoutMimeMagic() {
 for (String[] testPage : textBasedFormats) {
+  if (testPage.length > 4 && "requires-mime-magic".equals(testPage[4])) {
+continue;
+  }
   String mimeType = getMimeType(urlPrefix + testPage[1],
   testPage[3].getBytes(defaultCharset), testPage[2], false);
   assertEquals("", testPage[0], mimeType);


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> MIME detection is wrong for plain-text documents send as Content-Type 
> "application/msword"
> --
>
> Key: NUTCH-2606
> URL: https://issues.apache.org/jira/browse/NUTCH-2606
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> Plain-text documents send as Content-Type "application/msword" are tried to 
> parse as Word documents. The MIME detection should be fixed, so that these 
> are correctly identified as plain-text documents. See NUTCH-2603 and 
> https://www.atnf.csiro.au/computing/software/gipsy/doc/update.doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2669) Reliable solution for javax.ws packaging.type

2018-11-19 Thread Sebastian Nagel (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691844#comment-16691844
 ] 

Sebastian Nagel commented on NUTCH-2669:


Rolled back NUTCH-2671 because of NUTCH-2672. Waiting for IVY-1586 to be fixed. 
From NUTCH-2676 it's clear that the work-around defining the property 
packaging.type works only if the ivy cache already contains the javax.ws 
dependencies. That's explains why only the second build on every Jenkins 
machine succeeds.

> Reliable solution for javax.ws packaging.type
> -
>
> Key: NUTCH-2669
> URL: https://issues.apache.org/jira/browse/NUTCH-2669
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 2.4, 1.16
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 2.4, 1.16
>
>
> The upgrade of Tika to v1.19.1 (NUTCH-2651, NUTCH-2665, NUTCH-2667) raises an 
> ant/ivy issue during build when resolving/fetching dependencies:
> {noformat}
> [ivy:resolve] [FAILED ] 
> javax.ws.rs#javax.ws.rs-api;2.1!javax.ws.rs-api.${packaging.type}:  (0ms)
> [ivy:resolve]  local: tried
> [ivy:resolve]   
> /home/jenkins/.ivy2/local/javax.ws.rs/javax.ws.rs-api/2.1/${packaging.type}s/javax.ws.rs-api.${packaging.type}
> [ivy:resolve]  maven2: tried
> [ivy:resolve]   
> http://repo1.maven.org/maven2/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type}
> [ivy:resolve]  apache-snapshot: tried
> [ivy:resolve]   
> https://repository.apache.org/content/repositories/snapshots/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type}
> [ivy:resolve]  sonatype: tried
> [ivy:resolve]   
> http://oss.sonatype.org/content/repositories/releases/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type}
> [ivy:resolve] ::
> [ivy:resolve] ::  FAILED DOWNLOADS::
> [ivy:resolve] :: ^ see resolution messages for details  ^ ::
> [ivy:resolve] ::
> [ivy:resolve] :: 
> javax.ws.rs#javax.ws.rs-api;2.1!javax.ws.rs-api.${packaging.type}
> [ivy:resolve] ::
> [ivy:resolve]  ERRORS
> ...
> BUILD FAILED
> {noformat}
> More information about this issue is linked on 
> [jax-rs#576|https://github.com/jax-rs/api/pull/576]. 
> A work-around is to define a property {{packaging.type}} and set it to 
> {{jar}}. This can be done
> - in command-line {{ant -Dpackaging.type=jar ...}}
> - in default.properties
> - in ivysettings.xml
> The last work-around is active in current master/1.x. However, there are 
> still Jenkins builds failing while few succeed:
> ||#build||status jax-rs||machine||work-around||
> |3578|success|H28|ivysettings.xml|
> |3577|failed|H28|ivysettings.xml|
> |3576|failed|H33|ivysettings.xml|
> |3575|success|ubuntu-4|ivysettings.xml|
> |3574|failed|ubuntu-4|-Dpackaging.type=jar + default.properties|
> |3571|failed|?|-Dpackaging.type=jar + default.properties|
> |3568|failed|?|-Dpackaging.type=jar + default.properties|
> Builds which failed for other reasons are left away. The only pattern I see 
> is that only the second build on every of the Jenkins machines succeeds. A 
> possible reason could be that the build environments on the machines persist 
> state (the Nutch build directory, local ivy cache, etc.). If this is the 
> case, it may take some time until all Jenkins machines will succeed.
> The ivysettings.xml work-around was the first which succeeded on a Jenkins 
> build but it may be the case that all three work-arounds apply.
> The issue is supposed to be resolved (without work-arounds) by IVY-1577. 
> However, it looks like it isn't:
> - get rc2 of ivy 2.5.0 (the URL may change):
> {noformat}
> % wget -O ivy/ivy-2.5.0-rc2-test.jar \
> 
> https://builds.apache.org/job/Ivy/lastSuccessfulBuild/artifact/build/artifact/org.apache.ivy_2.5.0.cr2_20181023065327.jar
> {noformat}
> - edit default properties and set {{ivy.version=2.5.0-rc2-test}}
> - remove work-around in ivysettings.xml (or default.properties)
> - run {{ant clean runtime}} and check for failure resp. whether javax.ws lib 
> is in place: {{ls build/lib/javax.ws.rs-api*.jar}}
> This solution fails for 
> [ivy-2.5.0-rc1.jar|http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.5.0-rc1/ivy-2.5.0-rc1.jar]
>  and the mentioned rc2 jar as of 2018-10-23. But maybe the procedure is 
> wrong, I'll contact the ant/ivy team to solve this.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2676) Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver

2018-11-19 Thread Sebastian Nagel (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691835#comment-16691835
 ] 

Sebastian Nagel commented on NUTCH-2676:


Hi [~virt], thanks for the detailed description how the Docker image is built. 
Now I understand why our nightly builds still fail sometimes with the 
"javax.ws.rs-api.packaging.type" error, see NUTCH-2669. The work-around works 
only when the local ivy cache already contains the javax.ws-rs jars. When 
building the Docker image or working in a container the cache is not persisted 
and can not contain the dependencies. As a "work-around to the work-around" you 
could call the build twice if it fails the first time:
{noformat}
RUN ant runtime || ant runtime
{noformat}


> Update to the latest selenium and add code to use chrome and firefox headless 
> mode with the remote web driver
> -
>
> Key: NUTCH-2676
> URL: https://issues.apache.org/jira/browse/NUTCH-2676
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.15
>Reporter: Stas Batururimi
>Priority: Major
> Fix For: 1.16
>
> Attachments: Screenshot 2018-11-16 at 18.15.52.png
>
>
> * Selenium needs to be updated
>  * missing remote web driver for chrome 
>  * necessity to add headless mode for both remote WebDriverBase Firefox & 
> Chrome
>  * use case with Selenium grid using docker (1 hub docker container, several 
> nodes in different docker containers, Nutch in another docker container, 
> streaming to Apache Solr in docker container, that is at least 4 different 
> docker containers)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2676) Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver

2018-11-19 Thread Stas Batururimi (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691570#comment-16691570
 ] 

Stas Batururimi commented on NUTCH-2676:


Quite strange. It is working with Tika 1.18, but not with Tika 1.19+ the 
specified packaging.type seems to be missing. I see the following commits
https://github.com/apache/nutch/blob/65c4fedfacdb873a050e97a50602ed366c7b5a98/ivy/ivysettings.xml
But it is not helping...

> Update to the latest selenium and add code to use chrome and firefox headless 
> mode with the remote web driver
> -
>
> Key: NUTCH-2676
> URL: https://issues.apache.org/jira/browse/NUTCH-2676
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.15
>Reporter: Stas Batururimi
>Priority: Major
> Fix For: 1.16
>
> Attachments: Screenshot 2018-11-16 at 18.15.52.png
>
>
> * Selenium needs to be updated
>  * missing remote web driver for chrome 
>  * necessity to add headless mode for both remote WebDriverBase Firefox & 
> Chrome
>  * use case with Selenium grid using docker (1 hub docker container, several 
> nodes in different docker containers, Nutch in another docker container, 
> streaming to Apache Solr in docker container, that is at least 4 different 
> docker containers)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2676) Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver

2018-11-19 Thread Stas Batururimi (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691525#comment-16691525
 ] 

Stas Batururimi commented on NUTCH-2676:


The problem is lying in the dependencies section here
https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/ivy.xml

> Update to the latest selenium and add code to use chrome and firefox headless 
> mode with the remote web driver
> -
>
> Key: NUTCH-2676
> URL: https://issues.apache.org/jira/browse/NUTCH-2676
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.15
>Reporter: Stas Batururimi
>Priority: Major
> Fix For: 1.16
>
> Attachments: Screenshot 2018-11-16 at 18.15.52.png
>
>
> * Selenium needs to be updated
>  * missing remote web driver for chrome 
>  * necessity to add headless mode for both remote WebDriverBase Firefox & 
> Chrome
>  * use case with Selenium grid using docker (1 hub docker container, several 
> nodes in different docker containers, Nutch in another docker container, 
> streaming to Apache Solr in docker container, that is at least 4 different 
> docker containers)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2676) Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver

[jira] [Commented] (NUTCH-2668) Integrate OWASP dependency checks as ant target

Jenkins build is back to normal : Nutch-trunk #3591

[jira] [Commented] (NUTCH-2668) Integrate OWASP dependency checks as ant target

Jenkins build is back to normal : Nutch-nutchgora #1623

[jira] [Commented] (NUTCH-2606) MIME detection is wrong for plain-text documents send as Content-Type "application/msword"

[jira] [Commented] (NUTCH-2668) Integrate OWASP dependency checks as ant target

[jira] [Reopened] (NUTCH-2668) Integrate OWASP dependency checks as ant target

[jira] [Commented] (NUTCH-1842) crawl.gen.delay has a wrong default value in nutch-default.xml or is being parsed incorrectly

Build failed in Jenkins: Nutch-trunk #3590

[jira] [Commented] (NUTCH-2668) Integrate OWASP dependency checks as ant target

Build failed in Jenkins: Nutch-nutchgora #1622

[jira] [Commented] (NUTCH-2675) Give parsers the capability to read and write CrawlDatum

[jira] [Updated] (NUTCH-2675) Give parsers the capability to read and write CrawlDatum

[jira] [Resolved] (NUTCH-2668) Integrate OWASP dependency checks as ant target

[jira] [Commented] (NUTCH-2668) Integrate OWASP dependency checks as ant target

[jira] [Commented] (NUTCH-2668) Integrate OWASP dependency checks as ant target

[jira] [Resolved] (NUTCH-2606) MIME detection is wrong for plain-text documents send as Content-Type "application/msword"

[jira] [Commented] (NUTCH-2606) MIME detection is wrong for plain-text documents send as Content-Type "application/msword"

[jira] [Commented] (NUTCH-2669) Reliable solution for javax.ws packaging.type

[jira] [Commented] (NUTCH-2676) Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver

[jira] [Commented] (NUTCH-2676) Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver

[jira] [Commented] (NUTCH-2676) Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver

23 matches

Site Navigation

Mail list logo

Footer information