[jira] [Comment Edited] (NUTCH-2191) Add protocol-htmlunit

2016-03-19 Thread Karanjeet Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199116#comment-15199116
 ] 

Karanjeet Singh edited comment on NUTCH-2191 at 3/17/16 7:58 AM:
-

[~markus17]
Although I have started working on this but there is still a lot to cover and 
test. 

This has been a busy week for me. I will try to work on this over the weekend. 
Sorry for the delay.


was (Author: karanjeets):
[~markus17]
Although I started working on this but there is still a lot to cover and test. 

This has been a busy week for me. I will try to work on this over the weekend. 
Sorry for the delay.

> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Chris A. Mattmann
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch, NUTCH-2191.patch, NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: Add the boilerpipe parsing adapted from NUTCH-...

2016-03-19 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/92


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-03-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203025#comment-15203025
 ] 

ASF GitHub Bot commented on NUTCH-961:
--

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/92


> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, 
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
> NUTCH-961.patch, NUTCH-961.patch, NUTCH-961v2.patch, 
> nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.
> Use the following properties to enable and control Boilerpipe.
> {code}
> 
>   tika.extractor
>   none
>   
>   Which text extraction algorithm to use. Valid values are: boilerpipe or 
> none.
>   
> 
>  
>  
>   tika.extractor.boilerpipe.algorithm
>   ArticleExtractor
>
>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, 
> ArticleExtractor
>   or CanolaExtractor.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2241) Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration

2016-03-19 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-2241.
--
Resolution: Fixed

Merged, thanks [~karanjeets]!

{noformat}
[chipotle:~/tmp/nutch1.12] mattmann% git pull 
https://github.com/karanjeets/nutch/ NUTCH-2241
remote: Counting objects: 18, done.
remote: Compressing objects: 100% (11/11), done.
remote: Total 18 (delta 1), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (18/18), done.
>From https://github.com/karanjeets/nutch
 * branchNUTCH-2241 -> FETCH_HEAD
Updating a3e7420..a9b2491
Fast-forward
 CHANGES.txt
|  2 ++
 conf/nutch-default.xml 
| 50 ++
 
src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
 | 52 
 3 files changed, 88 insertions(+), 16 deletions(-)
[chipotle:~/tmp/nutch1.12] mattmann% git branch
  2.x
  NUTCH-2213
* master
  merge-branch
[chipotle:~/tmp/nutch1.12] mattmann% git push -u origin master
Counting objects: 96, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (11/11), done.
Writing objects: 100% (18/18), 2.53 KiB | 0 bytes/s, done.
Total 18 (delta 9), reused 0 (delta 0)
remote: nutch git commit: fix for NUTCH-2241 contributed by karanjeets
remote: nutch git commit: fix for NUTCH-2241 contributed by karanjeets
To https://git-wip-us.apache.org/repos/asf/nutch.git
   a3e7420..a9b2491  master -> master
Branch master set up to track remote branch master from origin.
[chipotle:~/tmp/nutch1.12] mattmann% 
{noformat}


> Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration
> 
>
> Key: NUTCH-2241
> URL: https://issues.apache.org/jira/browse/NUTCH-2241
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.12
> Environment: Fixed for Firefox browser with version 25 and above.
>Reporter: Karanjeet Singh
>Assignee: Chris A. Mattmann
>  Labels: firefox, interactiveselenium, lib-selenium, memex, 
> nutch, nutch-default.xml, plugin, protocol, selenium
> Fix For: 1.12
>
>
> Issues:
> (a) Firefox browser doesn't close gracefully.
> (b) The property libselenium.page.load.delay is not working. No matter how 
> much delay you give, the driver is not waiting for the page to load.
> (c) There is no timeout configured for the firefox binary.
> (d) A lot of selenium configuration is hard-coded which can be exposed 
> through nutch-default.xml or nutch-site.xml
> All these issues are part of "lib-selenium" plugin which is being used by two 
> other protocols "protocol-selenium" and "protocol-interactiveselenium".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2241) Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration

2016-03-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203022#comment-15203022
 ] 

ASF GitHub Bot commented on NUTCH-2241:
---

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/98


> Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration
> 
>
> Key: NUTCH-2241
> URL: https://issues.apache.org/jira/browse/NUTCH-2241
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.12
> Environment: Fixed for Firefox browser with version 25 and above.
>Reporter: Karanjeet Singh
>Assignee: Chris A. Mattmann
>  Labels: firefox, interactiveselenium, lib-selenium, memex, 
> nutch, nutch-default.xml, plugin, protocol, selenium
> Fix For: 1.12
>
>
> Issues:
> (a) Firefox browser doesn't close gracefully.
> (b) The property libselenium.page.load.delay is not working. No matter how 
> much delay you give, the driver is not waiting for the page to load.
> (c) There is no timeout configured for the firefox binary.
> (d) A lot of selenium configuration is hard-coded which can be exposed 
> through nutch-default.xml or nutch-site.xml
> All these issues are part of "lib-selenium" plugin which is being used by two 
> other protocols "protocol-selenium" and "protocol-interactiveselenium".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: fix for NUTCH-2241 contributed by karanjeets

2016-03-19 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/98


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Work started] (NUTCH-2241) Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration

2016-03-19 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2241 started by Chris A. Mattmann.

> Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration
> 
>
> Key: NUTCH-2241
> URL: https://issues.apache.org/jira/browse/NUTCH-2241
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.12
> Environment: Fixed for Firefox browser with version 25 and above.
>Reporter: Karanjeet Singh
>Assignee: Chris A. Mattmann
>  Labels: firefox, interactiveselenium, lib-selenium, memex, 
> nutch, nutch-default.xml, plugin, protocol, selenium
> Fix For: 1.12
>
>
> Issues:
> (a) Firefox browser doesn't close gracefully.
> (b) The property libselenium.page.load.delay is not working. No matter how 
> much delay you give, the driver is not waiting for the page to load.
> (c) There is no timeout configured for the firefox binary.
> (d) A lot of selenium configuration is hard-coded which can be exposed 
> through nutch-default.xml or nutch-site.xml
> All these issues are part of "lib-selenium" plugin which is being used by two 
> other protocols "protocol-selenium" and "protocol-interactiveselenium".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2241) Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration

2016-03-19 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2241:
-
Labels: firefox interactiveselenium lib-selenium memex nutch 
nutch-default.xml plugin protocol selenium  (was: firefox interactiveselenium 
lib-selenium nutch nutch-default.xml plugin protocol selenium)

> Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration
> 
>
> Key: NUTCH-2241
> URL: https://issues.apache.org/jira/browse/NUTCH-2241
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.12
> Environment: Fixed for Firefox browser with version 25 and above.
>Reporter: Karanjeet Singh
>  Labels: firefox, interactiveselenium, lib-selenium, memex, 
> nutch, nutch-default.xml, plugin, protocol, selenium
> Fix For: 1.12
>
>
> Issues:
> (a) Firefox browser doesn't close gracefully.
> (b) The property libselenium.page.load.delay is not working. No matter how 
> much delay you give, the driver is not waiting for the page to load.
> (c) There is no timeout configured for the firefox binary.
> (d) A lot of selenium configuration is hard-coded which can be exposed 
> through nutch-default.xml or nutch-site.xml
> All these issues are part of "lib-selenium" plugin which is being used by two 
> other protocols "protocol-selenium" and "protocol-interactiveselenium".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2241) Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration

2016-03-19 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-2241:


Assignee: Chris A. Mattmann

> Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration
> 
>
> Key: NUTCH-2241
> URL: https://issues.apache.org/jira/browse/NUTCH-2241
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.12
> Environment: Fixed for Firefox browser with version 25 and above.
>Reporter: Karanjeet Singh
>Assignee: Chris A. Mattmann
>  Labels: firefox, interactiveselenium, lib-selenium, memex, 
> nutch, nutch-default.xml, plugin, protocol, selenium
> Fix For: 1.12
>
>
> Issues:
> (a) Firefox browser doesn't close gracefully.
> (b) The property libselenium.page.load.delay is not working. No matter how 
> much delay you give, the driver is not waiting for the page to load.
> (c) There is no timeout configured for the firefox binary.
> (d) A lot of selenium configuration is hard-coded which can be exposed 
> through nutch-default.xml or nutch-site.xml
> All these issues are part of "lib-selenium" plugin which is being used by two 
> other protocols "protocol-selenium" and "protocol-interactiveselenium".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: fix for NUTCH-2241 contributed by karanjeets

2016-03-19 Thread karanjeets
GitHub user karanjeets opened a pull request:

https://github.com/apache/nutch/pull/98

fix for NUTCH-2241 contributed by karanjeets



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/karanjeets/nutch NUTCH-2241

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/98.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #98


commit 230693d6dc648f587e88e59817eea934166c9247
Author: Karanjeet Singh 
Date:   2016-03-19T23:55:40Z

fix for NUTCH-2241 contributed by karanjeets




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-2241) Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration

2016-03-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203017#comment-15203017
 ] 

ASF GitHub Bot commented on NUTCH-2241:
---

GitHub user karanjeets opened a pull request:

https://github.com/apache/nutch/pull/98

fix for NUTCH-2241 contributed by karanjeets



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/karanjeets/nutch NUTCH-2241

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/98.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #98


commit 230693d6dc648f587e88e59817eea934166c9247
Author: Karanjeet Singh 
Date:   2016-03-19T23:55:40Z

fix for NUTCH-2241 contributed by karanjeets




> Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration
> 
>
> Key: NUTCH-2241
> URL: https://issues.apache.org/jira/browse/NUTCH-2241
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.12
> Environment: Fixed for Firefox browser with version 25 and above.
>Reporter: Karanjeet Singh
>  Labels: firefox, interactiveselenium, lib-selenium, nutch, 
> nutch-default.xml, plugin, protocol, selenium
> Fix For: 1.12
>
>
> Issues:
> (a) Firefox browser doesn't close gracefully.
> (b) The property libselenium.page.load.delay is not working. No matter how 
> much delay you give, the driver is not waiting for the page to load.
> (c) There is no timeout configured for the firefox binary.
> (d) A lot of selenium configuration is hard-coded which can be exposed 
> through nutch-default.xml or nutch-site.xml
> All these issues are part of "lib-selenium" plugin which is being used by two 
> other protocols "protocol-selenium" and "protocol-interactiveselenium".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2241) Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration

2016-03-19 Thread Karanjeet Singh (JIRA)
Karanjeet Singh created NUTCH-2241:
--

 Summary: Unstable Selenium plugin in Nutch. Fixed bugs and 
enhanced configuration
 Key: NUTCH-2241
 URL: https://issues.apache.org/jira/browse/NUTCH-2241
 Project: Nutch
  Issue Type: Bug
  Components: plugin, protocol
Affects Versions: 1.12
 Environment: Fixed for Firefox browser with version 25 and above.
Reporter: Karanjeet Singh
 Fix For: 1.12


Issues:
(a) Firefox browser doesn't close gracefully.
(b) The property libselenium.page.load.delay is not working. No matter how much 
delay you give, the driver is not waiting for the page to load.
(c) There is no timeout configured for the firefox binary.
(d) A lot of selenium configuration is hard-coded which can be exposed through 
nutch-default.xml or nutch-site.xml

All these issues are part of "lib-selenium" plugin which is being used by two 
other protocols "protocol-selenium" and "protocol-interactiveselenium".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-03-19 Thread Karanjeet Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199116#comment-15199116
 ] 

Karanjeet Singh commented on NUTCH-2191:


[~markus17]
Although I started working on this but there is still a lot to cover and test. 

This has been a busy week for me. I will try to work on this over the weekend. 
Sorry for the delay.

> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Chris A. Mattmann
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch, NUTCH-2191.patch, NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1492) Support gora-dynamodb in Nutch 2.x

2016-03-19 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15198869#comment-15198869
 ] 

Lewis John McGibbney commented on NUTCH-1492:
-

[~renato2099] what about this shit?

> Support gora-dynamodb in Nutch 2.x
> --
>
> Key: NUTCH-1492
> URL: https://issues.apache.org/jira/browse/NUTCH-1492
> Project: Nutch
>  Issue Type: New Feature
>  Components: storage
>Affects Versions: 2.2
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.4
>
>
> We recently committed GORA-103. With the introduction of this module, it is 
> essential that it is thoroughly tested over at Nutch HQ. The primary purpose 
> of this issue is to provide all GORA configuration and ivy/ivy.xml 
> dependencies, however it should also act as a parent issue for any immediate 
> problem encountered in making GORA-103 functionality available through Nutch 
> 2.x  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2240) ava.lang.NoSuchFieldError: INSTANCE selenium nutch

2016-03-19 Thread lq (JIRA)
lq created NUTCH-2240:
-

 Summary: ava.lang.NoSuchFieldError: INSTANCE   selenium nutch 
 Key: NUTCH-2240
 URL: https://issues.apache.org/jira/browse/NUTCH-2240
 Project: Nutch
  Issue Type: Bug
Reporter: lq


java.lang.NoSuchFieldError: INSTANCE
at 
org.apache.http.conn.ssl.SSLConnectionSocketFactory.(SSLConnectionSocketFactory.java:144)
at 
com.gargoylesoftware.htmlunit.HttpWebConnection.configureHttpsScheme(HttpWebConnection.java:597)
at 
com.gargoylesoftware.htmlunit.HttpWebConnection.createHttpClient(HttpWebConnection.java:532)
at 
com.gargoylesoftware.htmlunit.HttpWebConnection.getHttpClientBuilder(HttpWebConnection.java:494)
at 
com.gargoylesoftware.htmlunit.HttpWebConnection.getResponse(HttpWebConnection.java:158)
at 
org.apache.nutch.protocol.htmlunit.RegexHttpWebConnection.getResponse(RegexHttpWebConnection.java:63)
at 
com.gargoylesoftware.htmlunit.WebClient.loadWebResponseFromWebConnection(WebClient.java:1321)
at 
com.gargoylesoftware.htmlunit.WebClient.loadWebResponse(WebClient.java:1238)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:346)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:432)
at 
org.apache.nutch.protocol.htmlunit.HttpWebClient.getPage(HttpWebClient.java:58)
at 
org.apache.nutch.protocol.htmlunit.HttpWebClient.getHtmlPage(HttpWebClient.java:67)
at 
org.apache.nutch.protocol.s2jh.HttpResponse.readPlainContentByHtmlunit(HttpResponse.java:345)
at 
org.apache.nutch.protocol.s2jh.HttpResponse.(HttpResponse.java:222)
at org.apache.nutch.protocol.s2jh.Http.getResponse(Http.java:79)
at 
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:245)
at 
org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:530)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-2138) Tika cannot OCR embedded images from PDF

2016-03-19 Thread eldk (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200034#comment-15200034
 ] 

eldk edited comment on NUTCH-2138 at 3/17/16 6:40 PM:
--

2016-03-17 18:44:29,656 INFO  parse.ParserFactory - The parsing plugins: 
[org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes 
system property, and all claim to support the content type application/pdf, but 
they are not mapped to it  in the parse-plugins.xml file

DEBUG tika.TikaParser - Using Tika parser org.apache.tika.parser.pdf.PDFParser 
for mime-type application/pdf


was (Author: eldk):
2016-03-17 18:44:29,656 INFO  parse.ParserFactory - The parsing plugins: 
[org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes 
system property, and all claim to support the content type application/pdf, but 
they are not mapped to it  in the parse-plugins.xml file

> Tika cannot OCR embedded images from PDF
> 
>
> Key: NUTCH-2138
> URL: https://issues.apache.org/jira/browse/NUTCH-2138
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.10
> Environment: Nutch v1.10
> openjdk version "1.8.0_60-internal"
> Debian 7.8
> Tika 1.8 or Tika 1.10
>Reporter: jean blue
>
> Tika 1.10 is able to OCR embedded images if PDFParser.properties is modified 
> accordingly in tika-app-1.10.jar but parse-tika doesn't if same modifications 
> are made in runtime/local/plugins/parse-tika/tika-parsers-1.10.jar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


1.11 branch/tag

2016-03-19 Thread Markus Jelsma
Hi guys - 1.11 is missing on in Git, or i am stupid :)

https://github.com/apache/nutch
https://git-wip-us.apache.org/repos/asf?p=nutch.git;a=summary

Did i miss smoething?
Markus


[jira] [Comment Edited] (NUTCH-2138) Tika cannot OCR embedded images from PDF

2016-03-19 Thread eldk (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200034#comment-15200034
 ] 

eldk edited comment on NUTCH-2138 at 3/18/16 4:12 PM:
--

2016-03-17 18:44:29,656 INFO  parse.ParserFactory - The parsing plugins: 
[org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes 
system property, and all claim to support the content type application/pdf, but 
they are not mapped to it  in the parse-plugins.xml file

DEBUG tika.TikaParser - Using Tika parser org.apache.tika.parser.pdf.PDFParser 
for mime-type application/pdf

https://issues.apache.org/jira/browse/TIKA-93


was (Author: eldk):
2016-03-17 18:44:29,656 INFO  parse.ParserFactory - The parsing plugins: 
[org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes 
system property, and all claim to support the content type application/pdf, but 
they are not mapped to it  in the parse-plugins.xml file

DEBUG tika.TikaParser - Using Tika parser org.apache.tika.parser.pdf.PDFParser 
for mime-type application/pdf

> Tika cannot OCR embedded images from PDF
> 
>
> Key: NUTCH-2138
> URL: https://issues.apache.org/jira/browse/NUTCH-2138
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.10
> Environment: Nutch v1.10
> openjdk version "1.8.0_60-internal"
> Debian 7.8
> Tika 1.8 or Tika 1.10
>Reporter: jean blue
>
> Tika 1.10 is able to OCR embedded images if PDFParser.properties is modified 
> accordingly in tika-app-1.10.jar but parse-tika doesn't if same modifications 
> are made in runtime/local/plugins/parse-tika/tika-parsers-1.10.jar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: 1.11 branch/tag

2016-03-19 Thread Mattmann, Chris A (3980)
try: release-1.11-rc2 :)


++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++





-Original Message-
From: Markus Jelsma 
Reply-To: "dev@nutch.apache.org" 
Date: Thursday, March 17, 2016 at 2:43 AM
To: "dev@nutch.apache.org" 
Subject: 1.11 branch/tag

>Hi guys - 1.11 is missing on in Git, or i am stupid :)
>
>https://github.com/apache/nutch
>https://git-wip-us.apache.org/repos/asf?p=nutch.git;a=summary
>
>Did i miss smoething?
>Markus