[jira] [Comment Edited] (NUTCH-2191) Add protocol-htmlunit
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199116#comment-15199116 ] Karanjeet Singh edited comment on NUTCH-2191 at 3/17/16 7:58 AM: - [~markus17] Although I have started working on this but there is still a lot to cover and test. This has been a busy week for me. I will try to work on this over the weekend. Sorry for the delay. was (Author: karanjeets): [~markus17] Although I started working on this but there is still a lot to cover and test. This has been a busy week for me. I will try to work on this over the weekend. Sorry for the delay. > Add protocol-htmlunit > - > > Key: NUTCH-2191 > URL: https://issues.apache.org/jira/browse/NUTCH-2191 > Project: Nutch > Issue Type: New Feature > Components: protocol >Affects Versions: 1.11 >Reporter: Markus Jelsma >Assignee: Chris A. Mattmann > Fix For: 1.12 > > Attachments: NUTCH-2191.patch, NUTCH-2191.patch, NUTCH-2191.patch > > > HtmlUnit is, opposed to other Javascript enabled headless browsers, a > portable library and should therefore be better suited for very large scale > crawls. This issue is an attempt to implement protocol-htmlunit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: Add the boilerpipe parsing adapted from NUTCH-...
Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/92 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203025#comment-15203025 ] ASF GitHub Bot commented on NUTCH-961: -- Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/92 > Expose Tika's boilerpipe support > > > Key: NUTCH-961 > URL: https://issues.apache.org/jira/browse/NUTCH-961 > Project: Nutch > Issue Type: New Feature > Components: parser >Affects Versions: 1.11 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: BoilerpipeExtractorRepository.java, > NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, > NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, > NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, > NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, > NUTCH-961.patch, NUTCH-961.patch, NUTCH-961v2.patch, > nutch-2.x-boilerpipe.patch > > > Tika 0.8 comes with the Boilerpipe content handler which can be used to > extract boilerplate content from HTML pages. We should see how we can expose > Boilerplate in the Nutch cofiguration. > Use the following properties to enable and control Boilerpipe. > {code} > > tika.extractor > none > > Which text extraction algorithm to use. Valid values are: boilerpipe or > none. > > > > > tika.extractor.boilerpipe.algorithm > ArticleExtractor > > Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, > ArticleExtractor > or CanolaExtractor. > > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-2241) Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration
[ https://issues.apache.org/jira/browse/NUTCH-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved NUTCH-2241. -- Resolution: Fixed Merged, thanks [~karanjeets]! {noformat} [chipotle:~/tmp/nutch1.12] mattmann% git pull https://github.com/karanjeets/nutch/ NUTCH-2241 remote: Counting objects: 18, done. remote: Compressing objects: 100% (11/11), done. remote: Total 18 (delta 1), reused 0 (delta 0), pack-reused 0 Unpacking objects: 100% (18/18), done. >From https://github.com/karanjeets/nutch * branchNUTCH-2241 -> FETCH_HEAD Updating a3e7420..a9b2491 Fast-forward CHANGES.txt | 2 ++ conf/nutch-default.xml | 50 ++ src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java | 52 3 files changed, 88 insertions(+), 16 deletions(-) [chipotle:~/tmp/nutch1.12] mattmann% git branch 2.x NUTCH-2213 * master merge-branch [chipotle:~/tmp/nutch1.12] mattmann% git push -u origin master Counting objects: 96, done. Delta compression using up to 4 threads. Compressing objects: 100% (11/11), done. Writing objects: 100% (18/18), 2.53 KiB | 0 bytes/s, done. Total 18 (delta 9), reused 0 (delta 0) remote: nutch git commit: fix for NUTCH-2241 contributed by karanjeets remote: nutch git commit: fix for NUTCH-2241 contributed by karanjeets To https://git-wip-us.apache.org/repos/asf/nutch.git a3e7420..a9b2491 master -> master Branch master set up to track remote branch master from origin. [chipotle:~/tmp/nutch1.12] mattmann% {noformat} > Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration > > > Key: NUTCH-2241 > URL: https://issues.apache.org/jira/browse/NUTCH-2241 > Project: Nutch > Issue Type: Bug > Components: plugin, protocol >Affects Versions: 1.12 > Environment: Fixed for Firefox browser with version 25 and above. >Reporter: Karanjeet Singh >Assignee: Chris A. Mattmann > Labels: firefox, interactiveselenium, lib-selenium, memex, > nutch, nutch-default.xml, plugin, protocol, selenium > Fix For: 1.12 > > > Issues: > (a) Firefox browser doesn't close gracefully. > (b) The property libselenium.page.load.delay is not working. No matter how > much delay you give, the driver is not waiting for the page to load. > (c) There is no timeout configured for the firefox binary. > (d) A lot of selenium configuration is hard-coded which can be exposed > through nutch-default.xml or nutch-site.xml > All these issues are part of "lib-selenium" plugin which is being used by two > other protocols "protocol-selenium" and "protocol-interactiveselenium". -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2241) Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration
[ https://issues.apache.org/jira/browse/NUTCH-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203022#comment-15203022 ] ASF GitHub Bot commented on NUTCH-2241: --- Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/98 > Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration > > > Key: NUTCH-2241 > URL: https://issues.apache.org/jira/browse/NUTCH-2241 > Project: Nutch > Issue Type: Bug > Components: plugin, protocol >Affects Versions: 1.12 > Environment: Fixed for Firefox browser with version 25 and above. >Reporter: Karanjeet Singh >Assignee: Chris A. Mattmann > Labels: firefox, interactiveselenium, lib-selenium, memex, > nutch, nutch-default.xml, plugin, protocol, selenium > Fix For: 1.12 > > > Issues: > (a) Firefox browser doesn't close gracefully. > (b) The property libselenium.page.load.delay is not working. No matter how > much delay you give, the driver is not waiting for the page to load. > (c) There is no timeout configured for the firefox binary. > (d) A lot of selenium configuration is hard-coded which can be exposed > through nutch-default.xml or nutch-site.xml > All these issues are part of "lib-selenium" plugin which is being used by two > other protocols "protocol-selenium" and "protocol-interactiveselenium". -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: fix for NUTCH-2241 contributed by karanjeets
Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/98 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Work started] (NUTCH-2241) Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration
[ https://issues.apache.org/jira/browse/NUTCH-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-2241 started by Chris A. Mattmann. > Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration > > > Key: NUTCH-2241 > URL: https://issues.apache.org/jira/browse/NUTCH-2241 > Project: Nutch > Issue Type: Bug > Components: plugin, protocol >Affects Versions: 1.12 > Environment: Fixed for Firefox browser with version 25 and above. >Reporter: Karanjeet Singh >Assignee: Chris A. Mattmann > Labels: firefox, interactiveselenium, lib-selenium, memex, > nutch, nutch-default.xml, plugin, protocol, selenium > Fix For: 1.12 > > > Issues: > (a) Firefox browser doesn't close gracefully. > (b) The property libselenium.page.load.delay is not working. No matter how > much delay you give, the driver is not waiting for the page to load. > (c) There is no timeout configured for the firefox binary. > (d) A lot of selenium configuration is hard-coded which can be exposed > through nutch-default.xml or nutch-site.xml > All these issues are part of "lib-selenium" plugin which is being used by two > other protocols "protocol-selenium" and "protocol-interactiveselenium". -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2241) Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration
[ https://issues.apache.org/jira/browse/NUTCH-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-2241: - Labels: firefox interactiveselenium lib-selenium memex nutch nutch-default.xml plugin protocol selenium (was: firefox interactiveselenium lib-selenium nutch nutch-default.xml plugin protocol selenium) > Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration > > > Key: NUTCH-2241 > URL: https://issues.apache.org/jira/browse/NUTCH-2241 > Project: Nutch > Issue Type: Bug > Components: plugin, protocol >Affects Versions: 1.12 > Environment: Fixed for Firefox browser with version 25 and above. >Reporter: Karanjeet Singh > Labels: firefox, interactiveselenium, lib-selenium, memex, > nutch, nutch-default.xml, plugin, protocol, selenium > Fix For: 1.12 > > > Issues: > (a) Firefox browser doesn't close gracefully. > (b) The property libselenium.page.load.delay is not working. No matter how > much delay you give, the driver is not waiting for the page to load. > (c) There is no timeout configured for the firefox binary. > (d) A lot of selenium configuration is hard-coded which can be exposed > through nutch-default.xml or nutch-site.xml > All these issues are part of "lib-selenium" plugin which is being used by two > other protocols "protocol-selenium" and "protocol-interactiveselenium". -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (NUTCH-2241) Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration
[ https://issues.apache.org/jira/browse/NUTCH-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned NUTCH-2241: Assignee: Chris A. Mattmann > Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration > > > Key: NUTCH-2241 > URL: https://issues.apache.org/jira/browse/NUTCH-2241 > Project: Nutch > Issue Type: Bug > Components: plugin, protocol >Affects Versions: 1.12 > Environment: Fixed for Firefox browser with version 25 and above. >Reporter: Karanjeet Singh >Assignee: Chris A. Mattmann > Labels: firefox, interactiveselenium, lib-selenium, memex, > nutch, nutch-default.xml, plugin, protocol, selenium > Fix For: 1.12 > > > Issues: > (a) Firefox browser doesn't close gracefully. > (b) The property libselenium.page.load.delay is not working. No matter how > much delay you give, the driver is not waiting for the page to load. > (c) There is no timeout configured for the firefox binary. > (d) A lot of selenium configuration is hard-coded which can be exposed > through nutch-default.xml or nutch-site.xml > All these issues are part of "lib-selenium" plugin which is being used by two > other protocols "protocol-selenium" and "protocol-interactiveselenium". -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: fix for NUTCH-2241 contributed by karanjeets
GitHub user karanjeets opened a pull request: https://github.com/apache/nutch/pull/98 fix for NUTCH-2241 contributed by karanjeets You can merge this pull request into a Git repository by running: $ git pull https://github.com/karanjeets/nutch NUTCH-2241 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/98.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #98 commit 230693d6dc648f587e88e59817eea934166c9247 Author: Karanjeet Singh Date: 2016-03-19T23:55:40Z fix for NUTCH-2241 contributed by karanjeets --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (NUTCH-2241) Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration
[ https://issues.apache.org/jira/browse/NUTCH-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203017#comment-15203017 ] ASF GitHub Bot commented on NUTCH-2241: --- GitHub user karanjeets opened a pull request: https://github.com/apache/nutch/pull/98 fix for NUTCH-2241 contributed by karanjeets You can merge this pull request into a Git repository by running: $ git pull https://github.com/karanjeets/nutch NUTCH-2241 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/98.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #98 commit 230693d6dc648f587e88e59817eea934166c9247 Author: Karanjeet Singh Date: 2016-03-19T23:55:40Z fix for NUTCH-2241 contributed by karanjeets > Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration > > > Key: NUTCH-2241 > URL: https://issues.apache.org/jira/browse/NUTCH-2241 > Project: Nutch > Issue Type: Bug > Components: plugin, protocol >Affects Versions: 1.12 > Environment: Fixed for Firefox browser with version 25 and above. >Reporter: Karanjeet Singh > Labels: firefox, interactiveselenium, lib-selenium, nutch, > nutch-default.xml, plugin, protocol, selenium > Fix For: 1.12 > > > Issues: > (a) Firefox browser doesn't close gracefully. > (b) The property libselenium.page.load.delay is not working. No matter how > much delay you give, the driver is not waiting for the page to load. > (c) There is no timeout configured for the firefox binary. > (d) A lot of selenium configuration is hard-coded which can be exposed > through nutch-default.xml or nutch-site.xml > All these issues are part of "lib-selenium" plugin which is being used by two > other protocols "protocol-selenium" and "protocol-interactiveselenium". -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2241) Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration
Karanjeet Singh created NUTCH-2241: -- Summary: Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration Key: NUTCH-2241 URL: https://issues.apache.org/jira/browse/NUTCH-2241 Project: Nutch Issue Type: Bug Components: plugin, protocol Affects Versions: 1.12 Environment: Fixed for Firefox browser with version 25 and above. Reporter: Karanjeet Singh Fix For: 1.12 Issues: (a) Firefox browser doesn't close gracefully. (b) The property libselenium.page.load.delay is not working. No matter how much delay you give, the driver is not waiting for the page to load. (c) There is no timeout configured for the firefox binary. (d) A lot of selenium configuration is hard-coded which can be exposed through nutch-default.xml or nutch-site.xml All these issues are part of "lib-selenium" plugin which is being used by two other protocols "protocol-selenium" and "protocol-interactiveselenium". -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199116#comment-15199116 ] Karanjeet Singh commented on NUTCH-2191: [~markus17] Although I started working on this but there is still a lot to cover and test. This has been a busy week for me. I will try to work on this over the weekend. Sorry for the delay. > Add protocol-htmlunit > - > > Key: NUTCH-2191 > URL: https://issues.apache.org/jira/browse/NUTCH-2191 > Project: Nutch > Issue Type: New Feature > Components: protocol >Affects Versions: 1.11 >Reporter: Markus Jelsma >Assignee: Chris A. Mattmann > Fix For: 1.12 > > Attachments: NUTCH-2191.patch, NUTCH-2191.patch, NUTCH-2191.patch > > > HtmlUnit is, opposed to other Javascript enabled headless browsers, a > portable library and should therefore be better suited for very large scale > crawls. This issue is an attempt to implement protocol-htmlunit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1492) Support gora-dynamodb in Nutch 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15198869#comment-15198869 ] Lewis John McGibbney commented on NUTCH-1492: - [~renato2099] what about this shit? > Support gora-dynamodb in Nutch 2.x > -- > > Key: NUTCH-1492 > URL: https://issues.apache.org/jira/browse/NUTCH-1492 > Project: Nutch > Issue Type: New Feature > Components: storage >Affects Versions: 2.2 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 2.4 > > > We recently committed GORA-103. With the introduction of this module, it is > essential that it is thoroughly tested over at Nutch HQ. The primary purpose > of this issue is to provide all GORA configuration and ivy/ivy.xml > dependencies, however it should also act as a parent issue for any immediate > problem encountered in making GORA-103 functionality available through Nutch > 2.x -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2240) ava.lang.NoSuchFieldError: INSTANCE selenium nutch
lq created NUTCH-2240: - Summary: ava.lang.NoSuchFieldError: INSTANCE selenium nutch Key: NUTCH-2240 URL: https://issues.apache.org/jira/browse/NUTCH-2240 Project: Nutch Issue Type: Bug Reporter: lq java.lang.NoSuchFieldError: INSTANCE at org.apache.http.conn.ssl.SSLConnectionSocketFactory.(SSLConnectionSocketFactory.java:144) at com.gargoylesoftware.htmlunit.HttpWebConnection.configureHttpsScheme(HttpWebConnection.java:597) at com.gargoylesoftware.htmlunit.HttpWebConnection.createHttpClient(HttpWebConnection.java:532) at com.gargoylesoftware.htmlunit.HttpWebConnection.getHttpClientBuilder(HttpWebConnection.java:494) at com.gargoylesoftware.htmlunit.HttpWebConnection.getResponse(HttpWebConnection.java:158) at org.apache.nutch.protocol.htmlunit.RegexHttpWebConnection.getResponse(RegexHttpWebConnection.java:63) at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseFromWebConnection(WebClient.java:1321) at com.gargoylesoftware.htmlunit.WebClient.loadWebResponse(WebClient.java:1238) at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:346) at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:432) at org.apache.nutch.protocol.htmlunit.HttpWebClient.getPage(HttpWebClient.java:58) at org.apache.nutch.protocol.htmlunit.HttpWebClient.getHtmlPage(HttpWebClient.java:67) at org.apache.nutch.protocol.s2jh.HttpResponse.readPlainContentByHtmlunit(HttpResponse.java:345) at org.apache.nutch.protocol.s2jh.HttpResponse.(HttpResponse.java:222) at org.apache.nutch.protocol.s2jh.Http.getResponse(Http.java:79) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:245) at org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:530) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (NUTCH-2138) Tika cannot OCR embedded images from PDF
[ https://issues.apache.org/jira/browse/NUTCH-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200034#comment-15200034 ] eldk edited comment on NUTCH-2138 at 3/17/16 6:40 PM: -- 2016-03-17 18:44:29,656 INFO parse.ParserFactory - The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes system property, and all claim to support the content type application/pdf, but they are not mapped to it in the parse-plugins.xml file DEBUG tika.TikaParser - Using Tika parser org.apache.tika.parser.pdf.PDFParser for mime-type application/pdf was (Author: eldk): 2016-03-17 18:44:29,656 INFO parse.ParserFactory - The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes system property, and all claim to support the content type application/pdf, but they are not mapped to it in the parse-plugins.xml file > Tika cannot OCR embedded images from PDF > > > Key: NUTCH-2138 > URL: https://issues.apache.org/jira/browse/NUTCH-2138 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.10 > Environment: Nutch v1.10 > openjdk version "1.8.0_60-internal" > Debian 7.8 > Tika 1.8 or Tika 1.10 >Reporter: jean blue > > Tika 1.10 is able to OCR embedded images if PDFParser.properties is modified > accordingly in tika-app-1.10.jar but parse-tika doesn't if same modifications > are made in runtime/local/plugins/parse-tika/tika-parsers-1.10.jar -- This message was sent by Atlassian JIRA (v6.3.4#6332)
1.11 branch/tag
Hi guys - 1.11 is missing on in Git, or i am stupid :) https://github.com/apache/nutch https://git-wip-us.apache.org/repos/asf?p=nutch.git;a=summary Did i miss smoething? Markus
[jira] [Comment Edited] (NUTCH-2138) Tika cannot OCR embedded images from PDF
[ https://issues.apache.org/jira/browse/NUTCH-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200034#comment-15200034 ] eldk edited comment on NUTCH-2138 at 3/18/16 4:12 PM: -- 2016-03-17 18:44:29,656 INFO parse.ParserFactory - The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes system property, and all claim to support the content type application/pdf, but they are not mapped to it in the parse-plugins.xml file DEBUG tika.TikaParser - Using Tika parser org.apache.tika.parser.pdf.PDFParser for mime-type application/pdf https://issues.apache.org/jira/browse/TIKA-93 was (Author: eldk): 2016-03-17 18:44:29,656 INFO parse.ParserFactory - The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes system property, and all claim to support the content type application/pdf, but they are not mapped to it in the parse-plugins.xml file DEBUG tika.TikaParser - Using Tika parser org.apache.tika.parser.pdf.PDFParser for mime-type application/pdf > Tika cannot OCR embedded images from PDF > > > Key: NUTCH-2138 > URL: https://issues.apache.org/jira/browse/NUTCH-2138 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.10 > Environment: Nutch v1.10 > openjdk version "1.8.0_60-internal" > Debian 7.8 > Tika 1.8 or Tika 1.10 >Reporter: jean blue > > Tika 1.10 is able to OCR embedded images if PDFParser.properties is modified > accordingly in tika-app-1.10.jar but parse-tika doesn't if same modifications > are made in runtime/local/plugins/parse-tika/tika-parsers-1.10.jar -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: 1.11 branch/tag
try: release-1.11-rc2 :) ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++ -Original Message- From: Markus Jelsma Reply-To: "dev@nutch.apache.org" Date: Thursday, March 17, 2016 at 2:43 AM To: "dev@nutch.apache.org" Subject: 1.11 branch/tag >Hi guys - 1.11 is missing on in Git, or i am stupid :) > >https://github.com/apache/nutch >https://git-wip-us.apache.org/repos/asf?p=nutch.git;a=summary > >Did i miss smoething? >Markus