[jira] [Commented] (NUTCH-1449) Optionally delete documents skipped by IndexingFilters
[ https://issues.apache.org/jira/browse/NUTCH-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083109#comment-15083109 ] Markus Jelsma commented on NUTCH-1449: -- We have it nicely running for some years. I will commit this some time soon unless objections > Optionally delete documents skipped by IndexingFilters > -- > > Key: NUTCH-1449 > URL: https://issues.apache.org/jira/browse/NUTCH-1449 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.5.1 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-1449.patch, NUTCH-1449.patch, NUTCH-1449.patch > > > Add configuration option to delete documents instead of skipping them if the > indexing filters return null. This is useful to delete documents with new > business logic in the indexing filter chain. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2168) Parse-tika fails to retrieve parser
[ https://issues.apache.org/jira/browse/NUTCH-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083285#comment-15083285 ] Sebastian Nagel commented on NUTCH-2168: Hi [~kalanya], looks like the indexed raw content of the JPEGs are causing the invalid utf-8 character. The index-html plugin tries to treat any raw content as readable content converting it to a String based on the platform-dependent charset. What happens if the field content is specified as "binary" in schema.xml (cf. patch for NUTCH-2130)? Without the patch applied, non-HTML documents simply fail to parse and are never indexed. That's probably the reason why the fix for this issue causes the problem with index-html. I would suggest to open a separate issue to address the indexer-solr problem with raw content from index-html. > Parse-tika fails to retrieve parser > --- > > Key: NUTCH-2168 > URL: https://issues.apache.org/jira/browse/NUTCH-2168 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 2.3.1 >Reporter: Sebastian Nagel > Fix For: 2.3.1 > > Attachments: NUTCH-2168.patch > > > The plugin parse-tika fails to parse most (all?) kinds of document types > (PDF, xlsx, ...) when run via ParserChecker or ParserJob: > {noformat} > 2015-11-12 19:14:30,903 INFO parse.ParserJob - Parsing > http://localhost/pdftest.pdf > 2015-11-12 19:14:30,905 INFO parse.ParserFactory - ... > 2015-11-12 19:14:30,907 ERROR tika.TikaParser - Can't retrieve Tika parser > for mime-type application/pdf > 2015-11-12 19:14:30,913 WARN parse.ParseUtil - Unable to successfully parse > content http://localhost/pdftest.pdf of type application/pdf > {noformat} > The same document is successfully parsed by TestPdfParser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083214#comment-15083214 ] Chris A. Mattmann commented on NUTCH-2191: -- Very nice, Markus! Beat me to implementing this one. My suggestions for future work here are to implement 10-15 Ajax pattern handlers like we started to do with selenium. > Add protocol-htmlunit > - > > Key: NUTCH-2191 > URL: https://issues.apache.org/jira/browse/NUTCH-2191 > Project: Nutch > Issue Type: New Feature > Components: protocol >Affects Versions: 1.11 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-2191.patch > > > HtmlUnit is, opposed to other Javascript enabled headless browsers, a > portable library and should therefore be better suited for very large scale > crawls. This issue is an attempt to implement protocol-htmlunit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1321) IDNNormalizer
[ https://issues.apache.org/jira/browse/NUTCH-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1321: - Patch Info: Patch Available > IDNNormalizer > - > > Key: NUTCH-1321 > URL: https://issues.apache.org/jira/browse/NUTCH-1321 > Project: Nutch > Issue Type: New Feature >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Attachments: idnNormalizer.patch > > > Right now, IDN's are indexed as ASCII. An IDNNormalizer is to be used with an > indexer so it will encode ASCII URL's to their proper unicode equivalant. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2196) IndexingFilterChecker to optionally normalize
Markus Jelsma created NUTCH-2196: Summary: IndexingFilterChecker to optionally normalize Key: NUTCH-2196 URL: https://issues.apache.org/jira/browse/NUTCH-2196 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Fix For: 1.12 As mentioned in NUTCH-2194, we sometimes use it as a backend for a web application. If so, then end users are obviously going to input bad URL's so having a normalizer running would smooth user satisfaction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083273#comment-15083273 ] Markus Jelsma commented on NUTCH-2191: -- Hey Chris! An Ajax pattern handler is new to me. Can you summarize it and/or point to one or two relevant issues? And please, suggest a solution for the plugin dependency hell. > Add protocol-htmlunit > - > > Key: NUTCH-2191 > URL: https://issues.apache.org/jira/browse/NUTCH-2191 > Project: Nutch > Issue Type: New Feature > Components: protocol >Affects Versions: 1.11 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-2191.patch > > > HtmlUnit is, opposed to other Javascript enabled headless browsers, a > portable library and should therefore be better suited for very large scale > crawls. This issue is an attempt to implement protocol-htmlunit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2195) IndexingFilterChecker to optionally follow N redirects
Markus Jelsma created NUTCH-2195: Summary: IndexingFilterChecker to optionally follow N redirects Key: NUTCH-2195 URL: https://issues.apache.org/jira/browse/NUTCH-2195 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Fix For: 1.12 As mentioned in NUTCH-2194, we sometimes use it as a backend for a web application. If so, it should at least be able to follow N redirects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083092#comment-15083092 ] Markus Jelsma commented on NUTCH-2184: -- Hello Lewis! * it should be no problem. But since IndexerMapReduce is complicated, i would love to have a simple unit test for it so we can guard for breaking things. * we should make sure the fetchDatum also caries the desired parseData fields, that we usually store in the CrawlDatum. This is not always true, see the fix i did for NUTCH-2093. I think it should be possible to index as much fields as with the CrawlDatum. If you implement it as such, then custom indexing filters that use CrawlDatum still work :) * Well yes, i think i already answered that question myself indeed, silly. This feature would be very handy for small segments but large CrawlDBs! > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2191) Add protocol-htmlunit
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2191: - Patch Info: Patch Available > Add protocol-htmlunit > - > > Key: NUTCH-2191 > URL: https://issues.apache.org/jira/browse/NUTCH-2191 > Project: Nutch > Issue Type: New Feature > Components: protocol >Affects Versions: 1.11 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-2191.patch > > > HtmlUnit is, opposed to other Javascript enabled headless browsers, a > portable library and should therefore be better suited for very large scale > crawls. This issue is an attempt to implement protocol-htmlunit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1838) Host and domain based regex and automaton filtering
[ https://issues.apache.org/jira/browse/NUTCH-1838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083101#comment-15083101 ] Markus Jelsma commented on NUTCH-1838: -- If no objections, i'll get this one in soon > Host and domain based regex and automaton filtering > --- > > Key: NUTCH-1838 > URL: https://issues.apache.org/jira/browse/NUTCH-1838 > Project: Nutch > Issue Type: New Feature >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Attachments: NUTCH-1838.patch, NUTCH-1838.patch, NUTCH-1838.patch, > NUTCH-1838.patch, NUTCH-1838.patch, NUTCH-1838.patch, NUTCH-1838.patch > > > Both regex and automaton filter pass all URL's through all rules although > this makes little sense if you have a lot of generated rules for many > different hosts or domains. This patch allows the users to configure specific > rules for a specific host or domain only, making filtering much more > efficient. > Each rule has an optional hostOrDomain field, the filter is applied for rules > that have no hostOrDomain and for URL's that match the rule's host name and > domain name. > The following line enables hostOrDomain specific rules: > {code} > > www.example.org > {code} > The following line disables/resets it again: > {code} > < > {code} > full example: > {code} > -some generic filter > +another generic filter > > www.example.org > -rule only applied to URL's of www.example.org > +another rule only applied to URL's of www.example.org > > apache.org > -rule only applied to URL's of apache.org > +another rule only applied to URL's of apache.org > < > -more generic rules > +and another one > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2194) Run IndexingFilterChecker as simple Telnet server
Markus Jelsma created NUTCH-2194: Summary: Run IndexingFilterChecker as simple Telnet server Key: NUTCH-2194 URL: https://issues.apache.org/jira/browse/NUTCH-2194 Project: Nutch Issue Type: New Feature Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.12 We have used a customized IndexingFilterChecker running as server to be able to quickly test/check pages from web applications. I'll add this feature back by letting IndexingFilterChecker run optionally as a simple server. Something like: bin/nutch indexchecker -listen -port The port param makes only sense if listen is there. Remove -listen or have port as argument for listen? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083104#comment-15083104 ] Markus Jelsma commented on NUTCH-2191: -- Does anyone have an idea on how to force the plugin to use its own included deps? I've spend a good time on this crazy problem :) > Add protocol-htmlunit > - > > Key: NUTCH-2191 > URL: https://issues.apache.org/jira/browse/NUTCH-2191 > Project: Nutch > Issue Type: New Feature > Components: protocol >Affects Versions: 1.11 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-2191.patch > > > HtmlUnit is, opposed to other Javascript enabled headless browsers, a > portable library and should therefore be better suited for very large scale > crawls. This issue is an attempt to implement protocol-htmlunit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1257) Support for the x-robots-tag HTTP Header
[ https://issues.apache.org/jira/browse/NUTCH-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083114#comment-15083114 ] Markus Jelsma commented on NUTCH-1257: -- Hmm, there is no patch but i remember having had this support on our older customized Nutch's. Ill look if i can find it again. > Support for the x-robots-tag HTTP Header > > > Key: NUTCH-1257 > URL: https://issues.apache.org/jira/browse/NUTCH-1257 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Reporter: Mike >Assignee: Markus Jelsma > Labels: http,, privacy,, robots, > Fix For: 2.4 > > > Google and Bing both currently support the x-robots-tag HTTP header. This is > important, because they have a policy of not *crawling* links that are in a > robots.txt file, and not *indexing* links that are set to noindex. In the > case that a page is indexed but not crawled, Google and Bing will show the > page in their results, but it will lack a snippet (since they didn't crawl > it, they can't generate one). > As a result, the only way to block Google and Bing from having a page in > their index is to use the robots meta tag in HTML pages and the x-robots-tag > in other mimetypes. > As a site owner that needs to keep specific pages private, I *cannot* trust > robots.txt to keep my pages out of Google and Bing, and I have to use the two > robots standards. Since Nutch doesn't support the HTTP header, I have to > block it from crawling ALL non-HTML pages on my site. > This is not an ideal state of affairs, and it would be great if Nutch > supported the x-robots-tag HTTP header. > I've done more research on this topic on my blog: > - > http://michaeljaylissner.com/blog/support-for-x-robots-tag-http-header-and-robots-HTML-meta-tag > - > http://michaeljaylissner.com/blog/respecting-privacy-while-providing-hundreds-of-thousands-of-public-documents -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2178) DeduplicationJob to optionall group on host or domain
[ https://issues.apache.org/jira/browse/NUTCH-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2178: - Patch Info: Patch Available > DeduplicationJob to optionall group on host or domain > - > > Key: NUTCH-2178 > URL: https://issues.apache.org/jira/browse/NUTCH-2178 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.10 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-2178.patch > > > Add optional grouping to DeduplicationJob. > Usage: DeduplicationJob [-group] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Patch Info: Patch Available > Automatically remove orphaned pages > --- > > Key: NUTCH-1932 > URL: https://issues.apache.org/jira/browse/NUTCH-1932 > Project: Nutch > Issue Type: New Feature >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-1932-add.patch, NUTCH-1932.patch, > NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, > NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, > NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch > > > Orphan scoring filter that determines whether a page has become orphaned, > e.g. it has no more other pages linking to it. If a page hasn't been linked > to after markGoneAfter seconds, the page is marked as gone and is then > removed by an indexer. If a page hasn't been linked to after markOrphanAfter > seconds, the page is removed from the CrawlDB. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1186) FreeGenerator always normalizes
[ https://issues.apache.org/jira/browse/NUTCH-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083138#comment-15083138 ] Lewis John McGibbney commented on NUTCH-1186: - Will scope and test [~markus17] > FreeGenerator always normalizes > --- > > Key: NUTCH-1186 > URL: https://issues.apache.org/jira/browse/NUTCH-1186 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 1.3 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-1186-1.7-1.patch, NUTCH-1186-1.7-2.patch > > > The FreeGenerator does not honor the -normalize option, it always normalizes > all URL's in the input directory. The -filter option is respected. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083603#comment-15083603 ] Sebastian Nagel commented on NUTCH-2191: As [~haraldk] mentioned in [this discussion|https://mail-archives.apache.org/mod_mbox/nutch-user/201404.mbox/%3c53576563.1030...@raytion.com%3E] there is only isolation between plugins (resp. their class loaders), but not between the plugin class loader and its parent loading classes from $NUTCH_HOME/lib/. > Add protocol-htmlunit > - > > Key: NUTCH-2191 > URL: https://issues.apache.org/jira/browse/NUTCH-2191 > Project: Nutch > Issue Type: New Feature > Components: protocol >Affects Versions: 1.11 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-2191.patch > > > HtmlUnit is, opposed to other Javascript enabled headless browsers, a > portable library and should therefore be better suited for very large scale > crawls. This issue is an attempt to implement protocol-htmlunit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2178) DeduplicationJob to optionall group on host or domain
[ https://issues.apache.org/jira/browse/NUTCH-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083096#comment-15083096 ] Markus Jelsma commented on NUTCH-2178: -- Will commit in a few if no further objections. > DeduplicationJob to optionall group on host or domain > - > > Key: NUTCH-2178 > URL: https://issues.apache.org/jira/browse/NUTCH-2178 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.10 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-2178.patch > > > Add optional grouping to DeduplicationJob. > Usage: DeduplicationJob [-group] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1449) Optionally delete documents skipped by IndexingFilters
[ https://issues.apache.org/jira/browse/NUTCH-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1449: - Patch Info: Patch Available > Optionally delete documents skipped by IndexingFilters > -- > > Key: NUTCH-1449 > URL: https://issues.apache.org/jira/browse/NUTCH-1449 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.5.1 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-1449.patch, NUTCH-1449.patch, NUTCH-1449.patch > > > Add configuration option to delete documents instead of skipping them if the > indexing filters return null. This is useful to delete documents with new > business logic in the indexing filter chain. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1186) FreeGenerator always normalizes
[ https://issues.apache.org/jira/browse/NUTCH-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1186: - Patch Info: Patch Available > FreeGenerator always normalizes > --- > > Key: NUTCH-1186 > URL: https://issues.apache.org/jira/browse/NUTCH-1186 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 1.3 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-1186-1.7-1.patch, NUTCH-1186-1.7-2.patch > > > The FreeGenerator does not honor the -normalize option, it always normalizes > all URL's in the input directory. The -filter option is respected. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2143) GeneratorJob ignores batch id passed as argument
[ https://issues.apache.org/jira/browse/NUTCH-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085043#comment-15085043 ] liuqibj commented on NUTCH-2143: I have a fix and can deliver it > GeneratorJob ignores batch id passed as argument > > > Key: NUTCH-2143 > URL: https://issues.apache.org/jira/browse/NUTCH-2143 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 2.3.1 >Reporter: Sebastian Nagel >Assignee: Lewis John McGibbney >Priority: Blocker > Fix For: 2.3.1 > > > The batch id passed to GeneratorJob by option/argument -batchId is > ignored and a generated batch id is used to mark the current batch. Log > snippets from a run of bin/crawl: > {noformat} > bin/nutch generate ... -batchId 1444941073-14208 > ... > GeneratorJob: generated batch id: 1444941074-858443668 containing 1 URLs > Fetching : > bin/nutch fetch ... 1444941073-14208 ... > ... > QueueFeeder finished: total 0 records. Hit by time limit :0 > {noformat} > The generated URLs are marked with the wrong batch id: > {noformat} > hbase(main):010:0> scan 'test_webpage' > ROWCOLUMN+CELL > org.apache.nutch:http/column=f:bid, timestamp=1444941077080, > value=1444941074-858443668 > ... > org.apache.nutch:http/column=mk:_gnmrk_, timestamp=1444941077080, > value=1444941074-858443668 > {noformat} > and fetcher will not fetch anything. This problem was reported by Sherban > Drulea > [[1|https://www.mail-archive.com/user@nutch.apache.org/msg13894.html]], > [[2|https://www.mail-archive.com/user@nutch.apache.org/msg13912.html]]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2168) Parse-tika fails to retrieve parser
[ https://issues.apache.org/jira/browse/NUTCH-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082904#comment-15082904 ] Auro Miralles commented on NUTCH-2168: -- Hello. I have no idea which document fails... I can crawl without problems with index-html plugin disabled, but nutch fails at the third iteration when i enable the plugin. Only two urls in my seed.txt and ignore.external.links on true. http://ujiapps.uji.es/ https://wiki.apache.org/nutch/ :~/ /bin/crawl urls/ testCrawl http://localhost:8983/solr/ 3 Parsing https://wiki.apache.org/nutch/bin/nutch%20mergelinkdb Parsing https://wiki.apache.org/nutch/GettingNutchRunningWithDebian Parsing http://ujiapps.uji.es/serveis/scp/accp/carnetUJI/avantatges/formacio/index.jpg Parsing http://ujiapps.uji.es/serveis/scp/accp/carnetUJI/avantatges/compres/index.jpg Parsing http://ujiapps.uji.es/serveis/scp/accp/carnetUJI/avantatges/productesfinancers/index.jpg ParserJob: success ParserJob: finished at 2016-01-05 12:27:42, time elapsed: 00:00:14 CrawlDB update for testCrawl /home/kalanya/apache-nutch-2.3.1/runtime/local/bin/nutch updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -all -crawlId testCrawl DbUpdaterJob: starting at 2016-01-05 12:27:42 DbUpdaterJob: updatinging all DbUpdaterJob: finished at 2016-01-05 12:27:49, time elapsed: 00:00:06 Indexing testCrawl on SOLR index -> http://localhost:8983/solr/ /home/kalanya/apache-nutch-2.3.1/runtime/local/bin/nutch index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D solr.server.url=http://localhost:8983/solr/ -all -crawlId testCrawl IndexingJob: starting SolrIndexerJob: java.lang.RuntimeException: job failed: name=[testCrawl]Indexer, jobid=job_local1207147570_0001 at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211) Error running: /home/kalanya/apache-nutch-2.3.1/runtime/local/bin/nutch index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D solr.server.url=http://localhost:8983/solr/ -all -crawlId testCrawl Failed with exit value 255. HADOOP.LOG 2016-01-05 12:28:00,151 INFO html.HtmlIndexingFilter - Html indexing for: http://ujiapps.uji.es/serveis/alumnisauji/jasoc/alumnisaujipremium/instal_lacions/se/ 2016-01-05 12:28:00,152 INFO html.HtmlIndexingFilter - Html indexing for: http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg 2016-01-05 12:28:00,163 INFO html.HtmlIndexingFilter - Html indexing for: http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/ 2016-01-05 12:28:00,164 INFO solr.SolrIndexWriter - Adding 250 documents 2016-01-05 12:28:00,531 INFO solr.SolrIndexWriter - Adding 250 documents 2016-01-05 12:28:00,842 WARN mapred.LocalJobRunner - job_local1207147570_0001 java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char #137317, byte #139263) at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522) Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char #1373 17, byte #139263) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54) at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84) at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84) at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48) at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43) at
[jira] [Comment Edited] (NUTCH-2168) Parse-tika fails to retrieve parser
[ https://issues.apache.org/jira/browse/NUTCH-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082904#comment-15082904 ] Auro Miralles edited comment on NUTCH-2168 at 1/5/16 11:50 AM: --- Hello. I have no idea which document fails... I can crawl without problems with index-html plugin disabled, but nutch fails at the third iteration when i enable the plugin. EDIT: There is no error if I disable tika plugin and keep index-html plugin enabled. Only two urls in my seed.txt and ignore.external.links on true. http://ujiapps.uji.es/ https://wiki.apache.org/nutch/ :~/ /bin/crawl urls/ testCrawl http://localhost:8983/solr/ 3 Parsing https://wiki.apache.org/nutch/bin/nutch%20mergelinkdb Parsing https://wiki.apache.org/nutch/GettingNutchRunningWithDebian Parsing http://ujiapps.uji.es/serveis/scp/accp/carnetUJI/avantatges/formacio/index.jpg Parsing http://ujiapps.uji.es/serveis/scp/accp/carnetUJI/avantatges/compres/index.jpg Parsing http://ujiapps.uji.es/serveis/scp/accp/carnetUJI/avantatges/productesfinancers/index.jpg ParserJob: success ParserJob: finished at 2016-01-05 12:27:42, time elapsed: 00:00:14 CrawlDB update for testCrawl /home/kalanya/apache-nutch-2.3.1/runtime/local/bin/nutch updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -all -crawlId testCrawl DbUpdaterJob: starting at 2016-01-05 12:27:42 DbUpdaterJob: updatinging all DbUpdaterJob: finished at 2016-01-05 12:27:49, time elapsed: 00:00:06 Indexing testCrawl on SOLR index -> http://localhost:8983/solr/ /home/kalanya/apache-nutch-2.3.1/runtime/local/bin/nutch index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D solr.server.url=http://localhost:8983/solr/ -all -crawlId testCrawl IndexingJob: starting SolrIndexerJob: java.lang.RuntimeException: job failed: name=[testCrawl]Indexer, jobid=job_local1207147570_0001 at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211) Error running: /home/kalanya/apache-nutch-2.3.1/runtime/local/bin/nutch index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D solr.server.url=http://localhost:8983/solr/ -all -crawlId testCrawl Failed with exit value 255. HADOOP.LOG 2016-01-05 12:28:00,151 INFO html.HtmlIndexingFilter - Html indexing for: http://ujiapps.uji.es/serveis/alumnisauji/jasoc/alumnisaujipremium/instal_lacions/se/ 2016-01-05 12:28:00,152 INFO html.HtmlIndexingFilter - Html indexing for: http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg 2016-01-05 12:28:00,163 INFO html.HtmlIndexingFilter - Html indexing for: http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/ 2016-01-05 12:28:00,164 INFO solr.SolrIndexWriter - Adding 250 documents 2016-01-05 12:28:00,531 INFO solr.SolrIndexWriter - Adding 250 documents 2016-01-05 12:28:00,842 WARN mapred.LocalJobRunner - job_local1207147570_0001 java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char #137317, byte #139263) at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522) Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char #1373 17, byte #139263) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54) at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84) at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84) at
[jira] [Commented] (NUTCH-1946) Upgrade to Gora 0.6.1
[ https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082951#comment-15082951 ] ASF GitHub Bot commented on NUTCH-1946: --- Github user lewismc commented on the pull request: https://github.com/apache/gora/pull/21#issuecomment-168985170 @jeroenvlek can you please close off this pull request? We are in the process of addressing [GORA-443](https://issues.apache.org/jira/browse/GORA-443). Thank you > Upgrade to Gora 0.6.1 > - > > Key: NUTCH-1946 > URL: https://issues.apache.org/jira/browse/NUTCH-1946 > Project: Nutch > Issue Type: Improvement > Components: storage >Affects Versions: 2.3.1 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 2.3.1 > > Attachments: NUTCH-1946.patch, NUTCH-1946_Gora_fixes.patch, > NUTCH-1946v2.patch, NUTCH-1946v3.patch, NUTCH-1946v4.patch > > > Apache Gora 0.6.1 was released recently. > We should upgrade before pushing Nutch 2.3.1 as it will come in very handy > for the new Docker containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1946) Upgrade to Gora 0.6.1
[ https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082966#comment-15082966 ] ASF GitHub Bot commented on NUTCH-1946: --- Github user jeroenvlek closed the pull request at: https://github.com/apache/gora/pull/21 > Upgrade to Gora 0.6.1 > - > > Key: NUTCH-1946 > URL: https://issues.apache.org/jira/browse/NUTCH-1946 > Project: Nutch > Issue Type: Improvement > Components: storage >Affects Versions: 2.3.1 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 2.3.1 > > Attachments: NUTCH-1946.patch, NUTCH-1946_Gora_fixes.patch, > NUTCH-1946v2.patch, NUTCH-1946v3.patch, NUTCH-1946v4.patch > > > Apache Gora 0.6.1 was released recently. > We should upgrade before pushing Nutch 2.3.1 as it will come in very handy > for the new Docker containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)