[jira] [Commented] (NUTCH-1449) Optionally delete documents skipped by IndexingFilters

2016-01-05 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083109#comment-15083109
 ] 

Markus Jelsma commented on NUTCH-1449:
--

We have it nicely running for some years. I will commit this some time soon 
unless objections

> Optionally delete documents skipped by IndexingFilters
> --
>
> Key: NUTCH-1449
> URL: https://issues.apache.org/jira/browse/NUTCH-1449
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.5.1
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-1449.patch, NUTCH-1449.patch, NUTCH-1449.patch
>
>
> Add configuration option to delete documents instead of skipping them if the 
> indexing filters return null. This is useful to delete documents with new 
> business logic in the indexing filter chain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2168) Parse-tika fails to retrieve parser

2016-01-05 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083285#comment-15083285
 ] 

Sebastian Nagel commented on NUTCH-2168:


Hi [~kalanya], looks like the indexed raw content of the JPEGs are causing the 
invalid utf-8 character. The index-html plugin tries to treat any raw content 
as readable content converting it to a String based on the platform-dependent 
charset. What happens if the field content is specified as "binary" in 
schema.xml (cf. patch for NUTCH-2130)?

Without the patch applied, non-HTML documents simply fail to parse and are 
never indexed. That's probably the reason why the fix for this issue causes the 
problem with index-html. I would suggest to open a separate issue to address 
the indexer-solr problem with raw content from index-html.

> Parse-tika fails to retrieve parser
> ---
>
> Key: NUTCH-2168
> URL: https://issues.apache.org/jira/browse/NUTCH-2168
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.3.1
>Reporter: Sebastian Nagel
> Fix For: 2.3.1
>
> Attachments: NUTCH-2168.patch
>
>
> The plugin parse-tika fails to parse most (all?) kinds of document types 
> (PDF, xlsx, ...) when run via ParserChecker or ParserJob:
> {noformat}
> 2015-11-12 19:14:30,903 INFO  parse.ParserJob - Parsing 
> http://localhost/pdftest.pdf
> 2015-11-12 19:14:30,905 INFO  parse.ParserFactory - ...
> 2015-11-12 19:14:30,907 ERROR tika.TikaParser - Can't retrieve Tika parser 
> for mime-type application/pdf
> 2015-11-12 19:14:30,913 WARN  parse.ParseUtil - Unable to successfully parse 
> content http://localhost/pdftest.pdf of type application/pdf
> {noformat}
> The same document is successfully parsed by TestPdfParser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-01-05 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083214#comment-15083214
 ] 

Chris A. Mattmann commented on NUTCH-2191:
--

Very nice, Markus! Beat me to implementing this one.

My suggestions for future work here are to implement 10-15 Ajax pattern 
handlers like we started to do with selenium.

> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1321) IDNNormalizer

2016-01-05 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1321:
-
Patch Info: Patch Available

> IDNNormalizer
> -
>
> Key: NUTCH-1321
> URL: https://issues.apache.org/jira/browse/NUTCH-1321
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: idnNormalizer.patch
>
>
> Right now, IDN's are indexed as ASCII. An IDNNormalizer is to be used with an 
> indexer so it will encode ASCII URL's to their proper unicode equivalant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2196) IndexingFilterChecker to optionally normalize

2016-01-05 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2196:


 Summary: IndexingFilterChecker to optionally normalize
 Key: NUTCH-2196
 URL: https://issues.apache.org/jira/browse/NUTCH-2196
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
 Fix For: 1.12


As mentioned in NUTCH-2194, we sometimes use it as a backend for a web 
application. If so, then end users are obviously going to input bad URL's so 
having a normalizer running would smooth user satisfaction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-01-05 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083273#comment-15083273
 ] 

Markus Jelsma commented on NUTCH-2191:
--

Hey Chris! An Ajax pattern handler is new to me. Can you summarize it and/or 
point to one or two relevant issues? And please, suggest a solution for the 
plugin dependency hell.

> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2195) IndexingFilterChecker to optionally follow N redirects

2016-01-05 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2195:


 Summary: IndexingFilterChecker to optionally follow N redirects
 Key: NUTCH-2195
 URL: https://issues.apache.org/jira/browse/NUTCH-2195
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
 Fix For: 1.12


As mentioned in NUTCH-2194, we sometimes use it as a backend for a web 
application. If so, it should at least be able to follow N redirects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2016-01-05 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083092#comment-15083092
 ] 

Markus Jelsma commented on NUTCH-2184:
--

Hello Lewis!

* it should be no problem. But since IndexerMapReduce is complicated, i would 
love to have a simple unit test for it so we can guard for breaking things.
* we should make sure the fetchDatum also caries the desired parseData fields, 
that we usually store in the CrawlDatum. This is not always true, see the fix i 
did for NUTCH-2093. I think it should be possible to index as much fields as 
with the CrawlDatum. If you implement it as such, then custom indexing filters 
that use CrawlDatum still work :)
* Well yes, i think i already answered that question myself indeed, silly.

This feature would be very handy for small segments but large CrawlDBs! 

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2191) Add protocol-htmlunit

2016-01-05 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2191:
-
Patch Info: Patch Available

> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1838) Host and domain based regex and automaton filtering

2016-01-05 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083101#comment-15083101
 ] 

Markus Jelsma commented on NUTCH-1838:
--

If no objections, i'll get this one in soon

> Host and domain based regex and automaton filtering
> ---
>
> Key: NUTCH-1838
> URL: https://issues.apache.org/jira/browse/NUTCH-1838
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: NUTCH-1838.patch, NUTCH-1838.patch, NUTCH-1838.patch, 
> NUTCH-1838.patch, NUTCH-1838.patch, NUTCH-1838.patch, NUTCH-1838.patch
>
>
> Both regex and automaton filter pass all URL's through all rules although 
> this makes little sense if you have a lot of generated rules for many 
> different hosts or domains. This patch allows the users to configure specific 
> rules for a specific host or domain only, making filtering much more 
> efficient.
> Each rule has an optional hostOrDomain field, the filter is applied for rules 
> that have no hostOrDomain and for URL's that match the rule's host name and 
> domain name.
> The following line enables hostOrDomain specific rules:
> {code}
> > www.example.org
> {code}
> The following line disables/resets it again:
> {code}
> <
> {code}
> full example:
> {code}
> -some generic filter
> +another generic filter
> > www.example.org
> -rule only applied to URL's of www.example.org
> +another rule only applied to URL's of www.example.org
> > apache.org
> -rule only applied to URL's of apache.org
> +another rule only applied to URL's of apache.org
> <
> -more generic rules
> +and another one
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2194) Run IndexingFilterChecker as simple Telnet server

2016-01-05 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2194:


 Summary: Run IndexingFilterChecker as simple Telnet server
 Key: NUTCH-2194
 URL: https://issues.apache.org/jira/browse/NUTCH-2194
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.12


We have used a customized IndexingFilterChecker running as server to be able to 
quickly test/check pages from web applications. I'll add this feature back by 
letting IndexingFilterChecker run optionally as a simple server.

Something like:
bin/nutch indexchecker -listen -port 

The port param makes only sense if listen is there. Remove -listen or have port 
as argument for listen?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-01-05 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083104#comment-15083104
 ] 

Markus Jelsma commented on NUTCH-2191:
--

Does anyone have an idea on how to force the plugin to use its own included 
deps? I've spend a good time on this crazy problem :)

> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1257) Support for the x-robots-tag HTTP Header

2016-01-05 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083114#comment-15083114
 ] 

Markus Jelsma commented on NUTCH-1257:
--

Hmm, there is no patch but i remember having had this support on our older 
customized Nutch's. Ill look if i can find it again.

> Support for the x-robots-tag HTTP Header
> 
>
> Key: NUTCH-1257
> URL: https://issues.apache.org/jira/browse/NUTCH-1257
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Reporter: Mike
>Assignee: Markus Jelsma
>  Labels: http,, privacy,, robots,
> Fix For: 2.4
>
>
> Google and Bing both currently support the x-robots-tag HTTP header. This is 
> important, because they have a policy of not *crawling* links that are in a 
> robots.txt file, and not *indexing* links that are set to noindex. In the 
> case that a page is indexed but not crawled, Google and Bing will show the 
> page in their results, but it will lack a snippet (since they didn't crawl 
> it, they can't generate one). 
> As a result, the only way to block Google and Bing from having a page in 
> their index is to use the robots meta tag in HTML pages and the x-robots-tag 
> in other mimetypes.
> As a site owner that needs to keep specific pages private, I *cannot* trust 
> robots.txt to keep my pages out of Google and Bing, and I have to use the two 
> robots standards. Since Nutch doesn't support the HTTP header, I have to 
> block it from crawling ALL non-HTML pages on my site.
> This is not an ideal state of affairs, and it would be great if Nutch 
> supported the x-robots-tag HTTP header.
> I've done more research on this topic on my blog:
>  - 
> http://michaeljaylissner.com/blog/support-for-x-robots-tag-http-header-and-robots-HTML-meta-tag
>  - 
> http://michaeljaylissner.com/blog/respecting-privacy-while-providing-hundreds-of-thousands-of-public-documents



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2178) DeduplicationJob to optionall group on host or domain

2016-01-05 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2178:
-
Patch Info: Patch Available

> DeduplicationJob to optionall group on host or domain
> -
>
> Key: NUTCH-2178
> URL: https://issues.apache.org/jira/browse/NUTCH-2178
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.10
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2178.patch
>
>
> Add optional grouping to DeduplicationJob.
> Usage: DeduplicationJob  [-group ]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2016-01-05 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1932:
-
Patch Info: Patch Available

> Automatically remove orphaned pages
> ---
>
> Key: NUTCH-1932
> URL: https://issues.apache.org/jira/browse/NUTCH-1932
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-1932-add.patch, NUTCH-1932.patch, 
> NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, 
> NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, 
> NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch
>
>
> Orphan scoring filter that determines whether a page has become orphaned, 
> e.g. it has no more other pages linking to it. If a page hasn't been linked 
> to after markGoneAfter seconds, the page is marked as gone and is then 
> removed by an indexer.  If a page hasn't been linked to after markOrphanAfter 
> seconds, the page is removed from the CrawlDB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1186) FreeGenerator always normalizes

2016-01-05 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083138#comment-15083138
 ] 

Lewis John McGibbney commented on NUTCH-1186:
-

Will scope and test [~markus17]


> FreeGenerator always normalizes
> ---
>
> Key: NUTCH-1186
> URL: https://issues.apache.org/jira/browse/NUTCH-1186
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.3
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-1186-1.7-1.patch, NUTCH-1186-1.7-2.patch
>
>
> The FreeGenerator does not honor the -normalize option, it always normalizes 
> all URL's in the input directory. The -filter option is respected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-01-05 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083603#comment-15083603
 ] 

Sebastian Nagel commented on NUTCH-2191:


As [~haraldk] mentioned in [this 
discussion|https://mail-archives.apache.org/mod_mbox/nutch-user/201404.mbox/%3c53576563.1030...@raytion.com%3E]
 there is only isolation between plugins (resp. their class loaders), but not 
between the plugin class loader and its parent loading classes from 
$NUTCH_HOME/lib/.

> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2178) DeduplicationJob to optionall group on host or domain

2016-01-05 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083096#comment-15083096
 ] 

Markus Jelsma commented on NUTCH-2178:
--

Will commit in a few if no further objections.


> DeduplicationJob to optionall group on host or domain
> -
>
> Key: NUTCH-2178
> URL: https://issues.apache.org/jira/browse/NUTCH-2178
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.10
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2178.patch
>
>
> Add optional grouping to DeduplicationJob.
> Usage: DeduplicationJob  [-group ]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1449) Optionally delete documents skipped by IndexingFilters

2016-01-05 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1449:
-
Patch Info: Patch Available

> Optionally delete documents skipped by IndexingFilters
> --
>
> Key: NUTCH-1449
> URL: https://issues.apache.org/jira/browse/NUTCH-1449
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.5.1
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-1449.patch, NUTCH-1449.patch, NUTCH-1449.patch
>
>
> Add configuration option to delete documents instead of skipping them if the 
> indexing filters return null. This is useful to delete documents with new 
> business logic in the indexing filter chain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1186) FreeGenerator always normalizes

2016-01-05 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1186:
-
Patch Info: Patch Available

> FreeGenerator always normalizes
> ---
>
> Key: NUTCH-1186
> URL: https://issues.apache.org/jira/browse/NUTCH-1186
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.3
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-1186-1.7-1.patch, NUTCH-1186-1.7-2.patch
>
>
> The FreeGenerator does not honor the -normalize option, it always normalizes 
> all URL's in the input directory. The -filter option is respected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2143) GeneratorJob ignores batch id passed as argument

2016-01-05 Thread liuqibj (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085043#comment-15085043
 ] 

liuqibj commented on NUTCH-2143:


I have a fix and can deliver it

> GeneratorJob ignores batch id passed as argument
> 
>
> Key: NUTCH-2143
> URL: https://issues.apache.org/jira/browse/NUTCH-2143
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.3.1
>Reporter: Sebastian Nagel
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 2.3.1
>
>
> The batch id passed to GeneratorJob by option/argument -batchId  is 
> ignored and a generated batch id is used to mark the current batch. Log 
> snippets from a run of bin/crawl:
> {noformat}
> bin/nutch generate ... -batchId 1444941073-14208
> ...
> GeneratorJob: generated batch id: 1444941074-858443668 containing 1 URLs
> Fetching : 
> bin/nutch fetch ... 1444941073-14208 ...
> ...
> QueueFeeder finished: total 0 records. Hit by time limit :0
> {noformat}
> The generated URLs are marked with the wrong batch id:
> {noformat}
> hbase(main):010:0> scan 'test_webpage'
> ROWCOLUMN+CELL
>  org.apache.nutch:http/column=f:bid, timestamp=1444941077080, 
> value=1444941074-858443668
>  ...
>  org.apache.nutch:http/column=mk:_gnmrk_, timestamp=1444941077080, 
> value=1444941074-858443668
> {noformat}
> and fetcher will not fetch anything. This problem was reported by Sherban 
> Drulea 
> [[1|https://www.mail-archive.com/user@nutch.apache.org/msg13894.html]], 
> [[2|https://www.mail-archive.com/user@nutch.apache.org/msg13912.html]].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2168) Parse-tika fails to retrieve parser

2016-01-05 Thread Auro Miralles (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082904#comment-15082904
 ] 

Auro Miralles commented on NUTCH-2168:
--

Hello. I have no idea which document fails... I can crawl without problems with 
index-html plugin disabled, but nutch fails at the third iteration when i 
enable the plugin. Only two urls in my seed.txt and ignore.external.links on 
true.

http://ujiapps.uji.es/
https://wiki.apache.org/nutch/

:~/ /bin/crawl urls/ testCrawl http://localhost:8983/solr/ 3



Parsing https://wiki.apache.org/nutch/bin/nutch%20mergelinkdb
Parsing https://wiki.apache.org/nutch/GettingNutchRunningWithDebian
Parsing 
http://ujiapps.uji.es/serveis/scp/accp/carnetUJI/avantatges/formacio/index.jpg
Parsing 
http://ujiapps.uji.es/serveis/scp/accp/carnetUJI/avantatges/compres/index.jpg
Parsing 
http://ujiapps.uji.es/serveis/scp/accp/carnetUJI/avantatges/productesfinancers/index.jpg
ParserJob: success
ParserJob: finished at 2016-01-05 12:27:42, time elapsed: 00:00:14
CrawlDB update for testCrawl
/home/kalanya/apache-nutch-2.3.1/runtime/local/bin/nutch updatedb -D 
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D 
mapred.reduce.tasks.speculative.execution=false -D 
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 
-all -crawlId testCrawl
DbUpdaterJob: starting at 2016-01-05 12:27:42
DbUpdaterJob: updatinging all
DbUpdaterJob: finished at 2016-01-05 12:27:49, time elapsed: 00:00:06
Indexing testCrawl on SOLR index -> http://localhost:8983/solr/
/home/kalanya/apache-nutch-2.3.1/runtime/local/bin/nutch index -D 
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D 
mapred.reduce.tasks.speculative.execution=false -D 
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 
-D solr.server.url=http://localhost:8983/solr/ -all -crawlId testCrawl
IndexingJob: starting
SolrIndexerJob: java.lang.RuntimeException: job failed: 
name=[testCrawl]Indexer, jobid=job_local1207147570_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)

Error running:
  /home/kalanya/apache-nutch-2.3.1/runtime/local/bin/nutch index -D 
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D 
mapred.reduce.tasks.speculative.execution=false -D 
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 
-D solr.server.url=http://localhost:8983/solr/ -all -crawlId testCrawl
Failed with exit value 255.



HADOOP.LOG



2016-01-05 12:28:00,151 INFO  html.HtmlIndexingFilter - Html indexing for: 
http://ujiapps.uji.es/serveis/alumnisauji/jasoc/alumnisaujipremium/instal_lacions/se/
2016-01-05 12:28:00,152 INFO  html.HtmlIndexingFilter - Html indexing for: 
http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg
2016-01-05 12:28:00,163 INFO  html.HtmlIndexingFilter - Html indexing for: 
http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/
2016-01-05 12:28:00,164 INFO  solr.SolrIndexWriter - Adding 250 documents
2016-01-05 12:28:00,531 INFO  solr.SolrIndexWriter - Adding 250 documents
2016-01-05 12:28:00,842 WARN  mapred.LocalJobRunner - job_local1207147570_0001
java.lang.Exception: 
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was 
class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at 
char #137317, byte #139263)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: 
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was 
class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char 
#1373
17, byte #139263)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at 
org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
at 
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
at 
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
at 

[jira] [Comment Edited] (NUTCH-2168) Parse-tika fails to retrieve parser

2016-01-05 Thread Auro Miralles (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082904#comment-15082904
 ] 

Auro Miralles edited comment on NUTCH-2168 at 1/5/16 11:50 AM:
---

Hello. I have no idea which document fails... I can crawl without problems with 
index-html plugin disabled, but nutch fails at the third iteration when i 
enable the plugin. 

EDIT: There is no error if I disable tika plugin and keep index-html plugin 
enabled.

Only two urls in my seed.txt and ignore.external.links on true.

http://ujiapps.uji.es/
https://wiki.apache.org/nutch/

:~/ /bin/crawl urls/ testCrawl http://localhost:8983/solr/ 3



Parsing https://wiki.apache.org/nutch/bin/nutch%20mergelinkdb
Parsing https://wiki.apache.org/nutch/GettingNutchRunningWithDebian
Parsing 
http://ujiapps.uji.es/serveis/scp/accp/carnetUJI/avantatges/formacio/index.jpg
Parsing 
http://ujiapps.uji.es/serveis/scp/accp/carnetUJI/avantatges/compres/index.jpg
Parsing 
http://ujiapps.uji.es/serveis/scp/accp/carnetUJI/avantatges/productesfinancers/index.jpg
ParserJob: success
ParserJob: finished at 2016-01-05 12:27:42, time elapsed: 00:00:14
CrawlDB update for testCrawl
/home/kalanya/apache-nutch-2.3.1/runtime/local/bin/nutch updatedb -D 
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D 
mapred.reduce.tasks.speculative.execution=false -D 
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 
-all -crawlId testCrawl
DbUpdaterJob: starting at 2016-01-05 12:27:42
DbUpdaterJob: updatinging all
DbUpdaterJob: finished at 2016-01-05 12:27:49, time elapsed: 00:00:06
Indexing testCrawl on SOLR index -> http://localhost:8983/solr/
/home/kalanya/apache-nutch-2.3.1/runtime/local/bin/nutch index -D 
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D 
mapred.reduce.tasks.speculative.execution=false -D 
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 
-D solr.server.url=http://localhost:8983/solr/ -all -crawlId testCrawl
IndexingJob: starting
SolrIndexerJob: java.lang.RuntimeException: job failed: 
name=[testCrawl]Indexer, jobid=job_local1207147570_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)

Error running:
  /home/kalanya/apache-nutch-2.3.1/runtime/local/bin/nutch index -D 
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D 
mapred.reduce.tasks.speculative.execution=false -D 
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 
-D solr.server.url=http://localhost:8983/solr/ -all -crawlId testCrawl
Failed with exit value 255.



HADOOP.LOG



2016-01-05 12:28:00,151 INFO  html.HtmlIndexingFilter - Html indexing for: 
http://ujiapps.uji.es/serveis/alumnisauji/jasoc/alumnisaujipremium/instal_lacions/se/
2016-01-05 12:28:00,152 INFO  html.HtmlIndexingFilter - Html indexing for: 
http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg
2016-01-05 12:28:00,163 INFO  html.HtmlIndexingFilter - Html indexing for: 
http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/
2016-01-05 12:28:00,164 INFO  solr.SolrIndexWriter - Adding 250 documents
2016-01-05 12:28:00,531 INFO  solr.SolrIndexWriter - Adding 250 documents
2016-01-05 12:28:00,842 WARN  mapred.LocalJobRunner - job_local1207147570_0001
java.lang.Exception: 
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was 
class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at 
char #137317, byte #139263)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: 
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was 
class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char 
#1373
17, byte #139263)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at 
org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
at 

[jira] [Commented] (NUTCH-1946) Upgrade to Gora 0.6.1

2016-01-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082951#comment-15082951
 ] 

ASF GitHub Bot commented on NUTCH-1946:
---

Github user lewismc commented on the pull request:

https://github.com/apache/gora/pull/21#issuecomment-168985170
  
@jeroenvlek can you please close off this pull request? We are in the 
process of addressing 
[GORA-443](https://issues.apache.org/jira/browse/GORA-443). Thank you


> Upgrade to Gora 0.6.1
> -
>
> Key: NUTCH-1946
> URL: https://issues.apache.org/jira/browse/NUTCH-1946
> Project: Nutch
>  Issue Type: Improvement
>  Components: storage
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 2.3.1
>
> Attachments: NUTCH-1946.patch, NUTCH-1946_Gora_fixes.patch, 
> NUTCH-1946v2.patch, NUTCH-1946v3.patch, NUTCH-1946v4.patch
>
>
> Apache Gora 0.6.1 was released recently.
> We should upgrade before pushing Nutch 2.3.1 as it will come in very handy 
> for the new Docker containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1946) Upgrade to Gora 0.6.1

2016-01-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082966#comment-15082966
 ] 

ASF GitHub Bot commented on NUTCH-1946:
---

Github user jeroenvlek closed the pull request at:

https://github.com/apache/gora/pull/21


> Upgrade to Gora 0.6.1
> -
>
> Key: NUTCH-1946
> URL: https://issues.apache.org/jira/browse/NUTCH-1946
> Project: Nutch
>  Issue Type: Improvement
>  Components: storage
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 2.3.1
>
> Attachments: NUTCH-1946.patch, NUTCH-1946_Gora_fixes.patch, 
> NUTCH-1946v2.patch, NUTCH-1946v3.patch, NUTCH-1946v4.patch
>
>
> Apache Gora 0.6.1 was released recently.
> We should upgrade before pushing Nutch 2.3.1 as it will come in very handy 
> for the new Docker containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)