[jira] [Created] (NUTCH-2248) CSS parser plugin

2016-04-07 Thread Joseph Naegele (JIRA)
Joseph Naegele created NUTCH-2248:
-

 Summary: CSS parser plugin
 Key: NUTCH-2248
 URL: https://issues.apache.org/jira/browse/NUTCH-2248
 Project: Nutch
  Issue Type: New Feature
  Components: parser, plugin
Affects Versions: 1.12
Reporter: Joseph Naegele


This plugin allows for collecting {{uri}} links from CSS (stylesheets). This is 
useful for collecting parent stylesheets, fonts, and images needed to display 
web pages as intended.

Parsed Outlinks do not have associated anchors, and no additional text/content 
is parsed from the stylesheet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1687) Pick queue in Round Robin

2016-05-24 Thread Joseph Naegele (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15298757#comment-15298757
 ] 

Joseph Naegele commented on NUTCH-1687:
---

Any issues with Tien's updated patch?

> Pick queue in Round Robin
> -
>
> Key: NUTCH-1687
> URL: https://issues.apache.org/jira/browse/NUTCH-1687
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Attachments: NUTCH-1687-2.patch, NUTCH-1687.patch, 
> NUTCH-1687.tejasp.v1.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>   final Iterator> it =
> queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>   while (it.hasNext()) {
> 
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2234) Upgrade to elasticsearch 2.1.1

2016-05-24 Thread Joseph Naegele (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15298791#comment-15298791
 ] 

Joseph Naegele commented on NUTCH-2234:
---

Since this also adds support for multiple, comma-separated Elasticsearch hosts 
in {{elastic.host}}, the description {{nutch-default.xml}} should be updated 
accordingly. Is there any reason not to update this to use the most recent 
version of Elasticsearch (2.3.3)? I can update the patch or open a PR on Github.

> Upgrade to elasticsearch 2.1.1
> --
>
> Key: NUTCH-2234
> URL: https://issues.apache.org/jira/browse/NUTCH-2234
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
>Assignee: Markus Jelsma
> Fix For: 1.13
>
> Attachments: NUTCH-2234.patch
>
>
> Currently we use elasticsearch 1.x, We should upgrade to 2.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2234) Upgrade to elasticsearch 2.1.1

2016-05-24 Thread Joseph Naegele (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15298924#comment-15298924
 ] 

Joseph Naegele commented on NUTCH-2234:
---

Hmm I'm a bit confused. ES 2.3.3 depends on Lucene 5.5.0 libraries. It appears 
indexer-solr does not depend on Lucene, only Solrj. lucene-analyzers-common 
4.10.2 is a Nutch-wide dependency in ivy/ivy.xml, but it appears to only be 
used by plugins: indexer-elastic, parsefilter-naivebayes, and 
scoring-similarity, of which indexer-elastic and parsefilter-naivebayes specify 
their Lucene dependencies in their own plugin.xml (scoring-similarity appears 
to rely on lucene-core 4.10.2 being a transitive dependency through 
lucene-analyzers-common. Changing the lucene version in ivy/ivy.xml requires 
changes to the scoring-similarity plugin, which I think should be its own issue.



> Upgrade to elasticsearch 2.1.1
> --
>
> Key: NUTCH-2234
> URL: https://issues.apache.org/jira/browse/NUTCH-2234
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
>Assignee: Markus Jelsma
> Fix For: 1.13
>
> Attachments: NUTCH-2234.patch
>
>
> Currently we use elasticsearch 1.x, We should upgrade to 2.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2234) Upgrade to elasticsearch 2.1.1

2016-05-25 Thread Joseph Naegele (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15300590#comment-15300590
 ] 

Joseph Naegele commented on NUTCH-2234:
---

Understood. The update to Lucene analyzers requires minor programmatic API 
changes in scoring-similarity, but nothing big. None of the indexers have 
tests, so I'm testing indexer-elastic manually for now. Unfortunately updating 
Elasticsearch breaks the plugin due to differences in guava versions: 
indexer-elastic depends on guava-18.0, which it declares in its plugin.xml, but 
guava-16.0.1 is a Nutch-wide dependency (for Hadoop). We avoided this issue in 
the past by also updating Nutch's Hadoop dependency from 2.4.0 -> 2.7.1, which 
is why Tien created NUTCH-2246. I'll open the PR with all aforementioned 
dependency updates.

> Upgrade to elasticsearch 2.1.1
> --
>
> Key: NUTCH-2234
> URL: https://issues.apache.org/jira/browse/NUTCH-2234
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
>Assignee: Markus Jelsma
> Fix For: 1.13
>
> Attachments: NUTCH-2234.patch
>
>
> Currently we use elasticsearch 1.x, We should upgrade to 2.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1611) Elastic Search Indexer Creates field in elastic search "boost" as a string value, so cannot be used in custom boost queries

2016-05-26 Thread Joseph Naegele (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Naegele updated NUTCH-1611:
--
Attachment: NUTCH-1611-1.x.patch

This is an issue in 1.x. I've attached a patch for 1.x for which all tests pass.

> Elastic Search Indexer Creates field in elastic search "boost" as a string 
> value, so cannot be used in custom boost queries
> ---
>
> Key: NUTCH-1611
> URL: https://issues.apache.org/jira/browse/NUTCH-1611
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 2.2.1
> Environment: All
>Reporter: Nicholas Waltham
> Attachments: NUTCH-1611-1.x.patch
>
>
> Ordinarily, one can use a boost field in a custom_score query in elastic 
> search to affect the ranking, nutch create such a field. However it is store 
> in elastic search as a string, so cannot be used. Attempt to use the boost 
> field in a query therefore creates the following error:
>  PropertyAccessException[[Error: could not access: floatValue; in class: 
> org.elasticsearch.index.field.data.strings.StringDocFieldData]\n[Near : {... 
> _score + (1 * doc.boost.floatValue / 100) }]   
> example test query:
> {
> "query" : {
>   "custom_score" : {  
> "query" : {
>   "query_string" : {
>  "query" : "something"
>   }},
>   "script" : "_score + (1 * doc.boost.doubleValue / 100)"
>   }
>}
> }   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2279) LinkRank fails when using Hadoop MR output compression

2016-06-14 Thread Joseph Naegele (JIRA)
Joseph Naegele created NUTCH-2279:
-

 Summary: LinkRank fails when using Hadoop MR output compression
 Key: NUTCH-2279
 URL: https://issues.apache.org/jira/browse/NUTCH-2279
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.11
Reporter: Joseph Naegele


When using MapReduce job output compression, i.e. 
{{mapreduce.output.fileoutputformat.compress=true}}, LinkRank can't read the 
results of its {{Counter}} MR job due to the additional, generated file 
extension.

For example, using the default compression codec (which appears to be DEFLATE), 
the counter file is written to 
{{crawl/webgraph/_num_nodes_/part-0.deflate}}. Then, the LinkRank job 
attempts to manually read this file to obtain the number of links using the 
following code:
{code}
FSDataInputStream readLinks = fs.open(new Path(numLinksPath, "part-0"));
{code}
which fails because the file {{part-0}} doesn't exist:
{code}
LinkAnalysis: java.io.FileNotFoundException: File 
crawl/webgraph/_num_nodes_/part-0 does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:819)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:596)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:140)
at 
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
at 
org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:124)
at org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:633)
at org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:713)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:680)
{code}

To reproduce, add {{-D mapreduce.output.fileoutputformat.compress=true}} to the 
properties for {{bin/nutch linkrank ...}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2279) LinkRank fails when using Hadoop MR output compression

2016-06-23 Thread Joseph Naegele (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Naegele updated NUTCH-2279:
--
Affects Version/s: (was: 1.11)
   1.12

> LinkRank fails when using Hadoop MR output compression
> --
>
> Key: NUTCH-2279
> URL: https://issues.apache.org/jira/browse/NUTCH-2279
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.12
>Reporter: Joseph Naegele
>
> When using MapReduce job output compression, i.e. 
> {{mapreduce.output.fileoutputformat.compress=true}}, LinkRank can't read the 
> results of its {{Counter}} MR job due to the additional, generated file 
> extension.
> For example, using the default compression codec (which appears to be 
> DEFLATE), the counter file is written to 
> {{crawl/webgraph/_num_nodes_/part-0.deflate}}. Then, the LinkRank job 
> attempts to manually read this file to obtain the number of links using the 
> following code:
> {code}
> FSDataInputStream readLinks = fs.open(new Path(numLinksPath, "part-0"));
> {code}
> which fails because the file {{part-0}} doesn't exist:
> {code}
> LinkAnalysis: java.io.FileNotFoundException: File 
> crawl/webgraph/_num_nodes_/part-0 does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:819)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:596)
> at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:140)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
> at 
> org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:124)
> at 
> org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:633)
> at org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:713)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:680)
> {code}
> To reproduce, add {{-D mapreduce.output.fileoutputformat.compress=true}} to 
> the properties for {{bin/nutch linkrank ...}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2287) Indexer-elastic plugin should use Elasticsearch BulkProcessor and BackoffPolicy

2016-06-24 Thread Joseph Naegele (JIRA)
Joseph Naegele created NUTCH-2287:
-

 Summary: Indexer-elastic plugin should use Elasticsearch 
BulkProcessor and BackoffPolicy
 Key: NUTCH-2287
 URL: https://issues.apache.org/jira/browse/NUTCH-2287
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, plugin
Affects Versions: 1.12
Reporter: Joseph Naegele


Elasticsearch's API (since at least v2.0) includes the {{BulkProcessor}}, which 
automatically handles flushing bulk requests given a max doc count and/or max 
bulk size. It also now (I believe since 2.2.0) offers a {{BackoffPolicy}} 
option, allowing the BulkProcessor/Client to retry bulk requests when the 
Elasticsearch cluster is saturated. Using the {{BulkProcessor}} was originally 
suggested 
[here|https://issues.apache.org/jira/browse/NUTCH-1527?focusedCommentId=1316&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-1316].

Refactoring the {{indexer-elastic}} plugin to use the {{BulkProcessor}} will 
greatly simplify the existing plugin at the cost of slightly less debug 
logging. Additionally, it will allow the plugin to handle cluster saturation 
gracefully (rather than raising a RuntimeException and killing the reduce 
task), by using a configurable "exponential back-off policy".

https://www.elastic.co/guide/en/elasticsearch/client/java-api/2.3/java-docs-bulk-processor.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2287) Indexer-elastic plugin should use Elasticsearch BulkProcessor and BackoffPolicy

2016-06-24 Thread Joseph Naegele (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348659#comment-15348659
 ] 

Joseph Naegele commented on NUTCH-2287:
---

I can have a PR ready early next week. This will require upgrading the 
Elasticsearch dependency to at least version 2.2.0 (potentially handled by 
NUTCH-2234, which upgrades it to 2.3.3).

> Indexer-elastic plugin should use Elasticsearch BulkProcessor and 
> BackoffPolicy
> ---
>
> Key: NUTCH-2287
> URL: https://issues.apache.org/jira/browse/NUTCH-2287
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Affects Versions: 1.12
>Reporter: Joseph Naegele
>
> Elasticsearch's API (since at least v2.0) includes the {{BulkProcessor}}, 
> which automatically handles flushing bulk requests given a max doc count 
> and/or max bulk size. It also now (I believe since 2.2.0) offers a 
> {{BackoffPolicy}} option, allowing the BulkProcessor/Client to retry bulk 
> requests when the Elasticsearch cluster is saturated. Using the 
> {{BulkProcessor}} was originally suggested 
> [here|https://issues.apache.org/jira/browse/NUTCH-1527?focusedCommentId=1316&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-1316].
> Refactoring the {{indexer-elastic}} plugin to use the {{BulkProcessor}} will 
> greatly simplify the existing plugin at the cost of slightly less debug 
> logging. Additionally, it will allow the plugin to handle cluster saturation 
> gracefully (rather than raising a RuntimeException and killing the reduce 
> task), by using a configurable "exponential back-off policy".
> https://www.elastic.co/guide/en/elasticsearch/client/java-api/2.3/java-docs-bulk-processor.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)