from:"Julien Nioche"


[ 
https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033491#comment-15033491
 ] 

Julien Nioche commented on NUTCH-2177:
--

Do you mean 'mapreduce.framework.name' ?

> Generator produces only one partition even in distributed mode
> --
>
> Key: NUTCH-2177
> URL: https://issues.apache.org/jira/browse/NUTCH-2177
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>    Reporter: Julien Nioche
>Priority: Blocker
> Fix For: 1.11
>
>
> See 
> [https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/crawl/Generator.java#L542]
> 'mapred.job.tracker' is deprecated and has been replaced by 
> 'mapreduce.jobtracker.address', however when running Nutch on EMR 
> mapreduce.jobtracker.address has local as a value. As a result we generate a 
> single partition i.e. have a single map fetching later on (which defeats the 
> object of having a distributed crawler).
> We should probably detect whether we are running on YARN instead, see 
> [http://stackoverflow.com/questions/29680155/why-there-is-a-mapreduce-jobtracker-address-configuration-on-yarn]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2177) Generator produces only one partition even in distributed mode


 [ 
https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2177:
-
Attachment: NUTCH-2177.patch

> Generator produces only one partition even in distributed mode
> --
>
> Key: NUTCH-2177
> URL: https://issues.apache.org/jira/browse/NUTCH-2177
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>    Reporter: Julien Nioche
>Priority: Blocker
> Fix For: 1.11
>
> Attachments: NUTCH-2177.patch
>
>
> See 
> [https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/crawl/Generator.java#L542]
> 'mapred.job.tracker' is deprecated and has been replaced by 
> 'mapreduce.jobtracker.address', however when running Nutch on EMR 
> mapreduce.jobtracker.address has local as a value. As a result we generate a 
> single partition i.e. have a single map fetching later on (which defeats the 
> object of having a distributed crawler).
> We should probably detect whether we are running on YARN instead, see 
> [http://stackoverflow.com/questions/29680155/why-there-is-a-mapreduce-jobtracker-address-configuration-on-yarn]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (NUTCH-2177) Generator produces only one partition even in distributed mode


[ 
https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033491#comment-15033491
 ] 

Julien Nioche edited comment on NUTCH-2177 at 12/1/15 11:43 AM:


Do you mean 'mapreduce.framework.name' ?

Should work indeed - here are the values I am getting on EMR

15/12/01 11:02:00 INFO crawl.Generator: mapred.job.tracker local
15/12/01 11:02:00 INFO crawl.Generator: mapreduce.jobtracker.address local
15/12/01 11:02:00 INFO crawl.Generator: mapreduce.framework.name yarn

where in local mode I get 

2015-12-01 11:42:16,622 INFO  crawl.Generator - mapred.job.tracker local
2015-12-01 11:42:16,622 INFO  crawl.Generator - mapreduce.jobtracker.address 
local
2015-12-01 11:42:16,622 INFO  crawl.Generator - mapreduce.framework.name local

Will send a patch shortly


was (Author: jnioche):
Do you mean 'mapreduce.framework.name' ?

> Generator produces only one partition even in distributed mode
> --
>
> Key: NUTCH-2177
> URL: https://issues.apache.org/jira/browse/NUTCH-2177
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>    Reporter: Julien Nioche
>Priority: Blocker
> Fix For: 1.11
>
>
> See 
> [https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/crawl/Generator.java#L542]
> 'mapred.job.tracker' is deprecated and has been replaced by 
> 'mapreduce.jobtracker.address', however when running Nutch on EMR 
> mapreduce.jobtracker.address has local as a value. As a result we generate a 
> single partition i.e. have a single map fetching later on (which defeats the 
> object of having a distributed crawler).
> We should probably detect whether we are running on YARN instead, see 
> [http://stackoverflow.com/questions/29680155/why-there-is-a-mapreduce-jobtracker-address-configuration-on-yarn]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (NUTCH-2177) Generator produces only one partition even in distributed mode


 [ 
https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-2177.
--
Resolution: Fixed

Committed revision 1717412.

Thanks [~wastl-nagel] and [~markus17]

> Generator produces only one partition even in distributed mode
> --
>
> Key: NUTCH-2177
> URL: https://issues.apache.org/jira/browse/NUTCH-2177
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>    Reporter: Julien Nioche
>Priority: Blocker
> Fix For: 1.11
>
> Attachments: NUTCH-2177.patch
>
>
> See 
> [https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/crawl/Generator.java#L542]
> 'mapred.job.tracker' is deprecated and has been replaced by 
> 'mapreduce.jobtracker.address', however when running Nutch on EMR 
> mapreduce.jobtracker.address has local as a value. As a result we generate a 
> single partition i.e. have a single map fetching later on (which defeats the 
> object of having a distributed crawler).
> We should probably detect whether we are running on YARN instead, see 
> [http://stackoverflow.com/questions/29680155/why-there-is-a-mapreduce-jobtracker-address-configuration-on-yarn]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-11-26 Thread Julien Nioche (JIRA)

Julien Nioche created NUTCH-2177:


 Summary: Generator produces only one partition even in distributed 
mode
 Key: NUTCH-2177
 URL: https://issues.apache.org/jira/browse/NUTCH-2177
 Project: Nutch
  Issue Type: Bug
  Components: generator
Reporter: Julien Nioche
Priority: Blocker
 Fix For: 1.11


See 
[https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/crawl/Generator.java#L542]

'mapred.job.tracker' is deprecated and has been replaced by 
'mapreduce.jobtracker.address', however when running Nutch on EMR 
mapreduce.jobtracker.address has local as a value. As a result we generate a 
single partition i.e. have a single map fetching later on (which defeats the 
object of having a distributed crawler).

We should probably detect whether we are running on YARN instead, see 
[http://stackoverflow.com/questions/29680155/why-there-is-a-mapreduce-jobtracker-address-configuration-on-yarn]




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-11-26 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029037#comment-15029037
 ] 

Julien Nioche commented on NUTCH-2177:
--

I am on  
Hadoop version: 2.4.0-amzn-7

not clear which version of Yarn it comes with. 

mapred.job.tracker does not appear in the conf at all but 
mapreduce.jobtracker.address does and has 'local' as value

I can see

15/11/26 15:58:13 INFO Configuration.deprecation: mapred.job.tracker is 
deprecated. Instead, use mapreduce.jobtracker.address
15/11/26 15:58:13 INFO crawl.Generator: Generator: jobtracker is 'local', 
generating exactly one partition.

so my guess is that the Configuration maps the old key (mapred.job.tracker) to 
the new one (mapreduce.jobtracker.address) 

@markus17 what do you get for mapreduce.jobtracker.address? 


> Generator produces only one partition even in distributed mode
> --
>
> Key: NUTCH-2177
> URL: https://issues.apache.org/jira/browse/NUTCH-2177
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>    Reporter: Julien Nioche
>Priority: Blocker
> Fix For: 1.11
>
>
> See 
> [https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/crawl/Generator.java#L542]
> 'mapred.job.tracker' is deprecated and has been replaced by 
> 'mapreduce.jobtracker.address', however when running Nutch on EMR 
> mapreduce.jobtracker.address has local as a value. As a result we generate a 
> single partition i.e. have a single map fetching later on (which defeats the 
> object of having a distributed crawler).
> We should probably detect whether we are running on YARN instead, see 
> [http://stackoverflow.com/questions/29680155/why-there-is-a-mapreduce-jobtracker-address-configuration-on-yarn]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2069) Ignore external links based on domain

2015-11-20 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018232#comment-15018232
 ] 

Julien Nioche commented on NUTCH-2069:
--

no probs. Would be good to find a way to format based on the Eclipse XML config 
with an ANT task. There is a way to do it with Maven but haven't seen one for 
ANT.

> Ignore external links based on domain
> -
>
> Key: NUTCH-2069
> URL: https://issues.apache.org/jira/browse/NUTCH-2069
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher, parser
>Affects Versions: 1.10
>    Reporter: Julien Nioche
> Fix For: 1.11
>
> Attachments: NUTCH-2069.patch, NUTCH-2069.v2.patch
>
>
> We currently have `db.ignore.external.links` which is a nice way of 
> restricting the crawl based on the hostname. This adds a new parameter 
> 'db.ignore.external.links.domain' to do the same based on the domain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (NUTCH-2069) Ignore external links based on domain

2015-11-20 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-2069.
--
Resolution: Fixed

Trunk committed revision 1715386.

Thanks everyone for comments and reviews

> Ignore external links based on domain
> -
>
> Key: NUTCH-2069
> URL: https://issues.apache.org/jira/browse/NUTCH-2069
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher, parser
>Affects Versions: 1.10
>    Reporter: Julien Nioche
> Fix For: 1.11
>
> Attachments: NUTCH-2069.patch, NUTCH-2069.v2.patch
>
>
> We currently have `db.ignore.external.links` which is a nice way of 
> restricting the crawl based on the hostname. This adds a new parameter 
> 'db.ignore.external.links.domain' to do the same based on the domain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (NUTCH-2069) Ignore external links based on domain

2015-11-20 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-2069.


> Ignore external links based on domain
> -
>
> Key: NUTCH-2069
> URL: https://issues.apache.org/jira/browse/NUTCH-2069
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher, parser
>Affects Versions: 1.10
>    Reporter: Julien Nioche
> Fix For: 1.11
>
> Attachments: NUTCH-2069.patch, NUTCH-2069.v2.patch
>
>
> We currently have `db.ignore.external.links` which is a nice way of 
> restricting the crawl based on the hostname. This adds a new parameter 
> 'db.ignore.external.links.domain' to do the same based on the domain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2069) Ignore external links based on domain

2015-11-19 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2069:
-
Attachment: NUTCH-2069.v2.patch

new patch introducing 'db.ignore.external.links.mode' 
this is compatible with the existing behaviour and will use byHost by default.

> Ignore external links based on domain
> -
>
> Key: NUTCH-2069
> URL: https://issues.apache.org/jira/browse/NUTCH-2069
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher, parser
>Affects Versions: 1.10
>    Reporter: Julien Nioche
> Attachments: NUTCH-2069.patch, NUTCH-2069.v2.patch
>
>
> We currently have `db.ignore.external.links` which is a nice way of 
> restricting the crawl based on the hostname. This adds a new parameter 
> 'db.ignore.external.links.domain' to do the same based on the domain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2064) URLNormalizer basic to encode reserved chars and decode non-reserved chars


[ 
https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998467#comment-14998467
 ] 

Julien Nioche commented on NUTCH-2064:
--

FYI have ported the code to Crawler-Commons 
[https://github.com/crawler-commons/crawler-commons/pull/106] where the 
provenance is acknowledged. Comments on that PR are more than  welcome.

> URLNormalizer basic to encode reserved chars and decode non-reserved chars
> --
>
> Key: NUTCH-2064
> URL: https://issues.apache.org/jira/browse/NUTCH-2064
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.10
>Reporter: Markus Jelsma
> Fix For: 1.11
>
> Attachments: NUTCH-1098.patch, NUTCH-1098.patch, NUTCH-2064-v3.patch, 
> NUTCH-2064-v5.patch, NUTCH-2064.patch
>
>
> NUTCH-1098 rewritten to work on trunk. Unit test is identical to 1098.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (NUTCH-2064) URLNormalizer basic to encode reserved chars and decode non-reserved chars


 [ 
https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-2064.
--
   Resolution: Fixed
Fix Version/s: (was: 1.12)
   1.11

Trunk  : Committed revision 1713615.

Nice one! thanks Markus and Sebastian


> URLNormalizer basic to encode reserved chars and decode non-reserved chars
> --
>
> Key: NUTCH-2064
> URL: https://issues.apache.org/jira/browse/NUTCH-2064
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.10
>Reporter: Markus Jelsma
> Fix For: 1.11
>
> Attachments: NUTCH-1098.patch, NUTCH-1098.patch, NUTCH-2064-v3.patch, 
> NUTCH-2064-v5.patch, NUTCH-2064.patch
>
>
> NUTCH-1098 rewritten to work on trunk. Unit test is identical to 1098.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (NUTCH-2158) Upgrade to Tika 1.11


 [ 
https://issues.apache.org/jira/browse/NUTCH-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche reassigned NUTCH-2158:


Assignee: Julien Nioche  (was: Chris A. Mattmann)

> Upgrade to Tika 1.11
> 
>
> Key: NUTCH-2158
> URL: https://issues.apache.org/jira/browse/NUTCH-2158
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Reporter: Chris A. Mattmann
>    Assignee: Julien Nioche
> Fix For: 1.11
>
>
> Upgrade parse-tika to 1.11 release for Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2158) Upgrade to Tika 1.11


 [ 
https://issues.apache.org/jira/browse/NUTCH-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2158:
-
Attachment: NUTCH-2158.patch

Patch which upgrades to Tika 1.11

tests fail for protocol-http 

{code}
Testcase: testStatusCode took 3.648 sec
FAILED
ContentType http://127.0.0.1:47504/basic-http.jsp 
expected:<[application/xhtml+x]ml> but was:<[text/ht]ml>
junit.framework.AssertionFailedError: ContentType 
http://127.0.0.1:47504/basic-http.jsp expected:<[application/xhtml+x]ml> but 
was:<[text/ht]ml>
at 
org.apache.nutch.protocol.http.TestProtocolHttp.fetchPage(TestProtocolHttp.java:136)
at 
org.apache.nutch.protocol.http.TestProtocolHttp.testStatusCode(TestProtocolHttp.java:80)
{code}

mimetype detected is different but probably correct. Will fix later.


> Upgrade to Tika 1.11
> 
>
> Key: NUTCH-2158
> URL: https://issues.apache.org/jira/browse/NUTCH-2158
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Julien Nioche
> Fix For: 1.11
>
> Attachments: NUTCH-2158.patch
>
>
> Upgrade parse-tika to 1.11 release for Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [VOTE] Apache Nutch 1.11 Release Candidate #1

2015-10-26 Thread Julien Nioche

Chris

-1  We usually release tar.gz as well as zip.  More importantly we need to
release the sources as well as the binary. We can't even test that it
compiles OK

Since you released Tika, why don't we include it before cutting 1.11?

Thanks

Julien


On 26 October 2015 at 05:53, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Hi Folks,
>
> A first candidate for the Nutch 1.11 release is available at:
>
>   https://dist.apache.org/repos/dist/dev/nutch/1.11/
>
> The release candidate is a zip archive of the sources in:
> http://svn.apache.org/repos/asf/nutch/tags/release-1.11-rc1/
>
>
> The SHA1 checksum of the archive is
> 6adebaca0504be69a9e6c67ae1eb3a8487b1806f
>
>
> In addition, a staged maven repository is available here:
>
> https://repository.apache.org/content/repositories/orgapachenutch-1006/
>
>
> Please vote on releasing this package as Apache Nutch 1.11.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Nutch PMC votes are cast.
>
> [ ] +1 Release this package as Apache Nutch 1.11
> [ ] -1 Do not release this package because…
>
> Cheers,
> Chris
>
> P.S. Of course here is my +1.
>
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble

[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2015-10-05 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943757#comment-14943757
 ] 

Julien Nioche commented on NUTCH-2132:
--

Looking at it from a slightly different angle, couldn't you use Logstash to 
aggregate and push to ElasticSearch? Most events are already present in the log 
files. You'd be able to query ES (or any other backend supported by Logstash or 
similar) for instance with Kibana and graph it all with 0 modifications to the 
code and 0 overhead. 

> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2132.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2015-10-05 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943856#comment-14943856
 ] 

Julien Nioche commented on NUTCH-2132:
--

bq.  but that locks us into using Kibana, etc. Ideally one goal of this would 
be to enable it to work with multiple downstream front ends

I mentioned logstash as an example, my point was more generally about 
leveraging the log files instead of modifying the code and possibly add 
overhead and complexity. There are probably other tools doing similar things.

Having said that Logstash is pluggable and supports various backends, it could 
be probably be possible to push things into a queue for instance.

Talking about dependencies, this introduces a hard one on RabbitMQ. Isn't there 
a neutral API that could be programmed against? (JMS? AMQP?) - this would allow 
users to chose their favourite messaging queue.


 

> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2132.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Nutch not recognizing html pages/images retrieved via php

2015-10-05 Thread Julien Nioche

Hi

What happens is that parse-tika is used by default but doesn't know what to
do with that mime type.

You can edit parse-plugins.xml
 and add






to map the mime type to the html parser. Obviously you'll need parse-html
to be active.

HTH

Julien



On 4 October 2015 at 03:01, Girish Rao  wrote:

> Hi,
>
> I am running a crawl on a website that serves pages and images via php.
> Nutch doesn’t seem to crawl these pages.
>
> I see the below in the hadoop.log
> 015-10-03 12:48:31,091 INFO  parse.ParserFactory - The parsing plugins:
> [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> plugin.includes system property, and all claim to support the content type
> text/x-php, but they are not mapped to it  in the parse-plugins.xml file
> 2015-10-03 12:48:31,712 ERROR tika.TikaParser - Can't retrieve Tika parser
> for mime-type text/x-php
> 2015-10-03 12:48:31,713 WARN  parse.ParseSegment - Error parsing:
> http://www.arguntrader.com/ucp.php?mode=login: failed(2,0): Can't
> retrieve Tika parser for mime-type text/x-php
>
> Can anyone help with identifying what is to be done to crawl a site which
> serves pages via php?
>
> Regards
> Girish




-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble

[jira] [Commented] (NUTCH-2129) Track Protocol Status in Crawl Datum

2015-10-01 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14939503#comment-14939503
 ] 

Julien Nioche commented on NUTCH-2129:
--

I'd rather keep it simple and not modify the CrawlDatum so much. Why don't you 
simply add a config element and optionally store the code in the metadata?
BTW we already have the option to store the response headers - see 
[https://github.com/apache/nutch/commit/23c7761aff830db82a1e44b84bf81265639c9a26].
 You could use that and simply reparse the first line to get the code.


> Track Protocol Status in Crawl Datum
> 
>
> Key: NUTCH-2129
> URL: https://issues.apache.org/jira/browse/NUTCH-2129
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
> Fix For: 2.4, 1.11
>
>
> It's become necessary on a few crawls that I run to get protocol status code 
> stats. After speaking with [~lewismc] it seemed that there might not be a 
> super convenient way of doing this as is, but it would be great to be able to 
> add the functionality necessary to pull this information out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Webcast : Apache Nutch on EMR

2015-09-23 Thread Julien Nioche

Hi again,

I have uploaded at webcast explaining how to run Nutch on AWS Elastic Map
Reduce

https://www.youtube.com/watch?v=v9zjcTjjjyU

Please excuse the sound quality, hesitations and stuttering. I hope you
find it useful nonetheless.

Julien

-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble

Tutorial : Index the web with AWS CloudSearch

2015-09-23 Thread Julien Nioche

Hi everyone,

Just to let you know that we've just published a new tutorial on how to use
Nutch (and StormCrawler) to crawl and index documents into AWS CloudSearch.

This is related to the recent addition of NUTCH-1517
 in the trunk codebase.
The tutorial is aimed at beginners and gives step by step instructions on
how to use Nutch, including in distributed mode. It should also be relevant
for more advanced users as it provides an introduction to CloudSearch and a
comparison with StormCrawler.

The tutorial is on
http://digitalpebble.blogspot.co.uk/2015/09/index-web-with-aws-cloudsearch.html

Please retweet the announcement if you use Twitter [
https://twitter.com/digitalpebble/status/646614555192336384].

I hope you find it useful

Julien

-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

[jira] [Commented] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper


[ 
https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902651#comment-14902651
 ] 

Julien Nioche commented on NUTCH-2095:
--

Thanks [~jorgelbg]. Please add a line to CHANGES.txt to describe what you did 
with this. Could you also edit 
[https://wiki.apache.org/nutch/CommonCrawlDataDumper] and describe what you 
added to the CCDD? Thanks

BTW the basic tests fail on my machine - do you get this too? e.g. for 
TestInjector

{code}
tried to access method com.google.common.base.Stopwatch.()V from class 
org.apache.hadoop.mapred.FileInputFormat
java.lang.IllegalAccessError: tried to access method 
com.google.common.base.Stopwatch.()V from class 
org.apache.hadoop.mapred.FileInputFormat
 {code}



> WARC exporter for the CommonCrawlDataDumper
> ---
>
> Key: NUTCH-2095
> URL: https://issues.apache.org/jira/browse/NUTCH-2095
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, tool
>Affects Versions: 1.11
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: tools, warc
> Attachments: NUTCH-2095.patch
>
>
> Adds the possibility of exporting the nutch segments to a WARC files.
> From the usage point of view a couple of new command line options are 
> available:
> {{-warc}}: enables the functionality to export into WARC files, if not 
> specified the default JACKSON formatter is used.
> {{-warcSize}}: enable the option to define a max file size for each WARC 
> file, if not specified a default of 1GB per file is used as recommended by 
> the WARC ISO standard.
> The usual {{-gzip}} flag can be used to enable compression on the WARC files.
> Some changes to the default {{CommonCrawlDataDumper}} were done, essentially 
> some changes to the Factory and to the Formats. This changes avoid creating a 
> new instance of a {{CommmonCrawlFormat}} on each URL read from the segments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper


[ 
https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902715#comment-14902715
 ] 

Julien Nioche commented on NUTCH-2095:
--

See [https://issues.apache.org/jira/browse/HADOOP-10961]. This is due to Guava 
17 which is inherited from webarchive-common version 1.1.5
I've excluded guava from it - in revision 1704641 and it has fixed the problem.

[~jorgelbg] please remember to run 'ant clean test' before committing something.

> WARC exporter for the CommonCrawlDataDumper
> ---
>
> Key: NUTCH-2095
> URL: https://issues.apache.org/jira/browse/NUTCH-2095
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, tool
>Affects Versions: 1.11
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: tools, warc
> Attachments: NUTCH-2095.patch
>
>
> Adds the possibility of exporting the nutch segments to a WARC files.
> From the usage point of view a couple of new command line options are 
> available:
> {{-warc}}: enables the functionality to export into WARC files, if not 
> specified the default JACKSON formatter is used.
> {{-warcSize}}: enable the option to define a max file size for each WARC 
> file, if not specified a default of 1GB per file is used as recommended by 
> the WARC ISO standard.
> The usual {{-gzip}} flag can be used to enable compression on the WARC files.
> Some changes to the default {{CommonCrawlDataDumper}} were done, essentially 
> some changes to the Factory and to the Formats. This changes avoid creating a 
> new instance of a {{CommmonCrawlFormat}} on each URL read from the segments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper


[ 
https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902578#comment-14902578
 ] 

Julien Nioche commented on NUTCH-2095:
--

[~jorgelbg] could you please fix the test. See below

{code}
Index: src/test/org/apache/nutch/tools/TestCommonCrawlDataDumper.java
===
--- src/test/org/apache/nutch/tools/TestCommonCrawlDataDumper.java  
(revision 1704612)
+++ src/test/org/apache/nutch/tools/TestCommonCrawlDataDumper.java  
(working copy)
@@ -101,8 +101,9 @@
 
CommonCrawlDataDumper dumper = new CommonCrawlDataDumper(
new CommonCrawlConfig());
-   dumper.dump(tempDir, sampleSegmentDir, false, null, false, "");
 
+   dumper.dump(tempDir, sampleSegmentDir, false, null, false, "", 
false);
+
Collection tempFiles = FileUtils.listFiles(tempDir,
FileFilterUtils.fileFileFilter(),
FileFilterUtils.directoryFileFilter());
{code}

> WARC exporter for the CommonCrawlDataDumper
> ---
>
> Key: NUTCH-2095
> URL: https://issues.apache.org/jira/browse/NUTCH-2095
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, tool
>Affects Versions: 1.11
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: tools, warc
> Attachments: NUTCH-2095.patch
>
>
> Adds the possibility of exporting the nutch segments to a WARC files.
> From the usage point of view a couple of new command line options are 
> available:
> {{-warc}}: enables the functionality to export into WARC files, if not 
> specified the default JACKSON formatter is used.
> {{-warcSize}}: enable the option to define a max file size for each WARC 
> file, if not specified a default of 1GB per file is used as recommended by 
> the WARC ISO standard.
> The usual {{-gzip}} flag can be used to enable compression on the WARC files.
> Some changes to the default {{CommonCrawlDataDumper}} were done, essentially 
> some changes to the Factory and to the Formats. This changes avoid creating a 
> new instance of a {{CommmonCrawlFormat}} on each URL read from the segments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (NUTCH-2102) WARC Exporter


 [ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-2102.
--
Resolution: Fixed

Committed revision 1704634.

Thanks for the reviews

> WARC Exporter
> -
>
> Key: NUTCH-2102
> URL: https://issues.apache.org/jira/browse/NUTCH-2102
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, dumpers
>Affects Versions: 1.10
>    Reporter: Julien Nioche
> Attachments: NUTCH-2102.patch
>
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both the modified CCDD and this 
> class providing similar functionalities.
> This class is called in the following way 
> ./nutch org.apache.nutch.tools.warc.WARCExporter 
> /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2102) WARC Exporter


 [ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2102:
-
Fix Version/s: 1.11

> WARC Exporter
> -
>
> Key: NUTCH-2102
> URL: https://issues.apache.org/jira/browse/NUTCH-2102
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, dumpers
>Affects Versions: 1.10
>    Reporter: Julien Nioche
> Fix For: 1.11
>
> Attachments: NUTCH-2102.patch
>
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both the modified CCDD and this 
> class providing similar functionalities.
> This class is called in the following way 
> ./nutch org.apache.nutch.tools.warc.WARCExporter 
> /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (NUTCH-2114) kkk

2015-09-20 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-2114.

Resolution: Invalid

> kkk
> ---
>
> Key: NUTCH-2114
> URL: https://issues.apache.org/jira/browse/NUTCH-2114
> Project: Nutch
>  Issue Type: Bug
>  Components: administration gui, commoncrawl, injector
>Reporter: Badreddine Ahmed
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Fwd: Job Opening at Common Crawl - Crawl Engineer / Data Scientist

2015-09-18 Thread Julien Nioche

Nutch people,

Just in case you missed the announcement below. As you probably know CC use
Nutch for their crawls, this is a fantastic opportunity to put your Nutch
skills to great use!

Julien

-- Forwarded message --
From: Sara Crouse 
Date: 17 September 2015 at 22:51
Subject: Job Opening at Common Crawl - Crawl Engineer / Data Scientist
To: Common Crawl 


Hello again CC community,

In addition to my appointment, another staff transition is on the horizon,
and I would like to ask for your help finding candidates to fill a critical
role. At the end of this month, Stephen Merity (data scientist, crawl
engineer, and much more!) will leave Common Crawl to work on image
recognition and language understanding using deep learning at MetaMind, a
new startup. Stephen, has been a great asset to Common Crawl, and we are
grateful that he wishes to remain engaged with us in a volunteer capacity
going forward.

This week, we therefore launch a search to fill the role of Crawl
Engineer/Data Scientist. Below and posted here https://commoncrawl.org/jobs/
is the job description. We appreciate any help you can provide in spreading
the word about this unique opportunity. If you have specific referrals, or
wish to apply, please contact j...@commoncrawl.org.

Many thanks,

Sara

---

_CRAWL ENGINEER / DATA SCIENTIST at THE COMMON CRAWL FOUNDATION_

*Location*
San Francisco or Remote


*Job Summary*
Common Crawl (CC) is the non-profit organization that builds and maintains
the single largest publicly accessible dataset of the world’s knowledge,
encompassing petabytes of web crawl data.

If democratizing access to web information and tackling the engineering
challenges of working with data at the scale of the web sounds exciting to
you, we would love to hear from you. If you have worked on open source
projects before or can share code samples with us, please don't hesitate to
send relevant links along with your application.


*Description*

/Primary Responsibilities/
_Running the crawl_
* Spinning up and managing Hadoop clusters on Amazon EC2
* Running regular comprehensive crawls of the web using Nutch
* Preparing and publishing crawl data to data hosting partner, Amazon Web
Services
* Incident response and diagnosis of crawl issues as they occur, e.g.
** Replacing lost instances due to EC2 problems / spot instance losses
** Responding to and remedying webmaster queries and issues

_Crawl engineering_
* Maintaining, developing, and deploying new features as required by
running the Nutch crawler, e.g.:
** Providing netiquette features, such as following robots.txt, as
required, and load balancing a crawl across millions of domains
** Implementing and improving ranking algorithms to prioritize the crawling
of popular pages
* Extending existing tools to work efficiently with large datasets
* Working with the Nutch community to push improvements to the crawler to
the public

/Other Responsibilities/
* Building support tools and artifacts, including documentation, tutorials,
and example code or supporting frameworks for processing CC data using
different tools.
* Identifying and reporting on research and innovations that result from
analysis and derivative use of CC data.
* Community evangelism:
** Collaborating with partners in academia and industry
** Engaging regularly with user discussion group and responding to frequent
inquiries about how to use CC data
** Writing technical blog posts
** Presenting on or representing CC at conferences, meetups, etc.


*Qualifications*
/Minimum qualifications/
* Fluent in Java (Nutch and Hadoop are core to our mission)
* Familiarity with the JVM big data ecosystem (Hadoop, HDFS, ...)
* Knowledge the Amazon Web Services (AWS) ecosystem
* Experience with Python
* Basic command line Unix knowledge
* BS Computer Science or equivalent work experience

/Preferred qualifications/
* Experience with running web crawlers
* Cluster computing experience (Hadoop preferred)
* Running parallel jobs over dozens of terabytes of data
* Experience committing to open source projects and participating in open
source forums


*About Common Crawl*
The Common Crawl Foundation is a California 501(c)(3) registered non-profit
with the goal of democratizing access to web information by producing and
maintaining an open repository of web crawl data that is universally
accessible and analyzable.

Our vision is of a truly open web that allows open access to information
and enables greater innovation in research, business and education. We
level the playing field by making wholesale extraction, transformation and
analysis of web data cheap and easy.

The Common Crawl Foundation is an Equal Opportunity Employer.


*To Apply*
Please send your cover letter and resumé to j...@commoncrawl.org.

-- 
You received this message because you are subscribed to the Google Groups
"Common Crawl" group.
To unsubscribe

[jira] [Commented] (NUTCH-2102) WARC Exporter


[ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14747300#comment-14747300
 ] 

Julien Nioche commented on NUTCH-2102:
--

The only modification to existing code is in the class 
'src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java'
 where we added two new config elements :
* store.http.request
* store.http.headers
which are used to keep the request and http headers verbatim in the content 
metadata. Both are set to false by default.

Note that this is also used by [#55](https://github.com/apache/nutch/pull/55)


> WARC Exporter
> -
>
> Key: NUTCH-2102
> URL: https://issues.apache.org/jira/browse/NUTCH-2102
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, dumpers
>Affects Versions: 1.10
>    Reporter: Julien Nioche
> Attachments: NUTCH-2102.patch
>
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both 
> [https://github.com/apache/nutch/pull/55] and this class providing similar 
> functionalities.
> This class is called in the following way 
> ./nutch org.apache.nutch.tools.warc.WARCExporter 
> /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2102) WARC Exporter


 [ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2102:
-
Description: 
This patch adds a WARC exporter 
[http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike the 
code submitted in [https://github.com/apache/nutch/pull/55] which is based on 
the CommonCrawlDataDumper, this exporter is a MapReduce job and hence should be 
able to cope with large segments in a timely fashion and also is not limited to 
the local file system.

Later on we could have a WARCImporter to generate segments from WARC files, 
which is outside the scope of the CCDD anyway. Also WARC is not specific to 
CommonCrawl, which is why the package name does not reflect it.

I don't think it would be a problem to have both the modified CCDD and this 
class providing similar functionalities.

This class is called in the following way 

./nutch org.apache.nutch.tools.warc.WARCExporter /data/nutch-dipe/1kcrawl/warc 
-dir /data/nutch-dipe/1kcrawl/segments/

  was:
This patch adds a WARC exporter 
[http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike the 
code submitted in [https://github.com/apache/nutch/pull/55] which is based on 
the CommonCrawlDataDumper, this exporter is a MapReduce job and hence should be 
able to cope with large segments in a timely fashion and also is not limited to 
the local file system.

Later on we could have a WARCImporter to generate segments from WARC files, 
which is outside the scope of the CCDD anyway. Also WARC is not specific to 
CommonCrawl, which is why the package name does not reflect it.

I don't think it would be a problem to have both 
[https://github.com/apache/nutch/pull/55] and this class providing similar 
functionalities.

This class is called in the following way 

./nutch org.apache.nutch.tools.warc.WARCExporter /data/nutch-dipe/1kcrawl/warc 
-dir /data/nutch-dipe/1kcrawl/segments/


> WARC Exporter
> -
>
> Key: NUTCH-2102
> URL: https://issues.apache.org/jira/browse/NUTCH-2102
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, dumpers
>Affects Versions: 1.10
>    Reporter: Julien Nioche
> Attachments: NUTCH-2102.patch
>
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both the modified CCDD and this 
> class providing similar functionalities.
> This class is called in the following way 
> ./nutch org.apache.nutch.tools.warc.WARCExporter 
> /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2102) WARC Exporter


[ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14747301#comment-14747301
 ] 

Julien Nioche commented on NUTCH-2102:
--

Please review

> WARC Exporter
> -
>
> Key: NUTCH-2102
> URL: https://issues.apache.org/jira/browse/NUTCH-2102
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, dumpers
>Affects Versions: 1.10
>    Reporter: Julien Nioche
> Attachments: NUTCH-2102.patch
>
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both the modified CCDD and this 
> class providing similar functionalities.
> This class is called in the following way 
> ./nutch org.apache.nutch.tools.warc.WARCExporter 
> /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (NUTCH-2102) WARC Exporter


[ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14747327#comment-14747327
 ] 

Julien Nioche edited comment on NUTCH-2102 at 9/16/15 11:21 AM:


Hi Markus

>  I believe this warc format is the updated arc format, for which we already 
> have an importer? 

The importer for WARC could be done later on and would leverage the same 
library as the exporter. But yes, it would look pretty similar to the ARC one .

> you meant to use StringBuilder instead of the synchronized StringBuffer in 
> HttpResponse
could do, will wait for other comments before amending the patch

> A bin/nutch entry is also missing, or not 
yes, why not. There's already far too much stuff in there :-) though. Again, I 
can amend it if ppl are +1 for committing this

Thanks for reviewing it




was (Author: jnioche):
Hi Markus

>  I believe this warc format is the updated arc format, for which we already 
> have an importer? 

The importer for WARC could be done later on and would leverage the same 
library as the exporter. But yes, it would look pretty similar to the ARC one .

> you meant to use StringBuilder instead of the synchronized StringBuffer in 
> HttpResponse
could do, will wait for other comments before amending the patch

> A bin/nutch entry is also missing, or not 
yes, why not. There's already far too much stuff in there :-) though. Again, I 
can amend it if ppl are +1 for committing this




> WARC Exporter
> -
>
> Key: NUTCH-2102
> URL: https://issues.apache.org/jira/browse/NUTCH-2102
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, dumpers
>Affects Versions: 1.10
>Reporter: Julien Nioche
> Attachments: NUTCH-2102.patch
>
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both the modified CCDD and this 
> class providing similar functionalities.
> This class is called in the following way 
> ./nutch org.apache.nutch.tools.warc.WARCExporter 
> /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2102) WARC Exporter


 [ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2102:
-
Description: 
This patch adds a WARC exporter 
[http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike the 
code submitted in [https://github.com/apache/nutch/pull/55] which is based on 
the CommonCrawlDataDumper, this exporter is a MapReduce job and hence should be 
able to cope with large segments in a timely fashion and also is not limited to 
the local file system.

Later on we could have a WARCImporter to generate segments from WARC files, 
which is outside the scope of the CCDD anyway. Also WARC is not specific to 
CommonCrawl, which is why the package name does not reflect it.

I don't think it would be a problem to have both 
[https://github.com/apache/nutch/pull/55] and this class providing similar 
functionalities.

This class is called in the following way 

./nutch org.apache.nutch.tools.warc.WARCExporter /data/nutch-dipe/1kcrawl/warc 
-dir /data/nutch-dipe/1kcrawl/segments/

  was:
This patch adds a WARC exporter 
[http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike the 
code submitted in [https://github.com/apache/nutch/pull/55] which is based on 
the CommonCrawlDataDumper, this exporter is a MapReduce job and hence should be 
able to cope with large segments in a timely fashion and also is not limited to 
the local file system.

Later on we could have a WARCImporter to generate segments from WARC files, 
which is outside the scope of the CCDD anyway. Also WARC is not specific to 
CommonCrawl, which is why the package name does not reflect it.

I don't think it would be a problem to have both 
[https://github.com/apache/nutch/pull/55] and this class providing similar 
functionalities.


> WARC Exporter
> -
>
> Key: NUTCH-2102
> URL: https://issues.apache.org/jira/browse/NUTCH-2102
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, dumpers
>Affects Versions: 1.10
>    Reporter: Julien Nioche
> Attachments: NUTCH-2102.patch
>
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both 
> [https://github.com/apache/nutch/pull/55] and this class providing similar 
> functionalities.
> This class is called in the following way 
> ./nutch org.apache.nutch.tools.warc.WARCExporter 
> /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2102) WARC Exporter


 [ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2102:
-
Attachment: (was: NUTCH-2102.patch)

> WARC Exporter
> -
>
> Key: NUTCH-2102
> URL: https://issues.apache.org/jira/browse/NUTCH-2102
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, dumpers
>Affects Versions: 1.10
>    Reporter: Julien Nioche
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both 
> [https://github.com/apache/nutch/pull/55] and this class providing similar 
> functionalities.
> This class is called in the following way 
> ./nutch org.apache.nutch.tools.warc.WARCExporter 
> /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2102) WARC Exporter


[ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14747327#comment-14747327
 ] 

Julien Nioche commented on NUTCH-2102:
--

Hi Markus

>  I believe this warc format is the updated arc format, for which we already 
> have an importer? 

The importer for WARC could be done later on and would leverage the same 
library as the exporter. But yes, it would look pretty similar to the ARC one .

> you meant to use StringBuilder instead of the synchronized StringBuffer in 
> HttpResponse
could do, will wait for other comments before amending the patch

> A bin/nutch entry is also missing, or not 
yes, why not. There's already far too much stuff in there :-) though. Again, I 
can amend it if ppl are +1 for committing this




> WARC Exporter
> -
>
> Key: NUTCH-2102
> URL: https://issues.apache.org/jira/browse/NUTCH-2102
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, dumpers
>Affects Versions: 1.10
>Reporter: Julien Nioche
> Attachments: NUTCH-2102.patch
>
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both the modified CCDD and this 
> class providing similar functionalities.
> This class is called in the following way 
> ./nutch org.apache.nutch.tools.warc.WARCExporter 
> /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2102) WARC Exporter


 [ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2102:
-
Attachment: NUTCH-2102.patch

> WARC Exporter
> -
>
> Key: NUTCH-2102
> URL: https://issues.apache.org/jira/browse/NUTCH-2102
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, dumpers
>Affects Versions: 1.10
>    Reporter: Julien Nioche
> Attachments: NUTCH-2102.patch
>
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both 
> [https://github.com/apache/nutch/pull/55] and this class providing similar 
> functionalities.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (NUTCH-2102) WARC Exporter

Julien Nioche created NUTCH-2102:


 Summary: WARC Exporter
 Key: NUTCH-2102
 URL: https://issues.apache.org/jira/browse/NUTCH-2102
 Project: Nutch
  Issue Type: Improvement
  Components: commoncrawl, dumpers
Affects Versions: 1.10
Reporter: Julien Nioche


This patch adds a WARC exporter 
[http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike the 
code submitted in [https://github.com/apache/nutch/pull/55] which is based on 
the CommonCrawlDataDumper, this exporter is a MapReduce job and hence should be 
able to cope with large segments in a timely fashion and also is not limited to 
the local file system.

Later on we could have a WARCImporter to generate segments from WARC files, 
which is outside the scope of the CCDD anyway. Also WARC is not specific to 
CommonCrawl, which is why the package name does not reflect it.

I don't think it would be a problem to have both 
[https://github.com/apache/nutch/pull/55] and this class providing similar 
functionalities.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2102) WARC Exporter


 [ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2102:
-
Attachment: NUTCH-2102.patch

> WARC Exporter
> -
>
> Key: NUTCH-2102
> URL: https://issues.apache.org/jira/browse/NUTCH-2102
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, dumpers
>Affects Versions: 1.10
>    Reporter: Julien Nioche
> Attachments: NUTCH-2102.patch
>
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both 
> [https://github.com/apache/nutch/pull/55] and this class providing similar 
> functionalities.
> This class is called in the following way 
> ./nutch org.apache.nutch.tools.warc.WARCExporter 
> /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2064) URLNormalizer basic to properly encode non-ASCII characters

2015-09-14 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744078#comment-14744078
 ] 

Julien Nioche commented on NUTCH-2064:
--

yep, can discuss that post 1.11

> URLNormalizer basic to properly encode non-ASCII characters
> ---
>
> Key: NUTCH-2064
> URL: https://issues.apache.org/jira/browse/NUTCH-2064
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.10
>Reporter: Markus Jelsma
> Fix For: 1.11
>
> Attachments: NUTCH-1098.patch, NUTCH-1098.patch, NUTCH-2064-v3.patch, 
> NUTCH-2064.patch
>
>
> NUTCH-1098 rewritten to work on trunk. Unit test is identical to 1098.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [ANNOUNCE] New Nutch committer and PMC - Asitang Mishra

2015-09-10 Thread Julien Nioche

Congratulations Asitang and welcome!

Julien

On 9 September 2015 at 23:01, Sebastian Nagel 
wrote:

> Dear all,
>
> on behalf of the Nutch PMC it is my pleasure to announce
> that Asitang Mishra has joined the Nutch team as committer
> and PMC member. Asitang, please feel free to introduce
> yourself and to tell the Nutch community about your
> interests and your relation to Nutch.
>
> Congratulations and welcome on board!
>
> Regards,
> Sebastian (on behalf of the Nutch PMC)
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

[jira] [Commented] (NUTCH-2064) URLNormalizer basic to properly encode non-ASCII characters

2015-09-04 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731114#comment-14731114
 ] 

Julien Nioche commented on NUTCH-2064:
--

What about moving the basic URL normalizer to Crawler-Commons? see 
[https://github.com/crawler-commons/crawler-commons/issues/74]
We already rely on it for robots parsing and it other projects would be able to 
reuse it (as well as improve it). Any views on this?

> URLNormalizer basic to properly encode non-ASCII characters
> ---
>
> Key: NUTCH-2064
> URL: https://issues.apache.org/jira/browse/NUTCH-2064
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.10
>Reporter: Markus Jelsma
> Fix For: 1.11
>
> Attachments: NUTCH-1098.patch, NUTCH-1098.patch, NUTCH-2064-v3.patch, 
> NUTCH-2064.patch
>
>
> NUTCH-1098 rewritten to work on trunk. Unit test is identical to 1098.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [DISCUSS] Release Nutch trunk 1.11

2015-08-26 Thread Julien Nioche

Hi Lewis

I'd love to see https://issues.apache.org/jira/browse/NUTCH-1517 being part
of 1.11. It is a separate indexing plugin which should not impact any
existing code. It's been reviewed by Jorge and I'll to commit it soon
unless someone objects.

Thanks

J.

On 26 August 2015 at 03:23, Lewis John Mcgibbney lewis.mcgibb...@gmail.com
wrote:

 Hi Folks,
 What do you all think about getting a release candidate out for Nutch
 1.11? I am happy to do RM role.
 Thanks
 Lewis


 --
 *Lewis*




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: [DISCUSS] Release Nutch trunk 1.11

2015-08-26 Thread Julien Nioche

Done. Thanks Markus

On 26 August 2015 at 13:08, Markus Jelsma markus.jel...@openindex.io
wrote:

 Yes Julien, please commit. I do think
 https://issues.apache.org/jira/browse/NUTCH-2064 should also be included.
 But i have my hands full atm.

 -Original message-
 From: Julien Niochelists.digitalpeb...@gmail.com
 Sent: Wednesday 26th August 2015 13:51
 To: dev@nutch.apache.org
 Subject: Re: [DISCUSS] Release Nutch trunk 1.11

 Hi Lewis

 Id love to see https://issues.apache.org/jira/browse/NUTCH-1517 
 https://issues.apache.org/jira/browse/NUTCH-1517 being part of 1.11. It
 is a separate indexing plugin which should not impact any existing code.
 Its been reviewed by Jorge and Ill to commit it soon unless someone objects.

 Thanks

 J.

 On 26 August 2015 at 03:23, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com mailto:lewis.mcgibb...@gmail.com wrote:

 Hi Folks,

 What do you all think about getting a release candidate out for Nutch
 1.11? I am happy to do RM role.

 Thanks

 Lewis

 --

 Lewis

 --

 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/ http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com http://www.digitalpebble.com
 http://twitter.com/digitalpebble http://twitter.com/digitalpebble





-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

[jira] [Resolved] (NUTCH-1517) CloudSearch indexer

2015-08-26 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-1517.
--
Resolution: Fixed

trunk committed revision 1697911.

Thanks for comments and review

 CloudSearch indexer
 ---

 Key: NUTCH-1517
 URL: https://issues.apache.org/jira/browse/NUTCH-1517
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.11

 Attachments: 0023883254_1377197869_indexer-cloudsearch.patch, 
 0025666929_1382393138_indexer-cloudsearch.20131021.patch, NUTCH-1517.v2.patch


 Once we have made the indexers pluggable, we should add a plugin for Amazon 
 CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
 JSON based representation Search Data Format (SDF), which we could reuse for 
 a file based indexer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2015-08-26 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14712988#comment-14712988
 ] 

Julien Nioche commented on NUTCH-1517:
--

Thanks [~jorgelbg]. Will commit soon unless someone objects.

 CloudSearch indexer
 ---

 Key: NUTCH-1517
 URL: https://issues.apache.org/jira/browse/NUTCH-1517
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.11

 Attachments: 0023883254_1377197869_indexer-cloudsearch.patch, 
 0025666929_1382393138_indexer-cloudsearch.20131021.patch, NUTCH-1517.v2.patch


 Once we have made the indexers pluggable, we should add a plugin for Amazon 
 CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
 JSON based representation Search Data Format (SDF), which we could reuse for 
 a file based indexer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (NUTCH-2049) Upgrade Trunk to Hadoop 2.4 stable

2015-08-24 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-2049.
--
Resolution: Fixed

Committed revision 1697466.

Thanks to everyone involved.

 Upgrade Trunk to Hadoop  2.4 stable
 

 Key: NUTCH-2049
 URL: https://issues.apache.org/jira/browse/NUTCH-2049
 Project: Nutch
  Issue Type: Improvement
  Components: build
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
  Labels: memex
 Fix For: 1.11

 Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch, NUTCH-2049v3.patch


 Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html
 I am +1 for taking trunk (or a branch of trunk) to explicit dependency on  
 Hadoop 2.6.
 We can run our tests, we can validate, we can fix.
 I will be doing validation on 2.X in paralegal as this is what I use on my 
 own projects. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop 2.4 stable

2015-08-21 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706402#comment-14706402
 ] 

Julien Nioche commented on NUTCH-2049:
--

Fantastic work [~lewismc]! I think this is one of the most important changes to 
Nutch in recent years. Well done.
Compilation and tests all fine, crawl in local mode OK. 

+1 to commit 

 Upgrade Trunk to Hadoop  2.4 stable
 

 Key: NUTCH-2049
 URL: https://issues.apache.org/jira/browse/NUTCH-2049
 Project: Nutch
  Issue Type: Improvement
  Components: build
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
  Labels: memex
 Fix For: 1.11

 Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch, NUTCH-2049v3.patch


 Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html
 I am +1 for taking trunk (or a branch of trunk) to explicit dependency on  
 Hadoop 2.6.
 We can run our tests, we can validate, we can fix.
 I will be doing validation on 2.X in paralegal as this is what I use on my 
 own projects. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1517) CloudSearch indexer

2015-08-19 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1517:
-
Attachment: (was: NUTCH-1517.patch)

 CloudSearch indexer
 ---

 Key: NUTCH-1517
 URL: https://issues.apache.org/jira/browse/NUTCH-1517
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.11

 Attachments: 0023883254_1377197869_indexer-cloudsearch.patch, 
 0025666929_1382393138_indexer-cloudsearch.20131021.patch, NUTCH-1517.v2.patch


 Once we have made the indexers pluggable, we should add a plugin for Amazon 
 CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
 JSON based representation Search Data Format (SDF), which we could reuse for 
 a file based indexer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1517) CloudSearch indexer

2015-08-19 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1517:
-
Flags: Patch

 CloudSearch indexer
 ---

 Key: NUTCH-1517
 URL: https://issues.apache.org/jira/browse/NUTCH-1517
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.11

 Attachments: 0023883254_1377197869_indexer-cloudsearch.patch, 
 0025666929_1382393138_indexer-cloudsearch.20131021.patch, NUTCH-1517.patch


 Once we have made the indexers pluggable, we should add a plugin for Amazon 
 CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
 JSON based representation Search Data Format (SDF), which we could reuse for 
 a file based indexer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1517) CloudSearch indexer

2015-08-19 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1517:
-
Attachment: NUTCH-1517.patch

New implementation of the CloudSearchIndexWriter, uses the latest version of 
the CloudSearch API. See README file for instructions

 CloudSearch indexer
 ---

 Key: NUTCH-1517
 URL: https://issues.apache.org/jira/browse/NUTCH-1517
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.11

 Attachments: 0023883254_1377197869_indexer-cloudsearch.patch, 
 0025666929_1382393138_indexer-cloudsearch.20131021.patch, NUTCH-1517.patch


 Once we have made the indexers pluggable, we should add a plugin for Amazon 
 CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
 JSON based representation Search Data Format (SDF), which we could reuse for 
 a file based indexer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2069) Ignore external links based on domain

2015-07-30 Thread Julien Nioche (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647467#comment-14647467
]

Julien Nioche commented on NUTCH-2069:
--

Hi [~wastl-nagel] and [~markus17]. BTW did not mean to be short in my previous
message but was typing from my phone ;-)
I know the difficulties of enforcing the code formatting systematically, but I
thought I might as well fix it while I was working on that part of the code.
Feel free to remove the bits from the patch that are about the formatting only.

bq. we could define this as two properties `db.ignore.external.links` +
`db.ignore.external.links.mode`. The latter can be host or domain, similar
to other properties (partition.url.mode, generator.count.mode,
fetcher.queue.mode). That would be extensible and can make the code leaner.

yes that would be more elegant

on vacation for the next few weeks as of today, will update the code based on
your suggestion when I am back unless one of you beats me to it of course.

Ignore external links based on domain
-

Key: NUTCH-2069
URL: https://issues.apache.org/jira/browse/NUTCH-2069
Project: Nutch
Issue Type: Improvement
Components: fetcher, parser
Affects Versions: 1.10
Reporter: Julien Nioche
Fix For: 1.11

Attachments: NUTCH-2069.patch

We currently have `db.ignore.external.links` which is a nice way of
restricting the crawl based on the hostname. This adds a new parameter
'db.ignore.external.links.domain' to do the same based on the domain.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2069) Ignore external links based on domain


[ 
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646543#comment-14646543
 ] 

Julien Nioche commented on NUTCH-2069:
--

What code restyle? I applied the formatting rules from 2.x as expected. They 
should be copied to trunk BTW. Looks like Lewis did not use  them

 Ignore external links based on domain
 -

 Key: NUTCH-2069
 URL: https://issues.apache.org/jira/browse/NUTCH-2069
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher, parser
Affects Versions: 1.10
Reporter: Julien Nioche
 Fix For: 1.11

 Attachments: NUTCH-2069.patch


 We currently have `db.ignore.external.links` which is a nice way of 
 restricting the crawl based on the hostname. This adds a new parameter 
 'db.ignore.external.links.domain' to do the same based on the domain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (NUTCH-2069) Ignore external links based on domain

Julien Nioche created NUTCH-2069:


 Summary: Ignore external links based on domain
 Key: NUTCH-2069
 URL: https://issues.apache.org/jira/browse/NUTCH-2069
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher, parser
Affects Versions: 1.10
Reporter: Julien Nioche
 Fix For: 1.11


We currently have `db.ignore.external.links` which is a nice way of restricting 
the crawl based on the hostname. This adds a new parameter 
'db.ignore.external.links.domain' to do the same based on the domain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2069) Ignore external links based on domain


 [ 
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2069:
-
Attachment: NUTCH-2069.patch

 Ignore external links based on domain
 -

 Key: NUTCH-2069
 URL: https://issues.apache.org/jira/browse/NUTCH-2069
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher, parser
Affects Versions: 1.10
Reporter: Julien Nioche
 Fix For: 1.11

 Attachments: NUTCH-2069.patch


 We currently have `db.ignore.external.links` which is a nice way of 
 restricting the crawl based on the hostname. This adds a new parameter 
 'db.ignore.external.links.domain' to do the same based on the domain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2069) Ignore external links based on domain


 [ 
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2069:
-
Patch Info: Patch Available

 Ignore external links based on domain
 -

 Key: NUTCH-2069
 URL: https://issues.apache.org/jira/browse/NUTCH-2069
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher, parser
Affects Versions: 1.10
Reporter: Julien Nioche
 Fix For: 1.11

 Attachments: NUTCH-2069.patch


 We currently have `db.ignore.external.links` which is a nice way of 
 restricting the crawl based on the hostname. This adds a new parameter 
 'db.ignore.external.links.domain' to do the same based on the domain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2048) parse-tika: fix dependencies in plugin.xml

2015-07-24 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14640138#comment-14640138
 ] 

Julien Nioche commented on NUTCH-2048:
--

howto_upgrade_tika.txt has been around for 2 years. One possible explanation is 
that whoever committed the change already had a lib directory with the old 
dependencies in it.  The build-ivy.xml script should be modified to remove any 
existing content in the lib dir it creates.

 parse-tika: fix dependencies in plugin.xml
 --

 Key: NUTCH-2048
 URL: https://issues.apache.org/jira/browse/NUTCH-2048
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.10
Reporter: Sebastian Nagel
Priority: Trivial
 Fix For: 1.11

 Attachments: NUTCH-2048_Joyce_20150723.patch, 
 NUTCH-2048_Joyce_20150723_2.patch


 Duplicate library dependencies listed in parse-tika's plugin.xml should be 
 cleaned up. There are a duplicates, only the version differs, e.g.:
 {noformat}
 tika-parsers-1.7.jar
 tika-parsers-1.8.jar
 {noformat}
 Not critical because libs which are not present should be just ignored.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (NUTCH-1517) CloudSearch indexer

2015-07-24 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche reassigned NUTCH-1517:


Assignee: Julien Nioche

 CloudSearch indexer
 ---

 Key: NUTCH-1517
 URL: https://issues.apache.org/jira/browse/NUTCH-1517
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.11

 Attachments: 0023883254_1377197869_indexer-cloudsearch.patch, 
 0025666929_1382393138_indexer-cloudsearch.20131021.patch


 Once we have made the indexers pluggable, we should add a plugin for Amazon 
 CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
 JSON based representation Search Data Format (SDF), which we could reuse for 
 a file based indexer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2016) Remove OldFetcher from trunk


[ 
https://issues.apache.org/jira/browse/NUTCH-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600946#comment-14600946
 ] 

Julien Nioche commented on NUTCH-2016:
--

+1

 Remove OldFetcher from trunk
 

 Key: NUTCH-2016
 URL: https://issues.apache.org/jira/browse/NUTCH-2016
 Project: Nutch
  Issue Type: Wish
  Components: fetcher
Affects Versions: 1.11
Reporter: Sebastian Nagel
Priority: Trivial
 Fix For: 1.11


 The class OldFetcher is not actively maintained and lacks all features added 
 to the new threaded Fetcher (started in 2007, used as default fetcher since 
 2009). Time to remove it from the code base (trunk/1.x only)?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script


 [ 
https://issues.apache.org/jira/browse/NUTCH-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2036:
-
Affects Version/s: (was: 1.11)

 Adding some continuous crawl goodies to the crawl script
 

 Key: NUTCH-2036
 URL: https://issues.apache.org/jira/browse/NUTCH-2036
 Project: Nutch
  Issue Type: Improvement
  Components: bin, tool, util
Affects Versions: 1.10
Reporter: Jorge Luis Betancourt Gonzalez
Priority: Minor
  Labels: crawl, script
 Fix For: 1.11

 Attachments: NUTCH-2036.patch


 Although Nutch does not support continuous crawling out of the box, and yes 
 this is somehow doable using cron or even sometimes irrelevant due the size 
 of the crawl its a nice feature to have. 
 This patch basically just adds a new parameter option to the {{bin/crawl}} 
 script (-w|--wait) which adds a time to wait if the generator returns 0 (when 
 no URLs are scheduled for fetching). 
 This new parameter has the {{NUMBER\[SUFFIX\]}} format, if no suffix is 
 provided the amount of time is assumed to be in seconds. Other valid suffixes 
 are: 
 s - second
 m - minutes
 h - hours
 d - days
 If a {{-1}} value is passed to the parameter or its not used at all the 
 default behaviour of exciting the script is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script


[ 
https://issues.apache.org/jira/browse/NUTCH-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600949#comment-14600949
 ] 

Julien Nioche commented on NUTCH-2036:
--

Any thoughts on this? This is useful and should be committed I think.

 Adding some continuous crawl goodies to the crawl script
 

 Key: NUTCH-2036
 URL: https://issues.apache.org/jira/browse/NUTCH-2036
 Project: Nutch
  Issue Type: Improvement
  Components: bin, tool, util
Affects Versions: 1.10
Reporter: Jorge Luis Betancourt Gonzalez
Priority: Minor
  Labels: crawl, script
 Fix For: 1.11

 Attachments: NUTCH-2036.patch


 Although Nutch does not support continuous crawling out of the box, and yes 
 this is somehow doable using cron or even sometimes irrelevant due the size 
 of the crawl its a nice feature to have. 
 This patch basically just adds a new parameter option to the {{bin/crawl}} 
 script (-w|--wait) which adds a time to wait if the generator returns 0 (when 
 no URLs are scheduled for fetching). 
 This new parameter has the {{NUMBER\[SUFFIX\]}} format, if no suffix is 
 provided the amount of time is assumed to be in seconds. Other valid suffixes 
 are: 
 s - second
 m - minutes
 h - hours
 d - days
 If a {{-1}} value is passed to the parameter or its not used at all the 
 default behaviour of exciting the script is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script


 [ 
https://issues.apache.org/jira/browse/NUTCH-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2036:
-
Fix Version/s: 1.11

 Adding some continuous crawl goodies to the crawl script
 

 Key: NUTCH-2036
 URL: https://issues.apache.org/jira/browse/NUTCH-2036
 Project: Nutch
  Issue Type: Improvement
  Components: bin, tool, util
Affects Versions: 1.10
Reporter: Jorge Luis Betancourt Gonzalez
Priority: Minor
  Labels: crawl, script
 Fix For: 1.11

 Attachments: NUTCH-2036.patch


 Although Nutch does not support continuous crawling out of the box, and yes 
 this is somehow doable using cron or even sometimes irrelevant due the size 
 of the crawl its a nice feature to have. 
 This patch basically just adds a new parameter option to the {{bin/crawl}} 
 script (-w|--wait) which adds a time to wait if the generator returns 0 (when 
 no URLs are scheduled for fetching). 
 This new parameter has the {{NUMBER\[SUFFIX\]}} format, if no suffix is 
 provided the amount of time is assumed to be in seconds. Other valid suffixes 
 are: 
 s - second
 m - minutes
 h - hours
 d - days
 If a {{-1}} value is passed to the parameter or its not used at all the 
 default behaviour of exciting the script is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2015-06-24 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599840#comment-14599840
 ] 

Julien Nioche commented on NUTCH-2046:
--

re-script : what about a positive parameter instead of a negative one (like we 
do for the indexing with -i)? Could have -s followed by the path to the seed.

 The crawl script should be able to skip an initial injection.
 -

 Key: NUTCH-2046
 URL: https://issues.apache.org/jira/browse/NUTCH-2046
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb, injector
Affects Versions: 1.10
Reporter: Luis Lopez
  Labels: crawl, injection
 Fix For: 1.11


 When our crawl gets really big a new injection takes considerable time as it 
 updates crawldb, the crawl script should be able to skip the injection and go 
 directly to the generate call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2000) Link inversion fails with .locked already exists.

2015-06-17 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589951#comment-14589951
 ] 

Julien Nioche commented on NUTCH-2000:
--

Hi Seb, +1 to commit. Not sure I'll be able to reproduce it but let's assume it 
fixes it. Can always reopen later if still a problem

 Link inversion fails with .locked already exists.
 -

 Key: NUTCH-2000
 URL: https://issues.apache.org/jira/browse/NUTCH-2000
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.9
Reporter: Julien Nioche
 Fix For: 1.11

 Attachments: NUTCH-2000-v1.patch


 using standard crawl script with a brand new test dir in local mode I am 
 getting 
 Link inversion
 /data/BLABLABLA/runtime/local/bin/nutch invertlinks 
 /data/BLABLABLA/testCrawl2//linkdb 
 /data/BLABLABLA/testCrawl2//segments/20150423114335
 LinkDb: java.io.IOException: lock file 
 /data/BLABLABLA/testCrawl2/linkdb/.locked already exists.
 PS: 2000!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

crawler-commons 0.6 released

2015-06-11 Thread Julien Nioche

[Apologies for cross posting]crawler-commons 0.6 is released

We are glad to announce the 0.6 release of Crawler Commons. See the
CHANGES.txt
https://github.com/crawler-commons/crawler-commons/releases/tag/crawler-commons-0.6
file
included with the release for a full list of details.

We suggest all users to upgrade to this version. Details of how to do so
can be found on Maven Central
http://search.maven.org/#artifactdetails%7Ccom.github.crawler-commons%7Ccrawler-commons%7C0.6%7Cjar.
Please note that the groupId has changed to *com.github.crawler-commons*.
Thanks to all contributors

Julien

https://github.com/crawler-commons/crawler-commons

[jira] [Resolved] (NUTCH-2006) IndexingFiltersChecker to take custom metadata as input

2015-05-15 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-2006.
--
   Resolution: Fixed
Fix Version/s: 1.11

Committed revision 1679567.

Thanks Seb

 IndexingFiltersChecker  to take custom metadata as input
 

 Key: NUTCH-2006
 URL: https://issues.apache.org/jira/browse/NUTCH-2006
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.10
Reporter: Julien Nioche
Priority: Minor
 Fix For: 1.11

 Attachments: NUTCH-2006.patch


 Similar to [NUTCH-1757] but for IndexingFiltersChecker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2012) Merge parsechecker and indexchecker

2015-05-15 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545534#comment-14545534
 ] 

Julien Nioche commented on NUTCH-2012:
--

+1 to merging them into a more generic tool. Most of the code in these 2 
classes is the same. We could add a few options e.g. not to display the fields 
generated for the indexing

 Merge parsechecker and indexchecker
 ---

 Key: NUTCH-2012
 URL: https://issues.apache.org/jira/browse/NUTCH-2012
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.10
Reporter: Sebastian Nagel
Priority: Minor
 Fix For: 1.11


 ParserChecker and IndexingFiltersChecker have evolved from simple tools to 
 check parsers and parsefilters resp. indexing filters to powerful tools which 
 emulate the crawling of a single URL/document:
 - check robots.txt (NUTCH-2002)
 - follow redirects (NUTCH-2004)
 Keeping both tools in sync takes extra work (cf. NUTCH-1757/NUTCH-2006, also 
 NUTCH-2002, NUTCH-2004 are done only for parsechecker). It's time to merge 
 them
 * either into one general debugging tool, keeping parsechecker and 
 indexchecker as aliases
 * centralize common code in one utility class



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2008) IndexerMapReduce to use single instance of NutchIndexAction for deletions

2015-05-13 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541843#comment-14541843
 ] 

Julien Nioche commented on NUTCH-2008:
--

Makes total sense. +1
Could also make it static while we are at it?

 IndexerMapReduce to use single instance of NutchIndexAction for deletions
 -

 Key: NUTCH-2008
 URL: https://issues.apache.org/jira/browse/NUTCH-2008
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.10
Reporter: Sebastian Nagel
Priority: Trivial
 Fix For: 1.11

 Attachments: NUTCH-2008-trunk-v1.patch


 For every URL/document to be deleted a new instance of NutchIndexAction is 
 created in IndexerMapReduce (in multiple positions):
 {code}
 NutchIndexAction action = new NutchIndexAction(null,
NutchIndexAction.DELETE);
 output.collect(key, action);
 {code}
 Since the index action does not hold any data specific to any URL/document it 
 would be more efficient to re-use a single instance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (NUTCH-2006) IndexingFiltersChecker to take custom metadata as input

Julien Nioche created NUTCH-2006:


 Summary: IndexingFiltersChecker  to take custom metadata as input
 Key: NUTCH-2006
 URL: https://issues.apache.org/jira/browse/NUTCH-2006
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.10
Reporter: Julien Nioche
Priority: Minor


Similar to [NUTCH-1757] but for IndexingFiltersChecker.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2006) IndexingFiltersChecker to take custom metadata as input


 [ 
https://issues.apache.org/jira/browse/NUTCH-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2006:
-
Attachment: NUTCH-2006.patch

Patch which allows to take custom metadata into account + improved used of 
slf4j and Configured.

 IndexingFiltersChecker  to take custom metadata as input
 

 Key: NUTCH-2006
 URL: https://issues.apache.org/jira/browse/NUTCH-2006
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.10
Reporter: Julien Nioche
Priority: Minor
 Attachments: NUTCH-2006.patch


 Similar to [NUTCH-1757] but for IndexingFiltersChecker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2006) IndexingFiltersChecker to take custom metadata as input


 [ 
https://issues.apache.org/jira/browse/NUTCH-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2006:
-
Patch Info: Patch Available

 IndexingFiltersChecker  to take custom metadata as input
 

 Key: NUTCH-2006
 URL: https://issues.apache.org/jira/browse/NUTCH-2006
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.10
Reporter: Julien Nioche
Priority: Minor
 Attachments: NUTCH-2006.patch


 Similar to [NUTCH-1757] but for IndexingFiltersChecker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1999) Add http://nutch.apache.org/robots.txt