[jira] [Created] (NUTCH-3025) urlfilter-fast to filter based on the length of the URL

2023-11-06 Thread Julien Nioche (Jira)
Julien Nioche created NUTCH-3025:


 Summary: urlfilter-fast to filter based on the length of the URL
 Key: NUTCH-3025
 URL: https://issues.apache.org/jira/browse/NUTCH-3025
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.19
Reporter: Julien Nioche
 Fix For: 1.20


There currently is no filter implementation to remove URLs based on their 
length or the length of their path / query.
Doing so with the regex filter would be inefficient, instead we could implement 
it in _urlfilter-fast _



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input

2023-10-30 Thread Julien Nioche (Jira)
Julien Nioche created NUTCH-3017:


 Summary: Allow fast-urlfilter to load from HDFS/S3 and support 
gzipped input
 Key: NUTCH-3017
 URL: https://issues.apache.org/jira/browse/NUTCH-3017
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.19
Reporter: Julien Nioche


This provide an easier way to refresh the resources since no rebuild of the jar 
will be needed. The path can point to either HDFS or S3. Additionally, .gz 
files should be handled automatically



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [ANNOUNCE] New Nutch committer and PMC - Tim Allison

2023-07-20 Thread Julien Nioche
What a fantastic addition to the Nutch team! Congrats to Tim

On Thu, 20 Jul 2023 at 10:20, Sebastian Nagel  wrote:

> Dear all,
>
> It is my pleasure to announce that Tim Allison has joined us
> as a committer and member of the Nutch PMC.
>
> You may already know Tim as a maintainer of and contributor to
> Apache Tika. So, it was great to see contributions to the
> Nutch source code from an experienced developer who is also
> active in a related Apache project. Among other contributions
> Tim recently implemented the indexer-opensearch plugin.
>
> Thank you, Tim Allison, and congratulations on your new role
> in the Apache Nutch community! And welcome on board!
>
> Sebastian
> (on behalf of the Nutch PMC)
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


[jira] [Commented] (NUTCH-2648) Make configurable whether TLS/SSL certificates are checked by protocol plugins

2018-10-09 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643035#comment-16643035
 ] 

Julien Nioche commented on NUTCH-2648:
--

[~wastl-nagel]

?? (code borrowed 
[storm-crawler#615|https://github.com/DigitalPebble/storm-crawler/issues/615], 
thanks [~jnioche]!)??

You are welcome. Would be fab if you could find the time to add the same 
behaviour to the httpclient protocol in SC :)

> Make configurable whether TLS/SSL certificates are checked by protocol plugins
> --
>
> Key: NUTCH-2648
> URL: https://issues.apache.org/jira/browse/NUTCH-2648
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> (see discussion in NUTCH-2647)
> It should be possible to enable/disable TLS/SSL certificate validation 
> centrally for all http/https protocol plugins by a single configuration 
> property.
> Some use cases (eg. crawl a site to detect insecure pages) may require that 
> TLS/SSL certificates are checked. Also a broader, unrestricted web crawl may 
> skip sites with invalid certificates as this is can be an indicator for the 
> quality of a site.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Crawler-Commons 0.10 released

2018-06-07 Thread Julien Nioche
Hi

We are glad to announce the 0.10 release of Crawler-Commons. See the
CHANGES.txt

file
included with the release for a full list of details. This version contains
among other things improvements to the Sitemap parsing and the removal of
the Tika dependency.

As usual, this latest version contains numerous improvements and bugfixes
and all users are invited to upgrade to it.

Thanks to all committers, contributors and users.

Julien

-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


Re: [VOTE] Release Apache Nutch 1.14 RC#1

2017-12-19 Thread Julien Nioche
+1 to release, thanks Seb

On 18 December 2017 at 22:12, Sebastian Nagel 
wrote:

> Hi Folks,
>
> A first candidate for the Nutch 1.14 release is available at:
>
>   https://dist.apache.org/repos/dist/dev/nutch/1.14/
>
> The release candidate is a zip and tar.gz archive of the binary and
> sources in:
>   https://github.com/apache/nutch/tree/release-1.14
> The SHA1 checksum of the release commit is
>   a8e60bdfb79b368612f068ed5aeeb690e29b448d
>
> In addition, a staged maven repository is available here:
>   https://repository.apache.org/content/repositories/orgapachenutch-1014/
>
> We addressed 79 Issues:
>https://issues.apache.org/jira/secure/ReleaseNote.jspa?
> projectId=10680=12340218
>
> Please vote on releasing this package as Apache Nutch 1.14.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Nutch PMC votes are cast.
>
> [ ] +1 Release this package as Apache Nutch 1.14.
> [ ] -1 Do not release this package because…
>
> Cheers,
> Sebastian
> (On behalf of the Nutch PMC)
>
> P.S. Here is my +1.
>



-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


Re: [DISCUSS] Release 1.14?

2017-12-14 Thread Julien Nioche
FYI Tika 1.17 has just been released
http://www.apache.org/dist/tika/CHANGES-1.17.txt

On 12 December 2017 at 12:36, Sebastian Nagel <wastl.na...@googlemail.com>
wrote:

> Hi Julien,
>
> yes, I know there's an open issue by Markus which depends on Tika 1.7.
> If the Tika release happens this week, I'll make sure that it's included.
>
> Thanks,
> Sebastian
>
>
> On 12/11/2017 10:22 AM, Julien Nioche wrote:
> > Tika 1.17 will be released shortly, maybe it would be worth waiting a
> bit and integrate it first?
> >
> > On 8 December 2017 at 22:53, Sebastian Nagel <wastl.na...@googlemail.com
> > <mailto:wastl.na...@googlemail.com>> wrote:
> >
> > Hi all,
> >
> > 50+ issues fixed
> >   https://issues.apache.org/jira/projects/NUTCH/versions/12340218
> > <https://issues.apache.org/jira/projects/NUTCH/versions/12340218>
> >
> > Of course, as always and still many open issues. But maybe it's time
> to
> > push a release now and try to integrate the next features and
> > fixes early next year. What do you think?
> >
> > The last release (1.3) dates 8 month back (April 2017).
> >
> > I would be ready to push a release candidate next week.
> >
> >
> > Sebastian
> >
> >
> >
> >
> > --
> > *
> > */Open Source Solutions for Text Engineering/
> > /
> > /http://www.digitalpebble.com <http://www.digitalpebble.com/>
> > http://digitalpebble.blogspot.com/
> > #digitalpebble <http://twitter.com/digitalpebble>
>
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>


Re: [DISCUSS] Release 1.14?

2017-12-11 Thread Julien Nioche
Tika 1.17 will be released shortly, maybe it would be worth waiting a bit
and integrate it first?

On 8 December 2017 at 22:53, Sebastian Nagel 
wrote:

> Hi all,
>
> 50+ issues fixed
>   https://issues.apache.org/jira/projects/NUTCH/versions/12340218
>
> Of course, as always and still many open issues. But maybe it's time to
> push a release now and try to integrate the next features and
> fixes early next year. What do you think?
>
> The last release (1.3) dates 8 month back (April 2017).
>
> I would be ready to push a release candidate next week.
>
>
> Sebastian
>



-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


Crawler-Commons 0.9 released

2017-10-31 Thread Julien Nioche
Happy Halloween!

We are glad to announce the 0.9 release of Crawler-Commons. See the
CHANGES.txt

file
included with the release for a full list of details. The main changes are
the removal of DOM-based sitemap parser as the SAX equivalent introduced in
the previous version has better performance and is also more robust.

You might need to change your code to replace SiteMapParserSAX with
SiteMapParser. The parser is now aware of namespaces, and by default does
not force the namespace to be the one recommended in the specification (
http://www.sitemaps.org/schemas/sitemap/0.9) as variants can be found in
the wild. You can set the behaviour using the method
*setStrictNamespace(boolean)*.

As usual, the version 0.9 contains numerous improvements and bugfixes and
all users are invited to upgrade to this version.
Thanks to all committers, contributors and users.

Julien


Re: Establishment of Static Source Code Analysis

2017-06-16 Thread Julien Nioche
More seriously, no idea who's done it but it is useful feedback. A similar
company (DevFactory)  contributed to StormCrawler
<https://github.com/DigitalPebble/storm-crawler/commits?author=AymanDF> some
time ago. Also reminds me of the discussion we had around Sonar in
crawler-commons
<https://github.com/crawler-commons/crawler-commons/pull/127>.

On 16 June 2017 at 08:55, Julien Nioche <lists.digitalpeb...@gmail.com>
wrote:

>  Russian compatriots
>
>
> Are we all Russian then?
>
> On 16 June 2017 at 04:29, lewis john mcgibbney <lewi...@apache.org> wrote:
>
>> Hi Folks,
>> I don't know if anyone else noticed... some of our Russian compatriots
>> have set up a static auto bot to notify us of source code issues...
>> An example is as follows
>> https://issues.apache.org/jira/browse/NUTCH-2394
>> I think this is great to be honest... with some peer review I think we
>> could take this seriously.
>> Out of curiosity is anyone responsible for this?
>> Lewis
>>
>> --
>> http://home.apache.org/~lewismc/
>> @hectorMcSpector
>> http://www.linkedin.com/in/lmcgibbney
>>
>
>
>
> --
>
> *Open Source Solutions for Text Engineering*
>
> http://www.digitalpebble.com
> http://digitalpebble.blogspot.com/
> #digitalpebble <http://twitter.com/digitalpebble>
>



-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>


Re: Establishment of Static Source Code Analysis

2017-06-16 Thread Julien Nioche
>
>  Russian compatriots


Are we all Russian then?

On 16 June 2017 at 04:29, lewis john mcgibbney  wrote:

> Hi Folks,
> I don't know if anyone else noticed... some of our Russian compatriots
> have set up a static auto bot to notify us of source code issues...
> An example is as follows
> https://issues.apache.org/jira/browse/NUTCH-2394
> I think this is great to be honest... with some peer review I think we
> could take this seriously.
> Out of curiosity is anyone responsible for this?
> Lewis
>
> --
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney
>



-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


Crawler-Commons 0.8 released

2017-06-09 Thread Julien Nioche
Apologies for cross-posting

The Common-Crawl project is pleased to announce its 0.8 release.

*https://github.com/crawler-commons/crawler-commons/releases/tag/crawler-commons-0.8
*

If you are wondering what Crawler-Commons is about :

*Crawler-Commons is a set of reusable Java components that implement
functionality common to any web crawler. These components benefit from
collaboration among various existing web crawler projects and reduce
duplication of effort. *

The artefacts are available from Maven central, simply add the following to
your project's POM file.

**
*com.github.crawler-commons*
*crawler-commons*
*0.8*
**


Thanks to all contributors and users and happy crawling!


Julien (on behalf of the Common-Crawl committers)

-- 

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


[jira] [Resolved] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2017-04-18 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-2046.
--
Resolution: Fixed
  Assignee: Julien Nioche  (was: Lewis John McGibbney)

> The crawl script should be able to skip an initial injection.
> -
>
> Key: NUTCH-2046
> URL: https://issues.apache.org/jira/browse/NUTCH-2046
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb, injector
>Affects Versions: 1.10
>Reporter: Luis Lopez
>    Assignee: Julien Nioche
>  Labels: crawl, injection
> Fix For: 1.14
>
> Attachments: crawl.patch
>
>
> When our crawl gets really big a new injection takes considerable time as it 
> updates crawldb, the crawl script should be able to skip the injection and go 
> directly to the generate call.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Closed] (NUTCH-1371) Replace Ivy with Maven Ant tasks

2017-04-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-1371.

Resolution: Duplicate

> Replace Ivy with Maven Ant tasks
> 
>
> Key: NUTCH-1371
> URL: https://issues.apache.org/jira/browse/NUTCH-1371
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.7, 2.2.1
>    Reporter: Julien Nioche
>Assignee: Lewis John McGibbney
> Attachments: NUTCH-1371-2x.patch, NUTCH-1371.patch, 
> NUTCH-1371-plugins.trunk.patch, NUTCH-1371-pom.patch, 
> NUTCH-1371-r1461140.patch
>
>
> We might move to Maven altogether but a good intermediate step could be to 
> rely on the maven ant tasks for managing the dependencies. Ivy does a good 
> job but we need to have a pom file anyway for publishing the artefacts which 
> means keeping the pom.xml and ivy.xml contents in sync. Most devs are also 
> more familiar with Maven, and it is well integrated in IDEs. Going the 
> ANT+MVN way also means that we don't have to rewrite the whole building 
> process and can rely on our existing script



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: [VOTE] Release Apache Nutch 1.13 RC#1

2017-03-29 Thread Julien Nioche
Hi Lewis

+1 compiled from source and ran a small crawl in local mode. All good!

Thanks

Julien

On 29 March 2017 at 05:20, lewis john mcgibbney  wrote:

> Hi Folks,
>
> A first candidate for the Nutch 1.13 release is available at:
>
>   https://dist.apache.org/repos/dist/dev/nutch/1.13/
>
> The release candidate is a zip and tar.gz archive of the binary and sources
> in:
> https://github.com/apache/nutch/tree/release-1.13
>
> The SHA1 checksum of the archive is
> bd0da3569aa14105799ed39204d4f0a31c77b42c
>
> In addition, a staged maven repository is available here:
>
> https://repository.apache.org/content/repositories/orgapachenutch-1013
>
> We addressed 29 Issues - https://s.apache.org/wq3x
>
> Please vote on releasing this package as Apache Nutch 1.13.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Nutch PMC votes are cast.
>
> [ ] +1 Release this package as Apache Nutch 1.13.
> [ ] -1 Do not release this package because…
>
> Cheers,
> Lewis
> (On behalf of the Nutch PMC)
>
> P.S. Here is my +1.
>
> --
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney
>



-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


[jira] [Commented] (NUTCH-2363) Fetcher support for reading and setting cookies

2017-03-01 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890043#comment-15890043
 ] 

Julien Nioche commented on NUTCH-2363:
--

Got it! Thanks for the explanation [~markus17]! Had missed [NUTCH-2355]

> Fetcher support for reading and setting cookies
> ---
>
> Key: NUTCH-2363
> URL: https://issues.apache.org/jira/browse/NUTCH-2363
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.13
>
> Attachments: NUTCH-2363.patch
>
>
> Patch adds basic support for cookies in the fetcher, and a scoring plugin 
> that passes cookies to its outlinks, within the domain. Sub-domain or path 
> based is not supported.
> This is useful if you want to maintain sessions or need to get around a 
> cookie wall.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Crawler-Commons 0.7 released

2016-11-24 Thread Julien Nioche
Apologies for cross-posting

The Common-Crawl project is pleased to announce its 0.7 release.

https://github.com/crawler-commons/crawler-commons#24th-november-2016crawler-commons-07-released

The list of changes can be found here
,
the main one being that the project requires Java 8.

If you are wondering what Crawler-Commons is about :

*Crawler-Commons is a set of reusable Java components that implement
functionality common to any web crawler. These components benefit from
collaboration among various existing web crawler projects and reduce
duplication of effort. *

Thanks

Julien (on behalf of the Common-Crawl committers)

-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


[jira] [Resolved] (NUTCH-1531) URL filtering takes long time for very long URLs

2016-10-24 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-1531.
--
Resolution: Duplicate

No follow up on this one + same functionality discussed elsewhere

> URL filtering takes long time for very long URLs
> 
>
> Key: NUTCH-1531
> URL: https://issues.apache.org/jira/browse/NUTCH-1531
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.6, 2.1, 1.7, 2.2
>Reporter: Fırat KÜÇÜK
>Priority: Minor
> Attachments: max_url_length.diff, test_case.txt
>
>
> Filtering very long urls (such as base64 image generators) take long time 
> (hours). On reducing phase it locks down all the system for hours. Therefore 
> some URL length limitation needed. We attached a little patch for this 
> improvement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2320) URLFilterChecker to run as TCP Telnet service

2016-10-05 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15549206#comment-15549206
 ] 

Julien Nioche commented on NUTCH-2320:
--

Hi @markus17, you haven't left much time for people to comment and review your 
change. You opened the issue then committed it an hour later!

> URLFilterChecker to run as TCP Telnet service
> -
>
> Key: NUTCH-2320
> URL: https://issues.apache.org/jira/browse/NUTCH-2320
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.13
>
> Attachments: NUTCH-2320.patch
>
>
> Allow testing URL filters for webapplications just like indexing filters 
> checker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1371) Replace Ivy with Maven Ant tasks

2016-07-01 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15359504#comment-15359504
 ] 

Julien Nioche commented on NUTCH-1371:
--

None whatsoever [~lewismc]. Maybe mark it as duplicate and link to the new 
issue. Out of curiosity are you planning to revamp the plugin mechanism as 
well? 

> Replace Ivy with Maven Ant tasks
> 
>
> Key: NUTCH-1371
> URL: https://issues.apache.org/jira/browse/NUTCH-1371
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.7, 2.2.1
>    Reporter: Julien Nioche
>Assignee: Lewis John McGibbney
> Attachments: NUTCH-1371-2x.patch, NUTCH-1371-plugins.trunk.patch, 
> NUTCH-1371-pom.patch, NUTCH-1371-r1461140.patch, NUTCH-1371.patch
>
>
> We might move to Maven altogether but a good intermediate step could be to 
> rely on the maven ant tasks for managing the dependencies. Ivy does a good 
> job but we need to have a pom file anyway for publishing the artefacts which 
> means keeping the pom.xml and ivy.xml contents in sync. Most devs are also 
> more familiar with Maven, and it is well integrated in IDEs. Going the 
> ANT+MVN way also means that we don't have to rewrite the whole building 
> process and can rely on our existing script



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


ApacheCon EU Sevilla

2016-06-29 Thread Julien Nioche
Hi,

Sorry for cross posting. As you are probably aware, the ApacheCon Europe,
and Apache Big Data conferences will take place in Seville, Spain, November
14-18, 2016.

http://events.linuxfoundation.org/events/apache-big-data-europe/

I just submitted a talk on StormCrawler  (which
will touch on Apache Nutch as well) and I know that at least 1 other fellow
Nutch committer will be there.

Is anyone else planning on going? It would be interesting not only to catch
up within each respective project but also meet people from other crawl
related projects.

Best regards

Julien

-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


Re: [VOTE] Release Apache Nutch 1.12

2016-06-15 Thread Julien Nioche
+1

Thanks Lewis and team!

On 15 June 2016 at 06:14, lewis john mcgibbney  wrote:

> Hi Folks,
>
> A first candidate for the Nutch 1.12 release is available at:
>
> https://dist.apache.org/repos/dist/dev/nutch/1.12/
>
> The release candidate is a zip and tar archive of the sources tag available
> at:
>
> https://git-wip-us.apache.org/repos/asf?p=nutch.git;a=tag;h=2d6f6de656c60c0b04890c5d3db20805ca39cfd5
>
> In addition, a staged maven repository is available here:
>
> https://repository.apache.org/content/repositories/orgapachenutch-1012/
>
> Please vote on releasing this package as Apache Nutch 1.12.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Nutch PMC votes are cast.
>
> [ ] +1 Release this package as Apache Nutch 1.12.
> [ ] -1 Do not release this package because…
>
> Cheers,
> Lewis
>
> P.S. Here is my +1.
>



-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2016-02-11 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15142863#comment-15142863
 ] 

Julien Nioche commented on NUTCH-2046:
--

I agree with the objective but I'd rather have a consistent approach and deal 
with that in the same way as we do for indexing i.e. [-s seedPath]. Shouldn't 
be difficult to do

> The crawl script should be able to skip an initial injection.
> -
>
> Key: NUTCH-2046
> URL: https://issues.apache.org/jira/browse/NUTCH-2046
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb, injector
>Affects Versions: 1.10
>Reporter: Luis Lopez
>Assignee: Lewis John McGibbney
>  Labels: crawl, injection
> Fix For: 1.12
>
> Attachments: crawl.patch
>
>
> When our crawl gets really big a new injection takes considerable time as it 
> updates crawldb, the crawl script should be able to skip the injection and go 
> directly to the generate call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (NUTCH-2213) CommonCrawlDataDumper saves gzipped body in extracted form

2016-02-10 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche reopened NUTCH-2213:
--
  Assignee: Julien Nioche

The WARC Export actually has the same issue as its CommonCrawl counterpart. 
When storing the http headers verbatim we do report an incorrect content length 
for documents which were compressed.

No idea whether this affect CommonCrawlDataDumper.

@jrsr thanks for reporting this. Re-keeping the original compressed content 
instead of decompressing it, this would have an impact on the subsequent 
processes e.g. parsing. We'll probably just avoid storing the http headers when 
the content is compressed and generate a WARC entity of type resource instead 
of response. See 
[https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/warc/WARCExporter.java#L184]
 

> CommonCrawlDataDumper saves gzipped body in extracted form
> --
>
> Key: NUTCH-2213
> URL: https://issues.apache.org/jira/browse/NUTCH-2213
> Project: Nutch
>  Issue Type: Bug
>  Components: commoncrawl, dumpers
>Reporter: Joris Rau
>    Assignee: Julien Nioche
>Priority: Critical
>  Labels: easyfix
>
> I have downloaded [a WARC 
> file|https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-40/segments/1443738099622.98/warc/CC-MAIN-20151001222139-00240-ip-10-137-6-227.ec2.internal.warc.gz]
>  from the common crawl data. This file contains several gzipped responses 
> which are stored plaintext (without the gzip encoding).
> I used [warctools|https://github.com/internetarchive/warctools] from Internet 
> Archive to extract the responses out of the WARC file. However this tool 
> expects the Content-Length field to match the actual length of the body in 
> the WARC ([See the issue on 
> github|https://github.com/internetarchive/warctools/pull/14#issuecomment-182048962]).
>  warctools uses a more up to date version of hanzo warctools which is 
> recommended on the [Common Crawl 
> website|https://commoncrawl.org/the-data/get-started/] under "Processing the 
> file format".
> I have not been using Nutch and can therefore not say which versions are 
> affected by this.
> After reading [the official WARC 
> draft|http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html] I 
> could not find out how gzipped content is supposed to be stored. However 
> probably multiple WARC file parsers will have an issue with this.
> It would be nice to know whether you consider this a bug and plan on fixing 
> this and whether this is a major issue which concerns most WARC files of the 
> Common Crawl data or only a small part.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-2213) CommonCrawlDataDumper saves gzipped body in extracted form

2016-02-10 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15140608#comment-15140608
 ] 

Julien Nioche edited comment on NUTCH-2213 at 2/10/16 10:36 AM:


Hi Joris

The WARC files provided by CommonCrawl are currently NOT generated by any Nutch 
code - and definitely not the CommonCrawlDataDumper which simply would not 
scale. The code used by CC for generating the WARCs has been written by them.

We've recently released a WarcExporter class (which is scalable). The files it 
generates should be compliant with the WARC specs, feel free to open an issue 
if this is not the case.


was (Author: jnioche):
Hi Joris

The WARC files provided by CommonCrawl are currently NOT generated by any Nutch 
code - and definitely not the CommonCrawlDataDumper which simply would not 
scale.

We've recently released a WarcExporter class (which is scalable). The files it 
generates should be compliant with the WARC specs, feel free to open an issue 
if this is not the case.

> CommonCrawlDataDumper saves gzipped body in extracted form
> --
>
> Key: NUTCH-2213
> URL: https://issues.apache.org/jira/browse/NUTCH-2213
> Project: Nutch
>  Issue Type: Bug
>  Components: commoncrawl, dumpers
>Reporter: Joris Rau
>Priority: Critical
>  Labels: easyfix
>
> I have downloaded [a WARC 
> file|https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-40/segments/1443738099622.98/warc/CC-MAIN-20151001222139-00240-ip-10-137-6-227.ec2.internal.warc.gz]
>  from the common crawl data. This file contains several gzipped responses 
> which are stored plaintext (without the gzip encoding).
> I used [warctools|https://github.com/internetarchive/warctools] from Internet 
> Archive to extract the responses out of the WARC file. However this tool 
> expects the Content-Length field to match the actual length of the body in 
> the WARC ([See the issue on 
> github|https://github.com/internetarchive/warctools/pull/14#issuecomment-182048962]).
>  warctools uses a more up to date version of hanzo warctools which is 
> recommended on the [Common Crawl 
> website|https://commoncrawl.org/the-data/get-started/] under "Processing the 
> file format".
> I have not been using Nutch and can therefore not say which versions are 
> affected by this.
> After reading [the official WARC 
> draft|http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html] I 
> could not find out how gzipped content is supposed to be stored. However 
> probably multiple WARC file parsers will have an issue with this.
> It would be nice to know whether you consider this a bug and plan on fixing 
> this and whether this is a major issue which concerns most WARC files of the 
> Common Crawl data or only a small part.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2204) remove junit lib from runtime

2016-01-22 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15113021#comment-15113021
 ] 

Julien Nioche commented on NUTCH-2204:
--

+1

> remove junit lib from runtime
> -
>
> Key: NUTCH-2204
> URL: https://issues.apache.org/jira/browse/NUTCH-2204
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.11
>Reporter: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.12
>
> Attachments: NUTCH-2204.patch
>
>
> The junit library is shipped in the Nutch bin package as an unnecessary 
> dependency (apache-nutch-1.11/lib/junit-3.8.1.jar). Unit tests use a 
> different library version:
> {noformat}
> % ls build/lib/junit* build/test/lib/junit*
> build/lib/junit-3.8.1.jar  build/test/lib/junit-4.11.jar
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [VOTE] Moving to Git

2016-01-08 Thread Julien Nioche
+1 to move to Git

Note : I don't think Dennis is on the PMC anymore

Ju

On 8 January 2016 at 08:46, Chris Mattmann  wrote:

> Hi Everyone,
>
> I proposed this earlier, and we said we’d wait until after the
> 1.11 release. So it’s time to VOTE to move Nutch to Git. So
> far, the following people have expressed +1s and if I don’t hear
> otherwise, I will implicitly count their VOTE from the DISCUSS
> thread:
>
> +1 PMC
>
> Chris Mattmann*
> Sebastien Nagel*
> Michael Joyce*
> Asitang Mishra*
> Dennis Kubes*
> BlackIce
>
> Everyone else (or those above that would like to amend their VOTE),
> please VOTE below. I will leave the VOTE open for at least 72 hours.
>
> [ ] +1 Move the Nutch SCM to Writeable Git repositories at the ASF.
> [ ] +0 No opinion.
> [ ] -1 Don’t move the Nutch SCM to Writeable Git repositories at the
> ASF because…
>
> Please note, I created a page for Tika that is worth checking out and
> perhaps copying over to the Nutch wiki:
>
> http://wiki.apache.org/tika/UsingGit
>
> Please have a look as I think it will help with our workflows too.
>
> Cheers,
> Chris
>
>
>
>
> -Original Message-
> From: jpluser 
> Reply-To: "dev@nutch.apache.org" 
> Date: Wednesday, November 18, 2015 at 7:39 PM
> To: "dev@nutch.apache.org" 
> Subject: [DISCUSS] Moving to Git
>
> >Hi All,
> >
> >I propose that we consider moving to ASF supported writeable git
> >repos fro Nutch. This would entail moving Nutch’s canonical repo
> >from:
> >
> >https://svn.apache.org/repos/asf/nutch
> >
> >TO
> >
> >https://git-wip-us.apache.org/repos/asf/nutch.git
> >
> >
> >We are already accepting PRs and so forth from Github and I think
> >many of us are using Git in our regular day to day workflows.
> >
> >Thoughts?
> >
> >Cheers,
> >Chris
> >
> >++
> >Chris Mattmann, Ph.D.
> >Chief Architect
> >Instrument Software and Science Data Systems Section (398)
> >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >Office: 168-519, Mailstop: 168-527
> >Email: chris.a.mattm...@nasa.gov
> >WWW:  http://sunset.usc.edu/~mattmann/
> >++
> >Adjunct Associate Professor, Computer Science Department
> >University of Southern California, Los Angeles, CA 90089 USA
> >++
> >
> >
> >
>
>
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


Re: [RELEASE] Apache Nutch 1.11

2015-12-08 Thread Julien Nioche
Thanks Lewis for taking care of the release and everyone involved.

Julien

On 8 December 2015 at 01:34, lewis john mcgibbney 
wrote:

> Hello Folks,
>
> 07 December 2015 - Nutch 1.11 Release
>
> The Apache Nutch PMC are pleased to announce the immediate release of
> Apache Nutch v1.11, we advise all current users and developers of the 1.X
> series to upgrade to this release.
>
> What is Apache Nutch?
>
> Nutch is a well matured, production ready Web crawler. Nutch 1.x enables
> fine grained configuration, relying on Apache Hadoop™
>  data structures, which are great for batch
> processing.
>
> This release is the result of many months of work and around 100 issues
> addressed. For a complete overview of these issues please see the release
> report .
>
> As usual in the 1.X series, release artifacts are made available as both
> source and binary and also available within Maven Central
> 
> as a Maven dependency. The release is available from our DOWNLAODS PAGE
> .
>
> Please also see the Nutch DOAP file -
> https://svn.apache.org/repos/asf/nutch/cms_site/trunk/content/doap.rdf
>
> Best
>
> Lewis
>
> (on behalf of the Apache Nutch PMC)
>
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


Re: [VOTE] Release Apache Nutch 1.11 RC#2

2015-12-05 Thread Julien Nioche
+1

Thanks Lewis

On 4 December 2015 at 18:03, Lewis John Mcgibbney  wrote:

> Hi Folks,
>
> A second candidate for the Nutch 1.11 release is available at:
>
>   https://dist.apache.org/repos/dist/dev/nutch/1.11rc2/
>
> The release candidate consists of zip and tar binaries as well as zip and
> tar sources archives of the sources in:
> http://svn.apache.org/repos/asf/nutch/tags/release-1.11-rc2/
>
> All artifacts have been signed with the following signature as present
> within KEYS
> 48BAEBF6 2013-10-28 Lewis John McGibbney (CODE SIGNING KEY) <
> lewi...@apache.org>
>
> In addition, a staged maven repository is available here:
>
> https://repository.apache.org/content/repositories/orgapachenutch-1007/
>
> Please vote on releasing this package as Apache Nutch 1.11.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Nutch PMC votes are cast.
>
> [ ] +1 Release this package as Apache Nutch X.Y.
> [ ] -1 Do not release this package because…
>
> Cheers,
> Lewis John McGibbney
>
> P.S. Here is my +1.
>
> --
> *Lewis*
>



-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


[jira] [Commented] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-12-01 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033491#comment-15033491
 ] 

Julien Nioche commented on NUTCH-2177:
--

Do you mean 'mapreduce.framework.name' ?

> Generator produces only one partition even in distributed mode
> --
>
> Key: NUTCH-2177
> URL: https://issues.apache.org/jira/browse/NUTCH-2177
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>    Reporter: Julien Nioche
>Priority: Blocker
> Fix For: 1.11
>
>
> See 
> [https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/crawl/Generator.java#L542]
> 'mapred.job.tracker' is deprecated and has been replaced by 
> 'mapreduce.jobtracker.address', however when running Nutch on EMR 
> mapreduce.jobtracker.address has local as a value. As a result we generate a 
> single partition i.e. have a single map fetching later on (which defeats the 
> object of having a distributed crawler).
> We should probably detect whether we are running on YARN instead, see 
> [http://stackoverflow.com/questions/29680155/why-there-is-a-mapreduce-jobtracker-address-configuration-on-yarn]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-12-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2177:
-
Attachment: NUTCH-2177.patch

> Generator produces only one partition even in distributed mode
> --
>
> Key: NUTCH-2177
> URL: https://issues.apache.org/jira/browse/NUTCH-2177
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>    Reporter: Julien Nioche
>Priority: Blocker
> Fix For: 1.11
>
> Attachments: NUTCH-2177.patch
>
>
> See 
> [https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/crawl/Generator.java#L542]
> 'mapred.job.tracker' is deprecated and has been replaced by 
> 'mapreduce.jobtracker.address', however when running Nutch on EMR 
> mapreduce.jobtracker.address has local as a value. As a result we generate a 
> single partition i.e. have a single map fetching later on (which defeats the 
> object of having a distributed crawler).
> We should probably detect whether we are running on YARN instead, see 
> [http://stackoverflow.com/questions/29680155/why-there-is-a-mapreduce-jobtracker-address-configuration-on-yarn]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-12-01 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033491#comment-15033491
 ] 

Julien Nioche edited comment on NUTCH-2177 at 12/1/15 11:43 AM:


Do you mean 'mapreduce.framework.name' ?

Should work indeed - here are the values I am getting on EMR

15/12/01 11:02:00 INFO crawl.Generator: mapred.job.tracker local
15/12/01 11:02:00 INFO crawl.Generator: mapreduce.jobtracker.address local
15/12/01 11:02:00 INFO crawl.Generator: mapreduce.framework.name yarn

where in local mode I get 

2015-12-01 11:42:16,622 INFO  crawl.Generator - mapred.job.tracker local
2015-12-01 11:42:16,622 INFO  crawl.Generator - mapreduce.jobtracker.address 
local
2015-12-01 11:42:16,622 INFO  crawl.Generator - mapreduce.framework.name local

Will send a patch shortly


was (Author: jnioche):
Do you mean 'mapreduce.framework.name' ?

> Generator produces only one partition even in distributed mode
> --
>
> Key: NUTCH-2177
> URL: https://issues.apache.org/jira/browse/NUTCH-2177
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>    Reporter: Julien Nioche
>Priority: Blocker
> Fix For: 1.11
>
>
> See 
> [https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/crawl/Generator.java#L542]
> 'mapred.job.tracker' is deprecated and has been replaced by 
> 'mapreduce.jobtracker.address', however when running Nutch on EMR 
> mapreduce.jobtracker.address has local as a value. As a result we generate a 
> single partition i.e. have a single map fetching later on (which defeats the 
> object of having a distributed crawler).
> We should probably detect whether we are running on YARN instead, see 
> [http://stackoverflow.com/questions/29680155/why-there-is-a-mapreduce-jobtracker-address-configuration-on-yarn]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-12-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-2177.
--
Resolution: Fixed

Committed revision 1717412.

Thanks [~wastl-nagel] and [~markus17]

> Generator produces only one partition even in distributed mode
> --
>
> Key: NUTCH-2177
> URL: https://issues.apache.org/jira/browse/NUTCH-2177
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>    Reporter: Julien Nioche
>Priority: Blocker
> Fix For: 1.11
>
> Attachments: NUTCH-2177.patch
>
>
> See 
> [https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/crawl/Generator.java#L542]
> 'mapred.job.tracker' is deprecated and has been replaced by 
> 'mapreduce.jobtracker.address', however when running Nutch on EMR 
> mapreduce.jobtracker.address has local as a value. As a result we generate a 
> single partition i.e. have a single map fetching later on (which defeats the 
> object of having a distributed crawler).
> We should probably detect whether we are running on YARN instead, see 
> [http://stackoverflow.com/questions/29680155/why-there-is-a-mapreduce-jobtracker-address-configuration-on-yarn]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-11-26 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-2177:


 Summary: Generator produces only one partition even in distributed 
mode
 Key: NUTCH-2177
 URL: https://issues.apache.org/jira/browse/NUTCH-2177
 Project: Nutch
  Issue Type: Bug
  Components: generator
Reporter: Julien Nioche
Priority: Blocker
 Fix For: 1.11


See 
[https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/crawl/Generator.java#L542]

'mapred.job.tracker' is deprecated and has been replaced by 
'mapreduce.jobtracker.address', however when running Nutch on EMR 
mapreduce.jobtracker.address has local as a value. As a result we generate a 
single partition i.e. have a single map fetching later on (which defeats the 
object of having a distributed crawler).

We should probably detect whether we are running on YARN instead, see 
[http://stackoverflow.com/questions/29680155/why-there-is-a-mapreduce-jobtracker-address-configuration-on-yarn]




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-11-26 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029037#comment-15029037
 ] 

Julien Nioche commented on NUTCH-2177:
--

I am on  
Hadoop version: 2.4.0-amzn-7

not clear which version of Yarn it comes with. 

mapred.job.tracker does not appear in the conf at all but 
mapreduce.jobtracker.address does and has 'local' as value

I can see

15/11/26 15:58:13 INFO Configuration.deprecation: mapred.job.tracker is 
deprecated. Instead, use mapreduce.jobtracker.address
15/11/26 15:58:13 INFO crawl.Generator: Generator: jobtracker is 'local', 
generating exactly one partition.

so my guess is that the Configuration maps the old key (mapred.job.tracker) to 
the new one (mapreduce.jobtracker.address) 

@markus17 what do you get for mapreduce.jobtracker.address? 


> Generator produces only one partition even in distributed mode
> --
>
> Key: NUTCH-2177
> URL: https://issues.apache.org/jira/browse/NUTCH-2177
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>    Reporter: Julien Nioche
>Priority: Blocker
> Fix For: 1.11
>
>
> See 
> [https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/crawl/Generator.java#L542]
> 'mapred.job.tracker' is deprecated and has been replaced by 
> 'mapreduce.jobtracker.address', however when running Nutch on EMR 
> mapreduce.jobtracker.address has local as a value. As a result we generate a 
> single partition i.e. have a single map fetching later on (which defeats the 
> object of having a distributed crawler).
> We should probably detect whether we are running on YARN instead, see 
> [http://stackoverflow.com/questions/29680155/why-there-is-a-mapreduce-jobtracker-address-configuration-on-yarn]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2069) Ignore external links based on domain

2015-11-20 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018232#comment-15018232
 ] 

Julien Nioche commented on NUTCH-2069:
--

no probs. Would be good to find a way to format based on the Eclipse XML config 
with an ANT task. There is a way to do it with Maven but haven't seen one for 
ANT.

> Ignore external links based on domain
> -
>
> Key: NUTCH-2069
> URL: https://issues.apache.org/jira/browse/NUTCH-2069
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher, parser
>Affects Versions: 1.10
>    Reporter: Julien Nioche
> Fix For: 1.11
>
> Attachments: NUTCH-2069.patch, NUTCH-2069.v2.patch
>
>
> We currently have `db.ignore.external.links` which is a nice way of 
> restricting the crawl based on the hostname. This adds a new parameter 
> 'db.ignore.external.links.domain' to do the same based on the domain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2069) Ignore external links based on domain

2015-11-20 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-2069.
--
Resolution: Fixed

Trunk committed revision 1715386.

Thanks everyone for comments and reviews

> Ignore external links based on domain
> -
>
> Key: NUTCH-2069
> URL: https://issues.apache.org/jira/browse/NUTCH-2069
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher, parser
>Affects Versions: 1.10
>    Reporter: Julien Nioche
> Fix For: 1.11
>
> Attachments: NUTCH-2069.patch, NUTCH-2069.v2.patch
>
>
> We currently have `db.ignore.external.links` which is a nice way of 
> restricting the crawl based on the hostname. This adds a new parameter 
> 'db.ignore.external.links.domain' to do the same based on the domain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (NUTCH-2069) Ignore external links based on domain

2015-11-20 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-2069.


> Ignore external links based on domain
> -
>
> Key: NUTCH-2069
> URL: https://issues.apache.org/jira/browse/NUTCH-2069
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher, parser
>Affects Versions: 1.10
>    Reporter: Julien Nioche
> Fix For: 1.11
>
> Attachments: NUTCH-2069.patch, NUTCH-2069.v2.patch
>
>
> We currently have `db.ignore.external.links` which is a nice way of 
> restricting the crawl based on the hostname. This adds a new parameter 
> 'db.ignore.external.links.domain' to do the same based on the domain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2069) Ignore external links based on domain

2015-11-19 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2069:
-
Attachment: NUTCH-2069.v2.patch

new patch introducing 'db.ignore.external.links.mode' 
this is compatible with the existing behaviour and will use byHost by default.

> Ignore external links based on domain
> -
>
> Key: NUTCH-2069
> URL: https://issues.apache.org/jira/browse/NUTCH-2069
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher, parser
>Affects Versions: 1.10
>    Reporter: Julien Nioche
> Attachments: NUTCH-2069.patch, NUTCH-2069.v2.patch
>
>
> We currently have `db.ignore.external.links` which is a nice way of 
> restricting the crawl based on the hostname. This adds a new parameter 
> 'db.ignore.external.links.domain' to do the same based on the domain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2064) URLNormalizer basic to encode reserved chars and decode non-reserved chars

2015-11-10 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998467#comment-14998467
 ] 

Julien Nioche commented on NUTCH-2064:
--

FYI have ported the code to Crawler-Commons 
[https://github.com/crawler-commons/crawler-commons/pull/106] where the 
provenance is acknowledged. Comments on that PR are more than  welcome.

> URLNormalizer basic to encode reserved chars and decode non-reserved chars
> --
>
> Key: NUTCH-2064
> URL: https://issues.apache.org/jira/browse/NUTCH-2064
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.10
>Reporter: Markus Jelsma
> Fix For: 1.11
>
> Attachments: NUTCH-1098.patch, NUTCH-1098.patch, NUTCH-2064-v3.patch, 
> NUTCH-2064-v5.patch, NUTCH-2064.patch
>
>
> NUTCH-1098 rewritten to work on trunk. Unit test is identical to 1098.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2064) URLNormalizer basic to encode reserved chars and decode non-reserved chars

2015-11-10 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-2064.
--
   Resolution: Fixed
Fix Version/s: (was: 1.12)
   1.11

Trunk  : Committed revision 1713615.

Nice one! thanks Markus and Sebastian


> URLNormalizer basic to encode reserved chars and decode non-reserved chars
> --
>
> Key: NUTCH-2064
> URL: https://issues.apache.org/jira/browse/NUTCH-2064
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.10
>Reporter: Markus Jelsma
> Fix For: 1.11
>
> Attachments: NUTCH-1098.patch, NUTCH-1098.patch, NUTCH-2064-v3.patch, 
> NUTCH-2064-v5.patch, NUTCH-2064.patch
>
>
> NUTCH-1098 rewritten to work on trunk. Unit test is identical to 1098.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2158) Upgrade to Tika 1.11

2015-11-10 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche reassigned NUTCH-2158:


Assignee: Julien Nioche  (was: Chris A. Mattmann)

> Upgrade to Tika 1.11
> 
>
> Key: NUTCH-2158
> URL: https://issues.apache.org/jira/browse/NUTCH-2158
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Reporter: Chris A. Mattmann
>    Assignee: Julien Nioche
> Fix For: 1.11
>
>
> Upgrade parse-tika to 1.11 release for Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2158) Upgrade to Tika 1.11

2015-11-10 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2158:
-
Attachment: NUTCH-2158.patch

Patch which upgrades to Tika 1.11

tests fail for protocol-http 

{code}
Testcase: testStatusCode took 3.648 sec
FAILED
ContentType http://127.0.0.1:47504/basic-http.jsp 
expected:<[application/xhtml+x]ml> but was:<[text/ht]ml>
junit.framework.AssertionFailedError: ContentType 
http://127.0.0.1:47504/basic-http.jsp expected:<[application/xhtml+x]ml> but 
was:<[text/ht]ml>
at 
org.apache.nutch.protocol.http.TestProtocolHttp.fetchPage(TestProtocolHttp.java:136)
at 
org.apache.nutch.protocol.http.TestProtocolHttp.testStatusCode(TestProtocolHttp.java:80)
{code}

mimetype detected is different but probably correct. Will fix later.


> Upgrade to Tika 1.11
> 
>
> Key: NUTCH-2158
> URL: https://issues.apache.org/jira/browse/NUTCH-2158
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Julien Nioche
> Fix For: 1.11
>
> Attachments: NUTCH-2158.patch
>
>
> Upgrade parse-tika to 1.11 release for Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [VOTE] Apache Nutch 1.11 Release Candidate #1

2015-10-26 Thread Julien Nioche
Chris

-1  We usually release tar.gz as well as zip.  More importantly we need to
release the sources as well as the binary. We can't even test that it
compiles OK

Since you released Tika, why don't we include it before cutting 1.11?

Thanks

Julien


On 26 October 2015 at 05:53, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Hi Folks,
>
> A first candidate for the Nutch 1.11 release is available at:
>
>   https://dist.apache.org/repos/dist/dev/nutch/1.11/
>
> The release candidate is a zip archive of the sources in:
> http://svn.apache.org/repos/asf/nutch/tags/release-1.11-rc1/
>
>
> The SHA1 checksum of the archive is
> 6adebaca0504be69a9e6c67ae1eb3a8487b1806f
>
>
> In addition, a staged maven repository is available here:
>
> https://repository.apache.org/content/repositories/orgapachenutch-1006/
>
>
> Please vote on releasing this package as Apache Nutch 1.11.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Nutch PMC votes are cast.
>
> [ ] +1 Release this package as Apache Nutch 1.11
> [ ] -1 Do not release this package because…
>
> Cheers,
> Chris
>
> P.S. Of course here is my +1.
>
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2015-10-05 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943757#comment-14943757
 ] 

Julien Nioche commented on NUTCH-2132:
--

Looking at it from a slightly different angle, couldn't you use Logstash to 
aggregate and push to ElasticSearch? Most events are already present in the log 
files. You'd be able to query ES (or any other backend supported by Logstash or 
similar) for instance with Kibana and graph it all with 0 modifications to the 
code and 0 overhead. 

> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2132.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2015-10-05 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943856#comment-14943856
 ] 

Julien Nioche commented on NUTCH-2132:
--

bq.  but that locks us into using Kibana, etc. Ideally one goal of this would 
be to enable it to work with multiple downstream front ends

I mentioned logstash as an example, my point was more generally about 
leveraging the log files instead of modifying the code and possibly add 
overhead and complexity. There are probably other tools doing similar things.

Having said that Logstash is pluggable and supports various backends, it could 
be probably be possible to push things into a queue for instance.

Talking about dependencies, this introduces a hard one on RabbitMQ. Isn't there 
a neutral API that could be programmed against? (JMS? AMQP?) - this would allow 
users to chose their favourite messaging queue.


 

> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2132.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Nutch not recognizing html pages/images retrieved via php

2015-10-05 Thread Julien Nioche
Hi

What happens is that parse-tika is used by default but doesn't know what to
do with that mime type.

You can edit parse-plugins.xml
 and add






to map the mime type to the html parser. Obviously you'll need parse-html
to be active.

HTH

Julien



On 4 October 2015 at 03:01, Girish Rao  wrote:

> Hi,
>
> I am running a crawl on a website that serves pages and images via php.
> Nutch doesn’t seem to crawl these pages.
>
> I see the below in the hadoop.log
> 015-10-03 12:48:31,091 INFO  parse.ParserFactory - The parsing plugins:
> [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> plugin.includes system property, and all claim to support the content type
> text/x-php, but they are not mapped to it  in the parse-plugins.xml file
> 2015-10-03 12:48:31,712 ERROR tika.TikaParser - Can't retrieve Tika parser
> for mime-type text/x-php
> 2015-10-03 12:48:31,713 WARN  parse.ParseSegment - Error parsing:
> http://www.arguntrader.com/ucp.php?mode=login: failed(2,0): Can't
> retrieve Tika parser for mime-type text/x-php
>
> Can anyone help with identifying what is to be done to crawl a site which
> serves pages via php?
>
> Regards
> Girish




-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


[jira] [Commented] (NUTCH-2129) Track Protocol Status in Crawl Datum

2015-10-01 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14939503#comment-14939503
 ] 

Julien Nioche commented on NUTCH-2129:
--

I'd rather keep it simple and not modify the CrawlDatum so much. Why don't you 
simply add a config element and optionally store the code in the metadata?
BTW we already have the option to store the response headers - see 
[https://github.com/apache/nutch/commit/23c7761aff830db82a1e44b84bf81265639c9a26].
 You could use that and simply reparse the first line to get the code.


> Track Protocol Status in Crawl Datum
> 
>
> Key: NUTCH-2129
> URL: https://issues.apache.org/jira/browse/NUTCH-2129
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
> Fix For: 2.4, 1.11
>
>
> It's become necessary on a few crawls that I run to get protocol status code 
> stats. After speaking with [~lewismc] it seemed that there might not be a 
> super convenient way of doing this as is, but it would be great to be able to 
> add the functionality necessary to pull this information out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Webcast : Apache Nutch on EMR

2015-09-23 Thread Julien Nioche
Hi again,

I have uploaded at webcast explaining how to run Nutch on AWS Elastic Map
Reduce

https://www.youtube.com/watch?v=v9zjcTjjjyU

Please excuse the sound quality, hesitations and stuttering. I hope you
find it useful nonetheless.

Julien

-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


Tutorial : Index the web with AWS CloudSearch

2015-09-23 Thread Julien Nioche
Hi everyone,

Just to let you know that we've just published a new tutorial on how to use
Nutch (and StormCrawler) to crawl and index documents into AWS CloudSearch.

This is related to the recent addition of NUTCH-1517
 in the trunk codebase.
The tutorial is aimed at beginners and gives step by step instructions on
how to use Nutch, including in distributed mode. It should also be relevant
for more advanced users as it provides an introduction to CloudSearch and a
comparison with StormCrawler.

The tutorial is on
http://digitalpebble.blogspot.co.uk/2015/09/index-web-with-aws-cloudsearch.html

Please retweet the announcement if you use Twitter [
https://twitter.com/digitalpebble/status/646614555192336384].

I hope you find it useful

Julien

-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


[jira] [Commented] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

2015-09-22 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902651#comment-14902651
 ] 

Julien Nioche commented on NUTCH-2095:
--

Thanks [~jorgelbg]. Please add a line to CHANGES.txt to describe what you did 
with this. Could you also edit 
[https://wiki.apache.org/nutch/CommonCrawlDataDumper] and describe what you 
added to the CCDD? Thanks

BTW the basic tests fail on my machine - do you get this too? e.g. for 
TestInjector

{code}
tried to access method com.google.common.base.Stopwatch.()V from class 
org.apache.hadoop.mapred.FileInputFormat
java.lang.IllegalAccessError: tried to access method 
com.google.common.base.Stopwatch.()V from class 
org.apache.hadoop.mapred.FileInputFormat
 {code}



> WARC exporter for the CommonCrawlDataDumper
> ---
>
> Key: NUTCH-2095
> URL: https://issues.apache.org/jira/browse/NUTCH-2095
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, tool
>Affects Versions: 1.11
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: tools, warc
> Attachments: NUTCH-2095.patch
>
>
> Adds the possibility of exporting the nutch segments to a WARC files.
> From the usage point of view a couple of new command line options are 
> available:
> {{-warc}}: enables the functionality to export into WARC files, if not 
> specified the default JACKSON formatter is used.
> {{-warcSize}}: enable the option to define a max file size for each WARC 
> file, if not specified a default of 1GB per file is used as recommended by 
> the WARC ISO standard.
> The usual {{-gzip}} flag can be used to enable compression on the WARC files.
> Some changes to the default {{CommonCrawlDataDumper}} were done, essentially 
> some changes to the Factory and to the Formats. This changes avoid creating a 
> new instance of a {{CommmonCrawlFormat}} on each URL read from the segments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

2015-09-22 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902715#comment-14902715
 ] 

Julien Nioche commented on NUTCH-2095:
--

See [https://issues.apache.org/jira/browse/HADOOP-10961]. This is due to Guava 
17 which is inherited from webarchive-common version 1.1.5
I've excluded guava from it - in revision 1704641 and it has fixed the problem.

[~jorgelbg] please remember to run 'ant clean test' before committing something.

> WARC exporter for the CommonCrawlDataDumper
> ---
>
> Key: NUTCH-2095
> URL: https://issues.apache.org/jira/browse/NUTCH-2095
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, tool
>Affects Versions: 1.11
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: tools, warc
> Attachments: NUTCH-2095.patch
>
>
> Adds the possibility of exporting the nutch segments to a WARC files.
> From the usage point of view a couple of new command line options are 
> available:
> {{-warc}}: enables the functionality to export into WARC files, if not 
> specified the default JACKSON formatter is used.
> {{-warcSize}}: enable the option to define a max file size for each WARC 
> file, if not specified a default of 1GB per file is used as recommended by 
> the WARC ISO standard.
> The usual {{-gzip}} flag can be used to enable compression on the WARC files.
> Some changes to the default {{CommonCrawlDataDumper}} were done, essentially 
> some changes to the Factory and to the Formats. This changes avoid creating a 
> new instance of a {{CommmonCrawlFormat}} on each URL read from the segments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

2015-09-22 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902578#comment-14902578
 ] 

Julien Nioche commented on NUTCH-2095:
--

[~jorgelbg] could you please fix the test. See below

{code}
Index: src/test/org/apache/nutch/tools/TestCommonCrawlDataDumper.java
===
--- src/test/org/apache/nutch/tools/TestCommonCrawlDataDumper.java  
(revision 1704612)
+++ src/test/org/apache/nutch/tools/TestCommonCrawlDataDumper.java  
(working copy)
@@ -101,8 +101,9 @@
 
CommonCrawlDataDumper dumper = new CommonCrawlDataDumper(
new CommonCrawlConfig());
-   dumper.dump(tempDir, sampleSegmentDir, false, null, false, "");
 
+   dumper.dump(tempDir, sampleSegmentDir, false, null, false, "", 
false);
+
Collection tempFiles = FileUtils.listFiles(tempDir,
FileFilterUtils.fileFileFilter(),
FileFilterUtils.directoryFileFilter());
{code}

> WARC exporter for the CommonCrawlDataDumper
> ---
>
> Key: NUTCH-2095
> URL: https://issues.apache.org/jira/browse/NUTCH-2095
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, tool
>Affects Versions: 1.11
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: tools, warc
> Attachments: NUTCH-2095.patch
>
>
> Adds the possibility of exporting the nutch segments to a WARC files.
> From the usage point of view a couple of new command line options are 
> available:
> {{-warc}}: enables the functionality to export into WARC files, if not 
> specified the default JACKSON formatter is used.
> {{-warcSize}}: enable the option to define a max file size for each WARC 
> file, if not specified a default of 1GB per file is used as recommended by 
> the WARC ISO standard.
> The usual {{-gzip}} flag can be used to enable compression on the WARC files.
> Some changes to the default {{CommonCrawlDataDumper}} were done, essentially 
> some changes to the Factory and to the Formats. This changes avoid creating a 
> new instance of a {{CommmonCrawlFormat}} on each URL read from the segments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2102) WARC Exporter

2015-09-22 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-2102.
--
Resolution: Fixed

Committed revision 1704634.

Thanks for the reviews

> WARC Exporter
> -
>
> Key: NUTCH-2102
> URL: https://issues.apache.org/jira/browse/NUTCH-2102
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, dumpers
>Affects Versions: 1.10
>    Reporter: Julien Nioche
> Attachments: NUTCH-2102.patch
>
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both the modified CCDD and this 
> class providing similar functionalities.
> This class is called in the following way 
> ./nutch org.apache.nutch.tools.warc.WARCExporter 
> /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2102) WARC Exporter

2015-09-22 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2102:
-
Fix Version/s: 1.11

> WARC Exporter
> -
>
> Key: NUTCH-2102
> URL: https://issues.apache.org/jira/browse/NUTCH-2102
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, dumpers
>Affects Versions: 1.10
>    Reporter: Julien Nioche
> Fix For: 1.11
>
> Attachments: NUTCH-2102.patch
>
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both the modified CCDD and this 
> class providing similar functionalities.
> This class is called in the following way 
> ./nutch org.apache.nutch.tools.warc.WARCExporter 
> /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (NUTCH-2114) kkk

2015-09-20 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-2114.

Resolution: Invalid

> kkk
> ---
>
> Key: NUTCH-2114
> URL: https://issues.apache.org/jira/browse/NUTCH-2114
> Project: Nutch
>  Issue Type: Bug
>  Components: administration gui, commoncrawl, injector
>Reporter: Badreddine Ahmed
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Fwd: Job Opening at Common Crawl - Crawl Engineer / Data Scientist

2015-09-18 Thread Julien Nioche
Nutch people,

Just in case you missed the announcement below. As you probably know CC use
Nutch for their crawls, this is a fantastic opportunity to put your Nutch
skills to great use!

Julien

-- Forwarded message --
From: Sara Crouse 
Date: 17 September 2015 at 22:51
Subject: Job Opening at Common Crawl - Crawl Engineer / Data Scientist
To: Common Crawl 


Hello again CC community,

In addition to my appointment, another staff transition is on the horizon,
and I would like to ask for your help finding candidates to fill a critical
role. At the end of this month, Stephen Merity (data scientist, crawl
engineer, and much more!) will leave Common Crawl to work on image
recognition and language understanding using deep learning at MetaMind, a
new startup. Stephen, has been a great asset to Common Crawl, and we are
grateful that he wishes to remain engaged with us in a volunteer capacity
going forward.

This week, we therefore launch a search to fill the role of Crawl
Engineer/Data Scientist. Below and posted here https://commoncrawl.org/jobs/
is the job description. We appreciate any help you can provide in spreading
the word about this unique opportunity. If you have specific referrals, or
wish to apply, please contact j...@commoncrawl.org.

Many thanks,

Sara

---

_CRAWL ENGINEER / DATA SCIENTIST at THE COMMON CRAWL FOUNDATION_

*Location*
San Francisco or Remote


*Job Summary*
Common Crawl (CC) is the non-profit organization that builds and maintains
the single largest publicly accessible dataset of the world’s knowledge,
encompassing petabytes of web crawl data.

If democratizing access to web information and tackling the engineering
challenges of working with data at the scale of the web sounds exciting to
you, we would love to hear from you. If you have worked on open source
projects before or can share code samples with us, please don't hesitate to
send relevant links along with your application.


*Description*

/Primary Responsibilities/
_Running the crawl_
* Spinning up and managing Hadoop clusters on Amazon EC2
* Running regular comprehensive crawls of the web using Nutch
* Preparing and publishing crawl data to data hosting partner, Amazon Web
Services
* Incident response and diagnosis of crawl issues as they occur, e.g.
** Replacing lost instances due to EC2 problems / spot instance losses
** Responding to and remedying webmaster queries and issues

_Crawl engineering_
* Maintaining, developing, and deploying new features as required by
running the Nutch crawler, e.g.:
** Providing netiquette features, such as following robots.txt, as
required, and load balancing a crawl across millions of domains
** Implementing and improving ranking algorithms to prioritize the crawling
of popular pages
* Extending existing tools to work efficiently with large datasets
* Working with the Nutch community to push improvements to the crawler to
the public

/Other Responsibilities/
* Building support tools and artifacts, including documentation, tutorials,
and example code or supporting frameworks for processing CC data using
different tools.
* Identifying and reporting on research and innovations that result from
analysis and derivative use of CC data.
* Community evangelism:
** Collaborating with partners in academia and industry
** Engaging regularly with user discussion group and responding to frequent
inquiries about how to use CC data
** Writing technical blog posts
** Presenting on or representing CC at conferences, meetups, etc.


*Qualifications*
/Minimum qualifications/
* Fluent in Java (Nutch and Hadoop are core to our mission)
* Familiarity with the JVM big data ecosystem (Hadoop, HDFS, ...)
* Knowledge the Amazon Web Services (AWS) ecosystem
* Experience with Python
* Basic command line Unix knowledge
* BS Computer Science or equivalent work experience

/Preferred qualifications/
* Experience with running web crawlers
* Cluster computing experience (Hadoop preferred)
* Running parallel jobs over dozens of terabytes of data
* Experience committing to open source projects and participating in open
source forums


*About Common Crawl*
The Common Crawl Foundation is a California 501(c)(3) registered non-profit
with the goal of democratizing access to web information by producing and
maintaining an open repository of web crawl data that is universally
accessible and analyzable.

Our vision is of a truly open web that allows open access to information
and enables greater innovation in research, business and education. We
level the playing field by making wholesale extraction, transformation and
analysis of web data cheap and easy.

The Common Crawl Foundation is an Equal Opportunity Employer.


*To Apply*
Please send your cover letter and resumé to j...@commoncrawl.org.

-- 
You received this message because you are subscribed to the Google Groups
"Common Crawl" group.
To unsubscribe 

[jira] [Commented] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14747300#comment-14747300
 ] 

Julien Nioche commented on NUTCH-2102:
--

The only modification to existing code is in the class 
'src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java'
 where we added two new config elements :
* store.http.request
* store.http.headers
which are used to keep the request and http headers verbatim in the content 
metadata. Both are set to false by default.

Note that this is also used by [#55](https://github.com/apache/nutch/pull/55)


> WARC Exporter
> -
>
> Key: NUTCH-2102
> URL: https://issues.apache.org/jira/browse/NUTCH-2102
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, dumpers
>Affects Versions: 1.10
>    Reporter: Julien Nioche
> Attachments: NUTCH-2102.patch
>
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both 
> [https://github.com/apache/nutch/pull/55] and this class providing similar 
> functionalities.
> This class is called in the following way 
> ./nutch org.apache.nutch.tools.warc.WARCExporter 
> /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2102:
-
Description: 
This patch adds a WARC exporter 
[http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike the 
code submitted in [https://github.com/apache/nutch/pull/55] which is based on 
the CommonCrawlDataDumper, this exporter is a MapReduce job and hence should be 
able to cope with large segments in a timely fashion and also is not limited to 
the local file system.

Later on we could have a WARCImporter to generate segments from WARC files, 
which is outside the scope of the CCDD anyway. Also WARC is not specific to 
CommonCrawl, which is why the package name does not reflect it.

I don't think it would be a problem to have both the modified CCDD and this 
class providing similar functionalities.

This class is called in the following way 

./nutch org.apache.nutch.tools.warc.WARCExporter /data/nutch-dipe/1kcrawl/warc 
-dir /data/nutch-dipe/1kcrawl/segments/

  was:
This patch adds a WARC exporter 
[http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike the 
code submitted in [https://github.com/apache/nutch/pull/55] which is based on 
the CommonCrawlDataDumper, this exporter is a MapReduce job and hence should be 
able to cope with large segments in a timely fashion and also is not limited to 
the local file system.

Later on we could have a WARCImporter to generate segments from WARC files, 
which is outside the scope of the CCDD anyway. Also WARC is not specific to 
CommonCrawl, which is why the package name does not reflect it.

I don't think it would be a problem to have both 
[https://github.com/apache/nutch/pull/55] and this class providing similar 
functionalities.

This class is called in the following way 

./nutch org.apache.nutch.tools.warc.WARCExporter /data/nutch-dipe/1kcrawl/warc 
-dir /data/nutch-dipe/1kcrawl/segments/


> WARC Exporter
> -
>
> Key: NUTCH-2102
> URL: https://issues.apache.org/jira/browse/NUTCH-2102
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, dumpers
>Affects Versions: 1.10
>    Reporter: Julien Nioche
> Attachments: NUTCH-2102.patch
>
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both the modified CCDD and this 
> class providing similar functionalities.
> This class is called in the following way 
> ./nutch org.apache.nutch.tools.warc.WARCExporter 
> /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14747301#comment-14747301
 ] 

Julien Nioche commented on NUTCH-2102:
--

Please review

> WARC Exporter
> -
>
> Key: NUTCH-2102
> URL: https://issues.apache.org/jira/browse/NUTCH-2102
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, dumpers
>Affects Versions: 1.10
>    Reporter: Julien Nioche
> Attachments: NUTCH-2102.patch
>
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both the modified CCDD and this 
> class providing similar functionalities.
> This class is called in the following way 
> ./nutch org.apache.nutch.tools.warc.WARCExporter 
> /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14747327#comment-14747327
 ] 

Julien Nioche edited comment on NUTCH-2102 at 9/16/15 11:21 AM:


Hi Markus

>  I believe this warc format is the updated arc format, for which we already 
> have an importer? 

The importer for WARC could be done later on and would leverage the same 
library as the exporter. But yes, it would look pretty similar to the ARC one .

> you meant to use StringBuilder instead of the synchronized StringBuffer in 
> HttpResponse
could do, will wait for other comments before amending the patch

> A bin/nutch entry is also missing, or not 
yes, why not. There's already far too much stuff in there :-) though. Again, I 
can amend it if ppl are +1 for committing this

Thanks for reviewing it




was (Author: jnioche):
Hi Markus

>  I believe this warc format is the updated arc format, for which we already 
> have an importer? 

The importer for WARC could be done later on and would leverage the same 
library as the exporter. But yes, it would look pretty similar to the ARC one .

> you meant to use StringBuilder instead of the synchronized StringBuffer in 
> HttpResponse
could do, will wait for other comments before amending the patch

> A bin/nutch entry is also missing, or not 
yes, why not. There's already far too much stuff in there :-) though. Again, I 
can amend it if ppl are +1 for committing this




> WARC Exporter
> -
>
> Key: NUTCH-2102
> URL: https://issues.apache.org/jira/browse/NUTCH-2102
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, dumpers
>Affects Versions: 1.10
>Reporter: Julien Nioche
> Attachments: NUTCH-2102.patch
>
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both the modified CCDD and this 
> class providing similar functionalities.
> This class is called in the following way 
> ./nutch org.apache.nutch.tools.warc.WARCExporter 
> /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2102:
-
Description: 
This patch adds a WARC exporter 
[http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike the 
code submitted in [https://github.com/apache/nutch/pull/55] which is based on 
the CommonCrawlDataDumper, this exporter is a MapReduce job and hence should be 
able to cope with large segments in a timely fashion and also is not limited to 
the local file system.

Later on we could have a WARCImporter to generate segments from WARC files, 
which is outside the scope of the CCDD anyway. Also WARC is not specific to 
CommonCrawl, which is why the package name does not reflect it.

I don't think it would be a problem to have both 
[https://github.com/apache/nutch/pull/55] and this class providing similar 
functionalities.

This class is called in the following way 

./nutch org.apache.nutch.tools.warc.WARCExporter /data/nutch-dipe/1kcrawl/warc 
-dir /data/nutch-dipe/1kcrawl/segments/

  was:
This patch adds a WARC exporter 
[http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike the 
code submitted in [https://github.com/apache/nutch/pull/55] which is based on 
the CommonCrawlDataDumper, this exporter is a MapReduce job and hence should be 
able to cope with large segments in a timely fashion and also is not limited to 
the local file system.

Later on we could have a WARCImporter to generate segments from WARC files, 
which is outside the scope of the CCDD anyway. Also WARC is not specific to 
CommonCrawl, which is why the package name does not reflect it.

I don't think it would be a problem to have both 
[https://github.com/apache/nutch/pull/55] and this class providing similar 
functionalities.


> WARC Exporter
> -
>
> Key: NUTCH-2102
> URL: https://issues.apache.org/jira/browse/NUTCH-2102
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, dumpers
>Affects Versions: 1.10
>    Reporter: Julien Nioche
> Attachments: NUTCH-2102.patch
>
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both 
> [https://github.com/apache/nutch/pull/55] and this class providing similar 
> functionalities.
> This class is called in the following way 
> ./nutch org.apache.nutch.tools.warc.WARCExporter 
> /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2102:
-
Attachment: (was: NUTCH-2102.patch)

> WARC Exporter
> -
>
> Key: NUTCH-2102
> URL: https://issues.apache.org/jira/browse/NUTCH-2102
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, dumpers
>Affects Versions: 1.10
>    Reporter: Julien Nioche
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both 
> [https://github.com/apache/nutch/pull/55] and this class providing similar 
> functionalities.
> This class is called in the following way 
> ./nutch org.apache.nutch.tools.warc.WARCExporter 
> /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14747327#comment-14747327
 ] 

Julien Nioche commented on NUTCH-2102:
--

Hi Markus

>  I believe this warc format is the updated arc format, for which we already 
> have an importer? 

The importer for WARC could be done later on and would leverage the same 
library as the exporter. But yes, it would look pretty similar to the ARC one .

> you meant to use StringBuilder instead of the synchronized StringBuffer in 
> HttpResponse
could do, will wait for other comments before amending the patch

> A bin/nutch entry is also missing, or not 
yes, why not. There's already far too much stuff in there :-) though. Again, I 
can amend it if ppl are +1 for committing this




> WARC Exporter
> -
>
> Key: NUTCH-2102
> URL: https://issues.apache.org/jira/browse/NUTCH-2102
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, dumpers
>Affects Versions: 1.10
>Reporter: Julien Nioche
> Attachments: NUTCH-2102.patch
>
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both the modified CCDD and this 
> class providing similar functionalities.
> This class is called in the following way 
> ./nutch org.apache.nutch.tools.warc.WARCExporter 
> /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2102:
-
Attachment: NUTCH-2102.patch

> WARC Exporter
> -
>
> Key: NUTCH-2102
> URL: https://issues.apache.org/jira/browse/NUTCH-2102
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, dumpers
>Affects Versions: 1.10
>    Reporter: Julien Nioche
> Attachments: NUTCH-2102.patch
>
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both 
> [https://github.com/apache/nutch/pull/55] and this class providing similar 
> functionalities.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-2102:


 Summary: WARC Exporter
 Key: NUTCH-2102
 URL: https://issues.apache.org/jira/browse/NUTCH-2102
 Project: Nutch
  Issue Type: Improvement
  Components: commoncrawl, dumpers
Affects Versions: 1.10
Reporter: Julien Nioche


This patch adds a WARC exporter 
[http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike the 
code submitted in [https://github.com/apache/nutch/pull/55] which is based on 
the CommonCrawlDataDumper, this exporter is a MapReduce job and hence should be 
able to cope with large segments in a timely fashion and also is not limited to 
the local file system.

Later on we could have a WARCImporter to generate segments from WARC files, 
which is outside the scope of the CCDD anyway. Also WARC is not specific to 
CommonCrawl, which is why the package name does not reflect it.

I don't think it would be a problem to have both 
[https://github.com/apache/nutch/pull/55] and this class providing similar 
functionalities.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2102:
-
Attachment: NUTCH-2102.patch

> WARC Exporter
> -
>
> Key: NUTCH-2102
> URL: https://issues.apache.org/jira/browse/NUTCH-2102
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, dumpers
>Affects Versions: 1.10
>    Reporter: Julien Nioche
> Attachments: NUTCH-2102.patch
>
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both 
> [https://github.com/apache/nutch/pull/55] and this class providing similar 
> functionalities.
> This class is called in the following way 
> ./nutch org.apache.nutch.tools.warc.WARCExporter 
> /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2064) URLNormalizer basic to properly encode non-ASCII characters

2015-09-14 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744078#comment-14744078
 ] 

Julien Nioche commented on NUTCH-2064:
--

yep, can discuss that post 1.11

> URLNormalizer basic to properly encode non-ASCII characters
> ---
>
> Key: NUTCH-2064
> URL: https://issues.apache.org/jira/browse/NUTCH-2064
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.10
>Reporter: Markus Jelsma
> Fix For: 1.11
>
> Attachments: NUTCH-1098.patch, NUTCH-1098.patch, NUTCH-2064-v3.patch, 
> NUTCH-2064.patch
>
>
> NUTCH-1098 rewritten to work on trunk. Unit test is identical to 1098.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [ANNOUNCE] New Nutch committer and PMC - Asitang Mishra

2015-09-10 Thread Julien Nioche
Congratulations Asitang and welcome!

Julien

On 9 September 2015 at 23:01, Sebastian Nagel 
wrote:

> Dear all,
>
> on behalf of the Nutch PMC it is my pleasure to announce
> that Asitang Mishra has joined the Nutch team as committer
> and PMC member. Asitang, please feel free to introduce
> yourself and to tell the Nutch community about your
> interests and your relation to Nutch.
>
> Congratulations and welcome on board!
>
> Regards,
> Sebastian (on behalf of the Nutch PMC)
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


[jira] [Commented] (NUTCH-2064) URLNormalizer basic to properly encode non-ASCII characters

2015-09-04 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731114#comment-14731114
 ] 

Julien Nioche commented on NUTCH-2064:
--

What about moving the basic URL normalizer to Crawler-Commons? see 
[https://github.com/crawler-commons/crawler-commons/issues/74]
We already rely on it for robots parsing and it other projects would be able to 
reuse it (as well as improve it). Any views on this?

> URLNormalizer basic to properly encode non-ASCII characters
> ---
>
> Key: NUTCH-2064
> URL: https://issues.apache.org/jira/browse/NUTCH-2064
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.10
>Reporter: Markus Jelsma
> Fix For: 1.11
>
> Attachments: NUTCH-1098.patch, NUTCH-1098.patch, NUTCH-2064-v3.patch, 
> NUTCH-2064.patch
>
>
> NUTCH-1098 rewritten to work on trunk. Unit test is identical to 1098.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [DISCUSS] Release Nutch trunk 1.11

2015-08-26 Thread Julien Nioche
Hi Lewis

I'd love to see https://issues.apache.org/jira/browse/NUTCH-1517 being part
of 1.11. It is a separate indexing plugin which should not impact any
existing code. It's been reviewed by Jorge and I'll to commit it soon
unless someone objects.

Thanks

J.

On 26 August 2015 at 03:23, Lewis John Mcgibbney lewis.mcgibb...@gmail.com
wrote:

 Hi Folks,
 What do you all think about getting a release candidate out for Nutch
 1.11? I am happy to do RM role.
 Thanks
 Lewis


 --
 *Lewis*




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: [DISCUSS] Release Nutch trunk 1.11

2015-08-26 Thread Julien Nioche
Done. Thanks Markus

On 26 August 2015 at 13:08, Markus Jelsma markus.jel...@openindex.io
wrote:

 Yes Julien, please commit. I do think
 https://issues.apache.org/jira/browse/NUTCH-2064 should also be included.
 But i have my hands full atm.

 -Original message-
 From: Julien Niochelists.digitalpeb...@gmail.com
 Sent: Wednesday 26th August 2015 13:51
 To: dev@nutch.apache.org
 Subject: Re: [DISCUSS] Release Nutch trunk 1.11

 Hi Lewis

 Id love to see https://issues.apache.org/jira/browse/NUTCH-1517 
 https://issues.apache.org/jira/browse/NUTCH-1517 being part of 1.11. It
 is a separate indexing plugin which should not impact any existing code.
 Its been reviewed by Jorge and Ill to commit it soon unless someone objects.

 Thanks

 J.

 On 26 August 2015 at 03:23, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com mailto:lewis.mcgibb...@gmail.com wrote:

 Hi Folks,

 What do you all think about getting a release candidate out for Nutch
 1.11? I am happy to do RM role.

 Thanks

 Lewis

 --

 Lewis

 --

 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/ http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com http://www.digitalpebble.com
 http://twitter.com/digitalpebble http://twitter.com/digitalpebble





-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


[jira] [Resolved] (NUTCH-1517) CloudSearch indexer

2015-08-26 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-1517.
--
Resolution: Fixed

trunk committed revision 1697911.

Thanks for comments and review

 CloudSearch indexer
 ---

 Key: NUTCH-1517
 URL: https://issues.apache.org/jira/browse/NUTCH-1517
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.11

 Attachments: 0023883254_1377197869_indexer-cloudsearch.patch, 
 0025666929_1382393138_indexer-cloudsearch.20131021.patch, NUTCH-1517.v2.patch


 Once we have made the indexers pluggable, we should add a plugin for Amazon 
 CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
 JSON based representation Search Data Format (SDF), which we could reuse for 
 a file based indexer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2015-08-26 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14712988#comment-14712988
 ] 

Julien Nioche commented on NUTCH-1517:
--

Thanks [~jorgelbg]. Will commit soon unless someone objects.

 CloudSearch indexer
 ---

 Key: NUTCH-1517
 URL: https://issues.apache.org/jira/browse/NUTCH-1517
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.11

 Attachments: 0023883254_1377197869_indexer-cloudsearch.patch, 
 0025666929_1382393138_indexer-cloudsearch.20131021.patch, NUTCH-1517.v2.patch


 Once we have made the indexers pluggable, we should add a plugin for Amazon 
 CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
 JSON based representation Search Data Format (SDF), which we could reuse for 
 a file based indexer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2049) Upgrade Trunk to Hadoop 2.4 stable

2015-08-24 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-2049.
--
Resolution: Fixed

Committed revision 1697466.

Thanks to everyone involved.

 Upgrade Trunk to Hadoop  2.4 stable
 

 Key: NUTCH-2049
 URL: https://issues.apache.org/jira/browse/NUTCH-2049
 Project: Nutch
  Issue Type: Improvement
  Components: build
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
  Labels: memex
 Fix For: 1.11

 Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch, NUTCH-2049v3.patch


 Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html
 I am +1 for taking trunk (or a branch of trunk) to explicit dependency on  
 Hadoop 2.6.
 We can run our tests, we can validate, we can fix.
 I will be doing validation on 2.X in paralegal as this is what I use on my 
 own projects. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop 2.4 stable

2015-08-21 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706402#comment-14706402
 ] 

Julien Nioche commented on NUTCH-2049:
--

Fantastic work [~lewismc]! I think this is one of the most important changes to 
Nutch in recent years. Well done.
Compilation and tests all fine, crawl in local mode OK. 

+1 to commit 

 Upgrade Trunk to Hadoop  2.4 stable
 

 Key: NUTCH-2049
 URL: https://issues.apache.org/jira/browse/NUTCH-2049
 Project: Nutch
  Issue Type: Improvement
  Components: build
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
  Labels: memex
 Fix For: 1.11

 Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch, NUTCH-2049v3.patch


 Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html
 I am +1 for taking trunk (or a branch of trunk) to explicit dependency on  
 Hadoop 2.6.
 We can run our tests, we can validate, we can fix.
 I will be doing validation on 2.X in paralegal as this is what I use on my 
 own projects. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1517) CloudSearch indexer

2015-08-19 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1517:
-
Attachment: (was: NUTCH-1517.patch)

 CloudSearch indexer
 ---

 Key: NUTCH-1517
 URL: https://issues.apache.org/jira/browse/NUTCH-1517
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.11

 Attachments: 0023883254_1377197869_indexer-cloudsearch.patch, 
 0025666929_1382393138_indexer-cloudsearch.20131021.patch, NUTCH-1517.v2.patch


 Once we have made the indexers pluggable, we should add a plugin for Amazon 
 CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
 JSON based representation Search Data Format (SDF), which we could reuse for 
 a file based indexer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1517) CloudSearch indexer

2015-08-19 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1517:
-
Flags: Patch

 CloudSearch indexer
 ---

 Key: NUTCH-1517
 URL: https://issues.apache.org/jira/browse/NUTCH-1517
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.11

 Attachments: 0023883254_1377197869_indexer-cloudsearch.patch, 
 0025666929_1382393138_indexer-cloudsearch.20131021.patch, NUTCH-1517.patch


 Once we have made the indexers pluggable, we should add a plugin for Amazon 
 CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
 JSON based representation Search Data Format (SDF), which we could reuse for 
 a file based indexer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1517) CloudSearch indexer

2015-08-19 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1517:
-
Attachment: NUTCH-1517.patch

New implementation of the CloudSearchIndexWriter, uses the latest version of 
the CloudSearch API. See README file for instructions

 CloudSearch indexer
 ---

 Key: NUTCH-1517
 URL: https://issues.apache.org/jira/browse/NUTCH-1517
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.11

 Attachments: 0023883254_1377197869_indexer-cloudsearch.patch, 
 0025666929_1382393138_indexer-cloudsearch.20131021.patch, NUTCH-1517.patch


 Once we have made the indexers pluggable, we should add a plugin for Amazon 
 CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
 JSON based representation Search Data Format (SDF), which we could reuse for 
 a file based indexer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2069) Ignore external links based on domain

2015-07-30 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647467#comment-14647467
 ] 

Julien Nioche commented on NUTCH-2069:
--

Hi [~wastl-nagel] and [~markus17].  BTW did not mean to be short in my previous 
message but was typing from my phone ;-)
I know the difficulties of enforcing the code formatting systematically, but I 
thought I might as well fix it while I was working on that part of the code. 
Feel free to remove the bits from the patch that are about the formatting only.

bq. we could define this as two properties `db.ignore.external.links` + 
`db.ignore.external.links.mode`. The latter can be host or domain, similar 
to other properties (partition.url.mode, generator.count.mode, 
fetcher.queue.mode). That would be extensible and can make the code leaner.

yes that would be more elegant

on vacation for the next few weeks as of today, will update the code  based on 
your suggestion when I am back unless one of you beats me to it of course.

J.  



 Ignore external links based on domain
 -

 Key: NUTCH-2069
 URL: https://issues.apache.org/jira/browse/NUTCH-2069
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher, parser
Affects Versions: 1.10
Reporter: Julien Nioche
 Fix For: 1.11

 Attachments: NUTCH-2069.patch


 We currently have `db.ignore.external.links` which is a nice way of 
 restricting the crawl based on the hostname. This adds a new parameter 
 'db.ignore.external.links.domain' to do the same based on the domain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2069) Ignore external links based on domain

2015-07-29 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646543#comment-14646543
 ] 

Julien Nioche commented on NUTCH-2069:
--

What code restyle? I applied the formatting rules from 2.x as expected. They 
should be copied to trunk BTW. Looks like Lewis did not use  them

 Ignore external links based on domain
 -

 Key: NUTCH-2069
 URL: https://issues.apache.org/jira/browse/NUTCH-2069
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher, parser
Affects Versions: 1.10
Reporter: Julien Nioche
 Fix For: 1.11

 Attachments: NUTCH-2069.patch


 We currently have `db.ignore.external.links` which is a nice way of 
 restricting the crawl based on the hostname. This adds a new parameter 
 'db.ignore.external.links.domain' to do the same based on the domain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2069) Ignore external links based on domain

2015-07-29 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-2069:


 Summary: Ignore external links based on domain
 Key: NUTCH-2069
 URL: https://issues.apache.org/jira/browse/NUTCH-2069
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher, parser
Affects Versions: 1.10
Reporter: Julien Nioche
 Fix For: 1.11


We currently have `db.ignore.external.links` which is a nice way of restricting 
the crawl based on the hostname. This adds a new parameter 
'db.ignore.external.links.domain' to do the same based on the domain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2069) Ignore external links based on domain

2015-07-29 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2069:
-
Attachment: NUTCH-2069.patch

 Ignore external links based on domain
 -

 Key: NUTCH-2069
 URL: https://issues.apache.org/jira/browse/NUTCH-2069
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher, parser
Affects Versions: 1.10
Reporter: Julien Nioche
 Fix For: 1.11

 Attachments: NUTCH-2069.patch


 We currently have `db.ignore.external.links` which is a nice way of 
 restricting the crawl based on the hostname. This adds a new parameter 
 'db.ignore.external.links.domain' to do the same based on the domain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2069) Ignore external links based on domain

2015-07-29 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2069:
-
Patch Info: Patch Available

 Ignore external links based on domain
 -

 Key: NUTCH-2069
 URL: https://issues.apache.org/jira/browse/NUTCH-2069
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher, parser
Affects Versions: 1.10
Reporter: Julien Nioche
 Fix For: 1.11

 Attachments: NUTCH-2069.patch


 We currently have `db.ignore.external.links` which is a nice way of 
 restricting the crawl based on the hostname. This adds a new parameter 
 'db.ignore.external.links.domain' to do the same based on the domain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2048) parse-tika: fix dependencies in plugin.xml

2015-07-24 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14640138#comment-14640138
 ] 

Julien Nioche commented on NUTCH-2048:
--

howto_upgrade_tika.txt has been around for 2 years. One possible explanation is 
that whoever committed the change already had a lib directory with the old 
dependencies in it.  The build-ivy.xml script should be modified to remove any 
existing content in the lib dir it creates.

 parse-tika: fix dependencies in plugin.xml
 --

 Key: NUTCH-2048
 URL: https://issues.apache.org/jira/browse/NUTCH-2048
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.10
Reporter: Sebastian Nagel
Priority: Trivial
 Fix For: 1.11

 Attachments: NUTCH-2048_Joyce_20150723.patch, 
 NUTCH-2048_Joyce_20150723_2.patch


 Duplicate library dependencies listed in parse-tika's plugin.xml should be 
 cleaned up. There are a duplicates, only the version differs, e.g.:
 {noformat}
 tika-parsers-1.7.jar
 tika-parsers-1.8.jar
 {noformat}
 Not critical because libs which are not present should be just ignored.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-1517) CloudSearch indexer

2015-07-24 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche reassigned NUTCH-1517:


Assignee: Julien Nioche

 CloudSearch indexer
 ---

 Key: NUTCH-1517
 URL: https://issues.apache.org/jira/browse/NUTCH-1517
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.11

 Attachments: 0023883254_1377197869_indexer-cloudsearch.patch, 
 0025666929_1382393138_indexer-cloudsearch.20131021.patch


 Once we have made the indexers pluggable, we should add a plugin for Amazon 
 CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
 JSON based representation Search Data Format (SDF), which we could reuse for 
 a file based indexer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2016) Remove OldFetcher from trunk

2015-06-25 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600946#comment-14600946
 ] 

Julien Nioche commented on NUTCH-2016:
--

+1

 Remove OldFetcher from trunk
 

 Key: NUTCH-2016
 URL: https://issues.apache.org/jira/browse/NUTCH-2016
 Project: Nutch
  Issue Type: Wish
  Components: fetcher
Affects Versions: 1.11
Reporter: Sebastian Nagel
Priority: Trivial
 Fix For: 1.11


 The class OldFetcher is not actively maintained and lacks all features added 
 to the new threaded Fetcher (started in 2007, used as default fetcher since 
 2009). Time to remove it from the code base (trunk/1.x only)?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script

2015-06-25 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2036:
-
Affects Version/s: (was: 1.11)

 Adding some continuous crawl goodies to the crawl script
 

 Key: NUTCH-2036
 URL: https://issues.apache.org/jira/browse/NUTCH-2036
 Project: Nutch
  Issue Type: Improvement
  Components: bin, tool, util
Affects Versions: 1.10
Reporter: Jorge Luis Betancourt Gonzalez
Priority: Minor
  Labels: crawl, script
 Fix For: 1.11

 Attachments: NUTCH-2036.patch


 Although Nutch does not support continuous crawling out of the box, and yes 
 this is somehow doable using cron or even sometimes irrelevant due the size 
 of the crawl its a nice feature to have. 
 This patch basically just adds a new parameter option to the {{bin/crawl}} 
 script (-w|--wait) which adds a time to wait if the generator returns 0 (when 
 no URLs are scheduled for fetching). 
 This new parameter has the {{NUMBER\[SUFFIX\]}} format, if no suffix is 
 provided the amount of time is assumed to be in seconds. Other valid suffixes 
 are: 
 s - second
 m - minutes
 h - hours
 d - days
 If a {{-1}} value is passed to the parameter or its not used at all the 
 default behaviour of exciting the script is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script

2015-06-25 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600949#comment-14600949
 ] 

Julien Nioche commented on NUTCH-2036:
--

Any thoughts on this? This is useful and should be committed I think.

 Adding some continuous crawl goodies to the crawl script
 

 Key: NUTCH-2036
 URL: https://issues.apache.org/jira/browse/NUTCH-2036
 Project: Nutch
  Issue Type: Improvement
  Components: bin, tool, util
Affects Versions: 1.10
Reporter: Jorge Luis Betancourt Gonzalez
Priority: Minor
  Labels: crawl, script
 Fix For: 1.11

 Attachments: NUTCH-2036.patch


 Although Nutch does not support continuous crawling out of the box, and yes 
 this is somehow doable using cron or even sometimes irrelevant due the size 
 of the crawl its a nice feature to have. 
 This patch basically just adds a new parameter option to the {{bin/crawl}} 
 script (-w|--wait) which adds a time to wait if the generator returns 0 (when 
 no URLs are scheduled for fetching). 
 This new parameter has the {{NUMBER\[SUFFIX\]}} format, if no suffix is 
 provided the amount of time is assumed to be in seconds. Other valid suffixes 
 are: 
 s - second
 m - minutes
 h - hours
 d - days
 If a {{-1}} value is passed to the parameter or its not used at all the 
 default behaviour of exciting the script is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script

2015-06-25 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2036:
-
Fix Version/s: 1.11

 Adding some continuous crawl goodies to the crawl script
 

 Key: NUTCH-2036
 URL: https://issues.apache.org/jira/browse/NUTCH-2036
 Project: Nutch
  Issue Type: Improvement
  Components: bin, tool, util
Affects Versions: 1.10
Reporter: Jorge Luis Betancourt Gonzalez
Priority: Minor
  Labels: crawl, script
 Fix For: 1.11

 Attachments: NUTCH-2036.patch


 Although Nutch does not support continuous crawling out of the box, and yes 
 this is somehow doable using cron or even sometimes irrelevant due the size 
 of the crawl its a nice feature to have. 
 This patch basically just adds a new parameter option to the {{bin/crawl}} 
 script (-w|--wait) which adds a time to wait if the generator returns 0 (when 
 no URLs are scheduled for fetching). 
 This new parameter has the {{NUMBER\[SUFFIX\]}} format, if no suffix is 
 provided the amount of time is assumed to be in seconds. Other valid suffixes 
 are: 
 s - second
 m - minutes
 h - hours
 d - days
 If a {{-1}} value is passed to the parameter or its not used at all the 
 default behaviour of exciting the script is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2015-06-24 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599840#comment-14599840
 ] 

Julien Nioche commented on NUTCH-2046:
--

re-script : what about a positive parameter instead of a negative one (like we 
do for the indexing with -i)? Could have -s followed by the path to the seed.

 The crawl script should be able to skip an initial injection.
 -

 Key: NUTCH-2046
 URL: https://issues.apache.org/jira/browse/NUTCH-2046
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb, injector
Affects Versions: 1.10
Reporter: Luis Lopez
  Labels: crawl, injection
 Fix For: 1.11


 When our crawl gets really big a new injection takes considerable time as it 
 updates crawldb, the crawl script should be able to skip the injection and go 
 directly to the generate call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2000) Link inversion fails with .locked already exists.

2015-06-17 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589951#comment-14589951
 ] 

Julien Nioche commented on NUTCH-2000:
--

Hi Seb, +1 to commit. Not sure I'll be able to reproduce it but let's assume it 
fixes it. Can always reopen later if still a problem

 Link inversion fails with .locked already exists.
 -

 Key: NUTCH-2000
 URL: https://issues.apache.org/jira/browse/NUTCH-2000
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.9
Reporter: Julien Nioche
 Fix For: 1.11

 Attachments: NUTCH-2000-v1.patch


 using standard crawl script with a brand new test dir in local mode I am 
 getting 
 Link inversion
 /data/BLABLABLA/runtime/local/bin/nutch invertlinks 
 /data/BLABLABLA/testCrawl2//linkdb 
 /data/BLABLABLA/testCrawl2//segments/20150423114335
 LinkDb: java.io.IOException: lock file 
 /data/BLABLABLA/testCrawl2/linkdb/.locked already exists.
 PS: 2000!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


crawler-commons 0.6 released

2015-06-11 Thread Julien Nioche
[Apologies for cross posting]crawler-commons 0.6 is released

We are glad to announce the 0.6 release of Crawler Commons. See the
CHANGES.txt
https://github.com/crawler-commons/crawler-commons/releases/tag/crawler-commons-0.6
file
included with the release for a full list of details.

We suggest all users to upgrade to this version. Details of how to do so
can be found on Maven Central
http://search.maven.org/#artifactdetails%7Ccom.github.crawler-commons%7Ccrawler-commons%7C0.6%7Cjar.
Please note that the groupId has changed to *com.github.crawler-commons*.
Thanks to all contributors

Julien

https://github.com/crawler-commons/crawler-commons


[jira] [Resolved] (NUTCH-2006) IndexingFiltersChecker to take custom metadata as input

2015-05-15 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-2006.
--
   Resolution: Fixed
Fix Version/s: 1.11

Committed revision 1679567.

Thanks Seb

 IndexingFiltersChecker  to take custom metadata as input
 

 Key: NUTCH-2006
 URL: https://issues.apache.org/jira/browse/NUTCH-2006
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.10
Reporter: Julien Nioche
Priority: Minor
 Fix For: 1.11

 Attachments: NUTCH-2006.patch


 Similar to [NUTCH-1757] but for IndexingFiltersChecker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2012) Merge parsechecker and indexchecker

2015-05-15 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545534#comment-14545534
 ] 

Julien Nioche commented on NUTCH-2012:
--

+1 to merging them into a more generic tool. Most of the code in these 2 
classes is the same. We could add a few options e.g. not to display the fields 
generated for the indexing

 Merge parsechecker and indexchecker
 ---

 Key: NUTCH-2012
 URL: https://issues.apache.org/jira/browse/NUTCH-2012
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.10
Reporter: Sebastian Nagel
Priority: Minor
 Fix For: 1.11


 ParserChecker and IndexingFiltersChecker have evolved from simple tools to 
 check parsers and parsefilters resp. indexing filters to powerful tools which 
 emulate the crawling of a single URL/document:
 - check robots.txt (NUTCH-2002)
 - follow redirects (NUTCH-2004)
 Keeping both tools in sync takes extra work (cf. NUTCH-1757/NUTCH-2006, also 
 NUTCH-2002, NUTCH-2004 are done only for parsechecker). It's time to merge 
 them
 * either into one general debugging tool, keeping parsechecker and 
 indexchecker as aliases
 * centralize common code in one utility class



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2008) IndexerMapReduce to use single instance of NutchIndexAction for deletions

2015-05-13 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541843#comment-14541843
 ] 

Julien Nioche commented on NUTCH-2008:
--

Makes total sense. +1
Could also make it static while we are at it?

 IndexerMapReduce to use single instance of NutchIndexAction for deletions
 -

 Key: NUTCH-2008
 URL: https://issues.apache.org/jira/browse/NUTCH-2008
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.10
Reporter: Sebastian Nagel
Priority: Trivial
 Fix For: 1.11

 Attachments: NUTCH-2008-trunk-v1.patch


 For every URL/document to be deleted a new instance of NutchIndexAction is 
 created in IndexerMapReduce (in multiple positions):
 {code}
 NutchIndexAction action = new NutchIndexAction(null,
NutchIndexAction.DELETE);
 output.collect(key, action);
 {code}
 Since the index action does not hold any data specific to any URL/document it 
 would be more efficient to re-use a single instance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2006) IndexingFiltersChecker to take custom metadata as input

2015-05-11 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-2006:


 Summary: IndexingFiltersChecker  to take custom metadata as input
 Key: NUTCH-2006
 URL: https://issues.apache.org/jira/browse/NUTCH-2006
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.10
Reporter: Julien Nioche
Priority: Minor


Similar to [NUTCH-1757] but for IndexingFiltersChecker.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2006) IndexingFiltersChecker to take custom metadata as input

2015-05-11 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2006:
-
Attachment: NUTCH-2006.patch

Patch which allows to take custom metadata into account + improved used of 
slf4j and Configured.

 IndexingFiltersChecker  to take custom metadata as input
 

 Key: NUTCH-2006
 URL: https://issues.apache.org/jira/browse/NUTCH-2006
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.10
Reporter: Julien Nioche
Priority: Minor
 Attachments: NUTCH-2006.patch


 Similar to [NUTCH-1757] but for IndexingFiltersChecker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2006) IndexingFiltersChecker to take custom metadata as input

2015-05-11 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2006:
-
Patch Info: Patch Available

 IndexingFiltersChecker  to take custom metadata as input
 

 Key: NUTCH-2006
 URL: https://issues.apache.org/jira/browse/NUTCH-2006
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.10
Reporter: Julien Nioche
Priority: Minor
 Attachments: NUTCH-2006.patch


 Similar to [NUTCH-1757] but for IndexingFiltersChecker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1999) Add http://nutch.apache.org/robots.txt

2015-05-11 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1999:
-
Assignee: (was: Julien Nioche)

 Add http://nutch.apache.org/robots.txt
 --

 Key: NUTCH-1999
 URL: https://issues.apache.org/jira/browse/NUTCH-1999
 Project: Nutch
  Issue Type: Improvement
  Components: website
Reporter: Julien Nioche

 http://nutch.apache.org/robots.txt = 404 not found
 Aren't we funny! Go and tell webmasters to have a robots.txt after that!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   5   6   7   8   9   10   >