Re: [DISCUSS] Release Trunk

2013-11-28 Thread Julien Nioche
Hi Lewis

We've done quite a few things in 1.x since the previous release (e.g.
generic deduplication, removing indexer.solr package, etc...)  and the next
2.x release will be after the changes to GORA have been made, tested and
used on the Nutch side so that could be quite a while.

I am neutral as to whether we should do a 1.x release now. There are some
minor issues that we could do in 1.x before the next release like :
* https://issues.apache.org/jira/browse/NUTCH-1360
* https://issues.apache.org/jira/browse/NUTCH-1676
and probably a few others but they could also be done later.

Let's hear what others think.

Thanks

Julien




On 28 November 2013 16:34, Lewis John Mcgibbney
lewis.mcgibb...@gmail.comwrote:

 Hi Folks,
 Thread says it all.
 There are some hot tickets over in Gora right now so I think holding off
 the next while for a 2.x release would be wise.
 I can spin the RC for trunk tonight/tomorrow/weekend if we get the thumbs
 up.
 Ta
 Lewis

 --
 *Lewis*




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Nutch with YARN (aka Hadoop 2.0)

2013-12-09 Thread Julien Nioche
I don't think Nutch has been fully ported to the new mapreduce API which is
a prerequisite for running it on Hadoop 2.
I can't think of a reason why that the performance would be any different
with Yarn.

Julien


On 9 December 2013 06:42, Tejas Patil tejas.patil...@gmail.com wrote:

 Has anyone tried out running Nutch over YARN ? If so, were there were any
 performance gains with the same ?

 Thanks,
 Tejas




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Nightly builds

2014-01-08 Thread Julien Nioche
Great stuff, thanks Lewis


On 8 January 2014 12:00, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote:

 Hi Folks,

 On Wed, Jan 8, 2014 at 4:06 AM, dev-digest-h...@nutch.apache.org wrote:

 I'm working on getting the Jenkins job configuration stable again.
 Something seems to have been reset or in not correct.
 I'll update here once we are back to stable builds.


 Seems that there was an upgrade to the Jenkins servers we run the builds
 on... which unfortunately resulted in this bug [0].

 I made some tweaks to the job config and the good news is that builds are
 back stable now.

 Ta

 [0] https://issues.jenkins-ci.org/browse/JENKINS-21250




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Renovating Nutch Hadoop Tutorial wiki page

2014-01-21 Thread Julien Nioche
Hi

The whole thing has been replaced with
 
http://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorialhttp://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorialwhich
does exactly what you described. +1 to remove the old
nutchhadooptutorial page

J.


On 21 January 2014 17:44, Tejas Patil tejas.patil...@gmail.com wrote:

 Hi nutch-dev,

 I was looking at [0] and realized that with the massive number of Hadoop
 setup tutorials out there on internet, we need not repeat the same on nutch
 wiki page and instead assume that user has already done Hadoop setup. For
 convinience, we could direct users to the Hadoop wiki page which has Hadoop
 setup details.
 Plus, I propose following:

 - Section Downloading Hadoop and Nutch : Remove the Hadoop portions and
 let the Nutch stuff stay.
 - Section Setting Up The Deployment Architecture must be removed.
 - Section Deploy Nutch to Single Machine and Deploy Nutch to Multiple
 Machines can be merged together.
 - Section Performing a Nutch Crawl, Testing the Crawl and Performing
 a Search must be merged, its contents must be updated.
 - Section Rsyncing Code to Slaves and Updates can be completely
 removed.

 Any comments ?

 [0] : http://wiki.apache.org/nutch/NutchHadoopTutorial

 Thanks,
 Tejas




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Renovating Nutch Hadoop Tutorial wiki page

2014-01-22 Thread Julien Nioche
Thanks Tejas!


On 22 January 2014 11:51, Tejas Patil tejas.patil...@gmail.com wrote:

 Moved the old nutchhadooptutorial page from Nutch wiki Front page to
 Archive and Legacy.

 ~tejas


 On Wed, Jan 22, 2014 at 5:09 PM, Tejas Patil tejas.patil...@gmail.comwrote:

 Thanks *Julien* for pointing me to new NutchHadoopSingleNodeTutorial
 wiki page [0]. I would soon remove the old nutchhadooptutorial page from
 wiki.

 [0] : http://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial

 *@d_k*, there are already tutorials for running Nutch 2.x. See [1] and
 [2]. Those are not as extensive as the tutorial for 1.x [3] but carry the
 steps which are different for 2.x. The rest steps after datastore setup are
 similar - the only difference being in the command params which can be
 figured out from the usage and so they were not duplicated in those 2.x
 tutorials to avoid maintenance overhead. Do you think that the 2.x
 tutorials are inadequate in some regards ?

 [1] : http://wiki.apache.org/nutch/Nutch2Tutorial
 [2] : http://wiki.apache.org/nutch/Nutch2Cassandra
 [3] : http://wiki.apache.org/nutch/NutchTutorial

 Thanks,
 Tejas


 On Wed, Jan 22, 2014 at 2:47 AM, d_k mail...@gmail.com wrote:

 Actually what I would like to see is a Nutch 2.x tutorial at the same
 level of detail as the http://wiki.apache.org/nutch/NutchHadoopTutorial
 What is the process of contributing to that wiki page?


 On Tue, Jan 21, 2014 at 9:33 PM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:

 Hi

 The whole thing has been replaced with
  
 http://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorialhttp://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorialwhich
  does exactly what you described. +1 to remove the old
 nutchhadooptutorial page

 J.


 On 21 January 2014 17:44, Tejas Patil tejas.patil...@gmail.com wrote:

 Hi nutch-dev,

 I was looking at [0] and realized that with the massive number of
 Hadoop setup tutorials out there on internet, we need not repeat the same
 on nutch wiki page and instead assume that user has already done Hadoop
 setup. For convinience, we could direct users to the Hadoop wiki page 
 which
 has Hadoop setup details.
 Plus, I propose following:

 - Section Downloading Hadoop and Nutch : Remove the Hadoop portions
 and let the Nutch stuff stay.
 - Section Setting Up The Deployment Architecture must be removed.
 - Section Deploy Nutch to Single Machine and Deploy Nutch to
 Multiple Machines can be merged together.
 - Section Performing a Nutch Crawl, Testing the Crawl and
 Performing a Search must be merged, its contents must be updated.
 - Section Rsyncing Code to Slaves and Updates can be completely
 removed.

 Any comments ?

 [0] : http://wiki.apache.org/nutch/NutchHadoopTutorial

 Thanks,
 Tejas




 --

 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble







-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Nutch meetup / hackathon at BerlinBuzzwords next May?

2014-01-24 Thread Julien Nioche
Hi guys,

I'll certainly be at BerlinBuzzwords and have submitted at talk on Nutch.
What about having a Nutch meetup / hackathon / workshop?

http://berlinbuzzwords.de/news/hackathons-workshops-berlin-buzzwords

Julien

-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: [DISCUSS] Release Trunk

2014-02-12 Thread Julien Nioche
Hi guys,

At least 2 of the issues that Seb and I had mentioned have now been
committed. What about releasing 1.8 from trunk? If so, any volunteers?

Julien


On 2 December 2013 21:02, Sebastian Nagel wastl.na...@googlemail.comwrote:

 Hi,

 +1 to release soon (this year, or early next year)

  and probably a few others but they could also be done later.
 At least, these should be done before releasing:
 NUTCH-1646 IndexerMapReduce to consider DB status
 NUTCH-1413 Record response time

 Sebastian

 On 11/28/2013 05:49 PM, Julien Nioche wrote:
  Hi Lewis
 
  We've done quite a few things in 1.x since the previous release (e.g.
 generic deduplication,
  removing indexer.solr package, etc...)  and the next 2.x release will be
 after the changes to GORA
  have been made, tested and used on the Nutch side so that could be quite
 a while.
 
  I am neutral as to whether we should do a 1.x release now. There are
 some minor issues that we could
  do in 1.x before the next release like :
  * https://issues.apache.org/jira/browse/NUTCH-1360
  * https://issues.apache.org/jira/browse/NUTCH-1676
  and probably a few others but they could also be done later.
 
  Let's hear what others think.
 
  Thanks
 
  Julien
 
 
 
 
  On 28 November 2013 16:34, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com
  mailto:lewis.mcgibb...@gmail.com wrote:
 
  Hi Folks,
  Thread says it all.
  There are some hot tickets over in Gora right now so I think holding
 off the next while for a
  2.x release would be wise.
  I can spin the RC for trunk tonight/tomorrow/weekend if we get the
 thumbs up.
  Ta
  Lewis
 
  --
  /Lewis/
 
 
 
 
  --
  *
  *Open Source Solutions for Text Engineering
 
  http://digitalpebble.blogspot.com/
  http://www.digitalpebble.com
  http://twitter.com/digitalpebble




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: [DISCUSS] Release 1.8?

2014-03-11 Thread Julien Nioche
+1
Thanks for your work on these issues guys!

Julien


On 11 March 2014 18:24, Markus Jelsma markus.jel...@openindex.io wrote:

 Yes! Agreed!




 Sebastian Nagel wastl.na...@googlemail.com schreef:

 Hi everyone,

 NUTCH-1113 and NUTCH-1706 are fixed,
 broken HostDb (NUTCH-1325) has been removed for now from trunk.
 No open issues marked for 1.8 are left
 and everything seems to work!

 Time to spin a new release candidate?

 Sebastian




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Who is moderating Nutch lists?

2014-03-13 Thread Julien Nioche
I don't think these lists are moderated. Don't think they should be either

J

On Thursday, 13 March 2014, Markus Jelsma markus.jel...@openindex.io
wrote:

 Well, thats not me, perhaps Chris?

 -Original message-
 From: Lewis John Mcgibbneylewis.mcgibb...@gmail.com javascript:;
 Sent: Wednesday 12th March 2014 15:56
 To: dev@nutch.apache.org javascript:;
 Subject: Who is moderating Nutch lists?

 Hi Folks,

 Is anyone moding these lists?
 I understand that it is a bit of a pain in the neck as both user@ and dev@are 
 reasonably busy but I am just curious to find out.

 Thanks

 Lewis

 --
 Lewis




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: [VOTE] Release Apache Nutch 1.8RC#2

2014-03-16 Thread Julien Nioche
+1 from me. Thanks everyone

On Sunday, 16 March 2014, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 +1 from me!

 SIGS pass, CHECKSUMS pass:

 [chipotle:~/tmp/apache-nutch-1.8] mattmann% $HOME/bin/stage_apache_rc
 apache-nutch 1.8-bin https://dist.apache.org/repos/dist/dev/nutch/
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 100 79.7M  100 79.7M0 0   894k  0  0:01:31  0:01:31 --:--:--
 926k
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 100   836  100   8360 0   2291  0 --:--:-- --:--:-- --:--:--
 2902
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 10078  100780 0214  0 --:--:-- --:--:-- --:--:--
 268
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 100 81.0M  100 81.0M0 0   828k  0  0:01:40  0:01:40 --:--:--
 809k
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 100   836  100   8360 0   2399  0 --:--:-- --:--:-- --:--:--
 3051
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 10075  100750 0201  0 --:--:-- --:--:-- --:--:--
 255
 [chipotle:~/tmp/apache-nutch-1.8] mattmann% $HOME/bin/stage_apache_rc
 apache-nutch 1.8-src https://dist.apache.org/repos/dist/dev/nutch/
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 100 2692k  100 2692k0 0   602k  0  0:00:04  0:00:04 --:--:--
 646k
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 100   836  100   8360 0   2306  0 --:--:-- --:--:-- --:--:--
 2912
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 10078  100780 0204  0 --:--:-- --:--:-- --:--:--
 255
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 100 4547k  100 4547k0 0   564k  0  0:00:08  0:00:08 --:--:--
 671k
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 100   836  100   8360 0   2182  0 --:--:-- --:--:-- --:--:--
 2814
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 10075  100750 0203  0 --:--:-- --:--:-- --:--:--
 268
 [chipotle:~/tmp/apache-nutch-1.8] mattmann% $HOME/bin/verify_gpg_sigs
 Verifying Signature for file apache-nutch-1.8-bin.tar.gz.asc
 gpg: Signature made Tue Mar 11 14:23:44 2014 PDT using RSA key ID 48BAEBF6
 gpg: Good signature from Lewis John McGibbney (CODE SIGNING KEY)
 lewi...@apache.org javascript:;
 gpg: WARNING: This key is not certified with a trusted signature!
 gpg:  There is no indication that the signature belongs to the
 owner.
 Primary key fingerprint: DB7B 5199 121C 08A5 C8F4  052B 3A47 17F0 48BA EBF6
 Verifying Signature for file apache-nutch-1.8-bin.zip.asc
 gpg: Signature made Tue Mar 11 14:25:56 2014 PDT using RSA key ID 48BAEBF6
 gpg: Good signature from Lewis John McGibbney (CODE SIGNING KEY)
 lewi...@apache.org javascript:;
 gpg: WARNING: This key is not certified with a trusted signature!
 gpg:  There is no indication that the signature belongs to the
 owner.
 Primary key fingerprint: DB7B 5199 121C 08A5 C8F4  052B 3A47 17F0 48BA EBF6
 Verifying Signature for file apache-nutch-1.8-src.tar.gz.asc
 gpg: Signature made Tue Mar 11 14:26:28 2014 PDT using RSA key ID 48BAEBF6
 gpg: Good signature from Lewis John McGibbney (CODE SIGNING KEY)
 lewi...@apache.org javascript:;
 gpg: WARNING: This key is not certified with a trusted signature!
 gpg:  There is no indication that the signature belongs to the
 owner.
 Primary key fingerprint: DB7B 5199 121C 08A5 C8F4  052B 3A47 17F0 48BA EBF6
 Verifying Signature for file apache-nutch-1.8-src.zip.asc
 gpg: Signature made Tue Mar 11 14:26:44 2014 PDT using RSA key ID 48BAEBF6
 gpg: Good signature from Lewis John McGibbney (CODE SIGNING KEY)
 lewi...@apache.org 

Re: Pushing content to Solr from Nutch

2014-04-10 Thread Julien Nioche
Hi Xavier

Your config file looks a bit outdated. Here are the values set by default
(see http://svn.apache.org/repos/asf/nutch/trunk/conf/nutch-default.xml)

property
  nameplugin.includes/name
  
valueprotocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|*indexer-solr*|scoring-opic|urlnormalizer-(pass|regex|basic)/value

/property

Your problem comes from the fact that you are missing indexer-solr.

You should not need
*query-(basic|site|url)|response-(json|xml)|summary-basic *as they
date back to times immemorial when we used to manage the indexing and
search ourselves.

HTH

Julien


On 10 April 2014 18:05, Xavier Morera xav...@familiamorera.com wrote:

 Hi,

 I have followed several Nutch tutorials - including the main one
 http://wiki.apache.org/nutch/NutchTutorial - to crawl sites (which works,
 I can see in the console as the pages get crawled and the directories built
 with the data) but for the life of me I can't get anything posted to Solr.
 The Solr console doesn't even squint, therefore Nutch is not sending
 anything.

 This is the command that I send over that crawls and in theory should also
 post
 bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr 2

 But I found that I could also use this one when it is already crawled
 bin/nutch solrindex http://localhost:8983/solr crawl/crawldb crawl/linkdb
 crawl/segments/*

 But no luck.

 This is the only thing that called my attention but I read that by adding
 the property below would work but doesn't work.
 *No IndexWriters activated - check your configuration*

 This is the property
 property
 nameplugin.includes/name

 valueprotocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value
 /property

 Any idea? Apache Nutch 1.8 running Java 1.6 via Cygwin on Windows.

 --
 *Xavier Morera*
 email: xav...@familiamorera.com
 CR: +(506) 8849 8866
 US: +1 (305) 600 4919
 skype: xmorera



 --
 *Xavier Morera*
 email: xav...@familiamorera.com
 CR: +(506) 8849 8866
 US: +1 (305) 600 4919
 skype: xmorera




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: [DISCUSS] Roadmap for 2.3 Release

2014-05-01 Thread Julien Nioche
I'd exclude NUTCH-1741 for now and focus on the core updates (GORA,
filters, etc...). See comments on
NUTCH-1714https://issues.apache.org/jira/browse/NUTCH-1714


On 1 May 2014 07:27, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote:

 Hi Alparslan  Folks,

 OK so you can see the road map's here

 *http://s.apache.org/Xqk* http://s.apache.org/Xqk

 As you can see in 2.3 development drive we've addressed 66 of 71 issues.
 The remainders being as follows

 NUTCH-1741 https://issues.apache.org/jira/browse/NUTCH-1741

 Support of Sitemaps in Nutch 
 2.xhttps://issues.apache.org/jira/browse/NUTCH-1741
 NUTCH-1714 https://issues.apache.org/jira/browse/NUTCH-1714

 Nutch 2.x upgrade to Gora 
 0.4https://issues.apache.org/jira/browse/NUTCH-1714
 NUTCH-1709 https://issues.apache.org/jira/browse/NUTCH-1709

 Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus
 contain methods not defined in source 
 .avschttps://issues.apache.org/jira/browse/NUTCH-1709
 NUTCH-1674 https://issues.apache.org/jira/browse/NUTCH-1674

 Use batchId filter to enable scan (GORA-119) for 
 Fetch,Parse,Update,Indexhttps://issues.apache.org/jira/browse/NUTCH-1674
 NUTCH-1570 https://issues.apache.org/jira/browse/NUTCH-1570

 Add filtering capability to Datastore 
 Querieshttps://issues.apache.org/jira/browse/NUTCH-1570
 I think if we addressed the above then we could push an RC.
 Any comments?
 I'll be able to crack on with this final push relatively soon.

 On Tue, Apr 29, 2014 at 1:09 PM, dev-digest-h...@nutch.apache.org wrote:


 I think we can also add https://issues.apache.org/jira/browse/NUTCH-1674.
 This issue was waiting the stable release of gora-0.4.

 And IMHO, we can add https://issues.apache.org/jira/browse/NUTCH-1741,
 if anyone could review and test it.

 Thanks,
 Alparslan






-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: [DISCUSS] Roadmap for 2.3 Release

2014-05-01 Thread Julien Nioche
Hi Talat

Not clear what you mean here. I need them is not really an explanation as
to why they should be part of the next release. [If you want your own
repository then open an account on GitHub (or somewhere else) and clone the
2.x branch to add the patches of your choice].

Lewis suggested a roadmap for the next release and the changes he made
reflect his suggestions. If you think some of the issues should be part of
the 2.3 release then please explain why. BTW I don't think you agree with
me as I was suggesting we stick to the ones already listed minus 1741.

Thanks

Julien


On 1 May 2014 08:40, Talat Uyarer ta...@uyarer.com wrote:

 I aggree with you Julien. Today Lewis change some issues's fix version
  2.3 to 2.4. Most of my issues :) May I ask, If I update these issues, can
 I change fix version to 2.3  ? I need them.

 Thanks
 Talat


 2014-05-01 9:47 GMT+03:00 Julien Nioche lists.digitalpeb...@gmail.com:

 I'd exclude NUTCH-1741 for now and focus on the core updates (GORA,
 filters, etc...). See comments on 
 NUTCH-1714https://issues.apache.org/jira/browse/NUTCH-1714


 On 1 May 2014 07:27, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote:

 Hi Alparslan  Folks,

 OK so you can see the road map's here

 *http://s.apache.org/Xqk* http://s.apache.org/Xqk

 As you can see in 2.3 development drive we've addressed 66 of 71 issues.
 The remainders being as follows

 NUTCH-1741 https://issues.apache.org/jira/browse/NUTCH-1741

 Support of Sitemaps in Nutch 
 2.xhttps://issues.apache.org/jira/browse/NUTCH-1741
 NUTCH-1714 https://issues.apache.org/jira/browse/NUTCH-1714

 Nutch 2.x upgrade to Gora 
 0.4https://issues.apache.org/jira/browse/NUTCH-1714
 NUTCH-1709 https://issues.apache.org/jira/browse/NUTCH-1709

 Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus
 contain methods not defined in source 
 .avschttps://issues.apache.org/jira/browse/NUTCH-1709
 NUTCH-1674 https://issues.apache.org/jira/browse/NUTCH-1674

 Use batchId filter to enable scan (GORA-119) for 
 Fetch,Parse,Update,Indexhttps://issues.apache.org/jira/browse/NUTCH-1674
  NUTCH-1570 https://issues.apache.org/jira/browse/NUTCH-1570

 Add filtering capability to Datastore 
 Querieshttps://issues.apache.org/jira/browse/NUTCH-1570
 I think if we addressed the above then we could push an RC.
 Any comments?
 I'll be able to crack on with this final push relatively soon.

 On Tue, Apr 29, 2014 at 1:09 PM, dev-digest-h...@nutch.apache.orgwrote:


 I think we can also add
 https://issues.apache.org/jira/browse/NUTCH-1674. This issue was
 waiting the stable release of gora-0.4.

 And IMHO, we can add https://issues.apache.org/jira/browse/NUTCH-1741,
 if anyone could review and test it.

 Thanks,
 Alparslan






 --

 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




 --
 Talat UYARER
 Websitesi: http://talat.uyarer.com
 Twitter: http://twitter.com/talatuyarer
 Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: [DISCUSS] Roadmap for 2.3 Release

2014-05-01 Thread Julien Nioche
Hi Talat,

Comments below :

NUTCH-1753 Eclipse dependecy problem for 2.x


= trivial, please see my comments on it


 NUTCH-1748 urlfilter-validator to allow .. (two dots) inside file names
 (path elements)


= still under discussion - leave it for 2.4


 NUTCH-1740 BatchId parameter is not set in DbUpdaterJob


= duplicate


 NUTCH-1728 indexer-solr plugin is not delete docs from solr


= trivial enough to be committed for 2.3


 NUTCH-1725 CleaningJob's reducer does not commit deleted docs.


= trivial enough to be committed for 2.3


 NUTCH-1662 NUTCH-1568 Indexer Plugin for Solr Cloud


= I think we did something pretty similar in 1.x and would like to make
sure that both versions are as similar as possible.


 NUTCH-1661 Language based crawling


= This is definitely not being committed. You haven't replied to Otis's
questions and this has to be properly reviewed first and discussed.


 NUTCH-1660 Index filter for Page's latitude and longitude


= same. You haven't replied to the comments on this one.


 NUTCH-1657 ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION never
 set in HTMLParser


= trivial indeed, +1 thanks


 NUTCH-1643 Unnecessary fetching with http.content.limit when using
 protocol-http


= needs reviewing first, let's leave it for later


 NUTCH-1618 Fetches some websites multiple times for long lasting queues


= trivial indeed, please change the title to something more explicit like
Turn speculative execution off for Fetching

I have added NUTCH-1679
https://issues.apache.org/jira/browse/NUTCH-1679 (UpdateDb
using batchId, link may override crawled page.) to 2.3 as it must be fixed
ASAP.

Thanks for pointing out these issues. I think the focus for 2.3 should be
to get everything as robust as possible, we can always add new
functionalities in another release after that (release often etc...). One
thing we should definitely have though is to leverage the brand new GORA
filtering so that we get only the entries marked for a given job - see
discussion on NUTCH-1714 ttps://issues.apache.org/jira/browse/NUTCH-1714.
This should make Nutch 2.x a lot faster.

We haven't released 2.x for some time and loads of interesting stuff has
been done to it. It will be an exciting release!

Thanks for your contributions and pushing things forward!

Julien




 2014-05-01 11:32 GMT+03:00 Julien Nioche lists.digitalpeb...@gmail.com:

 Hi Talat

 Not clear what you mean here. I need them is not really an explanation
 as to why they should be part of the next release. [If you want your own
 repository then open an account on GitHub (or somewhere else) and clone the
 2.x branch to add the patches of your choice].

 Lewis suggested a roadmap for the next release and the changes he made
 reflect his suggestions. If you think some of the issues should be part of
 the 2.3 release then please explain why. BTW I don't think you agree with
 me as I was suggesting we stick to the ones already listed minus 1741.

 Thanks

 Julien



 On 1 May 2014 08:40, Talat Uyarer ta...@uyarer.com wrote:

 I aggree with you Julien. Today Lewis change some issues's fix version
  2.3 to 2.4. Most of my issues :) May I ask, If I update these issues, can
 I change fix version to 2.3  ? I need them.

 Thanks
 Talat


 2014-05-01 9:47 GMT+03:00 Julien Nioche lists.digitalpeb...@gmail.com:

 I'd exclude NUTCH-1741 for now and focus on the core updates (GORA,
 filters, etc...). See comments on 
 NUTCH-1714https://issues.apache.org/jira/browse/NUTCH-1714


 On 1 May 2014 07:27, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote:

 Hi Alparslan  Folks,

 OK so you can see the road map's here

 *http://s.apache.org/Xqk* http://s.apache.org/Xqk

 As you can see in 2.3 development drive we've addressed 66 of 71
 issues. The remainders being as follows

 NUTCH-1741 https://issues.apache.org/jira/browse/NUTCH-1741

 Support of Sitemaps in Nutch 
 2.xhttps://issues.apache.org/jira/browse/NUTCH-1741
 NUTCH-1714 https://issues.apache.org/jira/browse/NUTCH-1714

 Nutch 2.x upgrade to Gora 
 0.4https://issues.apache.org/jira/browse/NUTCH-1714
 NUTCH-1709 https://issues.apache.org/jira/browse/NUTCH-1709

 Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus
 contain methods not defined in source 
 .avschttps://issues.apache.org/jira/browse/NUTCH-1709
 NUTCH-1674 https://issues.apache.org/jira/browse/NUTCH-1674

 Use batchId filter to enable scan (GORA-119) for
 Fetch,Parse,Update,Indexhttps://issues.apache.org/jira/browse/NUTCH-1674
  NUTCH-1570 https://issues.apache.org/jira/browse/NUTCH-1570

 Add filtering capability to Datastore 
 Querieshttps://issues.apache.org/jira/browse/NUTCH-1570
 I think if we addressed the above then we could push an RC.
 Any comments?
 I'll be able to crack on with this final push relatively soon.

 On Tue, Apr 29, 2014 at 1:09 PM, dev-digest-h...@nutch.apache.orgwrote:


 I think we can also add
 https://issues.apache.org/jira/browse/NUTCH-1674. This issue was
 waiting the stable release of gora

Re: Post process Nutch data

2014-05-05 Thread Julien Nioche
Hi

As mentioned earlier in a different discussion on this list behemoth would
be the right tool for this

Julien

On Monday, 5 May 2014, Srikanth Shankara Rao srikant...@aditi.com wrote:


 Hi All,

 I have crawled Nutch data using 1.8. Data is in HDFS. I would like to
 post-process this data before indexing into SOLR. The idea is to transform
 the data based on the content and add few additional fields that describe
 the content.

 I would like to do this as part of a hadoop job. What would be the best
 place to add code?

 Thanks
 Srikanth



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Creating Windows bash files for nutch

2014-05-18 Thread Julien Nioche
Hi


 Currently nutch isn't very friendly to windows users as it requires cygwin
 to run and there are a lot of issues with Hadoop 1.x branch, which nutch
 bundles with it, due to the set tmp permission issue.

 What do you think about doing two things:
 1. Move to Hadoop 2.4 to support windows/linux and the new map reduce api


it already works on Linux. Am pretty sure there already is  a JIRA for the
port to the new map reduce API. As for windows, feel free to contribute an
alternative set of scripts if you want to.


 2. Create bash scripts to run crawls with


what's wrong with src/bin/crawl.sh?

Julien



 Relevant JIRA Issues:




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Creating Windows bash files for nutch

2014-05-18 Thread Julien Nioche
Hi Diaa

That could be useful when running in local mode at least but am not sure
about the distributed mode. Doesn't Hadoop rely on Cygwin in order to be
used on Windows (at least the Apache distro) anyway?

Julien




On 18 May 2014 20:47, Diaa Abdallah diaa.abdelmon...@gmail.com wrote:

 I meant writing batch/cmd scripts for windows that don't require Cygwin.

 I was thinking of writing those scripts but wanted to check if people
 think it's a good idea.


 On Sunday, May 18, 2014, Julien Nioche lists.digitalpeb...@gmail.com
 wrote:

 Hi


 Currently nutch isn't very friendly to windows users as it requires
 cygwin to run and there are a lot of issues with Hadoop 1.x branch, which
 nutch bundles with it, due to the set tmp permission issue.

 What do you think about doing two things:
 1. Move to Hadoop 2.4 to support windows/linux and the new map reduce api


 it already works on Linux. Am pretty sure there already is  a JIRA for
 the port to the new map reduce API. As for windows, feel free to contribute
 an alternative set of scripts if you want to.


 2. Create bash scripts to run crawls with


 what's wrong with src/bin/crawl.sh?

 Julien



 Relevant JIRA Issues:




 --

 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Nutch survey

2014-05-21 Thread Julien Nioche
Hi everyone!

I had written a survey about Nutch and its uses and would be very grateful
if you could take a couple of minutes to contribute :

https://docs.google.com/forms/d/15Jg7dGoU2I1aHur3g5ia9qshCMES8hB1OLpf5q6sGXg/viewform

This should help getting a clearer picture of the wider Nutch community,
which version people use, what for etc...

I will definitely share the conclusions of this survey and might include
this in the talk I am planning to give at the next ApacheCon.

Best,

Julien

-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Nutch survey

2014-05-22 Thread Julien Nioche
Hi guys

Thanks to all who have participated so far. You should be able to edit your
answers if you want to, in particular I have added a few fields that were
missing at the beginning. So if you have participated in the survey, feel
free to have another look at it and add any missing details.

For those of you who haven't done the survey yet, please do take part. It
will definitely help getting a better picture of who we are / what we do as
a community. The survey will be online for a few weeks.

Thanks

Julien




On 21 May 2014 16:07, Julien Nioche lists.digitalpeb...@gmail.com wrote:

 Hi everyone!

 I had written a survey about Nutch and its uses and would be very grateful
 if you could take a couple of minutes to contribute :


 https://docs.google.com/forms/d/15Jg7dGoU2I1aHur3g5ia9qshCMES8hB1OLpf5q6sGXg/viewform

 This should help getting a clearer picture of the wider Nutch community,
 which version people use, what for etc...

 I will definitely share the conclusions of this survey and might include
 this in the talk I am planning to give at the next ApacheCon.

 Best,

 Julien

 --

 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Nutch survey

2014-05-27 Thread Julien Nioche
Hi guys

Thanks to the 35 of you who took the time to fill the survey. It will show
interesting things for sure. Now, there's definitely more than 35 people on
these lists who use Nutch : if you haven't done it yet could you please
fill the survey? It shouldn't take much time and will be very useful for
getting a clearer picture of who we are as a community, what we like or not
with Nutch etc...

Survey = https://t.co/Xod5Z3Mm5E

Please RT : https://twitter.com/digitalpebble/status/469130285284466688

Thanks

Julien


On 22 May 2014 08:10, Julien Nioche lists.digitalpeb...@gmail.com wrote:

 Hi guys

 Thanks to all who have participated so far. You should be able to edit
 your answers if you want to, in particular I have added a few fields that
 were missing at the beginning. So if you have participated in the survey,
 feel free to have another look at it and add any missing details.

 For those of you who haven't done the survey yet, please do take part. It
 will definitely help getting a better picture of who we are / what we do as
 a community. The survey will be online for a few weeks.

 Thanks

 Julien




 On 21 May 2014 16:07, Julien Nioche lists.digitalpeb...@gmail.com wrote:

 Hi everyone!

 I had written a survey about Nutch and its uses and would be very
 grateful if you could take a couple of minutes to contribute :


 https://docs.google.com/forms/d/15Jg7dGoU2I1aHur3g5ia9qshCMES8hB1OLpf5q6sGXg/viewform

 This should help getting a clearer picture of the wider Nutch community,
 which version people use, what for etc...

 I will definitely share the conclusions of this survey and might include
 this in the talk I am planning to give at the next ApacheCon.

 Best,

 Julien

 --

 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




 --

 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


ApacheCon CFP closes June 25

2014-06-10 Thread Julien Nioche
Dear Nutch enthusiast,

As you may be aware, ApacheCon will be held this year in Budapest, on
November 17-23. (See http://apachecon.eu for more info.)

The Call For Papers for that conference is still open, but will be
closing soon. We need you talk proposals, to represent Nutch at
ApacheCon. We need all kinds of talks - deep technical talks, hands-on
tutorials, introductions for beginners, or case studies about the
awesome stuff you're doing with Nutch.

Please consider submitting a proposal, at
http://events.linuxfoundation.org//events/apachecon-europe/program/cfp

Thanks!

PS: 2 talks about Nutch have already been submitted and there are at least
3 committers planning to do a tutorial. Please make yourself known if you'd
like to help organising the tutorial, alternatively do submit a talk as
described above.

Julien

-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Travel assistance for ApacheCon EU, Budapest November 17-21 2014

2014-06-11 Thread Julien Nioche
The Travel Assistance Committee (TAC) is happy to anounce that we now
accept applications for ApacheCon Europe 2014, 17-21 November in Budapest,
Hungary

Applications are welcome from individuals within the Apache community
at-large, users, developers, educators, students, Committers, and Members,
who need financial support to attend ApacheCon.

Please be aware the seats are very limited, and all applicants will be
scored on their individual merit.

More information can be found at http://www.apache.org/travel including a
link to the online application and detailed instructions for submitting.

Applications will close on 25 July 2014 at 23:00 UTC/GMT.

Please help spread the word among your community.

On behalf of TAC

-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: nutch elpais.com

2014-06-16 Thread Julien Nioche
Salut Yann,

Not really answering your question but where did you get this config from?
Some of its elements have been long deprecated (query-*, response-*,
summary-*)

Julien


On 15 June 2014 10:20, Yann Levreau yann.levr...@gmail.com wrote:

 hi everyone !

 I'm sorry to disturb you but i need some assistance for getting the
 outlinks of http://elpais.com.
 I use Nutch 2.2.1.

 The web page is well parsed, in debug I have all the outlinks in the Parse
 object.
 I use these basic plugins :


 protocol-http|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)

 But outlinks are never injected in hbase (with http://elpais.com or
 http://www.elpais.com).
 If i try to parse www.nytimes.com, outlinks are normally injected and
 added to the fetch list.

 Any idea ?
 Thanks
 Yann

 == I have the same issue with http://www.lemonde.fr





-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Version of Java in Jenkins

2014-06-17 Thread Julien Nioche
Lewis,

https://issues.apache.org/jira/browse/NUTCH-1590 requires Java 1.7 for
building the Javadoc. Does something need changing in Jenkins? BTW is there
a WIKI page somewhere on how to configure Jenkins?

Thanks

Julien

-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Nutch Extension for realtime processing

2014-06-18 Thread Julien Nioche
Hi Jake

Great to hear about your ideas. Sounds like what you are proposing would be
only near realtime as much would depend on the generation which is a
batch step. How / when would the update step be called? Would this be a
fetcher only i.e. does not recursively discover links. If so why not going
100% real time as done with something like
https://github.com/DigitalPebble/storm-crawler?

I hinted at similar things in my recent talks : see for instance
http://www.slideshare.net/digitalpebble/j-nioche-lucenerevoeu2013 slide #40
(video on  *http://youtu.be/KyHPBtRlo80?t=42m
http://youtu.be/KyHPBtRlo80?t=42m*. As mentioned in the talk above, I can
see a hybrid model mixing batch and real time processes as a good solution.
I recently looked at Apache Spark as it allows to mix batch, micro-batch
and graph computation as it sounded like a good framework for doing this
but haven't had the time to go very far.  In an ideal world, we'd be
continuously fetching, parsing and updating, wouldn't we?

Did I get your suggestion right?

Julien



On 17 June 2014 23:54, Jake Dodd j...@ontopic.io wrote:

 Markus: The indexer plugin idea definitely works if the goal is only to
 pass Nutch-collected data to realtime frameworks. However, there are some
 cool things that you can do in “real realtime (heh), as opposed to the
 batch nature of Nutch’s indexing plugins and the FetcherOutputFormat.
 Moreover, it would be cool to have Nutch working as designed (with
 fetching, parsing, indexing and all) while basically gaining the realtime
 capabilities for free.

 Chris: Glad to hear you’re interested, and thanks for the link! Today I
 was actually able to finish a prototype version of this, along with two
 example Disseminator plugins (one to stdout, the other to a Kafka
 topic—both working beautifully). I’d be happy to create a New Feature JIRA
 and start working on this.

 Cheers

 Jake



 On Jun 17, 2014, at 11:02 AM, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov wrote:

 Jake I am totally interested in this. Contributing to Nutch (and more
 generally to Apache projects) is described really well (by Dennis Kubes)
 here:

 http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer


 Looking forward to seeing your contributions!

 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Markus Jelsma markus.jel...@openindex.io
 Reply-To: dev@nutch.apache.org dev@nutch.apache.org
 Date: Tuesday, June 17, 2014 10:55 AM
 To: dev@nutch.apache.org dev@nutch.apache.org
 Subject: RE: Nutch Extension for realtime processing

 Hi Jake,

 It would be more pluggable if you just implement an indexer backend
 plugin for your target (storm, spark) so you can use the existing
 indexing filtering framework and plugins to enrich the data. If you then
 couple the indexing logic to FetcherOutputFormat, you can skip the parse
 (because this requires a parsing fetcher) and updatedb jobs, as well as
 the separate indexing job. This is certainly not real time but the delay
 is much smaller, especially if you keep to (many) small fetch jobs. In
 our environment we can guarantee a fetched document is always indexed
 within 15 minutes.

 Markus

 -Original message-

 From:Jake Dodd j...@ontopic.io
 Sent: Tuesday 17th June 2014 19:30
 To: dev@nutch.apache.org
 Subject: Nutch Extension for realtime processing

 Hi all,

 My organization is mulling the creation of a Nutch Extension Point that
 would enable realtime processing of Nutch documents as they¹re fetched.
 We have the desire to pass Nutch-fetched documents to a realtime
 framework such as Storm or Spark. Currently, it¹s trivial to implement a
 custom Indexer plugin that sort of gets the job done. However, this
 doesn¹t really meet the realtime requirement‹you must wait for the
 fetch, parse, updateddb, index cycle to complete.

 Our idea is to create a FetcherDisseminator extension point. A
 FetcherDisseminator would implement a disseminate() method that would
 take care of serialization (JSON, Avro, etc) and disseminating the data
 to an external entity (for example a REST interface, or a Kafka broker).

 The FetcherDisseminators would be called from within the
 org.apache.nutch.fetcher.Fetcher.FetcherThread class. The implementation
 would be such that the normal fetch-parse-update-index cycle would be
 unaffected, even in the case of disseminator failure.

 My first question is whether something like this has been 

Re: Nutch Extension for realtime processing

2014-06-19 Thread Julien Nioche
Hi Jake,

Thanks for taking the time to explain what you want to do with the
Disseminator. It does make sense but to be valuable would have to be quite
generic so that people can have different uses of it.

For instance one thing I did for a customer was to hack the Fetcher so that
we send statistics to a monitoring tool e.g. statsd / Graphite / Librato.
This way I can see the evolution of metrics during the Fetching (active
threads / number of queues / number of URLS in queues/ URLs fetched, IO
etc... ). We have these figures on the task UI but it helps to see how they
evolve over time. The fetching step is the only place where it makes sense
to have this as the other steps are quite linear in the way they work.
Anyway, would that fit with the Disseminator? Not sure, we could have yet
another plugin to do that.

The logging use case with  Kibana is a good example : I did something
something similar with the StormCrawler (its logging mechanims being a lot
worse than what Hadoop gives us). A nice thing to have would be a slf4j
extension that sends the logs to ElasticSearch but this is a different
subject.

The question is : can we do all these things within a single plugin
extension?

Julien



On 18 June 2014 17:18, Jake Dodd j...@ontopic.io wrote:

 Hi Julien,

 Yep, you’re correct about the generation step being a limiting factor in
 getting new content in realtime—i.e. nearly as soon as it appears on the
 web. But that isn’t *quite* what I meant, so I’ll clarify what I mean by
 “realtime”, in the context of Nutch as it exists today: getting access to
 data as soon as it’s fetched, rather than waiting for the fetch job (and
 any subsequent jobs) to finish.

 The update and index steps would be called as usual, with no modifications
 to any existing Nutch workflows. In the prototype that I banged out
 yesterday, the dissemination occurs in the
 org.apache.nutch.fetcher.Fetcher.FetcherThread.output() method, after the
 output has been collected. More specifically, a Disseminator isn’t a
 substitute for an indexer. It simply makes online data about the fetch
 available to outside sources and services. The Disseminator would be
 invisible to people who choose not to use dissemination plugins.

 I can think of a couple of example use cases to illustrate. One use would
 be to create a Disseminator plugin that would collect fetch metadata for
 each URL (response code, content type, number of outlines, host, domain,
 etc), format it as a Logstash event, and send it to an Elasticsearch
 cluster. A user could then use the Logstash/Kibana/Elasticsearch stack for
 detailed and highly visual monitoring of a Nutch crawl, with very little
 engineering involved, and no modification of the Nutch source—only
 development of a plugin. For smaller fetches, as Markus suggested, a
 Logstash “Indexer” could work for this; but for a longer fetch, it would be
 cool to monitor the crawl (not just the process, but the fetch data itself)
 in realtime without digging through Hadoop logs.

 A more advanced use case would be to disseminate the actual page content
 (either raw, or after parsed if parsing during fetch is enabled) to Apache
 Spark. From there, pages could be classified using SVM or a Bayes
 classifier in Spark’s MLLib. Once the fetch is finished, during indexing, a
 custom IndexingFilter could read the classifications—already generated—and
 filter indexing according to the classifications. If anybody has a
 classification-based IndexingFilter, this could greatly speed up their
 workflow.

 These are just off the top of my head—people are creative, I’m sure there
 are even cooler things someone could think up!

  As you mentioned in the talk that you shared, doing everything
 continuously would be an enormous undertaking, requiring a major overhaul
 of Nutch and a migration from MR. But creating a plugin-based hook to the
 Fetcher seems to be relatively trivial.

 The storm-crawler project looks neat! We’ve contemplated building
 something similar that would reuse elements from Nutch where possible.

 Cheers

 Jake

 On Jun 18, 2014, at 1:34 AM, Julien Nioche lists.digitalpeb...@gmail.com
 wrote:

 Hi Jake

 Great to hear about your ideas. Sounds like what you are proposing would
 be only near realtime as much would depend on the generation which is a
 batch step. How / when would the update step be called? Would this be a
 fetcher only i.e. does not recursively discover links. If so why not going
 100% real time as done with something like
 https://github.com/DigitalPebble/storm-crawler?

 I hinted at similar things in my recent talks : see for instance
 http://www.slideshare.net/digitalpebble/j-nioche-lucenerevoeu2013 slide
 #40 (video on  *http://youtu.be/KyHPBtRlo80?t=42m
 http://youtu.be/KyHPBtRlo80?t=42m*. As mentioned in the talk above, I
 can see a hybrid model mixing batch and real time processes as a good
 solution. I recently looked at Apache Spark as it allows to mix batch,
 micro-batch and graph computation

Nearing a 1.9 release?

2014-06-29 Thread Julien Nioche
Hi guys,

We've done loads of good work on the trunk since the last release, in
particular :

   - NUTCH-1736 https://issues.apache.org/jira/browse/NUTCH-1736
   - NUTCH-1647 https://issues.apache.org/jira/browse/NUTCH-1647
   - NUTCH-1793 https://issues.apache.org/jira/browse/NUTCH-1793

which are important bug fixes (NUTCH-578
https://issues.apache.org/jira/browse/NUTCH-578 will also be an important
one).

If you want to help make the new release happen, could you please go
through the issues listed for 1.9
https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%201.9%20AND%20project%20%3D%20NUTCH%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20updated%20DESC
and vote for the ones you think should be included in the next release /
comment on issues opened by others / review the patches / contribute to the
discussions?

Thanks!

Julien

-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Nearing a 1.9 release?

2014-07-07 Thread Julien Nioche
Hi,

I've moved all the open issues that were marked with fix version = 1.9 to
1.10 except for the ones that Seb mentioned earlier.

Please go through the issue listed for 1.10
https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%201.10%20AND%20project%20%3D%20NUTCH%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20updated%20DESC
and
change their fix version back to 1.9 if you think they should be included
in the next release.

Thanks

Julien



On 29 June 2014 10:20, Julien Nioche lists.digitalpeb...@gmail.com wrote:

 Hi guys,

 We've done loads of good work on the trunk since the last release, in
 particular :

- NUTCH-1736 https://issues.apache.org/jira/browse/NUTCH-1736
- NUTCH-1647 https://issues.apache.org/jira/browse/NUTCH-1647
- NUTCH-1793 https://issues.apache.org/jira/browse/NUTCH-1793

 which are important bug fixes (NUTCH-578
 https://issues.apache.org/jira/browse/NUTCH-578 will also be an
 important one).

 If you want to help make the new release happen, could you please go
 through the issues listed for 1.9
 https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%201.9%20AND%20project%20%3D%20NUTCH%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20updated%20DESC
 and vote for the ones you think should be included in the next release /
 comment on issues opened by others / review the patches / contribute to the
 discussions?

 Thanks!

 Julien

 --

 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


[VOTE] Remove pom.xml from source

2014-07-15 Thread Julien Nioche
Hi,

One of the frequent issues on the mailing list / JIRA is that users can be
led to think that Nutch is built with Maven as they can see what looks like
a perfectly valid pom.xml at the root of the project. It becomes clearer
when reading the WIKI or FAQ that ANT should be used instead but it isn't
an unreasonnable assumption, is it?

As we know this pom.xml is generated automatically when we publish the
Maven artefacts with the deploy task i.e. it is never done by end users and
only when we release a new version. This pom.xml is generated from a
template file in the ivy dir and uses the ivy dependencies.

This pom.xml file cannot be used to build Nutch core nor the plugins but
was used by Eclipse users to easily import the project and get the
dependencies, which can be done very neatly with the 'ant eclipse' task or
by using the IvyDE plugin for Eclipse.  Moreover there is no guarantee that
it is in sync with the content of the Ivy deps.

I suggest that we remove the pom.xml file from the source (and the
releases) to get rid of this source of confusion. Apart from the solutions
I just mentioned to get the dependencies in Eclipse etc... there is always
the option of calling 'ant deploy' to generate a fresh new pom.xml if you
really need one.

Can we please have your views on this?

[+1] yes
[-1] no, here is why...
[0] don't mind

Thanks

Julien

-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: [VOTE] Remove pom.xml from source

2014-07-15 Thread Julien Nioche
Hi Talat

Don't you remember https://issues.apache.org/jira/browse/NUTCH-1371? ;-)

Julien


On 15 July 2014 11:50, Talat Uyarer ta...@uyarer.com wrote:

 Hi Julien,

 [+1] We can remove pom.xml

 I wonder Why don't we switch our dependency management ivy+ant to maven ?
  Most of IDEs works very good with maven.

 Talat


 2014-07-15 13:36 GMT+03:00 Julien Nioche lists.digitalpeb...@gmail.com:

 Hi,

 One of the frequent issues on the mailing list / JIRA is that users can
 be led to think that Nutch is built with Maven as they can see what looks
 like a perfectly valid pom.xml at the root of the project. It becomes
 clearer when reading the WIKI or FAQ that ANT should be used instead but it
 isn't an unreasonnable assumption, is it?

 As we know this pom.xml is generated automatically when we publish the
 Maven artefacts with the deploy task i.e. it is never done by end users and
 only when we release a new version. This pom.xml is generated from a
 template file in the ivy dir and uses the ivy dependencies.

 This pom.xml file cannot be used to build Nutch core nor the plugins but
 was used by Eclipse users to easily import the project and get the
 dependencies, which can be done very neatly with the 'ant eclipse' task or
 by using the IvyDE plugin for Eclipse.  Moreover there is no guarantee that
 it is in sync with the content of the Ivy deps.

 I suggest that we remove the pom.xml file from the source (and the
 releases) to get rid of this source of confusion. Apart from the solutions
 I just mentioned to get the dependencies in Eclipse etc... there is always
 the option of calling 'ant deploy' to generate a fresh new pom.xml if you
 really need one.

 Can we please have your views on this?

 [+1] yes
 [-1] no, here is why...
 [0] don't mind

 Thanks

 Julien

 --

 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




 --
 Talat UYARER
 Websitesi: http://talat.uyarer.com
 Twitter: http://twitter.com/talatuyarer
 Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: [DISCUSS] [VOTE] Remove pom.xml from source

2014-07-15 Thread Julien Nioche
Hi Chris,

It does. See https://github.com/apache/nutch/blob/trunk/ivy/mvn.template
https://github.com/apache/nutch/blob/trunk/ivy/mvn.template, which
contains everything we need to generate the pom.xml, including the dev
list, deps etc... The pom contains nothing else and is fully regenerated
from the template at every release. We can remove it.

Thanks

Julien

On 15 July 2014 19:07, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Hey Julien,

 Does the ant deploy generate a fully POM though? I don't think it does
 I think it just generates the dependencies, and not e.g., the developer
 list, etc. So, we need the pom.xml as the template that has that stuff,

until someone cooks up a XSL combining solution with that original template
 and then what ant deploy spits out, no?

 Cheers,
 Chris





 -Original Message-
 From: Julien Nioche lists.digitalpeb...@gmail.com
 Reply-To: u...@nutch.apache.org u...@nutch.apache.org
 Date: Tuesday, July 15, 2014 3:36 AM
 To: dev@nutch.apache.org dev@nutch.apache.org, u...@nutch.apache.org
 u...@nutch.apache.org
 Subject: [VOTE] Remove pom.xml from source

 Hi,
 
 One of the frequent issues on the mailing list / JIRA is that users can be
 led to think that Nutch is built with Maven as they can see what looks
 like
 a perfectly valid pom.xml at the root of the project. It becomes clearer
 when reading the WIKI or FAQ that ANT should be used instead but it isn't
 an unreasonnable assumption, is it?
 
 As we know this pom.xml is generated automatically when we publish the
 Maven artefacts with the deploy task i.e. it is never done by end users
 and
 only when we release a new version. This pom.xml is generated from a
 template file in the ivy dir and uses the ivy dependencies.
 
 This pom.xml file cannot be used to build Nutch core nor the plugins but
 was used by Eclipse users to easily import the project and get the
 dependencies, which can be done very neatly with the 'ant eclipse' task or
 by using the IvyDE plugin for Eclipse.  Moreover there is no guarantee
 that
 it is in sync with the content of the Ivy deps.
 
 I suggest that we remove the pom.xml file from the source (and the
 releases) to get rid of this source of confusion. Apart from the solutions
 I just mentioned to get the dependencies in Eclipse etc... there is always
 the option of calling 'ant deploy' to generate a fresh new pom.xml if you
 really need one.
 
 Can we please have your views on this?
 
 [+1] yes
 [-1] no, here is why...
 [0] don't mind
 
 Thanks
 
 Julien
 
 --
 
 Open Source Solutions for Text Engineering
 
 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Problems running some ant targets on recent trunk

2014-07-17 Thread Julien Nioche
Hi

This is probably due to some of the recent changes I made e.g.
https://issues.apache.org/jira/browse/NUTCH-1804

I'll have a look at this.

Thanks

Julien


On 16 July 2014 23:10, Sebastian Nagel wastl.na...@googlemail.com wrote:

 Hi,

 I have some problems running ant targets on recent trunk:

 % ant runtime
 fails if run from scratch (after ant clean)
 but it succeeds after ant test or ant nightly.

 in a plugin folder, e.g., src/plugin/parse-metatags
 % ant test


 The error causing the failure is always:
  .../trunk/src/plugin/build-plugin.xml:190: .../trunk/build/test/lib does
 not exist.
 e.g. within the chain of calls:
 BUILD FAILED
 .../trunk/build.xml:112: The following error occurred while executing this
 line:
 .../trunk/src/plugin/build.xml:63: The following error occurred while
 executing this line:
 .../trunk/src/plugin/urlfilter-automaton/build.xml:25: The following error
 occurred while executing
 this line:
 .../trunk/src/plugin/build-plugin.xml:190: .../trunk/build/test/lib does
 not exist.

 Indeed the directory does not exist because it's removed by target
 clean-lib.
 In this case it is the target compile-test of lib-regex-filter which
 fails.
 Should it be really called for target runtime?

   target name=deps-jar
 ant target=jar inheritall=false dir=../lib-regex-filter/
 ant target=compile-test inheritall=false
 dir=../lib-regex-filter/
   /target


 Thanks,
 Sebastian




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Problems running some ant targets on recent trunk

2014-07-17 Thread Julien Nioche

 In this case it is the target compile-test of lib-regex-filter which
 fails.
 Should it be really called for target runtime?
   target name=deps-jar
 ant target=jar inheritall=false dir=../lib-regex-filter/
 ant target=compile-test inheritall=false
 dir=../lib-regex-filter/
   /target


This is the source of the problem indeed, the second line should not be
there : the test classes are not required at that stage.
Both urlfilter-automaton and urlfilter-regex have the same problem, which
was not apparent until I introduced a cleaner separation between the
compilation and test deps.

I've fixed that for trunk in revision 1611303.

As for the issue with calling 'ant test' from a plugin dir, this is not a
new issue : we get the same thing with a fresh copy of Nutch1.8. It's just
that the test task for the plugins assumes that the core classes and ivy
jars have already been resolved.

Thanks

Julien


On 16 July 2014 23:10, Sebastian Nagel wastl.na...@googlemail.com wrote:

 Hi,

 I have some problems running ant targets on recent trunk:

 % ant runtime
 fails if run from scratch (after ant clean)
 but it succeeds after ant test or ant nightly.

 in a plugin folder, e.g., src/plugin/parse-metatags
 % ant test


 The error causing the failure is always:
  .../trunk/src/plugin/build-plugin.xml:190: .../trunk/build/test/lib does
 not exist.
 e.g. within the chain of calls:
 BUILD FAILED
 .../trunk/build.xml:112: The following error occurred while executing this
 line:
 .../trunk/src/plugin/build.xml:63: The following error occurred while
 executing this line:
 .../trunk/src/plugin/urlfilter-automaton/build.xml:25: The following error
 occurred while executing
 this line:
 .../trunk/src/plugin/build-plugin.xml:190: .../trunk/build/test/lib does
 not exist.

 Indeed the directory does not exist because it's removed by target
 clean-lib.
 In this case it is the target compile-test of lib-regex-filter which
 fails.
 Should it be really called for target runtime?

   target name=deps-jar
 ant target=jar inheritall=false dir=../lib-regex-filter/
 ant target=compile-test inheritall=false
 dir=../lib-regex-filter/
   /target


 Thanks,
 Sebastian




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Problems running some ant targets on recent trunk

2014-07-21 Thread Julien Nioche
It's actually a bit more twisted than that : see
https://issues.apache.org/jira/browse/NUTCH-1818

This separation of the test and runtime dependencies has actually been very
good for exposing inconsistencies in the way the existing build worked. The
issue should be solved now, thanks for reporting it.


On 17 July 2014 10:18, Julien Nioche lists.digitalpeb...@gmail.com wrote:

 In this case it is the target compile-test of lib-regex-filter which
 fails.
 Should it be really called for target runtime?
   target name=deps-jar
 ant target=jar inheritall=false dir=../lib-regex-filter/
 ant target=compile-test inheritall=false
 dir=../lib-regex-filter/
   /target


 This is the source of the problem indeed, the second line should not be
 there : the test classes are not required at that stage.
 Both urlfilter-automaton and urlfilter-regex have the same problem, which
 was not apparent until I introduced a cleaner separation between the
 compilation and test deps.

 I've fixed that for trunk in revision 1611303.

 As for the issue with calling 'ant test' from a plugin dir, this is not a
 new issue : we get the same thing with a fresh copy of Nutch1.8. It's just
 that the test task for the plugins assumes that the core classes and ivy
 jars have already been resolved.

 Thanks

 Julien


 On 16 July 2014 23:10, Sebastian Nagel wastl.na...@googlemail.com wrote:

 Hi,

 I have some problems running ant targets on recent trunk:

 % ant runtime
 fails if run from scratch (after ant clean)
 but it succeeds after ant test or ant nightly.

 in a plugin folder, e.g., src/plugin/parse-metatags
 % ant test


 The error causing the failure is always:
  .../trunk/src/plugin/build-plugin.xml:190: .../trunk/build/test/lib does
 not exist.
 e.g. within the chain of calls:
 BUILD FAILED
 .../trunk/build.xml:112: The following error occurred while executing
 this line:
 .../trunk/src/plugin/build.xml:63: The following error occurred while
 executing this line:
 .../trunk/src/plugin/urlfilter-automaton/build.xml:25: The following
 error occurred while executing
 this line:
 .../trunk/src/plugin/build-plugin.xml:190: .../trunk/build/test/lib does
 not exist.

 Indeed the directory does not exist because it's removed by target
 clean-lib.
 In this case it is the target compile-test of lib-regex-filter which
 fails.
 Should it be really called for target runtime?

   target name=deps-jar
 ant target=jar inheritall=false dir=../lib-regex-filter/
 ant target=compile-test inheritall=false
 dir=../lib-regex-filter/
   /target


 Thanks,
 Sebastian




 --

 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Push Nutch 1.9

2014-08-07 Thread Julien Nioche
Lewis,

Any chance you'd have time to spin a RC?

Thanks

Julien


On 30 July 2014 21:14, Sebastian Nagel wastl.na...@googlemail.com wrote:

 +1

 sebastian


 2014-07-30 10:56 GMT+02:00 Julien Nioche lists.digitalpeb...@gmail.com
 mailto:lists.digitalpeb...@gmail.com:

 Hi Lewis

 https://issues.apache.org/jira/browse/NUTCH-1755 is more at a
 discussion stage and can be done
 later. I have moved it to 1.10

 I've just committed https://issues.apache.org/jira/browse/NUTCH-1561
 - there are no more issues
 flagged for 1.9.

 +1 for a RC. This will be a terrific release with loads of bugfixes
 and improvements.

 Thanks

 Julien



 On 30 July 2014 07:55, Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 mailto:lewis.mcgibb...@gmail.com wrote:

 Hi Folks,
 Lets go with this?
 I am sorry for not being here so much recently. I really would
 like to see 1.9 come out of
 the box.
 Last time I saw that 1.9 had 2 issues left to resolve?
 I am MORE than willing to push this any time this coming week for
 a 1.9 RC if we can agree
 on issues to resolve the release.
 2.3 still needs some work... which we are trying to iron out
 (amongst other things) over on
 Goraland.
 Thanks folks.
 Lewis

 --
 /Lewis/




 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble





-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: [VOTE] Apache Nutch 1.9 Release Candidate #1

2014-08-13 Thread Julien Nioche
Hi,

+1 to release. Compilation and tests run fine. Signatures look good.

Thanks Lewis!

Julien


On 13 August 2014 06:32, Lewis John Mcgibbney lewis.mcgibb...@gmail.com
wrote:

 VOTE'ing will be open for 'at-least' 72 hours to allow people enough time
 to cast their VOTE's.
 Thanks
 Lewis


 On Tue, Aug 12, 2014 at 10:31 PM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

 Hi user@  dev@,This thread is a VOTE for releasing Apache Nutch 1.9. The 
 release candidate comprises the following components.* A staging repository 
 [0] containing various Maven artifacts* A branch-1.9 of the trunk code [1]* 
 The tagged source upon which we are VOTE'ing [2]* Finally, the release 
 artifacts [3] which I would encourage you to verify for signatures and also 
 test if possible. Some advice on signing release artifacts can be found here 
 [4].You should use the following KEYS [5] file to verify the signatures of 
 all release artifacts. The artifcts have been signed by KEY 48BAEBF6 
 2013-10-28  Lewis John McGibbney (CODE SIGNING KEY) 
 lewi...@apache.orgPlease VOTE as follows[ ] +1 Push the release, I am 
 happy :)[ ] +0 I am not bothered either way[ ] -1 I am not happy with this 
 release candidate (please state why)Firstly thank you to everyone that 
 contributed to Nutch, it is greatly appreciated by members of the community 
 at large. Secondly, thank you to everyone that VOTE's. Finally, thank you to 
 everyone that uses Nutch. It is appreciated.Lewis(on behalf of Nutch 
 PMC)p.s. Here's my +1 [0] 
 https://repository.apache.org/content/repositories/orgapachenutch-1002[1] 
 https://svn.apache.org/repos/asf/nutch/branches/branch-1.9[2] 
 https://svn.apache.org/repos/asf/nutch/tags/release-1.9[3] 
 https://dist.apache.org/repos/dist/dev/nutch/1.9/
 [4] http://nutch.apache.org/downloads.html

 [5] http://www.apache.org/dist/nutch/KEYS



 --
 *Lewis*




 --
 *Lewis*




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Incorrect download links for Nutch-1.9

2014-08-28 Thread Julien Nioche
Thanks for reporting this Jake, I'll fix this tomorrow (unless a fellow
committer beats me to it)

Julien


On 27 August 2014 17:37, Jake Dodd j...@ontopic.io wrote:

 Hi all,

 I noticed that following the download links for Nutch 1.9 (from
 http://nutch.apache.org/downloads.html) takes users to a series of pages
 all with the pattern
 http://www.apache.org/dyn/closer.cgi/nutch/1.9/apache-nutch-1.8-*.  The
 end of the URI has apache-nutch-1.8, rather than apache-nutch-1.9. I
 haven’t tested any others, but at least the primary mirror for the 1.9
 source .zip is broken.

 Has anybody caught this yet, and are there plans to fix it?

 Cheers

 Jake




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Title of the page Version Control

2014-08-28 Thread Julien Nioche
Thanks for reporting this Alfonso, I'll fix this tomorrow (unless a fellow
committer beats me to it)

Julien



On 28 August 2014 10:13, Alfonso Nishikawa alfonso.nishik...@gmail.com
wrote:

 Greetings,

 I found that the page https://nutch.apache.org/version_control.html
 states in it's title: Apache Nutchtrade; - Gora Version Control System

 Regards,

 Alfonso Nishikawa




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Incorrect download links for Nutch-1.9

2014-08-29 Thread Julien Nioche
Thanks Lewis


On 28 August 2014 22:41, Lewis John Mcgibbney lewis.mcgibb...@gmail.com
wrote:

 Hi Jake,
 Thank you so much for reporting.
 Fixed.
 Thank you, have a great day.
 Lewis


 On Wed, Aug 27, 2014 at 9:37 AM, dev-digest-h...@nutch.apache.org wrote:


 Hi all,

 I noticed that following the download links for Nutch 1.9 (from
 http://nutch.apache.org/downloads.html) takes users to a series of pages
 all with the pattern
 http://www.apache.org/dyn/closer.cgi/nutch/1.9/apache-nutch-1.8-*.  The
 end of the URI has apache-nutch-1.8, rather than apache-nutch-1.9. I
 haven’t tested any others, but at least the primary mirror for the 1.9
 source .zip is broken.

 Has anybody caught this yet, and are there plans to fix it?

 Cheers

 Jake




 --
 *Lewis*




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Jump to 3.X WAS [RELEASE] Apache Nutch 1.9

2014-09-01 Thread Julien Nioche
Hi chaps,

-1 from me. IMHO moving the trunk code to 3.x does not really solve the
issue. I'd rather make it more explicit that the standard Nutch (1.x) and
Nutch-GORA (2.x) are two separate beasts for instance by referring to 2.x
as Nutch-GORA in the artifacts we release. This way users won't assume
believe that one is superior to the other. We can keep the same SVN
branches (trunk + 2.x) and use the minor version numbers as a reflection of
the amount of changes produced in the code.

Changing to 3.x would imply a major change of architecture or
functionality, which certainly won't be the case for the next release of
the trunk. When users ask what is the difference between 3.x and 1.x?
we'd have to answer not much, and more importantly when asked what is
the difference between 3.x and 2.x? we'd reply same as between 1.x and
2.x ;-) Changing the name of the artefacts would clarify things.

This reminds me that our FAQ does not really answer these questions (and
other basic ones), will post about this separately.

Julien




On 29 August 2014 17:34, Lewis John Mcgibbney lewis.mcgibb...@gmail.com
wrote:

 Hi Chris,

 N.B. move to dev@

 On Fri, Aug 29, 2014 at 7:40 AM, user-digest-h...@nutch.apache.org
 wrote:

 +1, great.

 I'd like to have a conversation about versioning.

 Since we're at 1.9, my suggestion would be to have the
 next in the trunk series (1.x) move to version 3.x post
 1.9 for the release.


 Based on the discussion from which this new thread stems I would totally
 be behind this. It breathes new life into trunk. Which is a bonnie feather
 in the Nutch bonnet. Here is my +1 on that one.



 Nutch2 remains Nutch and can be worked on there. That
 would give us a nice split in the diversionary branch
 paths for Nutch.


 +1




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Jump to 3.X WAS [RELEASE] Apache Nutch 1.9

2014-09-01 Thread Julien Nioche
Hi Talat

See
http://mail-archives.apache.org/mod_mbox/nutch-user/201408.mbox/%3cd025dc24.1793f7%25chris.a.mattm...@jpl.nasa.gov%3E
for
some background.

Julien


On 1 September 2014 14:25, Talat Uyarer ta...@uyarer.com wrote:

 Hi all,

 Sorry I was away while a long time. I could miss some talks. If it is
 that, please notice me. But I wonder Why do you consider our version
 numbering. Why do you continue 1.10 for next release of 1.9 ? IMHO 2.x
 branch is online version of Nutch 1.x. If they has some feature
 differences, this is our mistake. I will try to close this difference
 between 1.x with 2.x

 Changing to 3.x would imply a major change of architecture or
 functionality, which certainly won't be the case for the next release of
 the trunk.   I agree with Julien.

 IMHO Opinion We do not need any changes.

 Talat


 2014-09-01 12:23 GMT+03:00 Julien Nioche lists.digitalpeb...@gmail.com:

 Hi chaps,

 -1 from me. IMHO moving the trunk code to 3.x does not really solve the
 issue. I'd rather make it more explicit that the standard Nutch (1.x) and
 Nutch-GORA (2.x) are two separate beasts for instance by referring to 2.x
 as Nutch-GORA in the artifacts we release. This way users won't assume
 believe that one is superior to the other. We can keep the same SVN
 branches (trunk + 2.x) and use the minor version numbers as a reflection of
 the amount of changes produced in the code.

 Changing to 3.x would imply a major change of architecture or
 functionality, which certainly won't be the case for the next release of
 the trunk. When users ask what is the difference between 3.x and 1.x?
 we'd have to answer not much, and more importantly when asked what is
 the difference between 3.x and 2.x? we'd reply same as between 1.x and
 2.x ;-) Changing the name of the artefacts would clarify things.

 This reminds me that our FAQ does not really answer these questions (and
 other basic ones), will post about this separately.

 Julien




 On 29 August 2014 17:34, Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 wrote:

 Hi Chris,

 N.B. move to dev@

 On Fri, Aug 29, 2014 at 7:40 AM, user-digest-h...@nutch.apache.org
 wrote:

 +1, great.

 I'd like to have a conversation about versioning.

 Since we're at 1.9, my suggestion would be to have the
 next in the trunk series (1.x) move to version 3.x post
 1.9 for the release.


 Based on the discussion from which this new thread stems I would totally
 be behind this. It breathes new life into trunk. Which is a bonnie feather
 in the Nutch bonnet. Here is my +1 on that one.



 Nutch2 remains Nutch and can be worked on there. That
 would give us a nice split in the diversionary branch
 paths for Nutch.


 +1




 --

 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




 --
 Talat UYARER
 Websitesi: http://talat.uyarer.com
 Twitter: http://twitter.com/talatuyarer
 Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Jump to 3.X WAS [RELEASE] Apache Nutch 1.9

2014-09-01 Thread Julien Nioche
Let's wait a couple of weeks before voting on this. I know Sebastian is on
holiday until the 12th and there might be more people in this case.

On 1 September 2014 17:34, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Hi Julien,



 -Original Message-
 From: Julien Nioche lists.digitalpeb...@gmail.com
 Reply-To: dev@nutch.apache.org dev@nutch.apache.org
 Date: Monday, September 1, 2014 2:23 AM
 To: dev@nutch.apache.org dev@nutch.apache.org
 Cc: Chris Mattmann mattm...@apache.org
 Subject: Re: Jump to 3.X WAS [RELEASE] Apache Nutch 1.9

 Hi chaps,
 
 
 -1 from me. IMHO moving the trunk code to 3.x does not really solve the
 issue. I'd rather make it more explicit that the standard Nutch (1.x) and
 Nutch-GORA (2.x) are two separate beasts for instance by referring to 2.x
 as Nutch-GORA in the artifacts we
  release. This way users won't assume believe that one is superior to the
 other. We can keep the same SVN branches (trunk + 2.x) and use the minor
 version numbers as a reflection of the amount of changes produced in the
 code.

 It has nothing to do with being superior? Was Apache Tomcat 6 superior to
 Apache Tomcat 5? No, it had nothing to do with it - they were completely
 separate architectures. Heck Apache Tomcat 7 was a place where some of
 the architectural concepts from 5 and 6 met in the middle - that's
 precisely what I am proposing here.

 We've just completed the development line of the 1.x series by releasing
 1.9. 2.x is still going. They each do different things - 1.x is more
 scalable.
 2.x has more flexibility but is harder to install. It's not about one being
 superior to one another.

 
 
 
 Changing to 3.x would imply a major change of architecture or
 functionality, which certainly won't be the case for the next release of
 the trunk.

 Not really - all it would imply is the end of the 1.x branch-line, without
 merging into the 2.x branch line.

 When users ask what is the difference between 3.x and 1.x? we'd have to
 answer not much, and more importantly
  when asked what is the difference between 3.x and 2.x? we'd reply
 same as between 1.x and 2.x ;-) Changing the name of the artefacts
 would clarify things.

 So what? Answering user questions from time to time is not a huge deal. I
 answer
 them from my students all the time in teaching them Apache Nutch in my
 search
 engines class, or more recently with the JPL folks deploying it for our
 internal
 CIO search.

 
 
 This reminds me that our FAQ does not really answer these questions (and
 other basic ones), will post about this separately.

 Well if you are -1 on the renaming to 3.x, we'll have to figure something
 out.
 I'm -1 on renaming the artifacts to Nutch-Gora - so maybe what we need is
 a
 ballot with a few options and we can put it to a VOTE for the committee.

 I'll wait a few days to let this settle before calling such a VOTE.

 Cheers,
 Chris



 
 
 
 
 
 
 
 On 29 August 2014 17:34, Lewis John Mcgibbney
 lewis.mcgibb...@gmail.com wrote:
 
 Hi Chris,
 
 
 N.B. move to dev@
 
 
 On Fri, Aug 29, 2014 at 7:40 AM, user-digest-h...@nutch.apache.org
 wrote:
 
 +1, great.
 
 I'd like to have a conversation about versioning.
 
 Since we're at 1.9, my suggestion would be to have the
 next in the trunk series (1.x) move to version 3.x post
 1.9 for the release.
 
 
 
 
 Based on the discussion from which this new thread stems I would totally
 be behind this. It breathes new life into trunk. Which is a bonnie
 feather in the Nutch bonnet. Here is my +1 on that one.
 
 
 
 
 Nutch2 remains Nutch and can be worked on there. That
 would give us a nice split in the diversionary branch
 paths for Nutch.
 
 
 
 
 +1
 
 
 
 
 
 
 
 
 
 
 
 
 --
 
 Open Source Solutions for Text Engineering
 
 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble
 




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Nutch won't fetch the whole page if the Transfer Dncoding is chunked

2014-09-17 Thread Julien Nioche
Hi

Isn't that an effect of

property namehttp.content.limit/name value65536/value
descriptionThe length limit for downloaded content using the http://
protocol, in bytes. If this value is nonnegative (=0), content longer than
it will be truncated; otherwise, no truncation at all. Do not confuse this
setting with the file.content.limit setting. /description/property


I can't reproduce the problem as http://search.dangdang.com/ seems to be
down.

Do you have another URL to illustrate the issue?

J.

On 16 September 2014 15:59, zeroleaf zeroleaf...@gmail.com wrote:

 These days, when I use nutch, I found that if the Transfer Dncoding is
 chunked, then nutch will not fetch the whole page and only part of it. Is
 it
 right in nutch or is it a bug? If it is right, then how to config to fetch
 the
 whole page?

 For example, add the url below to seed dir

 http://search.dangdang.com/?key=%CA%FD%BE%DD%BF%E2

 then, find fetched html in content, will find it is only a part. In
 addition, the
 version I test is Nutch 1.x(1.9 and 1.10).

 Thanks.




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Generic xsl parser plugin

2014-09-25 Thread Julien Nioche
Hi Albin,

You don't have to have a separate plugin for each html structure you want
to parse. You can have a single plugin with multiple HTMLParseFilters.

Having a generic extractor with the extraction logic configured in an
external file is definitely a good idea and would make a great contribution
to the project. In a nutshell, you haven't missed anything and that wheel
definitely needs inventing ;-)

Best

Julien


On 25 September 2014 09:24, Albin Vigier albinsc...@gmail.com wrote:

 Hello everybody,

 I'm just wondering if it is possible to fetch specific metadata with
 an existing nutch plugin.

 Let's take an example.
 I want to extract some metadata from div or td tags from html
 pages that have specific ids and name them the way I like (this is
 done at parser time).
 Then, at indexer time, I would use index-metadata (a very good plugin)
 to add my custom metadata.

 Currently from what I've seen on the wiki and by quickly analyzing
 plugins I suppose I have to code my own plugin each time I've got a
 new site (with a new html structure). I've already done that by using
 a node walker in a custom htmlParseFilter but the extraction can be a
 little bit boring :)

 So on my side i've coded a little plugin that enables me to specify
 xpaths in an xml file. But before diving into more functionalities I'm
 just wondering if I did not missed something.
 This work allowed me to explore some nutch aspects but I don't want to
 reinvent the wheel or miss something.

 Albin




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Generic xsl parser plugin

2014-09-26 Thread Julien Nioche
Hi Nima

Thanks for reminding me about this JIRA issue, it hasn't been commented on
for some time and I'd forgotten about it. Judging by the discussion on
NUTCH-978 https://issues.apache.org/jira/browse/NUTCH-978 things got
stuck when Emmanuel tried to get in touch with Emir (who in the meantime
seems to have stopped using Nutch - see
http://www.atlantbh.com/book-review-web-crawling-and-data-mining-with-apache-nutch/
).

It would be a good thing to get in touch with him indeed, alternatively
Albin's plugin could be a good starting point. There clearly is a need for
such a functionality and quite a few people keen to make it happen.

Thanks

Julien


On 25 September 2014 18:19, Nima Falaki nfal...@popsugar.com wrote:

 And the reason why I think this is because of this ticket (Look at the
 conversation at the bottom between Emmanuel and Lewis John)

 https://issues.apache.org/jira/browse/NUTCH-978

 On Thu, Sep 25, 2014 at 8:44 AM, Nima Falaki nfal...@popsugar.com wrote:

 Hi Julien:

 I was under the impression that the nutch community was going to use a
 generic xls parser? This one.
 http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ Is
 the nutch community going to use this?



 On Thu, Sep 25, 2014 at 5:49 AM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:

 Hi Albin,

 You don't have to have a separate plugin for each html structure you
 want to parse. You can have a single plugin with multiple HTMLParseFilters.

 Having a generic extractor with the extraction logic configured in an
 external file is definitely a good idea and would make a great contribution
 to the project. In a nutshell, you haven't missed anything and that wheel
 definitely needs inventing ;-)

 Best

 Julien


 On 25 September 2014 09:24, Albin Vigier albinsc...@gmail.com wrote:

 Hello everybody,

 I'm just wondering if it is possible to fetch specific metadata with
 an existing nutch plugin.

 Let's take an example.
 I want to extract some metadata from div or td tags from html
 pages that have specific ids and name them the way I like (this is
 done at parser time).
 Then, at indexer time, I would use index-metadata (a very good plugin)
 to add my custom metadata.

 Currently from what I've seen on the wiki and by quickly analyzing
 plugins I suppose I have to code my own plugin each time I've got a
 new site (with a new html structure). I've already done that by using
 a node walker in a custom htmlParseFilter but the extraction can be a
 little bit boring :)

 So on my side i've coded a little plugin that enables me to specify
 xpaths in an xml file. But before diving into more functionalities I'm
 just wondering if I did not missed something.
 This work allowed me to explore some nutch aspects but I don't want to
 reinvent the wheel or miss something.

 Albin




 --

 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




 --



 Nima Falaki
 Software Engineer
 nfal...@popsugar.com




 --



 Nima Falaki
 Software Engineer
 nfal...@popsugar.com




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: GSoC 2015

2015-02-04 Thread Julien Nioche
Moving to Hadoop 2.x ?

On 4 February 2015 at 14:42, Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 wrote:

 Hi Folks,
 Does anyone have any good ideas for GSoC?
 Seb mentioned moving Nutch towards Spark so potentially a pluggable
 runtime execution engine abstraction?
 I am currently working on a lot of security and authentication related
 work so I would possibly be tempted to overhaul and improve that aspect of
 Nutch.
 Any other ideas?
 Thanks folks
 Lewis

 --
 *Lewis*




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: [ANNOUNCE] New Nutch committer and PMC - Jorge Luis Betancourt Gonzalez

2015-02-19 Thread Julien Nioche
Congratulations and welcome Jorge! Great to have you with us

Julien

On 19 February 2015 at 17:20, Sebastian Nagel wastl.na...@googlemail.com
wrote:

 Dear all,

 on behalf of the Nutch PMC it is my pleasure to announce that
 Jorge Luis Betancourt Gonzalez has been voted in as committer
 and member of the Nutch PMC. Jorge, would you mind telling us
 about yourself, what you've done so far with Nutch, which areas
 you think you'd like to get involved, etc...

 Congratulations and welcome on board!

 Regards,
 Sebastian




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Review Request 31579: Patch fo NUTCH-1949: Dump out the Nuth data into the Common Crawl format

2015-03-02 Thread Julien Nioche

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31579/#review74783
---


Any reason why you can't have this in a separate plugin as an extension of 
IndexWriter? See 
[https://issues.apache.org/jira/browse/NUTCH-1949?focusedCommentId=14336272page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14336272]

- Julien Nioche


On March 2, 2015, 5:58 p.m., Giuseppe Totaro wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/31579/
 ---
 
 (Updated March 2, 2015, 5:58 p.m.)
 
 
 Review request for nutch, Lewis McGibbney and Chris Mattmann.
 
 
 Bugs: NUTCH-1949
 https://issues.apache.org/jira/browse/NUTCH-1949
 
 
 Repository: nutch
 
 
 Description
 ---
 
 Patch fo NUTCH-1949: first version of the CommonCrawlDataDumper tool that 
 maps Nutch data into Common Crawl format.
 
 
 Diffs
 -
 
   trunk/src/bin/nutch 1662875 
   trunk/src/java/org/apache/nutch/tools/AbstractCommonCrawlFormat.java 
 PRE-CREATION 
   trunk/src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java 
 PRE-CREATION 
   trunk/src/java/org/apache/nutch/tools/CommonCrawlFormat.java PRE-CREATION 
   trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatFactory.java 
 PRE-CREATION 
   trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJackson.java 
 PRE-CREATION 
   trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJettinson.java 
 PRE-CREATION 
   trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatSimple.java 
 PRE-CREATION 
   trunk/src/java/org/apache/nutch/tools/FileDumper.java 1662875 
 
 Diff: https://reviews.apache.org/r/31579/diff/
 
 
 Testing
 ---
 
 Tested locally against Nutch segments.
 
 
 Thanks,
 
 Giuseppe Totaro
 




Re: Unsubscribe

2015-02-26 Thread Julien Nioche
Massimo,

http://nutch.apache.org/mailing_lists.html

= dev-unsubscr...@nutch.apache.org

Thanks

On 26 February 2015 at 19:11, Massimo Miccoli mmicc...@iltrovatore.it
wrote:



 Massimo

  Il giorno 26/feb/2015, alle ore 19:31, lewi...@apache.org ha scritto:
 
  Author: lewismc
  Date: Thu Feb 26 18:31:39 2015
  New Revision: 1662530
 
  URL: http://svn.apache.org/r1662530
  Log:
  NUTCH-1933 nutch-selenium plugin
 
  Added:
 nutch/trunk/src/plugin/lib-selenium/
 nutch/trunk/src/plugin/lib-selenium/build.xml
 nutch/trunk/src/plugin/lib-selenium/ivy.xml
 nutch/trunk/src/plugin/lib-selenium/plugin.xml
 nutch/trunk/src/plugin/lib-selenium/src/
 nutch/trunk/src/plugin/lib-selenium/src/java/
 nutch/trunk/src/plugin/lib-selenium/src/java/org/
 nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/
 nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/
 
 nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/
 
 nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/
 
 nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
 nutch/trunk/src/plugin/protocol-selenium/
 nutch/trunk/src/plugin/protocol-selenium/build-ivy.xml
 nutch/trunk/src/plugin/protocol-selenium/build.xml
 nutch/trunk/src/plugin/protocol-selenium/ivy.xml
 nutch/trunk/src/plugin/protocol-selenium/plugin.xml
 nutch/trunk/src/plugin/protocol-selenium/src/
 nutch/trunk/src/plugin/protocol-selenium/src/java/
 nutch/trunk/src/plugin/protocol-selenium/src/java/org/
 nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/
 nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/
 
 nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/
 
 nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/
 
 nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/Http.java
 
 nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/HttpResponse.java
 
 nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/package.html
 nutch/trunk/src/plugin/protocol-selenium/src/target/
 nutch/trunk/src/plugin/protocol-selenium/src/target/classes/
 nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/
 
 nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/
 
 nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/
 
 nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol/
 
 nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol/htmlunit/
 
 nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol/htmlunit/package.html
  Modified:
 nutch/trunk/CHANGES.txt
 nutch/trunk/build.xml
 nutch/trunk/ivy/ivy.xml
 nutch/trunk/src/plugin/build.xml
 
  Modified: nutch/trunk/CHANGES.txt
  URL:
 http://svn.apache.org/viewvc/nutch/trunk/CHANGES.txt?rev=1662530r1=1662529r2=1662530view=diff
 
 ==
  --- nutch/trunk/CHANGES.txt (original)
  +++ nutch/trunk/CHANGES.txt Thu Feb 26 18:31:39 2015
  @@ -2,6 +2,8 @@ Nutch Change Log
 
  Nutch Current Development 1.10-SNAPSHOT
 
  +* NUTCH-1933 nutch-selenium plugin (Mo Omer, Mohammad Al-Moshin,
 lewismc)
  +
  * NUTCH-827 HTTP POST Authentication (Jasper van Veghel, yuanyun.cn,
 snagel, lewismc)
 
  * NUTCH-1724 LinkDBReader to support regex output filtering (markus)
 
  Modified: nutch/trunk/build.xml
  URL:
 http://svn.apache.org/viewvc/nutch/trunk/build.xml?rev=1662530r1=1662529r2=1662530view=diff
 
 ==
  --- nutch/trunk/build.xml (original)
  +++ nutch/trunk/build.xml Thu Feb 26 18:31:39 2015
  @@ -184,6 +184,7 @@
packageset dir=${plugins.dir}/indexer-solr/src/java/
packageset dir=${plugins.dir}/language-identifier/src/java/
packageset dir=${plugins.dir}/lib-http/src/java/
  +  packageset dir=${plugins.dir}/lib-selenium/src/java/
packageset dir=${plugins.dir}/lib-regex-filter/src/java/
packageset dir=${plugins.dir}/microformats-reltag/src/java/
packageset dir=${plugins.dir}/parse-ext/src/java/
  @@ -197,6 +198,7 @@
packageset dir=${plugins.dir}/protocol-ftp/src/java/
packageset dir=${plugins.dir}/protocol-http/src/java/
packageset dir=${plugins.dir}/protocol-httpclient/src/java/
  +  packageset dir=${plugins.dir}/protocol-selenium/src/java/
packageset dir=${plugins.dir}/scoring-depth/src/java/
packageset dir=${plugins.dir}/scoring-link/src/java/
packageset dir=${plugins.dir}/scoring-opic/src/java/
  @@ -591,6 +593,7 @@
packageset dir=${plugins.dir}/indexer-solr/src/java/
packageset 

Re: [ANNOUNCE] New Nutch committer and PMC - Mo Omer

2015-03-23 Thread Julien Nioche
Welcome Mo!

On 22 March 2015 at 19:31, Markus Jelsma markus.jel...@openindex.io wrote:

 Welcome Mohammad!

 -Original message-
 From: Mohammed Omerbeancinemat...@gmail.com
 Sent: Sunday 22nd March 2015 18:55
 To: u...@nutch.apache.org
 Cc: dev@nutch.apache.org
 Subject: Re: [ANNOUNCE] New Nutch committer and PMC - Mo Omer

 Hello all,

 First, and most importantly, Id like to send out a thank you to Chris,
 Sebastian, Lewis (all of whom Ive interacted with), and the rest of the
 Nutch PMC for the opportunity to work and learn with them. Nutch is such a
 dope (love that word) project, and Im happy to help where I can.

 Im a 26 year old Minneapolis MN, USA software engineer and generally
 inquisitive person, whos happy that this winter was pretty mild and that
 the local groups I help out with (GoMN, RailsMN) are very alive and well.
 Figured a picture is worth 1000 words, and probably more-so in a
 self-summary, so I snapped one of my messy home office (
 http://imgur.com/klKOZBr http://imgur.com/klKOZBr).

 Most of what I write/do for my day (Mithun) job are internal applications,
 data warehousing, analytics systems/research, back-end APIs, etc. As Im the
 sole developer on our Analytics/Media team, theres a lot to research,
 build, deploy, and maintain.

 My usage so far with Nutch was mainly for one such analytics project that
 I built and deployed last year, during which time I put together a couple
 of Nutch plugins [0, 1, 2] which relied on Selenium [3] to retrieve
 web-browser rendered pages.

 Nutch development areas currently of interest to me are whichever the PMC
 has identified and prioritized in order to keep the project moving forward
 and maintaining its effectiveness as a large-scale web crawler. Everyone on
 the PMC is extremely familiar with the project, and Id first like to learn
 as much as I can while helping out where I can.

 Thank you again all,

 Mo

 [0]: Lewis, Markus, Mohammad, Jorge and Chris worked on and merged
 Nutch-Selenium into 1.10 trunk at
 https://issues.apache.org/jira/browse/NUTCH-1933 
 https://issues.apache.org/jira/browse/NUTCH-1933
 [1]: https://github.com/momer/nutch-selenium 
 https://github.com/momer/nutch-selenium
 [2]: https://github.com/momer/nutch-selenium-grid-plugin 
 https://github.com/momer/nutch-selenium-grid-plugin
 [3]: http://www.seleniumhq.org/ http://www.seleniumhq.org/

 On Sun, Mar 22, 2015 at 4:40 AM, Sebastian Nagel 
 wastl.na...@googlemail.com mailto:wastl.na...@googlemail.com wrote:
 Dear all,

 it is my pleasure to announce that Mo Omer has been voted in

 as committer and member of the Nutch PMC. Mo, would you mind

 telling us about yourself, what youve done so far with Nutch,

 which areas you think youd like to get involved, etc...?

 Congratulations and welcome on board!

 Regards,

 Sebastian

 (on behalf of the Nutch PMC)




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: [ANNOUNCE] New Nutch committer and PMC - Guiseppe Totaro

2015-04-26 Thread Julien Nioche
Congrats and welcome Giuseppe!

On 25 April 2015 at 22:43, Giuseppe Totaro totarope...@gmail.com wrote:

 Thanks a lot Sebastian.
 I am very proud to be part of this project as committer and member of the
 Nutch PMC.

 I am working on Information Retrieval at scale under the supervision of
 Professor Chris Mattmann at NASA JPL.
 I developed the CommonCrawlDataDumper
 https://wiki.apache.org/nutch/CommonCrawlDataDumper tool for Nutch and
 I am working on extending it as indexing plugin and implementing more
 conversion tools for Nutch.
 I take this opportunity to thank Chris Mattmann and Lewis McGibbney for
 kindly supporting me on this work.

 I would really like to get your feedback on my work. Feel free to ask me
 any question.

 Cheers,
 Giuseppe

 On Fri, Apr 24, 2015 at 1:00 PM, Sebastian Nagel 
 wastl.na...@googlemail.com wrote:

 Dear all,

 it is my pleasure to announce that Guiseppe Totaro has joined us
 as committer and member of the Nutch PMC.  Congratulations on your
 new role within the Apache Nutch community!

 Guiseppe, would you mind telling us about yourself, and what you
 are doing with Nutch, what you plan to do, etc.

 Cheers and welcome on board!
 Sebastian
 (on behalf of the Nutch PMC)





-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: [VOTE] Release Apache Nutch 1.10

2015-04-30 Thread Julien Nioche
Thanks Lewis

+1 : compiled on Linux + ran a small crawl and indexed with ES

j

On 29 April 2015 at 22:54, Lewis John Mcgibbney lewis.mcgibb...@gmail.com
wrote:


 Hi user@  dev@,This thread is a VOTE for releasing Apache Nutch 1.10. The 
 release candidate comprises the following components.* A staging repository 
 [0] containing various Maven artifacts* A branch-1.10 of the trunk code [1]* 
 The tagged source upon which we are VOTE'ing [2]* Finally, the release 
 artifacts [3] which I would encourage you to verify for signatures and 
 test.You should use the following KEYS [4] file to verify the signatures of 
 all release artifacts.Please VOTE as follows[ ] +1 Push the release, I am 
 happy :)[ ] +0 I am not bothered either way[ ] -1 I am not happy with this 
 release candidate (please state why)Firstly thank you to everyone that 
 contributed to Nutch 1.10.
 Secondly, thank you to everyone that VOTE's. It is highly 
 appreciated.ThanksLewis(on behalf of Nutch PMC)p.s. Here's my +1 [0] 
 https://repository.apache.org/content/repositories/orgapachenutch-1004[1] 
 https://svn.apache.org/repos/asf/nutch/branches/branch-1.10[2] 
 https://svn.apache.org/repos/asf/nutch/tags/release-1.10[3] 
 https://dist.apache.org/repos/dist/dev/nutch/1.10/[4] 
 http://www.apache.org/dist/nutch/KEYS



 --
 *Lewis*




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


crawler-commons 0.6 released

2015-06-11 Thread Julien Nioche
[Apologies for cross posting]crawler-commons 0.6 is released

We are glad to announce the 0.6 release of Crawler Commons. See the
CHANGES.txt
https://github.com/crawler-commons/crawler-commons/releases/tag/crawler-commons-0.6
file
included with the release for a full list of details.

We suggest all users to upgrade to this version. Details of how to do so
can be found on Maven Central
http://search.maven.org/#artifactdetails%7Ccom.github.crawler-commons%7Ccrawler-commons%7C0.6%7Cjar.
Please note that the groupId has changed to *com.github.crawler-commons*.
Thanks to all contributors

Julien

https://github.com/crawler-commons/crawler-commons


Re: [VOTE] Apache Nutch 1.11 Release Candidate #1

2015-10-26 Thread Julien Nioche
Chris

-1  We usually release tar.gz as well as zip.  More importantly we need to
release the sources as well as the binary. We can't even test that it
compiles OK

Since you released Tika, why don't we include it before cutting 1.11?

Thanks

Julien


On 26 October 2015 at 05:53, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Hi Folks,
>
> A first candidate for the Nutch 1.11 release is available at:
>
>   https://dist.apache.org/repos/dist/dev/nutch/1.11/
>
> The release candidate is a zip archive of the sources in:
> http://svn.apache.org/repos/asf/nutch/tags/release-1.11-rc1/
>
>
> The SHA1 checksum of the archive is
> 6adebaca0504be69a9e6c67ae1eb3a8487b1806f
>
>
> In addition, a staged maven repository is available here:
>
> https://repository.apache.org/content/repositories/orgapachenutch-1006/
>
>
> Please vote on releasing this package as Apache Nutch 1.11.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Nutch PMC votes are cast.
>
> [ ] +1 Release this package as Apache Nutch 1.11
> [ ] -1 Do not release this package because…
>
> Cheers,
> Chris
>
> P.S. Of course here is my +1.
>
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


Re: [DISCUSS] Release Nutch trunk 1.11

2015-08-26 Thread Julien Nioche
Hi Lewis

I'd love to see https://issues.apache.org/jira/browse/NUTCH-1517 being part
of 1.11. It is a separate indexing plugin which should not impact any
existing code. It's been reviewed by Jorge and I'll to commit it soon
unless someone objects.

Thanks

J.

On 26 August 2015 at 03:23, Lewis John Mcgibbney lewis.mcgibb...@gmail.com
wrote:

 Hi Folks,
 What do you all think about getting a release candidate out for Nutch
 1.11? I am happy to do RM role.
 Thanks
 Lewis


 --
 *Lewis*




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: [DISCUSS] Release Nutch trunk 1.11

2015-08-26 Thread Julien Nioche
Done. Thanks Markus

On 26 August 2015 at 13:08, Markus Jelsma markus.jel...@openindex.io
wrote:

 Yes Julien, please commit. I do think
 https://issues.apache.org/jira/browse/NUTCH-2064 should also be included.
 But i have my hands full atm.

 -Original message-
 From: Julien Niochelists.digitalpeb...@gmail.com
 Sent: Wednesday 26th August 2015 13:51
 To: dev@nutch.apache.org
 Subject: Re: [DISCUSS] Release Nutch trunk 1.11

 Hi Lewis

 Id love to see https://issues.apache.org/jira/browse/NUTCH-1517 
 https://issues.apache.org/jira/browse/NUTCH-1517 being part of 1.11. It
 is a separate indexing plugin which should not impact any existing code.
 Its been reviewed by Jorge and Ill to commit it soon unless someone objects.

 Thanks

 J.

 On 26 August 2015 at 03:23, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com mailto:lewis.mcgibb...@gmail.com wrote:

 Hi Folks,

 What do you all think about getting a release candidate out for Nutch
 1.11? I am happy to do RM role.

 Thanks

 Lewis

 --

 Lewis

 --

 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/ http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com http://www.digitalpebble.com
 http://twitter.com/digitalpebble http://twitter.com/digitalpebble





-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: [ANNOUNCE] New Nutch committer and PMC - Asitang Mishra

2015-09-10 Thread Julien Nioche
Congratulations Asitang and welcome!

Julien

On 9 September 2015 at 23:01, Sebastian Nagel 
wrote:

> Dear all,
>
> on behalf of the Nutch PMC it is my pleasure to announce
> that Asitang Mishra has joined the Nutch team as committer
> and PMC member. Asitang, please feel free to introduce
> yourself and to tell the Nutch community about your
> interests and your relation to Nutch.
>
> Congratulations and welcome on board!
>
> Regards,
> Sebastian (on behalf of the Nutch PMC)
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Nutch not recognizing html pages/images retrieved via php

2015-10-05 Thread Julien Nioche
Hi

What happens is that parse-tika is used by default but doesn't know what to
do with that mime type.

You can edit parse-plugins.xml
 and add






to map the mime type to the html parser. Obviously you'll need parse-html
to be active.

HTH

Julien



On 4 October 2015 at 03:01, Girish Rao  wrote:

> Hi,
>
> I am running a crawl on a website that serves pages and images via php.
> Nutch doesn’t seem to crawl these pages.
>
> I see the below in the hadoop.log
> 015-10-03 12:48:31,091 INFO  parse.ParserFactory - The parsing plugins:
> [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> plugin.includes system property, and all claim to support the content type
> text/x-php, but they are not mapped to it  in the parse-plugins.xml file
> 2015-10-03 12:48:31,712 ERROR tika.TikaParser - Can't retrieve Tika parser
> for mime-type text/x-php
> 2015-10-03 12:48:31,713 WARN  parse.ParseSegment - Error parsing:
> http://www.arguntrader.com/ucp.php?mode=login: failed(2,0): Can't
> retrieve Tika parser for mime-type text/x-php
>
> Can anyone help with identifying what is to be done to crawl a site which
> serves pages via php?
>
> Regards
> Girish




-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


Fwd: Job Opening at Common Crawl - Crawl Engineer / Data Scientist

2015-09-18 Thread Julien Nioche
Nutch people,

Just in case you missed the announcement below. As you probably know CC use
Nutch for their crawls, this is a fantastic opportunity to put your Nutch
skills to great use!

Julien

-- Forwarded message --
From: Sara Crouse 
Date: 17 September 2015 at 22:51
Subject: Job Opening at Common Crawl - Crawl Engineer / Data Scientist
To: Common Crawl 


Hello again CC community,

In addition to my appointment, another staff transition is on the horizon,
and I would like to ask for your help finding candidates to fill a critical
role. At the end of this month, Stephen Merity (data scientist, crawl
engineer, and much more!) will leave Common Crawl to work on image
recognition and language understanding using deep learning at MetaMind, a
new startup. Stephen, has been a great asset to Common Crawl, and we are
grateful that he wishes to remain engaged with us in a volunteer capacity
going forward.

This week, we therefore launch a search to fill the role of Crawl
Engineer/Data Scientist. Below and posted here https://commoncrawl.org/jobs/
is the job description. We appreciate any help you can provide in spreading
the word about this unique opportunity. If you have specific referrals, or
wish to apply, please contact j...@commoncrawl.org.

Many thanks,

Sara

---

_CRAWL ENGINEER / DATA SCIENTIST at THE COMMON CRAWL FOUNDATION_

*Location*
San Francisco or Remote


*Job Summary*
Common Crawl (CC) is the non-profit organization that builds and maintains
the single largest publicly accessible dataset of the world’s knowledge,
encompassing petabytes of web crawl data.

If democratizing access to web information and tackling the engineering
challenges of working with data at the scale of the web sounds exciting to
you, we would love to hear from you. If you have worked on open source
projects before or can share code samples with us, please don't hesitate to
send relevant links along with your application.


*Description*

/Primary Responsibilities/
_Running the crawl_
* Spinning up and managing Hadoop clusters on Amazon EC2
* Running regular comprehensive crawls of the web using Nutch
* Preparing and publishing crawl data to data hosting partner, Amazon Web
Services
* Incident response and diagnosis of crawl issues as they occur, e.g.
** Replacing lost instances due to EC2 problems / spot instance losses
** Responding to and remedying webmaster queries and issues

_Crawl engineering_
* Maintaining, developing, and deploying new features as required by
running the Nutch crawler, e.g.:
** Providing netiquette features, such as following robots.txt, as
required, and load balancing a crawl across millions of domains
** Implementing and improving ranking algorithms to prioritize the crawling
of popular pages
* Extending existing tools to work efficiently with large datasets
* Working with the Nutch community to push improvements to the crawler to
the public

/Other Responsibilities/
* Building support tools and artifacts, including documentation, tutorials,
and example code or supporting frameworks for processing CC data using
different tools.
* Identifying and reporting on research and innovations that result from
analysis and derivative use of CC data.
* Community evangelism:
** Collaborating with partners in academia and industry
** Engaging regularly with user discussion group and responding to frequent
inquiries about how to use CC data
** Writing technical blog posts
** Presenting on or representing CC at conferences, meetups, etc.


*Qualifications*
/Minimum qualifications/
* Fluent in Java (Nutch and Hadoop are core to our mission)
* Familiarity with the JVM big data ecosystem (Hadoop, HDFS, ...)
* Knowledge the Amazon Web Services (AWS) ecosystem
* Experience with Python
* Basic command line Unix knowledge
* BS Computer Science or equivalent work experience

/Preferred qualifications/
* Experience with running web crawlers
* Cluster computing experience (Hadoop preferred)
* Running parallel jobs over dozens of terabytes of data
* Experience committing to open source projects and participating in open
source forums


*About Common Crawl*
The Common Crawl Foundation is a California 501(c)(3) registered non-profit
with the goal of democratizing access to web information by producing and
maintaining an open repository of web crawl data that is universally
accessible and analyzable.

Our vision is of a truly open web that allows open access to information
and enables greater innovation in research, business and education. We
level the playing field by making wholesale extraction, transformation and
analysis of web data cheap and easy.

The Common Crawl Foundation is an Equal Opportunity Employer.


*To Apply*
Please send your cover letter and resumé to j...@commoncrawl.org.

-- 
You received this message because you are subscribed to the Google Groups
"Common Crawl" group.
To unsubscribe 

Webcast : Apache Nutch on EMR

2015-09-23 Thread Julien Nioche
Hi again,

I have uploaded at webcast explaining how to run Nutch on AWS Elastic Map
Reduce

https://www.youtube.com/watch?v=v9zjcTjjjyU

Please excuse the sound quality, hesitations and stuttering. I hope you
find it useful nonetheless.

Julien

-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


Tutorial : Index the web with AWS CloudSearch

2015-09-23 Thread Julien Nioche
Hi everyone,

Just to let you know that we've just published a new tutorial on how to use
Nutch (and StormCrawler) to crawl and index documents into AWS CloudSearch.

This is related to the recent addition of NUTCH-1517
 in the trunk codebase.
The tutorial is aimed at beginners and gives step by step instructions on
how to use Nutch, including in distributed mode. It should also be relevant
for more advanced users as it provides an introduction to CloudSearch and a
comparison with StormCrawler.

The tutorial is on
http://digitalpebble.blogspot.co.uk/2015/09/index-web-with-aws-cloudsearch.html

Please retweet the announcement if you use Twitter [
https://twitter.com/digitalpebble/status/646614555192336384].

I hope you find it useful

Julien

-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: [RELEASE] Apache Nutch 1.11

2015-12-08 Thread Julien Nioche
Thanks Lewis for taking care of the release and everyone involved.

Julien

On 8 December 2015 at 01:34, lewis john mcgibbney 
wrote:

> Hello Folks,
>
> 07 December 2015 - Nutch 1.11 Release
>
> The Apache Nutch PMC are pleased to announce the immediate release of
> Apache Nutch v1.11, we advise all current users and developers of the 1.X
> series to upgrade to this release.
>
> What is Apache Nutch?
>
> Nutch is a well matured, production ready Web crawler. Nutch 1.x enables
> fine grained configuration, relying on Apache Hadoop™
>  data structures, which are great for batch
> processing.
>
> This release is the result of many months of work and around 100 issues
> addressed. For a complete overview of these issues please see the release
> report .
>
> As usual in the 1.X series, release artifacts are made available as both
> source and binary and also available within Maven Central
> 
> as a Maven dependency. The release is available from our DOWNLAODS PAGE
> .
>
> Please also see the Nutch DOAP file -
> https://svn.apache.org/repos/asf/nutch/cms_site/trunk/content/doap.rdf
>
> Best
>
> Lewis
>
> (on behalf of the Apache Nutch PMC)
>
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


Re: [VOTE] Release Apache Nutch 1.11 RC#2

2015-12-05 Thread Julien Nioche
+1

Thanks Lewis

On 4 December 2015 at 18:03, Lewis John Mcgibbney  wrote:

> Hi Folks,
>
> A second candidate for the Nutch 1.11 release is available at:
>
>   https://dist.apache.org/repos/dist/dev/nutch/1.11rc2/
>
> The release candidate consists of zip and tar binaries as well as zip and
> tar sources archives of the sources in:
> http://svn.apache.org/repos/asf/nutch/tags/release-1.11-rc2/
>
> All artifacts have been signed with the following signature as present
> within KEYS
> 48BAEBF6 2013-10-28 Lewis John McGibbney (CODE SIGNING KEY) <
> lewi...@apache.org>
>
> In addition, a staged maven repository is available here:
>
> https://repository.apache.org/content/repositories/orgapachenutch-1007/
>
> Please vote on releasing this package as Apache Nutch 1.11.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Nutch PMC votes are cast.
>
> [ ] +1 Release this package as Apache Nutch X.Y.
> [ ] -1 Do not release this package because…
>
> Cheers,
> Lewis John McGibbney
>
> P.S. Here is my +1.
>
> --
> *Lewis*
>



-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


Re: [VOTE] Moving to Git

2016-01-08 Thread Julien Nioche
+1 to move to Git

Note : I don't think Dennis is on the PMC anymore

Ju

On 8 January 2016 at 08:46, Chris Mattmann  wrote:

> Hi Everyone,
>
> I proposed this earlier, and we said we’d wait until after the
> 1.11 release. So it’s time to VOTE to move Nutch to Git. So
> far, the following people have expressed +1s and if I don’t hear
> otherwise, I will implicitly count their VOTE from the DISCUSS
> thread:
>
> +1 PMC
>
> Chris Mattmann*
> Sebastien Nagel*
> Michael Joyce*
> Asitang Mishra*
> Dennis Kubes*
> BlackIce
>
> Everyone else (or those above that would like to amend their VOTE),
> please VOTE below. I will leave the VOTE open for at least 72 hours.
>
> [ ] +1 Move the Nutch SCM to Writeable Git repositories at the ASF.
> [ ] +0 No opinion.
> [ ] -1 Don’t move the Nutch SCM to Writeable Git repositories at the
> ASF because…
>
> Please note, I created a page for Tika that is worth checking out and
> perhaps copying over to the Nutch wiki:
>
> http://wiki.apache.org/tika/UsingGit
>
> Please have a look as I think it will help with our workflows too.
>
> Cheers,
> Chris
>
>
>
>
> -Original Message-
> From: jpluser 
> Reply-To: "dev@nutch.apache.org" 
> Date: Wednesday, November 18, 2015 at 7:39 PM
> To: "dev@nutch.apache.org" 
> Subject: [DISCUSS] Moving to Git
>
> >Hi All,
> >
> >I propose that we consider moving to ASF supported writeable git
> >repos fro Nutch. This would entail moving Nutch’s canonical repo
> >from:
> >
> >https://svn.apache.org/repos/asf/nutch
> >
> >TO
> >
> >https://git-wip-us.apache.org/repos/asf/nutch.git
> >
> >
> >We are already accepting PRs and so forth from Github and I think
> >many of us are using Git in our regular day to day workflows.
> >
> >Thoughts?
> >
> >Cheers,
> >Chris
> >
> >++
> >Chris Mattmann, Ph.D.
> >Chief Architect
> >Instrument Software and Science Data Systems Section (398)
> >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >Office: 168-519, Mailstop: 168-527
> >Email: chris.a.mattm...@nasa.gov
> >WWW:  http://sunset.usc.edu/~mattmann/
> >++
> >Adjunct Associate Professor, Computer Science Department
> >University of Southern California, Los Angeles, CA 90089 USA
> >++
> >
> >
> >
>
>
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


Re: [VOTE] Release Apache Nutch 1.12

2016-06-15 Thread Julien Nioche
+1

Thanks Lewis and team!

On 15 June 2016 at 06:14, lewis john mcgibbney  wrote:

> Hi Folks,
>
> A first candidate for the Nutch 1.12 release is available at:
>
> https://dist.apache.org/repos/dist/dev/nutch/1.12/
>
> The release candidate is a zip and tar archive of the sources tag available
> at:
>
> https://git-wip-us.apache.org/repos/asf?p=nutch.git;a=tag;h=2d6f6de656c60c0b04890c5d3db20805ca39cfd5
>
> In addition, a staged maven repository is available here:
>
> https://repository.apache.org/content/repositories/orgapachenutch-1012/
>
> Please vote on releasing this package as Apache Nutch 1.12.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Nutch PMC votes are cast.
>
> [ ] +1 Release this package as Apache Nutch 1.12.
> [ ] -1 Do not release this package because…
>
> Cheers,
> Lewis
>
> P.S. Here is my +1.
>



-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


ApacheCon EU Sevilla

2016-06-29 Thread Julien Nioche
Hi,

Sorry for cross posting. As you are probably aware, the ApacheCon Europe,
and Apache Big Data conferences will take place in Seville, Spain, November
14-18, 2016.

http://events.linuxfoundation.org/events/apache-big-data-europe/

I just submitted a talk on StormCrawler  (which
will touch on Apache Nutch as well) and I know that at least 1 other fellow
Nutch committer will be there.

Is anyone else planning on going? It would be interesting not only to catch
up within each respective project but also meet people from other crawl
related projects.

Best regards

Julien

-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


Crawler-Commons 0.7 released

2016-11-24 Thread Julien Nioche
Apologies for cross-posting

The Common-Crawl project is pleased to announce its 0.7 release.

https://github.com/crawler-commons/crawler-commons#24th-november-2016crawler-commons-07-released

The list of changes can be found here
,
the main one being that the project requires Java 8.

If you are wondering what Crawler-Commons is about :

*Crawler-Commons is a set of reusable Java components that implement
functionality common to any web crawler. These components benefit from
collaboration among various existing web crawler projects and reduce
duplication of effort. *

Thanks

Julien (on behalf of the Common-Crawl committers)

-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


Re: [VOTE] Release Apache Nutch 1.13 RC#1

2017-03-29 Thread Julien Nioche
Hi Lewis

+1 compiled from source and ran a small crawl in local mode. All good!

Thanks

Julien

On 29 March 2017 at 05:20, lewis john mcgibbney  wrote:

> Hi Folks,
>
> A first candidate for the Nutch 1.13 release is available at:
>
>   https://dist.apache.org/repos/dist/dev/nutch/1.13/
>
> The release candidate is a zip and tar.gz archive of the binary and sources
> in:
> https://github.com/apache/nutch/tree/release-1.13
>
> The SHA1 checksum of the archive is
> bd0da3569aa14105799ed39204d4f0a31c77b42c
>
> In addition, a staged maven repository is available here:
>
> https://repository.apache.org/content/repositories/orgapachenutch-1013
>
> We addressed 29 Issues - https://s.apache.org/wq3x
>
> Please vote on releasing this package as Apache Nutch 1.13.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Nutch PMC votes are cast.
>
> [ ] +1 Release this package as Apache Nutch 1.13.
> [ ] -1 Do not release this package because…
>
> Cheers,
> Lewis
> (On behalf of the Nutch PMC)
>
> P.S. Here is my +1.
>
> --
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney
>



-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


Crawler-Commons 0.8 released

2017-06-09 Thread Julien Nioche
Apologies for cross-posting

The Common-Crawl project is pleased to announce its 0.8 release.

*https://github.com/crawler-commons/crawler-commons/releases/tag/crawler-commons-0.8
*

If you are wondering what Crawler-Commons is about :

*Crawler-Commons is a set of reusable Java components that implement
functionality common to any web crawler. These components benefit from
collaboration among various existing web crawler projects and reduce
duplication of effort. *

The artefacts are available from Maven central, simply add the following to
your project's POM file.

**
*com.github.crawler-commons*
*crawler-commons*
*0.8*
**


Thanks to all contributors and users and happy crawling!


Julien (on behalf of the Common-Crawl committers)

-- 

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


Re: Establishment of Static Source Code Analysis

2017-06-16 Thread Julien Nioche
>
>  Russian compatriots


Are we all Russian then?

On 16 June 2017 at 04:29, lewis john mcgibbney  wrote:

> Hi Folks,
> I don't know if anyone else noticed... some of our Russian compatriots
> have set up a static auto bot to notify us of source code issues...
> An example is as follows
> https://issues.apache.org/jira/browse/NUTCH-2394
> I think this is great to be honest... with some peer review I think we
> could take this seriously.
> Out of curiosity is anyone responsible for this?
> Lewis
>
> --
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney
>



-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


Re: Establishment of Static Source Code Analysis

2017-06-16 Thread Julien Nioche
More seriously, no idea who's done it but it is useful feedback. A similar
company (DevFactory)  contributed to StormCrawler
<https://github.com/DigitalPebble/storm-crawler/commits?author=AymanDF> some
time ago. Also reminds me of the discussion we had around Sonar in
crawler-commons
<https://github.com/crawler-commons/crawler-commons/pull/127>.

On 16 June 2017 at 08:55, Julien Nioche <lists.digitalpeb...@gmail.com>
wrote:

>  Russian compatriots
>
>
> Are we all Russian then?
>
> On 16 June 2017 at 04:29, lewis john mcgibbney <lewi...@apache.org> wrote:
>
>> Hi Folks,
>> I don't know if anyone else noticed... some of our Russian compatriots
>> have set up a static auto bot to notify us of source code issues...
>> An example is as follows
>> https://issues.apache.org/jira/browse/NUTCH-2394
>> I think this is great to be honest... with some peer review I think we
>> could take this seriously.
>> Out of curiosity is anyone responsible for this?
>> Lewis
>>
>> --
>> http://home.apache.org/~lewismc/
>> @hectorMcSpector
>> http://www.linkedin.com/in/lmcgibbney
>>
>
>
>
> --
>
> *Open Source Solutions for Text Engineering*
>
> http://www.digitalpebble.com
> http://digitalpebble.blogspot.com/
> #digitalpebble <http://twitter.com/digitalpebble>
>



-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>


Re: [DISCUSS] Release 1.14?

2017-12-14 Thread Julien Nioche
FYI Tika 1.17 has just been released
http://www.apache.org/dist/tika/CHANGES-1.17.txt

On 12 December 2017 at 12:36, Sebastian Nagel <wastl.na...@googlemail.com>
wrote:

> Hi Julien,
>
> yes, I know there's an open issue by Markus which depends on Tika 1.7.
> If the Tika release happens this week, I'll make sure that it's included.
>
> Thanks,
> Sebastian
>
>
> On 12/11/2017 10:22 AM, Julien Nioche wrote:
> > Tika 1.17 will be released shortly, maybe it would be worth waiting a
> bit and integrate it first?
> >
> > On 8 December 2017 at 22:53, Sebastian Nagel <wastl.na...@googlemail.com
> > <mailto:wastl.na...@googlemail.com>> wrote:
> >
> > Hi all,
> >
> > 50+ issues fixed
> >   https://issues.apache.org/jira/projects/NUTCH/versions/12340218
> > <https://issues.apache.org/jira/projects/NUTCH/versions/12340218>
> >
> > Of course, as always and still many open issues. But maybe it's time
> to
> > push a release now and try to integrate the next features and
> > fixes early next year. What do you think?
> >
> > The last release (1.3) dates 8 month back (April 2017).
> >
> > I would be ready to push a release candidate next week.
> >
> >
> > Sebastian
> >
> >
> >
> >
> > --
> > *
> > */Open Source Solutions for Text Engineering/
> > /
> > /http://www.digitalpebble.com <http://www.digitalpebble.com/>
> > http://digitalpebble.blogspot.com/
> > #digitalpebble <http://twitter.com/digitalpebble>
>
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>


Re: [VOTE] Release Apache Nutch 1.14 RC#1

2017-12-19 Thread Julien Nioche
+1 to release, thanks Seb

On 18 December 2017 at 22:12, Sebastian Nagel 
wrote:

> Hi Folks,
>
> A first candidate for the Nutch 1.14 release is available at:
>
>   https://dist.apache.org/repos/dist/dev/nutch/1.14/
>
> The release candidate is a zip and tar.gz archive of the binary and
> sources in:
>   https://github.com/apache/nutch/tree/release-1.14
> The SHA1 checksum of the release commit is
>   a8e60bdfb79b368612f068ed5aeeb690e29b448d
>
> In addition, a staged maven repository is available here:
>   https://repository.apache.org/content/repositories/orgapachenutch-1014/
>
> We addressed 79 Issues:
>https://issues.apache.org/jira/secure/ReleaseNote.jspa?
> projectId=10680=12340218
>
> Please vote on releasing this package as Apache Nutch 1.14.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Nutch PMC votes are cast.
>
> [ ] +1 Release this package as Apache Nutch 1.14.
> [ ] -1 Do not release this package because…
>
> Cheers,
> Sebastian
> (On behalf of the Nutch PMC)
>
> P.S. Here is my +1.
>



-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


Re: [DISCUSS] Release 1.14?

2017-12-11 Thread Julien Nioche
Tika 1.17 will be released shortly, maybe it would be worth waiting a bit
and integrate it first?

On 8 December 2017 at 22:53, Sebastian Nagel 
wrote:

> Hi all,
>
> 50+ issues fixed
>   https://issues.apache.org/jira/projects/NUTCH/versions/12340218
>
> Of course, as always and still many open issues. But maybe it's time to
> push a release now and try to integrate the next features and
> fixes early next year. What do you think?
>
> The last release (1.3) dates 8 month back (April 2017).
>
> I would be ready to push a release candidate next week.
>
>
> Sebastian
>



-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


Crawler-Commons 0.9 released

2017-10-31 Thread Julien Nioche
Happy Halloween!

We are glad to announce the 0.9 release of Crawler-Commons. See the
CHANGES.txt

file
included with the release for a full list of details. The main changes are
the removal of DOM-based sitemap parser as the SAX equivalent introduced in
the previous version has better performance and is also more robust.

You might need to change your code to replace SiteMapParserSAX with
SiteMapParser. The parser is now aware of namespaces, and by default does
not force the namespace to be the one recommended in the specification (
http://www.sitemaps.org/schemas/sitemap/0.9) as variants can be found in
the wild. You can set the behaviour using the method
*setStrictNamespace(boolean)*.

As usual, the version 0.9 contains numerous improvements and bugfixes and
all users are invited to upgrade to this version.
Thanks to all committers, contributors and users.

Julien


Crawler-Commons 0.10 released

2018-06-07 Thread Julien Nioche
Hi

We are glad to announce the 0.10 release of Crawler-Commons. See the
CHANGES.txt

file
included with the release for a full list of details. This version contains
among other things improvements to the Sitemap parsing and the removal of
the Tika dependency.

As usual, this latest version contains numerous improvements and bugfixes
and all users are invited to upgrade to it.

Thanks to all committers, contributors and users.

Julien

-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


Re: [ANNOUNCE] New Nutch committer and PMC - Tim Allison

2023-07-20 Thread Julien Nioche
What a fantastic addition to the Nutch team! Congrats to Tim

On Thu, 20 Jul 2023 at 10:20, Sebastian Nagel  wrote:

> Dear all,
>
> It is my pleasure to announce that Tim Allison has joined us
> as a committer and member of the Nutch PMC.
>
> You may already know Tim as a maintainer of and contributor to
> Apache Tika. So, it was great to see contributions to the
> Nutch source code from an experienced developer who is also
> active in a related Apache project. Among other contributions
> Tim recently implemented the indexer-opensearch plugin.
>
> Thank you, Tim Allison, and congratulations on your new role
> in the Apache Nutch community! And welcome on board!
>
> Sebastian
> (on behalf of the Nutch PMC)
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


[jira] Commented: (NUTCH-826) Mailing list is broken.

2010-05-24 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12870528#action_12870528
 ] 

Julien Nioche commented on NUTCH-826:
-

Nutch has recently become a TLP and some of the info on the website needs 
updating.

To subscribe to the list, send a message to:
  user-subscr...@nutch.apache.org

To remove your address from the list, send a message to:
  user-unsubscr...@nutch.apache.org

Send mail to the following for info and FAQ for this list:
  user-i...@nutch.apache.org
  user-...@nutch.apache.org

PS : this is hardly a blocker 

 Mailing list is broken.
 ---

 Key: NUTCH-826
 URL: https://issues.apache.org/jira/browse/NUTCH-826
 Project: Nutch
  Issue Type: Bug
Reporter: John Sherwood
Priority: Blocker

 All of the following addresses are failing:
 nutch-u...@nutch.apache.org
 nutch-user-subscr...@nutch.apache.org
 nutch-user-subscr...@lucene.apache.org
 For the last one, the mailer daemon said 
 This mailing list has moved to user at nutch.apache.org.
 Below is the message I tried to send:
 Hi people,
 I've been banging my head against this problem for two days now.
 Simply, I want to add a field with the value of a given meta tag.
 I've been trying the parse-xml plugin, but that seems that it doesn't
 work with version 1.0.  I've tried the code at
 http://sujitpal.blogspot.com/2009/07/nutch-getting-my-feet-wet.html
 and it hasn't worked.  I don't even know why.  I don't even know if my
 plugin is being used... or even looked for!  Nutch seems to have a
 infuriating Fail silently policy for plugins.  I put a
 System.exit(1) in my filters just to see if my code is even being
 encountered.  It has not in spite of my config telling it to.
 Here's my config:
 nutch-site.xml
 ...
 property
  nameplugin.includes/name
  
 valueprotocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|metadata/value
 /property
 ...
 parse-plugins.xml
 ...
 mimeType name=application/xhtml+xml
plugin id=parse-html /
plugin id=metadata /
 /mimeType
 mimeType name=text/html
   plugin id=parse-html /
   plugin id=metadata /
 /mimeType
 mimeType name=text/sgml
   plugin id=parse-html /
   plugin id=metadata /
 /mimeType
 mimeType name=text/xml
  plugin id=parse-html /
  plugin id=parse-rss /
 plugin id=metadata /
 plugin id=feed /
 /mimeType
 ...
 alias name=metadata
 extension-id=com.example.website.nutch.parsing.MetaTagExtractorParseFilter
 /
 ...
 I've also copied the plugin.xml and jar from my build/metadata to the
 plugins root dir.
 Nonetheless, Nutch runs and puts data in solr for me.  Afaik, Nutch is
 completely unaware of my plugin despite my config options.  Is the
 some other place I need to tell Nutch to use my plugin?  Is there some
 other approach to do this without having to write a plugin?  This does
 seem like a lot of work to simply get a meta tag into a field.  Any
 help would be appreciated.
 Sincerely,
 John Sherwood

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-826) Mailing list is broken.

2010-05-24 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-826.
-

Fix Version/s: 1.1
   Resolution: Fixed

Committed revision 947569.

The changes should be visible on the website within a few hours

 Mailing list is broken.
 ---

 Key: NUTCH-826
 URL: https://issues.apache.org/jira/browse/NUTCH-826
 Project: Nutch
  Issue Type: Bug
Reporter: John Sherwood
Assignee: Julien Nioche
Priority: Blocker
 Fix For: 1.1


 All of the following addresses are failing:
 nutch-u...@nutch.apache.org
 nutch-user-subscr...@nutch.apache.org
 nutch-user-subscr...@lucene.apache.org
 For the last one, the mailer daemon said 
 This mailing list has moved to user at nutch.apache.org.
 Below is the message I tried to send:
 Hi people,
 I've been banging my head against this problem for two days now.
 Simply, I want to add a field with the value of a given meta tag.
 I've been trying the parse-xml plugin, but that seems that it doesn't
 work with version 1.0.  I've tried the code at
 http://sujitpal.blogspot.com/2009/07/nutch-getting-my-feet-wet.html
 and it hasn't worked.  I don't even know why.  I don't even know if my
 plugin is being used... or even looked for!  Nutch seems to have a
 infuriating Fail silently policy for plugins.  I put a
 System.exit(1) in my filters just to see if my code is even being
 encountered.  It has not in spite of my config telling it to.
 Here's my config:
 nutch-site.xml
 ...
 property
  nameplugin.includes/name
  
 valueprotocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|metadata/value
 /property
 ...
 parse-plugins.xml
 ...
 mimeType name=application/xhtml+xml
plugin id=parse-html /
plugin id=metadata /
 /mimeType
 mimeType name=text/html
   plugin id=parse-html /
   plugin id=metadata /
 /mimeType
 mimeType name=text/sgml
   plugin id=parse-html /
   plugin id=metadata /
 /mimeType
 mimeType name=text/xml
  plugin id=parse-html /
  plugin id=parse-rss /
 plugin id=metadata /
 plugin id=feed /
 /mimeType
 ...
 alias name=metadata
 extension-id=com.example.website.nutch.parsing.MetaTagExtractorParseFilter
 /
 ...
 I've also copied the plugin.xml and jar from my build/metadata to the
 plugins root dir.
 Nonetheless, Nutch runs and puts data in solr for me.  Afaik, Nutch is
 completely unaware of my plugin despite my config options.  Is the
 some other place I need to tell Nutch to use my plugin?  Is there some
 other approach to do this without having to write a plugin?  This does
 seem like a lot of work to simply get a meta tag into a field.  Any
 help would be appreciated.
 Sincerely,
 John Sherwood

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-828) Fetch Filter

2010-06-08 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12876576#action_12876576
 ] 

Julien Nioche commented on NUTCH-828:
-

Shall we postpone this after the release of 1.1? This is a new functionality 
and at this stage we probably just want to iron out bugs on what we currently 
have. Makes sense? 

 Fetch Filter
 

 Key: NUTCH-828
 URL: https://issues.apache.org/jira/browse/NUTCH-828
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1

 Attachments: NUTCH-828-1-20100608.patch


 Adds a Nutch extension point for a fetch filter.  The fetch filter allows 
 filtering content and parse data/text after it is fetched but before it is 
 written to segments.  The fliter can return true if content is to be written 
 or false if it is not.  
 Some use cases for this filter would be topical search engines that only want 
 to fetch/index certain types of content, for example a news or sports only 
 search engine.  In these types of situations the only way to determine if 
 content belongs to a particular set is to fetch the page and then analyze the 
 content.  If the content passes, meaning belongs to the set of say sports 
 pages, then we want to include it.  If it doesn't then we want to ignore it, 
 never fetch that same page in the future, and ignore any urls on that page.  
 If content is rejected due to a fetch filter then its status is written to 
 the CrawlDb as gone and its content is ignored and not written to segments.  
 This effectively stop crawling along the crawl path of that page and the urls 
 from that page.  An example filter, fetch-safe, is provided that allows 
 fetching content that does not contain a list of bad words.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-830) ScoringFilter to restrict the crawl to the hosts/domains listed in the seeds

2010-06-23 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-830:


Attachment: NUTCH-830.patch

 ScoringFilter to restrict the crawl to the hosts/domains listed in the seeds
 

 Key: NUTCH-830
 URL: https://issues.apache.org/jira/browse/NUTCH-830
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0

 Attachments: NUTCH-830.patch


 The DomainURLFilter allows to specify the domains to consider for a crawl. 
 This works fine but requires to edit a list of domain / hosts manually. The 
 patch presented here offers the same functionality but uses a different 
 mechanism as we use a custom scoring filter to filter the outlinks. 
 1. add a metadata to your seed list e.g. '_origin_' with as values the seed 
 URL
 e.g. http://www.cnn.com/_origin_=http://www.cnn.com/
 2. The custom scoring filter would take care of :
 * transmitting the origin metadata to its outlinks
 * remove from the outlinks the ones which do not have the same host / 
 domain as the origin
 The parameter _scoring.insite.mode_ allows to specify whether to restrict on 
 the host or domain. The parameter _scoring.insite.addOriginOnInject_ allows 
 to addition of the metadata during the injection step and reuses the URL 
 automatically.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-834) Separate the Nutch web site from trunk

2010-06-30 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-834.
---

Resolution: Fixed

Committed revision 959228.

Thanks Chris for your comments and help with this

 Separate the Nutch web site from trunk
 --

 Key: NUTCH-834
 URL: https://issues.apache.org/jira/browse/NUTCH-834
 Project: Nutch
  Issue Type: Task
  Components: documentation
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0


 As discussed on dev@, it would be useful to move the -PDFBox- Nutch web site 
 sources from .../asf/nutch/trunk to .../asf/nutch/site and to use the 
 svnpubsub mechanism for instant deployment of site changes.
 The related issue for infra is 
 https://issues.apache.org/jira/browse/INFRA-2822
 See also https://issues.apache.org/jira/browse/PDFBOX-623

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-650) Hbase Integration

2010-06-30 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883880#action_12883880
 ] 

Julien Nioche commented on NUTCH-650:
-

The patch has been committed with revision # 959259. The content of 
https://svn.apache.org/repos/asf/nutch/branches/nutchbase is now the same as 
github.

 Hbase Integration
 -

 Key: NUTCH-650
 URL: https://issues.apache.org/jira/browse/NUTCH-650
 Project: Nutch
  Issue Type: New Feature
Reporter: Doğacan Güney
Assignee: Doğacan Güney
 Fix For: 2.0

 Attachments: hbase-integration_v1.patch, hbase_v2.patch, 
 latest-nutchbase-vs-original-branch-point.patch, 
 latest-nutchbase-vs-svn-nutchbase.patch, malformedurl.patch, meta.patch, 
 meta2.patch, nb-design.txt, nb-installusage.txt, nofollow-hbase.patch, 
 NUTCH-650.patch, nutch-habase.patch, searching.diff, slash.patch


 This issue will track nutch/hbase integration

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-836) Remove deprecated parse plugins

2010-06-30 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-836:


Attachment: NUTCH-836.patch

 Remove deprecated parse plugins
 ---

 Key: NUTCH-836
 URL: https://issues.apache.org/jira/browse/NUTCH-836
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0

 Attachments: NUTCH-836.patch


 Some of the parser plugins in 1.1 are covered by the parse-tika plugin. These 
 plugins have been kept in 1.1 but should be removed from 2.0 where we'll rely 
 on parse-tika almost exclusively. Some existing plugins might be kept when 
 there is no equivalent in Tika (to be discussed). The following plugins are 
 removed : 
 * parse-html
 * parse-msexcel
 * parse-mspowerpoint
 * parse-msword
 * parse-pdf
 * parse-oo
 * parse-text
 * lib-jakarta-poi
 * lib-parsems
 The patch does not (yet) remove :
 * parse-js
 * parse-rss
 * parse-swf
 * parse-zip
 * feed
 Please review the patch and vote for its inclusion in the trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-836) Remove deprecated parse plugins

2010-06-30 Thread Julien Nioche (JIRA)
Remove deprecated parse plugins
---

 Key: NUTCH-836
 URL: https://issues.apache.org/jira/browse/NUTCH-836
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0
 Attachments: NUTCH-836.patch

Some of the parser plugins in 1.1 are covered by the parse-tika plugin. These 
plugins have been kept in 1.1 but should be removed from 2.0 where we'll rely 
on parse-tika almost exclusively. Some existing plugins might be kept when 
there is no equivalent in Tika (to be discussed). The following plugins are 
removed : 
* parse-html
* parse-msexcel
* parse-mspowerpoint
* parse-msword
* parse-pdf
* parse-oo
* parse-text
* lib-jakarta-poi
* lib-parsems

The patch does not (yet) remove :

* parse-js
* parse-rss
* parse-swf
* parse-zip
* feed

Please review the patch and vote for its inclusion in the trunk.




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-836) Remove deprecated parse plugins

2010-06-30 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883891#action_12883891
 ] 

Julien Nioche commented on NUTCH-836:
-

Actually creative-commons + languageidentifier currently have a dependency on 
parse-html and parse-zip has one on parse-text in their build script.
The tests for the Fetcher and ParserFactory also fail without parse-html and 
parse-text. 

I will modify the patch to prevent these issues

 Remove deprecated parse plugins
 ---

 Key: NUTCH-836
 URL: https://issues.apache.org/jira/browse/NUTCH-836
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0

 Attachments: NUTCH-836.patch


 Some of the parser plugins in 1.1 are covered by the parse-tika plugin. These 
 plugins have been kept in 1.1 but should be removed from 2.0 where we'll rely 
 on parse-tika almost exclusively. Some existing plugins might be kept when 
 there is no equivalent in Tika (to be discussed). The following plugins are 
 removed : 
 * parse-html
 * parse-msexcel
 * parse-mspowerpoint
 * parse-msword
 * parse-pdf
 * parse-oo
 * parse-text
 * lib-jakarta-poi
 * lib-parsems
 The patch does not (yet) remove :
 * parse-js
 * parse-rss
 * parse-swf
 * parse-zip
 * feed
 Please review the patch and vote for its inclusion in the trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-836) Remove deprecated parse plugins

2010-06-30 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-836:


Description: 
Some of the parser plugins in 1.1 are covered by the parse-tika plugin. These 
plugins have been kept in 1.1 but should be removed from 2.0 where we'll rely 
on parse-tika almost exclusively. Some existing plugins might be kept when 
there is no equivalent in Tika (to be discussed). The following plugins are 
removed : 
* parse-html
* parse-msexcel
* parse-mspowerpoint
* parse-msword
* parse-pdf
* parse-oo
* parse-text
* lib-jakarta-poi
* lib-parsems

The patch does not (yet) remove :
* parse-ext
* parse-js
* parse-rss
* parse-swf
* parse-zip
* feed

Please review the patch and vote for its inclusion in the trunk.




  was:
Some of the parser plugins in 1.1 are covered by the parse-tika plugin. These 
plugins have been kept in 1.1 but should be removed from 2.0 where we'll rely 
on parse-tika almost exclusively. Some existing plugins might be kept when 
there is no equivalent in Tika (to be discussed). The following plugins are 
removed : 
* parse-html
* parse-msexcel
* parse-mspowerpoint
* parse-msword
* parse-pdf
* parse-oo
* parse-text
* lib-jakarta-poi
* lib-parsems

The patch does not (yet) remove :

* parse-js
* parse-rss
* parse-swf
* parse-zip
* feed

Please review the patch and vote for its inclusion in the trunk.





 Remove deprecated parse plugins
 ---

 Key: NUTCH-836
 URL: https://issues.apache.org/jira/browse/NUTCH-836
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0

 Attachments: NUTCH-836.patch


 Some of the parser plugins in 1.1 are covered by the parse-tika plugin. These 
 plugins have been kept in 1.1 but should be removed from 2.0 where we'll rely 
 on parse-tika almost exclusively. Some existing plugins might be kept when 
 there is no equivalent in Tika (to be discussed). The following plugins are 
 removed : 
 * parse-html
 * parse-msexcel
 * parse-mspowerpoint
 * parse-msword
 * parse-pdf
 * parse-oo
 * parse-text
 * lib-jakarta-poi
 * lib-parsems
 The patch does not (yet) remove :
 * parse-ext
 * parse-js
 * parse-rss
 * parse-swf
 * parse-zip
 * feed
 Please review the patch and vote for its inclusion in the trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-837) Remove search servers and Lucene dependencies

2010-06-30 Thread Julien Nioche (JIRA)
Remove search servers and Lucene dependencies 
--

 Key: NUTCH-837
 URL: https://issues.apache.org/jira/browse/NUTCH-837
 Project: Nutch
  Issue Type: Task
  Components: searcher, web gui
Affects Versions: 1.1
Reporter: Julien Nioche
 Fix For: 2.0


One of the main aspects of 2.0 is the delegation of the indexing and search to 
external resources like SOLR. We can simplify the code a lot by getting rid of 
the : 
* search servers
* indexing and analysis with Lucene
* search side functionalities : ontologies / clustering etc...
In the short term only SOLR / SOLRCloud will be supported but the plan would be 
to add other systems as well. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-836) Remove deprecated parse plugins

2010-06-30 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-836:


Attachment: (was: NUTCH-836.patch)

 Remove deprecated parse plugins
 ---

 Key: NUTCH-836
 URL: https://issues.apache.org/jira/browse/NUTCH-836
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0

 Attachments: NUTCH-836-2.patch


 Some of the parser plugins in 1.1 are covered by the parse-tika plugin. These 
 plugins have been kept in 1.1 but should be removed from 2.0 where we'll rely 
 on parse-tika almost exclusively. Some existing plugins might be kept when 
 there is no equivalent in Tika (to be discussed). The following plugins are 
 removed : 
 * parse-html
 * parse-msexcel
 * parse-mspowerpoint
 * parse-msword
 * parse-pdf
 * parse-oo
 * parse-text
 * lib-jakarta-poi
 * lib-parsems
 The patch does not (yet) remove :
 * parse-ext
 * parse-js
 * parse-rss
 * parse-swf
 * parse-zip
 * feed
 Please review the patch and vote for its inclusion in the trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-835) document deduplication (exact duplicates) failed using MD5Signature

2010-07-02 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884624#action_12884624
 ] 

Julien Nioche commented on NUTCH-835:
-

This patch has been marked for 1.2 but has been committed to trunk only (2.0). 
Shall we also apply it to /nutch/branches/branch-1.2 ?

 document deduplication (exact duplicates) failed using MD5Signature
 ---

 Key: NUTCH-835
 URL: https://issues.apache.org/jira/browse/NUTCH-835
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0, 1.1
 Environment: Linux, Ubuntu 10.04, Java 1.6.0_20
Reporter: Sebastian Nagel
Assignee: Andrzej Bialecki 
 Fix For: 1.2, 2.0


 The MD5Signature class calculates different signatures for identical 
 documents.
 The reason is that
   byte[] data = content.getContent();
   ... StringBuilder().append(data) ...
 uses java.lang.Object.toString() to get a string representation of the 
 (binary) content
 which results in unique hash codes (e.g., [...@30dc9065) even for two byte 
 arrays
 with identical content.
 A solution would be to take the MD5 sum of the binary content as first part 
 of the
 final signature calculation (the parsed content is the second part):
   ... 
 .append(StringUtil.toHexString(MD5Hash.digest(data).getDigest())).append(parse.getText());
 Of course, there are many other solutions...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-840) Port tests from parse-html to parse-tika

2010-07-02 Thread Julien Nioche (JIRA)
Port tests from parse-html to parse-tika


 Key: NUTCH-840
 URL: https://issues.apache.org/jira/browse/NUTCH-840
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0


We don't have test for HTML in parse-tika so I'll copy them from the old 
parse-html plugin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-840) Port tests from parse-html to parse-tika

2010-07-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-840:


Attachment: NUTCH-840.patch

Patch which adds the HTML tests to the Tika Parser

The tests currently rely on some DOM related code from Neko-HTML which 
introduces a dependency to the plugin lib-nekohtml.
Apart from parse-tika lib-nekohtml is used only in clustering-carrot which will 
be removed shortly. Once this is done we can delete lib-nekohtml as well then 
either : 
a) add the neko jar to the parse-tika lib via IVY
b) replace it with another implementation already available from the tika 
dependencies or the main Nutch dependencies (e.g. dom4j)





 Port tests from parse-html to parse-tika
 

 Key: NUTCH-840
 URL: https://issues.apache.org/jira/browse/NUTCH-840
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0

 Attachments: NUTCH-840.patch


 We don't have test for HTML in parse-tika so I'll copy them from the old 
 parse-html plugin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-837) Remove search servers and Lucene dependencies

2010-07-02 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884671#action_12884671
 ] 

Julien Nioche commented on NUTCH-837:
-

I think we can also get rid of  :

* docs/
* WAR related tasks in ANT
* src/web/
* src/xmlcatalog/
* src/engines/


 Remove search servers and Lucene dependencies 
 --

 Key: NUTCH-837
 URL: https://issues.apache.org/jira/browse/NUTCH-837
 Project: Nutch
  Issue Type: Task
  Components: searcher, web gui
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-837.patch


 One of the main aspects of 2.0 is the delegation of the indexing and search 
 to external resources like SOLR. We can simplify the code a lot by getting 
 rid of the : 
 * search servers
 * indexing and analysis with Lucene
 * search side functionalities : ontologies / clustering etc...
 In the short term only SOLR / SOLRCloud will be supported but the plan would 
 be to add other systems as well. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-837) Remove search servers and Lucene dependencies

2010-07-02 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884734#action_12884734
 ] 

Julien Nioche commented on NUTCH-837:
-

:-)

 Remove search servers and Lucene dependencies 
 --

 Key: NUTCH-837
 URL: https://issues.apache.org/jira/browse/NUTCH-837
 Project: Nutch
  Issue Type: Task
  Components: searcher, web gui
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-837.patch


 One of the main aspects of 2.0 is the delegation of the indexing and search 
 to external resources like SOLR. We can simplify the code a lot by getting 
 rid of the : 
 * search servers
 * indexing and analysis with Lucene
 * search side functionalities : ontologies / clustering etc...
 In the short term only SOLR / SOLRCloud will be supported but the plan would 
 be to add other systems as well. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-821) Use ivy in nutch builds

2010-07-05 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-821:


Attachment: NUTCH-821.patch

Adds IVY support for dependencies

The lib/. dir is maintained and will be used to store dependencies which are 
not accessible via Ivy (e.g. GORA). The libs managed by Ivy are put in the 
directory build/lib. 

This patch also differentiates the _build_ path from the _dist_ path.



 Use ivy in nutch builds
 ---

 Key: NUTCH-821
 URL: https://issues.apache.org/jira/browse/NUTCH-821
 Project: Nutch
  Issue Type: New Feature
  Components: build
Affects Versions: 2.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 2.0

 Attachments: NUTCH-821.patch, nutchbase-ivy_v1.patch


 Ivy is the de-facto dependency management tool used in conjunction with Ant. 
 It would be nice if we switch to using Ivy in Nutch builds. 
 Maven is also an alternative, but I think Nutch will benefit more with an 
 Ant+Ivy architecture. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-791) External links for published javadocs are partially broken

2010-07-05 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-791.
-

Fix Version/s: 1.1
   Resolution: Duplicate

Duplicates 790?

 External links for published javadocs are partially broken
 --

 Key: NUTCH-791
 URL: https://issues.apache.org/jira/browse/NUTCH-791
 Project: Nutch
  Issue Type: Bug
  Components: documentation
Reporter: Sami Siren
 Fix For: 1.1


 Lucene and Hadoop links point to non existing urls. For some versions of 
 apidocs the links are just broken and for some they do not exist at all. 
 Basically what is required is that the javadocs are generated again with 
 proper urls for external packages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-821) Use ivy in nutch builds

2010-07-05 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885207#action_12885207
 ] 

Julien Nioche commented on NUTCH-821:
-

{QUOTE}
I think this patch refers to some parts that were already removed in NUTCH-837 
...
{QUOTE}

I applied NUTCH-837 before but indeed it does remove references to parts 
deleted in NUTCH-837. Maybe I should have done it in a separate issue. 

{QUOTE}
Also, it would be nice to have a target that sets up an Eclipse project - after 
this patch is applied the lib/ is nearly empty and you need to run build at 
least once to bring dependencies - this may be confusing.
{QUOTE}

The jars are put in the build/lib directory so this assumes that the project 
has been built in order to get the dependencies. I think there are resources in 
Eclipse for dealing with Ivy configurations. If anyone has any pointers they 
will be most welcome  

 Use ivy in nutch builds
 ---

 Key: NUTCH-821
 URL: https://issues.apache.org/jira/browse/NUTCH-821
 Project: Nutch
  Issue Type: New Feature
  Components: build
Affects Versions: 2.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 2.0

 Attachments: NUTCH-821.patch, nutchbase-ivy_v1.patch


 Ivy is the de-facto dependency management tool used in conjunction with Ant. 
 It would be nice if we switch to using Ivy in Nutch builds. 
 Maven is also an alternative, but I think Nutch will benefit more with an 
 Ant+Ivy architecture. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-821) Use ivy in nutch builds

2010-07-05 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885244#action_12885244
 ] 

Julien Nioche commented on NUTCH-821:
-

I found [http://ant.apache.org/ivy/ivyde/] which allows to manage Ivy 
dependencies in Eclipse. 
I had to rewrite ivy/ivy.xml to make the version numbers explicit as IvyDE was 
not able to load the properties in ivy/library.properties but it worked fine 
after that. The beauty of it is that we don't rely on the content of build/lib 
at all

 Use ivy in nutch builds
 ---

 Key: NUTCH-821
 URL: https://issues.apache.org/jira/browse/NUTCH-821
 Project: Nutch
  Issue Type: New Feature
  Components: build
Affects Versions: 2.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 2.0

 Attachments: NUTCH-821.patch, nutchbase-ivy_v1.patch


 Ivy is the de-facto dependency management tool used in conjunction with Ant. 
 It would be nice if we switch to using Ivy in Nutch builds. 
 Maven is also an alternative, but I think Nutch will benefit more with an 
 Ant+Ivy architecture. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-696) Timeout for Parser

2010-07-05 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885260#action_12885260
 ] 

Julien Nioche commented on NUTCH-696:
-

+1 : this is definitely useful. Hopefully the underlying parsers in Tika are 
constantly improved to prevent loops and crashes but having the parser timeout 
on top would be great 

 Timeout for Parser
 --

 Key: NUTCH-696
 URL: https://issues.apache.org/jira/browse/NUTCH-696
 Project: Nutch
  Issue Type: Wish
  Components: fetcher
Reporter: Julien Nioche
Priority: Minor
 Attachments: timeout.patch


 I found that the parsing sometimes crashes due to a problem on a specific 
 document, which is a bit of a shame as this blocks the rest of the segment 
 and Hadoop ends up finding that the node does not respond. I was wondering 
 about whether it would make sense to have a timeout mechanism for the parsing 
 so that if a document is not parsed after a time t, it is simply treated as 
 an exception and we can get on with the rest of the process.
 Does that make sense? Where do you think we should implement that, in 
 ParseUtil?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (NUTCH-696) Timeout for Parser

2010-07-05 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885260#action_12885260
 ] 

Julien Nioche edited comment on NUTCH-696 at 7/5/10 11:13 AM:
--

+1 : this is definitely useful. Hopefully the underlying parsers in Tika are 
constantly improved to prevent loops and crashes but having the parser timeout 
on top would be great 

I suggest we mark it for 2.0 and 1.2

  was (Author: jnioche):
+1 : this is definitely useful. Hopefully the underlying parsers in Tika 
are constantly improved to prevent loops and crashes but having the parser 
timeout on top would be great 
  
 Timeout for Parser
 --

 Key: NUTCH-696
 URL: https://issues.apache.org/jira/browse/NUTCH-696
 Project: Nutch
  Issue Type: Wish
  Components: fetcher
Reporter: Julien Nioche
Priority: Minor
 Attachments: timeout.patch


 I found that the parsing sometimes crashes due to a problem on a specific 
 document, which is a bit of a shame as this blocks the rest of the segment 
 and Hadoop ends up finding that the node does not respond. I was wondering 
 about whether it would make sense to have a timeout mechanism for the parsing 
 so that if a document is not parsed after a time t, it is simply treated as 
 an exception and we can get on with the rest of the process.
 Does that make sense? Where do you think we should implement that, in 
 ParseUtil?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



<    1   2   3   4   5   6   7   8   9   10   >