Re: Future of Nutch 2.0 [Was: Unresolved dependencies org.apache.gora#gora-hbase;0.1: not found in Nutch trunk]

2011-09-17 Thread Mattmann, Chris A (388J)
Hey Julien,

My option E was pretty much equivalent to B except I specified a time frame 
(next 6 months). Are we just 
saying that we'll accelerate the time frame to say, umm, next week or the week 
after? :)

If so, fine by me. Since I moved nutchbase into the trunk at one point, I'd be 
happy once we've VOTEd and 
decided to be the one to execute moving it out.

And yes, PMC votes will be binding and we'll do majority takes it, fine by me.

Cheers,
Chris

On Sep 17, 2011, at 1:45 AM, Julien Nioche wrote:

 Let's keep it simple. Let's vote for option B (i.e. shelve 2.0), if most 
 people are in favour then we don't need to look into other options at all. If 
 not, we'll see what alternatives or arguments come up and vote on these later.
 
 I assume that only PMC votes will be binding and the majority takes it?
 
 Julien
 
 On 16 September 2011 22:30, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:
 Why don't we just collect VOTEs for each of the options a-e, and then
 figure out based on that if there is a majority. If there's no majority, we
 can widdle it down to say the top 2-3, and then VOTE on those, looking
 for majority again.
 
 Cheers,
 Chris
 
 On Sep 16, 2011, at 11:44 AM, Markus Jelsma wrote:
 
  Option B) Shelve trunk in a branch and promote 1.4 to trunk. We can always
  choose to hardwire HBASE (option D) later.
 
  Markus
 
  Am happy to call for a vote on the future of Nutch 2.0 if you want. Shall
  we reduce the various options described before to a single one?
 
  Julien
 
  On 15 September 2011 19:55, Markus Jelsma 
  markus.jel...@openindex.iowrote:
  Hi Guys,
 
  I thought I'd chime in on this thread. My comments below:
  I understand and share your frustration, however you need to bear in
 
  mind
 
  that things are done only if people volunteer and have time - usually
  taken from their holiday, weekends, evenings. Chris (who is the de
 
  facto
 
  release master for Nutch and Gora) has not had the time and nobody
  else has volunteered to do it.
 
  Yep I haven't had the time to push a Gora 0.1.1-incubating release that
  will address the Maven issues. However it is on my roadmap for open
 
  source
 
  stuff to get done in the next month, so that's a good thing. But yes,
 
  that
 
  portion of my open source work is all volunteer time, so sometimes
  other things take priority.
 
  As it happens, yesterday was the 1 year anniversary of the last
  successful Hudson/Jenkins build...  If that actually worked, we
  could point people towards it as a useful recipe for how to get a
  build working off trunk.  I haven't been following Nutch too
  closely, but it always strikes me as really odd, that there's a
  nightly build and it doesn't bother anybody that it fails all the
  time (and that there isn't a nightly build for the stable
  branches).
 
  The real issue behind all this is what we should do with Nutch 2.0.
 
  What
 
  follows is only my opinion and I would love to hear what others have
  to say on this subject.
 
  Since we (actually mostly Dogacan) wrote 2.0 and delegated the
  storage
 
  to
 
  Gora, the latter hasn't really taken off since incubation. There have
  been some modest contributions to it but it does not seem to be used
  much and there is virtually nothing happening on it in terms of
  development. More worryingly, the people who initially contributed to
 
  it
 
  are not very active on the project (such is life, new jobs, different
  projects, etc...) anymore·. As for Nutch 2.0, it hasn't made any
  progress in  the last 12 months : we still have the same bugs, the
 
  tests
 
  do not work, the build has to be done manually etc...
 
  Yep.
 
  At the same time, there has been a new lease of life into Nutch as a
  whole : there is definitely more activity on the mailing lists, new
  users, new active committers  etc... and quite a few bugfixes and
  improvements - most of them backported from what had been done in the
  trunk and people seem fairly happy with what we can do with 1.4
 
  Totally agreed. I'm actually not super surprised -- ever since 1.1, I
 
  kind
 
  of felt that maintaining a stable 1.X branch of Nutch (in parallel to
  the 2.0 efforts) was really going to pay off since there was renewed
  interest from users in leveraging (and furthermore accepting) the
  nuances of 1.X.
 
  So the question is : what shall we do with 2.0? Here are a few
  possibilities
 
 
  a) put some effort into it, fix the bugs and make so that it can be
 
  used
 
  instead of 1.x
  b) shelve it and leave it for enthusiasts to play with + make 1.x the
  trunk again
  c) do nothing : keep 2.0 and 1.x in parallel  (but having to maintain
 
  two
 
  branches is quite a pain)
  d) abandon the idea of a neutral storage layer with Gora and hardwire
 
  it
 
  to e.g. HBase
 
  Option (a) has not happened in the last 12 months and I am not very
  hopeful about it.
 
  What do you guys think?
 
  I'd suggest an option e). Evolve and keep releasing 1.X 

Re: Future of Nutch 2.0 [Was: Unresolved dependencies org.apache.gora#gora-hbase;0.1: not found in Nutch trunk]

2011-09-17 Thread Markus Jelsma
Hi Chris,

I initially respawned this thread with the suggestion to not to wait until 
january orso before the vote. Hence my apologies for being impatient and 
pessimistic about trunk :)

Cheers,

 Hey Julien,
 
 My option E was pretty much equivalent to B except I specified a time frame
 (next 6 months). Are we just saying that we'll accelerate the time frame
 to say, umm, next week or the week after? :)
 
 If so, fine by me. Since I moved nutchbase into the trunk at one point, I'd
 be happy once we've VOTEd and decided to be the one to execute moving it
 out.
 
 And yes, PMC votes will be binding and we'll do majority takes it, fine by
 me.
 
 Cheers,
 Chris
 
 On Sep 17, 2011, at 1:45 AM, Julien Nioche wrote:
  Let's keep it simple. Let's vote for option B (i.e. shelve 2.0), if most
  people are in favour then we don't need to look into other options at
  all. If not, we'll see what alternatives or arguments come up and vote
  on these later.
  
  I assume that only PMC votes will be binding and the majority takes it?
  
  Julien
  
  On 16 September 2011 22:30, Mattmann, Chris A (388J)
  chris.a.mattm...@jpl.nasa.gov wrote: Why don't we just collect VOTEs
  for each of the options a-e, and then figure out based on that if there
  is a majority. If there's no majority, we can widdle it down to say the
  top 2-3, and then VOTE on those, looking for majority again.
  
  Cheers,
  Chris
  
  On Sep 16, 2011, at 11:44 AM, Markus Jelsma wrote:
   Option B) Shelve trunk in a branch and promote 1.4 to trunk. We can
   always choose to hardwire HBASE (option D) later.
   
   Markus
   
   Am happy to call for a vote on the future of Nutch 2.0 if you want.
   Shall we reduce the various options described before to a single one?
   
   Julien
   
   On 15 September 2011 19:55, Markus Jelsma 
markus.jel...@openindex.iowrote:
   Hi Guys,
   
   I thought I'd chime in on this thread. My comments below:
   I understand and share your frustration, however you need to bear
   in
   
   mind
   
   that things are done only if people volunteer and have time -
   usually taken from their holiday, weekends, evenings. Chris (who
   is the de
   
   facto
   
   release master for Nutch and Gora) has not had the time and nobody
   else has volunteered to do it.
   
   Yep I haven't had the time to push a Gora 0.1.1-incubating release
   that will address the Maven issues. However it is on my roadmap for
   open
   
   source
   
   stuff to get done in the next month, so that's a good thing. But
   yes,
   
   that
   
   portion of my open source work is all volunteer time, so sometimes
   other things take priority.
   
   As it happens, yesterday was the 1 year anniversary of the last
   successful Hudson/Jenkins build...  If that actually worked, we
   could point people towards it as a useful recipe for how to get a
   build working off trunk.  I haven't been following Nutch too
   closely, but it always strikes me as really odd, that there's a
   nightly build and it doesn't bother anybody that it fails all the
   time (and that there isn't a nightly build for the stable
   branches).
   
   The real issue behind all this is what we should do with Nutch 2.0.
   
   What
   
   follows is only my opinion and I would love to hear what others
   have to say on this subject.
   
   Since we (actually mostly Dogacan) wrote 2.0 and delegated the
   storage
   
   to
   
   Gora, the latter hasn't really taken off since incubation. There
   have been some modest contributions to it but it does not seem to
   be used much and there is virtually nothing happening on it in
   terms of development. More worryingly, the people who initially
   contributed to
   
   it
   
   are not very active on the project (such is life, new jobs,
   different projects, etc...) anymore·. As for Nutch 2.0, it hasn't
   made any progress in  the last 12 months : we still have the same
   bugs, the
   
   tests
   
   do not work, the build has to be done manually etc...
   
   Yep.
   
   At the same time, there has been a new lease of life into Nutch as
   a whole : there is definitely more activity on the mailing lists,
   new users, new active committers  etc... and quite a few bugfixes
   and improvements - most of them backported from what had been done
   in the trunk and people seem fairly happy with what we can do with
   1.4
   
   Totally agreed. I'm actually not super surprised -- ever since 1.1,
   I
   
   kind
   
   of felt that maintaining a stable 1.X branch of Nutch (in parallel
   to the 2.0 efforts) was really going to pay off since there was
   renewed interest from users in leveraging (and furthermore
   accepting) the nuances of 1.X.
   
   So the question is : what shall we do with 2.0? Here are a few
   possibilities
   
   
   a) put some effort into it, fix the bugs and make so that it can be
   
   used
   
   instead of 1.x
   b) shelve it and leave it for enthusiasts to play with + make 1.x
   the trunk again
   

Re: Future of Nutch 2.0 [Was: Unresolved dependencies org.apache.gora#gora-hbase;0.1: not found in Nutch trunk]

2011-09-17 Thread Mattmann, Chris A (388J)
Hey Markus,

No worries. I actually have no dog in this fight to be honest. 

I want Gora to be successful, and I want Nutch to be successful. 
I haven't contributed much to Nutch 2.0 trunk but I have been 
to the 1.x series branch. I wish I knew more about Gora's internals (and 
am trying to learn) so I could help more with it. I think it will make a lot 
of sense to use it at some point.

At the same time, I'm all for making 1.x releases and naturally getting to 
2.0 over time based on our current progress and understanding. I'm also 
super excited about the 1.x versions of Nutch and when I think about it
the reality is that they've always been Nutch trunk even though we 
artificially tried to turn the nutchbase brancn into it. 

So to wrap it up, I'm totally fine with 1.x moving into trunk and with 
executing 
the plan I proposed a while back:

---snip
1. branch the current trunk as 
https://svn.apache.org/repos/asf/nutch/branches/nutchgora
2. grab latest stable branch (e.g., 
https://svn.apache.org/repos/asf/nutch/branches/branch-1.6) and 
*replace* the Nutch trunk with it, and bump the version # to 1.7-dev
3. active development on stable becomes active development in trunk and 
nutchgora still 
exists in case anyone ever resurrects it.
---snip

Of course, it's not 1.6 (I was optimistic about getting there in 6 months ;) ), 
but it's really 1.4. 
And we don't need to bump to -dev since we're already in full dev with the 1.4 
cycle. 

So, I'm ready for a VOTE. Feel free to call one (or have Julien do it), and 
I'll VOTE +1.

Cheers,
Chris


On Sep 17, 2011, at 10:18 AM, Markus Jelsma wrote:

 Hi Chris,
 
 I initially respawned this thread with the suggestion to not to wait until
 january orso before the vote. Hence my apologies for being impatient and
 pessimistic about trunk :)
 
 Cheers,
 
 Hey Julien,
 
 My option E was pretty much equivalent to B except I specified a time frame
 (next 6 months). Are we just saying that we'll accelerate the time frame
 to say, umm, next week or the week after? :)
 
 If so, fine by me. Since I moved nutchbase into the trunk at one point, I'd
 be happy once we've VOTEd and decided to be the one to execute moving it
 out.
 
 And yes, PMC votes will be binding and we'll do majority takes it, fine by
 me.
 
 Cheers,
 Chris
 
 On Sep 17, 2011, at 1:45 AM, Julien Nioche wrote:
 Let's keep it simple. Let's vote for option B (i.e. shelve 2.0), if most
 people are in favour then we don't need to look into other options at
 all. If not, we'll see what alternatives or arguments come up and vote
 on these later.
 
 I assume that only PMC votes will be binding and the majority takes it?
 
 Julien
 
 On 16 September 2011 22:30, Mattmann, Chris A (388J)
 chris.a.mattm...@jpl.nasa.gov wrote: Why don't we just collect VOTEs
 for each of the options a-e, and then figure out based on that if there
 is a majority. If there's no majority, we can widdle it down to say the
 top 2-3, and then VOTE on those, looking for majority again.
 
 Cheers,
 Chris
 
 On Sep 16, 2011, at 11:44 AM, Markus Jelsma wrote:
 Option B) Shelve trunk in a branch and promote 1.4 to trunk. We can
 always choose to hardwire HBASE (option D) later.
 
 Markus
 
 Am happy to call for a vote on the future of Nutch 2.0 if you want.
 Shall we reduce the various options described before to a single one?
 
 Julien
 
 On 15 September 2011 19:55, Markus Jelsma
 markus.jel...@openindex.iowrote:
 Hi Guys,
 
 I thought I'd chime in on this thread. My comments below:
 I understand and share your frustration, however you need to bear
 in
 
 mind
 
 that things are done only if people volunteer and have time -
 usually taken from their holiday, weekends, evenings. Chris (who
 is the de
 
 facto
 
 release master for Nutch and Gora) has not had the time and nobody
 else has volunteered to do it.
 
 Yep I haven't had the time to push a Gora 0.1.1-incubating release
 that will address the Maven issues. However it is on my roadmap for
 open
 
 source
 
 stuff to get done in the next month, so that's a good thing. But
 yes,
 
 that
 
 portion of my open source work is all volunteer time, so sometimes
 other things take priority.
 
 As it happens, yesterday was the 1 year anniversary of the last
 successful Hudson/Jenkins build...  If that actually worked, we
 could point people towards it as a useful recipe for how to get a
 build working off trunk.  I haven't been following Nutch too
 closely, but it always strikes me as really odd, that there's a
 nightly build and it doesn't bother anybody that it fails all the
 time (and that there isn't a nightly build for the stable
 branches).
 
 The real issue behind all this is what we should do with Nutch 2.0.
 
 What
 
 follows is only my opinion and I would love to hear what others
 have to say on this subject.
 
 Since we (actually mostly Dogacan) wrote 2.0 and delegated the
 storage
 
 to
 
 Gora, the latter hasn't really taken off since incubation. There
 have been some modest 

Re: Future of Nutch 2.0 [Was: Unresolved dependencies org.apache.gora#gora-hbase;0.1: not found in Nutch trunk]

2011-09-17 Thread lewis john mcgibbney
Glad to see were making progress here.

Same with me, I am ready to move on with the project and move out of this
'rut' we have been in with trunk.

Thanks

On Sat, Sep 17, 2011 at 6:56 PM, Mattmann, Chris A (388J) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Hey Markus,

 No worries. I actually have no dog in this fight to be honest.

 I want Gora to be successful, and I want Nutch to be successful.
 I haven't contributed much to Nutch 2.0 trunk but I have been
 to the 1.x series branch. I wish I knew more about Gora's internals (and
 am trying to learn) so I could help more with it. I think it will make a
 lot
 of sense to use it at some point.

 At the same time, I'm all for making 1.x releases and naturally getting to
 2.0 over time based on our current progress and understanding. I'm also
 super excited about the 1.x versions of Nutch and when I think about it
 the reality is that they've always been Nutch trunk even though we
 artificially tried to turn the nutchbase brancn into it.

 So to wrap it up, I'm totally fine with 1.x moving into trunk and with
 executing
 the plan I proposed a while back:

 ---snip
 1. branch the current trunk as
 https://svn.apache.org/repos/asf/nutch/branches/nutchgora
 2. grab latest stable branch (e.g.,
 https://svn.apache.org/repos/asf/nutch/branches/branch-1.6) and
 *replace* the Nutch trunk with it, and bump the version # to 1.7-dev
 3. active development on stable becomes active development in trunk and
 nutchgora still
 exists in case anyone ever resurrects it.
 ---snip

 Of course, it's not 1.6 (I was optimistic about getting there in 6 months
 ;) ), but it's really 1.4.
 And we don't need to bump to -dev since we're already in full dev with the
 1.4 cycle.

 So, I'm ready for a VOTE. Feel free to call one (or have Julien do it), and
 I'll VOTE +1.

 Cheers,
 Chris


 On Sep 17, 2011, at 10:18 AM, Markus Jelsma wrote:

  Hi Chris,
 
  I initially respawned this thread with the suggestion to not to wait
 until
  january orso before the vote. Hence my apologies for being impatient and
  pessimistic about trunk :)
 
  Cheers,
 
  Hey Julien,
 
  My option E was pretty much equivalent to B except I specified a time
 frame
  (next 6 months). Are we just saying that we'll accelerate the time frame
  to say, umm, next week or the week after? :)
 
  If so, fine by me. Since I moved nutchbase into the trunk at one point,
 I'd
  be happy once we've VOTEd and decided to be the one to execute moving it
  out.
 
  And yes, PMC votes will be binding and we'll do majority takes it, fine
 by
  me.
 
  Cheers,
  Chris
 
  On Sep 17, 2011, at 1:45 AM, Julien Nioche wrote:
  Let's keep it simple. Let's vote for option B (i.e. shelve 2.0), if
 most
  people are in favour then we don't need to look into other options at
  all. If not, we'll see what alternatives or arguments come up and vote
  on these later.
 
  I assume that only PMC votes will be binding and the majority takes it?
 
  Julien
 
  On 16 September 2011 22:30, Mattmann, Chris A (388J)
  chris.a.mattm...@jpl.nasa.gov wrote: Why don't we just collect VOTEs
  for each of the options a-e, and then figure out based on that if there
  is a majority. If there's no majority, we can widdle it down to say the
  top 2-3, and then VOTE on those, looking for majority again.
 
  Cheers,
  Chris
 
  On Sep 16, 2011, at 11:44 AM, Markus Jelsma wrote:
  Option B) Shelve trunk in a branch and promote 1.4 to trunk. We can
  always choose to hardwire HBASE (option D) later.
 
  Markus
 
  Am happy to call for a vote on the future of Nutch 2.0 if you want.
  Shall we reduce the various options described before to a single one?
 
  Julien
 
  On 15 September 2011 19:55, Markus Jelsma
  markus.jel...@openindex.iowrote:
  Hi Guys,
 
  I thought I'd chime in on this thread. My comments below:
  I understand and share your frustration, however you need to bear
  in
 
  mind
 
  that things are done only if people volunteer and have time -
  usually taken from their holiday, weekends, evenings. Chris (who
  is the de
 
  facto
 
  release master for Nutch and Gora) has not had the time and nobody
  else has volunteered to do it.
 
  Yep I haven't had the time to push a Gora 0.1.1-incubating release
  that will address the Maven issues. However it is on my roadmap for
  open
 
  source
 
  stuff to get done in the next month, so that's a good thing. But
  yes,
 
  that
 
  portion of my open source work is all volunteer time, so sometimes
  other things take priority.
 
  As it happens, yesterday was the 1 year anniversary of the last
  successful Hudson/Jenkins build...  If that actually worked, we
  could point people towards it as a useful recipe for how to get a
  build working off trunk.  I haven't been following Nutch too
  closely, but it always strikes me as really odd, that there's a
  nightly build and it doesn't bother anybody that it fails all the
  time (and that there isn't a nightly build for the stable
  branches).
 
  The real issue 

[jira] [Commented] (NUTCH-1092) overhaul FAQ's and publish to Nutch site

2011-09-17 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13107206#comment-13107206
 ] 

Lewis John McGibbney commented on NUTCH-1092:
-

Committed revision 1172043.
I have completely updated our FAQ's, archiving all material not of relevance to 
the Nutch wiki archive.
There is a small error with some SVG's not being handled correctly when the pdf 
is produced from index.html. Unfortunately this breaks the site build and I 
will be working towards getting this fixed when the guys from Forrest get back 
to me. The reason for committing when I know it would break the build, was that 
it add greatly to the site functionality and directly addressed our lack of 
documentation for the project. Hopefully this trivial discrepancy can be dealt 
with very soon and I can update the commit.

The next step here is to push all html and pdf to a /docs directory which we 
can ship with the next stable release of Nutch. This will address our ling 
standing issue of some comprehensive documentation.

 overhaul FAQ's and publish to Nutch site
 

 Key: NUTCH-1092
 URL: https://issues.apache.org/jira/browse/NUTCH-1092
 Project: Nutch
  Issue Type: Sub-task
  Components: documentation
Affects Versions: 1.4, 2.0
Reporter: Lewis John McGibbney
 Fix For: 1.4, 2.0


 We require a complete overhaul of the FAQ's on the Wiki. Once this is 
 accomplished they need to be pushed into the Nutch site. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Issue Comment Edited] (NUTCH-1092) overhaul FAQ's and publish to Nutch site

2011-09-17 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13107206#comment-13107206
 ] 

Lewis John McGibbney edited comment on NUTCH-1092 at 9/17/11 6:49 PM:
--

Committed revision 1172043.
I have completely updated our FAQ's, archiving all material not of relevance to 
the Nutch wiki archive.
There is a small error with some SVG's not being handled correctly when the pdf 
is produced from index.html. Unfortunately this breaks the site build and I 
will be working towards getting this fixed when the guys from Forrest get back 
to me. The reason for committing when I know it would break the build, was that 
it add greatly to the site functionality and directly addressed our lack of 
documentation for the project. Hopefully this trivial discrepancy can be dealt 
with very soon and I can update the commit.

The next step here is to push all html and pdf to a /docs directory which we 
can ship with the next stable release of Nutch. This will address our long 
standing issue of some comprehensive documentation.

  was (Author: lewismc):
Committed revision 1172043.
I have completely updated our FAQ's, archiving all material not of relevance to 
the Nutch wiki archive.
There is a small error with some SVG's not being handled correctly when the pdf 
is produced from index.html. Unfortunately this breaks the site build and I 
will be working towards getting this fixed when the guys from Forrest get back 
to me. The reason for committing when I know it would break the build, was that 
it add greatly to the site functionality and directly addressed our lack of 
documentation for the project. Hopefully this trivial discrepancy can be dealt 
with very soon and I can update the commit.

The next step here is to push all html and pdf to a /docs directory which we 
can ship with the next stable release of Nutch. This will address our ling 
standing issue of some comprehensive documentation.
  
 overhaul FAQ's and publish to Nutch site
 

 Key: NUTCH-1092
 URL: https://issues.apache.org/jira/browse/NUTCH-1092
 Project: Nutch
  Issue Type: Sub-task
  Components: documentation
Affects Versions: 1.4, 2.0
Reporter: Lewis John McGibbney
 Fix For: 1.4, 2.0


 We require a complete overhaul of the FAQ's on the Wiki. Once this is 
 accomplished they need to be pushed into the Nutch site. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2011-09-17 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13107294#comment-13107294
 ] 

Rui Araújo commented on NUTCH-585:
--

I can also confirm that the patch works on Nutch 1.3.

However, it didn't work for my use-case as I need to filter a diverse set of tag
based on different attributes. Besides I needed the links from the filtered 
area 
which did not happen. 

So I altered Hira's patch and I am publishing my work here.

This is the new changed property.
{code:xml} 
property
  nameparser.html.NodesToExclude/name
  valuetable;summary;header|div;id;navigation/value
  description
  A list of nodes whose content will not be indexed separated by |.  Use this 
to tell
  the HTML parser to ignore, for example, site navigation text.
  Each node has three elements: the first one is the tag name, the second one 
the
  attribute name, the third one the value of the attribute.
  Note that nodes with these attributes, and their children, will be silently 
ignored by the parser
  so verify the indexed content with Luke to confirm results.
  /description
/property
{code} 

I really think this should be present in Nutch. I am available to improve the 
patch until it is ready for inclusion. Also I am looking for comments on how I 
implemented my improvements.

Thanks,
Rui

 [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
 ---

 Key: NUTCH-585
 URL: https://issues.apache.org/jira/browse/NUTCH-585
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: All operating systems
Reporter: Andrea Spinelli
Priority: Minor
 Attachments: nutch-585-excludeNodes.patch, 
 nutch-585-jostens-excludeDIVs.patch


 We are using nutch to index our own web sites; we would like not to index 
 certain parts of our pages, because we know they are not relevant (for 
 instance, there are several links to change the background color) and 
 generate spurious matches.
 We have modified the plugin so that it ignores HTML code between certain HTML 
 comments, like
 !-- START-IGNORE --
 ... ignored part ...
 !-- STOP-IGNORE --
 We feel this might be useful to someone else, maybe factorizing the comment 
 strings as constants in the configuration files (say parser.html.ignore.start 
 and parser.html.ignore.stop in nutch-site.xml).
 We are almost ready to contribute our code snippet.  Looking forward for any 
 expression of  interest - or for an explanation why waht we are doing is 
 plain wrong!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2011-09-17 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-585:


   Patch Info: [Patch Available]
Fix Version/s: 1.4
 Assignee: Markus Jelsma

Marked for 1.4. Thanks!

 [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
 ---

 Key: NUTCH-585
 URL: https://issues.apache.org/jira/browse/NUTCH-585
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: All operating systems
Reporter: Andrea Spinelli
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.4

 Attachments: nutch-585-excludeNodes.patch, 
 nutch-585-jostens-excludeDIVs.patch


 We are using nutch to index our own web sites; we would like not to index 
 certain parts of our pages, because we know they are not relevant (for 
 instance, there are several links to change the background color) and 
 generate spurious matches.
 We have modified the plugin so that it ignores HTML code between certain HTML 
 comments, like
 !-- START-IGNORE --
 ... ignored part ...
 !-- STOP-IGNORE --
 We feel this might be useful to someone else, maybe factorizing the comment 
 strings as constants in the configuration files (say parser.html.ignore.start 
 and parser.html.ignore.stop in nutch-site.xml).
 We are almost ready to contribute our code snippet.  Looking forward for any 
 expression of  interest - or for an explanation why waht we are doing is 
 plain wrong!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




Build failed in Jenkins: Nutch-trunk #1607

2011-09-17 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-trunk/1607/

--
Started by timer
Building remotely on solaris1
hudson.util.IOException2: remote file operation failed: 
https://builds.apache.org/job/Nutch-trunk/ws/ at 
hudson.remoting.Channel@488e2f8:solaris1
at hudson.FilePath.act(FilePath.java:754)
at hudson.FilePath.act(FilePath.java:740)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:731)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:676)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1193)
at 
hudson.model.AbstractBuild$AbstractRunner.checkout(AbstractBuild.java:555)
at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:443)
at hudson.model.Run.run(Run.java:1376)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:230)
Caused by: java.io.IOException: Remote call on solaris1 failed
at hudson.remoting.Channel.call(Channel.java:677)
at hudson.FilePath.act(FilePath.java:747)
... 10 more
Caused by: java.lang.LinkageError: duplicate class definition: 
hudson/model/Descriptor
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
at java.lang.ClassLoader.defineClass(ClassLoader.java:466)
at 
hudson.remoting.RemoteClassLoader.loadClassFile(RemoteClassLoader.java:151)
at 
hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:131)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at java.lang.Class.getDeclaredFields0(Native Method)
at java.lang.Class.privateGetDeclaredFields(Class.java:2259)
at java.lang.Class.getDeclaredField(Class.java:1852)
at 
java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1582)
at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:52)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:408)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.init(ObjectStreamClass.java:400)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:297)
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:531)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1552)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1466)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1552)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1466)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1699)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348)
at hudson.remoting.UserRequest.deserialize(UserRequest.java:182)
at hudson.remoting.UserRequest.perform(UserRequest.java:98)
at hudson.remoting.UserRequest.perform(UserRequest.java:48)
at hudson.remoting.Request$2.run(Request.java:287)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
at java.util.concurrent.FutureTask.run(FutureTask.java:123)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:651)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:676)
at java.lang.Thread.run(Thread.java:595)
Archiving artifacts
Recording test results
Publishing Javadoc



Build failed in Jenkins: Nutch-branch-1.4 #8

2011-09-17 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-branch-1.4/8/

--
[...truncated 4182 lines...]

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-regex
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/jenkins-slave/tools/ant-1.8.2/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:https://builds.apache.org/job/Nutch-branch-1.4/ws/branch-1.4/build/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Running 
org.apache.nutch.net.urlnormalizer.regex.TestRegexURLNormalizer
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.062 sec

test:

jar:
 [copy] Copying 1 file to 
https://builds.apache.org/job/Nutch-branch-1.4/ws/branch-1.4/build/classes
 [copy] Copying 1 file to 
https://builds.apache.org/job/Nutch-branch-1.4/ws/branch-1.4/build/classes
  [jar] Building jar: 
https://builds.apache.org/job/Nutch-branch-1.4/ws/branch-1.4/build/nutch-1.4-snapshot.jar

runtime:
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-branch-1.4/ws/branch-1.4/runtime
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-branch-1.4/ws/branch-1.4/runtime/local
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-branch-1.4/ws/branch-1.4/runtime/deploy
 [copy] Copying 1 file to 
https://builds.apache.org/job/Nutch-branch-1.4/ws/branch-1.4/runtime/deploy
 [copy] Copying 1 file to 
https://builds.apache.org/job/Nutch-branch-1.4/ws/branch-1.4/runtime/deploy/bin
 [copy] Copying 1 file to 
https://builds.apache.org/job/Nutch-branch-1.4/ws/branch-1.4/runtime/local/lib
 [copy] Copying 1 file to 
https://builds.apache.org/job/Nutch-branch-1.4/ws/branch-1.4/runtime/local/lib/native
 [copy] Copying 19 files to 
https://builds.apache.org/job/Nutch-branch-1.4/ws/branch-1.4/runtime/local/conf
 [copy] Copying 1 file to 
https://builds.apache.org/job/Nutch-branch-1.4/ws/branch-1.4/runtime/local/bin
 [copy] Copying 52 files to 
https://builds.apache.org/job/Nutch-branch-1.4/ws/branch-1.4/runtime/local/lib
 [copy] Copying 109 files to 
https://builds.apache.org/job/Nutch-branch-1.4/ws/branch-1.4/runtime/local/plugins

javadoc:
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-branch-1.4/ws/branch-1.4/build/docs/api
  [javadoc] Generating Javadoc
  [javadoc] Javadoc execution
  [javadoc] Loading source files for package org.apache.nutch.crawl...
  [javadoc] Loading source files for package org.apache.nutch.fetcher...
  [javadoc] Loading source files for package org.apache.nutch.indexer...
  [javadoc] Loading source files for package org.apache.nutch.indexer.solr...
  [javadoc] Loading source files for package org.apache.nutch.metadata...
  [javadoc] Loading source files for package org.apache.nutch.net...
  [javadoc] Loading source files for package org.apache.nutch.net.protocols...
  [javadoc] Loading source files for package org.apache.nutch.parse...
  [javadoc] Loading source files for package org.apache.nutch.plugin...
  [javadoc] Loading source files for package org.apache.nutch.protocol...
  [javadoc] Loading source files for package org.apache.nutch.scoring...
  [javadoc] Loading source files for package 
org.apache.nutch.scoring.webgraph...
  [javadoc] Loading source files for package org.apache.nutch.segment...
  [javadoc] Loading source files for package org.apache.nutch.tools...
  [javadoc] Loading source files for package org.apache.nutch.tools.arc...
  [javadoc] Loading source files for package org.apache.nutch.tools.proxy...
  [javadoc] Loading source files for package org.apache.nutch.util...
  [javadoc] Loading source files for package org.apache.nutch.util.domain...
  [javadoc] Loading source files for package 
org.apache.nutch.protocol.http.api...
  [javadoc] Loading source files for package org.apache.nutch.urlfilter.api...
  [javadoc] Loading source files for package 
org.apache.nutch.microformats.reltag...
  [javadoc] Loading source files for package org.apache.nutch.protocol.file...
  [javadoc] Loading source files for package org.apache.nutch.protocol.ftp...
  [javadoc] Loading source files for package org.apache.nutch.protocol.http...
  [javadoc] Loading source files for package 
org.apache.nutch.protocol.httpclient...
  [javadoc] Loading source files for package org.apache.nutch.parse.tika...
  [javadoc] Loading source files for package org.apache.nutch.parse.ext...
  [javadoc] Loading source files for package org.apache.nutch.parse.js...
  [javadoc] Loading source files for package org.apache.nutch.parse.swf...
  [javadoc] Loading source files for package org.apache.nutch.parse.zip...
  [javadoc] Loading source files for package org.apache.nutch.indexer.basic...
  [javadoc] Loading source files for package org.apache.nutch.indexer.more...
  [javadoc] Loading source files for package org.apache.nutch.scoring.opic...
  [javadoc] Loading source files for package