Filter the urls from search results.

2007-03-28 Thread inalasuresh

Hi , 

I want to filter out particular urls from search result.

How can i use the filters for this situations

Any one Plz give me the solutions for this with example. 

Thanx 
Suresh


-- 
View this message in context: 
http://www.nabble.com/Filter-the-urls-from-search-results.-tf3478724.html#a9708491
Sent from the Nutch - Dev mailing list archive at Nabble.com.



Re: [VOTE] Release Apache Nutch 0.9

2007-03-28 Thread Dennis Kubes
Yes.  This seems to have fixed the problem.  All, do we want to create a 
JIRA and commit this for the 0.9 release?


Dennis

Andrzej Bialecki wrote:

Doğacan Güney wrote:

Hi,

On 3/28/07, Dennis Kubes [EMAIL PROTECTED] wrote:


This is definitely a hadoop problem.  This is similar to the classpath
issues that we were encountering before with Hadoop and the
ReductTaskRunner.  When I include the nutch-*.jar in the hadoop class
path the errors go away.  Not a fix but it proves the point that this is
an issue with Hadoop class loading.

Dennis Kubes



Dennis, you were running SegmentMerger, I presume? This occurs probably
because in SegmentMerger and SegmentReader's dump Nutch uses JobConf 
instead

of NutchJob. Because of this Hadoop can't find the necessary job file.

I put a simple patch at
http://www.ceng.metu.edu.tr/~e1345172/use-nutch-job.patch . Can you 
try it

with this?



Duh, the patch seems to be exactly what's needed - thanks Doğacan!

In the future we should rework the test suite to execute using a clean 
Hadoop installation, i.e. one where Hadoop daemons are started without 
Nutch classes on the classpath.





Re: [VOTE] Release Apache Nutch 0.9

2007-03-28 Thread Michael Stack

Dennis Kubes wrote:
Yes.  This seems to have fixed the problem.  All, do we want to create 
a JIRA and commit this for the 0.9 release?
FYI, this looks like NUTCH-333: 
http://issues.apache.org/jira/browse/NUTCH-333.

St.Ack



Dennis

Andrzej Bialecki wrote:

Doğacan Güney wrote:

Hi,

On 3/28/07, Dennis Kubes [EMAIL PROTECTED] wrote:


This is definitely a hadoop problem.  This is similar to the classpath
issues that we were encountering before with Hadoop and the
ReductTaskRunner.  When I include the nutch-*.jar in the hadoop class
path the errors go away.  Not a fix but it proves the point that 
this is

an issue with Hadoop class loading.

Dennis Kubes



Dennis, you were running SegmentMerger, I presume? This occurs probably
because in SegmentMerger and SegmentReader's dump Nutch uses JobConf 
instead

of NutchJob. Because of this Hadoop can't find the necessary job file.

I put a simple patch at
http://www.ceng.metu.edu.tr/~e1345172/use-nutch-job.patch . Can you 
try it

with this?



Duh, the patch seems to be exactly what's needed - thanks Doğacan!

In the future we should rework the test suite to execute using a 
clean Hadoop installation, i.e. one where Hadoop daemons are 
started without Nutch classes on the classpath.







Re: [VOTE] Release Apache Nutch 0.9

2007-03-28 Thread Andrzej Bialecki

Dennis Kubes wrote:
Yes.  This seems to have fixed the problem.  All, do we want to create a 
JIRA and commit this for the 0.9 release?


It should definitely go into the release, and we need a patch for the 
trunk/ .


Actually, I'm somewhat surprised that we have tags/release-0.9 but we 
don't yet have branches/branch-0.9 ...


I think I'm confused, or the release procedure is confused. My 
understanding so far was that we first create a branch-0.9, we test the 
build from that branch and if it passes all tests and the wait period is 
over, then we copy it to tags/release-0.9 and proclaim a release - which 
is really a read-only branch, i.e. we don't ever commit any patches to 
it ... If that were the case, then we still wouldn't have the 
release-0.9 tag, we could have applied the patch in branch-0.9, plus 
possibly other patches, and then finally tag this tree as tags/release-0.9.


As it is now we are in an awkward situation that we have to patch 
tags/release-0.9 ..


One solution would be now to delete this tag, apply the patch to trunk, 
create branches/branch-0.9, and continue applying any other patches that 
may come up during this testing period - and when we are finally happy 
with the codebase then take a snapshot into tags/release-0.9, and keep 
it read-only.


Another solution is to bend the rules and apply the patch to trunk/ and 
then merge from the trunk to tags/release-0.9 .


What do you think?


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Next release - 0.10.0 or 1.0.0 ?

2007-03-28 Thread Andrzej Bialecki

Hi all,

I know it's a trivial issue, but still ... When this release is out, I 
propose that we should name the next release 1.0.0, and not 0.10.0. The 
effect is purely psychological, but it also reflects our confidence in 
the platform.


Many Open Source projects are afraid of going to 1.0.0 and seem to be 
unable to ever reach this level, as if it were a magic step beyond which 
they are obliged to make some implied but unjustified promises ... 
Perhaps it's because in the commercial world everyone knows what a 1.0.0 
release means :) The downside of the version numbering that never 
reaches 1.0.0 is that casual users don't know how usable the software is 
- e.g. Nutch 0.10.0 could possibly mean that there are still 90 releases 
to go before it becomes usable.


Therefore I propose the following:

* shorten the release cycle, so that we can make a release at least once 
every quarter. This was discussed before, and I hope we can make it 
happen, especially with the help of new forces that joined the team ;)


* call the next version 1.0.0, and continue in increments of 0.1.0 for 
each bi-monhtly or quarterly release,


* make critical bugfix / maintenance releases using increments of 0.0.1 
- although the need for such would be greatly diminished with the 
shorter release cycle.


* once we arrive at versions greater than x.5.0 we should plan for a big 
release (increment of 1.0.0).


* we should use only single digits for small increments, i.e. limit them 
to values between 0-9.


What do you think?


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [VOTE] Release Apache Nutch 0.9

2007-03-28 Thread Chris Mattmann
Well, it's just going to add more work for me, but in the end, it's probably
something that needs to be in there. I could go either way on this though,
as in, if we don't commit it, 0.9.1 shouldn't be far off. Here's my +1 for
going ahead and committing it...


On 3/28/07 10:21 AM, Dennis Kubes [EMAIL PROTECTED] wrote:

 Yes.  This seems to have fixed the problem.  All, do we want to create a
 JIRA and commit this for the 0.9 release?
 
 Dennis
 
 Andrzej Bialecki wrote:
 Doğacan Güney wrote:
 Hi,
 
 On 3/28/07, Dennis Kubes [EMAIL PROTECTED] wrote:
 
 This is definitely a hadoop problem.  This is similar to the classpath
 issues that we were encountering before with Hadoop and the
 ReductTaskRunner.  When I include the nutch-*.jar in the hadoop class
 path the errors go away.  Not a fix but it proves the point that this is
 an issue with Hadoop class loading.
 
 Dennis Kubes
 
 
 Dennis, you were running SegmentMerger, I presume? This occurs probably
 because in SegmentMerger and SegmentReader's dump Nutch uses JobConf
 instead
 of NutchJob. Because of this Hadoop can't find the necessary job file.
 
 I put a simple patch at
 http://www.ceng.metu.edu.tr/~e1345172/use-nutch-job.patch . Can you
 try it
 with this?
 
 
 Duh, the patch seems to be exactly what's needed - thanks Doğacan!
 
 In the future we should rework the test suite to execute using a clean
 Hadoop installation, i.e. one where Hadoop daemons are started without
 Nutch classes on the classpath.
 
 




RE: Next release - 0.10.0 or 1.0.0 ?

2007-03-28 Thread Steve Severance
Another way of looking at it might be to ask the question what would make a 
great 1.0 release? What new features would be awesome? What might get people 
more excited?

Having a 1.0 might make the project look like it has attained a real milestone.

Steve
 -Original Message-
 From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, March 28, 2007 2:38 PM
 To: nutch-dev@lucene.apache.org
 Subject: Next release - 0.10.0 or 1.0.0 ?
 
 Hi all,
 
 I know it's a trivial issue, but still ... When this release is out, I
 propose that we should name the next release 1.0.0, and not 0.10.0. The
 effect is purely psychological, but it also reflects our confidence in
 the platform.
 
 Many Open Source projects are afraid of going to 1.0.0 and seem to be
 unable to ever reach this level, as if it were a magic step beyond
 which
 they are obliged to make some implied but unjustified promises ...
 Perhaps it's because in the commercial world everyone knows what a
 1.0.0
 release means :) The downside of the version numbering that never
 reaches 1.0.0 is that casual users don't know how usable the software
 is
 - e.g. Nutch 0.10.0 could possibly mean that there are still 90
 releases
 to go before it becomes usable.
 
 Therefore I propose the following:
 
 * shorten the release cycle, so that we can make a release at least
 once
 every quarter. This was discussed before, and I hope we can make it
 happen, especially with the help of new forces that joined the team ;)
 
 * call the next version 1.0.0, and continue in increments of 0.1.0 for
 each bi-monhtly or quarterly release,
 
 * make critical bugfix / maintenance releases using increments of 0.0.1
 - although the need for such would be greatly diminished with the
 shorter release cycle.
 
 * once we arrive at versions greater than x.5.0 we should plan for a
 big
 release (increment of 1.0.0).
 
 * we should use only single digits for small increments, i.e. limit
 them
 to values between 0-9.
 
 What do you think?
 
 
 --
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com



Re: Next release - 0.10.0 or 1.0.0 ?

2007-03-28 Thread Dennis Kubes


+1

Andrzej Bialecki wrote:

Hi all,

I know it's a trivial issue, but still ... When this release is out, I 
propose that we should name the next release 1.0.0, and not 0.10.0. The 
effect is purely psychological, but it also reflects our confidence in 
the platform.


I think that a 1.0 release signifies maturity.  And while I think there 
are areas that Nutch can and will improve, I think that it has reached 
the necessary maturity level.




Many Open Source projects are afraid of going to 1.0.0 and seem to be 
unable to ever reach this level, as if it were a magic step beyond which 
they are obliged to make some implied but unjustified promises ... 
Perhaps it's because in the commercial world everyone knows what a 1.0.0 
release means :) The downside of the version numbering that never 
reaches 1.0.0 is that casual users don't know how usable the software is 
- e.g. Nutch 0.10.0 could possibly mean that there are still 90 releases 
to go before it becomes usable.


Personally, I don't like the x.10.x  release structure.  I guess I 
think that if you can't get what you need done in 10 releases x.0.x - 
x.9.x then some rework needs to be done.  Think about this, eclipse is 
still only on 3.2.2 / 3.3 and they use this type of structure.




Therefore I propose the following:

* shorten the release cycle, so that we can make a release at least once 
every quarter. This was discussed before, and I hope we can make it 
happen, especially with the help of new forces that joined the team ;)


I agree.



* call the next version 1.0.0, and continue in increments of 0.1.0 for 
each bi-monhtly or quarterly release,


I agree with bi-monthly or monthly.  I think quarterly is too long 
especially considering how fast Hadoop is moving.


* make critical bugfix / maintenance releases using increments of 0.0.1 
- although the need for such would be greatly diminished with the 
shorter release cycle.


Yes but some bug fixes will still be necessary even with shortened 
release cycles.




* once we arrive at versions greater than x.5.0 we should plan for a big 
release (increment of 1.0.0).


I am fine having 10 releases x.0 - x.9 per major release.  Maybe I don't 
 understand the reason for limiting it to 5 other than.  If we do a 
release every month or so then about once a year we should have a major 
X release.




* we should use only single digits for small increments, i.e. limit them 
to values between 0-9.


Agree.


What do you think?




Sequence File Question

2007-03-28 Thread Steve Severance
Hey guys,
I have a mapreduce job that sets up a directory for pagerank. It iterates
over all the segments and then outputs a MapFile containing the data. When I
go to open the outputted directory with another MapReduce job it fails
saying that it cannot find the path. The path that it thinks it is trying to
open does not include the part-0 directory. Both my directory (and all
other directories for that matter) have the same structure which is
/path/part-0/whatever. I feel like this is a really stupid error and I
have forgotten something that is easily fixed. Any ideas?

Steve



RE: Sequence File Question

2007-03-28 Thread Steve Severance
Let me actually refine that question we do some directories like the linkdb
have a current, and why do others like parse_data not? Is there a convention
on this?

Steve

 -Original Message-
 From: Steve Severance [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, March 28, 2007 4:11 PM
 To: nutch-dev@lucene.apache.org
 Subject: Sequence File Question
 
 Hey guys,
 I have a mapreduce job that sets up a directory for pagerank. It
 iterates
 over all the segments and then outputs a MapFile containing the data.
 When I
 go to open the outputted directory with another MapReduce job it fails
 saying that it cannot find the path. The path that it thinks it is
 trying to
 open does not include the part-0 directory. Both my directory (and
 all
 other directories for that matter) have the same structure which is
 /path/part-0/whatever. I feel like this is a really stupid error
 and I
 have forgotten something that is easily fixed. Any ideas?
 
 Steve



Re: Sequence File Question

2007-03-28 Thread Andrzej Bialecki

Steve Severance wrote:

Let me actually refine that question we do some directories like the linkdb
have a current, and why do others like parse_data not? Is there a convention
on this?


First, to answer your original question: you should use 
MapFileOutputFormat class for reading such output. It handles these 
part- subdirectories automatically.


Second, the current subdirectory is there in order to properly handle 
DB updates - or actually replacements - see e.g. CrawlDb.install() 
method for details. This is not needed in case of segments, which are 
created once and never updated.


Thirdly, although you didn't ask about it ;) the latest version of 
Hadoop contains a handy facility called Counters - if you use the PR 
PowerMethod you need to collect PR from dangling nodes in order to 
redistribute it later. You can use Counters for this, and save on a 
separate aggregation step.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Next release - 0.10.0 or 1.0.0 ?

2007-03-28 Thread Chris Mattmann
My +1 for 1.0.0. I already changed it to 0.10.0, but this can be easily
reverted, and was probably something that I should have brought to the
attention of the dev list before I did that (sorry about that). In any case,
I think 1.0.0 makes a lot of sense, politically, and software wise. Nutch is
production quality software (we use it in production environments here at
JPL), and deserves to have a 1.0.0 release...

My 2 cents,
  Chris



On 3/28/07 11:38 AM, Andrzej Bialecki [EMAIL PROTECTED] wrote:

 Hi all,
 
 I know it's a trivial issue, but still ... When this release is out, I
 propose that we should name the next release 1.0.0, and not 0.10.0. The
 effect is purely psychological, but it also reflects our confidence in
 the platform.
 
 Many Open Source projects are afraid of going to 1.0.0 and seem to be
 unable to ever reach this level, as if it were a magic step beyond which
 they are obliged to make some implied but unjustified promises ...
 Perhaps it's because in the commercial world everyone knows what a 1.0.0
 release means :) The downside of the version numbering that never
 reaches 1.0.0 is that casual users don't know how usable the software is
 - e.g. Nutch 0.10.0 could possibly mean that there are still 90 releases
 to go before it becomes usable.
 
 Therefore I propose the following:
 
 * shorten the release cycle, so that we can make a release at least once
 every quarter. This was discussed before, and I hope we can make it
 happen, especially with the help of new forces that joined the team ;)
 
 * call the next version 1.0.0, and continue in increments of 0.1.0 for
 each bi-monhtly or quarterly release,
 
 * make critical bugfix / maintenance releases using increments of 0.0.1
 - although the need for such would be greatly diminished with the
 shorter release cycle.
 
 * once we arrive at versions greater than x.5.0 we should plan for a big
 release (increment of 1.0.0).
 
 * we should use only single digits for small increments, i.e. limit them
 to values between 0-9.
 
 What do you think?
 




Re: [VOTE] Release Apache Nutch 0.9

2007-03-28 Thread 吕召刚

Hello, 

sounds good.  one question is we have taged release-0.9, and this release has 
been there in some mirror sites of nutch, and people downloaded this version, 
so there would be two nutch-0.9 exist in the world, how could people differ 
between them. 

Thanks
David 

Chris Mattmann :
 Folks,

  Discussing this with Andrzej, and reading his email below, I tend to agree
 more with this procedure below. I would like to call for a vote to change
 the existing as-documented procedure (on the wiki) to branch first, do
 testing in  branch (apply patches where needed), and then when the branch
 is blessed (e.g., 3 binding votes from committers in favor of it), tag it,
 and make a release. Sound good?

  In terms of next steps with what we have now, that boils down to:

 1. delete tags/release-0.9
 2. apply patch to trunk
 3. create branches/branch-0.9
 4. have dennis test again (large scale)
 5. if all goes well, finish release process
 6. tag tags/release-0.9

 Thoughts?

 Thanks!

 Cheers,
   Chris

 On 3/28/07 10:35 AM, Andrzej Bialecki [EMAIL PROTECTED] wrote:
  Dennis Kubes wrote:
  Yes.  This seems to have fixed the problem.  All, do we want to create a
  JIRA and commit this for the 0.9 release?
 
  It should definitely go into the release, and we need a patch for the
  trunk/ .
 
  Actually, I'm somewhat surprised that we have tags/release-0.9 but we
  don't yet have branches/branch-0.9 ...
 
  I think I'm confused, or the release procedure is confused. My
  understanding so far was that we first create a branch-0.9, we test the
  build from that branch and if it passes all tests and the wait period is
  over, then we copy it to tags/release-0.9 and proclaim a release - which
  is really a read-only branch, i.e. we don't ever commit any patches to
  it ... If that were the case, then we still wouldn't have the
  release-0.9 tag, we could have applied the patch in branch-0.9, plus
  possibly other patches, and then finally tag this tree as
  tags/release-0.9.
 
  As it is now we are in an awkward situation that we have to patch
  tags/release-0.9 ..
 
  One solution would be now to delete this tag, apply the patch to trunk,
  create branches/branch-0.9, and continue applying any other patches that
  may come up during this testing period - and when we are finally happy
  with the codebase then take a snapshot into tags/release-0.9, and keep
  it read-only.
 
  Another solution is to bend the rules and apply the patch to trunk/ and
  then merge from the trunk to tags/release-0.9 .
 
  What do you think?

-- 


Re: [VOTE] Release Apache Nutch 0.9

2007-03-28 Thread Sami Siren

2007/3/28, Andrzej Bialecki [EMAIL PROTECTED]:


Dennis Kubes wrote:
 Yes.  This seems to have fixed the problem.  All, do we want to create a
 JIRA and commit this for the 0.9 release?

It should definitely go into the release, and we need a patch for the
trunk/ .



+1

Actually, I'm somewhat surprised that we have tags/release-0.9 but we

don't yet have branches/branch-0.9 ...



IMO there's no need for a branch before a release.

I think I'm confused, or the release procedure is confused. My

understanding so far was that we first create a branch-0.9, we test the
build from that branch and if it passes all tests and the wait period is
over, then we copy it to tags/release-0.9 and proclaim a release - which
is really a read-only branch, i.e. we don't ever commit any patches to
it ... If that were the case, then we still wouldn't have the
release-0.9 tag, we could have applied the patch in branch-0.9, plus
possibly other patches, and then finally tag this tree as tags/release-0.9
.



IMO we should have had a 0.9-rc1 tag, apply patch to trunk, have 0.9-rc2 tag
and so on
until we are satisfied.

Then when we're actually satisfied create tag for 0.9 (copy from rc that got
promoted).

What is the benefit of using a branch before a release?

--
Sami Siren