Filter the urls from search results.
Hi , I want to filter out particular urls from search result. How can i use the filters for this situations Any one Plz give me the solutions for this with example. Thanx Suresh -- View this message in context: http://www.nabble.com/Filter-the-urls-from-search-results.-tf3478724.html#a9708491 Sent from the Nutch - Dev mailing list archive at Nabble.com.
Re: [VOTE] Release Apache Nutch 0.9
Yes. This seems to have fixed the problem. All, do we want to create a JIRA and commit this for the 0.9 release? Dennis Andrzej Bialecki wrote: Doğacan Güney wrote: Hi, On 3/28/07, Dennis Kubes [EMAIL PROTECTED] wrote: This is definitely a hadoop problem. This is similar to the classpath issues that we were encountering before with Hadoop and the ReductTaskRunner. When I include the nutch-*.jar in the hadoop class path the errors go away. Not a fix but it proves the point that this is an issue with Hadoop class loading. Dennis Kubes Dennis, you were running SegmentMerger, I presume? This occurs probably because in SegmentMerger and SegmentReader's dump Nutch uses JobConf instead of NutchJob. Because of this Hadoop can't find the necessary job file. I put a simple patch at http://www.ceng.metu.edu.tr/~e1345172/use-nutch-job.patch . Can you try it with this? Duh, the patch seems to be exactly what's needed - thanks Doğacan! In the future we should rework the test suite to execute using a clean Hadoop installation, i.e. one where Hadoop daemons are started without Nutch classes on the classpath.
Re: [VOTE] Release Apache Nutch 0.9
Dennis Kubes wrote: Yes. This seems to have fixed the problem. All, do we want to create a JIRA and commit this for the 0.9 release? FYI, this looks like NUTCH-333: http://issues.apache.org/jira/browse/NUTCH-333. St.Ack Dennis Andrzej Bialecki wrote: Doğacan Güney wrote: Hi, On 3/28/07, Dennis Kubes [EMAIL PROTECTED] wrote: This is definitely a hadoop problem. This is similar to the classpath issues that we were encountering before with Hadoop and the ReductTaskRunner. When I include the nutch-*.jar in the hadoop class path the errors go away. Not a fix but it proves the point that this is an issue with Hadoop class loading. Dennis Kubes Dennis, you were running SegmentMerger, I presume? This occurs probably because in SegmentMerger and SegmentReader's dump Nutch uses JobConf instead of NutchJob. Because of this Hadoop can't find the necessary job file. I put a simple patch at http://www.ceng.metu.edu.tr/~e1345172/use-nutch-job.patch . Can you try it with this? Duh, the patch seems to be exactly what's needed - thanks Doğacan! In the future we should rework the test suite to execute using a clean Hadoop installation, i.e. one where Hadoop daemons are started without Nutch classes on the classpath.
Re: [VOTE] Release Apache Nutch 0.9
Dennis Kubes wrote: Yes. This seems to have fixed the problem. All, do we want to create a JIRA and commit this for the 0.9 release? It should definitely go into the release, and we need a patch for the trunk/ . Actually, I'm somewhat surprised that we have tags/release-0.9 but we don't yet have branches/branch-0.9 ... I think I'm confused, or the release procedure is confused. My understanding so far was that we first create a branch-0.9, we test the build from that branch and if it passes all tests and the wait period is over, then we copy it to tags/release-0.9 and proclaim a release - which is really a read-only branch, i.e. we don't ever commit any patches to it ... If that were the case, then we still wouldn't have the release-0.9 tag, we could have applied the patch in branch-0.9, plus possibly other patches, and then finally tag this tree as tags/release-0.9. As it is now we are in an awkward situation that we have to patch tags/release-0.9 .. One solution would be now to delete this tag, apply the patch to trunk, create branches/branch-0.9, and continue applying any other patches that may come up during this testing period - and when we are finally happy with the codebase then take a snapshot into tags/release-0.9, and keep it read-only. Another solution is to bend the rules and apply the patch to trunk/ and then merge from the trunk to tags/release-0.9 . What do you think? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Next release - 0.10.0 or 1.0.0 ?
Hi all, I know it's a trivial issue, but still ... When this release is out, I propose that we should name the next release 1.0.0, and not 0.10.0. The effect is purely psychological, but it also reflects our confidence in the platform. Many Open Source projects are afraid of going to 1.0.0 and seem to be unable to ever reach this level, as if it were a magic step beyond which they are obliged to make some implied but unjustified promises ... Perhaps it's because in the commercial world everyone knows what a 1.0.0 release means :) The downside of the version numbering that never reaches 1.0.0 is that casual users don't know how usable the software is - e.g. Nutch 0.10.0 could possibly mean that there are still 90 releases to go before it becomes usable. Therefore I propose the following: * shorten the release cycle, so that we can make a release at least once every quarter. This was discussed before, and I hope we can make it happen, especially with the help of new forces that joined the team ;) * call the next version 1.0.0, and continue in increments of 0.1.0 for each bi-monhtly or quarterly release, * make critical bugfix / maintenance releases using increments of 0.0.1 - although the need for such would be greatly diminished with the shorter release cycle. * once we arrive at versions greater than x.5.0 we should plan for a big release (increment of 1.0.0). * we should use only single digits for small increments, i.e. limit them to values between 0-9. What do you think? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [VOTE] Release Apache Nutch 0.9
Well, it's just going to add more work for me, but in the end, it's probably something that needs to be in there. I could go either way on this though, as in, if we don't commit it, 0.9.1 shouldn't be far off. Here's my +1 for going ahead and committing it... On 3/28/07 10:21 AM, Dennis Kubes [EMAIL PROTECTED] wrote: Yes. This seems to have fixed the problem. All, do we want to create a JIRA and commit this for the 0.9 release? Dennis Andrzej Bialecki wrote: Doğacan Güney wrote: Hi, On 3/28/07, Dennis Kubes [EMAIL PROTECTED] wrote: This is definitely a hadoop problem. This is similar to the classpath issues that we were encountering before with Hadoop and the ReductTaskRunner. When I include the nutch-*.jar in the hadoop class path the errors go away. Not a fix but it proves the point that this is an issue with Hadoop class loading. Dennis Kubes Dennis, you were running SegmentMerger, I presume? This occurs probably because in SegmentMerger and SegmentReader's dump Nutch uses JobConf instead of NutchJob. Because of this Hadoop can't find the necessary job file. I put a simple patch at http://www.ceng.metu.edu.tr/~e1345172/use-nutch-job.patch . Can you try it with this? Duh, the patch seems to be exactly what's needed - thanks Doğacan! In the future we should rework the test suite to execute using a clean Hadoop installation, i.e. one where Hadoop daemons are started without Nutch classes on the classpath.
RE: Next release - 0.10.0 or 1.0.0 ?
Another way of looking at it might be to ask the question what would make a great 1.0 release? What new features would be awesome? What might get people more excited? Having a 1.0 might make the project look like it has attained a real milestone. Steve -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 28, 2007 2:38 PM To: nutch-dev@lucene.apache.org Subject: Next release - 0.10.0 or 1.0.0 ? Hi all, I know it's a trivial issue, but still ... When this release is out, I propose that we should name the next release 1.0.0, and not 0.10.0. The effect is purely psychological, but it also reflects our confidence in the platform. Many Open Source projects are afraid of going to 1.0.0 and seem to be unable to ever reach this level, as if it were a magic step beyond which they are obliged to make some implied but unjustified promises ... Perhaps it's because in the commercial world everyone knows what a 1.0.0 release means :) The downside of the version numbering that never reaches 1.0.0 is that casual users don't know how usable the software is - e.g. Nutch 0.10.0 could possibly mean that there are still 90 releases to go before it becomes usable. Therefore I propose the following: * shorten the release cycle, so that we can make a release at least once every quarter. This was discussed before, and I hope we can make it happen, especially with the help of new forces that joined the team ;) * call the next version 1.0.0, and continue in increments of 0.1.0 for each bi-monhtly or quarterly release, * make critical bugfix / maintenance releases using increments of 0.0.1 - although the need for such would be greatly diminished with the shorter release cycle. * once we arrive at versions greater than x.5.0 we should plan for a big release (increment of 1.0.0). * we should use only single digits for small increments, i.e. limit them to values between 0-9. What do you think? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Next release - 0.10.0 or 1.0.0 ?
+1 Andrzej Bialecki wrote: Hi all, I know it's a trivial issue, but still ... When this release is out, I propose that we should name the next release 1.0.0, and not 0.10.0. The effect is purely psychological, but it also reflects our confidence in the platform. I think that a 1.0 release signifies maturity. And while I think there are areas that Nutch can and will improve, I think that it has reached the necessary maturity level. Many Open Source projects are afraid of going to 1.0.0 and seem to be unable to ever reach this level, as if it were a magic step beyond which they are obliged to make some implied but unjustified promises ... Perhaps it's because in the commercial world everyone knows what a 1.0.0 release means :) The downside of the version numbering that never reaches 1.0.0 is that casual users don't know how usable the software is - e.g. Nutch 0.10.0 could possibly mean that there are still 90 releases to go before it becomes usable. Personally, I don't like the x.10.x release structure. I guess I think that if you can't get what you need done in 10 releases x.0.x - x.9.x then some rework needs to be done. Think about this, eclipse is still only on 3.2.2 / 3.3 and they use this type of structure. Therefore I propose the following: * shorten the release cycle, so that we can make a release at least once every quarter. This was discussed before, and I hope we can make it happen, especially with the help of new forces that joined the team ;) I agree. * call the next version 1.0.0, and continue in increments of 0.1.0 for each bi-monhtly or quarterly release, I agree with bi-monthly or monthly. I think quarterly is too long especially considering how fast Hadoop is moving. * make critical bugfix / maintenance releases using increments of 0.0.1 - although the need for such would be greatly diminished with the shorter release cycle. Yes but some bug fixes will still be necessary even with shortened release cycles. * once we arrive at versions greater than x.5.0 we should plan for a big release (increment of 1.0.0). I am fine having 10 releases x.0 - x.9 per major release. Maybe I don't understand the reason for limiting it to 5 other than. If we do a release every month or so then about once a year we should have a major X release. * we should use only single digits for small increments, i.e. limit them to values between 0-9. Agree. What do you think?
Sequence File Question
Hey guys, I have a mapreduce job that sets up a directory for pagerank. It iterates over all the segments and then outputs a MapFile containing the data. When I go to open the outputted directory with another MapReduce job it fails saying that it cannot find the path. The path that it thinks it is trying to open does not include the part-0 directory. Both my directory (and all other directories for that matter) have the same structure which is /path/part-0/whatever. I feel like this is a really stupid error and I have forgotten something that is easily fixed. Any ideas? Steve
RE: Sequence File Question
Let me actually refine that question we do some directories like the linkdb have a current, and why do others like parse_data not? Is there a convention on this? Steve -Original Message- From: Steve Severance [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 28, 2007 4:11 PM To: nutch-dev@lucene.apache.org Subject: Sequence File Question Hey guys, I have a mapreduce job that sets up a directory for pagerank. It iterates over all the segments and then outputs a MapFile containing the data. When I go to open the outputted directory with another MapReduce job it fails saying that it cannot find the path. The path that it thinks it is trying to open does not include the part-0 directory. Both my directory (and all other directories for that matter) have the same structure which is /path/part-0/whatever. I feel like this is a really stupid error and I have forgotten something that is easily fixed. Any ideas? Steve
Re: Sequence File Question
Steve Severance wrote: Let me actually refine that question we do some directories like the linkdb have a current, and why do others like parse_data not? Is there a convention on this? First, to answer your original question: you should use MapFileOutputFormat class for reading such output. It handles these part- subdirectories automatically. Second, the current subdirectory is there in order to properly handle DB updates - or actually replacements - see e.g. CrawlDb.install() method for details. This is not needed in case of segments, which are created once and never updated. Thirdly, although you didn't ask about it ;) the latest version of Hadoop contains a handy facility called Counters - if you use the PR PowerMethod you need to collect PR from dangling nodes in order to redistribute it later. You can use Counters for this, and save on a separate aggregation step. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Next release - 0.10.0 or 1.0.0 ?
My +1 for 1.0.0. I already changed it to 0.10.0, but this can be easily reverted, and was probably something that I should have brought to the attention of the dev list before I did that (sorry about that). In any case, I think 1.0.0 makes a lot of sense, politically, and software wise. Nutch is production quality software (we use it in production environments here at JPL), and deserves to have a 1.0.0 release... My 2 cents, Chris On 3/28/07 11:38 AM, Andrzej Bialecki [EMAIL PROTECTED] wrote: Hi all, I know it's a trivial issue, but still ... When this release is out, I propose that we should name the next release 1.0.0, and not 0.10.0. The effect is purely psychological, but it also reflects our confidence in the platform. Many Open Source projects are afraid of going to 1.0.0 and seem to be unable to ever reach this level, as if it were a magic step beyond which they are obliged to make some implied but unjustified promises ... Perhaps it's because in the commercial world everyone knows what a 1.0.0 release means :) The downside of the version numbering that never reaches 1.0.0 is that casual users don't know how usable the software is - e.g. Nutch 0.10.0 could possibly mean that there are still 90 releases to go before it becomes usable. Therefore I propose the following: * shorten the release cycle, so that we can make a release at least once every quarter. This was discussed before, and I hope we can make it happen, especially with the help of new forces that joined the team ;) * call the next version 1.0.0, and continue in increments of 0.1.0 for each bi-monhtly or quarterly release, * make critical bugfix / maintenance releases using increments of 0.0.1 - although the need for such would be greatly diminished with the shorter release cycle. * once we arrive at versions greater than x.5.0 we should plan for a big release (increment of 1.0.0). * we should use only single digits for small increments, i.e. limit them to values between 0-9. What do you think?
Re: [VOTE] Release Apache Nutch 0.9
Hello, sounds good. one question is we have taged release-0.9, and this release has been there in some mirror sites of nutch, and people downloaded this version, so there would be two nutch-0.9 exist in the world, how could people differ between them. Thanks David Chris Mattmann : Folks, Discussing this with Andrzej, and reading his email below, I tend to agree more with this procedure below. I would like to call for a vote to change the existing as-documented procedure (on the wiki) to branch first, do testing in branch (apply patches where needed), and then when the branch is blessed (e.g., 3 binding votes from committers in favor of it), tag it, and make a release. Sound good? In terms of next steps with what we have now, that boils down to: 1. delete tags/release-0.9 2. apply patch to trunk 3. create branches/branch-0.9 4. have dennis test again (large scale) 5. if all goes well, finish release process 6. tag tags/release-0.9 Thoughts? Thanks! Cheers, Chris On 3/28/07 10:35 AM, Andrzej Bialecki [EMAIL PROTECTED] wrote: Dennis Kubes wrote: Yes. This seems to have fixed the problem. All, do we want to create a JIRA and commit this for the 0.9 release? It should definitely go into the release, and we need a patch for the trunk/ . Actually, I'm somewhat surprised that we have tags/release-0.9 but we don't yet have branches/branch-0.9 ... I think I'm confused, or the release procedure is confused. My understanding so far was that we first create a branch-0.9, we test the build from that branch and if it passes all tests and the wait period is over, then we copy it to tags/release-0.9 and proclaim a release - which is really a read-only branch, i.e. we don't ever commit any patches to it ... If that were the case, then we still wouldn't have the release-0.9 tag, we could have applied the patch in branch-0.9, plus possibly other patches, and then finally tag this tree as tags/release-0.9. As it is now we are in an awkward situation that we have to patch tags/release-0.9 .. One solution would be now to delete this tag, apply the patch to trunk, create branches/branch-0.9, and continue applying any other patches that may come up during this testing period - and when we are finally happy with the codebase then take a snapshot into tags/release-0.9, and keep it read-only. Another solution is to bend the rules and apply the patch to trunk/ and then merge from the trunk to tags/release-0.9 . What do you think? --
Re: [VOTE] Release Apache Nutch 0.9
2007/3/28, Andrzej Bialecki [EMAIL PROTECTED]: Dennis Kubes wrote: Yes. This seems to have fixed the problem. All, do we want to create a JIRA and commit this for the 0.9 release? It should definitely go into the release, and we need a patch for the trunk/ . +1 Actually, I'm somewhat surprised that we have tags/release-0.9 but we don't yet have branches/branch-0.9 ... IMO there's no need for a branch before a release. I think I'm confused, or the release procedure is confused. My understanding so far was that we first create a branch-0.9, we test the build from that branch and if it passes all tests and the wait period is over, then we copy it to tags/release-0.9 and proclaim a release - which is really a read-only branch, i.e. we don't ever commit any patches to it ... If that were the case, then we still wouldn't have the release-0.9 tag, we could have applied the patch in branch-0.9, plus possibly other patches, and then finally tag this tree as tags/release-0.9 . IMO we should have had a 0.9-rc1 tag, apply patch to trunk, have 0.9-rc2 tag and so on until we are satisfied. Then when we're actually satisfied create tag for 0.9 (copy from rc that got promoted). What is the benefit of using a branch before a release? -- Sami Siren