Re: Pending Commits for Nutch Issues
I agree with John too. Probably you meant $ 0.02, since 0.02 cents is too less. It is usually 2 cents. :-P Regards, Susam Pal On Tue, Dec 2, 2008 at 6:09 PM, John Martyniak <[EMAIL PROTECTED]> wrote: > Is NUTCH-442 going to be part of the 1.0 release? I hope so, Nutch/Solr > integration would be a huge. > > just my .02 cents. > > -John > > On Nov 27, 2008, at 12:10 PM, Doğacan Güney wrote: > > And here is a list of issues from me that needs more discussion/review: >> >> NUTCH-442 - Integrate Nutch/Solr: If NUTCH-442 is too complex to >> review for people, for now we can just write a SolrIndexer like Sami >> Siren's and deal with 442 after 1.0. I would be happy to provide such >> a patch. >> >> NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException: I >> don't know how to fix this one but indexing almost always fails with >> index-more enabled. >> >> NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate >> fetch interval correctly: I botched it once so now I am afraid to >> commit it :D >> >> NUTCH-626 - fetcher2 breaks out the domain with >> db.ignore.external.links set at cross domain redirects: I am going to >> update the patch and commit it if no objections. >> >> Also, I think NUTCH-658 would be a nice feature for 1.0. >> >> There are some others but these are the most recent and we really >> should push 1.0 out the door already :D >> >> Oh and finally we should do a review of all libraries in nutch >> (libraries in plugins included) and update them to latest versions. I >> am going to open an issue with the intenton of updating all the >> libraries that do not require code changes. >> >> -- >> Doğacan Güney >> > >
Re: Pending Commits for Nutch Issues
I agree with John. NUTCH-442 is by far the most popular/watched item in JIRA and, I think, has been already used by quite a lot of different people to be deemed reliable. Julien 2008/12/2 John Martyniak <[EMAIL PROTECTED]> > Is NUTCH-442 going to be part of the 1.0 release? I hope so, Nutch/Solr > integration would be a huge. > > just my .02 cents. > > -John > > > On Nov 27, 2008, at 12:10 PM, Doğacan Güney wrote: > > And here is a list of issues from me that needs more discussion/review: >> >> NUTCH-442 - Integrate Nutch/Solr: If NUTCH-442 is too complex to >> review for people, for now we can just write a SolrIndexer like Sami >> Siren's and deal with 442 after 1.0. I would be happy to provide such >> a patch. >> >> NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException: I >> don't know how to fix this one but indexing almost always fails with >> index-more enabled. >> >> NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate >> fetch interval correctly: I botched it once so now I am afraid to >> commit it :D >> >> NUTCH-626 - fetcher2 breaks out the domain with >> db.ignore.external.links set at cross domain redirects: I am going to >> update the patch and commit it if no objections. >> >> Also, I think NUTCH-658 would be a nice feature for 1.0. >> >> There are some others but these are the most recent and we really >> should push 1.0 out the door already :D >> >> Oh and finally we should do a review of all libraries in nutch >> (libraries in plugins included) and update them to latest versions. I >> am going to open an issue with the intenton of updating all the >> libraries that do not require code changes. >> >> -- >> Doğacan Güney >> > > -- DigitalPebble Ltd http://www.digitalpebble.com
Re: Pending Commits for Nutch Issues
Is NUTCH-442 going to be part of the 1.0 release? I hope so, Nutch/ Solr integration would be a huge. just my .02 cents. -John On Nov 27, 2008, at 12:10 PM, Doğacan Güney wrote: And here is a list of issues from me that needs more discussion/ review: NUTCH-442 - Integrate Nutch/Solr: If NUTCH-442 is too complex to review for people, for now we can just write a SolrIndexer like Sami Siren's and deal with 442 after 1.0. I would be happy to provide such a patch. NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException: I don't know how to fix this one but indexing almost always fails with index-more enabled. NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate fetch interval correctly: I botched it once so now I am afraid to commit it :D NUTCH-626 - fetcher2 breaks out the domain with db.ignore.external.links set at cross domain redirects: I am going to update the patch and commit it if no objections. Also, I think NUTCH-658 would be a nice feature for 1.0. There are some others but these are the most recent and we really should push 1.0 out the door already :D Oh and finally we should do a review of all libraries in nutch (libraries in plugins included) and update them to latest versions. I am going to open an issue with the intenton of updating all the libraries that do not require code changes. -- Doğacan Güney
Re: Pending Commits for Nutch Issues
Doğacan Güney wrote: Hi Dennis, On Wed, Nov 26, 2008 at 11:42 PM, Dennis Kubes <[EMAIL PROTECTED]> wrote: If nobody has a problem with them I would like to commit the following issues in the next day or two: NUTCH-663: Upgrade Nutch to the most recent Hadoop version (0.19) NUTCH-662: Upgrade Nutch to the most recent Lucene version (2.4) NUTCH-647: Resolve URLs tool NUTCH-665: Search Load Testing Tool NUTCH-667: Input Format for working with Content in Hadoop Streaming And I would like to commit these in < a week: NUTCH-635: LinkAnalysis Tool for Nutch NUTCH-646: New Indexing framework for Nutch NUTCH-594: Serve Nutch search results in XML and JSON NUTCH-666: Analysis plugins and new language identifier. There are others too but these are the ones I am trying to get moved into trunk right now. I am OK with all but NUTCH-666... Why a new language identifier? (or if a new one, why keep old one around?) I haven't got the code pushed out yet. I do have a production version running but I need to make it play nice with the Apache licensing requirements. Current library I am using is under GPL. The reason I switched was because I found that the old one wasn't working correctly for me. I don't know the accuracy levels of the old language identifier but I found that with pages that contained both english and another language, it would often classify it as english. The new language identifier I am currently using has an accuracy rate of 97% and is trainable as before for multiple languages. Currently we have models for 20-30 languages. Also the new language identifier works with the new indexing framework and with new functionality for custom fields. The only reason I would keep the old one around would be for backwards compatibility for people currently using it. I will push out a patch shortly and we can review. If we don't want it to make it into this release I am ok with that. Dennis Dennis
Re: Pending Commits for Nutch Issues
Doğacan Güney wrote: I forgot: I think there is a huge bug with MapWritable in nutch. I didn't yet figure out what it is exactly but it has something to do with the fact that id->class maps are static. Hadoop now has its own implementation of MapWritable, which doesn't use static mappings. We should probably switch to this implementation, although we would have to solve the back-compat issues of accessing old data produced with the Nutch's MapWritable. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Pending Commits for Nutch Issues
OK one last thing: Get rid of Fetcher and promote Fetcher2 to be the default fetcher. On Thu, Nov 27, 2008 at 7:15 PM, Doğacan Güney <[EMAIL PROTECTED]> wrote: > I forgot: I think there is a huge bug with MapWritable in nutch. I > didn't yet figure out what it is > exactly but it has something to do with the fact that id->class maps are > static. > > On Thu, Nov 27, 2008 at 7:10 PM, Doğacan Güney <[EMAIL PROTECTED]> wrote: >> And here is a list of issues from me that needs more discussion/review: >> >> NUTCH-442 - Integrate Nutch/Solr: If NUTCH-442 is too complex to >> review for people, for now we can just write a SolrIndexer like Sami >> Siren's and deal with 442 after 1.0. I would be happy to provide such >> a patch. >> >> NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException: I >> don't know how to fix this one but indexing almost always fails with >> index-more enabled. >> >> NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate >> fetch interval correctly: I botched it once so now I am afraid to >> commit it :D >> >> NUTCH-626 - fetcher2 breaks out the domain with >> db.ignore.external.links set at cross domain redirects: I am going to >> update the patch and commit it if no objections. >> >> Also, I think NUTCH-658 would be a nice feature for 1.0. >> >> There are some others but these are the most recent and we really >> should push 1.0 out the door already :D >> >> Oh and finally we should do a review of all libraries in nutch >> (libraries in plugins included) and update them to latest versions. I >> am going to open an issue with the intenton of updating all the >> libraries that do not require code changes. >> >> -- >> Doğacan Güney >> > > > > -- > Doğacan Güney > -- Doğacan Güney
Re: Pending Commits for Nutch Issues
I forgot: I think there is a huge bug with MapWritable in nutch. I didn't yet figure out what it is exactly but it has something to do with the fact that id->class maps are static. On Thu, Nov 27, 2008 at 7:10 PM, Doğacan Güney <[EMAIL PROTECTED]> wrote: > And here is a list of issues from me that needs more discussion/review: > > NUTCH-442 - Integrate Nutch/Solr: If NUTCH-442 is too complex to > review for people, for now we can just write a SolrIndexer like Sami > Siren's and deal with 442 after 1.0. I would be happy to provide such > a patch. > > NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException: I > don't know how to fix this one but indexing almost always fails with > index-more enabled. > > NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate > fetch interval correctly: I botched it once so now I am afraid to > commit it :D > > NUTCH-626 - fetcher2 breaks out the domain with > db.ignore.external.links set at cross domain redirects: I am going to > update the patch and commit it if no objections. > > Also, I think NUTCH-658 would be a nice feature for 1.0. > > There are some others but these are the most recent and we really > should push 1.0 out the door already :D > > Oh and finally we should do a review of all libraries in nutch > (libraries in plugins included) and update them to latest versions. I > am going to open an issue with the intenton of updating all the > libraries that do not require code changes. > > -- > Doğacan Güney > -- Doğacan Güney
Re: Pending Commits for Nutch Issues
And here is a list of issues from me that needs more discussion/review: NUTCH-442 - Integrate Nutch/Solr: If NUTCH-442 is too complex to review for people, for now we can just write a SolrIndexer like Sami Siren's and deal with 442 after 1.0. I would be happy to provide such a patch. NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException: I don't know how to fix this one but indexing almost always fails with index-more enabled. NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate fetch interval correctly: I botched it once so now I am afraid to commit it :D NUTCH-626 - fetcher2 breaks out the domain with db.ignore.external.links set at cross domain redirects: I am going to update the patch and commit it if no objections. Also, I think NUTCH-658 would be a nice feature for 1.0. There are some others but these are the most recent and we really should push 1.0 out the door already :D Oh and finally we should do a review of all libraries in nutch (libraries in plugins included) and update them to latest versions. I am going to open an issue with the intenton of updating all the libraries that do not require code changes. -- Doğacan Güney
Re: Pending Commits for Nutch Issues
Hi Dennis, On Wed, Nov 26, 2008 at 11:42 PM, Dennis Kubes <[EMAIL PROTECTED]> wrote: > If nobody has a problem with them I would like to commit the following > issues in the next day or two: > > NUTCH-663: Upgrade Nutch to the most recent Hadoop version (0.19) > NUTCH-662: Upgrade Nutch to the most recent Lucene version (2.4) > NUTCH-647: Resolve URLs tool > NUTCH-665: Search Load Testing Tool > NUTCH-667: Input Format for working with Content in Hadoop Streaming > > And I would like to commit these in < a week: > > NUTCH-635: LinkAnalysis Tool for Nutch > NUTCH-646: New Indexing framework for Nutch > NUTCH-594: Serve Nutch search results in XML and JSON > NUTCH-666: Analysis plugins and new language identifier. > > There are others too but these are the ones I am trying to get moved into > trunk right now. > I am OK with all but NUTCH-666... Why a new language identifier? (or if a new one, why keep old one around?) > Dennis > -- Doğacan Güney
Pending Commits for Nutch Issues
If nobody has a problem with them I would like to commit the following issues in the next day or two: NUTCH-663: Upgrade Nutch to the most recent Hadoop version (0.19) NUTCH-662: Upgrade Nutch to the most recent Lucene version (2.4) NUTCH-647: Resolve URLs tool NUTCH-665: Search Load Testing Tool NUTCH-667: Input Format for working with Content in Hadoop Streaming And I would like to commit these in < a week: NUTCH-635: LinkAnalysis Tool for Nutch NUTCH-646: New Indexing framework for Nutch NUTCH-594: Serve Nutch search results in XML and JSON NUTCH-666: Analysis plugins and new language identifier. There are others too but these are the ones I am trying to get moved into trunk right now. Dennis