[jira] Created: (NUTCH-664) Possibility to update already stored documents.
Possibility to update already stored documents. --- Key: NUTCH-664 URL: https://issues.apache.org/jira/browse/NUTCH-664 Project: Nutch Issue Type: New Feature Reporter: Sergey Khilkov We have huge index of stored documents. It is high cost procedure to fetch page, merge indexes any time we update some information about page. The information can be changed 1-3 times per day. At this moment we have to store changed info in database, but in this case we have lots of problems with sorting, search restricions and so on. Lucene itself allows delete single document and add new one into existing index. But there is a problem with hadoop... As I understand hadoop filesystem has no possibility to write in random positions. But it will be great feature if nutch will be able to update created index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: NUTCH-92
This method of calculating global IDF values certainly sounds more efficient then the currently proposed method. The reduction of 1 RPC call during the search query (so that only 1 RPC call is made in total) should reduce the overall load on each search server. I prefer the idea of having network broadcasts going out during the initial startup and only thereafter during a topology changing event. To me this kind of sounds like network routing tables, the initial table is setup during startup and checked periodically for changes. When a change is detected the table is modified (sometimes regenerated completely) and the network continues to operate. The alternative (based on the current patch) is to check the table every time a packet (or maybe connection) is sent to one of the devices listed inside. This method may be faster to detect any problem but the additional load would be substantial. With all this said though, the amount of time needed to research and develop this new method may take an extended period of time depending on developer availability. We have a proposed solution (albeit not as nice) that did work on older code that may only need a quick refresh to work with trunk (and the future 1.0 release). I would personally like to see NUTCH-92 (or some form of it) included in trunk for a legitimate evaluation before the next release. Sean Dean From: Andrzej Bialecki <[EMAIL PROTECTED]> To: nutch-dev@lucene.apache.org Sent: Tuesday, November 25, 2008 8:04:22 PM Subject: NUTCH-92 Hi all, After reading this paper: http://wortschatz.uni-leipzig.de/~fwitschel/papers/ipm1152.pdf I came up with the following idea of implementing global IDF in Nutch. The upside of the approach I propose is that it brings back the cost of making a search query to 1 RPC call. The downside is that the search servers need to cache global IDF estimates as computed by the DS.Client, which ties them to a single query front-end (DistributedSearch.Client), or requires keeping a map of on each search server. - First, as the paper above claims, we don't really need exact IDF values of all terms from every index. We should get acceptable quality if we only learn the top-N frequent terms, and for the rest of them we apply a smoothing function that is based on global characteristics of each index (such as the number of terms in the index). This means that the data that needs to be collected by the query integrator (DS.Client in Nutch) from shard servers (DS.Server in Nutch) would consist of a list of e.g. top 500 local terms with their frequency, plus the local smoothing factor as a single value. We could further reduce the amount of data to be sent from/to shard servers by encoding this information in a counted Bloom filter with a single-byte resolution (or a spectral Bloom filter, whichever yields a better precision / bit in our case). The query integrator would ask all active shard servers to provide their local IDF data, and it would compute global IDFs for these terms, plus a global smoothing factor, and send back the updated information to each shard server. This would happen once per lifetime of a local shard, and is needed because of the local query rewriting (and expansion of terms from Nutch Query to Lucene Query). Shard servers would then process incoming queries using the IDF estimates for terms included in the global IDF data, or the global smoothing factors for terms missing from that data (or use local IDFs). The global IDF data would have to be recomputed each time the set of shards available to a DS.Client changes, and then it needs to be broadcast back from the client to all servers - which is the downside of this solution, because servers need to keep a cache of this information for every DS.Client (each of them possibly having a different list of shard servers, hence different IDFs). Also, as shard servers come and go, the IDF data keeps being recomputed and broadcast, which increases the traffic between the client and servers. Still I believe the amount of additional traffic should be minimal in a typical scenario, where changes to the shards are much less frequent than the frequency of sending user queries. :) -- Now, if this approach seems viable (please comment on this), what should we do with the patches in NUTCH-92 ? 1. skip them for now, and wait until the above approach is implemented, and pay the penalty of using skewed local IDFs. 2. apply them now, and pay the penalty of additional RPC call / search, and replace this mechanism with the one described above, whenever that becomes available. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
NUTCH-92
Hi all, After reading this paper: http://wortschatz.uni-leipzig.de/~fwitschel/papers/ipm1152.pdf I came up with the following idea of implementing global IDF in Nutch. The upside of the approach I propose is that it brings back the cost of making a search query to 1 RPC call. The downside is that the search servers need to cache global IDF estimates as computed by the DS.Client, which ties them to a single query front-end (DistributedSearch.Client), or requires keeping a map of on each search server. - First, as the paper above claims, we don't really need exact IDF values of all terms from every index. We should get acceptable quality if we only learn the top-N frequent terms, and for the rest of them we apply a smoothing function that is based on global characteristics of each index (such as the number of terms in the index). This means that the data that needs to be collected by the query integrator (DS.Client in Nutch) from shard servers (DS.Server in Nutch) would consist of a list of e.g. top 500 local terms with their frequency, plus the local smoothing factor as a single value. We could further reduce the amount of data to be sent from/to shard servers by encoding this information in a counted Bloom filter with a single-byte resolution (or a spectral Bloom filter, whichever yields a better precision / bit in our case). The query integrator would ask all active shard servers to provide their local IDF data, and it would compute global IDFs for these terms, plus a global smoothing factor, and send back the updated information to each shard server. This would happen once per lifetime of a local shard, and is needed because of the local query rewriting (and expansion of terms from Nutch Query to Lucene Query). Shard servers would then process incoming queries using the IDF estimates for terms included in the global IDF data, or the global smoothing factors for terms missing from that data (or use local IDFs). The global IDF data would have to be recomputed each time the set of shards available to a DS.Client changes, and then it needs to be broadcast back from the client to all servers - which is the downside of this solution, because servers need to keep a cache of this information for every DS.Client (each of them possibly having a different list of shard servers, hence different IDFs). Also, as shard servers come and go, the IDF data keeps being recomputed and broadcast, which increases the traffic between the client and servers. Still I believe the amount of additional traffic should be minimal in a typical scenario, where changes to the shards are much less frequent than the frequency of sending user queries. :) -- Now, if this approach seems viable (please comment on this), what should we do with the patches in NUTCH-92 ? 1. skip them for now, and wait until the above approach is implemented, and pay the penalty of using skewed local IDFs. 2. apply them now, and pay the penalty of additional RPC call / search, and replace this mechanism with the one described above, whenever that becomes available. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Commented: (NUTCH-663) Upgrade Nutch to use Hadoop 0.18.2
[ https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650713#action_12650713 ] Dennis Kubes commented on NUTCH-663: @buddha1021 The 1.0 release for Nutch has some of the features for Nutch 2 but it is not a complete Nutch 2 Architecture. We felt it was best to do add some needed features into the current version of Nutch and get them deployed to the community quickly. A lot of people have been asking about the development of Nutch and releasing. Truth is we have just been busy adding in needed features and patches. We should have a release out in the next couple of weeks. That will be a 1.0 release for Nutch but will probably contain a 18.2 or 19 release for Hadoop. We aren't waiting for hadoop to go to 1.0. @Doğacan Güney I am not opposed to waiting for 0.19 as long as it will be released soon. I was looking and it seemed they tried to release a little while back and didn't finish because of some big errors. > Upgrade Nutch to use Hadoop 0.18.2 > -- > > Key: NUTCH-663 > URL: https://issues.apache.org/jira/browse/NUTCH-663 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.0.0 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes > Fix For: 1.0.0 > > > Upgrade Nutch to use a newer hadoop, version 0.18.2. This includes > performance improvements, bug fixes, and new functionality. Changes some > current APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[Nutch Wiki] Update of "johnroman" by johnroman
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification. The following page has been changed by johnroman: http://wiki.apache.org/nutch/johnroman New page: John Roman is a sysadmin for the R&D arm of lexmark international. some of his contributions include bugfix documentation and troubleshooting...as well as an attempt to clean up alot of the tutorials.
[jira] Commented: (NUTCH-663) Upgrade Nutch to use Hadoop 0.18.2
[ https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650505#action_12650505 ] buddha1021 commented on NUTCH-663: -- hi: I find the Nutch2Architecture in the wiki,which said that the next release will Remove the plugin framework .Is that true? And ,nutch has been a long time not to update. when the next release will be available? I think the next release will be the stable version ,and nutch builds on the hadoop and lucene .Lucene has updated to 2.4.0. So, Is nutch waiting for the hadoop update to 1.0.0? Thanks!!! > Upgrade Nutch to use Hadoop 0.18.2 > -- > > Key: NUTCH-663 > URL: https://issues.apache.org/jira/browse/NUTCH-663 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.0.0 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes > Fix For: 1.0.0 > > > Upgrade Nutch to use a newer hadoop, version 0.18.2. This includes > performance improvements, bug fixes, and new functionality. Changes some > current APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.