[jira] Created: (NUTCH-664) Possibility to update already stored documents.

2008-11-25 Thread Sergey Khilkov (JIRA)
Possibility to update already stored documents.
---

 Key: NUTCH-664
 URL: https://issues.apache.org/jira/browse/NUTCH-664
 Project: Nutch
  Issue Type: New Feature
Reporter: Sergey Khilkov


We have huge index of stored documents. It is high cost procedure to fetch 
page, merge indexes any time we update some information about page. The 
information can be changed 1-3 times per day. At this moment we have to store 
changed info in database, but in this case we have lots of problems with 
sorting, search restricions and so on. Lucene itself allows delete single 
document and add new one into existing index. But there is a problem with 
hadoop... As I understand hadoop filesystem has no possibility to write in 
random positions. But it will be great feature if nutch will be able to update 
created index.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: NUTCH-92

2008-11-25 Thread Sean Dean
This method of calculating global IDF values certainly sounds more efficient 
then the currently proposed method. The reduction of 1 RPC call during the 
search query (so that only 1 RPC call is made in total) should reduce the 
overall load on each search server. I prefer the idea of having network 
broadcasts going out during the initial startup and only thereafter during a 
topology changing event.

To me this kind of sounds like network routing tables, the initial table is 
setup during startup and checked periodically for changes. When a change is 
detected the table is modified (sometimes regenerated completely) and the 
network continues to operate. The alternative (based on the current patch) is 
to check the table every time a packet (or maybe connection) is sent to one of 
the devices listed inside. This method may be faster to detect any problem but 
the additional load would be substantial.

With all this said though, the amount of time needed to research and develop 
this new method may take an extended period of time depending on developer 
availability. We have a proposed solution (albeit not as nice) that did work on 
older code that may only need a quick refresh to work with trunk (and the 
future 1.0 release). I would personally like to see NUTCH-92 (or some form of 
it) included in trunk for a legitimate evaluation before the next release.


Sean Dean





From: Andrzej Bialecki <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Sent: Tuesday, November 25, 2008 8:04:22 PM
Subject: NUTCH-92

Hi all,

After reading this paper:

http://wortschatz.uni-leipzig.de/~fwitschel/papers/ipm1152.pdf

I came up with the following idea of implementing global IDF in Nutch. 
The upside of the approach I propose is that it brings back the cost of 
making a search query to 1 RPC call. The downside is that the search 
servers need to cache global IDF estimates as computed by the DS.Client, 
which ties them to a single query front-end (DistributedSearch.Client), 
or requires keeping a map of  on each search server.

-

First, as the paper above claims, we don't really need exact IDF values 
of all terms from every index. We should get acceptable quality if we 
only learn the top-N frequent terms, and for the rest of them we apply a 
smoothing function that is based on global characteristics of each index 
(such as the number of terms in the index).

This means that the data that needs to be collected by the query 
integrator (DS.Client in Nutch) from shard servers (DS.Server in Nutch) 
would consist of a list of e.g. top 500 local terms with their 
frequency, plus the local smoothing factor as a single value.

We could further reduce the amount of data to be sent from/to shard 
servers by encoding this information in a counted Bloom filter with a 
single-byte resolution (or a spectral Bloom filter, whichever yields a 
better precision / bit in our case).

The query integrator would ask all active shard servers to provide their 
local IDF data, and it would compute global IDFs for these terms, plus a 
global smoothing factor, and send back the updated information to each 
shard server. This would happen once per lifetime of a local shard, and 
is needed because of the local query rewriting (and expansion of terms 
from Nutch Query to Lucene Query).

Shard servers would then process incoming queries using the IDF 
estimates for terms included in the global IDF data, or the global 
smoothing factors for terms missing from that data (or use local IDFs).

The global IDF data would have to be recomputed each time the set of 
shards available to a DS.Client changes, and then it needs to be 
broadcast back from the client to all servers - which is the downside of 
this solution, because servers need to keep a cache of this information 
for every DS.Client (each of them possibly having a different list of 
shard servers, hence different IDFs). Also, as shard servers come and 
go, the IDF data keeps being recomputed and broadcast, which increases 
the traffic between the client and servers.

Still I believe the amount of additional traffic should be minimal in a 
typical scenario, where changes to the shards are much less frequent 
than the frequency of sending user queries. :)

--

Now, if this approach seems viable (please comment on this), what should 
we do with the patches in NUTCH-92 ?

1. skip them for now, and wait until the above approach is implemented, 
and pay the penalty of using skewed local IDFs.

2. apply them now, and pay the penalty of additional RPC call / search, 
and replace this mechanism with the one described above, whenever that 
becomes available.

-- 
Best regards,
Andrzej Bialecki <><
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

NUTCH-92

2008-11-25 Thread Andrzej Bialecki

Hi all,

After reading this paper:

http://wortschatz.uni-leipzig.de/~fwitschel/papers/ipm1152.pdf

I came up with the following idea of implementing global IDF in Nutch. 
The upside of the approach I propose is that it brings back the cost of 
making a search query to 1 RPC call. The downside is that the search 
servers need to cache global IDF estimates as computed by the DS.Client, 
which ties them to a single query front-end (DistributedSearch.Client), 
or requires keeping a map of  on each search server.


-

First, as the paper above claims, we don't really need exact IDF values 
of all terms from every index. We should get acceptable quality if we 
only learn the top-N frequent terms, and for the rest of them we apply a 
smoothing function that is based on global characteristics of each index 
(such as the number of terms in the index).


This means that the data that needs to be collected by the query 
integrator (DS.Client in Nutch) from shard servers (DS.Server in Nutch) 
would consist of a list of e.g. top 500 local terms with their 
frequency, plus the local smoothing factor as a single value.


We could further reduce the amount of data to be sent from/to shard 
servers by encoding this information in a counted Bloom filter with a 
single-byte resolution (or a spectral Bloom filter, whichever yields a 
better precision / bit in our case).


The query integrator would ask all active shard servers to provide their 
local IDF data, and it would compute global IDFs for these terms, plus a 
global smoothing factor, and send back the updated information to each 
shard server. This would happen once per lifetime of a local shard, and 
is needed because of the local query rewriting (and expansion of terms 
from Nutch Query to Lucene Query).


Shard servers would then process incoming queries using the IDF 
estimates for terms included in the global IDF data, or the global 
smoothing factors for terms missing from that data (or use local IDFs).


The global IDF data would have to be recomputed each time the set of 
shards available to a DS.Client changes, and then it needs to be 
broadcast back from the client to all servers - which is the downside of 
this solution, because servers need to keep a cache of this information 
for every DS.Client (each of them possibly having a different list of 
shard servers, hence different IDFs). Also, as shard servers come and 
go, the IDF data keeps being recomputed and broadcast, which increases 
the traffic between the client and servers.


Still I believe the amount of additional traffic should be minimal in a 
typical scenario, where changes to the shards are much less frequent 
than the frequency of sending user queries. :)


--

Now, if this approach seems viable (please comment on this), what should 
we do with the patches in NUTCH-92 ?


1. skip them for now, and wait until the above approach is implemented, 
and pay the penalty of using skewed local IDFs.


2. apply them now, and pay the penalty of additional RPC call / search, 
and replace this mechanism with the one described above, whenever that 
becomes available.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] Commented: (NUTCH-663) Upgrade Nutch to use Hadoop 0.18.2

2008-11-25 Thread Dennis Kubes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650713#action_12650713
 ] 

Dennis Kubes commented on NUTCH-663:


@buddha1021
The 1.0 release for Nutch has some of the features for Nutch 2 but it is not a 
complete Nutch 2 Architecture.  We felt it was best to do add some needed 
features into the current version of Nutch and get them deployed to the 
community quickly.  A lot of people have been asking about the development of 
Nutch and releasing.  Truth is we have just been busy adding in needed features 
and patches.  We should have a release out in the next couple of weeks.  That 
will be a 1.0 release for Nutch but will probably contain a 18.2 or 19 release 
for Hadoop. We aren't waiting for hadoop to go to 1.0.

@Doğacan Güney
I am not opposed to waiting for 0.19 as long as it will be released soon.  I 
was looking and it seemed they tried to release a little while back and didn't 
finish because of some big errors.

> Upgrade Nutch to use Hadoop 0.18.2
> --
>
> Key: NUTCH-663
> URL: https://issues.apache.org/jira/browse/NUTCH-663
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.0.0
> Environment: All
>Reporter: Dennis Kubes
>Assignee: Dennis Kubes
> Fix For: 1.0.0
>
>
> Upgrade Nutch to use a newer hadoop, version 0.18.2.  This includes 
> performance improvements, bug fixes, and new functionality.  Changes some 
> current APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[Nutch Wiki] Update of "johnroman" by johnroman

2008-11-25 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by johnroman:
http://wiki.apache.org/nutch/johnroman

New page:
John Roman is a sysadmin for the R&D arm of lexmark international.  
some of his contributions include bugfix documentation and troubleshooting...as 
well as an attempt to clean up alot of the tutorials.


[jira] Commented: (NUTCH-663) Upgrade Nutch to use Hadoop 0.18.2

2008-11-25 Thread buddha1021 (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650505#action_12650505
 ] 

buddha1021 commented on NUTCH-663:
--

hi:
I find the Nutch2Architecture in the wiki,which said that the next release will 
Remove the plugin framework .Is that  true?
 And ,nutch has been a long time not to update. when the next release will be 
available? I think the next release will be the stable version ,and nutch 
builds on the hadoop and lucene .Lucene has updated to 2.4.0.  So, Is nutch 
waiting for the hadoop update to 1.0.0?
Thanks!!!

> Upgrade Nutch to use Hadoop 0.18.2
> --
>
> Key: NUTCH-663
> URL: https://issues.apache.org/jira/browse/NUTCH-663
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.0.0
> Environment: All
>Reporter: Dennis Kubes
>Assignee: Dennis Kubes
> Fix For: 1.0.0
>
>
> Upgrade Nutch to use a newer hadoop, version 0.18.2.  This includes 
> performance improvements, bug fixes, and new functionality.  Changes some 
> current APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.