[jira] Assigned: (NUTCH-3) multi values of header discarded
[ http://issues.apache.org/jira/browse/NUTCH-3?page=all ] Stefan Groschupf reassigned NUTCH-3: Assign To: Stefan Groschupf > multi values of header discarded > > > Key: NUTCH-3 > URL: http://issues.apache.org/jira/browse/NUTCH-3 > Project: Nutch > Type: Bug > Reporter: Stefan Groschupf > Assignee: Stefan Groschupf > > orignal by: phoebe > http://sourceforge.net/tracker/index.php?func=detail&aid=185&group_id=59548&atid=491356 > multi values of header discarded > Each successive setting of a header value deletes the > previous one. > This patch allows multi values to be retained, such as > cookies, using lf cr as a delimiter for each values. > --- /tmp/HttpResponse.java 2005-01-27 > 19:57:55.0 -0500 > +++ HttpResponse.java 2005-01-27 20:45:01.0 -0500 > @@ -324,7 +324,19 @@ > } > String value = line.substring(valueStart); > - headers.put(key, value); > +//Spec allows multiple values, such as Set-Cookie - > using lf cr as delimiter > + if ( headers.containsKey(key)) { > + try { > + Object obj= headers.get(key); > + if ( obj != null) { > + String oldvalue= > headers.get(key).toString(); > + value = oldvalue + > "\r\n" + value; > + } > + } catch (Exception e) { > + e.printStackTrace(); > + } > + } > + headers.put(key, value); > } > private Map parseHeaders(PushbackInputStream in, > StringBuffer line) > @@ -399,5 +411,3 @@ > } -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-135) http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)
[ http://issues.apache.org/jira/browse/NUTCH-135?page=comments#action_12360025 ] Stefan Groschupf commented on NUTCH-135: Andrzej, that is easy to add to the ContentProperties object and sure I can do that. However first I would love to get a OK for this patch, before I invest more time in it, since I spend to many time writing stuff just for the issue archive. As soon this patch is in the sources I will write a small new patch (as Doug suggested, do it in small steps) to solve NUTCH-3 > http header meta data are case insensitive in the real world (e.g. > Content-Type or content-type) > > > Key: NUTCH-135 > URL: http://issues.apache.org/jira/browse/NUTCH-135 > Project: Nutch > Type: Bug > Components: fetcher > Versions: 0.7, 0.7.1 > Reporter: Stefan Groschupf > Priority: Critical > Fix For: 0.8-dev, 0.7.2-dev > Attachments: contentProperties_patch.txt > > As described in issue nutch-133, some webservers return http header meta data > not standard conform case insensitive. > This provides many negative side effects, for example query thet content type > from the meta data return null also in case the webserver returns a content > type, but the key is not standard conform e.g. lower case. Also this has > effects to the pdf parser that queries the content length etc. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-135) http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)
[ http://issues.apache.org/jira/browse/NUTCH-135?page=comments#action_12359961 ] Andrzej Bialecki commented on NUTCH-135: - Since you already are working on this issue, I'd like to ask you to take a look at NUTCH-3, and see if you can solve this too. The problem described there is that if there are several headers with the same name, only the last value is preserved, but in some cases multiple headers make sense (see any of the existing Java models for handling HTTP or RFC822 mail messages - all of them handle multiple values per single key). > http header meta data are case insensitive in the real world (e.g. > Content-Type or content-type) > > > Key: NUTCH-135 > URL: http://issues.apache.org/jira/browse/NUTCH-135 > Project: Nutch > Type: Bug > Components: fetcher > Versions: 0.7.1, 0.7 > Reporter: Stefan Groschupf > Priority: Critical > Fix For: 0.8-dev, 0.7.2-dev > Attachments: contentProperties_patch.txt > > As described in issue nutch-133, some webservers return http header meta data > not standard conform case insensitive. > This provides many negative side effects, for example query thet content type > from the meta data return null also in case the webserver returns a content > type, but the key is not standard conform e.g. lower case. Also this has > effects to the pdf parser that queries the content length etc. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-135) http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)
[ http://issues.apache.org/jira/browse/NUTCH-135?page=all ] Stefan Groschupf updated NUTCH-135: --- Attachment: contentProperties_patch.txt As Doug suggested a patch using TreeMap String.CASE_INSENSITIVE_ORDER that solve the problem of case insensitive http header or general case insensitve content meta data. In general I see two different ways to solve the problem. First leave the API as it is and extend a Properties object to overwriting its methods by using behind the sence a TreeMap. This solution would also require to copy some data between the properties object and treemap back and for several times, since the nutch code uses a Properties object in the content constructor. The other choice would be to change the API of the content object to cleanly document that a other object, that has a different behavior than the properties object is used. The negative thing on this solution is that there are many small changes in the nutch code base. However I decide for a clean way, the last way, since I don't like code that does some things behind the sence that developers would not expect. So I introduced a tiny ContentProperties object and changed the Content construtor to use the ContentProperties object instead of the java.util.Properties object. The new ContentProperties has a similar API as the Properties class but use case insensitve keys. I changed all classes that use the content object to use the new ContentProperties until object instantiation and I also extend the Content test case to test if case insensitive keys are now supported. Feel free to give constructive improvement suggestions, but also please let get us this done as soon as possible since from my point of view this is a critical issue. All testcases pass on my box, but please double check before commiting. > http header meta data are case insensitive in the real world (e.g. > Content-Type or content-type) > > > Key: NUTCH-135 > URL: http://issues.apache.org/jira/browse/NUTCH-135 > Project: Nutch > Type: Bug > Components: fetcher > Versions: 0.7.1, 0.7 > Reporter: Stefan Groschupf > Priority: Critical > Fix For: 0.8-dev, 0.7.2-dev > Attachments: contentProperties_patch.txt > > As described in issue nutch-133, some webservers return http header meta data > not standard conform case insensitive. > This provides many negative side effects, for example query thet content type > from the meta data return null also in case the webserver returns a content > type, but the key is not standard conform e.g. lower case. Also this has > effects to the pdf parser that queries the content length etc. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-135) http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)
http header meta data are case insensitive in the real world (e.g. Content-Type or content-type) Key: NUTCH-135 URL: http://issues.apache.org/jira/browse/NUTCH-135 Project: Nutch Type: Bug Components: fetcher Versions: 0.7.1, 0.7 Reporter: Stefan Groschupf Priority: Critical Fix For: 0.8-dev, 0.7.2-dev As described in issue nutch-133, some webservers return http header meta data not standard conform case insensitive. This provides many negative side effects, for example query thet content type from the meta data return null also in case the webserver returns a content type, but the key is not standard conform e.g. lower case. Also this has effects to the pdf parser that queries the content length etc. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: parse.getData().getMetadata().get("propName") is NULL?
Jack, discussed here in detail: http://issues.apache.org/jira/browse/NUTCH-133 I will provide a patch just fixing this issue very soon. Stefan Am 09.12.2005 um 20:04 schrieb Jack Tang: Hi I am going to standardize some fields which I stored in my parser plugin. But I found that sometimes parse.getData().getMetadata().get("propertyName") is NULL. In fact when i stepped in the source code, the value of propertyName is not NULL. So can someone explain this? Thanks /Jack -- Keep Discovering ... ... http://www.jroller.com/page/jmars --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net
parse.getData().getMetadata().get("propName") is NULL?
Hi I am going to standardize some fields which I stored in my parser plugin. But I found that sometimes parse.getData().getMetadata().get("propertyName") is NULL. In fact when i stepped in the source code, the value of propertyName is not NULL. So can someone explain this? Thanks /Jack -- Keep Discovering ... ... http://www.jroller.com/page/jmars
Re: nutch questions
Ken, Thanks Stefan. I'll resend this to the user list as well. Just thought the dev list might be better since we're using the map/ reduce version. it is just that may other user would be interested to get such information as well and there are a lot of the developers also subscribed to the user list. Cheers, Stefan
Re: nutch questions
Thanks Stefan. I'll resend this to the user list as well. Just thought the dev list might be better since we're using the map/reduce version. Thanks! Stefan Groschupf wrote: Ken, may the user mailing list would be a better place for such questions. The size of your index depends on you configuration(what kind of index filter plugins you use) You can say a document in the index needs 10KB plus the meta data like date, content type or category of the page. Storing the pages content took around 64KB for each page. You also need to store a linkgraph and a list of known urls - web db. I would say each 100 Mio document require 1 TB of storage. Information about query speed can be found in the index, as a role of thumb 4 GB of RAM can handle 20 queries per second by 2 Million documents per box. So in general you need many boxes, but the more expansive part of such a project is bandwidth. Nutch 0.8 works well, however you have to write some custom jobs to get some standart jobs done, also storing index on the distributed filesystem and search it from there is very very slow. Beside that nutch has serious problems with spam detection in very large indexes. HTH Stefan Am 09.12.2005 um 00:59 schrieb Ken van Mulder: Hey folks, We're looking at launching a search engine in the beginning of the new year that will eventually grow to being a multi-billion page index. Three questions: First, and most important for now, does anyone have any useful numbers for what the hardware requirements are to run such an engine? I have numbers for how fast I can get the crawler's working. But not for how many pages can be served off of each search node and how much processing power is required for the indexing, etc. Second, what all needs to be done to Nutch yet in order for it to be able to handle billions of pages? Is there a general list of requirements? Third, if nutch isn't capable of doing what we need, what is the expected upper limit for it? Using the map/reduce version. Thanks, -- Ken van Mulder Wavefire Technologies Corporation http://www.wavefire.com 250.717.0200 (ext 113) -- Ken van Mulder Wavefire Technologies Corporation http://www.wavefire.com 250.717.0200 (ext 113)
Re: nutch questions
Ken, may the user mailing list would be a better place for such questions. The size of your index depends on you configuration(what kind of index filter plugins you use) You can say a document in the index needs 10KB plus the meta data like date, content type or category of the page. Storing the pages content took around 64KB for each page. You also need to store a linkgraph and a list of known urls - web db. I would say each 100 Mio document require 1 TB of storage. Information about query speed can be found in the index, as a role of thumb 4 GB of RAM can handle 20 queries per second by 2 Million documents per box. So in general you need many boxes, but the more expansive part of such a project is bandwidth. Nutch 0.8 works well, however you have to write some custom jobs to get some standart jobs done, also storing index on the distributed filesystem and search it from there is very very slow. Beside that nutch has serious problems with spam detection in very large indexes. HTH Stefan Am 09.12.2005 um 00:59 schrieb Ken van Mulder: Hey folks, We're looking at launching a search engine in the beginning of the new year that will eventually grow to being a multi-billion page index. Three questions: First, and most important for now, does anyone have any useful numbers for what the hardware requirements are to run such an engine? I have numbers for how fast I can get the crawler's working. But not for how many pages can be served off of each search node and how much processing power is required for the indexing, etc. Second, what all needs to be done to Nutch yet in order for it to be able to handle billions of pages? Is there a general list of requirements? Third, if nutch isn't capable of doing what we need, what is the expected upper limit for it? Using the map/reduce version. Thanks, -- Ken van Mulder Wavefire Technologies Corporation http://www.wavefire.com 250.717.0200 (ext 113)
Re: Google performance bottlenecks ;-) (Re: Lucene performance bottlenecks)
> The total number of hits (approx) is 2,780,000,000. BTW, I find it > curious that the last 3 or 6 digits always seem to be zeros ... there's > some clever guesstimation involved here. The fact that Google Suggest is > able to return results so quickly would support this suspicion. > For more informations about "fake" Google counts, I suggest you to take a look to some tests performed by Jean Véronis, a French academic : http://aixtal.blogspot.com/2005/02/web-googles-missing-pages-mystery.html Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Google performance bottlenecks ;-) (Re: Lucene performance bottlenecks)
Hi, I made an experiment with Google, to see if they use a similar approach. I find the results to be most interesting. I selected a query which is guaranteed to give large result sets, but is more complicated than a single term query: http com. The total number of hits (approx) is 2,780,000,000. BTW, I find it curious that the last 3 or 6 digits always seem to be zeros ... there's some clever guesstimation involved here. The fact that Google Suggest is able to return results so quickly would support this suspicion. When I ran the query for the first time, the response time was 0.29 sec. All subsequent queries retrieving the first 10 results are in the order of 0.07 sec. This is for retrieving just the first page (first 10 results). Retrieving results 10-20 also takes 0.08 sec, which suggests that this result was cached somewhere. Starting from results 20+ the response time increases (linearly?), although it varies wildly between requests, sometimes returning quicker, sometimes taking the max time - which suggests that I'm hitting different servers each time. Also, if I wait ~30 sec to 1 minute, the response times are back to the values for the first-time run. start first repeated response 30 0.14 0.08-0.21 50 0.29 0.11-0.22 100 0.36 0.22-0.45 200 0.73 0.49-0.65 300 0.96 0.64-0.98 500 1.36 1.43-1.87 650 2.24 1.49-1.85 The last range was the maximum in this case - Google wouldn't display any hit above 652 (which I find curious, too - because the total number of hits is, well, significantly higher - and Google claims to return up to the first 1000 results). My impressions from this excercise are perhaps not so surprising: Google is highly optimized for retrieving the first couple of results, and the more results you want to retrieve the worse the performance. Finally, you won't be able to retrieve any results above a couple hundred, quite often less than the claimed 1000 results threshold. As for the exact techniques of this optimization, we'll never know for sure, but it seems like something similar is going on to what you outlined in your email. I think it would be great to try it out. Andrzej Doug Cutting wrote: Doug Cutting wrote: Implementing something like this for Lucene would not be too difficult. The index would need to be re-sorted by document boost: documents would be re-numbered so that highly-boosted documents had low document numbers. In particular, one could: 1. Create an array of int[maxDoc], with a[i] = i. 2. Sort the array with order(i,j) = boost(i) - boost(j); 3. Implement a FilterIndexReader that re-numbers using the sorted array. So, for example, the document numbers in the TermPositions will a[old.doc()]. Each term's positions will need to be read entirely into memory and sorted to perform this renumbering. The IndexOptimizer.java class in the searcher package was an old attempt to create something like what Suel calls "fancy postings". It creates an index with the top 10% scoring postings. Since documents are not renumbered one can intermix postings from this with the full index. So for example, one can first try searching using this index for terms that occur more than, e.g., 10k times, and use the full index for rarer words. If that does not find 1000 hits then the full index must be searched. Such an approach can be combined with using a pre-sorted index. I think the first thing to implement would be to implement something like what Suel calls first-1000. Then we need to evaluate this and determine, for query log, how different the results are. Then a HitCollector can simply stop searching once a given number of hits are found. Doug -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [C2-devel] about the question of clustering-carrot2
Hi Charlie, Don't cross-post to two lists at once. The question you asked is relevant to C2, not Nutch, so I'll reply to it there. Dawid charlie wrote: Dear all, Currently I’m using the Nutch plug-in “clustering-carrot2” and would like to ask for some help. When I built the search result clusters, only the search results that occurred twice or more will be grouped into one cluster. At the same time, if some results(keywords) only occur once, it’ll be grouped into the “Other” group. What I’m trying to do now is to change this behavior so that even if it occurred only once, it could still be grouped into a unique cluster. Does anyone have any clue of how this could be accomplished? Thanks in advance! Charlie