[jira] Assigned: (NUTCH-3) multi values of header discarded

2005-12-09 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-3?page=all ]

Stefan Groschupf reassigned NUTCH-3:


Assign To: Stefan Groschupf

> multi values of header discarded
> 
>
>  Key: NUTCH-3
>  URL: http://issues.apache.org/jira/browse/NUTCH-3
>  Project: Nutch
> Type: Bug
> Reporter: Stefan Groschupf
> Assignee: Stefan Groschupf

>
> orignal by: phoebe
> http://sourceforge.net/tracker/index.php?func=detail&aid=185&group_id=59548&atid=491356
> multi values of header discarded
> Each successive setting of a header value deletes the
> previous one.
> This patch allows multi values to be retained, such as
> cookies, using lf cr as a delimiter for each values.
> --- /tmp/HttpResponse.java 2005-01-27
> 19:57:55.0 -0500
> +++ HttpResponse.java 2005-01-27 20:45:01.0 -0500
> @@ -324,7 +324,19 @@
> }
> String value = line.substring(valueStart);
> - headers.put(key, value);
> +//Spec allows multiple values, such as Set-Cookie -
> using lf cr as delimiter
> + if ( headers.containsKey(key)) {
> + try {
> + Object obj= headers.get(key);
> + if ( obj != null) {
> + String oldvalue=
> headers.get(key).toString();
> + value = oldvalue +
> "\r\n" + value;
> + }
> + } catch (Exception e) {
> + e.printStackTrace();
> + }
> + }
> + headers.put(key, value);
> }
> private Map parseHeaders(PushbackInputStream in,
> StringBuffer line)
> @@ -399,5 +411,3 @@
> }

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-135) http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)

2005-12-09 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-135?page=comments#action_12360025 ] 

Stefan Groschupf commented on NUTCH-135:


Andrzej, that is easy to add to the ContentProperties object and sure I can do 
that. However first I would love to get a OK for this patch, before I invest 
more time in it, since I spend to many time writing stuff just for the issue 
archive. 
As soon this patch is in the sources I will write a small new patch (as Doug 
suggested, do it in small steps) to solve NUTCH-3

> http header meta data are case insensitive in the real world (e.g. 
> Content-Type or content-type)
> 
>
>  Key: NUTCH-135
>  URL: http://issues.apache.org/jira/browse/NUTCH-135
>  Project: Nutch
> Type: Bug
>   Components: fetcher
> Versions: 0.7, 0.7.1
> Reporter: Stefan Groschupf
> Priority: Critical
>  Fix For: 0.8-dev, 0.7.2-dev
>  Attachments: contentProperties_patch.txt
>
> As described in issue nutch-133, some webservers return http header meta data 
> not standard conform case insensitive.
> This provides many negative side effects, for example query thet content type 
> from the meta data return null also in case the webserver returns a content 
> type, but the key is not standard conform e.g. lower case. Also this has 
> effects to the pdf parser that queries the content length etc.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-135) http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)

2005-12-09 Thread Andrzej Bialecki (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-135?page=comments#action_12359961 ] 

Andrzej Bialecki  commented on NUTCH-135:
-

Since you already are working on this issue, I'd like to ask you to take a look 
at NUTCH-3, and see if you can solve this too. The problem described there is 
that if there are several headers with the same name, only the last value is 
preserved, but in some cases multiple headers make sense (see any of the 
existing Java models for handling HTTP or RFC822 mail messages - all of them 
handle multiple values per single key).

> http header meta data are case insensitive in the real world (e.g. 
> Content-Type or content-type)
> 
>
>  Key: NUTCH-135
>  URL: http://issues.apache.org/jira/browse/NUTCH-135
>  Project: Nutch
> Type: Bug
>   Components: fetcher
> Versions: 0.7.1, 0.7
> Reporter: Stefan Groschupf
> Priority: Critical
>  Fix For: 0.8-dev, 0.7.2-dev
>  Attachments: contentProperties_patch.txt
>
> As described in issue nutch-133, some webservers return http header meta data 
> not standard conform case insensitive.
> This provides many negative side effects, for example query thet content type 
> from the meta data return null also in case the webserver returns a content 
> type, but the key is not standard conform e.g. lower case. Also this has 
> effects to the pdf parser that queries the content length etc.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-135) http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)

2005-12-09 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-135?page=all ]

Stefan Groschupf updated NUTCH-135:
---

Attachment: contentProperties_patch.txt

As Doug suggested a patch using TreeMap String.CASE_INSENSITIVE_ORDER that 
solve the problem of case insensitive http header or general case insensitve 
content meta data. 
In general I see  two different ways to solve the problem. First leave the API 
as it is and extend a Properties object to overwriting its methods by using 
behind the sence a TreeMap. This solution would also require to copy some data 
between the properties object and treemap back and for several times, since the 
nutch code uses a Properties object in the content  constructor. The other 
choice would be to change the API of the content object to cleanly document 
that a other object, that has a different behavior than the properties object 
is used. The negative thing on this solution is that there are many small 
changes in the nutch code base. 
However I decide for a clean way, the last way, since I don't like code that 
does some things behind the sence that  developers would not expect. So I 
introduced a tiny ContentProperties object and changed the Content construtor 
to use the ContentProperties object instead of the java.util.Properties object. 
The new ContentProperties has a similar API as the Properties class but use 
case insensitve keys. I changed all classes that use the content object to use 
the new ContentProperties until object instantiation and I also extend the 
Content test case to test if case insensitive keys are now supported. 
Feel free to give constructive improvement suggestions, but also please let get 
us this done as soon as possible since from my point of view this is a critical 
issue.  All testcases pass on my box, but please double check before commiting.

> http header meta data are case insensitive in the real world (e.g. 
> Content-Type or content-type)
> 
>
>  Key: NUTCH-135
>  URL: http://issues.apache.org/jira/browse/NUTCH-135
>  Project: Nutch
> Type: Bug
>   Components: fetcher
> Versions: 0.7.1, 0.7
> Reporter: Stefan Groschupf
> Priority: Critical
>  Fix For: 0.8-dev, 0.7.2-dev
>  Attachments: contentProperties_patch.txt
>
> As described in issue nutch-133, some webservers return http header meta data 
> not standard conform case insensitive.
> This provides many negative side effects, for example query thet content type 
> from the meta data return null also in case the webserver returns a content 
> type, but the key is not standard conform e.g. lower case. Also this has 
> effects to the pdf parser that queries the content length etc.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-135) http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)

2005-12-09 Thread Stefan Groschupf (JIRA)
http header meta data are case insensitive in the real world (e.g. Content-Type 
or content-type)


 Key: NUTCH-135
 URL: http://issues.apache.org/jira/browse/NUTCH-135
 Project: Nutch
Type: Bug
  Components: fetcher  
Versions: 0.7.1, 0.7
Reporter: Stefan Groschupf
Priority: Critical
 Fix For: 0.8-dev, 0.7.2-dev


As described in issue nutch-133, some webservers return http header meta data 
not standard conform case insensitive.
This provides many negative side effects, for example query thet content type 
from the meta data return null also in case the webserver returns a content 
type, but the key is not standard conform e.g. lower case. Also this has 
effects to the pdf parser that queries the content length etc.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: parse.getData().getMetadata().get("propName") is NULL?

2005-12-09 Thread Stefan Groschupf

Jack,
discussed here in detail:
http://issues.apache.org/jira/browse/NUTCH-133

I will provide a patch just fixing this issue very soon.

Stefan

Am 09.12.2005 um 20:04 schrieb Jack Tang:


Hi

I am going to standardize some fields which I stored in my parser
plugin. But I found that sometimes
parse.getData().getMetadata().get("propertyName") is NULL. In fact
when i stepped in the source code, the value of propertyName is not
NULL.

So can someone explain this? Thanks

/Jack


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars



---
company:http://www.media-style.com
forum:http://www.text-mining.org
blog:http://www.find23.net




parse.getData().getMetadata().get("propName") is NULL?

2005-12-09 Thread Jack Tang
Hi

I am going to standardize some fields which I stored in my parser
plugin. But I found that sometimes
parse.getData().getMetadata().get("propertyName") is NULL. In fact
when i stepped in the source code, the value of propertyName is not
NULL.

So can someone explain this? Thanks

/Jack


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars


Re: nutch questions

2005-12-09 Thread Stefan Groschupf

Ken,
Thanks Stefan. I'll resend this to the user list as well. Just  
thought the dev list might be better since we're using the map/ 
reduce version.
it is just that may other user would be interested to get such  
information as well and there are a lot of the developers also  
subscribed to the user list.


Cheers,
Stefan



Re: nutch questions

2005-12-09 Thread Ken van Mulder
Thanks Stefan. I'll resend this to the user list as well. Just thought 
the dev list might be better since we're using the map/reduce version.


Thanks!

Stefan Groschupf wrote:

Ken,
may the user mailing list would be a better place for such questions.
The size of your index  depends on you configuration(what kind of  index 
filter plugins you use)


You can say a document in the index needs 10KB  plus the meta data  like 
date, content type or category of the page.

Storing the pages content took around  64KB for each page.
You also need to store  a linkgraph and a list of known urls - web db.
I would say  each 100 Mio document require 1 TB of storage.

Information about query speed can be found in the index, as a role of  
thumb 4 GB of RAM can handle 20 queries per second by 2 Million  
documents per box.
So in general you need many boxes, but the more expansive part of  such 
a project is bandwidth.


Nutch 0.8 works well, however you have to write some custom jobs to  get 
some standart jobs done, also storing index on the distributed  
filesystem and search it from there is very very slow. Beside that  
nutch has serious problems with spam detection in very large indexes.


HTH
Stefan




Am 09.12.2005 um 00:59 schrieb Ken van Mulder:


Hey folks,

We're looking at launching a search engine in the beginning of the  
new year that will eventually grow to being a multi-billion page  
index. Three questions:


First, and most important for now, does anyone have any useful  
numbers for what the hardware requirements are to run such an  engine? 
I have numbers for how fast I can get the crawler's  working. But not 
for how many pages can be served off of each  search node and how much 
processing power is required for the  indexing, etc.


Second, what all needs to be done to Nutch yet in order for it to  be 
able to handle billions of pages? Is there a general list of  
requirements?


Third, if nutch isn't capable of doing what we need, what is the  
expected upper limit for it? Using the map/reduce version.


Thanks,

--
Ken van Mulder
Wavefire Technologies Corporation

http://www.wavefire.com
250.717.0200 (ext 113)







--
Ken van Mulder
Wavefire Technologies Corporation

http://www.wavefire.com
250.717.0200 (ext 113)


Re: nutch questions

2005-12-09 Thread Stefan Groschupf

Ken,
may the user mailing list would be a better place for such questions.
The size of your index  depends on you configuration(what kind of  
index filter plugins you use)


You can say a document in the index needs 10KB  plus the meta data  
like date, content type or category of the page.

Storing the pages content took around  64KB for each page.
You also need to store  a linkgraph and a list of known urls - web db.
I would say  each 100 Mio document require 1 TB of storage.

Information about query speed can be found in the index, as a role of  
thumb 4 GB of RAM can handle 20 queries per second by 2 Million  
documents per box.
So in general you need many boxes, but the more expansive part of  
such a project is bandwidth.


Nutch 0.8 works well, however you have to write some custom jobs to  
get some standart jobs done, also storing index on the distributed  
filesystem and search it from there is very very slow. Beside that  
nutch has serious problems with spam detection in very large indexes.


HTH
Stefan




Am 09.12.2005 um 00:59 schrieb Ken van Mulder:


Hey folks,

We're looking at launching a search engine in the beginning of the  
new year that will eventually grow to being a multi-billion page  
index. Three questions:


First, and most important for now, does anyone have any useful  
numbers for what the hardware requirements are to run such an  
engine? I have numbers for how fast I can get the crawler's  
working. But not for how many pages can be served off of each  
search node and how much processing power is required for the  
indexing, etc.


Second, what all needs to be done to Nutch yet in order for it to  
be able to handle billions of pages? Is there a general list of  
requirements?


Third, if nutch isn't capable of doing what we need, what is the  
expected upper limit for it? Using the map/reduce version.


Thanks,

--
Ken van Mulder
Wavefire Technologies Corporation

http://www.wavefire.com
250.717.0200 (ext 113)





Re: Google performance bottlenecks ;-) (Re: Lucene performance bottlenecks)

2005-12-09 Thread Jérôme Charron
> The total number of hits (approx) is 2,780,000,000. BTW, I find it
> curious that the last 3 or 6 digits always seem to be zeros ... there's
> some clever guesstimation involved here. The fact that Google Suggest is
> able to return results so quickly would support this suspicion.
>
For more informations about "fake" Google counts, I suggest you to take a
look to some
tests performed by Jean Véronis, a French academic :
http://aixtal.blogspot.com/2005/02/web-googles-missing-pages-mystery.html

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Google performance bottlenecks ;-) (Re: Lucene performance bottlenecks)

2005-12-09 Thread Andrzej Bialecki

Hi,

I made an experiment with Google, to see if they use a similar approach.

I find the results to be most interesting. I selected a query which is 
guaranteed to give large result sets, but is more complicated than a 
single term query: http com.


The total number of hits (approx) is 2,780,000,000. BTW, I find it 
curious that the last 3 or 6 digits always seem to be zeros ... there's 
some clever guesstimation involved here. The fact that Google Suggest is 
able to return results so quickly would support this suspicion.


When I ran the query for the first time, the response time was 0.29 sec. 
All subsequent queries retrieving the first 10 results are in the order 
of 0.07 sec.


This is for retrieving just the first page (first 10 results). 
Retrieving results 10-20 also takes 0.08 sec, which suggests that this 
result was cached somewhere. Starting from results 20+ the response time 
increases (linearly?), although it varies wildly between requests, 
sometimes returning quicker, sometimes taking the max time - which 
suggests that I'm hitting different servers each time. Also, if I wait 
~30 sec to 1 minute, the response times are back to the values for the 
first-time run.


start   first repeated response
30  0.14  0.08-0.21
50  0.29  0.11-0.22
100 0.36  0.22-0.45
200 0.73  0.49-0.65
300 0.96  0.64-0.98
500 1.36  1.43-1.87
650 2.24  1.49-1.85

The last range was the maximum in this case - Google wouldn't display 
any hit above 652 (which I find curious, too - because the total number 
of hits is, well, significantly higher - and Google claims to return up 
to the first 1000 results).


My impressions from this excercise are perhaps not so surprising: Google 
is highly optimized for retrieving the first couple of results, and the 
more results you want to retrieve the worse the performance. Finally, 
you won't be able to retrieve any results above a couple hundred, quite 
often less than the claimed 1000 results threshold.


As for the exact techniques of this optimization, we'll never know for 
sure, but it seems like something similar is going on to what you 
outlined in your email. I think it would be great to try it out.


Andrzej


Doug Cutting wrote:


Doug Cutting wrote:

Implementing something like this for Lucene would not be too 
difficult. The index would need to be re-sorted by document boost: 
documents would be re-numbered so that highly-boosted documents had 
low document numbers.



In particular, one could:

1. Create an array of int[maxDoc], with a[i] = i.
2. Sort the array with order(i,j) = boost(i) - boost(j);
3. Implement a FilterIndexReader that re-numbers using the sorted 
array.  So, for example, the document numbers in the TermPositions 
will a[old.doc()].  Each term's positions will need to be read 
entirely into memory and sorted to perform this renumbering.


The IndexOptimizer.java class in the searcher package was an old 
attempt to create something like what Suel calls "fancy postings".  It 
creates an index with the top 10% scoring postings.  Since documents 
are not renumbered one can intermix postings from this with the full 
index.  So for example, one can first try searching using this index 
for terms that occur more than, e.g., 10k times, and use the full 
index for rarer words.  If that does not find 1000 hits then the full 
index must be searched.  Such an approach can be combined with using a 
pre-sorted index.


I think the first thing to implement would be to implement something 
like what Suel calls first-1000.  Then we need to evaluate this and 
determine, for query log, how different the results are.


Then a HitCollector can simply stop searching once a given number of 
hits are found.


Doug








--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: [C2-devel] about the question of clustering-carrot2

2005-12-09 Thread Dawid Weiss


Hi Charlie,

Don't cross-post to two lists at once. The question you asked is 
relevant to C2, not Nutch, so I'll reply to it there.


Dawid

charlie wrote:

Dear all,

 

Currently I’m using the Nutch plug-in “clustering-carrot2” and would 
like to ask for some help. When I built the search result clusters, only 
the search results that occurred twice or more will be grouped into one 
cluster. At the same time, if some results(keywords) only occur once, 
it’ll be grouped into the “Other” group. What I’m trying to do now is to 
change this behavior so that even if it occurred only once, it could 
still be grouped into a unique cluster. Does anyone have any clue of how 
this could be accomplished?


 


Thanks in advance!

Charlie