[jira] Commented: (NUTCH-271) Meta-data per URL/site/section

2006-05-18 Thread Gal Nitzan (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-271?page=comments#action_12412436 ] 

Gal Nitzan commented on NUTCH-271:
--

Sorry for the short comment.

Actually the meta tags functionality is already available in the 0.8 version 
along with a CrawlDatum object.

You can build the required functionality just by developing plugins for parsing 
indexing and querying

HTH.

> Meta-data per URL/site/section
> --
>
>  Key: NUTCH-271
>  URL: http://issues.apache.org/jira/browse/NUTCH-271
>  Project: Nutch
> Type: New Feature

> Versions: 0.7.2
> Reporter: Stefan Neufeind

>
> We have the need to index sites and attach additional meta-data-tags to them. 
> Afaik this is not yet possible, or is there a "workaround" I don't see? What 
> I think of is using meta-tags per start-url, only indexing content below that 
> URL, and have the ability to limit searches upon those meta-tags. E.g.
> http://www.example1.com/something1/   -> meta-tag "companybranch1"
> http://www.example2.com/something2/   -> meta-tag "companybranch2"
> http://www.example3.com/something3/   -> meta-tag "companybranch1"
> http://www.example4.com/something4/   -> meta-tag "companybranch3"
> search for everything in companybranch1 or across 1 and 3 or similar

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-271) Meta-data per URL/site/section

2006-05-18 Thread Gal Nitzan (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-271?page=comments#action_12412435 ] 

Gal Nitzan commented on NUTCH-271:
--

This functionality is already available in Nutch-0.8

> Meta-data per URL/site/section
> --
>
>  Key: NUTCH-271
>  URL: http://issues.apache.org/jira/browse/NUTCH-271
>  Project: Nutch
> Type: New Feature

> Versions: 0.7.2
> Reporter: Stefan Neufeind

>
> We have the need to index sites and attach additional meta-data-tags to them. 
> Afaik this is not yet possible, or is there a "workaround" I don't see? What 
> I think of is using meta-tags per start-url, only indexing content below that 
> URL, and have the ability to limit searches upon those meta-tags. E.g.
> http://www.example1.com/something1/   -> meta-tag "companybranch1"
> http://www.example2.com/something2/   -> meta-tag "companybranch2"
> http://www.example3.com/something3/   -> meta-tag "companybranch1"
> http://www.example4.com/something4/   -> meta-tag "companybranch3"
> search for everything in companybranch1 or across 1 and 3 or similar

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-271) Meta-data per URL/site/section

2006-05-18 Thread Stefan Neufeind (JIRA)
Meta-data per URL/site/section
--

 Key: NUTCH-271
 URL: http://issues.apache.org/jira/browse/NUTCH-271
 Project: Nutch
Type: New Feature

Versions: 0.7.2
Reporter: Stefan Neufeind


We have the need to index sites and attach additional meta-data-tags to them. 
Afaik this is not yet possible, or is there a "workaround" I don't see? What I 
think of is using meta-tags per start-url, only indexing content below that 
URL, and have the ability to limit searches upon those meta-tags. E.g.

http://www.example1.com/something1/   -> meta-tag "companybranch1"
http://www.example2.com/something2/   -> meta-tag "companybranch2"
http://www.example3.com/something3/   -> meta-tag "companybranch1"
http://www.example4.com/something4/   -> meta-tag "companybranch3"

search for everything in companybranch1 or across 1 and 3 or similar

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: Fetcher.java reporting incorrect kb/s?

2006-05-18 Thread Andrzej Bialecki

Greg Kim wrote:

Hi,

I was just looking at the Fetcher.java code on trunk (r 407599), 
snippet below.

The total # of bytes is getting multiplied by 8 and the division by
8.0 is missing;

 private void reportStatus() throws IOException {
   String status;
   synchronized (this) {
 long elapsed = (System.currentTimeMillis() - start)/1000;
 status =
   pages+" pages, "+errors+" errors, "
   + Math.round(((float)pages*10)/elapsed)/10.0+" pages/s, "
   + Math.round(float)bytes)*8)/1024)/elapsed)+" kb/s, ";
 ^^^
   }
   reporter.setStatus(status);
 }


Funny you should mention that just now, I was looking at this 
calculation today - and if you take what is printed literally, it says 
the truth, because it converts bytes (apparently defined as octets) into 
bits (multiply by 8) and then to kilo-bits - although one could say that 
this is not strictly true either, it's rather kibibits - 
http://en.wikipedia.org/wiki/Kibibyte).


So, the calculation is correct, and the unit name "kb/s" correctly uses 
lower-case "b" to signify "bits" rather than "bytes", although many 
people tend to take this as bytes  ...


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: Fetcher.java reporting incorrect kb/s?

2006-05-18 Thread Ken Krugler
kb/s is kilobits/second, not kilobytes/second. See 
.


I agree that using the more explicit kbits/s would be better.

Related micro-nit...least according to 
http://en.wikipedia.org/wiki/Kilobit_per_second) it should be /1000, 
not /1024.


-- Ken

I was just looking at the Fetcher.java code on trunk (r 407599), 
snippet below.

The total # of bytes is getting multiplied by 8 and the division by
8.0 is missing;

 private void reportStatus() throws IOException {
   String status;
   synchronized (this) {
 long elapsed = (System.currentTimeMillis() - start)/1000;
 status =
   pages+" pages, "+errors+" errors, "
   + Math.round(((float)pages*10)/elapsed)/10.0+" pages/s, "
   + Math.round(float)bytes)*8)/1024)/elapsed)+" kb/s, ";
 ^^^
   }
   reporter.setStatus(status);
 }

thanks
greg



--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"


[jira] Created: (NUTCH-270) Apply just the applicable portions of the patch to protocol.httpclient.Http.java

2006-05-18 Thread Jeremy Calvert (JIRA)
Apply just the applicable portions of the patch to protocol.httpclient.Http.java


 Key: NUTCH-270
 URL: http://issues.apache.org/jira/browse/NUTCH-270
 Project: Nutch
Type: Sub-task

  Components: fetcher  
Versions: 0.8-dev
Reporter: Jeremy Calvert


This seems to be two issues in one.  Adaptive scheduling AND content change 
detection.

I don't see any reason not to apply the patch to allow content change 
detection.  That is, the parts of th patch to support changing the signature 
HttpResponse(URL url, long lastModified).  It'd be especially useful for those 
of us who refetch feeds fairly frequently.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Nutch 'Help Wanted' page on wiki

2006-05-18 Thread Gordon Mohr
To complement the existing 'Support' (experts available) page and at 
Doug's suggestion, I've added a 'Help Wanted' page to the Nutch wiki:


  http://wiki.apache.org/nutch/Help_Wanted

There's also a first listing to get things started. :)

- Gordon @ IA



Fetcher.java reporting incorrect kb/s?

2006-05-18 Thread Greg Kim

Hi,

I was just looking at the Fetcher.java code on trunk (r 407599), snippet below.
The total # of bytes is getting multiplied by 8 and the division by
8.0 is missing;

 private void reportStatus() throws IOException {
   String status;
   synchronized (this) {
 long elapsed = (System.currentTimeMillis() - start)/1000;
 status =
   pages+" pages, "+errors+" errors, "
   + Math.round(((float)pages*10)/elapsed)/10.0+" pages/s, "
   + Math.round(float)bytes)*8)/1024)/elapsed)+" kb/s, ";
 ^^^
   }
   reporter.setStatus(status);
 }

thanks
greg


Re: Following tags

2006-05-18 Thread Andrzej Bialecki

Chris Schneider wrote:

Gang,

I had a webmaster complain that our crawler was following his  links. 
Although he admits that his use of the GET method is a bit unorthodox, he feels strongly 
that form submissions with input fields shouldn't be followed by crawlers. Would it make 
sense to modify the HTML parser so that it checked to see whether such input fields exist 
before following  links?

  


I read through your email exchange, and setting aside all emotional 
content I think this is a valid request - indeed, as far as I can tell 
other major crawlers don't follow these links. We could either remove 
this, or make it optional (default not to use them).


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com