Fwd: links in db and pagerank calculation

2005-07-12 Thread Orkunt Sabuncu
Hi,

I found a setting that solves my first problem. Setting 
db.ignore.internal.links to false will generate all the links in a web site.

Still I couldn't find any clue about the second one. Why nutch page anaysis 
module compute contributionForOutlinkers? There is nothing like this in the 
usual PageRank algorithm. Any idea about this? I am forwading the first mail 
sent to nutch-user.

Thanks in advance,
-orkunt.

--  Forwarded Message  --

Subject: links in db and pagerank calculation
Date: Monday 11 July 2005 11:17
From: Orkunt Sabuncu <[EMAIL PROTECTED]>
To: nutch-user@lucene.apache.org

Hi,

Let's say we have a site with diamond like link structure. There are 4 pages
 r (root), 1, 2, and 3. r has outlinks to 1 and 2; and both 1 and 2 have
 outlinks to 3. When we crawl this site, the links in webdb ignores the link
 from 2 to 3. At the end there are only 3 links in db. 2 from r pointing to 1
 and 2; one from 1 to 3.

This will surely effects PageRank calculations. Is this a bug or am i
considering something wrong?

Also, in the link analysis module (DistributedAnalysisTool.java) there are
some extra score contributions named contributionForOutlinkers. This
contribution considers the links to pages which have also links to other
pages. I couldn't find references to this way of calculating pagerank in the
literature. Basic pagerank calculation considers only the outlinks. Nutch's
way of calculation will find different scores from the basic Pagerank
calculation. So, what's the use of contribution for outlinkers? Do you have
any idea or references that explains this?

I am using Nutch-0.6

Thanks,
-orkunt.

---


[jira] Created: (NUTCH-71) Search web page doesn't not focus on query input

2005-07-12 Thread Christophe Noel (JIRA)
Search web page doesn't not focus on query input


 Key: NUTCH-71
 URL: http://issues.apache.org/jira/browse/NUTCH-71
 Project: Nutch
Type: Bug
  Components: searcher  
Reporter: Christophe Noel
Priority: Minor
 Attachments: searchQueryFocus.patch

In search.html and search.jsp , keyboard cursor does not focus in the form 
query input.

I've made a patch for en and fr search.html and for search.jsp.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-71) Search web page doesn't not focus on query input

2005-07-12 Thread Christophe Noel (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-71?page=all ]

Christophe Noel updated NUTCH-71:
-

Attachment: searchQueryFocus.patch

Search.html (fr,en) and search.jsp focus patch.

> Search web page doesn't not focus on query input
> 
>
>  Key: NUTCH-71
>  URL: http://issues.apache.org/jira/browse/NUTCH-71
>  Project: Nutch
> Type: Bug
>   Components: searcher
> Reporter: Christophe Noel
> Priority: Minor
>  Attachments: searchQueryFocus.patch
>
> In search.html and search.jsp , keyboard cursor does not focus in the form 
> query input.
> I've made a patch for en and fr search.html and for search.jsp.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: [jira] Created: (NUTCH-70) duplicate pages - virtual hosts in db.

2005-07-12 Thread [EMAIL PROTECTED]

Dear Piotr,

These pages are not  identical. There are different links, and off 
course advertisements.

I use your great patch for nutch-7 ;), that removes identical pages.
I waiting for your new patch (www.cnn.com, cnn.com), because this will 
solve 90% of these problems. I think there aren't any other idea to 
solve the nutch-70 problem.

I think there are not any pronbem with lost of anchor texts.
Thanks for your great patchs, Ferenc

Piotr Kosiorowski wrotte:


Hello Ferenc,
If the pages are really identical they can removed using "nutch dedup" 
command. If not (sometimes such pages differ by some date, counter or 
advertisement) - currently there is no such tool that makes it 
possible to remove them. I am working on simple tool to remove 
duplicates like
http://www.cnn.com/ and http://cnn.com (that differ only in "www") but 
at this stage it is rather a hack - it removes it from an Lucene index 
but all anchor text for removed page is lost and WebDB is not updated.

Regards
Piotr


Lutischán Ferenc (JIRA) wrote:


duplicate pages - virtual hosts in db.
--

 Key: NUTCH-70
 URL: http://issues.apache.org/jira/browse/NUTCH-70
 Project: Nutch
Type: Bug
 Environment: 0,7 dev
Reporter: Lutischán Ferenc


Dear Developers,

I have a problem with nutch:
- There are many sites duplicates in the webdb and in the segments.
The source of this problem is:
- If the site make 'virtual hosts' (like Apache), e.g. www.origo.hu, 
origo.hu, origo.matav.hu, origo.matavnet.hu etc.: the result pages 
are the same, only the inlinks are differents.

- The ip address is the same.
- When search, all virtualhosts are in the results.

Google only show one of these virtual hosts, the nutch show all. The 
result nutch db is larger, and this case slower, than google.


Have any idea, how to remove these duplicates?

Regards,
Ferenc










[jira] Commented: (NUTCH-71) Search web page doesn't not focus on query input

2005-07-12 Thread Christophe Noel (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-71?page=comments#action_12315559 ] 

Christophe Noel commented on NUTCH-71:
--

(Patch is build with svn diff >>)

> Search web page doesn't not focus on query input
> 
>
>  Key: NUTCH-71
>  URL: http://issues.apache.org/jira/browse/NUTCH-71
>  Project: Nutch
> Type: Bug
>   Components: searcher
> Reporter: Christophe Noel
> Priority: Minor
>  Attachments: searchQueryFocus.patch
>
> In search.html and search.jsp , keyboard cursor does not focus in the form 
> query input.
> I've made a patch for en and fr search.html and for search.jsp.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: [Nutch-dev] Exception "Could not obtain new output block"

2005-07-12 Thread [EMAIL PROTECTED]

Hello,

The NDFS is under development, I think not use it on production. You can 
use the 'bin/nutch server'.


Regards,
Ferenc

reetesh chandran wrotte:


Hello,

We are running nutch in 3 networked machines running
linux. We have apache tomcat running in all 3
machines. We are able to create a folder in NDFS. But
when we try to put a local file to the NDFS through
the NDFS client, we get an exception “Could not obtain
new output block for file”. Sometimes the exception is
“NullPointer at java.net.Socket”. Could you please
share your thoughts on why such an exception could
occur.

Thanks and regards,
Reetesh.





Sell on Yahoo! Auctions – no fees. Bid on great items.  
http://auctions.yahoo.com/



---
This SF.Net email is sponsored by the 'Do More With Dual!' webinar happening
July 14 at 8am PDT/11am EDT. We invite you to explore the latest in dual
core and dual graphics technology at this free one hour event hosted by HP, 
AMD, and NVIDIA.  To register visit http://www.hp.com/go/dualwebinar

___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers