[jira] Commented: (NUTCH-16) boost documents matching a url pattern

2006-01-28 Thread byron miller (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-16?page=comments#action_12364354 ] 

byron miller commented on NUTCH-16:
---

Cool

an inverse of this plugin would be great, or enhancement of this for +/- values 
based on patters as i think lowering score of  domains like  
i.like.to.spam.with.keywords.in.my.url.pretending.im.a.good.site.dot.com

 boost documents matching a url pattern
 --

  Key: NUTCH-16
  URL: http://issues.apache.org/jira/browse/NUTCH-16
  Project: Nutch
 Type: New Feature
   Components: indexer
 Reporter: Stefan Groschupf
 Priority: Trivial
  Attachments: boost-url-src_and_bin.zip, boostingPluginPatch.txt

 The attached patch is a plugin that allows to boost documents matching a url 
 pattern. 
 This could be useful to rank documents from a intranet higher then external 
 pages.
 A README comes with the patch.
 Any comments are welcome.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-95) DeleteDuplicates depends on the order of input segments

2006-01-28 Thread Andrzej Bialecki (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-95?page=comments#action_12364355 ] 

Andrzej Bialecki  commented on NUTCH-95:


Yes, it should. SegmentMergeTool should handle this correctly in 0.7. For 0.8 
it is not (yet) supported...

 DeleteDuplicates depends on the order of input segments
 ---

  Key: NUTCH-95
  URL: http://issues.apache.org/jira/browse/NUTCH-95
  Project: Nutch
 Type: Bug
   Components: indexer
 Versions: 0.8-dev, 0.6, 0.7
 Reporter: Andrzej Bialecki 
 Assignee: Andrzej Bialecki 


 DeleteDuplicates depends on what order the input segments are processed, 
 which in turn depends on the order of segment dirs returned from 
 NutchFileSystem.listFiles(File). In most cases this is undesired and may lead 
 to deleting wrong records from indexes. The silent assumption that segments 
 at the end of the listing are more recent is not always true.
 Here's the explanation:
 * Dedup first deletes the URL duplicates by computing MD5 hashes for each 
 URL, and then sorting all records by (hash, segmentIdx, docIdx). SegmentIdx 
 is just an int index to the array of open IndexReaders - and if segment dirs 
 are moved/copied/renamed then entries in that array may change their  order. 
 And then for all equal triples Dedup keeps just the first entry. Naturally, 
 if segmentIdx is changed due to dir renaming, a different record will be kept 
 and different ones will be deleted...
 * then Dedup deletes content duplicates, again by computing hashes for each 
 content, and then sorting records by (hash, segmentIdx, docIdx). However, by 
 now we already have a different set of undeleted docs depending on the order 
 of input segments. On top of that, the same factor acts here, i.e. segmentIdx 
 changes when you re-shuffle the input segment dirs - so again, when identical 
 entries are compared the one with the lowest (segmentIdx, docIdx) is picked.
 Solution: use the fetched date from the first record in each segment to 
 determine the order of segments. Alternatively, modify DeleteDuplicates to 
 use the newer algorithm from SegmentMergeTool. This algorithm works by 
 sorting records using tuples of (urlHash, contentHash, fetchDate, score, 
 urlLength). Then:
 1. If urlHash is the same, keep the doc with the highest fetchDate  (the 
 latest version, as recorded by Fetcher).
 2. If contentHash is the same, keep the doc with the highest score, and then 
 if the scores are the same, keep the doc with the shortest url.
 Initial fix will be prepared for the trunk/ and then backported to the 
 release branch.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-79) Fault tolerant searching.

2006-01-28 Thread byron miller (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-79?page=comments#action_12364357 ] 

byron miller commented on NUTCH-79:
---

Piotr,

Any update on this? Have you been able to run with this or still working out 
the kinks?

 Fault tolerant searching.
 -

  Key: NUTCH-79
  URL: http://issues.apache.org/jira/browse/NUTCH-79
  Project: Nutch
 Type: New Feature
   Components: searcher
 Reporter: Piotr Kosiorowski
  Attachments: patch

 I have finally managed to prepare first version of fault tolerant searching I 
 have promised long time ago. 
 It reads server configuration from search-groups.txt file (in startup 
 directory or directory specified by searcher.dir) if no search-servers.txt 
 file is present. If search-servers.txt  is presentit would be read and 
 handled as previously.
 ---
 Format of search-groups.txt:
 * pre
  *  search.group.count=[int] 
  *  search.group.name.[i]=[string] (for i=0 to count-1)
  *  
  *  For each name: 
  *  [name].part.count=[int] partitionCount 
  *  [name].part.[i].host=[string] (for i=0 to partitionCount-1)
  *  [name].part.[i].port=int (for i=0 to partitionCount-1)
  *  
  *  Example: 
  *  search.group.count=2 
  *  search.group.name.0=master
  *  search.group.name.1=backup
  *  
  *  master.part.count=2 
  *  master.part.0.host=host1 
  *  master.part.0.port=
  *  master.part.1.host=host2 
  *  master.part.1.port=
  *  
  *  backup.part.count=2 
  *  backup.part.0.host=host3 
  *  backup.part.0.port=
  *  backup.part.1.host=host4 
  *  backup.part.1.port=
  * /pre.
 
 If more than one search group is defined in configuration file requests are 
 distributed among groups in round-robin fashion. If one of the servers from 
 the group fails to respond the whole group is treated as inactive and removed 
 from the pool used to distributed requests. There is a separate recovery 
 thread that every searcher.recovery.delay seconds (default 60) tries to 
 check if inactive became alive and if so adds it back to the pool of active 
 groups.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-14) NullPointerException NutchBean.getSummary

2006-01-28 Thread byron miller (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-14?page=comments#action_12364358 ] 

byron miller commented on NUTCH-14:
---

Are you still hitting this Stefan?

 NullPointerException NutchBean.getSummary
 -

  Key: NUTCH-14
  URL: http://issues.apache.org/jira/browse/NUTCH-14
  Project: Nutch
 Type: Bug
   Components: searcher
 Reporter: Stefan Groschupf
 Priority: Minor


 In heavy load scenarios this may happens when connection broke.
 java.lang.NullPointerException
 at java.util.Hashtable.get(Hashtable.java:333)
 at net.nutch.ipc.Client.getConnection(Client.java:276)
 at net.nutch.ipc.Client.call(Client.java:251)
 at 
 net.nutch.searcher.DistributedSearch$Client.getSummary(DistributedSearch.java:418)
 at net.nutch.searcher.NutchBean.getSummary(NutchBean.java:236)
 at 
 org.apache.jsp.search_jsp._jspService(org.apache.jsp.search_jsp:396)
 at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:99)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 at 
 org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:325)
 at 
 org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:295)
 at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:245)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
 at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
 at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178)
 at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
 at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
 at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107)
 at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
 at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:825)
 at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:738)
 at 
 org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:526)
 at 
 org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:80)
 at 
 org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
 at java.lang.Thread.run(Thread.java:552)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: svn commit: r359822 - in /lucene/nutch/trunk: bin/ conf/ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/indexer/ src/java/org/apache/nutch/parse/ src

2006-01-28 Thread Sami Siren

[EMAIL PROTECTED] wrote:

Added: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/SignatureFactory.java
URL: 
http://svn.apache.org/viewcvs/lucene/nutch/trunk/src/java/org/apache/nutch/crawl/SignatureFactory.java?rev=359822view=auto
==
--- lucene/nutch/trunk/src/java/org/apache/nutch/crawl/SignatureFactory.java 
(added)
+++ lucene/nutch/trunk/src/java/org/apache/nutch/crawl/SignatureFactory.java 
Thu Dec 29 07:28:30 2005

...

+  public static Signature getSignature(NutchConf conf) {
+String clazz = conf.get(db.signature.class, 
MD5Signature.class.getName());
+Signature impl = (Signature)conf.getObject(clazz);
+if (impl == null) {
+  try {
+LOG.info(Using Signature impl:  + clazz);
+Class implClass = Class.forName(clazz);
+impl = (Signature)implClass.newInstance();
+impl.setConf(conf);
+  } catch (Exception e) {


should there be a

conf.setObject(clazz,impl);

inside that try ?

--
 Sami Siren