[jira] Commented: (NUTCH-16) boost documents matching a url pattern
[ http://issues.apache.org/jira/browse/NUTCH-16?page=comments#action_12364354 ] byron miller commented on NUTCH-16: --- Cool an inverse of this plugin would be great, or enhancement of this for +/- values based on patters as i think lowering score of domains like i.like.to.spam.with.keywords.in.my.url.pretending.im.a.good.site.dot.com boost documents matching a url pattern -- Key: NUTCH-16 URL: http://issues.apache.org/jira/browse/NUTCH-16 Project: Nutch Type: New Feature Components: indexer Reporter: Stefan Groschupf Priority: Trivial Attachments: boost-url-src_and_bin.zip, boostingPluginPatch.txt The attached patch is a plugin that allows to boost documents matching a url pattern. This could be useful to rank documents from a intranet higher then external pages. A README comes with the patch. Any comments are welcome. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-95) DeleteDuplicates depends on the order of input segments
[ http://issues.apache.org/jira/browse/NUTCH-95?page=comments#action_12364355 ] Andrzej Bialecki commented on NUTCH-95: Yes, it should. SegmentMergeTool should handle this correctly in 0.7. For 0.8 it is not (yet) supported... DeleteDuplicates depends on the order of input segments --- Key: NUTCH-95 URL: http://issues.apache.org/jira/browse/NUTCH-95 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev, 0.6, 0.7 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki DeleteDuplicates depends on what order the input segments are processed, which in turn depends on the order of segment dirs returned from NutchFileSystem.listFiles(File). In most cases this is undesired and may lead to deleting wrong records from indexes. The silent assumption that segments at the end of the listing are more recent is not always true. Here's the explanation: * Dedup first deletes the URL duplicates by computing MD5 hashes for each URL, and then sorting all records by (hash, segmentIdx, docIdx). SegmentIdx is just an int index to the array of open IndexReaders - and if segment dirs are moved/copied/renamed then entries in that array may change their order. And then for all equal triples Dedup keeps just the first entry. Naturally, if segmentIdx is changed due to dir renaming, a different record will be kept and different ones will be deleted... * then Dedup deletes content duplicates, again by computing hashes for each content, and then sorting records by (hash, segmentIdx, docIdx). However, by now we already have a different set of undeleted docs depending on the order of input segments. On top of that, the same factor acts here, i.e. segmentIdx changes when you re-shuffle the input segment dirs - so again, when identical entries are compared the one with the lowest (segmentIdx, docIdx) is picked. Solution: use the fetched date from the first record in each segment to determine the order of segments. Alternatively, modify DeleteDuplicates to use the newer algorithm from SegmentMergeTool. This algorithm works by sorting records using tuples of (urlHash, contentHash, fetchDate, score, urlLength). Then: 1. If urlHash is the same, keep the doc with the highest fetchDate (the latest version, as recorded by Fetcher). 2. If contentHash is the same, keep the doc with the highest score, and then if the scores are the same, keep the doc with the shortest url. Initial fix will be prepared for the trunk/ and then backported to the release branch. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-79) Fault tolerant searching.
[ http://issues.apache.org/jira/browse/NUTCH-79?page=comments#action_12364357 ] byron miller commented on NUTCH-79: --- Piotr, Any update on this? Have you been able to run with this or still working out the kinks? Fault tolerant searching. - Key: NUTCH-79 URL: http://issues.apache.org/jira/browse/NUTCH-79 Project: Nutch Type: New Feature Components: searcher Reporter: Piotr Kosiorowski Attachments: patch I have finally managed to prepare first version of fault tolerant searching I have promised long time ago. It reads server configuration from search-groups.txt file (in startup directory or directory specified by searcher.dir) if no search-servers.txt file is present. If search-servers.txt is presentit would be read and handled as previously. --- Format of search-groups.txt: * pre * search.group.count=[int] * search.group.name.[i]=[string] (for i=0 to count-1) * * For each name: * [name].part.count=[int] partitionCount * [name].part.[i].host=[string] (for i=0 to partitionCount-1) * [name].part.[i].port=int (for i=0 to partitionCount-1) * * Example: * search.group.count=2 * search.group.name.0=master * search.group.name.1=backup * * master.part.count=2 * master.part.0.host=host1 * master.part.0.port= * master.part.1.host=host2 * master.part.1.port= * * backup.part.count=2 * backup.part.0.host=host3 * backup.part.0.port= * backup.part.1.host=host4 * backup.part.1.port= * /pre. If more than one search group is defined in configuration file requests are distributed among groups in round-robin fashion. If one of the servers from the group fails to respond the whole group is treated as inactive and removed from the pool used to distributed requests. There is a separate recovery thread that every searcher.recovery.delay seconds (default 60) tries to check if inactive became alive and if so adds it back to the pool of active groups. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-14) NullPointerException NutchBean.getSummary
[ http://issues.apache.org/jira/browse/NUTCH-14?page=comments#action_12364358 ] byron miller commented on NUTCH-14: --- Are you still hitting this Stefan? NullPointerException NutchBean.getSummary - Key: NUTCH-14 URL: http://issues.apache.org/jira/browse/NUTCH-14 Project: Nutch Type: Bug Components: searcher Reporter: Stefan Groschupf Priority: Minor In heavy load scenarios this may happens when connection broke. java.lang.NullPointerException at java.util.Hashtable.get(Hashtable.java:333) at net.nutch.ipc.Client.getConnection(Client.java:276) at net.nutch.ipc.Client.call(Client.java:251) at net.nutch.searcher.DistributedSearch$Client.getSummary(DistributedSearch.java:418) at net.nutch.searcher.NutchBean.getSummary(NutchBean.java:236) at org.apache.jsp.search_jsp._jspService(org.apache.jsp.search_jsp:396) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:99) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:325) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:295) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:245) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:825) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:738) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:526) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:80) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684) at java.lang.Thread.run(Thread.java:552) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: svn commit: r359822 - in /lucene/nutch/trunk: bin/ conf/ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/indexer/ src/java/org/apache/nutch/parse/ src
[EMAIL PROTECTED] wrote: Added: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/SignatureFactory.java URL: http://svn.apache.org/viewcvs/lucene/nutch/trunk/src/java/org/apache/nutch/crawl/SignatureFactory.java?rev=359822view=auto == --- lucene/nutch/trunk/src/java/org/apache/nutch/crawl/SignatureFactory.java (added) +++ lucene/nutch/trunk/src/java/org/apache/nutch/crawl/SignatureFactory.java Thu Dec 29 07:28:30 2005 ... + public static Signature getSignature(NutchConf conf) { +String clazz = conf.get(db.signature.class, MD5Signature.class.getName()); +Signature impl = (Signature)conf.getObject(clazz); +if (impl == null) { + try { +LOG.info(Using Signature impl: + clazz); +Class implClass = Class.forName(clazz); +impl = (Signature)implClass.newInstance(); +impl.setConf(conf); + } catch (Exception e) { should there be a conf.setObject(clazz,impl); inside that try ? -- Sami Siren