Re: nighly build brocken?
i get nightly to run, but it never completes anything. always get stuck at 98% here and there.. i'll try todays build and see what happens. --- Stefan Groschupf [EMAIL PROTECTED] wrote: Hi, looks like the latest nightly build is broken. Looks like the jar that comes with the nightly build contains some patches that are not yet in the svn sources. Is someone able to get the latest nutch nightly to run? Thanks. Stefan
Re: scalability limits getDetails, mapFile Readers?
I would like to see something as active, in process and inbound. Active data is live and on the query servers (both indexes and correlating segments) in process are tasks currently being mapped out and inbound is processes/data that is pending to be processed. Active nodes report as in the search pool. In process nodes are really data nodes doing all of the number crunching/import/merging/indexing and inbound is everything in fetch/pre processing. The cycle would be a pull cycle. Active nodes pull from the corresponding data nodes in turn pull from the corresponding inbound nodes. Events/batches could trigger the pull so that it is a complete or useable data set. Some light weight workflow engine could allow you to process/manage the cycle of data. i would like to see dfs block aware - able to process the data on the active data server where that data resides (as much as possible).. such a file table could be used to associate the data through the entire process stream and allow for a fairly linear growth. Such a system could also be aware of its own capacity in that the inbound processes will fail/halt if disk space on the dfs system isn't capable of handling new tasks and vice versa if the active nodes are at capacity tasks could be told to stop/hold. You could use this logic to add more nodes where necessary and resume processing and chart your growth. I come from the ERP/Oracle world so i very much have learned to appreciate the distributed architecture and concept of an aware system such as concurrent processing that is distributed across many nodes and aware of the status of the task and able to hold/wait or act upon the condition of the system and grow fairly linearly as needed. -byron --- Stefan Groschupf [EMAIL PROTECTED] wrote: Hi Andrzej, * merge 80 segments into 1. A lot of IO involved... and you have to repeat it from time to time. Ugly. I agree. * implement a search server as a map task. Several challenges: it needs to partition the Lucene index, and it has to copy all parts of segments and indexes from DFS to the local storage, otherwise performance will suffer. However, the number of open files per machine would be reduced, because (ideally) each machine would deal with few or a single part of segment and a single part of index... Well I played around and already had a kind of prototype. I had seen following problems: + having a kind of repository of active search servers possibility A: find all tasktrackers running a specific task (already discussed in the hadoop mailing list) possibility B: having a rpc server running in the jvm that runs the search server client, add the hostname to the jobconf and similar to task - jobtracker search server announce itself via hardbeat to the search server 'repository'. + having the index locally and the segment in the dfs. ++ adding to NutchBean init a dfs for index and one for segments could fix this, or more general add support for streamhandlers like dfs:// vs file://. (very long term) + downloading an index from dfs until the mapper starts or just index the segment data to local hdd and let the mapper run for the next 30 days? Stefan
Re: Carrot2 v. 1.0.1. [clustering plugin]
I would love to see it continue as a plugin. I'm moving to mapreduce myself so i would be interested in utilizing it there. thanks for the great work! look forward to trying out your updates. feel free to contact me directly if you wish. -byron --- Dawid Weiss [EMAIL PROTECTED] wrote: Hi there, We've been quite busy with putting things together at Carrot2. Version 1.0.1 is out -- it is a stable release with a few tweaks and tunings that appeared after 1.0. We also have a Web site ;) http://www.carrot2.org So... I think it's time for reintegrating that code into Nutch clustering plugin. If folks still consider it valuable enough to be kept in the codebase, that is (I realize it is a big plugin). Summarizing, my questions are: - Is there still enough interest in maintaining this functionality in the Nutch codebase? I admit I'm really busy and I don't have enough time to keep up with Nutch's development. My contributions will be there, but their frequency might be insufficient to keep the plugin usable at all times. - Which Nutch version you want me to work with if there's still interest -- map-reduce, maintenance line or both? Dawid
indexSorter - applied to SVN or patch in Jira?
Has indexsorter code discussed a while back been pushed to jira or put in SVN? I'd like to give it a whirl on some of my indexes and the archive i can find cut the post with the code attached..
[jira] Commented: (NUTCH-16) boost documents matching a url pattern
[ http://issues.apache.org/jira/browse/NUTCH-16?page=comments#action_12364354 ] byron miller commented on NUTCH-16: --- Cool an inverse of this plugin would be great, or enhancement of this for +/- values based on patters as i think lowering score of domains like i.like.to.spam.with.keywords.in.my.url.pretending.im.a.good.site.dot.com boost documents matching a url pattern -- Key: NUTCH-16 URL: http://issues.apache.org/jira/browse/NUTCH-16 Project: Nutch Type: New Feature Components: indexer Reporter: Stefan Groschupf Priority: Trivial Attachments: boost-url-src_and_bin.zip, boostingPluginPatch.txt The attached patch is a plugin that allows to boost documents matching a url pattern. This could be useful to rank documents from a intranet higher then external pages. A README comes with the patch. Any comments are welcome. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-79) Fault tolerant searching.
[ http://issues.apache.org/jira/browse/NUTCH-79?page=comments#action_12364357 ] byron miller commented on NUTCH-79: --- Piotr, Any update on this? Have you been able to run with this or still working out the kinks? Fault tolerant searching. - Key: NUTCH-79 URL: http://issues.apache.org/jira/browse/NUTCH-79 Project: Nutch Type: New Feature Components: searcher Reporter: Piotr Kosiorowski Attachments: patch I have finally managed to prepare first version of fault tolerant searching I have promised long time ago. It reads server configuration from search-groups.txt file (in startup directory or directory specified by searcher.dir) if no search-servers.txt file is present. If search-servers.txt is presentit would be read and handled as previously. --- Format of search-groups.txt: * pre * search.group.count=[int] * search.group.name.[i]=[string] (for i=0 to count-1) * * For each name: * [name].part.count=[int] partitionCount * [name].part.[i].host=[string] (for i=0 to partitionCount-1) * [name].part.[i].port=int (for i=0 to partitionCount-1) * * Example: * search.group.count=2 * search.group.name.0=master * search.group.name.1=backup * * master.part.count=2 * master.part.0.host=host1 * master.part.0.port= * master.part.1.host=host2 * master.part.1.port= * * backup.part.count=2 * backup.part.0.host=host3 * backup.part.0.port= * backup.part.1.host=host4 * backup.part.1.port= * /pre. If more than one search group is defined in configuration file requests are distributed among groups in round-robin fashion. If one of the servers from the group fails to respond the whole group is treated as inactive and removed from the pool used to distributed requests. There is a separate recovery thread that every searcher.recovery.delay seconds (default 60) tries to check if inactive became alive and if so adds it back to the pool of active groups. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-14) NullPointerException NutchBean.getSummary
[ http://issues.apache.org/jira/browse/NUTCH-14?page=comments#action_12364358 ] byron miller commented on NUTCH-14: --- Are you still hitting this Stefan? NullPointerException NutchBean.getSummary - Key: NUTCH-14 URL: http://issues.apache.org/jira/browse/NUTCH-14 Project: Nutch Type: Bug Components: searcher Reporter: Stefan Groschupf Priority: Minor In heavy load scenarios this may happens when connection broke. java.lang.NullPointerException at java.util.Hashtable.get(Hashtable.java:333) at net.nutch.ipc.Client.getConnection(Client.java:276) at net.nutch.ipc.Client.call(Client.java:251) at net.nutch.searcher.DistributedSearch$Client.getSummary(DistributedSearch.java:418) at net.nutch.searcher.NutchBean.getSummary(NutchBean.java:236) at org.apache.jsp.search_jsp._jspService(org.apache.jsp.search_jsp:396) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:99) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:325) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:295) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:245) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:825) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:738) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:526) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:80) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684) at java.lang.Thread.run(Thread.java:552) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: need volunteer to develop search for apache.org
I'll be happy to do it. --- Doug Cutting [EMAIL PROTECTED] wrote: Would someone volunteer to develop Nutch-based site-search engine for all apache.org domains? We now have a Solaris zone to host this. Thanks, Doug
[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets
[ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12363400 ] byron miller commented on NUTCH-134: Thanks Erik, I was able to pull down the highlighter and i'll be loading it up on mozdex.com to test out over the weekend (1/21/2006). i'll let people know if my cpu skyrockets :) Summarizer doesn't select the best snippets --- Key: NUTCH-134 URL: http://issues.apache.org/jira/browse/NUTCH-134 Project: Nutch Type: Bug Components: searcher Versions: 0.7, 0.8-dev, 0.7.1, 0.7.2-dev Reporter: Andrzej Bialecki Summarizer.java tries to select the best fragments from the input text, where the frequency of query terms is the highest. However, the logic in line 223 is flawed in that the excerptSet.add() operation will add new excerpts only if they are not already present - the test is performed using the Comparator that compares only the numUniqueTokens. This means that if there are two or more excerpts, which score equally high, only the first of them will be retained, and the rest of equally-scoring excerpts will be discarded, in favor of other excerpts (possibly lower scoring). To fix this the Set should be replaced with a List + a sort operation. To keep the relative position of excerpts in the original order the Excerpt class should be extended with an int order field, and the collected excerpts should be sorted in that order prior to adding them to the summary. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-183) MapReduce has a series of problems concerning task-allocation to worker nodes
[ http://issues.apache.org/jira/browse/NUTCH-183?page=comments#action_12363477 ] byron miller commented on NUTCH-183: As Mr Burns would say eggcelent I'll give this a try. BTW, is it possible to implement functionality that would start jobs that are lagging on nodes that have completed tasks like google does? for example if your 90% done and the last 10 jobs are hung because of bad hardware, slow response or failure and have the ability to redo the long running jobs in parallel on alternate nodes and complete the first one that finishes? this way if you have a huge crawl and certain nodes slow or fail those jobs can be alternated on completed nodes to try and wrap up and terminate any dead jobs when done? hope that makes sense.. MapReduce has a series of problems concerning task-allocation to worker nodes - Key: NUTCH-183 URL: http://issues.apache.org/jira/browse/NUTCH-183 Project: Nutch Type: Improvement Environment: All Reporter: Mike Cafarella Attachments: jobtracker.patch The MapReduce JobTracker is not great at allocating tasks to TaskTracker worker nodes. Here are the problems: 1) There is no speculative execution of tasks 2) Reduce tasks must wait until all map tasks are completed before doing any work 3) TaskTrackers don't distinguish between Map and Reduce jobs. Also, the number of tasks at a single node is limited to some constant. That means you can get weird deadlock problems upon machine failure. The reduces take up all the available execution slots, but they don't do productive work, because they're waiting for a map task to complete. Of course, that map task won't even be started until the reduce tasks finish, so you can see the problem... 4) The JobTracker is so complicated that it's hard to fix any of these. The right solution is a rewrite of the JobTracker to be a lot more flexible in task handling. It has to be a lot simpler. One way to make it simpler is to add an abstraction I'll call TaskInProgress. Jobs are broken into chunks called TasksInProgress. All the TaskInProgress objects must be complete, somehow, before the Job is complete. A single TaskInProgress can be executed by one or more Tasks. TaskTrackers are assigned Tasks. If a Task fails, we report it back to the JobTracker, where the TaskInProgress lives. The TIP can then decide whether to launch additional Tasks or not. Speculative execution is handled within the TIP. It simply launches multiple Tasks in parallel. The TaskTrackers have no idea that these Tasks are actually doing the same chunk of work. The TIP is complete when any one of its Tasks are complete. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Problem with latest SVN during reduce phase
I'll pull it down today and give it a shot. thanks, -byron --- Lukas Vlcek [EMAIL PROTECTED] wrote: Hi, Get the latest svn version. Andrzej commited some patches yesterday and now this issue is gone (at least it warks fine for me). I believe that revision# 368167 is what we were about. Regards, Lukas On 1/13/06, Pashabhai [EMAIL PROTECTED] wrote: Hi , You are right, Parse object is not null even though page has no content and title. Could it be FetcherOutput Object ??? P --- Lukas Vlcek [EMAIL PROTECTED] wrote: Hi, I think this issue can be more complex. If I remember my test correctly then parse object was not null. Also parse.getText() was not null (it just contained empty String). If document is not parsed correctly then empty parse is returned instead: parseStatus.getEmptyParse(); which should be OK, but I didn't have a chance to check if this can cause any troubles during index index optimization. Lukas On 1/12/06, Pashabhai [EMAIL PROTECTED] wrote: Hi , The very similar exception occurs while indexing a page which do not have body content (and title sometimes). 051223 194717 Optimizing index. java.lang.NullPointerException at org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:75) at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:63) at org.apache.nutch.crawl.Indexer.reduce(Indexer.java:217) at org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260) at Looking into the source code of BasicIndexingFilter. it is trying to doc.add(Field.UnStored(content, parse.getText())); I guess adding check for null on parse object if(parse!=null) should solve the problem. Can confirm when tested locally. Thanks P --- Lukas Vlcek [EMAIL PROTECTED] wrote: Hi, I am facing this error as well. Now I located one particular document which is causing it (it is msword document which can't be properly parsed by parser). I have sent it to Andrzej in separed email. Let's see if that helps... Lukas On 1/11/06, Dominik Friedrich [EMAIL PROTECTED] wrote: I got this exception a lot, too. I haven't tested the patch by Andrzej yet but instead I just put the doc.add() lines in the indexer reduce function in a try-catch block . This way the indexing finishes even with a null value and i can see which documents haven't been indexed in the log file. Wouldn't it be a good idea to catch every exceptions that only affect one document in loops like this? At least I don't like it if an indexing process dies after a few hours because one document triggers such an exception. best regards, Dominik Byron Miller wrote: 60111 103432 reduce reduce 060111 103432 Optimizing index. 060111 103433 closing reduce 060111 103434 closing reduce 060111 103435 closing reduce java.lang.NullPointerException: value cannot be null at org.apache.lucene.document.Field.init(Field.java:469) at org.apache.lucene.document.Field.init(Field.java:412) at org.apache.lucene.document.Field.UnIndexed(Field.java:195) at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:198) at org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260) at org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90) Exception in thread main java.io.IOException: Job failed! at === message truncated ===
RE: MapReduce and segment merging
I was thinking that Nutch needs some sort of workflow manager. This way you could build jobs off specific workflows and hopefully recover jobs based upon the portion of the workflow they are stuck. (or restart a job if failed/processing time x hours and other such workflow processes rules) Something like that could also send notifications of jobs done, trigger other events and create a management interface to what your cluster is up to or apply configuration types to be defigned based upon batch job/workflow process in process. For example if i'm building a blog index i may want more smaller segments based upon daily fetches while for other jobs i may want less larger segments. Does something like that make much sense for where mapred branch is going? is workflow the right term for such beast? -byron --- Goldschmidt, Dave [EMAIL PROTECTED] wrote: Could you also just copy segments out of NDFS to local -- perform merges in local -- then copy segments back into NDFS? DaveG -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Thursday, January 12, 2006 2:14 PM To: nutch-dev@lucene.apache.org Subject: Re: MapReduce and segment merging Mike Alulin wrote: Then how people uses the new version if they need let's say daily crawls of the new/updated pages? I crawl updated pages every 24 hours and if I do not merge the segments, soon I will have hundreds of them. What is the best solution in this case? Full recrawl is not a good option as i have millions of documents and I DO know which of them were updated without requesting them. This is a development version, nobody said it's feature complete. Patience, my friend... or spend some effort to improve it. ;-) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Problem with latest SVN during reduce phase
60111 103432 reduce reduce 060111 103432 Optimizing index. 060111 103433 closing reduce 060111 103434 closing reduce 060111 103435 closing reduce java.lang.NullPointerException: value cannot be null at org.apache.lucene.document.Field.init(Field.java:469) at org.apache.lucene.document.Field.init(Field.java:412) at org.apache.lucene.document.Field.UnIndexed(Field.java:195) at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:198) at org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260) at org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90) Exception in thread main java.io.IOException: Job failed! at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308) at org.apache.nutch.indexer.Indexer.index(Indexer.java:259) at org.apache.nutch.crawl.Crawl.main(Crawl.java:121) [EMAIL PROTECTED]:/data/nutch/trunk$ Pulled todays build and got above error. No problems running out of disk space or anything like that. This is a single instance, local file systems. Anyway to recover the crawl/finish the reduce job from where it failed?
Re: Per-page crawling policy
Excellent Ideas and that is what i'm hoping to use some of the social bookmarking type ideas to build the starter sites from and linkmaps from. I hope to work with Simpy or other bookmarking projects to build somewhat of a popularity map(human edited authorit) to merge and calculate against a computer generated map (via standard link processing, anchor results and such) My only continuing question is how to manage the merge, index process of staging/processing your crawl/fetch jobs such as this. It seems all of our theories would be a single crawl and publish of that index rather than a living/breathing corpus. Unless we map/bucket the segments to have some purpose it's difficult to manage how we process them, sort them or analyze them to defign or extra more meaning from them. Brain is exploding :) -byron --- Andrzej Bialecki [EMAIL PROTECTED] wrote: Hi, I've been toying with the following idea, which is an extension of the existing URLFilter mechanism and the concept of a crawl frontier. Let's suppose we have several initial seed urls, each with a different subjective quality. We would like to crawl these, and expand the crawling frontier using outlinks. However, we don't want to do it uniformly for every initial url, but rather propagate certain crawling policy through the expanding trees of linked pages. This crawling policy could consist of url filters, scoring methods, etc - basically anything configurable in Nutch could be included in this policy. Perhaps it could even be the new version of non-static NutchConf ;-) Then, if a given initial url is a known high-quality source, we would like to apply a favor policy, where we e.g. add pages linked from that url, and in doing so we give them a higher score. Recursively, we could apply the same policy for the next generation pages, or perhaps only for pages belonging to the same domain. So, in a sense the original notion of high-quality would cascade down to other linked pages. The important aspect of this to note is that all newly discovered pages would be subject to the same policy - unless we have compelling reasons to switch the policy (from favor to default or to distrust), which at that point would essentially change the shape of the expanding frontier. If a given initial url is a known spammer, we would like to apply a distrust policy for adding pages linked from that url (e.g. adding or not adding, if adding then lowering their score, or applying different score calculation). And recursively we could apply a similar policy of distrust to any pages discovered this way. We could also change the policy on the way, if there are compelling reasons to do so. This means that we could follow some high-quality links from low-quality pages, without drilling down the sites which are known to be of low quality. Special care needs to be taken if the same page is discovered from pages with different policies, I haven't thought about this aspect yet... ;-) What would be the benefits of such approach? * the initial page + policy would both control the expanding crawling frontier, and it could be differently defined for different starting pages. I.e. in a single web database we could keep different collections or areas of interest with differently specified policies. But still we could reap the benefits of a single web db, namely the link information. * URLFilters could be grouped into several policies, and it would be easy to switch between them, or edit them. * if the crawl process realizes it ended up on a spam page, it can switch the page policy to distrust, or the other way around, and stop crawling unwanted content. From now on the pages linked from that page will follow the new policy. In other words, if a crawling frontier reaches pages with known quality problems, it would be easy to change the policy on-the-fly to avoid them or pages linked from them, without resorting to modifications of URLFilters. Some of the above you can do even now with URLFilters, but any change you do now has global consequences. You may also end up with awfully complicated rules if you try to cover all cases in one rule set. How to implement it? Surprisingly, I think that it's very simple - just adding a CrawlDatum.policyId field would suffice, assuming we have a means to store and retrieve these policies by ID; and then instantiate it and call appropriate methods whenever we use today the URLFilters and do the score calculations. Any comments? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: mapred crawling exception - Job failed!
Fixed in the copy i run as i've been able to get my 100k pages indexed without getting that error. -byron --- Andrzej Bialecki [EMAIL PROTECTED] wrote: Lukas Vlcek wrote: Hi, I am trying to use the latest nutch-trunk version but I am facing unexpected Job failed! exception. It seems that all crawling work has been already done but some threads are hunged which results into exception after some timeout. This was fixed (or should be fixed :) in the revision r365576. Please report if it doesn't fix it for you. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: IndexSorter optimizer
On optimizing performance, does anyone know if google is exporting its entire dataset as an index or only somehow indexing the topN % (since they only show the first 1000 or so results anyway) With this patch and a top result set in the xml file does that mean it will stop scanning the index at that point? Is there a methodology to actually prune the index on some scaling factor so that a 4 billion page index can be searchable only 1k results deep on average? seems like some sort of method to do the above would cut your search processing/index size down fairly well. But it may be a more expensive to post process to this scale then it is to simply push and let the query optimize ignore it as needed.. afterall disk space is getting rather cheap compared to cpu processing memory. --- Doug Cutting [EMAIL PROTECTED] wrote: Andrzej Bialecki wrote: I'm happy to report that further tests performed on a larger index seem to show that the overall impact of the IndexSorter is definitely positive: performance improvements are significant, and the overall quality of results seems at least comparable, if not actually better. Great news! I will submit the Lucene patches ASAP, now that we know they're useful. Doug
Adding some theory publication links into the Wiki..
I figured since i'm in research mode i woul start compiling available information resource and putthing them up on the wiki http://wiki.apache.org/nutch/Search_Theory sorry about all the cvs message on edits.. i'm not used to the touchpad on this darned laptop :) Anyhow, if you have any resources to share or care to look at all sorts of map, theory and discussions on search, relevence, rank and theory it is a gold mine of info!
[jira] Created: (NUTCH-159) Specify temp/working directory for crawl
Specify temp/working directory for crawl Key: NUTCH-159 URL: http://issues.apache.org/jira/browse/NUTCH-159 Project: Nutch Type: Bug Components: fetcher, indexer Versions: 0.8-dev Environment: Linux/Debian Reporter: byron miller I ran a crawl of 100k web pages and got: org.apache.nutch.fs.FSError: java.io.IOException: No space left on device at org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(LocalFileSystem.java:149) at org.apache.nutch.fs.FileUtil.copyContents(FileUtil.java:65) at org.apache.nutch.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:178) at org.apache.nutch.fs.NutchFileSystem.rename(NutchFileSystem.java:224) at org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:80) Caused by: java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:260) at org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(LocalFileSystem.java:147) ... 4 more Exception in thread main java.io.IOException: Job failed! at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308) at org.apache.nutch.crawl.Fetcher.fetch(Fetcher.java:335) at org.apache.nutch.crawl.Crawl.main(Crawl.java:107) [EMAIL PROTECTED]:/data/nutch$ df -k It appears crawl created a /tmp/nutch directory that filled up even though i specified a db directory. Need to add a parameter to the command line or make a globaly configurable /tmp (work area) for the nutch instance so that crawls won't fail. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-123) Cache.jsp some times generate NullPointerException
[ http://issues.apache.org/jira/browse/NUTCH-123?page=comments#action_12361473 ] byron miller commented on NUTCH-123: Perhaps you should try the cache servlet as it dumps out the data as it sees it. Cache.jsp some times generate NullPointerException -- Key: NUTCH-123 URL: http://issues.apache.org/jira/browse/NUTCH-123 Project: Nutch Type: Bug Components: web gui Environment: All systems Reporter: YourSoft Priority: Critical There is a problem with the following line in the cached.jsp: String contentType = (String) metaData.get(Content-Type); In the segments data there is some times not equals Content-Type, there are content-type or Content-type etc. The solution, insert these lines over the above line: for (Enumeration eNum = metaData.propertyNames(); eNum.hasMoreElements();) { content = (String) eNum.nextElement(); if (content-type.equalsIgnoreCase (content)) { break; } } final String contentType = (String) metaData.get(content); Regards, Ferenc -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-42) enhance search.jsp such that it can also returns XML
[ http://issues.apache.org/jira/browse/NUTCH-42?page=comments#action_12361474 ] byron miller commented on NUTCH-42: --- Safe to close. (done) We have XML/OpenSearch in latest trunk and other branches. enhance search.jsp such that it can also returns XML Key: NUTCH-42 URL: http://issues.apache.org/jira/browse/NUTCH-42 Project: Nutch Type: Wish Components: web gui Reporter: Michael Wechner Priority: Trivial Attachments: NutchRssSearch.zip, NutchRssSearch.zip, search.jsp.diff, search.jsp.diff Enhance search.jsp such that by specifying a parameter format=xml the JSP will return an XML, whereas if no format is being specified then it will return HTML -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-158) Process Sitemap data in text, rss or xml format as well as OAI-PMH
Process Sitemap data in text, rss or xml format as well as OAI-PMH -- Key: NUTCH-158 URL: http://issues.apache.org/jira/browse/NUTCH-158 Project: Nutch Type: New Feature Components: fetcher Versions: 0.8-dev Reporter: byron miller Priority: Minor Add support to the fetcher to look for sitemap files, download them and process them into webdb. Perhaps create a robots.txt directive that can be used to create a standard format for sitemaps in RSS, XML or text format (one line per url) and process that. I would love to see someone stomp on proprietary sitemap features or making things so google specific as they are today :) * RSS format/Atom Format (standard) * XML meta descroption * OAI-PMH meta description (http://www.openarchives.org/OAI/openarchivesprotocol.html) Perhaps even a pre crawler that will scour for these to inject into the web db to help build your link map so you could even just index topN. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-155) Remove web gui from the distribution to contrib and use OpenSearch Servlet
[ http://issues.apache.org/jira/browse/NUTCH-155?page=comments#action_12361398 ] byron miller commented on NUTCH-155: I don't know how i feel about removing the JSP stuff into a contrib and then fluffing it up more with the potential to support other web languages. moving out jsp doesn't negate the need for an app server. but making a 3rd party contrib for everything that can use nutch may be worthwhile. Remove web gui from the distribution to contrib and use OpenSearch Servlet Key: NUTCH-155 URL: http://issues.apache.org/jira/browse/NUTCH-155 Project: Nutch Type: Wish Components: web gui Versions: 0.8-dev Reporter: nutch.newbie Web gui JSP search pages should be moved to a contrib folder. It would be better to focus on OpenSearch Servlet based XML results. For example in the current tutorial at - http://lucene.apache.org/nutch/tutorial.html under the searching section one could imagine to add a script OpenSearch. (i.e. bin/nutch OpenSearch search term-- Bingo XML results. ) Therefore I suggest - It is better that web gui moves to contrib. I also forsee posting PHP or Perl, Ruby, XSLT or other language based GUI being developed and have it under the contrib as an addition to JSP pages. - Current implementation focuses on JSP pages, tomcat, etc. has nothing to do with Nutch. But has everything to do with How Nutch needs to be deployed. And to my mind Nutch can be deployed in many ways. So why just JSP and tomcat will get the core attention. The above wish is not new, I have seen others in Jira having similler thinking. Furthermore Nutch is becoming big in size, the plugins are also growing it would be good idea to have a contrib directory just like Lucene. Some of the plugin could also move there. Plugins like clustering, ontology (i.e. not required for basic indexing/searching) etc are not given that it should be part of the distribution. The point I try to make here is its up to the search engine operator to download the plugins rather then everyone gets everything.tar model. Above is still a wish :-) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Mega-cleanup in trunk/
I'll pull a build down tonight and let you know how it goes! -byron --- Andrzej Bialecki [EMAIL PROTECTED] wrote: Hi, I just commited a large patch to cleanup the trunk/ of obsolete and broken classes remaining from the 0.7.x development line. Please test that things still work as they should ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Commented: (NUTCH-92) DistributedSearch incorrectly scores results
[ http://issues.apache.org/jira/browse/NUTCH-92?page=comments#action_12361348 ] byron miller commented on NUTCH-92: --- Has there been any advancement on this front? DistributedSearch incorrectly scores results Key: NUTCH-92 URL: http://issues.apache.org/jira/browse/NUTCH-92 Project: Nutch Type: Bug Components: searcher Versions: 0.8-dev, 0.7 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki When running search servers in a distributed setup, using DistributedSearch$Server and Client, total scores are incorrectly calculated. The symptoms are that scores differ depending on how segments are deployed to Servers, i.e. if there is uneven distribution of terms in segment indexes (due to segment size or content differences) then scores will differ depending on how many and which segments are deployed on a particular Server. This may lead to prioritizing of non-relevant results over more relevant ones. The underlying reason for this is that each IndexSearcher (which uses local index on each Server) calculates scores based on the local IDFs of query terms, and not the global IDFs from all indexes together. This means that scores arriving from different Servers to the Client cannot be meaningfully compared, unless all indexes have similar distribution of Terms and similar numbers of documents in them. However, currently the Client mixes all scores together, sorts them by absolute values and picks top hits. These absolute values will change if segments are un-evenly deployed to Servers. Currently the workaround is to deploy the same number of documents in segments per Server, and to ensure that segments contain well-randomized content so that term frequencies for common terms are very similar. The solution proposed here (as a result of discussion between ab and cutting, patches are coming) is to calculate global IDFs prior to running the query, and pre-boost query Terms with these global IDFs. This will require one more RPC call per each query (this can be optimized later, e.g. through caching). Then the scores will become normalized according to the global IDFs, and Client will be able to meaningfully compare them. Scores will also become independent of the segment content or local number of documents per Server. This will involve at least the following changes: * change NutchSimilarity.idf(Term, Searcher) to always return 1.0f. This enables us to manipulate scores independently of local IDFs. * add a new method to Searcher interface, int[] getDocFreqs(Term[]), which will return document frequencies for query terms. * modify getSegmentNames() so that it returns also the total number of documents in each segment, or implement this as a separate method (this will be called once during segment init) * in DistributedSearch$Client.search() first make a call to servers to return local IDFs for the current query, and calculate global IDFs for each relevant Term in that query. * multiply the TermQuery boosts by idf(totalDocFreq, totalIndexedDocs), and PhraseQuery boosts by the sum of the idf(totalDocFreqs, totalIndexedDocs) for all of its terms This solution should be applicable with only minor changes to all branches, but initially the patches will be relative to trunk/ . Comments, suggestions and review are welcome! -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets
[ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12361350 ] byron miller commented on NUTCH-134: Where is the lucene summarizer from the contrib? i'm not seeing anything obvious (unless it's under a different name) Summarizer doesn't select the best snippets --- Key: NUTCH-134 URL: http://issues.apache.org/jira/browse/NUTCH-134 Project: Nutch Type: Bug Components: searcher Versions: 0.7.1, 0.7, 0.7.2-dev, 0.8-dev Reporter: Andrzej Bialecki Summarizer.java tries to select the best fragments from the input text, where the frequency of query terms is the highest. However, the logic in line 223 is flawed in that the excerptSet.add() operation will add new excerpts only if they are not already present - the test is performed using the Comparator that compares only the numUniqueTokens. This means that if there are two or more excerpts, which score equally high, only the first of them will be retained, and the rest of equally-scoring excerpts will be discarded, in favor of other excerpts (possibly lower scoring). To fix this the Set should be replaced with a List + a sort operation. To keep the relative position of excerpts in the original order the Excerpt class should be extended with an int order field, and the collected excerpts should be sorted in that order prior to adding them to the summary. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-95) DeleteDuplicates depends on the order of input segments
[ http://issues.apache.org/jira/browse/NUTCH-95?page=comments#action_12361300 ] byron miller commented on NUTCH-95: --- Number 2 sounds great, but wouldn't you always want the latest scoring document since that should reflect the latest updatedb and rank of the page even if it's lower or higher? DeleteDuplicates depends on the order of input segments --- Key: NUTCH-95 URL: http://issues.apache.org/jira/browse/NUTCH-95 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev, 0.6, 0.7 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki DeleteDuplicates depends on what order the input segments are processed, which in turn depends on the order of segment dirs returned from NutchFileSystem.listFiles(File). In most cases this is undesired and may lead to deleting wrong records from indexes. The silent assumption that segments at the end of the listing are more recent is not always true. Here's the explanation: * Dedup first deletes the URL duplicates by computing MD5 hashes for each URL, and then sorting all records by (hash, segmentIdx, docIdx). SegmentIdx is just an int index to the array of open IndexReaders - and if segment dirs are moved/copied/renamed then entries in that array may change their order. And then for all equal triples Dedup keeps just the first entry. Naturally, if segmentIdx is changed due to dir renaming, a different record will be kept and different ones will be deleted... * then Dedup deletes content duplicates, again by computing hashes for each content, and then sorting records by (hash, segmentIdx, docIdx). However, by now we already have a different set of undeleted docs depending on the order of input segments. On top of that, the same factor acts here, i.e. segmentIdx changes when you re-shuffle the input segment dirs - so again, when identical entries are compared the one with the lowest (segmentIdx, docIdx) is picked. Solution: use the fetched date from the first record in each segment to determine the order of segments. Alternatively, modify DeleteDuplicates to use the newer algorithm from SegmentMergeTool. This algorithm works by sorting records using tuples of (urlHash, contentHash, fetchDate, score, urlLength). Then: 1. If urlHash is the same, keep the doc with the highest fetchDate (the latest version, as recorded by Fetcher). 2. If contentHash is the same, keep the doc with the highest score, and then if the scores are the same, keep the doc with the shortest url. Initial fix will be prepared for the trunk/ and then backported to the release branch. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-55) Create dmoz.org search plugin - incorporate the dmoz.org title/category/description if available
[ http://issues.apache.org/jira/browse/NUTCH-55?page=comments#action_12361301 ] byron miller commented on NUTCH-55: --- You can close this ticket, duplicate of ticket NUTCH-59 Create dmoz.org search plugin - incorporate the dmoz.org title/category/description if available -- Key: NUTCH-55 URL: http://issues.apache.org/jira/browse/NUTCH-55 Project: Nutch Type: New Feature Components: indexer, searcher Environment: all Reporter: byron miller Priority: Minor I am looking into the possibility of creating a dmoz.org plugin, so if you seed from the dmoz.org rdf the data you pull in could be used to extend the data you fetch. Possibilities: Searchable dmoz.org data or nutch summary + dmoz.org category in serps. ofcourse the data from dmoz.org isn't as descriptive as it used to be, but i think being able to integrate the category and href to a base url where the category resolves would be a nice feature (and homage to the dmoz.org data) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
failure with crawl using 12/23 trunk
Not sure if its because i have some of the older 7.x parameters for my plugins - did these change in trunk? 051223 194716 crawl-20051223193201/crawldb/current/part-0/data:0+809491 051223 194716 map 100% 051223 194717 crawl-20051223193201/linkdb/current/part-0/data:0+1270873 -adding org.apache.nutch.indexer.basic.BasicIndexingFilter -adding org.apache.nutch.indexer.more.MoreIndexingFilter 051223 194717 found resource common-terms.utf8 at file:/home/byron/n2/trunk/conf/common-terms.utf8 051223 194717 Optimizing index. java.lang.NullPointerException at org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:75) at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:63) at org.apache.nutch.crawl.Indexer.reduce(Indexer.java:217) at org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260) at org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90) Exception in thread main java.io.IOException: Job failed! at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308) at org.apache.nutch.crawl.Indexer.index(Indexer.java:256) at org.apache.nutch.crawl.Crawl.main(Crawl.java:117)
Re: IndexSorter optimizer
I've got 400mill db i can run this against over the next few days. -byron --- Stefan Groschupf [EMAIL PROTECTED] wrote: Hi Andrzej, wow are really great news! Using the optimized index, I reported previously that some of the top-scoring results were missing. As it happens, the missing results were typically the junk pages with high tf/idf but low boost. Since we collect up to N hits, going from higher to lower boost values, the junk pages with low boost value were automatically eliminated. So, overall the subjective quality of results was improved. On the other hand, some of the legitimate results with a decent boost values were also skipped because they didn't fit within the fixed number of hits... ah, well. Perhaps we should limit the number of hits in LimitedCollector using a cutoff boost value, and not the maximum number of hits (or maybe both?). As far we experiment it would be good to have booth. To conclude, I will add the IndexSorter.java to the core classes, and I suggest to continue the experiments ... May someone out there in the community has a commercial search engine running (e.g. google appliance or similar) so we may can setup a nutch with the same pages and compare the results. I guess it will be difficult to compare nutch with yahoo or google since nobody of us has a 4 billion index up and running. I would run one on my laptop but I do not have the bandwidth to fetch until next two days. :-D Great work! Cheers, Stefan
Re: [VOTE] Commiter access for Stefan Groschupf
+1 Thanks for all the hard work! Very much appreciated --- Andrzej Bialecki [EMAIL PROTECTED] wrote: Hi, During the past year and more Stefan participated actively in the development, and contributed many high-quality patches. He's been spending considerable effort on addressing many issues in JIRA, and proposing fixes and improvements. Apparently he has too much free time on his hands, and it's best to catch him now, before he realizes that there are other ways of spending time than hacking Nutch code... ;-) So, I'd like to call for a vote on adding Stefan as a commiter. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets
[ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12359649 ] byron miller commented on NUTCH-134: I would take more cpu for better summaries any day :) cpu power is cheaper than manual intervention! If any testing is needed, don't hesitate to drop me a patch.. i've been working on a 500million page index using mapred branch on a 10 node cluster so i have plenty of numbers to test against. Summarizer doesn't select the best snippets --- Key: NUTCH-134 URL: http://issues.apache.org/jira/browse/NUTCH-134 Project: Nutch Type: Bug Components: searcher Versions: 0.7, 0.8-dev, 0.7.1, 0.7.2-dev Reporter: Andrzej Bialecki Summarizer.java tries to select the best fragments from the input text, where the frequency of query terms is the highest. However, the logic in line 223 is flawed in that the excerptSet.add() operation will add new excerpts only if they are not already present - the test is performed using the Comparator that compares only the numUniqueTokens. This means that if there are two or more excerpts, which score equally high, only the first of them will be retained, and the rest of equally-scoring excerpts will be discarded, in favor of other excerpts (possibly lower scoring). To fix this the Set should be replaced with a List + a sort operation. To keep the relative position of excerpts in the original order the Excerpt class should be extended with an int order field, and the collected excerpts should be sorted in that order prior to adding them to the summary. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
standard version of log4j
Is there any way to make sure all plugins/modules reference a standard version of log4j? seems to me there are atlest 3 different versions (although minor) # find . | grep log4 ./plugins/parse-pdf/log4j-1.2.9.jar ./plugins/parse-pdf/PDFBox-0.7.2-log4j.jar ./plugins/parse-rss/log4j-1.2.6.jar ./plugins/clustering-carrot2/log4j-1.2.11.jar
RE: Halloween Joke at Google
I wish it did have something to do with halloween :) Google tells no lies! :P --- Nick Lothian [EMAIL PROTECTED] wrote: If you just do the search you'll see a link at the side of the page: Why these results? These results may seem politically slanted. Here's what happened. www.google.com/googleblog which links to http://googleblog.blogspot.com/2005/09/googlebombing-failure.html This particular Google Bomb has been around for quite a while. See http://en.wikipedia.org/wiki/Google_bomb (and has nothing to do with Halloween!) Nick
RE: Halloween Joke at Google
Actually, to add fuel to the fire, using nutch out of the box, searching for miserable failure yields the same thing. http://www.mozdex.com/search.jsp?query=miserablefailure --- Fuad Efendi [EMAIL PROTECTED] wrote: Thanks Nick, So this is why some search engines are not honest. I mean the commercial policy of putting links on top of a search for extra money. This particular Google Bomb has been around for quite a while. See http://en.wikipedia.org/wiki/Google_bomb (and has nothing to do with Halloween!) Nick
Re: Halloween Joke at Google
We run with fetchlist.score.by.link.count=true and indexer.boost.by.link.count=true We haven't run a stand alone analyze, so it's how the database is updated when we run updatedb. (per the recommendations a few months back when it was found to be pretty darn close results!) Even though my scale is still much smaller than Googles, it is amazing how closely the results can match! Makes you wonder just how much of the net is usefull ;) -byron --- Andrzej Bialecki [EMAIL PROTECTED] wrote: Byron Miller wrote: Actually, to add fuel to the fire, using nutch out of the box, searching for miserable failure yields the same thing. http://www.mozdex.com/search.jsp?query=miserablefailure I'm curious... could you check if the anchors come from the same site, or from different sites? Do you run with fetchlist.score.by.link.count=true and indexer.boost.by.link.count=true? Anyway, that's how the PageRank is _supposed_ to work - it should give a higher score to sites that are highly linked, and also it should strongly consider the anchor text as an indication of the page's true subject ... ;-) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: NekoHTML 0.9.5
I'll give tagsoup a try, i saw that was in there. thanks for the headsup! -byron --- Andrzej Bialecki [EMAIL PROTECTED] wrote: Byron Miller wrote: http://people.apache.org/~andyc/neko/doc/html/changes.html Any chance of getting that rolled in? Has a few fixes that look good. Did you try using TagSoup? Some time ago I added to parse-html the support for using TagSoup instead of NekoHTML (this is an option in the config file). I found that in many cases TagSoup gives much better results, especially for pages with multiple html or body elements, where neko would give up... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Commented: (NUTCH-39) pagination in search result
[ http://issues.apache.org/jira/browse/NUTCH-39?page=comments#action_12356374 ] byron miller commented on NUTCH-39: --- I'm using the above code snippet on mozdex and run across some strange issues.. for example if you search for cnn.com it doesn't show up at all, if you search for site:www.cnn.com cnn and find all cnn within that subquery it works.. wondering if there are too many pages coming up for some results or something like that. Anyone else using this snippet? i like the way it works for the most part :) will try and enable a debug page to chase down which variables are acting up. pagination in search result --- Key: NUTCH-39 URL: http://issues.apache.org/jira/browse/NUTCH-39 Project: Nutch Type: Improvement Components: web gui Environment: all Reporter: Jack Tang Priority: Trivial Now in nutch search.jsp, user navigate all search result using Next button. And google like pagination will feel better. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-49) Flag for generate to fetch only new pages to complement the -refetchonly flag
[ http://issues.apache.org/jira/browse/NUTCH-49?page=comments#action_12355864 ] byron miller commented on NUTCH-49: --- Can something like this be adapted to use the regex filter as well? it would be nice to say new only and match urls of x type or x link score or some other expressions. (not just the very topN) Flag for generate to fetch only new pages to complement the -refetchonly flag - Key: NUTCH-49 URL: http://issues.apache.org/jira/browse/NUTCH-49 Project: Nutch Type: New Feature Components: fetcher Reporter: Luke Baker Priority: Minor Attachments: fetchnewonly.patch It would be useful, especially for research/testing purposes, to have a flag for the FetchListTool that make sure to only include URLs in the fetchlist that have not already been fetched (according to the information from the webdb that you're generating the fetchlist from). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira