RE: Sequence File Question
-Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 28, 2007 4:34 PM To: nutch-dev@lucene.apache.org Subject: Re: Sequence File Question Steve Severance wrote: Let me actually refine that question we do some directories like the linkdb have a current, and why do others like parse_data not? Is there a convention on this? First, to answer your original question: you should use MapFileOutputFormat class for reading such output. It handles these part- subdirectories automatically. Second, the current subdirectory is there in order to properly handle DB updates - or actually replacements - see e.g. CrawlDb.install() method for details. This is not needed in case of segments, which are created once and never updated. How does the reader know which one it is expecting. For instance I can make a reader to read a linkDB just by instantiating it on the directory crawl/linkdb And it knows to go inside the current directory. What when opening a parse_data there is no current. So how does it know which expect? Steve Thirdly, although you didn't ask about it ;) the latest version of Hadoop contains a handy facility called Counters - if you use the PR PowerMethod you need to collect PR from dangling nodes in order to redistribute it later. You can use Counters for this, and save on a separate aggregation step. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: Next release - 0.10.0 or 1.0.0 ?
Another way of looking at it might be to ask the question what would make a great 1.0 release? What new features would be awesome? What might get people more excited? Having a 1.0 might make the project look like it has attained a real milestone. Steve -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 28, 2007 2:38 PM To: nutch-dev@lucene.apache.org Subject: Next release - 0.10.0 or 1.0.0 ? Hi all, I know it's a trivial issue, but still ... When this release is out, I propose that we should name the next release 1.0.0, and not 0.10.0. The effect is purely psychological, but it also reflects our confidence in the platform. Many Open Source projects are afraid of going to 1.0.0 and seem to be unable to ever reach this level, as if it were a magic step beyond which they are obliged to make some implied but unjustified promises ... Perhaps it's because in the commercial world everyone knows what a 1.0.0 release means :) The downside of the version numbering that never reaches 1.0.0 is that casual users don't know how usable the software is - e.g. Nutch 0.10.0 could possibly mean that there are still 90 releases to go before it becomes usable. Therefore I propose the following: * shorten the release cycle, so that we can make a release at least once every quarter. This was discussed before, and I hope we can make it happen, especially with the help of new forces that joined the team ;) * call the next version 1.0.0, and continue in increments of 0.1.0 for each bi-monhtly or quarterly release, * make critical bugfix / maintenance releases using increments of 0.0.1 - although the need for such would be greatly diminished with the shorter release cycle. * once we arrive at versions greater than x.5.0 we should plan for a big release (increment of 1.0.0). * we should use only single digits for small increments, i.e. limit them to values between 0-9. What do you think? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Sequence File Question
Hey guys, I have a mapreduce job that sets up a directory for pagerank. It iterates over all the segments and then outputs a MapFile containing the data. When I go to open the outputted directory with another MapReduce job it fails saying that it cannot find the path. The path that it thinks it is trying to open does not include the part-0 directory. Both my directory (and all other directories for that matter) have the same structure which is /path/part-0/whatever. I feel like this is a really stupid error and I have forgotten something that is easily fixed. Any ideas? Steve
RE: Sequence File Question
Let me actually refine that question we do some directories like the linkdb have a current, and why do others like parse_data not? Is there a convention on this? Steve -Original Message- From: Steve Severance [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 28, 2007 4:11 PM To: nutch-dev@lucene.apache.org Subject: Sequence File Question Hey guys, I have a mapreduce job that sets up a directory for pagerank. It iterates over all the segments and then outputs a MapFile containing the data. When I go to open the outputted directory with another MapReduce job it fails saying that it cannot find the path. The path that it thinks it is trying to open does not include the part-0 directory. Both my directory (and all other directories for that matter) have the same structure which is /path/part-0/whatever. I feel like this is a really stupid error and I have forgotten something that is easily fixed. Any ideas? Steve
Image Search Engine Input
Hey all, I am working on the basics of an image search engine. I want to ask for feedback on something. Should I create a new directory in a segment parse_image? And then put the images there? If not where should I put them, in the parse_text? I created a class ImageWritable just like the Jira task said. This class contains image meta data as well as two BytesWritable for the original image and the thumbnail. One more question, what ramifications does that have for the type of Parse that I am returning? Do I need to create a ParseImage class to hold it? The actual parsing infrastructure is something that I am still studying so any ideas here would be great. Thanks, Steve
RE: Image Search Engine Input
So now that I have spent a few hours looking into how this works a lot more deeply I am even more of a conundrum. The fetcher passes the contents of the page to the parsers. It assumes that text will be output from the parsers. For instance even the SWF parser returns text. For all binary data, images, videos, music, etc... this is problematic. Potentially confounding the problem even further in the case of music is that text and binary data can come from the same file. Even if that is a problem I am not going to tackle it. So there are 3 choices for moving forward with an image search, 1. All image data can be encoded as strings. I really don't like that choice since the indexer will index huge amounts of junk. 2. The fetcher can be modified to allow another output for binary data. This I think is the better choice although it will be a lot more work. I am not sure that this is possible with MapReduce since MapRunnable has only 1 output. 3. Images can be written into another directory for processing. This would need more work to automate but is probably non-issue. I want to do the right thing so that the image search can eventually be in the trunk. I don't want to have to change the way a lot of things work in the process. Let me know what you all think. Steve -Original Message- From: Steve Severance [mailto:[EMAIL PROTECTED] Sent: Monday, March 26, 2007 4:04 PM To: nutch-dev@lucene.apache.org Subject: Image Search Engine Input Hey all, I am working on the basics of an image search engine. I want to ask for feedback on something. Should I create a new directory in a segment parse_image? And then put the images there? If not where should I put them, in the parse_text? I created a class ImageWritable just like the Jira task said. This class contains image meta data as well as two BytesWritable for the original image and the thumbnail. One more question, what ramifications does that have for the type of Parse that I am returning? Do I need to create a ParseImage class to hold it? The actual parsing infrastructure is something that I am still studying so any ideas here would be great. Thanks, Steve
Breaking change in webapp?
Hey, I have an index that I am trying to search using the webapp. I am using the current trunk. When I run a search I get the following message, HTTP Status 404 - no segments* file found: files: type Status report message no segments* file found: files: description The requested resource (no segments* file found: files:) is not available. Apache Tomcat/5.5.20 The message I think refers to a Lucene issue: http://www.mail-archive.com/java-dev@lucene.apache.org/msg09044.html I am not really sure what to do to fix it. I have not really delved into lucene itself yet. Is there a different directory structure that I need to have for the index now? BTW, my searcher.dir is G:\NutchDeployment\crawl which is the same thing that I used for 0.8.1. Any ideas? Thanks, Steve
indexing with current trunk
Hi, I updated my test system to the 0.9 dev trunk current as of yesterday. Now indexing does not work. I tried purging the linkdb and recreating it. I tried just running it on a single segment to see if I could find the error. Here is the output: $ bin/nutch index crawl/index crawl/crawldb/ crawl/linkdb/ crawl/segments/20070 307113353/ Indexer: starting Indexer: linkdb: crawl/linkdb Indexer: adding segment: crawl/segments/20070307113353 Optimizing index. Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:402) at org.apache.nutch.indexer.Indexer.index(Indexer.java:273) at org.apache.nutch.indexer.Indexer.run(Indexer.java:295) at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) at org.apache.nutch.indexer.Indexer.main(Indexer.java:278) Other things are working fine. Just not indexing. Steve
RE: indexing with current trunk
Here is the log. 2007-03-22 15:45:39,851 WARN mapred.LocalJobRunner - job_pyll84 java.lang.NoSuchMethodError: org.apache.lucene.document.Document.add(Lorg/apache/lucene/document/Fieldabl e;)V at org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilte r.java:62) at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:110) at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:215) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:317) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:137) 2007-03-22 15:45:40,043 FATAL indexer.Indexer - Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:402) at org.apache.nutch.indexer.Indexer.index(Indexer.java:273) at org.apache.nutch.indexer.Indexer.run(Indexer.java:295) at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) at org.apache.nutch.indexer.Indexer.main(Indexer.java:278) Steve -Original Message- From: Sami Siren [mailto:[EMAIL PROTECTED] Sent: Thursday, March 22, 2007 4:03 PM To: nutch-dev@lucene.apache.org Subject: Re: indexing with current trunk Steve Severance wrote: Hi, I updated my test system to the 0.9 dev trunk current as of yesterday. Now indexing does not work. I tried purging the linkdb and recreating it. I tried just running it on a single segment to see if I could find the error. Here is the output: $ bin/nutch index crawl/index crawl/crawldb/ crawl/linkdb/ crawl/segments/20070 307113353/ Indexer: starting Indexer: linkdb: crawl/linkdb Indexer: adding segment: crawl/segments/20070307113353 Optimizing index. Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:402) at org.apache.nutch.indexer.Indexer.index(Indexer.java:273) at org.apache.nutch.indexer.Indexer.run(Indexer.java:295) at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) at org.apache.nutch.indexer.Indexer.main(Indexer.java:278) Other things are working fine. Just not indexing. Can you please check the log files for more specific error message(s). Indexing works ok for me but I have only tried it with small segments so far. -- Sami Siren
RE: indexing with current trunk
-Original Message- From: Sami Siren [mailto:[EMAIL PROTECTED] Sent: Thursday, March 22, 2007 4:27 PM To: nutch-dev@lucene.apache.org Subject: Re: indexing with current trunk Are you running on localrunner or distributed mode, is distributed then check that the lucene version in task tracker class path is correct. I am using a localrunner. I have lucene 2.0 and 2.1 in my lib dir for nutch. Steve -- Sami Siren Steve Severance wrote: Here is the log. 2007-03-22 15:45:39,851 WARN mapred.LocalJobRunner - job_pyll84 java.lang.NoSuchMethodError: org.apache.lucene.document.Document.add(Lorg/apache/lucene/document/Fie ldabl e;)V at org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexing Filte r.java:62) at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:11 0) at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:215) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:317) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:137 ) 2007-03-22 15:45:40,043 FATAL indexer.Indexer - Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:402) at org.apache.nutch.indexer.Indexer.index(Indexer.java:273) at org.apache.nutch.indexer.Indexer.run(Indexer.java:295) at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) at org.apache.nutch.indexer.Indexer.main(Indexer.java:278) Steve -Original Message- From: Sami Siren [mailto:[EMAIL PROTECTED] Sent: Thursday, March 22, 2007 4:03 PM To: nutch-dev@lucene.apache.org Subject: Re: indexing with current trunk Steve Severance wrote: Hi, I updated my test system to the 0.9 dev trunk current as of yesterday. Now indexing does not work. I tried purging the linkdb and recreating it. I tried just running it on a single segment to see if I could find the error. Here is the output: $ bin/nutch index crawl/index crawl/crawldb/ crawl/linkdb/ crawl/segments/20070 307113353/ Indexer: starting Indexer: linkdb: crawl/linkdb Indexer: adding segment: crawl/segments/20070307113353 Optimizing index. Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:402) at org.apache.nutch.indexer.Indexer.index(Indexer.java:273) at org.apache.nutch.indexer.Indexer.run(Indexer.java:295) at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) at org.apache.nutch.indexer.Indexer.main(Indexer.java:278) Other things are working fine. Just not indexing. Can you please check the log files for more specific error message(s). Indexing works ok for me but I have only tried it with small segments so far. -- Sami Siren
[jira] Closed: (NUTCH-462) Noarchive urls are available via the cache link
[ https://issues.apache.org/jira/browse/NUTCH-462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Severance closed NUTCH-462. - Resolution: Fixed duplicate. see NUTCH-167. Has been fixed. Noarchive urls are available via the cache link --- Key: NUTCH-462 URL: https://issues.apache.org/jira/browse/NUTCH-462 Project: Nutch Issue Type: Bug Components: web gui Reporter: Steve Severance Fix For: 0.8.1 If a robots.txt file specifies a Noarchive statement then urls that or contained as part of that path should not be available via the cached link. For example Noarchive:/ means that no pages should be available via the cached link. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Multi-pass algorithms
If I want to have an algorithm that runs over the same data multiple times (it is an iterative algorithm) is there a way to have my MapReduce job use the same directory for both input and output? Or do I need to make a temp directory for each iteration? Steve
Launching custom classes
Hi all, I have a custom class in the nutch jar. Everything works fine in eclipse but when I try to run it from the command line using bin/nutch it throws the java.lang.NoClassDefFoundError. All the pages on the internet helpfully suggested that I make sure that the jar is in the classpath. I think that everything is correct since I can invoke any of the nutch classes via its class name e.g. bin/nutch org.apache.nutch.crawl.Crawl. This may be a simple Java problem but I have been banging my head against this all weekend. Thanks, Steve
RE: Launching custom classes
-Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Monday, March 19, 2007 10:18 AM To: nutch-dev@lucene.apache.org Subject: Re: Launching custom classes Steve Severance wrote: Hi all, I have a custom class in the nutch jar. Everything works fine in eclipse but when I try to run it from the command line using bin/nutch it throws the java.lang.NoClassDefFoundError. All the pages on the internet helpfully suggested that I make sure that the jar is in the classpath. I think that What needs to be on your classpath is the *.job jar. The bin/nutch script takes care of that if you built your Nutch using the command- line version of ant. Ok. Thanks. 2 more things. I have 2 directories for nutch, 1 is synchronized with SVN and the other is my working directory. If I run the ant package command in my working directory ant says BUILD FAILED g:\NutchInstance\build.xml:61: Specify at least one source--a file or resource collection. Total time: 0 seconds If I copy my source folder into the trunk dir for my directory that is synced with SVN my class does not get added. I have been studying the build.xml file and I see the plugin generation jobs, but my reasoning is that my package name is org.apache.nutch.my package should be compiled into the core. Is this correct? Do I need to make a separate build job for my class or something like that? Second, how do people generally setup their development machines? Do you use Eclipse, if so do you just work off of the trunk or what? What is recommendation for source control in this situation? Is there a way to make a subversion repository for me so that I can add my own code but also receive updates from the trunk? Using an open source project like this seems to add some complexity to the source control process. But I am sure this problem has already been worked out. Regards, Steve -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Created: (NUTCH-462) Noarchive urls are available via the cache link
Noarchive urls are available via the cache link --- Key: NUTCH-462 URL: https://issues.apache.org/jira/browse/NUTCH-462 Project: Nutch Issue Type: Bug Components: web gui Reporter: Steve Severance Fix For: 0.8.1 If a robots.txt file specifies a Noarchive statement then urls that or contained as part of that path should not be available via the cached link. For example Noarchive:/ means that no pages should be available via the cached link. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
RE: Indexing the Interesting Part Only...
I think if anyone here had the perfect answer for that one they would have sold it Google, Microsoft or Yahoo for a ton of money. You will need an algorithm that can detect ads. I have not written ad filters since my search engine is currently using a domain whitelist. I can tell you that a whole web crawl will definetly need it since it can cut down on pages in the index by 10-20%. If you do a whole web crawl you will also need spam detection. I would recommend looking for some academic papers on the topic. Maybe use CiteSeer or something like that. Steve -Original Message- From: d e [mailto:[EMAIL PROTECTED] Sent: Saturday, March 10, 2007 3:07 PM To: nutch-dev@lucene.apache.org Subject: Re: Indexing the Interesting Part Only... We plan to index many websites. Got any suggestions on how to drop the junk without having to do too much work for each such site? Know anyone who has a background on doing this sort of thing? What sorts of approaches would you recommend? Are there existing plug ins I should consider using? On 3/9/07, J. Delgado [EMAIL PROTECTED] wrote: You have to build a special HTML Junk parser. 2007/3/9, d e [EMAIL PROTECTED]: If I'm indexing a news article, I want to avoid getting the junk (other than the title, auther and article) into the index. I want to avoid getting the advertizments, etc. How do I do that sort of thing? What parts of what manual should I be reading so I will know how to do this sort of thing.
RE: How to read data from segments
-Original Message- From: Dennis Kubes [mailto:[EMAIL PROTECTED] Sent: Friday, March 09, 2007 9:47 AM To: nutch-dev@lucene.apache.org Subject: Re: How to read data from segments Steve Severance wrote: I am trying to learn the internals of Nutch and by extension Hadoop right now. I am implementing an algorithm that processes link and content data. I am stuck on how to open the ParseDatas contained in the segments. Each subdir of a segment (crawl_generate, etc...) contains a subdir part- 0, which id I understand correctly, if I had more computers as part of a hadoop cluster there would also be part-1 and so on. There is one directory for each split. One interesting thing to note is that multiple writers (i.e. map and reduce tasks) can't write to the same file on the DFS at the same time. So each reduce task writes out it's own split to its own directory. Does this mean that there might be some parts that are Map outputs and others that are Reduce outputs? When I try to open them with an ArrayFile.Reader it cannot find the file. I know that the Path class is working properly since it can enumerate sub directories. I tried hard coding the part-0 in to the path but that did not work either. The code is as follows: Path segmentDir = new Path(args[0]); Path pageRankDir = new Path(args[1]); Path segmentPath = new Path(segmentDir, parse_data/part-0); ArrayFile.Reader parses = null; try { parses = new ArrayFile.Reader(segmentPath.getFileSystem(config),segmentPath.toString (),co nfig); } catch(IOException ex){ System.out.println(An Error Occured while opening the segment. Message: + ex.getMessage()); } The exception reports that it cannot open the file. I also tried merging the segments but that did not work either. Any help would be greatly appreciated. Just like Andrzej said. It is in the outputformats and they have getReaders and getEntry methods. I have a little tool that is a MapFileExplorer, if you want it let me know and I will send you a copy. Yes, that would be great if you are willing to share it. I was already thinking about writing something similar. One more thing. As a new nutch developer I am keeping a running list of problems/questions that I have and their solutions. A lot of questions arise from not understanding how to work with the internals, specifically understanding the building blocks of Hadoop such as filetypes and why there are custom types that Hadoop uses, e.g. why Text instead of String. I noticed that in a mailing list post earlier this year the lack of detailed information for new developers was cited as a barrier to more involvement. I would be happy to contribute this back to the wiki if there is interest. Absolutely. The more documentation we have, especially for new developers, the better. If you need any questions answered in doing this, give me a shout and I will help as much as I can. What is the best way to proceed with this? Should I make a new wiki page? Here is what I am thinking: Have an overview of Nutch and Hadoop. This will include code samples of basic tasks like getting data. And by overview I mean a detailed overview so that someone without distributed computing or search experience will be able to understand. It will not include IR basics as those are fairly well documented elsewere. The Hadoop one might want to live on its own wiki. I also am going to write up my implementation of PageRank as a tutorial since it will cover I think a lot of Hadoop and Nutch basics, including Hadoop types, using Hadoop files and MapReduce. Dennis Kubes Regards, Steve Steve
RE: Indexing the Interesting Part Only...
This is a Natural Language Processing problem, although you can certainly take hints from URL graph structures and host block lists. Nutch does not support this natively (that I know of) but you can certainly extend Nutch to be able to recognize and filter ads. Start by looking at how to develop plugins and also look at the indexing plugin. Regards, Steve -Original Message- From: d e [mailto:[EMAIL PROTECTED] Sent: Friday, March 09, 2007 6:49 PM To: nutch-dev@lucene.apache.org Subject: Indexing the Interesting Part Only... If I'm indexing a news article, I want to avoid getting the junk (other than the title, auther and article) into the index. I want to avoid getting the advertizments, etc. How do I do that sort of thing? What parts of what manual should I be reading so I will know how to do this sort of thing.
How to read data from segments
I am trying to learn the internals of Nutch and by extension Hadoop right now. I am implementing an algorithm that processes link and content data. I am stuck on how to open the ParseDatas contained in the segments. Each subdir of a segment (crawl_generate, etc...) contains a subdir part-0, which id I understand correctly, if I had more computers as part of a hadoop cluster there would also be part-1 and so on. When I try to open them with an ArrayFile.Reader it cannot find the file. I know that the Path class is working properly since it can enumerate sub directories. I tried hard coding the part-0 in to the path but that did not work either. The code is as follows: Path segmentDir = new Path(args[0]); Path pageRankDir = new Path(args[1]); Path segmentPath = new Path(segmentDir, parse_data/part-0); ArrayFile.Reader parses = null; try { parses = new ArrayFile.Reader(segmentPath.getFileSystem(config),segmentPath.toString(),co nfig); } catch(IOException ex){ System.out.println(An Error Occured while opening the segment. Message: + ex.getMessage()); } The exception reports that it cannot open the file. I also tried merging the segments but that did not work either. Any help would be greatly appreciated. One more thing. As a new nutch developer I am keeping a running list of problems/questions that I have and their solutions. A lot of questions arise from not understanding how to work with the internals, specifically understanding the building blocks of Hadoop such as filetypes and why there are custom types that Hadoop uses, e.g. why Text instead of String. I noticed that in a mailing list post earlier this year the lack of detailed information for new developers was cited as a barrier to more involvement. I would be happy to contribute this back to the wiki if there is interest. Regards, Steve
RE: How to read data from segments
Hi Andrzej, Thanks for the reply. I have a couple more questions that I am not quite sure about. Mapfile.Reader[] represents the individual readers for each piece of a MapFile such that part-0, part-1 are each represented by a reader? In that case is the correct path to the segment something like crawl/segments/some segment and that is the path that I should pass? Currently it is returning 0 readers. Also generally on PageRank, I implemented a version in .net on mapreduce for another project that I was working on. However that was at my last job and I have started a new company that is developing a vertical search on nutch/hadoop. My basic idea of how implement PageRank for nutch is as follows: Step 1: Build basic data I have created a PageRankDatum class to hold the information that PageRank requires for its computation. PageRankDatum contains the PageRank value and the number of outbound links. This would enable the key/value pair to be Url,PageRankDatum Step 2: Compute the ranks Collect the resulting ranks to the output and write them out. Reduce would in effect be an Identity function I think. With this step we need to look up the inbound links for a Url and then how many other outbound links each link has. That was the purpose of storing the outbound link count in addition to the page rank. If I have a Hadoop cluster (currently I am running this on my dev machine, more machines on the way for testing) is the linkDB accessible from all nodes? I am thinking that the PageRankDb will work basically the sameway. After step 1 write it out so that it will be accessible. Also several papers have shown that in parallel computation of PageRank that being able to look up the ranks that have been computed in other nodes can lead to faster conversion. Is this possible in the map reduce model? Regards, Steve -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Thursday, March 08, 2007 4:43 PM To: nutch-dev@lucene.apache.org Subject: Re: How to read data from segments Steve Severance wrote: I am trying to learn the internals of Nutch and by extension Hadoop right now. I am implementing an algorithm that processes link and content data. I am stuck on how to open the ParseDatas contained in the segments. Each subdir of a segment (crawl_generate, etc...) contains a subdir part-0, which id I understand correctly, if I had more computers as part of a hadoop cluster there would also be part-1 and so on. Correct. When I try to open them with an ArrayFile.Reader it cannot find the file. I know that the Path class is working properly since it can enumerate sub directories. I tried hard coding the part-0 in to the path but that did not work either. The code is as follows: Path segmentDir = new Path(args[0]); Path pageRankDir = new Path(args[1]); Ah-ha, pageRankDir .. ;) Path segmentPath = new Path(segmentDir, parse_data/part-0); Please take a look at the class MapFileOutputFormat and SequenceFileOutputFormat. Both support this nested dir structure which is a by-product of producing the data via map-reduce, and offer methods for getting MapFile.Reader[] or SequenceFile.Reader[], and then getting a selected entry. Cf. also the code attached to HADOOP-175 issue in JIRA. One more thing. As a new nutch developer I am keeping a running list of problems/questions that I have and their solutions. A lot of questions arise from not understanding how to work with the internals, specifically understanding the building blocks of Hadoop such as filetypes and why there are custom types that Hadoop uses, e.g. why Text instead of String. I noticed that in a mailing list post earlier this year the lack of detailed information for new developers was cited as a barrier to more involvement. I would be happy to contribute this back to the wiki if there is interest. Definitely, you are welcome to contribute in this area - this is always needed. Although this particular information might be more suitable for the Hadoop wiki ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: 0.9 release
Also one thing that comes to my mind as I have been struggling with it, there is no upgrade path that I know of from 0.8.x to 0.9.0. I followed the directions in the wiki and that did not work. I later found in a mailing list post that everything needs to be regenerated. There needs to be some guidance on if a 0.8.x upgrade is possible and if it is how to do it. Regards, Steve iVirtuoso, Inc Steve Severance Partner, Chief Technology Officer [EMAIL PROTECTED] mobile: (240) 472 - 9645 -Original Message- From: Sami Siren [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 07, 2007 2:10 PM To: nutch-dev@lucene.apache.org Subject: Re: 0.9 release 2. Any outstanding things that need to get done that aren't really code that needs to get committed, e.g., things we need to close the loop on One thing that comes to my mind is the web site, we have specifically tutorials for 0.7.x and 0.8.x it might be confusing for users if we left it as is and release 0.9.0. -- Sami Siren
[jira] Commented: (NUTCH-296) Image Search
[ https://issues.apache.org/jira/browse/NUTCH-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478920 ] Steve Severance commented on NUTCH-296: --- I know the commiters are hard at work on the 0.9.0 release but I have begun to work on the first piece of this, the parser. I am looking for guidance as to how the images and thumbnails should be stored. One file per image is probably too inefficient. Are there existing file formats that the community would like to use? I am building a parser that can handle most image types. Should I break them out into individual plugins so there is one per file type? e.g. jpg will have an extension, gif will have a separate extension etc... This may be more flexible in the long run. This is the first project that I am undertaking on the nutch codebase so any guidance would be great. Steve Image Search Key: NUTCH-296 URL: https://issues.apache.org/jira/browse/NUTCH-296 Project: Nutch Issue Type: New Feature Reporter: Thomas Delnoij Priority: Minor Per the discussion in the Nutch-User mailing list, there is a wish for an Image Search add-on component that will index images. Must have: - retrieve outlinks to image files from fetched pages - generate thumbnails from images - thumbnails are stored in the segments as ImageWritable that contains the compressed binary data and some meta data Should have: - implemented as hadoop map reduce job - should be seperate from main Nutch codeline as it breaks general Nutch logic of one url == one index document. Could have: - store the original image in the segments Would like to have: - search interface for image index - parameterizable thumbnail generation (width, height, quality) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
RE: 0.9 release
I have gotten this working. A little bit of tweaking was involved but everything works fine now. Steve -Original Message- From: Steve Severance [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 07, 2007 2:19 PM To: nutch-dev@lucene.apache.org Subject: RE: 0.9 release Also one thing that comes to my mind as I have been struggling with it, there is no upgrade path that I know of from 0.8.x to 0.9.0. I followed the directions in the wiki and that did not work. I later found in a mailing list post that everything needs to be regenerated. There needs to be some guidance on if a 0.8.x upgrade is possible and if it is how to do it. Regards, Steve iVirtuoso, Inc Steve Severance Partner, Chief Technology Officer [EMAIL PROTECTED] mobile: (240) 472 - 9645 -Original Message- From: Sami Siren [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 07, 2007 2:10 PM To: nutch-dev@lucene.apache.org Subject: Re: 0.9 release 2. Any outstanding things that need to get done that aren't really code that needs to get committed, e.g., things we need to close the loop on One thing that comes to my mind is the web site, we have specifically tutorials for 0.7.x and 0.8.x it might be confusing for users if we left it as is and release 0.9.0. -- Sami Siren
[jira] Created: (NUTCH-453) Move stop words to a config file
Move stop words to a config file Key: NUTCH-453 URL: https://issues.apache.org/jira/browse/NUTCH-453 Project: Nutch Issue Type: Improvement Components: indexer, searcher Reporter: Steve Severance Priority: Minor Move the stop words from the code to a config file. This will allow the stop words to be modified without recompiling the code. The format could be the same as the regex-urlfilter where regexs are used to define the words or a plain text file of words could be used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-224) Nutch doesn't handle Korean text at all
[ https://issues.apache.org/jira/browse/NUTCH-224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12477190 ] Steve Severance commented on NUTCH-224: --- The PDF Parser for 0.8.1 also fails on Korean text. Steve Nutch doesn't handle Korean text at all --- Key: NUTCH-224 URL: https://issues.apache.org/jira/browse/NUTCH-224 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 0.7.1 Reporter: KuroSaka TeruHiko I was browing NutchAnalysis.jj and found that Hungul Syllables (U+AC00 ... U+D7AF; U+ means a Unicode character of the hex value ) are not part of LETTER or CJK class. This seems to me that Nutch cannot handle Korean documents at all. I posted the above message at nutch-user ML and Cheolgoo Kang [EMAIL PROTECTED] replied as: There was similar issue with Lucene's StandardTokenizer.jj. http://issues.apache.org/jira/browse/LUCENE-444 and http://issues.apache.org/jira/browse/LUCENE-461 I'm have almost no experience with Nutch, but you can handle it like those issues above. Both fixes should probably be ported back to NuatchAnalysis.jj. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
RE: Performance optimization for Nutch index / query
Hi, I would like to comment if I might. I am not a Nutch/Lucene hacker yet. I have been working with it for only a few weeks. However I am looking at extending it significantly to add some new features. Now some of these will require extending Lucene as well. First I have a test implementation of PageRank that is really an approximation that runs ontop of map reduce. Are people interested in having this in the index? I am interested in how this and other meta data might interact with your super field. For instance I am also looking at using relevance feedback and having that as one of the criteria for ranking documents. I was also considering using an outside data source, possibly even another Lucene index to store these values on a per document basis. The other major feature I am thinking about is using distance between words and text type. Do you know of anyone who has done this? Regards, Steve iVirtuoso, Inc Steve Severance Partner, Chief Technology Officer [EMAIL PROTECTED] mobile: (240) 472 - 9645 -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Thursday, February 22, 2007 7:44 PM To: nutch-dev@lucene.apache.org Subject: Performance optimization for Nutch index / query Hi all, This very long post is meant to initiate a discussion. There is no code yet. Be warned that it discusses low-level Nutch/Lucene stuff. Nutch queries are currently translated into complex Lucene queries. This is necessary in order to take into account score factors coming from various document parts, such as URL, host, title, content, and anchors. Typically, the translation provided by query-basic looks like this for single term queries: (1) Query: term1 Parsed: term1 Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 title:term1^1.5 host:term1^2.0) For queries consisting of two or more terms it looks like this (Nutch uses implicit AND): (2) Query: term1 term2 Parsed: term1 term2 Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 title:term1^1.5 host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 content:term2 title:term2^1.5 host:term2^2.0) url:term1 term2~2147483647^4.0 anchor:term1 term2~4^2.0 content:term1 term2~2147483647 title:term1 term2~2147483647^1.5 host:term1 term2~2147483647^2.0 By the way, please note the absurd default slop value - in case of anchors it defeats the purpose of having the ANCHOR_GAP ... Let's list other common query types: (3) Query: term1 term2 term3 Parsed: term1 term2 term3 Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 title:term1^1.5 host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 content:term2 title:term2^1.5 host:term2^2.0) +(url:term3^4.0 anchor:term3^2.0 content:term3 title:term3^1.5 host:term3^2.0) url:term1 term2 term3~2147483647^4.0 anchor:term1 term2 term3~4^2.0 content:term1 term2 term3~2147483647 title:term1 term2 term3~2147483647^1.5 host:term1 term2 term3~2147483647^2.0 For phrase queries it looks like this: (4) Query: term1 term2 Parsed: term1 term2 Translated: +(url:term1 term2^4.0 anchor:term1 term2^2.0 content:term1 term2 title:term1 term2^1.5 host:term1 term2^2.0) For mixed term and phrase queries it looks like this: (5) Query: term1 term2 term3 Parsed: term1 term2 term3 Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 title:term1^1.5 host:term1^2.0) +(url:term2 term3^4.0 anchor:term2 term3^2.0 content:term2 term3 title:term2 term3^1.5 host:term2 term3^2.0) For queries with NOT operator it looks like this: (6) Query: term1 -term2 Parsed: term1 -term2 Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 title:term1^1.5 host:term1^2.0) -(url:term2^4.0 anchor:term2^2.0 content:term2 title:term2^1.5 host:term2^2.0) (7) Query: term1 term2 -term3 Parsed: term1 term2 -term3 Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 title:term1^1.5 host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 content:term2 title:term2^1.5 host:term2^2.0) -(url:term3^4.0 anchor:term3^2.0 content:term3 title:term3^1.5 host:term3^2.0) url:term1 term2~2147483647^4.0 anchor:term1 term2~4^2.0 content:term1 term2~2147483647 title:term1 term2~2147483647^1.5 host:term1 term2~2147483647^2.0 (8) Query: term1 term2 -term3 Parsed: term1 term2 -term3 Translated: +(url:term1 term2^4.0 anchor:term1 term2^2.0 content:term1 term2 title:term1 term2^1.5 host:term1 term2^2.0) -(url:term3^4.0 anchor:term3^2.0 content:term3 title:term3^1.5 host:term3^2.0) (9) Query: term1 -term2 term3 Parsed: term1 -term2 term3 Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 title:term1^1.5 host:term1^2.0) -(url:term2 term3^4.0 anchor:term2 term3^2.0 content:term2 term3 title:term2 term3^1.5 host:term2 term3^2.0) WHEW ... !!! Are you tired? Well, Lucene is tired of these queries too. They are too long! They are absurdly long and complex. For large indexes the time to evaluate them may run into several