RE: Sequence File Question

2007-03-29 Thread Steve Severance
 -Original Message-
 From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, March 28, 2007 4:34 PM
 To: nutch-dev@lucene.apache.org
 Subject: Re: Sequence File Question
 
 Steve Severance wrote:
  Let me actually refine that question we do some directories like the
 linkdb
  have a current, and why do others like parse_data not? Is there a
 convention
  on this?
 
 First, to answer your original question: you should use
 MapFileOutputFormat class for reading such output. It handles these
 part- subdirectories automatically.
 
 Second, the current subdirectory is there in order to properly handle
 DB updates - or actually replacements - see e.g. CrawlDb.install()
 method for details. This is not needed in case of segments, which are
 created once and never updated.

How does the reader know which one it is expecting. For instance I can make a 
reader to read a linkDB just by instantiating it on the directory crawl/linkdb 
And it knows to go inside the current directory. What when opening a parse_data 
there is no current. So how does it know which expect?

Steve

 
 Thirdly, although you didn't ask about it ;) the latest version of
 Hadoop contains a handy facility called Counters - if you use the PR
 PowerMethod you need to collect PR from dangling nodes in order to
 redistribute it later. You can use Counters for this, and save on a
 separate aggregation step.
 
 
 --
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com



RE: Next release - 0.10.0 or 1.0.0 ?

2007-03-28 Thread Steve Severance
Another way of looking at it might be to ask the question what would make a 
great 1.0 release? What new features would be awesome? What might get people 
more excited?

Having a 1.0 might make the project look like it has attained a real milestone.

Steve
 -Original Message-
 From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, March 28, 2007 2:38 PM
 To: nutch-dev@lucene.apache.org
 Subject: Next release - 0.10.0 or 1.0.0 ?
 
 Hi all,
 
 I know it's a trivial issue, but still ... When this release is out, I
 propose that we should name the next release 1.0.0, and not 0.10.0. The
 effect is purely psychological, but it also reflects our confidence in
 the platform.
 
 Many Open Source projects are afraid of going to 1.0.0 and seem to be
 unable to ever reach this level, as if it were a magic step beyond
 which
 they are obliged to make some implied but unjustified promises ...
 Perhaps it's because in the commercial world everyone knows what a
 1.0.0
 release means :) The downside of the version numbering that never
 reaches 1.0.0 is that casual users don't know how usable the software
 is
 - e.g. Nutch 0.10.0 could possibly mean that there are still 90
 releases
 to go before it becomes usable.
 
 Therefore I propose the following:
 
 * shorten the release cycle, so that we can make a release at least
 once
 every quarter. This was discussed before, and I hope we can make it
 happen, especially with the help of new forces that joined the team ;)
 
 * call the next version 1.0.0, and continue in increments of 0.1.0 for
 each bi-monhtly or quarterly release,
 
 * make critical bugfix / maintenance releases using increments of 0.0.1
 - although the need for such would be greatly diminished with the
 shorter release cycle.
 
 * once we arrive at versions greater than x.5.0 we should plan for a
 big
 release (increment of 1.0.0).
 
 * we should use only single digits for small increments, i.e. limit
 them
 to values between 0-9.
 
 What do you think?
 
 
 --
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com



Sequence File Question

2007-03-28 Thread Steve Severance
Hey guys,
I have a mapreduce job that sets up a directory for pagerank. It iterates
over all the segments and then outputs a MapFile containing the data. When I
go to open the outputted directory with another MapReduce job it fails
saying that it cannot find the path. The path that it thinks it is trying to
open does not include the part-0 directory. Both my directory (and all
other directories for that matter) have the same structure which is
/path/part-0/whatever. I feel like this is a really stupid error and I
have forgotten something that is easily fixed. Any ideas?

Steve



RE: Sequence File Question

2007-03-28 Thread Steve Severance
Let me actually refine that question we do some directories like the linkdb
have a current, and why do others like parse_data not? Is there a convention
on this?

Steve

 -Original Message-
 From: Steve Severance [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, March 28, 2007 4:11 PM
 To: nutch-dev@lucene.apache.org
 Subject: Sequence File Question
 
 Hey guys,
 I have a mapreduce job that sets up a directory for pagerank. It
 iterates
 over all the segments and then outputs a MapFile containing the data.
 When I
 go to open the outputted directory with another MapReduce job it fails
 saying that it cannot find the path. The path that it thinks it is
 trying to
 open does not include the part-0 directory. Both my directory (and
 all
 other directories for that matter) have the same structure which is
 /path/part-0/whatever. I feel like this is a really stupid error
 and I
 have forgotten something that is easily fixed. Any ideas?
 
 Steve



Image Search Engine Input

2007-03-26 Thread Steve Severance
Hey all,
I am working on the basics of an image search engine. I want to ask for
feedback on something.

Should I create a new directory in a segment parse_image? And then put the
images there? If not where should I put them, in the parse_text? I created a
class ImageWritable just like the Jira task said. This class contains image
meta data as well as two BytesWritable for the original image and the
thumbnail.

One more question, what ramifications does that have for the type of Parse
that I am returning? Do I need to create a ParseImage class to hold it? The
actual parsing infrastructure is something that I am still studying so any
ideas here would be great. Thanks,

Steve



RE: Image Search Engine Input

2007-03-26 Thread Steve Severance
So now that I have spent a few hours looking into how this works a lot more
deeply I am even more of a conundrum. The fetcher passes the contents of the
page to the parsers. It assumes that text will be output from the parsers.
For instance even the SWF parser returns text. For all binary data, images,
videos, music, etc... this is problematic. Potentially confounding the
problem even further in the case of music is that text and binary data can
come from the same file. Even if that is a problem I am not going to tackle
it. 

So there are 3 choices for moving forward with an image search,

1. All image data can be encoded as strings. I really don't like that choice
since the indexer will index huge amounts of junk.
2. The fetcher can be modified to allow another output for binary data. This
I think is the better choice although it will be a lot more work. I am not
sure that this is possible with MapReduce since MapRunnable has only 1
output.
3. Images can be written into another directory for processing. This would
need more work to automate but is probably non-issue.

I want to do the right thing so that the image search can eventually be in
the trunk. I don't want to have to change the way a lot of things work in
the process. Let me know what you all think.

Steve

 -Original Message-
 From: Steve Severance [mailto:[EMAIL PROTECTED]
 Sent: Monday, March 26, 2007 4:04 PM
 To: nutch-dev@lucene.apache.org
 Subject: Image Search Engine Input
 
 Hey all,
 I am working on the basics of an image search engine. I want to ask for
 feedback on something.
 
 Should I create a new directory in a segment parse_image? And then put
 the
 images there? If not where should I put them, in the parse_text? I
 created a
 class ImageWritable just like the Jira task said. This class contains
 image
 meta data as well as two BytesWritable for the original image and the
 thumbnail.
 
 One more question, what ramifications does that have for the type of
 Parse
 that I am returning? Do I need to create a ParseImage class to hold it?
 The
 actual parsing infrastructure is something that I am still studying so
 any
 ideas here would be great. Thanks,
 
 Steve



Breaking change in webapp?

2007-03-23 Thread Steve Severance
Hey,
I have an index that I am trying to search using the webapp. I am
using the current trunk. When I run a search I get the following message,

HTTP Status 404 - no segments* file found: files:

type Status report

message no segments* file found: files:

description The requested resource (no segments* file found: files:) is not
available.
Apache Tomcat/5.5.20

The message I think refers to a Lucene issue:

http://www.mail-archive.com/java-dev@lucene.apache.org/msg09044.html

I am not really sure what to do to fix it. I have not really delved into
lucene itself yet. Is there a different directory structure that I need to
have for the index now?

BTW, my searcher.dir is G:\NutchDeployment\crawl which is the same thing
that I used for 0.8.1. Any ideas? Thanks,

Steve



indexing with current trunk

2007-03-22 Thread Steve Severance
Hi,
I updated my test system to the 0.9 dev trunk current as of yesterday. Now
indexing does not work. I tried purging the linkdb and recreating it. I
tried just running it on a single segment to see if I could find the error.
Here is the output:

$ bin/nutch index crawl/index crawl/crawldb/ crawl/linkdb/
crawl/segments/20070
307113353/
Indexer: starting
Indexer: linkdb: crawl/linkdb
Indexer: adding segment: crawl/segments/20070307113353
Optimizing index.
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:402)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:273)
at org.apache.nutch.indexer.Indexer.run(Indexer.java:295)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.indexer.Indexer.main(Indexer.java:278)

Other things are working fine. Just not indexing.

Steve



RE: indexing with current trunk

2007-03-22 Thread Steve Severance
Here is the log.

2007-03-22 15:45:39,851 WARN  mapred.LocalJobRunner - job_pyll84
java.lang.NoSuchMethodError:
org.apache.lucene.document.Document.add(Lorg/apache/lucene/document/Fieldabl
e;)V
at
org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilte
r.java:62)
at
org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:110)
at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:215)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:317)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:137)
2007-03-22 15:45:40,043 FATAL indexer.Indexer - Indexer:
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:402)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:273)
at org.apache.nutch.indexer.Indexer.run(Indexer.java:295)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.indexer.Indexer.main(Indexer.java:278)

Steve
 -Original Message-
 From: Sami Siren [mailto:[EMAIL PROTECTED]
 Sent: Thursday, March 22, 2007 4:03 PM
 To: nutch-dev@lucene.apache.org
 Subject: Re: indexing with current trunk
 
 Steve Severance wrote:
  Hi,
  I updated my test system to the 0.9 dev trunk current as of
 yesterday. Now
  indexing does not work. I tried purging the linkdb and recreating it.
 I
  tried just running it on a single segment to see if I could find the
 error.
  Here is the output:
 
  $ bin/nutch index crawl/index crawl/crawldb/ crawl/linkdb/
  crawl/segments/20070
  307113353/
  Indexer: starting
  Indexer: linkdb: crawl/linkdb
  Indexer: adding segment: crawl/segments/20070307113353
  Optimizing index.
  Indexer: java.io.IOException: Job failed!
  at
 org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:402)
  at org.apache.nutch.indexer.Indexer.index(Indexer.java:273)
  at org.apache.nutch.indexer.Indexer.run(Indexer.java:295)
  at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
  at org.apache.nutch.indexer.Indexer.main(Indexer.java:278)
 
  Other things are working fine. Just not indexing.
 
 Can you please check the log files for more specific error message(s).
 Indexing works ok for me but I have only tried it with small segments
 so
 far.
 
 --
  Sami Siren



RE: indexing with current trunk

2007-03-22 Thread Steve Severance
 -Original Message-
 From: Sami Siren [mailto:[EMAIL PROTECTED]
 Sent: Thursday, March 22, 2007 4:27 PM
 To: nutch-dev@lucene.apache.org
 Subject: Re: indexing with current trunk
 
 Are you running on localrunner or distributed mode, is distributed then
 check that the lucene version in task tracker class path is correct.

I am using a localrunner. I have lucene 2.0 and 2.1 in my lib dir for nutch.

Steve

 
 --
  Sami Siren
 
 Steve Severance wrote:
  Here is the log.
 
  2007-03-22 15:45:39,851 WARN  mapred.LocalJobRunner - job_pyll84
  java.lang.NoSuchMethodError:
 
 org.apache.lucene.document.Document.add(Lorg/apache/lucene/document/Fie
 ldabl
  e;)V
  at
 
 org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexing
 Filte
  r.java:62)
  at
 
 org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:11
 0)
  at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:215)
  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:317)
  at
 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:137
 )
  2007-03-22 15:45:40,043 FATAL indexer.Indexer - Indexer:
  java.io.IOException: Job failed!
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:402)
  at org.apache.nutch.indexer.Indexer.index(Indexer.java:273)
  at org.apache.nutch.indexer.Indexer.run(Indexer.java:295)
  at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
  at org.apache.nutch.indexer.Indexer.main(Indexer.java:278)
 
  Steve
  -Original Message-
  From: Sami Siren [mailto:[EMAIL PROTECTED]
  Sent: Thursday, March 22, 2007 4:03 PM
  To: nutch-dev@lucene.apache.org
  Subject: Re: indexing with current trunk
 
  Steve Severance wrote:
  Hi,
  I updated my test system to the 0.9 dev trunk current as of
  yesterday. Now
  indexing does not work. I tried purging the linkdb and recreating
 it.
  I
  tried just running it on a single segment to see if I could find
 the
  error.
  Here is the output:
 
  $ bin/nutch index crawl/index crawl/crawldb/ crawl/linkdb/
  crawl/segments/20070
  307113353/
  Indexer: starting
  Indexer: linkdb: crawl/linkdb
  Indexer: adding segment: crawl/segments/20070307113353
  Optimizing index.
  Indexer: java.io.IOException: Job failed!
  at
  org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:402)
  at org.apache.nutch.indexer.Indexer.index(Indexer.java:273)
  at org.apache.nutch.indexer.Indexer.run(Indexer.java:295)
  at
 org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
  at org.apache.nutch.indexer.Indexer.main(Indexer.java:278)
 
  Other things are working fine. Just not indexing.
  Can you please check the log files for more specific error
 message(s).
  Indexing works ok for me but I have only tried it with small
 segments
  so
  far.
 
  --
   Sami Siren
 
 



[jira] Closed: (NUTCH-462) Noarchive urls are available via the cache link

2007-03-20 Thread Steve Severance (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Severance closed NUTCH-462.
-

Resolution: Fixed

duplicate. see NUTCH-167. Has been fixed.

 Noarchive urls are available via the cache link
 ---

 Key: NUTCH-462
 URL: https://issues.apache.org/jira/browse/NUTCH-462
 Project: Nutch
  Issue Type: Bug
  Components: web gui
Reporter: Steve Severance
 Fix For: 0.8.1


 If a robots.txt file specifies a Noarchive statement then urls that or 
 contained as part of that path should not be available via the cached link.
 For example Noarchive:/ means that no pages should be available via the 
 cached link.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Multi-pass algorithms

2007-03-20 Thread Steve Severance
If I want to have an algorithm that runs over the same data multiple times
(it is an iterative algorithm) is there a way to have my MapReduce job use
the same directory for both input and output? Or do I need to make a temp
directory for each iteration?

Steve



Launching custom classes

2007-03-19 Thread Steve Severance
Hi all,
I have a custom class in the nutch jar. Everything works fine in eclipse but
when I try to run it from the command line using bin/nutch it throws the
java.lang.NoClassDefFoundError. All the pages on the internet helpfully
suggested that I make sure that the jar is in the classpath. I think that
everything is correct since I can invoke any of the nutch classes via its
class name e.g. bin/nutch org.apache.nutch.crawl.Crawl. This may be a simple
Java problem but I have been banging my head against this all weekend.

Thanks,

Steve



RE: Launching custom classes

2007-03-19 Thread Steve Severance
 -Original Message-
 From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
 Sent: Monday, March 19, 2007 10:18 AM
 To: nutch-dev@lucene.apache.org
 Subject: Re: Launching custom classes
 
 Steve Severance wrote:
  Hi all,
  I have a custom class in the nutch jar. Everything works fine in
 eclipse but
  when I try to run it from the command line using bin/nutch it throws
 the
  java.lang.NoClassDefFoundError. All the pages on the internet
 helpfully
  suggested that I make sure that the jar is in the classpath. I think
 that
 
 
 What needs to be on your classpath is the *.job jar. The bin/nutch
 script takes care of that if you built your Nutch using the command-
 line
 version of ant.

Ok. Thanks. 2 more things. I have 2 directories for nutch, 1 is synchronized 
with SVN and the other is my working directory. If I run the ant package 
command in my working directory ant says 
BUILD FAILED
g:\NutchInstance\build.xml:61: Specify at least one source--a file or resource 
collection.

Total time: 0 seconds

If I copy my source folder into the trunk dir for my directory that is synced 
with SVN my class does not get added. I have been studying the build.xml file 
and I see the plugin generation jobs, but my reasoning is that my package name 
is org.apache.nutch.my package should be compiled into the core. Is this 
correct? Do I need to make a separate build job for my class or something like 
that?

Second, how do people generally setup their development machines? Do you use 
Eclipse, if so do you just work off of the trunk or what? What is 
recommendation for source control in this situation? Is there a way to make a 
subversion repository for me so that I can add my own code but also receive 
updates from the trunk? Using an open source project like this seems to add 
some complexity to the source control process. But I am sure this problem has 
already been worked out.

Regards,

Steve

 
 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




[jira] Created: (NUTCH-462) Noarchive urls are available via the cache link

2007-03-19 Thread Steve Severance (JIRA)
Noarchive urls are available via the cache link
---

 Key: NUTCH-462
 URL: https://issues.apache.org/jira/browse/NUTCH-462
 Project: Nutch
  Issue Type: Bug
  Components: web gui
Reporter: Steve Severance
 Fix For: 0.8.1


If a robots.txt file specifies a Noarchive statement then urls that or 
contained as part of that path should not be available via the cached link.

For example Noarchive:/ means that no pages should be available via the cached 
link.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



RE: Indexing the Interesting Part Only...

2007-03-10 Thread Steve Severance
I think if anyone here had the perfect answer for that one they would have
sold it Google, Microsoft or Yahoo for a ton of money. You will need an
algorithm that can detect ads. I have not written ad filters since my search
engine is currently using a domain whitelist. I can tell you that a whole
web crawl will definetly need it since it can cut down on pages in the index
by 10-20%. If you do a whole web crawl you will also need spam detection.

I would recommend looking for some academic papers on the topic. Maybe use
CiteSeer or something like that.

Steve
-Original Message-
From: d e [mailto:[EMAIL PROTECTED] 
Sent: Saturday, March 10, 2007 3:07 PM
To: nutch-dev@lucene.apache.org
Subject: Re: Indexing the Interesting Part Only...

We plan to index many websites. Got any suggestions on how to drop the junk
without having to do too much work for each such site? Know anyone who has a
background on doing this sort of thing? What sorts of approaches would you
recommend?

Are there existing plug ins I should consider using?


On 3/9/07, J. Delgado [EMAIL PROTECTED] wrote:

 You have to build a special HTML Junk parser.

 2007/3/9, d e [EMAIL PROTECTED]:
 
  If I'm indexing a news article, I want to avoid getting the junk (other
  than
  the title, auther and article) into the index. I want to avoid getting
 the
  advertizments, etc. How do I do that sort of thing?
 
  What parts of what manual should I be reading so I will know how to do
  this
  sort of thing.
 




RE: How to read data from segments

2007-03-09 Thread Steve Severance
 -Original Message-
 From: Dennis Kubes [mailto:[EMAIL PROTECTED]
 Sent: Friday, March 09, 2007 9:47 AM
 To: nutch-dev@lucene.apache.org
 Subject: Re: How to read data from segments
 
 
 
 Steve Severance wrote:
  I am trying to learn the internals of Nutch and by extension Hadoop
 right
  now. I am implementing an algorithm that processes link and content
 data. I
  am stuck on how to open the ParseDatas contained in the segments.
 Each
  subdir of a segment (crawl_generate, etc...) contains a subdir part-
 0,
  which id I understand correctly, if I had more computers as part of a
 hadoop
  cluster there would also be part-1 and so on.
 
 There is one directory for each split.  One interesting thing to note
 is
 that multiple writers (i.e. map and reduce tasks) can't write to the
 same file on the DFS at the same time.  So each reduce task writes out
 it's own split to its own directory.

Does this mean that there might be some parts that are Map outputs and
others that are Reduce outputs? 

  When I try to open them with an ArrayFile.Reader it cannot find the
 file. I
  know that the Path class is working properly since it can enumerate
 sub
  directories. I tried hard coding the part-0 in to the path but
 that did
  not work either.
 
  The code is as follows:
 
  Path segmentDir = new Path(args[0]);
  Path pageRankDir = new Path(args[1]);
 
  Path segmentPath = new Path(segmentDir, parse_data/part-0);
  ArrayFile.Reader parses = null;
  try
  {
  parses = new
 
 ArrayFile.Reader(segmentPath.getFileSystem(config),segmentPath.toString
 (),co
  nfig);
  }
  catch(IOException ex){
  System.out.println(An Error Occured while opening the segment.
  Message:  + ex.getMessage());
  }
 
  The exception reports that it cannot open the file. I also tried
 merging the
  segments but that did not work either. Any help would be greatly
  appreciated.
 
 Just like Andrzej said.  It is in the outputformats and they have
 getReaders and getEntry methods.  I have a little tool that is a
 MapFileExplorer, if you want it let me know and I will send you a copy.

Yes, that would be great if you are willing to share it. I was already
thinking about writing something similar.

 
  One more thing. As a new nutch developer I am keeping a running list
 of
  problems/questions that I have and their solutions. A lot of
 questions arise
  from not understanding how to work with the internals, specifically
  understanding the building blocks of Hadoop such as filetypes and why
 there
  are custom types that Hadoop uses, e.g. why Text instead of String. I
  noticed that in a mailing list post earlier this year the lack of
 detailed
  information for new developers was cited as a barrier to more
 involvement. I
  would be happy to contribute this back to the wiki if there is
 interest.
 
 Absolutely.  The more documentation we have, especially for new
 developers, the better.  If you need any questions answered in doing
 this, give me a shout and I will help as much as I can.

What is the best way to proceed with this? Should I make a new wiki page?
Here is what I am thinking:
Have an overview of Nutch and Hadoop. This will include code samples of
basic tasks like getting data. And by overview I mean a detailed overview so
that someone without distributed computing or search experience will be able
to understand. It will not include IR basics as those are fairly well
documented elsewere. The Hadoop one might want to live on its own wiki. I
also am going to write up my implementation of PageRank as a tutorial since
it will cover I think a lot of Hadoop and Nutch basics, including Hadoop
types, using Hadoop files and MapReduce.

 
 Dennis Kubes
 
  Regards,
 
  Steve
 

Steve



RE: Indexing the Interesting Part Only...

2007-03-09 Thread Steve Severance
This is a Natural Language Processing problem, although you can certainly
take hints from URL graph structures and host block lists. Nutch does not
support this natively (that I know of) but you can certainly extend Nutch to
be able to recognize and filter ads. Start by looking at how to develop
plugins and also look at the indexing plugin.

Regards,

Steve

 -Original Message-
 From: d e [mailto:[EMAIL PROTECTED]
 Sent: Friday, March 09, 2007 6:49 PM
 To: nutch-dev@lucene.apache.org
 Subject: Indexing the Interesting Part Only...
 
 If I'm indexing a news article, I want to avoid getting the junk (other
 than
 the title, auther and article) into the index. I want to avoid getting
 the
 advertizments, etc. How do I do that sort of thing?
 
 What parts of what manual should I be reading so I will know how to do
 this
 sort of thing.



How to read data from segments

2007-03-08 Thread Steve Severance
I am trying to learn the internals of Nutch and by extension Hadoop right
now. I am implementing an algorithm that processes link and content data. I
am stuck on how to open the ParseDatas contained in the segments. Each
subdir of a segment (crawl_generate, etc...) contains a subdir part-0,
which id I understand correctly, if I had more computers as part of a hadoop
cluster there would also be part-1 and so on. 

When I try to open them with an ArrayFile.Reader it cannot find the file. I
know that the Path class is working properly since it can enumerate sub
directories. I tried hard coding the part-0 in to the path but that did
not work either. 

The code is as follows:

Path segmentDir = new Path(args[0]);
Path pageRankDir = new Path(args[1]);

Path segmentPath = new Path(segmentDir, parse_data/part-0);
ArrayFile.Reader parses = null;
try
{
parses = new
ArrayFile.Reader(segmentPath.getFileSystem(config),segmentPath.toString(),co
nfig);
}
catch(IOException ex){
System.out.println(An Error Occured while opening the segment.
Message:  + ex.getMessage());
}

The exception reports that it cannot open the file. I also tried merging the
segments but that did not work either. Any help would be greatly
appreciated.

One more thing. As a new nutch developer I am keeping a running list of
problems/questions that I have and their solutions. A lot of questions arise
from not understanding how to work with the internals, specifically
understanding the building blocks of Hadoop such as filetypes and why there
are custom types that Hadoop uses, e.g. why Text instead of String. I
noticed that in a mailing list post earlier this year the lack of detailed
information for new developers was cited as a barrier to more involvement. I
would be happy to contribute this back to the wiki if there is interest.

Regards,

Steve



RE: How to read data from segments

2007-03-08 Thread Steve Severance
Hi Andrzej,
Thanks for the reply. I have a couple more questions that I am not 
quite sure about. Mapfile.Reader[] represents the individual readers for each 
piece of a MapFile such that part-0, part-1 are each represented by a 
reader? In that case is the correct path to the segment something like 
crawl/segments/some segment and that is the path that I should pass? 
Currently it is returning 0 readers. 

Also generally on PageRank, I implemented a version in .net on mapreduce for 
another project that I was working on. However that was at my last job and I 
have started a new company that is developing a vertical search on 
nutch/hadoop. My basic idea of how implement PageRank for nutch is as follows:

Step 1: Build basic data
I have created a PageRankDatum class to hold the information that 
PageRank requires for its computation. PageRankDatum contains the PageRank 
value and the number of outbound links. This would enable the key/value pair to 
be Url,PageRankDatum

Step 2: Compute the ranks
Collect the resulting ranks to the output and write them out. Reduce 
would in effect be an Identity function I think. With this step we need to look 
up the inbound links for a Url and then how many other outbound links each link 
has. That was the purpose of storing the outbound link count in addition to the 
page rank. If I have a Hadoop cluster (currently I am running this on my dev 
machine, more machines on the way for testing) is the linkDB accessible from 
all nodes? I am thinking that the PageRankDb will work basically the sameway. 
After step 1 write it out so that it will be accessible. Also several papers 
have shown that in parallel computation of PageRank that being able to look up 
the ranks that have been computed in other nodes can lead to faster conversion. 
Is this possible in the map reduce model?

Regards,

Steve
-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Thursday, March 08, 2007 4:43 PM
To: nutch-dev@lucene.apache.org
Subject: Re: How to read data from segments

Steve Severance wrote:
 I am trying to learn the internals of Nutch and by extension Hadoop right
 now. I am implementing an algorithm that processes link and content data. I
 am stuck on how to open the ParseDatas contained in the segments. Each
 subdir of a segment (crawl_generate, etc...) contains a subdir part-0,
 which id I understand correctly, if I had more computers as part of a hadoop
 cluster there would also be part-1 and so on. 
   

Correct.

 When I try to open them with an ArrayFile.Reader it cannot find the file. I
 know that the Path class is working properly since it can enumerate sub
 directories. I tried hard coding the part-0 in to the path but that did
 not work either. 

 The code is as follows:

 Path segmentDir = new Path(args[0]);
 Path pageRankDir = new Path(args[1]);
   

Ah-ha, pageRankDir .. ;)

   
 Path segmentPath = new Path(segmentDir, parse_data/part-0);
   

Please take a look at the class MapFileOutputFormat and 
SequenceFileOutputFormat. Both support this nested dir structure which 
is a by-product of producing the data via map-reduce, and offer methods 
for getting MapFile.Reader[] or SequenceFile.Reader[], and then getting 
a selected entry.

Cf. also the code attached to HADOOP-175 issue in JIRA.


 One more thing. As a new nutch developer I am keeping a running list of
 problems/questions that I have and their solutions. A lot of questions arise
 from not understanding how to work with the internals, specifically
 understanding the building blocks of Hadoop such as filetypes and why there
 are custom types that Hadoop uses, e.g. why Text instead of String. I
 noticed that in a mailing list post earlier this year the lack of detailed
 information for new developers was cited as a barrier to more involvement. I
 would be happy to contribute this back to the wiki if there is interest.
   

Definitely, you are welcome to contribute in this area - this is always 
needed. Although this particular information might be more suitable for 
the Hadoop wiki ...

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




RE: 0.9 release

2007-03-07 Thread Steve Severance
Also one thing that comes to my mind as I have been struggling with it,
there is no upgrade path that I know of from 0.8.x to 0.9.0. I followed the
directions in the wiki and that did not work. I later found in a mailing
list post that everything needs to be regenerated. There needs to be some
guidance on if a 0.8.x upgrade is possible and  if it is how to do it.

Regards,

Steve


iVirtuoso, Inc
Steve Severance
Partner, Chief Technology Officer
[EMAIL PROTECTED]
mobile: (240) 472 - 9645


-Original Message-
From: Sami Siren [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, March 07, 2007 2:10 PM
To: nutch-dev@lucene.apache.org
Subject: Re: 0.9 release

 2. Any outstanding things that need to get done that aren't really code
that
 needs to get committed, e.g., things we need to close the loop on

One thing that comes to my mind is the web site, we have specifically
tutorials for 0.7.x and 0.8.x it might be confusing for users if we left
it as is and release 0.9.0.

--
 Sami Siren



[jira] Commented: (NUTCH-296) Image Search

2007-03-07 Thread Steve Severance (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478920
 ] 

Steve Severance commented on NUTCH-296:
---

I know the commiters are hard at work on the 0.9.0 release but I have begun to 
work on the first piece of this, the parser. I am looking for guidance as to 
how the images and thumbnails should be stored. One file per image is probably 
too inefficient. Are there existing file formats that the community would like 
to use?

I am building a parser that can handle most image types. Should I break them 
out into individual plugins so there is one per file type? e.g. jpg will have 
an extension, gif will have a separate extension etc... This may be more 
flexible in the long run. This is the first project that I am undertaking on 
the nutch codebase so any guidance would be great.

Steve

 Image Search
 

 Key: NUTCH-296
 URL: https://issues.apache.org/jira/browse/NUTCH-296
 Project: Nutch
  Issue Type: New Feature
Reporter: Thomas Delnoij
Priority: Minor

 Per the discussion in the Nutch-User mailing list, there is a wish for an 
 Image Search add-on component that will index images.
 Must have:
 - retrieve outlinks to image files from fetched pages
 - generate thumbnails from images
 - thumbnails are stored in the segments as ImageWritable that contains the 
 compressed binary data and some meta data 
 Should have:
 - implemented as hadoop map reduce job
 - should be seperate from main Nutch codeline as it breaks general Nutch 
 logic of one url == one index document.
 Could  have:
 - store the original image in the segments
 Would like to have:
 - search interface for image index
 - parameterizable thumbnail generation (width, height, quality)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



RE: 0.9 release

2007-03-07 Thread Steve Severance
I have gotten this working. A little bit of tweaking was involved but
everything works fine now.

Steve
-Original Message-
From: Steve Severance [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, March 07, 2007 2:19 PM
To: nutch-dev@lucene.apache.org
Subject: RE: 0.9 release

Also one thing that comes to my mind as I have been struggling with it,
there is no upgrade path that I know of from 0.8.x to 0.9.0. I followed the
directions in the wiki and that did not work. I later found in a mailing
list post that everything needs to be regenerated. There needs to be some
guidance on if a 0.8.x upgrade is possible and  if it is how to do it.

Regards,

Steve


iVirtuoso, Inc
Steve Severance
Partner, Chief Technology Officer
[EMAIL PROTECTED]
mobile: (240) 472 - 9645


-Original Message-
From: Sami Siren [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, March 07, 2007 2:10 PM
To: nutch-dev@lucene.apache.org
Subject: Re: 0.9 release

 2. Any outstanding things that need to get done that aren't really code
that
 needs to get committed, e.g., things we need to close the loop on

One thing that comes to my mind is the web site, we have specifically
tutorials for 0.7.x and 0.8.x it might be confusing for users if we left
it as is and release 0.9.0.

--
 Sami Siren



[jira] Created: (NUTCH-453) Move stop words to a config file

2007-03-01 Thread Steve Severance (JIRA)
Move stop words to a config file


 Key: NUTCH-453
 URL: https://issues.apache.org/jira/browse/NUTCH-453
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, searcher
Reporter: Steve Severance
Priority: Minor


Move the stop words from the code to a config file. This will allow the stop 
words to be modified without recompiling the code. The format could be the same 
as the regex-urlfilter where regexs are used to define the words or a plain 
text file of words could be used. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-224) Nutch doesn't handle Korean text at all

2007-03-01 Thread Steve Severance (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12477190
 ] 

Steve Severance commented on NUTCH-224:
---

The PDF Parser for 0.8.1 also fails on Korean text.

Steve

 Nutch doesn't handle Korean text at all
 ---

 Key: NUTCH-224
 URL: https://issues.apache.org/jira/browse/NUTCH-224
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 0.7.1
Reporter: KuroSaka TeruHiko

 I was browing NutchAnalysis.jj and found that
 Hungul Syllables (U+AC00 ... U+D7AF; U+ means
 a Unicode character of the hex value ) are not
 part of LETTER or CJK class.  This seems to me that
 Nutch cannot handle Korean documents at all.
 I posted the above message at nutch-user ML and Cheolgoo Kang [EMAIL 
 PROTECTED]
 replied as:
 
 There was similar issue with Lucene's StandardTokenizer.jj.
 http://issues.apache.org/jira/browse/LUCENE-444
 and
 http://issues.apache.org/jira/browse/LUCENE-461
 I'm have almost no experience with Nutch, but you can handle it like
 those issues above.
 
 Both fixes should probably be ported back to NuatchAnalysis.jj.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



RE: Performance optimization for Nutch index / query

2007-02-22 Thread Steve Severance
Hi,

I would like to comment if I might. I am not a Nutch/Lucene hacker yet. I have 
been working with it for only a few weeks. However I am looking at extending it 
significantly to add some new features. Now some of these will require 
extending Lucene as well. First I have a test implementation of PageRank that 
is really an approximation that runs ontop of map reduce. Are people interested 
in having this in the index? I am interested in how this and other meta data 
might interact with your super field. For instance I am also looking at using 
relevance feedback and having that as one of the criteria for ranking 
documents. I was also considering using an outside data source, possibly even 
another Lucene index to store these values on a per document basis. 

The other major feature I am thinking about is using distance between words and 
text type. Do you know of anyone who has done this?

Regards,

Steve


iVirtuoso, Inc
Steve Severance
Partner, Chief Technology Officer
[EMAIL PROTECTED]
mobile: (240) 472 - 9645


-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 22, 2007 7:44 PM
To: nutch-dev@lucene.apache.org
Subject: Performance optimization for Nutch index / query

Hi all,

This very long post is meant to initiate a discussion. There is no code 
yet. Be warned that it discusses low-level Nutch/Lucene stuff.

Nutch queries are currently translated into complex Lucene queries. This 
is necessary in order to take into account score factors coming from 
various document parts, such as URL, host, title, content, and anchors.

Typically, the translation provided by query-basic looks like this for 
single term queries:

(1)
Query: term1
Parsed: term1
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0)

For queries consisting of two or more terms it looks like this (Nutch 
uses implicit AND):

(2)
Query: term1 term2
Parsed: term1 term2
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 
content:term2 title:term2^1.5 host:term2^2.0) url:term1 
term2~2147483647^4.0 anchor:term1 term2~4^2.0 content:term1 
term2~2147483647 title:term1 term2~2147483647^1.5 host:term1 
term2~2147483647^2.0

By the way, please note the absurd default slop value - in case of 
anchors it defeats the purpose of having the ANCHOR_GAP ...

Let's list other common query types:

(3)
Query: term1 term2 term3
Parsed: term1 term2 term3
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 
content:term2 title:term2^1.5 host:term2^2.0) +(url:term3^4.0 
anchor:term3^2.0 content:term3 title:term3^1.5 host:term3^2.0) 
url:term1 term2 term3~2147483647^4.0 anchor:term1 term2 term3~4^2.0 
content:term1 term2 term3~2147483647 title:term1 term2 
term3~2147483647^1.5 host:term1 term2 term3~2147483647^2.0

For phrase queries it looks like this:

(4)
Query: term1 term2
Parsed: term1 term2
Translated: +(url:term1 term2^4.0 anchor:term1 term2^2.0 
content:term1 term2 title:term1 term2^1.5 host:term1 term2^2.0)

For mixed term and phrase queries it looks like this:

(5)
Query: term1 term2 term3
Parsed: term1 term2 term3
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0) +(url:term2 term3^4.0 anchor:term2 
term3^2.0 content:term2 term3 title:term2 term3^1.5 host:term2 
term3^2.0)

For queries with NOT operator it looks like this:

(6)
Query: term1 -term2
Parsed: term1 -term2
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0) -(url:term2^4.0 anchor:term2^2.0 
content:term2 title:term2^1.5 host:term2^2.0)

(7)
Query: term1 term2 -term3
Parsed: term1 term2 -term3
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 
content:term2 title:term2^1.5 host:term2^2.0) -(url:term3^4.0 
anchor:term3^2.0 content:term3 title:term3^1.5 host:term3^2.0) 
url:term1 term2~2147483647^4.0 anchor:term1 term2~4^2.0 
content:term1 term2~2147483647 title:term1 term2~2147483647^1.5 
host:term1 term2~2147483647^2.0

(8)
Query: term1 term2 -term3
Parsed: term1 term2 -term3
Translated: +(url:term1 term2^4.0 anchor:term1 term2^2.0 
content:term1 term2 title:term1 term2^1.5 host:term1 term2^2.0) 
-(url:term3^4.0 anchor:term3^2.0 content:term3 title:term3^1.5 
host:term3^2.0)

(9)
Query: term1 -term2 term3
Parsed: term1 -term2 term3
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0) -(url:term2 term3^4.0 anchor:term2 
term3^2.0 content:term2 term3 title:term2 term3^1.5 host:term2 
term3^2.0)


WHEW ... !!! Are you tired? Well, Lucene is tired of these queries too. 
They are too long! They are absurdly long and complex. For large indexes 
the time to evaluate them may run into several