Re: nighly build brocken?

2006-04-11 Thread Byron Miller
i get nightly to run, but it never completes anything.
always get stuck at 98% here and there.. i'll try
todays build and see what happens.

--- Stefan Groschupf [EMAIL PROTECTED] wrote:

 Hi,
 
 looks like the latest nightly build is broken.
 Looks like the jar that comes with the nightly build
 contains some  
 patches that are not yet in the svn sources.
 Is someone able to get the latest nutch nightly to
 run?
 
 Thanks.
 Stefan
 
 
 



Re: scalability limits getDetails, mapFile Readers?

2006-03-02 Thread Byron Miller
I would like to see something as active, in process
and  inbound.  Active data is live and on the query
servers (both indexes and correlating segments) in
process are tasks currently being mapped out and
inbound is processes/data that is pending to be
processed.

Active nodes report as in the search pool.  In process
nodes are really data nodes doing all of the number
crunching/import/merging/indexing and inbound is
everything in fetch/pre processing.

The cycle would be a pull cycle. Active nodes pull
from the corresponding data nodes in turn pull from
the corresponding inbound nodes.  Events/batches could
trigger the pull so that it is a complete or useable
data set. Some light weight workflow engine could
allow you to process/manage the cycle of data.

i would like to see dfs block aware - able to process
the data on the active data server where that data
resides (as much as possible).. such a file table
could be used to associate the data through the entire
process stream and allow for a fairly linear growth.  

Such a system could also be aware of its own capacity
in that the inbound processes will fail/halt if disk
space on the dfs system isn't capable of handling new
tasks and vice versa if the active nodes are at
capacity tasks could be told to stop/hold.  You could
use this logic to add more nodes where necessary and
resume processing and chart your growth.

I come from the ERP/Oracle world so i very much have
learned to appreciate the distributed architecture and
concept of an aware system such as concurrent
processing that is distributed across many nodes and
aware of the status of the task and able to hold/wait
or act upon the condition of the system and grow
fairly linearly as needed.

-byron

--- Stefan Groschupf [EMAIL PROTECTED] wrote:

 Hi Andrzej,
 
 
  * merge 80 segments into 1. A lot of IO
 involved... and you have to  
  repeat it from time to time. Ugly.
 I agree.
 
  * implement a search server as a map task. Several
 challenges: it  
  needs to partition the Lucene index, and it has to
 copy all parts  
  of segments and indexes from DFS to the local
 storage, otherwise  
  performance will suffer. However, the number of
 open files per  
  machine would be reduced, because (ideally) each
 machine would deal  
  with few or a single part of segment and a single
 part of index...
 
 Well I played around and already had a kind of
 prototype.
 I had seen following problems:
 
 + having a kind of repository of active search
 servers
 possibility A: find all tasktrackers running a
 specific task (already  
 discussed in the hadoop mailing list)
 possibility B: having a rpc server running in the
 jvm that runs the  
 search server client, add the hostname to the
 jobconf and similar to  
 task - jobtracker search server announce itself via
 hardbeat to the  
 search server 'repository'.
 
 + having the index locally and the segment in the
 dfs.
 ++ adding to NutchBean init a dfs for index and one
 for segments  
 could fix this, or more general add support for
 streamhandlers like  
 dfs:// vs file://. (very long term)
 
 + downloading an index from dfs until the mapper
 starts or just index  
 the segment data to local hdd and let the mapper run
 for the next 30  
 days?
 
 Stefan 
 



Re: Carrot2 v. 1.0.1. [clustering plugin]

2006-02-03 Thread Byron Miller
I would love to see it continue as a plugin.  I'm
moving to mapreduce myself so i would be interested in
utilizing it there.

thanks for the great work! look forward to trying out
your updates.

feel free to contact me directly if you wish.

-byron

--- Dawid Weiss [EMAIL PROTECTED] wrote:

 
 Hi there,
 
 We've been quite busy with putting things together
 at Carrot2. Version 
 1.0.1 is out -- it is a stable release with a few
 tweaks and tunings 
 that appeared after 1.0. We also have a Web site ;)
 
 http://www.carrot2.org
 
 So... I think it's time for reintegrating that code
 into Nutch 
 clustering plugin. If folks still consider it
 valuable enough to be kept 
 in the codebase, that is (I realize it is a big
 plugin).
 
 Summarizing, my questions are:
 
 - Is there still enough interest in maintaining this
 functionality in 
 the Nutch codebase? I admit I'm really busy and I
 don't have enough time 
 to keep up with Nutch's development. My
 contributions will be there, but 
 their frequency might be insufficient to keep the
 plugin usable at all 
 times.
 
 - Which Nutch version you want me to work with if
 there's still interest 
 -- map-reduce, maintenance line or both?
 
 Dawid
 



indexSorter - applied to SVN or patch in Jira?

2006-01-31 Thread Byron Miller
Has indexsorter code discussed a while back been
pushed to jira or put in SVN?  I'd like to give it a
whirl on some of my indexes and the archive i can find
cut the post with the code attached..


[jira] Commented: (NUTCH-16) boost documents matching a url pattern

2006-01-28 Thread byron miller (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-16?page=comments#action_12364354 ] 

byron miller commented on NUTCH-16:
---

Cool

an inverse of this plugin would be great, or enhancement of this for +/- values 
based on patters as i think lowering score of  domains like  
i.like.to.spam.with.keywords.in.my.url.pretending.im.a.good.site.dot.com

 boost documents matching a url pattern
 --

  Key: NUTCH-16
  URL: http://issues.apache.org/jira/browse/NUTCH-16
  Project: Nutch
 Type: New Feature
   Components: indexer
 Reporter: Stefan Groschupf
 Priority: Trivial
  Attachments: boost-url-src_and_bin.zip, boostingPluginPatch.txt

 The attached patch is a plugin that allows to boost documents matching a url 
 pattern. 
 This could be useful to rank documents from a intranet higher then external 
 pages.
 A README comes with the patch.
 Any comments are welcome.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-79) Fault tolerant searching.

2006-01-28 Thread byron miller (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-79?page=comments#action_12364357 ] 

byron miller commented on NUTCH-79:
---

Piotr,

Any update on this? Have you been able to run with this or still working out 
the kinks?

 Fault tolerant searching.
 -

  Key: NUTCH-79
  URL: http://issues.apache.org/jira/browse/NUTCH-79
  Project: Nutch
 Type: New Feature
   Components: searcher
 Reporter: Piotr Kosiorowski
  Attachments: patch

 I have finally managed to prepare first version of fault tolerant searching I 
 have promised long time ago. 
 It reads server configuration from search-groups.txt file (in startup 
 directory or directory specified by searcher.dir) if no search-servers.txt 
 file is present. If search-servers.txt  is presentit would be read and 
 handled as previously.
 ---
 Format of search-groups.txt:
 * pre
  *  search.group.count=[int] 
  *  search.group.name.[i]=[string] (for i=0 to count-1)
  *  
  *  For each name: 
  *  [name].part.count=[int] partitionCount 
  *  [name].part.[i].host=[string] (for i=0 to partitionCount-1)
  *  [name].part.[i].port=int (for i=0 to partitionCount-1)
  *  
  *  Example: 
  *  search.group.count=2 
  *  search.group.name.0=master
  *  search.group.name.1=backup
  *  
  *  master.part.count=2 
  *  master.part.0.host=host1 
  *  master.part.0.port=
  *  master.part.1.host=host2 
  *  master.part.1.port=
  *  
  *  backup.part.count=2 
  *  backup.part.0.host=host3 
  *  backup.part.0.port=
  *  backup.part.1.host=host4 
  *  backup.part.1.port=
  * /pre.
 
 If more than one search group is defined in configuration file requests are 
 distributed among groups in round-robin fashion. If one of the servers from 
 the group fails to respond the whole group is treated as inactive and removed 
 from the pool used to distributed requests. There is a separate recovery 
 thread that every searcher.recovery.delay seconds (default 60) tries to 
 check if inactive became alive and if so adds it back to the pool of active 
 groups.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-14) NullPointerException NutchBean.getSummary

2006-01-28 Thread byron miller (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-14?page=comments#action_12364358 ] 

byron miller commented on NUTCH-14:
---

Are you still hitting this Stefan?

 NullPointerException NutchBean.getSummary
 -

  Key: NUTCH-14
  URL: http://issues.apache.org/jira/browse/NUTCH-14
  Project: Nutch
 Type: Bug
   Components: searcher
 Reporter: Stefan Groschupf
 Priority: Minor


 In heavy load scenarios this may happens when connection broke.
 java.lang.NullPointerException
 at java.util.Hashtable.get(Hashtable.java:333)
 at net.nutch.ipc.Client.getConnection(Client.java:276)
 at net.nutch.ipc.Client.call(Client.java:251)
 at 
 net.nutch.searcher.DistributedSearch$Client.getSummary(DistributedSearch.java:418)
 at net.nutch.searcher.NutchBean.getSummary(NutchBean.java:236)
 at 
 org.apache.jsp.search_jsp._jspService(org.apache.jsp.search_jsp:396)
 at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:99)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 at 
 org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:325)
 at 
 org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:295)
 at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:245)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
 at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
 at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178)
 at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
 at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
 at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107)
 at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
 at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:825)
 at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:738)
 at 
 org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:526)
 at 
 org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:80)
 at 
 org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
 at java.lang.Thread.run(Thread.java:552)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: need volunteer to develop search for apache.org

2006-01-25 Thread Byron Miller
I'll be happy to do it.

--- Doug Cutting [EMAIL PROTECTED] wrote:

 Would someone volunteer to develop Nutch-based
 site-search engine for 
 all apache.org domains?  We now have a Solaris zone
 to host this.
 
 Thanks,
 
 Doug
 



[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

2006-01-20 Thread byron miller (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12363400 ] 

byron miller commented on NUTCH-134:


Thanks Erik, I was able to pull down the highlighter and i'll be loading it up 
on mozdex.com to test out over the weekend (1/21/2006).  i'll let people know 
if my cpu skyrockets :)

 Summarizer doesn't select the best snippets
 ---

  Key: NUTCH-134
  URL: http://issues.apache.org/jira/browse/NUTCH-134
  Project: Nutch
 Type: Bug
   Components: searcher
 Versions: 0.7, 0.8-dev, 0.7.1, 0.7.2-dev
 Reporter: Andrzej Bialecki 


 Summarizer.java tries to select the best fragments from the input text, where 
 the frequency of query terms is the highest. However, the logic in line 223 
 is flawed in that the excerptSet.add() operation will add new excerpts only 
 if they are not already present - the test is performed using the Comparator 
 that compares only the numUniqueTokens. This means that if there are two or 
 more excerpts, which score equally high, only the first of them will be 
 retained, and the rest of equally-scoring excerpts will be discarded, in 
 favor of other excerpts (possibly lower scoring).
 To fix this the Set should be replaced with a List + a sort operation. To 
 keep the relative position of excerpts in the original order the Excerpt 
 class should be extended with an int order field, and the collected 
 excerpts should be sorted in that order prior to adding them to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-183) MapReduce has a series of problems concerning task-allocation to worker nodes

2006-01-20 Thread byron miller (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-183?page=comments#action_12363477 ] 

byron miller commented on NUTCH-183:


As Mr Burns would say eggcelent  I'll give this a try.  BTW, is it possible 
to implement functionality that would start jobs that are lagging on nodes that 
have completed tasks like google does?  for example if your 90% done and the 
last 10 jobs are hung because of bad hardware, slow response or failure and 
have the ability to redo the long running jobs in parallel on alternate nodes 
and complete the first one that finishes?  this way if you have a huge crawl 
and certain nodes slow or fail those jobs can be alternated on completed nodes 
to try and wrap up and terminate any dead jobs when done?

hope that makes sense..

 MapReduce has a series of problems concerning task-allocation to worker nodes
 -

  Key: NUTCH-183
  URL: http://issues.apache.org/jira/browse/NUTCH-183
  Project: Nutch
 Type: Improvement
  Environment: All
 Reporter: Mike Cafarella
  Attachments: jobtracker.patch

 The MapReduce JobTracker is not great at allocating tasks to TaskTracker 
 worker nodes.
 Here are the problems:
 1) There is no speculative execution of tasks
 2) Reduce tasks must wait until all map tasks are completed before doing any 
 work
 3) TaskTrackers don't distinguish between Map and Reduce jobs.  Also, the 
 number of
 tasks at a single node is limited to some constant.  That means you can get 
 weird deadlock
 problems upon machine failure.  The reduces take up all the available 
 execution slots, but they
 don't do productive work, because they're waiting for a map task to complete. 
  Of course, that
 map task won't even be started until the reduce tasks finish, so you can see 
 the problem...
 4) The JobTracker is so complicated that it's hard to fix any of these.
 The right solution is a rewrite of the JobTracker to be a lot more flexible 
 in task handling.
 It has to be a lot simpler.  One way to make it simpler is to add an 
 abstraction I'll call
 TaskInProgress.  Jobs are broken into chunks called TasksInProgress.  All 
 the TaskInProgress
 objects must be complete, somehow, before the Job is complete.
 A single TaskInProgress can be executed by one or more Tasks.  TaskTrackers 
 are assigned Tasks.
 If a Task fails, we report it back to the JobTracker, where the 
 TaskInProgress lives.  The TIP can then
 decide whether to launch additional  Tasks or not.
 Speculative execution is handled within the TIP.  It simply launches multiple 
 Tasks in parallel.  The
 TaskTrackers have no idea that these Tasks are actually doing the same chunk 
 of work.  The TIP
 is complete when any one of its Tasks are complete.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: Problem with latest SVN during reduce phase

2006-01-13 Thread Byron Miller
I'll pull it down today and give it a shot.

thanks,
-byron

--- Lukas Vlcek [EMAIL PROTECTED] wrote:

 Hi,
 
 Get the latest svn version. Andrzej commited some
 patches yesterday
 and now this issue is gone (at least it warks fine
 for me). I believe
 that revision# 368167 is what we were about.
 
 Regards,
 Lukas
 
 On 1/13/06, Pashabhai [EMAIL PROTECTED]
 wrote:
  Hi ,
 
 You are right, Parse object is not null even
 though
  page has no content and title.
 
 Could it be FetcherOutput Object ???
 
 
  P
 
  --- Lukas Vlcek [EMAIL PROTECTED] wrote:
 
   Hi,
   I think this issue can be more complex. If I
   remember my test
   correctly then parse object was not null. Also
   parse.getText() was not
   null (it just contained empty String).
   If document is not parsed correctly then empty
   parse is returned
   instead: parseStatus.getEmptyParse(); which
 should
   be OK, but I didn't
   have a chance to check if this can cause any
   troubles during index
   index optimization.
   Lukas
  
   On 1/12/06, Pashabhai [EMAIL PROTECTED]
   wrote:
Hi ,
   
   The very similar exception occurs while
   indexing a
page which do not have body content (and title
sometimes).
   
051223 194717 Optimizing index.
java.lang.NullPointerException
at
   
  
 

org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:75)
   
at
   
  
 

org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:63)
   
at
   
  
 

org.apache.nutch.crawl.Indexer.reduce(Indexer.java:217)
   
at
   
  
 

org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
   
at
   
   
 Looking into the source code of
   BasicIndexingFilter.
it is trying to
doc.add(Field.UnStored(content,
   parse.getText()));
   
I guess adding check for null on parse object
if(parse!=null)   should solve the problem.
   
Can confirm when tested locally.
   
Thanks
P
   
   
   
   
--- Lukas Vlcek [EMAIL PROTECTED] wrote:
   
 Hi,
 I am facing this error as well. Now I
 located
   one
 particular document
 which is causing it (it is msword document
 which
 can't be properly
 parsed by parser). I have sent it to Andrzej
 in
 separed email. Let's
 see if that helps...
 Lukas

 On 1/11/06, Dominik Friedrich
 [EMAIL PROTECTED] wrote:
  I got this exception a lot, too. I haven't
   tested
 the patch by Andrzej
  yet but instead I just put the doc.add()
 lines
   in
 the indexer reduce
  function in a try-catch block . This way
 the
 indexing finishes even with
  a null value and i can see which documents
   haven't
 been indexed in the
  log file.
 
  Wouldn't it be a good idea to catch every
 exceptions that only affect
  one document in loops like this? At least
 I
   don't
 like it if an indexing
  process dies after a few hours because one
 document triggers such an
  exception.
 
  best regards,
  Dominik
 
  Byron Miller wrote:
   60111 103432 reduce  reduce
   060111 103432 Optimizing index.
   060111 103433 closing  reduce
   060111 103434 closing  reduce
   060111 103435 closing  reduce
   java.lang.NullPointerException: value
 cannot
   be
 null
   at
  

   
  
 

org.apache.lucene.document.Field.init(Field.java:469)
   at
  

   
  
 

org.apache.lucene.document.Field.init(Field.java:412)
   at
  

   
  
 

org.apache.lucene.document.Field.UnIndexed(Field.java:195)
   at
  

   
  
 

org.apache.nutch.indexer.Indexer.reduce(Indexer.java:198)
   at
  

   
  
 

org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
   at
  

   
  
 

org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
   Exception in thread main
   java.io.IOException:
 Job
   failed!
   at
  
 
=== message truncated ===



RE: MapReduce and segment merging

2006-01-12 Thread Byron Miller
I was thinking that Nutch needs some sort of workflow
manager. This way you could build jobs off specific
workflows and hopefully recover jobs based upon the
portion of the workflow they are stuck. (or restart a
job if failed/processing time  x hours and other such
workflow processes rules)

Something like that could also send notifications of
jobs done, trigger other events and create a
management interface to what your cluster is up to or
apply configuration types to be defigned based upon
batch job/workflow process in process.  For example
if i'm building a blog index i may want more smaller
segments based upon daily fetches while for other jobs
i may want less larger segments. 

Does something like that make much sense for where
mapred branch is going?

is workflow the right term for such beast?

-byron



--- Goldschmidt, Dave [EMAIL PROTECTED]
wrote:

 Could you also just copy segments out of NDFS to
 local -- perform merges
 in local -- then copy segments back into NDFS?
 
 DaveG
 
 
 -Original Message-
 From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
 Sent: Thursday, January 12, 2006 2:14 PM
 To: nutch-dev@lucene.apache.org
 Subject: Re: MapReduce and segment merging
 
 Mike Alulin wrote:
  Then how people uses the new version if they need
 let's say daily
 crawls of the new/updated pages? I crawl updated
 pages every 24 hours
 and if I do not merge the segments, soon I will have
 hundreds of them.
 What is the best solution in this case? 
 
Full recrawl is not a good option as i have
 millions of documents
 and I DO know which of them were updated without
 requesting them.

 
 This is a development version, nobody said it's
 feature complete. 
 Patience, my friend... or spend some effort to
 improve it. ;-)
 
 -- 
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _  
 __
 [__ || __|__/|__||\/|  Information Retrieval,
 Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System
 Integration
 http://www.sigram.com  Contact: info at sigram dot
 com
 
 
 



Problem with latest SVN during reduce phase

2006-01-11 Thread Byron Miller
60111 103432 reduce  reduce
060111 103432 Optimizing index.
060111 103433 closing  reduce
060111 103434 closing  reduce
060111 103435 closing  reduce
java.lang.NullPointerException: value cannot be null
at
org.apache.lucene.document.Field.init(Field.java:469)
at
org.apache.lucene.document.Field.init(Field.java:412)
at
org.apache.lucene.document.Field.UnIndexed(Field.java:195)
at
org.apache.nutch.indexer.Indexer.reduce(Indexer.java:198)
at
org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
at
org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
Exception in thread main java.io.IOException: Job
failed!
at
org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
at
org.apache.nutch.indexer.Indexer.index(Indexer.java:259)
at
org.apache.nutch.crawl.Crawl.main(Crawl.java:121)
[EMAIL PROTECTED]:/data/nutch/trunk$


Pulled todays build and got above error. No problems
running out of disk space or anything like that. This
is a single instance, local file systems.

Anyway to recover the crawl/finish the reduce job from
where it failed?


Re: Per-page crawling policy

2006-01-05 Thread Byron Miller
Excellent Ideas and that is what i'm hoping to use
some of the social bookmarking type ideas to build the
starter sites from and linkmaps from.

I hope to work with Simpy or other bookmarking
projects to build somewhat of a popularity map(human
edited authorit) to merge and calculate against a
computer generated map (via standard link processing,
anchor results and such)

My only continuing question is how to manage the
merge, index process of staging/processing your
crawl/fetch jobs such as this.  It seems all of our
theories would be a single crawl and publish of that
index rather than a living/breathing corpus.

Unless we map/bucket the segments to have some purpose
it's difficult to manage how we process them, sort
them or analyze them to defign or extra more meaning
from them.

Brain is exploding :)

-byron

--- Andrzej Bialecki [EMAIL PROTECTED] wrote:

 Hi,
 
 I've been toying with the following idea, which is
 an extension of the 
 existing URLFilter mechanism and the concept of a
 crawl frontier.
 
 Let's suppose we have several initial seed urls,
 each with a different 
 subjective quality. We would like to crawl these,
 and expand the 
 crawling frontier using outlinks. However, we
 don't want to do it 
 uniformly for every initial url, but rather
 propagate certain crawling 
 policy through the expanding trees of linked pages.
 This crawling 
 policy could consist of url filters, scoring
 methods, etc - basically 
 anything configurable in Nutch could be included in
 this policy. 
 Perhaps it could even be the new version of
 non-static NutchConf ;-)
 
 Then, if a given initial url is a known high-quality
 source, we would 
 like to apply a favor policy, where we e.g. add
 pages linked from that 
 url, and in doing so we give them a higher score.
 Recursively, we could 
 apply the same policy for the next generation pages,
 or perhaps only for 
 pages belonging to the same domain. So, in a sense
 the original notion 
 of high-quality would cascade down to other linked
 pages. The important 
 aspect of this to note is that all newly discovered
 pages would be 
 subject to the same policy - unless we have
 compelling reasons to switch 
 the policy (from favor to default or to
 distrust), which at that 
 point would essentially change the shape of the
 expanding frontier.
 
 If a given initial url is a known spammer, we would
 like to apply a 
 distrust policy for adding pages linked from that
 url (e.g. adding or 
 not adding, if adding then lowering their score, or
 applying different 
 score calculation). And recursively we could apply a
 similar policy of 
 distrust to any pages discovered this way. We
 could also change the 
 policy on the way, if there are compelling reasons
 to do so. This means 
 that we could follow some high-quality links from
 low-quality pages, 
 without drilling down the sites which are known to
 be of low quality.
 
 Special care needs to be taken if the same page is
 discovered from pages 
 with different policies, I haven't thought about
 this aspect yet... ;-)
 
 What would be the benefits of such approach?
 
 * the initial page + policy would both control the
 expanding crawling 
 frontier, and it could be differently defined for
 different starting 
 pages. I.e. in a single web database we could keep
 different 
 collections or areas of interest with
 differently specified 
 policies. But still we could reap the benefits of a
 single web db, 
 namely the link information.
 
 * URLFilters could be grouped into several policies,
 and it would be 
 easy to switch between them, or edit them.
 
 * if the crawl process realizes it ended up on a
 spam page, it can 
 switch the page policy to distrust, or the other
 way around, and stop 
 crawling unwanted content. From now on the pages
 linked from that page 
 will follow the new policy. In other words, if a
 crawling frontier 
 reaches pages with known quality problems, it would
 be easy to change 
 the policy on-the-fly to avoid them or pages linked
 from them, without 
 resorting to modifications of URLFilters.
 
 Some of the above you can do even now with
 URLFilters, but any change 
 you do now has global consequences. You may also end
 up with awfully 
 complicated rules if you try to cover all cases in
 one rule set.
 
 How to implement it? Surprisingly, I think that it's
 very simple - just 
 adding a CrawlDatum.policyId field would suffice,
 assuming we have a 
 means to store and retrieve these policies by ID;
 and then instantiate 
 it and call appropriate methods whenever we use
 today the URLFilters and 
 do the score calculations.
 
 Any comments?
 
 -- 
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _  
 __
 [__ || __|__/|__||\/|  Information Retrieval,
 Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System
 Integration
 http://www.sigram.com  Contact: info at sigram dot
 com
 
 
 



Re: mapred crawling exception - Job failed!

2006-01-04 Thread Byron Miller
Fixed in the copy i run as i've been able to get my
100k pages indexed without getting that error.

-byron

--- Andrzej Bialecki [EMAIL PROTECTED] wrote:

 Lukas Vlcek wrote:
 
 Hi,
 
 I am trying to use the latest nutch-trunk version
 but I am facing
 unexpected Job failed! exception. It seems that
 all crawling work
 has been already done but some threads are hunged
 which results into
 exception after some timeout.
 
   
 
 
 This was fixed (or should be fixed :) in the
 revision r365576. Please 
 report if it doesn't fix it for you.
 
 -- 
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _  
 __
 [__ || __|__/|__||\/|  Information Retrieval,
 Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System
 Integration
 http://www.sigram.com  Contact: info at sigram dot
 com
 
 
 



Re: IndexSorter optimizer

2006-01-03 Thread Byron Miller
On optimizing performance, does anyone know if google
is exporting its entire dataset as an index or only
somehow indexing the topN % (since they only show the
first 1000 or so results anyway)

With this patch and a top result set in the xml file
does that mean it will stop scanning the index at that
point?  Is there a methodology to actually prune the
index on some scaling factor so that a  4 billion page
index can be searchable only 1k results deep on
average?

seems like some sort of method to do the above would
cut your search processing/index size down fairly
well. But it may be a more expensive to post process
to this scale then it is to simply push and let the
query optimize ignore it as needed.. afterall disk
space is getting rather cheap compared to cpu
processing  memory.



--- Doug Cutting [EMAIL PROTECTED] wrote:

 Andrzej Bialecki wrote:
  I'm happy to report that further tests performed
 on a larger index seem 
  to show that the overall impact of the IndexSorter
 is definitely 
  positive: performance improvements are
 significant, and the overall 
  quality of results seems at least comparable, if
 not actually better.
 
 Great news!
 
 I will submit the Lucene patches ASAP, now that we
 know they're useful.
 
 Doug
 



Adding some theory publication links into the Wiki..

2006-01-03 Thread Byron Miller
I figured since i'm in research mode i woul start
compiling available information  resource and
putthing them up on the wiki

http://wiki.apache.org/nutch/Search_Theory

sorry about all the cvs message on edits.. i'm not
used to the touchpad on this darned laptop :)

Anyhow, if you have any resources to share or care to
look at all sorts of map, theory and discussions on
search, relevence, rank and theory it is a gold mine
of info!


[jira] Created: (NUTCH-159) Specify temp/working directory for crawl

2005-12-31 Thread byron miller (JIRA)
Specify temp/working directory for crawl


 Key: NUTCH-159
 URL: http://issues.apache.org/jira/browse/NUTCH-159
 Project: Nutch
Type: Bug
  Components: fetcher, indexer  
Versions: 0.8-dev
 Environment: Linux/Debian
Reporter: byron miller


I ran a crawl of 100k web pages and got:

org.apache.nutch.fs.FSError: java.io.IOException: No space left on device
at 
org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(LocalFileSystem.java:149)
at org.apache.nutch.fs.FileUtil.copyContents(FileUtil.java:65)
at 
org.apache.nutch.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:178)
at org.apache.nutch.fs.NutchFileSystem.rename(NutchFileSystem.java:224)
at 
org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:80)
Caused by: java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:260)
at 
org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(LocalFileSystem.java:147)
... 4 more
Exception in thread main java.io.IOException: Job failed!
at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
at org.apache.nutch.crawl.Fetcher.fetch(Fetcher.java:335)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:107)
[EMAIL PROTECTED]:/data/nutch$ df -k


It appears crawl created a /tmp/nutch directory that filled up even though i 
specified a db directory.

Need to add a parameter to the command line or make a globaly configurable /tmp 
(work area) for the nutch instance so that crawls won't fail.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-123) Cache.jsp some times generate NullPointerException

2005-12-31 Thread byron miller (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-123?page=comments#action_12361473 ] 

byron miller commented on NUTCH-123:


Perhaps you should try the cache servlet as it dumps out the data as it sees it.

 Cache.jsp some times generate NullPointerException
 --

  Key: NUTCH-123
  URL: http://issues.apache.org/jira/browse/NUTCH-123
  Project: Nutch
 Type: Bug
   Components: web gui
  Environment: All systems
 Reporter: YourSoft
 Priority: Critical


 There is a problem with the following line in the cached.jsp:
   String contentType = (String) metaData.get(Content-Type);
 In the segments data there is some times not equals Content-Type, there are 
 content-type or Content-type etc.
 The solution, insert these lines over the above line:
 for (Enumeration eNum = metaData.propertyNames(); eNum.hasMoreElements();) {
   content = (String) eNum.nextElement();
   if (content-type.equalsIgnoreCase (content)) {
   break;
   }
 }
 final String contentType = (String) metaData.get(content);
 Regards,
 Ferenc

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-42) enhance search.jsp such that it can also returns XML

2005-12-31 Thread byron miller (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-42?page=comments#action_12361474 ] 

byron miller commented on NUTCH-42:
---

Safe to close.  (done)  We have XML/OpenSearch in latest trunk and other 
branches.

 enhance search.jsp such that it can also returns XML
 

  Key: NUTCH-42
  URL: http://issues.apache.org/jira/browse/NUTCH-42
  Project: Nutch
 Type: Wish
   Components: web gui
 Reporter: Michael Wechner
 Priority: Trivial
  Attachments: NutchRssSearch.zip, NutchRssSearch.zip, search.jsp.diff, 
 search.jsp.diff

 Enhance search.jsp such that by specifying a parameter format=xml the JSP 
 will return an XML, whereas if no format is being specified then it will 
 return HTML

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-158) Process Sitemap data in text, rss or xml format as well as OAI-PMH

2005-12-29 Thread byron miller (JIRA)
Process Sitemap data in text, rss or xml format as well as OAI-PMH
--

 Key: NUTCH-158
 URL: http://issues.apache.org/jira/browse/NUTCH-158
 Project: Nutch
Type: New Feature
  Components: fetcher  
Versions: 0.8-dev
Reporter: byron miller
Priority: Minor


Add support to the fetcher to look for sitemap files, download them and process 
them into webdb.

Perhaps create a robots.txt directive that can be used to create a standard 
format for sitemaps in RSS, XML or text format (one line per url) and process 
that.

I would love to see someone stomp on proprietary sitemap features or making 
things so google specific as they are today :)

* RSS format/Atom Format (standard)
* XML meta descroption
* OAI-PMH meta description 
(http://www.openarchives.org/OAI/openarchivesprotocol.html)

Perhaps even a pre crawler that will scour for these to inject into the web 
db to help build your link map so you could even just index topN.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-155) Remove web gui from the distribution to contrib and use OpenSearch Servlet

2005-12-29 Thread byron miller (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-155?page=comments#action_12361398 ] 

byron miller commented on NUTCH-155:


I don't know how i feel about removing the JSP stuff into a contrib and then 
fluffing it up more with the potential to support other web languages.

moving out jsp doesn't negate the need for an app server.  but making a 3rd 
party contrib for everything that can use nutch may be worthwhile.

 Remove web gui from the distribution to contrib and use OpenSearch Servlet
 

  Key: NUTCH-155
  URL: http://issues.apache.org/jira/browse/NUTCH-155
  Project: Nutch
 Type: Wish
   Components: web gui
 Versions: 0.8-dev
 Reporter: nutch.newbie


 Web gui JSP search pages should be moved to a contrib folder.  It would be 
 better to focus on OpenSearch Servlet based XML results. For example in the 
 current tutorial at -
 http://lucene.apache.org/nutch/tutorial.html under the searching section one 
 could imagine to add a script OpenSearch. (i.e. bin/nutch OpenSearch 
 search term-- Bingo XML results. ) 
 Therefore I suggest - It is better that web gui moves to contrib. I also 
 forsee posting PHP or Perl, Ruby, XSLT or other language based GUI being 
 developed and have it under the contrib as an addition to JSP pages. 
 - Current implementation focuses on JSP pages, tomcat, etc. has nothing to do 
 with Nutch. But has everything to do with How Nutch needs to be deployed. And 
 to my mind Nutch can be deployed in many ways. So why just JSP and tomcat 
 will get the core attention.
 The above wish is not new, I have seen others in Jira having similler 
 thinking. Furthermore Nutch is becoming big in size, the plugins are also 
 growing it would be good idea to have a contrib directory just like Lucene. 
 Some of the plugin could also move there. Plugins like clustering, ontology 
 (i.e. not required for basic indexing/searching) etc are not given that it 
 should be part of the distribution. The point I try to make here is its up to 
 the search engine operator to download the plugins rather then everyone gets 
 everything.tar model.
 Above is still a wish :-) 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: Mega-cleanup in trunk/

2005-12-28 Thread Byron Miller
I'll pull a build down tonight and let you know how it
goes!

-byron

--- Andrzej Bialecki [EMAIL PROTECTED] wrote:

 Hi,
 
 I just commited a large patch to cleanup the trunk/
 of obsolete and 
 broken classes remaining from the 0.7.x development
 line. Please test 
 that things still work as they should ...
 
 -- 
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _  
 __
 [__ || __|__/|__||\/|  Information Retrieval,
 Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System
 Integration
 http://www.sigram.com  Contact: info at sigram dot
 com
 
 
 



[jira] Commented: (NUTCH-92) DistributedSearch incorrectly scores results

2005-12-28 Thread byron miller (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-92?page=comments#action_12361348 ] 

byron miller commented on NUTCH-92:
---

Has there been any advancement on this front? 

 DistributedSearch incorrectly scores results
 

  Key: NUTCH-92
  URL: http://issues.apache.org/jira/browse/NUTCH-92
  Project: Nutch
 Type: Bug
   Components: searcher
 Versions: 0.8-dev, 0.7
 Reporter: Andrzej Bialecki 
 Assignee: Andrzej Bialecki 


 When running search servers in a distributed setup, using 
 DistributedSearch$Server and Client, total scores are incorrectly calculated. 
 The symptoms are that scores differ depending on how segments are deployed to 
 Servers, i.e. if there is uneven distribution of terms in segment indexes 
 (due to segment size or content differences) then scores will differ 
 depending on how many and which segments are deployed on a particular Server. 
 This may lead to prioritizing of non-relevant results over more relevant ones.
 The underlying reason for this is that each IndexSearcher (which uses local 
 index on each Server) calculates scores based on the local IDFs of query 
 terms, and not the global IDFs from all indexes together. This means that 
 scores arriving from different Servers to the Client cannot be meaningfully 
 compared, unless all indexes have similar distribution of Terms and similar 
 numbers of documents in them. However, currently the Client mixes all scores 
 together, sorts them by absolute values and picks top hits. These absolute 
 values will change if segments are un-evenly deployed to Servers.
 Currently the workaround is to deploy the same number of documents in 
 segments per Server, and to ensure that segments contain well-randomized 
 content so that term frequencies for common terms are very similar.
 The solution proposed here (as a result of discussion between ab and cutting, 
 patches are coming) is to calculate global IDFs prior to running the query, 
 and pre-boost query Terms with these global IDFs. This will require one more 
 RPC call per each query (this can be optimized later, e.g. through caching). 
 Then the scores will become normalized according to the global IDFs, and 
 Client will be able to meaningfully compare them. Scores will also become 
 independent of the segment content or local number of documents per Server. 
 This will involve at least the following changes:
 * change NutchSimilarity.idf(Term, Searcher) to always return 1.0f. This 
 enables us to manipulate scores independently of local IDFs.
 * add a new method to Searcher interface, int[] getDocFreqs(Term[]), which 
 will return document frequencies for query terms.
 * modify getSegmentNames() so that it returns also the total number of 
 documents in each segment, or implement this as a separate method (this will 
 be called once during segment init)
 * in DistributedSearch$Client.search() first make a call to servers to return 
 local IDFs for the current query, and calculate global IDFs for each relevant 
 Term in that query.
 * multiply the TermQuery boosts by idf(totalDocFreq, totalIndexedDocs), and 
 PhraseQuery boosts by the sum of the idf(totalDocFreqs, totalIndexedDocs) for 
 all of its terms
 This solution should be applicable with only minor changes to all branches, 
 but initially the patches will be relative to trunk/ .
 Comments, suggestions and review are welcome!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

2005-12-28 Thread byron miller (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12361350 ] 

byron miller commented on NUTCH-134:


Where is the lucene summarizer from the contrib?  i'm not seeing anything 
obvious (unless it's under a different name)

 Summarizer doesn't select the best snippets
 ---

  Key: NUTCH-134
  URL: http://issues.apache.org/jira/browse/NUTCH-134
  Project: Nutch
 Type: Bug
   Components: searcher
 Versions: 0.7.1, 0.7, 0.7.2-dev, 0.8-dev
 Reporter: Andrzej Bialecki 


 Summarizer.java tries to select the best fragments from the input text, where 
 the frequency of query terms is the highest. However, the logic in line 223 
 is flawed in that the excerptSet.add() operation will add new excerpts only 
 if they are not already present - the test is performed using the Comparator 
 that compares only the numUniqueTokens. This means that if there are two or 
 more excerpts, which score equally high, only the first of them will be 
 retained, and the rest of equally-scoring excerpts will be discarded, in 
 favor of other excerpts (possibly lower scoring).
 To fix this the Set should be replaced with a List + a sort operation. To 
 keep the relative position of excerpts in the original order the Excerpt 
 class should be extended with an int order field, and the collected 
 excerpts should be sorted in that order prior to adding them to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-95) DeleteDuplicates depends on the order of input segments

2005-12-27 Thread byron miller (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-95?page=comments#action_12361300 ] 

byron miller commented on NUTCH-95:
---

Number 2 sounds great, but wouldn't you always want the latest scoring document 
since that should reflect the latest updatedb and rank of the page even if it's 
lower or higher?

 DeleteDuplicates depends on the order of input segments
 ---

  Key: NUTCH-95
  URL: http://issues.apache.org/jira/browse/NUTCH-95
  Project: Nutch
 Type: Bug
   Components: indexer
 Versions: 0.8-dev, 0.6, 0.7
 Reporter: Andrzej Bialecki 
 Assignee: Andrzej Bialecki 


 DeleteDuplicates depends on what order the input segments are processed, 
 which in turn depends on the order of segment dirs returned from 
 NutchFileSystem.listFiles(File). In most cases this is undesired and may lead 
 to deleting wrong records from indexes. The silent assumption that segments 
 at the end of the listing are more recent is not always true.
 Here's the explanation:
 * Dedup first deletes the URL duplicates by computing MD5 hashes for each 
 URL, and then sorting all records by (hash, segmentIdx, docIdx). SegmentIdx 
 is just an int index to the array of open IndexReaders - and if segment dirs 
 are moved/copied/renamed then entries in that array may change their  order. 
 And then for all equal triples Dedup keeps just the first entry. Naturally, 
 if segmentIdx is changed due to dir renaming, a different record will be kept 
 and different ones will be deleted...
 * then Dedup deletes content duplicates, again by computing hashes for each 
 content, and then sorting records by (hash, segmentIdx, docIdx). However, by 
 now we already have a different set of undeleted docs depending on the order 
 of input segments. On top of that, the same factor acts here, i.e. segmentIdx 
 changes when you re-shuffle the input segment dirs - so again, when identical 
 entries are compared the one with the lowest (segmentIdx, docIdx) is picked.
 Solution: use the fetched date from the first record in each segment to 
 determine the order of segments. Alternatively, modify DeleteDuplicates to 
 use the newer algorithm from SegmentMergeTool. This algorithm works by 
 sorting records using tuples of (urlHash, contentHash, fetchDate, score, 
 urlLength). Then:
 1. If urlHash is the same, keep the doc with the highest fetchDate  (the 
 latest version, as recorded by Fetcher).
 2. If contentHash is the same, keep the doc with the highest score, and then 
 if the scores are the same, keep the doc with the shortest url.
 Initial fix will be prepared for the trunk/ and then backported to the 
 release branch.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-55) Create dmoz.org search plugin - incorporate the dmoz.org title/category/description if available

2005-12-27 Thread byron miller (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-55?page=comments#action_12361301 ] 

byron miller commented on NUTCH-55:
---

You can close this ticket, duplicate of ticket NUTCH-59

 Create dmoz.org search plugin - incorporate the dmoz.org 
 title/category/description if available 
 --

  Key: NUTCH-55
  URL: http://issues.apache.org/jira/browse/NUTCH-55
  Project: Nutch
 Type: New Feature
   Components: indexer, searcher
  Environment: all
 Reporter: byron miller
 Priority: Minor


 I am looking into the possibility of creating a dmoz.org plugin, so if you 
 seed from the dmoz.org rdf the data you pull in could be used to extend the 
 data you fetch.
 Possibilities:  Searchable dmoz.org data or nutch summary + dmoz.org category 
 in serps.
 ofcourse the data from dmoz.org isn't as descriptive as it used to be, but i 
 think being able to integrate the category and href to a base url where the 
 category resolves would be a nice feature (and homage to the dmoz.org data)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



failure with crawl using 12/23 trunk

2005-12-23 Thread Byron Miller
Not sure if its because i have some of the older 7.x
parameters for my plugins - did these change in trunk?

051223 194716
crawl-20051223193201/crawldb/current/part-0/data:0+809491
051223 194716  map 100%
051223 194717
crawl-20051223193201/linkdb/current/part-0/data:0+1270873
-adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
-adding
org.apache.nutch.indexer.more.MoreIndexingFilter
051223 194717 found resource common-terms.utf8 at
file:/home/byron/n2/trunk/conf/common-terms.utf8
051223 194717 Optimizing index.
java.lang.NullPointerException
at
org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:75)
at
org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:63)
at
org.apache.nutch.crawl.Indexer.reduce(Indexer.java:217)
at
org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
at
org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
Exception in thread main java.io.IOException: Job
failed!
at
org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
at
org.apache.nutch.crawl.Indexer.index(Indexer.java:256)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:117)


Re: IndexSorter optimizer

2005-12-21 Thread Byron Miller
I've got 400mill db i can run this against over the
next few days.

-byron

--- Stefan Groschupf [EMAIL PROTECTED] wrote:

 Hi Andrzej,
 
 wow are really great news!
  Using the optimized index, I reported previously
 that some of the  
  top-scoring results were missing. As it happens,
 the missing  
  results were typically the junk pages with high
 tf/idf but low  
  boost. Since we collect up to N hits, going from
 higher to lower  
  boost values, the junk pages with low boost
 value were  
  automatically eliminated. So, overall the
 subjective quality of  
  results was improved. On the other hand, some of
 the legitimate  
  results with a decent boost values were also
 skipped because they  
  didn't fit within the fixed number of hits... ah,
 well. Perhaps we  
  should limit the number of hits in
 LimitedCollector using a cutoff  
  boost value, and not the maximum number of hits
 (or maybe both?).
 
 As far we experiment it would be good to have booth.
 
  To conclude, I will add the IndexSorter.java to
 the core classes,  
  and I suggest to continue the experiments ...
 
 May someone out there in the community has a
 commercial search engine  
 running (e.g. google appliance or similar) so we may
 can setup a  
 nutch with the same pages and compare the results.
 I guess it will be difficult to compare nutch with
 yahoo or google  
 since nobody of us has a 4 billion index up and
 running. I would run  
 one on my laptop but I do not have the bandwidth to
 fetch until next  
 two days. :-D
 Great work!
 
 Cheers,
 Stefan 
 



Re: [VOTE] Commiter access for Stefan Groschupf

2005-12-16 Thread Byron Miller
+1

Thanks for all the hard work! Very much appreciated

--- Andrzej Bialecki [EMAIL PROTECTED] wrote:

 Hi,
 
 During the past year and more Stefan participated
 actively in the
 development, and contributed many high-quality
 patches. He's been
 spending considerable effort on addressing many
 issues in JIRA, and
 proposing fixes and improvements.
 
 Apparently he has too much free time on his hands,
 and it's best to
 catch him now, before he realizes that there are
 other ways of spending
 time than hacking Nutch code... ;-)
 
 So, I'd like to call for a vote on adding Stefan as
 a commiter.
 
 
 -- 
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _  
 __
 [__ || __|__/|__||\/|  Information Retrieval,
 Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System
 Integration
 http://www.sigram.com  Contact: info at sigram dot
 com
 
 



[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

2005-12-07 Thread byron miller (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12359649 ] 

byron miller commented on NUTCH-134:


I would take more cpu for better summaries any day :) cpu power is cheaper than 
manual intervention!

If any testing is needed, don't hesitate to drop me a patch.. i've been working 
on a 500million page index using mapred branch on a 10 node cluster so i have 
plenty of numbers to test against.

 Summarizer doesn't select the best snippets
 ---

  Key: NUTCH-134
  URL: http://issues.apache.org/jira/browse/NUTCH-134
  Project: Nutch
 Type: Bug
   Components: searcher
 Versions: 0.7, 0.8-dev, 0.7.1, 0.7.2-dev
 Reporter: Andrzej Bialecki 


 Summarizer.java tries to select the best fragments from the input text, where 
 the frequency of query terms is the highest. However, the logic in line 223 
 is flawed in that the excerptSet.add() operation will add new excerpts only 
 if they are not already present - the test is performed using the Comparator 
 that compares only the numUniqueTokens. This means that if there are two or 
 more excerpts, which score equally high, only the first of them will be 
 retained, and the rest of equally-scoring excerpts will be discarded, in 
 favor of other excerpts (possibly lower scoring).
 To fix this the Set should be replaced with a List + a sort operation. To 
 keep the relative position of excerpts in the original order the Excerpt 
 class should be extended with an int order field, and the collected 
 excerpts should be sorted in that order prior to adding them to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



standard version of log4j

2005-11-07 Thread Byron Miller
Is there any way to make sure all plugins/modules
reference a standard version of log4j?  seems to me
there are atlest 3 different versions (although minor)

# find . | grep log4
./plugins/parse-pdf/log4j-1.2.9.jar
./plugins/parse-pdf/PDFBox-0.7.2-log4j.jar
./plugins/parse-rss/log4j-1.2.6.jar
./plugins/clustering-carrot2/log4j-1.2.11.jar




RE: Halloween Joke at Google

2005-11-02 Thread Byron Miller
I wish it did have something to do with halloween :)

Google tells no lies! :P

--- Nick Lothian [EMAIL PROTECTED] wrote:

 If you just do the search you'll see a link at the
 side of the page:
 
 Why these results?
 These results may seem politically
 slanted. Here's what happened.
 www.google.com/googleblog
 
 which links to

http://googleblog.blogspot.com/2005/09/googlebombing-failure.html
 
 This particular Google Bomb has been around for
 quite a while. See
 http://en.wikipedia.org/wiki/Google_bomb (and has
 nothing to do with
 Halloween!)
 
 Nick 



RE: Halloween Joke at Google

2005-11-02 Thread Byron Miller
Actually, to add fuel to the fire, using nutch out of
the box, searching for miserable failure yields the
same thing.

http://www.mozdex.com/search.jsp?query=miserablefailure

--- Fuad Efendi [EMAIL PROTECTED] wrote:

 Thanks Nick,
 
 So this is why some search engines are not honest. I
 mean the commercial
 policy of putting links on top of a search for extra
 money.
 
 This particular Google Bomb has been around for
 quite a while. See
 http://en.wikipedia.org/wiki/Google_bomb (and has
 nothing to do with
 Halloween!)
 
 Nick 
 
 



Re: Halloween Joke at Google

2005-11-02 Thread Byron Miller
We run with

fetchlist.score.by.link.count=true and
indexer.boost.by.link.count=true

We haven't run a stand alone analyze, so it's how the
database is updated when we run updatedb. (per the
recommendations a few months back when it was found to
be pretty darn close results!)

Even though my scale is still much smaller than
Googles, it is amazing how closely the results can
match!

Makes you wonder just how much of the net is usefull
;)

-byron



--- Andrzej Bialecki [EMAIL PROTECTED] wrote:

 Byron Miller wrote:
 
 Actually, to add fuel to the fire, using nutch out
 of
 the box, searching for miserable failure yields the
 same thing.
 

http://www.mozdex.com/search.jsp?query=miserablefailure
 
   
 
 
 I'm curious... could you check if the anchors come
 from the same site, 
 or from different sites? Do you run with 
 fetchlist.score.by.link.count=true and
 indexer.boost.by.link.count=true?
 
 Anyway, that's how the PageRank is _supposed_ to
 work - it should give a 
 higher score to sites that are highly linked, and
 also it should 
 strongly consider the anchor text as an indication
 of the page's true 
 subject ... ;-)
 
 -- 
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _  
 __
 [__ || __|__/|__||\/|  Information Retrieval,
 Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System
 Integration
 http://www.sigram.com  Contact: info at sigram dot
 com
 
 
 



Re: NekoHTML 0.9.5

2005-11-01 Thread Byron Miller
I'll give tagsoup a try, i saw that was in there.

thanks for the headsup!
-byron

--- Andrzej Bialecki [EMAIL PROTECTED] wrote:

 Byron Miller wrote:
 

http://people.apache.org/~andyc/neko/doc/html/changes.html
 
 Any chance of getting that rolled in? Has a few
 fixes
 that look good. 
   
 
 
 Did you try using TagSoup? Some time ago I added to
 parse-html the 
 support for using TagSoup instead of NekoHTML (this
 is an option in the 
 config file). I found that in many cases TagSoup
 gives much better 
 results, especially for pages with multiple html
 or body elements, 
 where neko would give up...
 
 -- 
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _  
 __
 [__ || __|__/|__||\/|  Information Retrieval,
 Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System
 Integration
 http://www.sigram.com  Contact: info at sigram dot
 com
 
 
 



[jira] Commented: (NUTCH-39) pagination in search result

2005-10-30 Thread byron miller (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-39?page=comments#action_12356374 ] 

byron miller commented on NUTCH-39:
---

I'm using the above code snippet on mozdex and run across some strange issues..

for example if you search for cnn.com it doesn't show up at all, if you search 
for site:www.cnn.com cnn and find all cnn within that subquery it works.. 
wondering if there are too many pages coming up for some results or something 
like that.  Anyone else using this snippet?  i like the way it works for the 
most part :)

will try and enable a debug page to chase down which variables are acting up.

 pagination in search result
 ---

  Key: NUTCH-39
  URL: http://issues.apache.org/jira/browse/NUTCH-39
  Project: Nutch
 Type: Improvement
   Components: web gui
  Environment: all
 Reporter: Jack Tang
 Priority: Trivial


 Now in nutch search.jsp, user navigate all search result using Next button. 
 And google like pagination will feel better.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-49) Flag for generate to fetch only new pages to complement the -refetchonly flag

2005-10-25 Thread byron miller (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-49?page=comments#action_12355864 ] 

byron miller commented on NUTCH-49:
---

Can something like this be adapted to use the regex filter as well? it would be 
nice to say new only and match urls of x type or  x link score or some other 
expressions.  (not just the very topN)



 Flag for generate to fetch only new pages to complement the -refetchonly flag
 -

  Key: NUTCH-49
  URL: http://issues.apache.org/jira/browse/NUTCH-49
  Project: Nutch
 Type: New Feature
   Components: fetcher
 Reporter: Luke Baker
 Priority: Minor
  Attachments: fetchnewonly.patch

 It would be useful, especially for research/testing purposes, to have a flag 
 for the FetchListTool that make sure to only include URLs in the fetchlist 
 that have not already been fetched (according to the information from the 
 webdb that you're generating the fetchlist from).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira