Re: Image Search Engine Input (General storage of extra data for use by Nutch)

2007-03-30 Thread Ed Whittaker

Hi,

My question is not strictly to do with image search but I can't help feeling
the issue is somewhat related in terms of where to store what: I want to
spell correct web pages prior to indexing and only index the corrected
terms. I still want to store the original errorful text so this can be given
in the summary and highlighted accordingly. Queries would similarly be
corrected at query time. (The need for an alignment is to recover spelling
mistakes and miskeys like "inthe" and "Nutc h" etc. although actually it's
more relevant to CJK languages.)

Of course, this would be trivial if we didn't want to highlight the terms in
the summary but to do highlighting we need to know where the original terms
are in relation to the corrected terms (i.e what we indexed). We therefore
need somewhere to store an alignment of the errorful original with the
corrected text. At the moment we're storing this as an extra field inside
the index but this is not a very elegant solution. Hence my related
question: is there a more appropriate location to store such data? And where
is the best place to do the spelling correction in the Nutch workflow? A
separate mapreduce job or inside the indexing reduce job as I'm doing now?

-Ed

On 3/30/07, Doug Cutting <[EMAIL PROTECTED]> wrote:


Steve Severance wrote:
> I am not looking to really make an image retrieval engine. During
indexing referencing docs will be analyzed and text content will be
associated with the image. Currently I want to keep this in a separate
index. So despite the fact that images will be returned the search will be
against text data.

So do you just want to be able to reference the cached images?  In that
case, I think the images should stay in the content directory and be
accessed like cached pages.  The parse should just contain enough
metadata to index so that the images can be located in the cache.  I
don't see a reason to keep this in a separate index, but perhaps a
separate field instead?  Then when displaying hits you can look up
associated images and display them too.  Does that work?

Steve Severance wrote:
> I like Mathijs's suggestion about using a DB for holding thumbnails. I
just want access to be in constant time since I am going to probably need to
grab at least 10 and maybe 50 for each query. That can be kept in the plugin
as an option or something like that. Does that have any ramifications for
being run on Hadoop?

I'm not sure how a database solves scalability issues.  It seems to me
that thumbnails should be handled similarly to summaries.  They should
be retrieved in parallel from segment data in a separate pass once the
final set of hits to be displayed has been determined.  Thumbnails could
be placed in a directory per segment as a separate mapreduce pass.  I
don't see this as a parser issue, although perhaps it could be
piggybacked on that mapreduce pass, which also processes content.

Doug



[jira] Updated: (NUTCH-167) Observation of directive

2006-01-07 Thread Ed Whittaker (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-167?page=all ]

Ed Whittaker updated NUTCH-167:
---

Component: web gui
  Version: 0.7.1

> Observation of  directive
> -
>
>  Key: NUTCH-167
>  URL: http://issues.apache.org/jira/browse/NUTCH-167
>  Project: Nutch
> Type: Improvement
>   Components: indexer, web gui
> Versions: 0.7.1
> Reporter: Ed Whittaker
> Priority: Critical

>
> Though not strictly a bug, this issue is potentially serious for users of 
> Nutch who deploy live systems who might be threatened with legal action for 
> caching copies of copyrighted material. The major search engines all observe 
> this directive (even though apparently it's not stanard) so there's every 
> reason why Nutch should too.  

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-167) Observation of directive

2006-01-07 Thread Ed Whittaker (JIRA)
Observation of  directive
-

 Key: NUTCH-167
 URL: http://issues.apache.org/jira/browse/NUTCH-167
 Project: Nutch
Type: Improvement
  Components: indexer  
Reporter: Ed Whittaker
Priority: Critical


Though not strictly a bug, this issue is potentially serious for users of Nutch 
who deploy live systems who might be threatened with legal action for caching 
copies of copyrighted material. The major search engines all observe this 
directive (even though apparently it's not stanard) so there's every reason why 
Nutch should too.  

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira