[ 
http://issues.apache.org/jira/browse/NUTCH-59?page=comments#action_12364165 ] 

James Jonas commented on NUTCH-59:
----------------------------------

Stefan,

Spot on.

Use of HashMaps - very fast

Use of separate file instead of extending WebDB - good

Background
Initially this will help limit the size of the MetaDB (the separate file). For 
example, association of DMOZ topics to Pages would only be one-to-one on the 
first fetch. On the supsequent fetches other websites outside of the DMOZ list 
would then contain a blank topic for that field, thus filling up needless space 
on WebDB. (some databases are more efficient with regards to managing this type 
of dead space. Lucene may be one of these). The next senario is adding a new 
metadata association (simple location - city,state(province),country). Here the 
MetaDB (temporary name for the convenience of discussion) would only related to 
the Region section of the DMOZ list, but some of the non-DMOZ pages would have 
such a Location association. This leads to the question of potentially 
splitting the file into a multiple file for each metadata artifact (topic, 
location). As the list of metadata artifacts grows, so does the number of 
files. This dancing between denormalized data (single big files) versus 
normalized data (many smaller files - complex relationships) will over time 
impact the speed of the queries. This type of performance penalty associated 
with metadata can be even more exaserbated when you move into metadata 
repositories, where they persist both the metadata and the model of the 
metadata (customer now roles back his eyes and passes out as you continue 
speaking of meta-meta models).

That being said, for simplicities sake, I would not get to far ahead of the 
game. Your decision of  using of a single separate file gets the job done. 
Changes to the other components (index, QueryFilter) to handle Extensible 
Metadata seems like the higher priority. I just wanted to give you a flavor for 
how metadata stores grow from simple to complex and that some planning is often 
helpful in order to avoid some small hickups in the users migration from one 
set of simple metadata stores into more complex structures. Normally 
applications go through a series of learning experiences as they move up the 
complexity slope for metadata. (sometimes these applications (companies) 
actually survive - several don't)

Quick HOW TO for building a metadata store:
- Write down a list of metadata that you think you may wish to store
- Map this list to Use Cases that create specific value to the user
- For each metadata artifact assign it the standard (must have, should have, 
could have, won't have)  (or a,b,c - red, white blue - whatever) based on your 
use cases.
- Define the API containing only a link to metadata that seems the most useful 
(must haves)
- Define a simple metadata model to contain that short list of metadata exposed 
in your API
- Define and implement the physical model to support that API. The semantics of 
the model will normally be greater than what is exposed
- Keep the API stable, grow the underlying physical model. Do Not Expose the 
physical model.
- Carefully expand the scope of the API based on what creates real value to the 
user

What happens is the underlying model will change radically over time and will 
often becomes the limiting factor in your persistence of more complex metadata 
artifacts. ( think of a person inside a hierarchical organization with matrixed 
relationships with associations to both titles and roles - yuk - it can get fun 
very quickly ) Most applications bind thier software tightly to the physical 
metamodel (its easy - just expose it). The result is unsatisfied customers as 
the metamodel has to change over time. Cometition usually swoops in since they 
can green field thier metamodels while you are stuck supporting the semantics 
of your pervious application. 

2-cents worth of comments

PS
I'm very interested in testing our your DMOZ Topic Metadata Extention on .8. I 
have a couple websites that might find a use for it.

Thanks,
James

> meta data support in webdb
> --------------------------
>
>          Key: NUTCH-59
>          URL: http://issues.apache.org/jira/browse/NUTCH-59
>      Project: Nutch
>         Type: New Feature
>     Reporter: Stefan Groschupf
>     Priority: Minor
>  Attachments: webDBMetaDataPatch.txt
>
> Meta data support in web db would very usefully for a new set of nutch 
> feature that needs long life meta data. 
> Actually page meta data need to be regenerated or lookup every 30 days a page 
> is re-fetched, in a long context web db meta data would bring a dramatically 
> performance improvement for such tasks.
> Furthermore Storage of meta data in webdb would make a new generation of 
> linklist generation filters possible.  

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid3432&bid#0486&dat1642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to