[ http://issues.apache.org/jira/browse/NUTCH-59?page=comments#action_12364165 ]
James Jonas commented on NUTCH-59: ---------------------------------- Stefan, Spot on. Use of HashMaps - very fast Use of separate file instead of extending WebDB - good Background Initially this will help limit the size of the MetaDB (the separate file). For example, association of DMOZ topics to Pages would only be one-to-one on the first fetch. On the supsequent fetches other websites outside of the DMOZ list would then contain a blank topic for that field, thus filling up needless space on WebDB. (some databases are more efficient with regards to managing this type of dead space. Lucene may be one of these). The next senario is adding a new metadata association (simple location - city,state(province),country). Here the MetaDB (temporary name for the convenience of discussion) would only related to the Region section of the DMOZ list, but some of the non-DMOZ pages would have such a Location association. This leads to the question of potentially splitting the file into a multiple file for each metadata artifact (topic, location). As the list of metadata artifacts grows, so does the number of files. This dancing between denormalized data (single big files) versus normalized data (many smaller files - complex relationships) will over time impact the speed of the queries. This type of performance penalty associated with metadata can be even more exaserbated when you move into metadata repositories, where they persist both the metadata and the model of the metadata (customer now roles back his eyes and passes out as you continue speaking of meta-meta models). That being said, for simplicities sake, I would not get to far ahead of the game. Your decision of using of a single separate file gets the job done. Changes to the other components (index, QueryFilter) to handle Extensible Metadata seems like the higher priority. I just wanted to give you a flavor for how metadata stores grow from simple to complex and that some planning is often helpful in order to avoid some small hickups in the users migration from one set of simple metadata stores into more complex structures. Normally applications go through a series of learning experiences as they move up the complexity slope for metadata. (sometimes these applications (companies) actually survive - several don't) Quick HOW TO for building a metadata store: - Write down a list of metadata that you think you may wish to store - Map this list to Use Cases that create specific value to the user - For each metadata artifact assign it the standard (must have, should have, could have, won't have) (or a,b,c - red, white blue - whatever) based on your use cases. - Define the API containing only a link to metadata that seems the most useful (must haves) - Define a simple metadata model to contain that short list of metadata exposed in your API - Define and implement the physical model to support that API. The semantics of the model will normally be greater than what is exposed - Keep the API stable, grow the underlying physical model. Do Not Expose the physical model. - Carefully expand the scope of the API based on what creates real value to the user What happens is the underlying model will change radically over time and will often becomes the limiting factor in your persistence of more complex metadata artifacts. ( think of a person inside a hierarchical organization with matrixed relationships with associations to both titles and roles - yuk - it can get fun very quickly ) Most applications bind thier software tightly to the physical metamodel (its easy - just expose it). The result is unsatisfied customers as the metamodel has to change over time. Cometition usually swoops in since they can green field thier metamodels while you are stuck supporting the semantics of your pervious application. 2-cents worth of comments PS I'm very interested in testing our your DMOZ Topic Metadata Extention on .8. I have a couple websites that might find a use for it. Thanks, James > meta data support in webdb > -------------------------- > > Key: NUTCH-59 > URL: http://issues.apache.org/jira/browse/NUTCH-59 > Project: Nutch > Type: New Feature > Reporter: Stefan Groschupf > Priority: Minor > Attachments: webDBMetaDataPatch.txt > > Meta data support in web db would very usefully for a new set of nutch > feature that needs long life meta data. > Actually page meta data need to be regenerated or lookup every 30 days a page > is re-fetched, in a long context web db meta data would bring a dramatically > performance improvement for such tasks. > Furthermore Storage of meta data in webdb would make a new generation of > linklist generation filters possible. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid3432&bid#0486&dat1642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
