Hi Fergus,
The idea is that we have something generic which can be applicable to
a large set of users. If the manifest is a text file it can be read in
somestandard way (say line by line). So we can have an EntityProcessor
which reads a text file line and filer it by a regex like the way
'grep' works.

On Mon, Mar 9, 2009 at 10:44 PM, Fergus McMenemie <fer...@twig.me.uk> wrote:
>>manifest processing has a very limited usecase. Why can't it be
>>processed using a PlainTextEntityProcessor and write a Tranformer to
>>read lines using regex?
>>
> Ehmmm Ok. The PlainTextEntityProcessor docs do not give me enough
> insight to see how this could be used to index each of the files
> listed by a 'tar xvf' report. Can you explain further?
>
> About the limited usecase. Verity thought it was useful enough
> to have there own "bulk insert file" or bif file format that
> did the same and was far less flexible.
>
> In my experience we generally start off with some kind of
> file walker or crawler looking after file repositories. But
> these always proved slow and unreliable and over time they
> were always replaced it with some kind of manifest based
> control of the indexer. Where we could get a report of changes
> we always used it, and only relied on walkers or crawlers
> where we had to.
>
> Fergus
>
>>
>>--Noble
>>
>>On Mon, Mar 9, 2009 at 8:30 PM, Fergus McMenemie <fer...@twig.me.uk> wrote:
>>> Hello,
>>>
>>> I have almost finished a new DIH EntityProcessor which
>>> I am calling the manifestEnityProcessor. It is designed
>>> around the idea that whatever demon is used to maintain
>>> your set of a few 100,000 xml documents it is likely to
>>> drop a report or log file explaining what has been changed
>>> within your content store. This assumes a file based
>>> content repository.
>>>
>>> The manifestEnityProcessor is used as follows
>>>
>>>       <entity name="jc"
>>>               processor="ManifestEntityProcessor"
>>>               baseDir="/Volumes/Techmore/ts/aaa/schema/data"
>>>               rootEntity="false"
>>>               dataSource="null"
>>>
>>>               allowRegex="^.*\.xml$"
>>>               manifestFileName="/Volumes/ts/man-find.txt"
>>>               manifestAddRegex="(.*)$"
>>>               >
>>>
>>> The idea is you have a log file or other report, perhaps
>>> from tar or zip, and you wish to use this to control the
>>> indexing of the new content. The new entity fields are as
>>> follows.
>>>
>>> manifestFileName is the name of the manifest file. If
>>>                 this value is relative, it assumed to
>>>                 be relative to baseDir. Required.
>>>
>>> manifestAddRegex is a required regex to identify lines
>>>                 which when matched should cause docs to
>>>                 be added to the index.
>>>
>>> manifestDelRegex is an optional value of a regex to
>>>                 identify documents which when matched should
>>>                 be deleted from the index **PLANNED**
>>>
>>> allowRegex       a required regex to identify the portion
>>>                 of the ADD/DELete line identified above
>>>                 which contains the file or pathname to
>>>                 ADDed or DELeted. If the resulting value
>>>                 relative, it assumed to be relative to
>>>                 baseDir.
>>>
>>> What do I do next?
>>>   Raise a JIRA issue and add the code?
>>>   Is DIH the right place to add this?
>>>   Suggestions for a different name?
>>>   Suggestions on how to do the delete bitty from within an entity?
>>>
>>> Regards Fergus.
>>--Noble Paul
>
> --
>
> ===============================================================
> Fergus McMenemie               Email:fer...@twig.me.uk
> Techmore Ltd                   Phone:(UK) 07721 376021
>
> Unix/Mac/Intranets             Analyst Programmer
> ===============================================================
>



-- 
--Noble Paul

Reply via email to