Re: a new DIH manifestEnityProcessor SOLR-1060 on jira

Fergus McMenemie Tue, 10 Mar 2009 11:24:50 -0700

OK, SOLR-1060 created.

>To this requirement I would add the basic requirement that this file  
>(what Fergus calls the manifest to which I still don't agree)  
>represents a update-set and that there should be a delete-set as well.
>
>ChangeSetEntityProcessor, on there I would jump with two feet.
>
>paul
>
>
>Le 10-mars-09 à 05:40, Noble Paul ??????  
>Â Ë³Ë a écrit :
>
>> Hi Fergus open a JIRA issue anyway. put in your thoughts and we can
>> refine the requirements as a part of the discussion.
>>
>> Basically the requirements are ,
>> 1)read a file line by line
>> 2) filter out lines (include or exclude ) based on a regex
>> 3) extract parts (named parts) from the line using another regex
>>
>> Noble
>>
>>
>> On Tue, Mar 10, 2009 at 1:50 AM, Fergus McMenemie  
>> <fer...@twig.me.uk> wrote:
>>>> Hi Fergus,
>>>> The idea is that we have something generic which can be applicable  
>>>> to
>>>> a large set of users. If the manifest is a text file it can be  
>>>> read in
>>>> somestandard way (say line by line). So we can have an  
>>>> EntityProcessor
>>>> which reads a text file line and filer it by a regex like the way
>>>> 'grep' works.
>>> Yes. That is what I have written. It is just an alternate form of the
>>> FileListEntityProcessor except that rather than walking the file  
>>> system
>>> it reads from a file, line by line, and identifies the portion of the
>>> line containing the filename using a regexp.
>>>
>>>
>>>>
>>>> On Mon, Mar 9, 2009 at 10:44 PM, Fergus McMenemie  
>>>> <fer...@twig.me.uk> wrote:
>>>>>> manifest processing has a very limited usecase. Why can't it be
>>>>>> processed using a PlainTextEntityProcessor and write a  
>>>>>> Tranformer to
>>>>>> read lines using regex?
>>>>>>
>>>>> Ehmmm Ok. The PlainTextEntityProcessor docs do not give me enough
>>>>> insight to see how this could be used to index each of the files
>>>>> listed by a 'tar xvf' report. Can you explain further?
>>>>>
>>>>> About the limited usecase. Verity thought it was useful enough
>>>>> to have there own "bulk insert file" or bif file format that
>>>>> did the same and was far less flexible.
>>>>>
>>>>> In my experience we generally start off with some kind of
>>>>> file walker or crawler looking after file repositories. But
>>>>> these always proved slow and unreliable and over time they
>>>>> were always replaced it with some kind of manifest based
>>>>> control of the indexer. Where we could get a report of changes
>>>>> we always used it, and only relied on walkers or crawlers
>>>>> where we had to.
>>>>>
>>>>> Fergus
>>>>>
>>>>>>
>>>>>> --Noble
>>>>>>
>>>>>> On Mon, Mar 9, 2009 at 8:30 PM, Fergus McMenemie <fer...@twig.me.uk 
>>>>>> > wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> I have almost finished a new DIH EntityProcessor which
>>>>>>> I am calling the manifestEnityProcessor. It is designed
>>>>>>> around the idea that whatever demon is used to maintain
>>>>>>> your set of a few 100,000 xml documents it is likely to
>>>>>>> drop a report or log file explaining what has been changed
>>>>>>> within your content store. This assumes a file based
>>>>>>> content repository.
>>>>>>>
>>>>>>> The manifestEnityProcessor is used as follows
>>>>>>>
>>>>>>>       <entity name="jc"
>>>>>>>               processor="ManifestEntityProcessor"
>>>>>>>               baseDir="/Volumes/Techmore/ts/aaa/schema/data"
>>>>>>>               rootEntity="false"
>>>>>>>               dataSource="null"
>>>>>>>
>>>>>>>               allowRegex="^.*\.xml$"
>>>>>>>               manifestFileName="/Volumes/ts/man-find.txt"
>>>>>>>               manifestAddRegex="(.*)$"
>>>>>>>               >
>>>>>>>
>>>>>>> The idea is you have a log file or other report, perhaps
>>>>>>> from tar or zip, and you wish to use this to control the
>>>>>>> indexing of the new content. The new entity fields are as
>>>>>>> follows.
>>>>>>>
>>>>>>> manifestFileName is the name of the manifest file. If
>>>>>>>                 this value is relative, it assumed to
>>>>>>>                 be relative to baseDir. Required.
>>>>>>>
>>>>>>> manifestAddRegex is a required regex to identify lines
>>>>>>>                 which when matched should cause docs to
>>>>>>>                 be added to the index.
>>>>>>>
>>>>>>> manifestDelRegex is an optional value of a regex to
>>>>>>>                 identify documents which when matched should
>>>>>>>                 be deleted from the index **PLANNED**
>>>>>>>
>>>>>>> allowRegex       a required regex to identify the portion
>>>>>>>                 of the ADD/DELete line identified above
>>>>>>>                 which contains the file or pathname to
>>>>>>>                 ADDed or DELeted. If the resulting value
>>>>>>>                 relative, it assumed to be relative to
>>>>>>>                 baseDir.
>>>>>>>
>>>>>>> What do I do next?
>>>>>>>   Raise a JIRA issue and add the code?
>>>>>>>   Is DIH the right place to add this?
>>>>>>>   Suggestions for a different name?
>>>>>>>   Suggestions on how to do the delete bitty from within an  
>>>>>>> entity?
>>>>>>>
>>>>>>> Regards Fergus.
>>>>>> --Noble Paul
>>>>>
>>>>> --
>>>>>
>>>>> ===============================================================
>>>>> Fergus McMenemie               Email:fer...@twig.me.uk
>>>>> Techmore Ltd                   Phone:(UK) 07721 376021
>>>>>
>>>>> Unix/Mac/Intranets             Analyst Programmer
>>>>> ===============================================================
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> --Noble Paul
>>>
>>> --
>>>
>>> ===============================================================
>>> Fergus McMenemie               Email:fer...@twig.me.uk
>>> Techmore Ltd                   Phone:(UK) 07721 376021
>>>
>>> Unix/Mac/Intranets             Analyst Programmer
>>> ===============================================================
>>>
>>
>>
>>
>> -- 
>> --Noble Paul
>
>
>
>Attachment converted: fergus:smime 33.p7s (    /    ) (00022E61)


-- 

===============================================================
Fergus McMenemie               Email:fer...@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

Re: a new DIH manifestEnityProcessor SOLR-1060 on jira

Reply via email to