OK, SOLR-1060 created. >To this requirement I would add the basic requirement that this file >(what Fergus calls the manifest to which I still don't agree) >represents a update-set and that there should be a delete-set as well. > >ChangeSetEntityProcessor, on there I would jump with two feet. > >paul > > >Le 10-mars-09 à 05:40, Noble Paul ?????? >Â Ë³Ë a écrit : > >> Hi Fergus open a JIRA issue anyway. put in your thoughts and we can >> refine the requirements as a part of the discussion. >> >> Basically the requirements are , >> 1)read a file line by line >> 2) filter out lines (include or exclude ) based on a regex >> 3) extract parts (named parts) from the line using another regex >> >> Noble >> >> >> On Tue, Mar 10, 2009 at 1:50 AM, Fergus McMenemie >> <fer...@twig.me.uk> wrote: >>>> Hi Fergus, >>>> The idea is that we have something generic which can be applicable >>>> to >>>> a large set of users. If the manifest is a text file it can be >>>> read in >>>> somestandard way (say line by line). So we can have an >>>> EntityProcessor >>>> which reads a text file line and filer it by a regex like the way >>>> 'grep' works. >>> Yes. That is what I have written. It is just an alternate form of the >>> FileListEntityProcessor except that rather than walking the file >>> system >>> it reads from a file, line by line, and identifies the portion of the >>> line containing the filename using a regexp. >>> >>> >>>> >>>> On Mon, Mar 9, 2009 at 10:44 PM, Fergus McMenemie >>>> <fer...@twig.me.uk> wrote: >>>>>> manifest processing has a very limited usecase. Why can't it be >>>>>> processed using a PlainTextEntityProcessor and write a >>>>>> Tranformer to >>>>>> read lines using regex? >>>>>> >>>>> Ehmmm Ok. The PlainTextEntityProcessor docs do not give me enough >>>>> insight to see how this could be used to index each of the files >>>>> listed by a 'tar xvf' report. Can you explain further? >>>>> >>>>> About the limited usecase. Verity thought it was useful enough >>>>> to have there own "bulk insert file" or bif file format that >>>>> did the same and was far less flexible. >>>>> >>>>> In my experience we generally start off with some kind of >>>>> file walker or crawler looking after file repositories. But >>>>> these always proved slow and unreliable and over time they >>>>> were always replaced it with some kind of manifest based >>>>> control of the indexer. Where we could get a report of changes >>>>> we always used it, and only relied on walkers or crawlers >>>>> where we had to. >>>>> >>>>> Fergus >>>>> >>>>>> >>>>>> --Noble >>>>>> >>>>>> On Mon, Mar 9, 2009 at 8:30 PM, Fergus McMenemie <fer...@twig.me.uk >>>>>> > wrote: >>>>>>> Hello, >>>>>>> >>>>>>> I have almost finished a new DIH EntityProcessor which >>>>>>> I am calling the manifestEnityProcessor. It is designed >>>>>>> around the idea that whatever demon is used to maintain >>>>>>> your set of a few 100,000 xml documents it is likely to >>>>>>> drop a report or log file explaining what has been changed >>>>>>> within your content store. This assumes a file based >>>>>>> content repository. >>>>>>> >>>>>>> The manifestEnityProcessor is used as follows >>>>>>> >>>>>>> <entity name="jc" >>>>>>> processor="ManifestEntityProcessor" >>>>>>> baseDir="/Volumes/Techmore/ts/aaa/schema/data" >>>>>>> rootEntity="false" >>>>>>> dataSource="null" >>>>>>> >>>>>>> allowRegex="^.*\.xml$" >>>>>>> manifestFileName="/Volumes/ts/man-find.txt" >>>>>>> manifestAddRegex="(.*)$" >>>>>>> > >>>>>>> >>>>>>> The idea is you have a log file or other report, perhaps >>>>>>> from tar or zip, and you wish to use this to control the >>>>>>> indexing of the new content. The new entity fields are as >>>>>>> follows. >>>>>>> >>>>>>> manifestFileName is the name of the manifest file. If >>>>>>> this value is relative, it assumed to >>>>>>> be relative to baseDir. Required. >>>>>>> >>>>>>> manifestAddRegex is a required regex to identify lines >>>>>>> which when matched should cause docs to >>>>>>> be added to the index. >>>>>>> >>>>>>> manifestDelRegex is an optional value of a regex to >>>>>>> identify documents which when matched should >>>>>>> be deleted from the index **PLANNED** >>>>>>> >>>>>>> allowRegex a required regex to identify the portion >>>>>>> of the ADD/DELete line identified above >>>>>>> which contains the file or pathname to >>>>>>> ADDed or DELeted. If the resulting value >>>>>>> relative, it assumed to be relative to >>>>>>> baseDir. >>>>>>> >>>>>>> What do I do next? >>>>>>> Raise a JIRA issue and add the code? >>>>>>> Is DIH the right place to add this? >>>>>>> Suggestions for a different name? >>>>>>> Suggestions on how to do the delete bitty from within an >>>>>>> entity? >>>>>>> >>>>>>> Regards Fergus. >>>>>> --Noble Paul >>>>> >>>>> -- >>>>> >>>>> =============================================================== >>>>> Fergus McMenemie Email:fer...@twig.me.uk >>>>> Techmore Ltd Phone:(UK) 07721 376021 >>>>> >>>>> Unix/Mac/Intranets Analyst Programmer >>>>> =============================================================== >>>>> >>>> >>>> >>>> >>>> -- >>>> --Noble Paul >>> >>> -- >>> >>> =============================================================== >>> Fergus McMenemie Email:fer...@twig.me.uk >>> Techmore Ltd Phone:(UK) 07721 376021 >>> >>> Unix/Mac/Intranets Analyst Programmer >>> =============================================================== >>> >> >> >> >> -- >> --Noble Paul > > > >Attachment converted: fergus:smime 33.p7s ( / ) (00022E61)
-- =============================================================== Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===============================================================