[
https://issues.apache.org/jira/browse/SOLR-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683851#action_12683851
]
Fergus McMenemie commented on SOLR-1060:
----------------------------------------
My original patch did all this in the ChangeListEntityProcessor, as an option!
However as a seperate issue I do think we have a ambigutiy in the face value
behaviour of the following code when a mismatch occurs.
{code}
<field column="$deleteDocByQuery" regex="^DELETE.*" sourceColName="rawLine"/>
<field column="$deleteDocByQuery" regex="^DELETE.*" sourceColName="rawLine"
replaceWith="fileAbsolutePath:${jc.fileAbsolutePath}" />
{code}
While I do understand that under the hood one is a match and the other a
replace. I think that we could to enhance the existing transformer somehow to
streamline its interface. After all a new custom/new Transformer would just be
a regex by another name. Not sure what to do for the best. 1) I could put my
optional code back into ChangeListEntityProcessor? 2) I can also get around the
problem with temporary fields, but it is rather ugly:-
{code}
<entity name="jc"
processor="ChangeListEntityProcessor"
acceptLineRegex="^.*\.xml$"
omitLineRegex="usc2009"
fileName="file:///Volumes/spare/ts/man-findlsurl.txt"
rootEntity="false"
dataSource="null"
baseLocation="file:///Volumes/spare/ts/ford/schema/"
transformer="RegexTransformer"
>
<field column="fileAbsolutePath" regex="^.*\s+([^ ]*)$"
replaceWith="${jc.baseLocation}/$1" sourceColName="rawLine"/>
<field column="dummy" regex="^DELETE.*"
replaceWith="fileAbsolutePath:${jc.fileAbsolutePath}" sourceColName="rawLine"/>
<field column="$deleteDocByQuery" regex="^fileAbsolutePath:"
sourceColName="dummy"/>
<entity name="x"
dataSource="myURIreader"
processor="XPathEntityProcessor"
{code}
> a new DIH EnityProcessor allowing text file lists of files to be indexed
> ------------------------------------------------------------------------
>
> Key: SOLR-1060
> URL: https://issues.apache.org/jira/browse/SOLR-1060
> Project: Solr
> Issue Type: New Feature
> Components: contrib - DataImportHandler
> Affects Versions: 1.4
> Reporter: Fergus McMenemie
> Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-1060.patch, SOLR-1060.patch
>
> Original Estimate: 120h
> Remaining Estimate: 120h
>
> I have finished a new DIH EntityProcessor. It is designed around the idea
> that whatever demon is used to maintain your content store it is likely to
> drop a report or log file explaining what has changed within your content
> store. I wish to use this report file to control the indexing of the new or
> changed content and the removal of old content. The report files, perhaps
> from un-tar or un-zip, are likely to reference jpegs and directory stubs
> which need to be ignored. I assumed a file based content repository but this
> should be expanded to handle URI's as well
> I feel that the current FileListEntityProcessor is poorly named. It should be
> called the dirWalkEntityProcessor or dirCrawlEntityProcessor or such. And
> this new EntityProcessor should have the name FileListEntityProcessor.
> However what is done is done. I then came up with manifestEnityProcessor
> which I thought suited, manifest files are all over the content sets I deal
> with and the dictionary definition seemed close enough ("ships manifest").
> However how about ChangeListEntityProcessor
> {code}
> <entity name="jc"
> processor="ManifestEntityProcessor"
> baseDir="/Volumes/Techmore/ts/aaa/schema/data"
> rootEntity="false"
> dataSource="null"
> allowRegex="^.*\.xml$"
> blockRegex="usc2009"
> manifestFileName="/Volumes/ts/man-find.txt"
> docAddRegex=".*"
> >
> {code}
> The new entity fields are as follows.
>
> *manifestFileName* is the required location of the manifest file. If this
> value is relative, it assumed to be relative to baseDir.
> *allowRegex* is an optional attribute that if present discards any line
> which does not match the regExp
>
> *blockRegex* is an optional attribute that is applied after any allowRegex
> and discards any line which matches the regExp
> *docAddRegex* is a required regex to identify lines which when matched
> should cause docs to be added to the index. As well as matching the line it
> should also return the portion of the line which contains the filepath as
> group(1)
> *docDeleteRegex* is an optional value of a regex to identify documents
> which when matched should be deleted from the index. As well as matching the
> line it should also return the portion of the line which contains the
> filepath as group(1) **PLANNED**
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.