[jira] Issue Comment Edited: (SOLR-1060) a new DIH EnityProcessor allowing text file lists of files to be indexed

Fergus McMenemie (JIRA) Mon, 16 Mar 2009 13:10:15 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682441#action_12682441
 ]


Fergus McMenemie edited comment on SOLR-1060 at 3/16/09 1:08 PM:
-----------------------------------------------------------------

Yes, briefly, I did and could not see how it could be done nicely; however it 
is quite possible I am misunderstanding things..

To recap, the idea was to split "ChangeListEntityProcessor" into two halves. 
The first half would deal with reading lines from a file:/// or http:// 
locations with features to allow lines to be omitted or accepted. The second 
half would focus on analyzing the line turning it into add/delete instructions 
and identifying the portion of the lines which was to be operated on. Is this 
correct?

If my understanding is correct. Then if baseLocation was allowed to be empty 
and "docAddRegex" and "docDeleteRegex" are not supplied then the line from the 
changelist could be returned by the entity exactly as read from the file. 
Further; if "acceptLineRegex" and "omitLineRegex" are also undefined then the 
whole file is returned to the next entity. Would that make it the same as part 
one?

I had looked at removing all my code for doing the second half described above, 
replacing it with a transformers. I guess as long as the templatetransformer 
can assign to the fields $deleteDocId and $deleteDocQuery then it is do-able. 
Is the following valid? In the following I always assign to $deleteDocQuery but 
make $deleteDocId true/false to control actual deletion.

{code}
<entity name="jc"
           processor="ChangeListEntityProcessor"
           fileName="file:///Volumes/ts/man-findlsurl.txt"
           rootEntity="false"
           baseLocation="http://localhost/ford/";
           transformer="TemplateTransformer,RegexTransformer">
           >
<field column="id"                regex=".*(-- find jucy bit--).*" 
replaceWith="$1" \>
<field column="$deleteDocQuery"   regex=".*(-- find jucy bit--).*" 
replaceWith="$1"      sourceColName="fileAbsolutePath"/>
<field column="$deleteDocId"      template="false"   regex=".*(-- find add/del 
bit--).*" replaceWith="true" sourceColName="fileAbsolutePath"/>
{code}

?

      was (Author: fergus):
    Yes, briefly, I did and could not see how it could be done nicely; however 
it is quite possible I am misunderstanding things..

To recap, the idea was to split "ChangeListEntityProcessor" into two halves. 
The first half would deal with reading lines from a file:/// or http:// 
locations with features to allow lines to be omitted or accepted. The second 
half would focus on analyzing the line turning it into add/delete instructions 
and identifying the portion of the lines which was to be operated on. Is this 
correct?

If my understanding is correct. Then if baseLocation was allowed to be empty 
and "docAddRegex" and "docDeleteRegex" are not supplied then the line from the 
changelist could be returned by the entity exactly as read from the file. 
Further; if "acceptLineRegex" and "omitLineRegex" are also undefined then the 
whole file is returned to the next entity. Would that make it the same as part 
one?

I had looked at removing all my code for doing the second half described above, 
replacing it with a transformers. I guess as long as the templatetransformer 
can assign to the fields $deleteDocId and $deleteDocQuery then it is do-able. 
Is the following valid? In the following I always assign to $deleteDocQuery but 
make $deleteDocId true/flase to control actual deletion.

{code}
<entity name="jc"
           processor="ChangeListEntityProcessor"
           fileName="file:///Volumes/ts/man-findlsurl.txt"
           rootEntity="false"
           baseLocation="http://localhost/ford/";
           transformer="TemplateTransformer,RegexTransformer">
           >
<field column="id"                regex=".*(-- find jucy bit--).*" 
replaceWith="$1" \>
<field column="$deleteDocQuery"   regex=".*(-- find jucy bit--).*" 
replaceWith="$1"      sourceColName="fileAbsolutePath"/>
<field column="$deleteDocId"      template="false"   regex=".*(-- find add/del 
bit--).*" replaceWith="true" sourceColName="fileAbsolutePath"/>
{code}

  
> a new DIH EnityProcessor allowing text file lists of files to be indexed
> ------------------------------------------------------------------------
>
>                 Key: SOLR-1060
>                 URL: https://issues.apache.org/jira/browse/SOLR-1060
>             Project: Solr
>          Issue Type: New Feature
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>             Fix For: 1.4
>
>         Attachments: SOLR-1060.patch, SOLR-1060.patch
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> I have finished a new DIH EntityProcessor. It is designed around the idea 
> that whatever demon is used to maintain your content store it is likely to 
> drop a report or log file explaining what has changed within your content 
> store. I wish to use this report file to control the indexing of the new or 
> changed content and the removal of old content. The report files, perhaps 
> from un-tar or un-zip, are likely to reference jpegs and directory stubs 
> which need to be ignored. I assumed a file based content repository but this 
> should be expanded to handle URI's as well
> I feel that the current FileListEntityProcessor is poorly named. It should be 
> called the dirWalkEntityProcessor or dirCrawlEntityProcessor or such. And 
> this new EntityProcessor should have the name FileListEntityProcessor. 
> However what is done is done. I then came up with manifestEnityProcessor 
> which I thought suited, manifest files are all over the content sets I deal 
> with and the dictionary definition seemed close enough ("ships manifest"). 
> However how about ChangeListEntityProcessor
> {code}
>        <entity name="jc"
>                processor="ManifestEntityProcessor"
>                baseDir="/Volumes/Techmore/ts/aaa/schema/data"
>                rootEntity="false"
>                dataSource="null"
>                allowRegex="^.*\.xml$"
>                blockRegex="usc2009"
>                manifestFileName="/Volumes/ts/man-find.txt"
>                docAddRegex=".*"
>                >
> {code}
> The new entity fields are as follows.
>  
>    *manifestFileName* is the required location of the manifest file. If this 
> value is relative, it assumed to be relative to baseDir.
>    *allowRegex* is an optional attribute that if present discards any line 
> which does not match the regExp
>  
>    *blockRegex* is an optional attribute that is applied after any allowRegex 
> and discards any line which matches the regExp
>    *docAddRegex* is a required regex to identify lines which when matched 
> should cause docs to be added to the index. As well as matching the line it 
> should also return the portion of the line which contains the filepath as 
> group(1)
>    *docDeleteRegex* is an optional value of a regex to identify documents 
> which when matched should be deleted from the index. As well as matching the 
> line it should also return the portion of the line which contains the 
> filepath as group(1) **PLANNED**

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (SOLR-1060) a new DIH EnityProcessor allowing text file lists of files to be indexed

Reply via email to