[jira] Commented: (SOLR-1060) a new DIH EnityProcessor allowing text file lists of files to be indexed

Shalin Shekhar Mangar (JIRA) Wed, 15 Apr 2009 09:48:39 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12699273#action_12699273
 ]


Shalin Shekhar Mangar commented on SOLR-1060:
---------------------------------------------

bq. Are you referring to the fragment of code inside ChangeListEntityProcessor 
that opens the changelist, and its similarity to the functionality in 
URIDataSource?

Yes.

bq. I had not thought about arranging some kind of nested use of 
URIDataSource... is that what you are thinking about? 

Not exactly. EntityProcessors do not access http/files directly. That's what 
DataSources are for. The ChangeListEntityProcessor should just use the 
context.getDataSource() instead of creating URLConnection directly. The only 
problem with that approach is that the baseLocation must be specified on the 
<dataSource>. If you really need it to be returned with the row, you can put a 
template field with its value, assuming the baseLocation is fixed.

The more I look at this, the more I feel that the name 
'ChangeListEntityProcessor' is misleading. It doesn't really do any changes. It 
is actually what I imagined a LineEntityProcessor would be. It just streams 
lines one by one after accepting or rejecting some lines with regex. Whatever 
else you need to do (for your original use-case), can be done with nested 
entities and/or custom transformers.

What are the changes to TestRegexTransformer that this patch includes? Are 
these tests that you wrote for the RegexTransformer improvements/fixes that you 
found earlier? If yes, we should commit them through a different issue. Same 
should be done for the URIDataSource and associated changes.

What do you think?

> a new DIH EnityProcessor allowing text file lists of files to be indexed
> ------------------------------------------------------------------------
>
>                 Key: SOLR-1060
>                 URL: https://issues.apache.org/jira/browse/SOLR-1060
>             Project: Solr
>          Issue Type: New Feature
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: regex-fix.patch, SOLR-1060.patch, SOLR-1060.patch, 
> SOLR-1060.patch, SOLR-1060.patch, SOLR-1060.patch
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> I have finished a new DIH EntityProcessor. It is designed around the idea 
> that whatever demon is used to maintain your content store it is likely to 
> drop a report or log file explaining what has changed within your content 
> store. I wish to use this report file to control the indexing of the new or 
> changed content and the removal of old content. The report files, perhaps 
> from un-tar or un-zip, are likely to reference jpegs and directory stubs 
> which need to be ignored. I assumed a file based content repository but this 
> should be expanded to handle URI's as well
> I feel that the current FileListEntityProcessor is poorly named. It should be 
> called the dirWalkEntityProcessor or dirCrawlEntityProcessor or such. And 
> this new EntityProcessor should have the name FileListEntityProcessor. 
> However what is done is done. I then came up with manifestEnityProcessor 
> which I thought suited, manifest files are all over the content sets I deal 
> with and the dictionary definition seemed close enough ("ships manifest"). 
> However how about ChangeListEntityProcessor
> {code}
>        <entity name="jc"
>                processor="ManifestEntityProcessor"
>                baseDir="/Volumes/Techmore/ts/aaa/schema/data"
>                rootEntity="false"
>                dataSource="null"
>                allowRegex="^.*\.xml$"
>                blockRegex="usc2009"
>                manifestFileName="/Volumes/ts/man-find.txt"
>                docAddRegex=".*"
>                >
> {code}
> The new entity fields are as follows.
>  
>    *manifestFileName* is the required location of the manifest file. If this 
> value is relative, it assumed to be relative to baseDir.
>    *allowRegex* is an optional attribute that if present discards any line 
> which does not match the regExp
>  
>    *blockRegex* is an optional attribute that is applied after any allowRegex 
> and discards any line which matches the regExp
>    *docAddRegex* is a required regex to identify lines which when matched 
> should cause docs to be added to the index. As well as matching the line it 
> should also return the portion of the line which contains the filepath as 
> group(1)
>    *docDeleteRegex* is an optional value of a regex to identify documents 
> which when matched should be deleted from the index. As well as matching the 
> line it should also return the portion of the line which contains the 
> filepath as group(1) **PLANNED**

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1060) a new DIH EnityProcessor allowing text file lists of files to be indexed

Reply via email to