Re: [Solr Wiki] Update of "DataImportHandler" by DNaber

Noble Paul നോബിള്‍ नोब्ळ् Wed, 09 Dec 2009 10:22:43 -0800

this need to be reverted . there was data loss

On Wed, Dec 9, 2009 at 8:46 PM, Apache Wiki <wikidi...@apache.org> wrote:
> Dear Wiki user,
>
> You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
> notification.
>
> The "DataImportHandler" page has been changed by DNaber.
> http://wiki.apache.org/solr/DataImportHandler?action=diff&rev1=220&rev2=221
>
> --------------------------------------------------
>
>  <dataConfig>
>      <dataSource type="FileDataSource" />
>      <document>
> +         <entity name="f" processor="FileListEntityProcessor" 
> baseDir="/some/path/tongle implicit field called 'plainText'. The content is 
> not parsed in any way, however you may add transformers to manipulate the 
> data within 'plainText' as needed or to create other additional fields.
> -         <entity name="f" processor="FileListEntityProcessor" 
> baseDir="/some/path/to/files" fileName=".*xml" newerThan="'NOW-3DAYS'" 
> recursive="true" rootEntity="false" dataSource="null">
> -             <entity name="x" processor="XPathEntityProcessor" 
> forEach="/the/record/xpath" url="${f.fileAbsolutePath}">
> -                 <field column="full_name" xpath="/field/xpath"/>
> -             </entity>
> -         </entity>
> -     </document>
> - </dataConfig>
> - }}}
> - Do not miss the `rootEntity` attribute. The implicit fields generated by 
> the !FileListEntityProcessor are `fileAbsolutePath, fileSize, 
> fileLastModified, fileName` and these are available for use within the entity 
> X as shown above. It should be noted that !FileListEntityProcessor returns a 
> list of pathnames and that the subsequent entity must use the !FileDataSource 
> to fetch the files content.
>
> + example:
> - === CachedSqlEntityProcessor ===
> - <<Anchor(cached)>>
> -
> - This is an extension of the !SqlEntityProcessor.  This !EntityProcessor 
> helps reduce the no: of DB queries executed by caching the rows. It does not 
> help to use it in the root most entity because only one sql is run for the 
> entity.
> -
> - Example 1.
>  {{{
> - <entity name="x" query="select * from x">
> -     <entity name="y" query="select * from y where xid=${x.id}" 
> processor="CachedSqlEntityProcessor">
> -     </entity>
> + <entity processor="PlainTextEntityProcessor" name="x" 
> url="http://abc.com/a.txt"; dataSource="data-source-name">
> +    <!-- copies the text to a field called 'text' in Solr-->
> +   <field column="plainText" name="text"/>
> - <entity>
> + </entity>
>  }}}
>
> - The usage is exactly same as the other one. When a query is run the results 
> are stored and if the same query is run again it is fetched from the cache 
> and returned
> + Ensure that the dataSource is of type !DataSource<Reader> (!FileDataSource, 
> URL!DataSource)
>
> - Example 2:
> - {{{
> - <entity name="x" query="select * from x">
> -     <entity name="y" query="select * from y" 
> processor="CachedSqlEntityProcessor"  where="xid=x.id">
> -     </entity>
> - <entity>
> - }}}
> -
> - The difference with the previous one is the 'where' attribute. In this case 
> the query fetches all the rows from the table and stores all the rows in the 
> cache. The magic is in the 'where' value. The cache stores the values with 
> the 'xid' value in 'y' as the key. The value for 'x.id' is evaluated every 
> time the entity has to be run and the value is looked up in the cache an the 
> rows are returned.
> -
> - In the where the lhs (the part before '=') is the column in y and the rhs 
> (the part after '=') is the value to be computed for looking up the cache.
> -
> - === PlainTextEntityProcessor ===
> + === LineEntityProcessor ===
> - <<Anchor(plaintext)>>
> + <<Anchor(LineEntityProcessor)>>
>  <!> [[Solr1.4]]
>
> - This !EntityProcessor reads all content from the data source into an single 
> implicit field called 'plainText'. The content is not parsed in any way, 
> however you may add transformers to manipulate the data within 'plainText' as 
> needed or to create other additional fields.
> + This !EntityProcessor reads all content from the data source on a line by 
> line basis, a field called 'rawLine' is returned for each line read. The 
> content is not parsed in any way, however you may add transformers to 
> manipulate the data within 'rawLine' or to create other additional fields.
> +
> + The lines read can be filtered by two regular expressions 
> '''acceptLineRegex''' and '''omitLineRegex'''.
> + This entities additional attributes are:
> +  * '''`url`''' : a required attribute that specifies the location of the 
> input file in a way that is compatible with the configured datasource. If 
> this value is relative and you are using !FileDataSource or URL!DataSource, 
> it assumed to be relative to '''baseLoc'''.
> +  * '''`acceptLineRegex`''' :an optional attribute that if present discards 
> any line which does not match the regExp.
> +  * '''`omitLineRegex`''' : an optional attribute that is applied after any 
> acceptLineRegex and discards any line which matches this regExp.
> + example:
> + {{{
> + <entity name="jc"
> +         processor="LineEntityProcessor"
> +         acceptLineRegex="^.*\.xml$"
> +         omitLineRegex="/obsolete"
> +         url="file:///Volumes/ts/files.lis"
> +         rootEntity="false"
> +         dataSource="myURIreader1"
> +         transformer="RegexTransformer,DateFormatTransformer"
> +         >
> +    ...
> + }}}
> + While there are use cases where you might need to create a solr document 
> per line read from a file, it is expected that in most cases that the lines 
> read will consist of a pathname which is in turn consumed by another 
> !EntityProcessor
> + such as X!PathEntityProcessor.
> +
> + == DataSource ==
> + <<Anchor(datasource)>>
> + A class can extend `org.apache.solr.handler.dataimport.DataSource` . 
> [[http:/%ngle implicit field called 'plainText'. The content is not parsed in 
> any way, however you may add transformers to manipulate the data within 
> 'plainText' as needed or to create other additional fields.
>
>  example:
>  {{{
> @@ -1026, +1026 @@
>
>  {{attachment:interactive-dev-dataimporthandler.PNG}}
>
>  = Where to find it? =
> - DataImportHandler is a new addition to Solr. You can either:
> + DataImportHandler was added to Solr in Solr 1.3. You can either:
> -  * Download a nightly build of Solr from 
> [[http://lucene.apache.org/solr/|Solr website]], or
> +  * Download a build of Solr from [[http://lucene.apache.org/solr/|Solr 
> website]], or
>   * Use the steps given in Full Import Example to try it out.
>
>  For a history of development discussion related to DataImportHandler, please 
> see [[http://issues.apache.org/jira/browse/SOLR-469|SOLR-469]] in the Solr 
> JIRA.
>




-- 
-----------------------------------------------------
Noble Paul | Systems Architect| AOL | http://aol.com

Re: [Solr Wiki] Update of "DataImportHandler" by DNaber

Reply via email to