Re: [Solr Wiki] Update of "DataImportHandler" by NoblePaul

Noble Paul നോബിള്‍ नोब्ळ् Mon, 21 Sep 2009 03:13:34 -0700

the wiki screwed up the page. It has chopped off most of the content

and now i am unable to roll it back to the previous version (208).


On Mon, Sep 21, 2009 at 2:50 PM, Apache Wiki <wikidi...@apache.org> wrote:
> Dear Wiki user,
>
> You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
> notification.
>
> The "DataImportHandler" page has been changed by NoblePaul:
> http://wiki.apache.org/solr/DataImportHandler?action=diff&rev1=208&rev2=209
>
>  {{{
>  <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" 
> url="jdbc:mysql://localhost/dbname" user="db_username" 
> password="db_password"/>
>  }}}
> -  * The datasource configuration can also be done in solr config xml 
> [[#solrconfigdatasource]]
>   * The attribute 'type' specifies the implementation class. It is optional. 
> The default value is `'JdbcDataSource'`
>   * The attribute 'name' can be used if there are [[#multipleds|multiple 
> datasources]] used by multiple entities
>   * All other attributes in the <dataSource> tag are specific to the 
> particular dataSource implementation being configured.
> @@ -679, +678 @@
>
>
>  {{{
>    <requestHandler name="/dataimport" 
> class="org.apache.solr.handler.dataimport.DataImportHandler">
> +     <lst na0D
> -     <lst name="defaults">
> -        <str name="config">data-config.xml</str>
> -        </lst>
> -     <lst name="invariants">
> -        <!-- Pass through the prefix which needs stripped from
> -             an absolute disk path to give an absolute web path  -->
> -        <str name="img_installdir">/usr/local/apache2/htdocs</str>
> -        </lst>
> -     </requestHandler>
> - }}}
> -
> -
> - {{{
> -  <dataConfig>
> -  <dataSource name="myfilereader" type="FileDataSource"/>
> -    <document>
> -      <entity name="jc" rootEntity="false" dataSource="null"
> -            processor="FileListEntityProcessor"
> -            fileName="^.*\.xml$" recursive="true"
> -            baseDir="/usr/local/apache2/htdocs/imagery"
> -              >
> -        <entity name="x"rootEntity="true"
> -              dataSource="myfilereader"
> -              processor="XPathEntityProcessor"
> -              url="${jc.fileAbsolutePath}"
> -              stream="false" forEach="/mediaBlock"
> -              
> transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer,LogTransformer"
> -                logTemplate="      processing ${jc.fileAbsolutePath}"
> -                logLevel="info"
> -                >
> -
> -          <field column="fileAbsPath"     template="${jc.fileAbsolutePath}" 
> />
> -
> -          <field column="fileWebPath"     template="${x.fileAbsolutePath}"
> -                                          
> regex="${dataimporter.request.img_installdir}(.*)" replaceWith="$1"/>
> -
> -          <field column="fileWebDir"      regex="^(.*)/.*" replaceWith="$1" 
> sourceColName="fileWebPath"/>
> -
> -          <field column="imgFilename"     xpath="/mediaBlock/@url" />
> -          <field column="imgCaption"      xpath="/mediaBlock/caption"  />
> -          <field column="imgSrcArticle"   xpath="/mediaBlock/source"
> -                                          
> template="${x.fileWebDir}/${x.imgSrcArticle}/"/>
> -
> -          <field column="uid"             regex="^(.*)$" 
> replaceWith="$1#${x.imgFilename}" sourceColName="fileWebPath"/>
> -
> -          <!-- if imgFilename is not defined all the following will also not 
> be defined -->
> -          <field column="imgWebPathFULL"  
> template="${x.fileWebDir}/images/${x.imgFilename}"/>
> -          <field column="imgWebPathICON"  regex="^(.*)\.\w+$" 
> replaceWith="${x.fileWebDir}/images/s$1.png"
> -                                          sourceColName="imgFilename"/>
> -
> -        </entity>
> -      </entity>
> -    </document>
> -   </dataConfig>
> - }}}
> -
> - <<Anchor(custom-transformers)>>
> - === Writing Custom Transformers ===
> - It is simple to add you own transformers and this documented on the page 
> [[DIHCustomTransformer]]
> -
> - <<Anchor(entityprocessor)>>
> - == EntityProcessor ==
> - Each entity is handled by a default Entity processor called 
> !SqlEntityProcessor. This works well for systems which use RDBMS as a 
> datasource. For other kind of datasources like  REST or Non Sql datasources 
> you can choose to extend this abstract class 
> `org.apache.solr.handler.dataimport.Entityprocessor`. This is designed to 
> Stream rows one by one from an entity. The simplest way to implement your own 
> !EntityProcessor is to extend !EntityProcessorBase and override the `public 
> Map<String,Object> nextRow()` method.
> - '!EntityProcessor' rely on the !DataSource for fetching data. The return 
> type of the !DataSource is important for an !EntityProcessor. The built-in 
> ones are,
> -
> - === SqlEntityProcessor ===
> - This is the defaut. The !DataSource must be of type 
> `DataSource<Iterator<Map<String, Object>>>` . !JdbcDataSource can be used 
> with this.
> -
> - === XPathEntityProcessor ===
> - Used when indexing XML type data. The !DataSource must be of type 
> `DataSource<Reader>` . URL!DataSource <!> [[Solr1.4]] or !FileDataSource is 
> commonly used with X!PathEntityProcessor.
> -
> - === FileListEntityProcessor ===
> - A simple entity processor which can be used to enumerate the list of files 
> from a File System based on some criteria. It does not use a !DataSource. The 
> entity attributes are:
> -  * '''`fileName`''' :(required) A regex pattern to identify files
> -  * '''`baseDir`''' : (required) The Base directory (absolute path)
> -  * '''`recursive`''' : Recursive listing or not. Default is 'false'
> -  * '''`excludes`''' : A Regex pattern of excluded file names
> -  * '''`newerThan`''' : A date param . Use the format (`yyyy-MM-dd 
> HH:mm:ss`) . It can also be a datemath string eg: ('NOW-3DAYS'). The single 
> quote is necessary . Or it can be a valid variableresolver format like 
> (${var.name})
> -  * '''`olderThan`''' : A date param . Same rules as above
> -  * '''`rootEntity`''' :It must be false for this (Unless you wish to just 
> index filenames) An entity directly under the <document> is a root entity. 
> That means that for each row emitted by the root entity one document is 
> created in Solr/Lucene. But as in this case we do not wish to make one 
> document per file. We wish to make one document per row emitted by the 
> following entity 'x'. Because the entity 'f' has rootEntity=false the entity 
> directly under it becomes a root entity automatically and each row emitted by 
> that becomes a document.
> -  * '''`dataSource`''' :If you use Solr1.3 It must be set to "null" because 
> this does not use any DataSource. No need to specify that in Solr1.4 .It just 
> means that we won't create a DataSource instance. (In most of the cases there 
> is only one !DataSource (A !JdbcDataSource) and all entities just use them. 
> In case of !FileListEntityProcessor a !DataSource is not necessary.)
> -
> - example:
> - {{{
> - <dataConfig>
> -     <dataSource type="FileDataSource" />
> -     <document>
> -         <entity name="f" processor="FileListEntityProcessor" 
> baseDir="/some/path/to/files" fileName=".*xml" newerThan="'NOW-3DAYS'" 
> recursive="true" rootEntity="false" dataSource="null">
> -             <entity name="x" processor="XPathEntityProcessor" 
> forEach="/the/record/xpath" url="${f.fileAbsolutePath}">
> -                 <field column="full_name" xpath="/field/xpath"/>
> -             </entity>
> -         </entity>
> -     </document>
> - </dataConfig>
> - }}}
> - Do not miss the `rootEntity` attribute. The implicit fields generated by 
> the !FileListEntityProcessor are `fileAbsolutePath, fileSize, 
> fileLastModified, fileName` and these are available for use within the entity 
> X as shown above. It should be noted that !FileListEntityProcessor returns a 
> list of pathnames and that the subsequent entity must use the !FileDataSource 
> to fetch the files content.
> -
> - === CachedSqlEntityProcessor ===
> - <<Anchor(cached)>>
> -
>  This is an extension of the !SqlEntityProcessor.  This !EntityProcessor 
> helps reduce the no: of DB queries executed by caching the rows. It does not 
> help to use it in the root most entity because only one sql is run for the 
> entity.
>
>  Example 1.
> @@ -904, +804 @@
>
>  === ContentStreamDataSource ===
>  <!> [[Solr1.4]]
>
> + Use this to use the POST data as the DataSour0D
> + This is an extension of the !SqlEntityProcessor.  This !EntityProcessor 
> helps reduce the no: of DB queries executed by caching the rows. It does not 
> help to use it in the root most entity because only one sql is run for the 
> entity.
> +
> + Example 1.
> + {{{
> + <entity name="x" query="select * from x">
> +     <entity name="y" query="select * from y where xid=${x.id}" 
> processor="CachedSqlEntityProcessor">
> +     </entity>
> + <entity>
> + }}}
> +
> + The usage is exactly same as the other one. When a query is run the results 
> are stored and if the same query is run again it is fetched from the cache 
> and returned
> +
> + Example 2:
> + {{{
> + <entity name="x" query="select * from x">
> +     <entity name="y" query="select * from y" 
> processor="CachedSqlEntityProcessor"  where="xid=x.id">
> +     </entity>
> + <entity>
> + }}}
> +
> + The difference with the previous one is the 'where' attribute. In this case 
> the query fetches all the rows from the table and stores all the rows in the 
> cache. The magic is in the 'where' value. The cache stores the values with 
> the 'xid' value in 'y' as the key. The value for 'x.id' is evaluated every 
> time the entity has to be run and the value is looked up in the cache an the 
> rows are returned.
> +
> + In the where the lhs (the part before '=') is the column in y and the rhs 
> (the part after '=') is the value to be computed for looking up the cache.
> +
> + === PlainTextEntityProcessor ===
> + <<Anchor(plaintext)>>
> + <!> [[Solr1.4]]
> +
> + This !EntityProcessor reads all content from the data source into an single 
> implicit field called 'plainText'. The content is not parsed in any way, 
> however you may add transformers to manipulate the data within 'plainText' as 
> needed or to create other additional fields.
> +
> + example:
> + {{{
> + <entity processor="PlainTextEntityProcessor" name="x" 
> url="http://abc.com/a.txt"; dataSource="data-source-name">
> +    <!-- copies the text to a field called 'text' in Solr-->
> +   <field column="plainText" name="text"/>
> + </entity>
> + }}}
> +
> + Ensure that the dataSource is of type !DataSource<Reader> (!FileDataSource, 
> URL!DataSource)
> +
> + === LineEntityProcessor ===
> + <<Anchor(LineEntityProcessor)>>
> + <!> [[Solr1.4]]
> +
> + This !EntityProcessor reads all content from the data source on a line by 
> line basis, a field called 'rawLine' is returned for each line read. The 
> content is not parsed in any way, however you may add transformers to 
> manipulate the data within 'rawLine' or to create other additional fields.
> +
> + The lines read can be filtered by two regular expressions 
> '''acceptLineRegex''' and '''omitLineRegex'''.
> + This entities additional attributes are:
> +  * '''`url`''' : a required attribute that specifies the location of the 
> input file in a way that is compatible with the configured datasource. If 
> this value is relative and you are using !FileDataSource or URL!DataSource, 
> it assumed to be relative to '''baseLoc'''.
> +  * '''`acceptLineRegex`''' :an optional attribute that if present discards 
> any line which does not match the regExp.
> +  * '''`omitLineRegex`''' : an optional attribute that is applied after any 
> acceptLineRegex and discards any line which matches this regExp.
> + example:
> + {{{
> + <entity name="jc"
> +         processor="LineEntityProcessor"
> +         acceptLineRegex="^.*\.xml$"
> +         omitLineRegex="/obsolete"
> +         url="file:///Volumes/ts/files.lis"
> +         rootEntity="false"
> +         dataSource="myURIreader1"
> +         transformer="RegexTransformer,DateFormatTransformer"
> +         >
> +    ...
> + }}}
> + While there are use cases where you might need to create a solr document 
> per line read from a file, it is expected that in most cases that the lines 
> read will consist of a pathname which is in turn consumed by another 
> !EntityProcessor
> + such as X!PathEntityProcessor.
> +
> + == DataSource ==
> + <<Anchor(datasource)>>
> + A class can extend `org.apache.solr.handler.dataimport.DataSource` . 
> [[http://svn.apache.org/viewvc/lucene/solr/trunk/contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/DataSource.java?view=markup|See
>  source]]
> +
> + and can be used as a !DataSource. It must be configured in the dataSource 
> definition
> + {{{
> + <dataSource type="com.foo.FooDataSource" prop1="hello"/>
> + }}}
> + and it can be used in the entities like a standard one
> +
> + === JdbcdataSource ===
> + This is the default. See the  [[#jdbcdatasource|example]] . The signature 
> is as follows
> + {{{
> + public class JdbcDataSource extends DataSource<Iterator<Map<String, 
> Object>>>
> + }}}
> + It is designed to iterate rows in DB one by one. A row is represented as a 
> Map.
> +
> + === URLDataSource ===
> + <!> [[Solr1.4]]
> + This datasource is often used with X!PathEntityProcessor to fetch content 
> from an underlying file:// or http:// location. See the documentation 
> [[#httpds|here]] . The signature is as follows
> + {{{
> + public class URLDataSource extends DataSource<Reader>
> + }}}
> +
> + === HttpDataSource ===
> + <!> Http!DataSource is being deprecated in favour of URL!DataSource in 
> [[Solr1.4]]. There is no change in functionality between URL!DataSource and 
> !Http!DataSource, only a name change.
> +
> + === FileDataSource ===
> + This can be used like an URL!DataSource but used to fetch content from 
> files on disk. The only difference from URL!DataSource, when accessing disk 
> files, is how a pathname is specified. The signature is as follows
> + {{{
> + public class FileDataSource extends DataSource<Reader>
> + }}}
> +
> + The attributes are:
> +  * '''`basePath`''': (optional) The base path relative to which the value 
> is evaluated if it is not absolute
> +  * '''`encoding`''': (optional) If the files are to be read in an encoding 
> that is not same as the platform encoding
> +
> + === FieldReaderDataSource ===
> + <!> [[Solr1.4]]
> +
> + This can be used like an URL!DataSource . The signature is as follows
> + {{{
> + public class FieldReaderDataSource extends DataSource<Reader>
> + }}}
> + This can be useful for users who have a DB field containing XML and wish to 
> use a nested X!PathEntityProcessor to process the fields contents.
> + The datasouce may be configured as follows
> + {{{
> +   <datasource name="f" type="FieldReaderDataSource" />
> + }}}
> +
> + The enity which uses this datasource must keep the url value as the 
> variable name dataField="field-name". For instance , if the parent entity 
> 'dbEntity' has a field called 'xmlData' . Then he child entity woould look 
> like,
> + {{{
> + <entity dataSource="f" processor="XPathEntityProcessor" 
> dataField="dbEntity.xmlData"/>
> + }}}
> +
> + === ContentStreamDataSource ===
> + <!> [[Solr1.4]]
> +
>  Use this to use the POST data as the DataSource. This can be used with any 
> !EntityProcessor that uses a !DataSource<Reader>
>
>  == EventListeners ==
> @@ -926, +952 @@
>
>   * '''`$deleteDocByQuery`''' :Delete docs from Solr by this query. The value 
> must be a Solr Query <!> [[Solr1.4]]
>
>
> - == Adding datasource in solrconfig.xml ==
> - <<Anchor(solrconfigdatasource)>>
> -
> - It is possible to configure datasource in solrconfig.xml as well as the 
> data-config.xml, however the datasource attributes are expressed differently.
> - {{{
> -   <requestHandler name="/dataimport" 
> class="org.apache.solr.handler.dataimport.DataImportHandler">
> -     <lst name="defaults">
> -       <str name="config">/home/username/data-config.xml</str>
> -       <lst name="datasource">
> -          <str name="driver">com.mysql.jdbc.Driver</str>
> -          <str name="url">jdbc:mysql://localhost/dbname</str>
> -          <str name="user">db_username</str>
> -          <str name="password">db_password</str>
> -       </lst>
> -     </lst>
> -   </requestHandler>
> - }}}
>  <<Anchor(arch)>>
>  = Architecture =
>  The following diagram describes the logical flow for a sample configuration.
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: [Solr Wiki] Update of "DataImportHandler" by NoblePaul

Reply via email to