the wiki screwed up the page. It has chopped off most of the content and now i am unable to roll it back to the previous version (208).
On Mon, Sep 21, 2009 at 2:50 PM, Apache Wiki <wikidi...@apache.org> wrote: > Dear Wiki user, > > You have subscribed to a wiki page or wiki category on "Solr Wiki" for change > notification. > > The "DataImportHandler" page has been changed by NoblePaul: > http://wiki.apache.org/solr/DataImportHandler?action=diff&rev1=208&rev2=209 > > {{{ > <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" > url="jdbc:mysql://localhost/dbname" user="db_username" > password="db_password"/> > }}} > - * The datasource configuration can also be done in solr config xml > [[#solrconfigdatasource]] > * The attribute 'type' specifies the implementation class. It is optional. > The default value is `'JdbcDataSource'` > * The attribute 'name' can be used if there are [[#multipleds|multiple > datasources]] used by multiple entities > * All other attributes in the <dataSource> tag are specific to the > particular dataSource implementation being configured. > @@ -679, +678 @@ > > > {{{ > <requestHandler name="/dataimport" > class="org.apache.solr.handler.dataimport.DataImportHandler"> > + <lst na0D > - <lst name="defaults"> > - <str name="config">data-config.xml</str> > - </lst> > - <lst name="invariants"> > - <!-- Pass through the prefix which needs stripped from > - an absolute disk path to give an absolute web path --> > - <str name="img_installdir">/usr/local/apache2/htdocs</str> > - </lst> > - </requestHandler> > - }}} > - > - > - {{{ > - <dataConfig> > - <dataSource name="myfilereader" type="FileDataSource"/> > - <document> > - <entity name="jc" rootEntity="false" dataSource="null" > - processor="FileListEntityProcessor" > - fileName="^.*\.xml$" recursive="true" > - baseDir="/usr/local/apache2/htdocs/imagery" > - > > - <entity name="x"rootEntity="true" > - dataSource="myfilereader" > - processor="XPathEntityProcessor" > - url="${jc.fileAbsolutePath}" > - stream="false" forEach="/mediaBlock" > - > transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer,LogTransformer" > - logTemplate=" processing ${jc.fileAbsolutePath}" > - logLevel="info" > - > > - > - <field column="fileAbsPath" template="${jc.fileAbsolutePath}" > /> > - > - <field column="fileWebPath" template="${x.fileAbsolutePath}" > - > regex="${dataimporter.request.img_installdir}(.*)" replaceWith="$1"/> > - > - <field column="fileWebDir" regex="^(.*)/.*" replaceWith="$1" > sourceColName="fileWebPath"/> > - > - <field column="imgFilename" xpath="/mediaBlock/@url" /> > - <field column="imgCaption" xpath="/mediaBlock/caption" /> > - <field column="imgSrcArticle" xpath="/mediaBlock/source" > - > template="${x.fileWebDir}/${x.imgSrcArticle}/"/> > - > - <field column="uid" regex="^(.*)$" > replaceWith="$1#${x.imgFilename}" sourceColName="fileWebPath"/> > - > - <!-- if imgFilename is not defined all the following will also not > be defined --> > - <field column="imgWebPathFULL" > template="${x.fileWebDir}/images/${x.imgFilename}"/> > - <field column="imgWebPathICON" regex="^(.*)\.\w+$" > replaceWith="${x.fileWebDir}/images/s$1.png" > - sourceColName="imgFilename"/> > - > - </entity> > - </entity> > - </document> > - </dataConfig> > - }}} > - > - <<Anchor(custom-transformers)>> > - === Writing Custom Transformers === > - It is simple to add you own transformers and this documented on the page > [[DIHCustomTransformer]] > - > - <<Anchor(entityprocessor)>> > - == EntityProcessor == > - Each entity is handled by a default Entity processor called > !SqlEntityProcessor. This works well for systems which use RDBMS as a > datasource. For other kind of datasources like REST or Non Sql datasources > you can choose to extend this abstract class > `org.apache.solr.handler.dataimport.Entityprocessor`. This is designed to > Stream rows one by one from an entity. The simplest way to implement your own > !EntityProcessor is to extend !EntityProcessorBase and override the `public > Map<String,Object> nextRow()` method. > - '!EntityProcessor' rely on the !DataSource for fetching data. The return > type of the !DataSource is important for an !EntityProcessor. The built-in > ones are, > - > - === SqlEntityProcessor === > - This is the defaut. The !DataSource must be of type > `DataSource<Iterator<Map<String, Object>>>` . !JdbcDataSource can be used > with this. > - > - === XPathEntityProcessor === > - Used when indexing XML type data. The !DataSource must be of type > `DataSource<Reader>` . URL!DataSource <!> [[Solr1.4]] or !FileDataSource is > commonly used with X!PathEntityProcessor. > - > - === FileListEntityProcessor === > - A simple entity processor which can be used to enumerate the list of files > from a File System based on some criteria. It does not use a !DataSource. The > entity attributes are: > - * '''`fileName`''' :(required) A regex pattern to identify files > - * '''`baseDir`''' : (required) The Base directory (absolute path) > - * '''`recursive`''' : Recursive listing or not. Default is 'false' > - * '''`excludes`''' : A Regex pattern of excluded file names > - * '''`newerThan`''' : A date param . Use the format (`yyyy-MM-dd > HH:mm:ss`) . It can also be a datemath string eg: ('NOW-3DAYS'). The single > quote is necessary . Or it can be a valid variableresolver format like > (${var.name}) > - * '''`olderThan`''' : A date param . Same rules as above > - * '''`rootEntity`''' :It must be false for this (Unless you wish to just > index filenames) An entity directly under the <document> is a root entity. > That means that for each row emitted by the root entity one document is > created in Solr/Lucene. But as in this case we do not wish to make one > document per file. We wish to make one document per row emitted by the > following entity 'x'. Because the entity 'f' has rootEntity=false the entity > directly under it becomes a root entity automatically and each row emitted by > that becomes a document. > - * '''`dataSource`''' :If you use Solr1.3 It must be set to "null" because > this does not use any DataSource. No need to specify that in Solr1.4 .It just > means that we won't create a DataSource instance. (In most of the cases there > is only one !DataSource (A !JdbcDataSource) and all entities just use them. > In case of !FileListEntityProcessor a !DataSource is not necessary.) > - > - example: > - {{{ > - <dataConfig> > - <dataSource type="FileDataSource" /> > - <document> > - <entity name="f" processor="FileListEntityProcessor" > baseDir="/some/path/to/files" fileName=".*xml" newerThan="'NOW-3DAYS'" > recursive="true" rootEntity="false" dataSource="null"> > - <entity name="x" processor="XPathEntityProcessor" > forEach="/the/record/xpath" url="${f.fileAbsolutePath}"> > - <field column="full_name" xpath="/field/xpath"/> > - </entity> > - </entity> > - </document> > - </dataConfig> > - }}} > - Do not miss the `rootEntity` attribute. The implicit fields generated by > the !FileListEntityProcessor are `fileAbsolutePath, fileSize, > fileLastModified, fileName` and these are available for use within the entity > X as shown above. It should be noted that !FileListEntityProcessor returns a > list of pathnames and that the subsequent entity must use the !FileDataSource > to fetch the files content. > - > - === CachedSqlEntityProcessor === > - <<Anchor(cached)>> > - > This is an extension of the !SqlEntityProcessor. This !EntityProcessor > helps reduce the no: of DB queries executed by caching the rows. It does not > help to use it in the root most entity because only one sql is run for the > entity. > > Example 1. > @@ -904, +804 @@ > > === ContentStreamDataSource === > <!> [[Solr1.4]] > > + Use this to use the POST data as the DataSour0D > + This is an extension of the !SqlEntityProcessor. This !EntityProcessor > helps reduce the no: of DB queries executed by caching the rows. It does not > help to use it in the root most entity because only one sql is run for the > entity. > + > + Example 1. > + {{{ > + <entity name="x" query="select * from x"> > + <entity name="y" query="select * from y where xid=${x.id}" > processor="CachedSqlEntityProcessor"> > + </entity> > + <entity> > + }}} > + > + The usage is exactly same as the other one. When a query is run the results > are stored and if the same query is run again it is fetched from the cache > and returned > + > + Example 2: > + {{{ > + <entity name="x" query="select * from x"> > + <entity name="y" query="select * from y" > processor="CachedSqlEntityProcessor" where="xid=x.id"> > + </entity> > + <entity> > + }}} > + > + The difference with the previous one is the 'where' attribute. In this case > the query fetches all the rows from the table and stores all the rows in the > cache. The magic is in the 'where' value. The cache stores the values with > the 'xid' value in 'y' as the key. The value for 'x.id' is evaluated every > time the entity has to be run and the value is looked up in the cache an the > rows are returned. > + > + In the where the lhs (the part before '=') is the column in y and the rhs > (the part after '=') is the value to be computed for looking up the cache. > + > + === PlainTextEntityProcessor === > + <<Anchor(plaintext)>> > + <!> [[Solr1.4]] > + > + This !EntityProcessor reads all content from the data source into an single > implicit field called 'plainText'. The content is not parsed in any way, > however you may add transformers to manipulate the data within 'plainText' as > needed or to create other additional fields. > + > + example: > + {{{ > + <entity processor="PlainTextEntityProcessor" name="x" > url="http://abc.com/a.txt" dataSource="data-source-name"> > + <!-- copies the text to a field called 'text' in Solr--> > + <field column="plainText" name="text"/> > + </entity> > + }}} > + > + Ensure that the dataSource is of type !DataSource<Reader> (!FileDataSource, > URL!DataSource) > + > + === LineEntityProcessor === > + <<Anchor(LineEntityProcessor)>> > + <!> [[Solr1.4]] > + > + This !EntityProcessor reads all content from the data source on a line by > line basis, a field called 'rawLine' is returned for each line read. The > content is not parsed in any way, however you may add transformers to > manipulate the data within 'rawLine' or to create other additional fields. > + > + The lines read can be filtered by two regular expressions > '''acceptLineRegex''' and '''omitLineRegex'''. > + This entities additional attributes are: > + * '''`url`''' : a required attribute that specifies the location of the > input file in a way that is compatible with the configured datasource. If > this value is relative and you are using !FileDataSource or URL!DataSource, > it assumed to be relative to '''baseLoc'''. > + * '''`acceptLineRegex`''' :an optional attribute that if present discards > any line which does not match the regExp. > + * '''`omitLineRegex`''' : an optional attribute that is applied after any > acceptLineRegex and discards any line which matches this regExp. > + example: > + {{{ > + <entity name="jc" > + processor="LineEntityProcessor" > + acceptLineRegex="^.*\.xml$" > + omitLineRegex="/obsolete" > + url="file:///Volumes/ts/files.lis" > + rootEntity="false" > + dataSource="myURIreader1" > + transformer="RegexTransformer,DateFormatTransformer" > + > > + ... > + }}} > + While there are use cases where you might need to create a solr document > per line read from a file, it is expected that in most cases that the lines > read will consist of a pathname which is in turn consumed by another > !EntityProcessor > + such as X!PathEntityProcessor. > + > + == DataSource == > + <<Anchor(datasource)>> > + A class can extend `org.apache.solr.handler.dataimport.DataSource` . > [[http://svn.apache.org/viewvc/lucene/solr/trunk/contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/DataSource.java?view=markup|See > source]] > + > + and can be used as a !DataSource. It must be configured in the dataSource > definition > + {{{ > + <dataSource type="com.foo.FooDataSource" prop1="hello"/> > + }}} > + and it can be used in the entities like a standard one > + > + === JdbcdataSource === > + This is the default. See the [[#jdbcdatasource|example]] . The signature > is as follows > + {{{ > + public class JdbcDataSource extends DataSource<Iterator<Map<String, > Object>>> > + }}} > + It is designed to iterate rows in DB one by one. A row is represented as a > Map. > + > + === URLDataSource === > + <!> [[Solr1.4]] > + This datasource is often used with X!PathEntityProcessor to fetch content > from an underlying file:// or http:// location. See the documentation > [[#httpds|here]] . The signature is as follows > + {{{ > + public class URLDataSource extends DataSource<Reader> > + }}} > + > + === HttpDataSource === > + <!> Http!DataSource is being deprecated in favour of URL!DataSource in > [[Solr1.4]]. There is no change in functionality between URL!DataSource and > !Http!DataSource, only a name change. > + > + === FileDataSource === > + This can be used like an URL!DataSource but used to fetch content from > files on disk. The only difference from URL!DataSource, when accessing disk > files, is how a pathname is specified. The signature is as follows > + {{{ > + public class FileDataSource extends DataSource<Reader> > + }}} > + > + The attributes are: > + * '''`basePath`''': (optional) The base path relative to which the value > is evaluated if it is not absolute > + * '''`encoding`''': (optional) If the files are to be read in an encoding > that is not same as the platform encoding > + > + === FieldReaderDataSource === > + <!> [[Solr1.4]] > + > + This can be used like an URL!DataSource . The signature is as follows > + {{{ > + public class FieldReaderDataSource extends DataSource<Reader> > + }}} > + This can be useful for users who have a DB field containing XML and wish to > use a nested X!PathEntityProcessor to process the fields contents. > + The datasouce may be configured as follows > + {{{ > + <datasource name="f" type="FieldReaderDataSource" /> > + }}} > + > + The enity which uses this datasource must keep the url value as the > variable name dataField="field-name". For instance , if the parent entity > 'dbEntity' has a field called 'xmlData' . Then he child entity woould look > like, > + {{{ > + <entity dataSource="f" processor="XPathEntityProcessor" > dataField="dbEntity.xmlData"/> > + }}} > + > + === ContentStreamDataSource === > + <!> [[Solr1.4]] > + > Use this to use the POST data as the DataSource. This can be used with any > !EntityProcessor that uses a !DataSource<Reader> > > == EventListeners == > @@ -926, +952 @@ > > * '''`$deleteDocByQuery`''' :Delete docs from Solr by this query. The value > must be a Solr Query <!> [[Solr1.4]] > > > - == Adding datasource in solrconfig.xml == > - <<Anchor(solrconfigdatasource)>> > - > - It is possible to configure datasource in solrconfig.xml as well as the > data-config.xml, however the datasource attributes are expressed differently. > - {{{ > - <requestHandler name="/dataimport" > class="org.apache.solr.handler.dataimport.DataImportHandler"> > - <lst name="defaults"> > - <str name="config">/home/username/data-config.xml</str> > - <lst name="datasource"> > - <str name="driver">com.mysql.jdbc.Driver</str> > - <str name="url">jdbc:mysql://localhost/dbname</str> > - <str name="user">db_username</str> > - <str name="password">db_password</str> > - </lst> > - </lst> > - </requestHandler> > - }}} > <<Anchor(arch)>> > = Architecture = > The following diagram describes the logical flow for a sample configuration. > -- ----------------------------------------------------- Noble Paul | Principal Engineer| AOL | http://aol.com