Re: DIH : modify document in sibling entity of root entity
Hi Chantal, i'm not sure if i understood you correctly (if at all)? Two entities, not arranged as sub-entitiy, but using values from the previous entity? Could you paste your dataimport the relevant part of the logging-output? Regards Stefan On Thu, Mar 10, 2011 at 4:12 PM, Chantal Ackermann chantal.ackerm...@btelligent.de wrote: Dear all, in DIH, is it possible to have two sibling entities where: - the first one is the root entity that creates the documents by iterating over a table that has one row per document. - the second one is executed after the completion of the first entity iteration, and it provides more data that is added to the newly created documents. I've set up such a dih configuration, and the second entity is executed, but no data is written into the index apart from the data extracted by the root entity (=no document is modified?). Documents are identified by the unique key 'id' which is defined by pk=id on both entities. Is this supposed to work at all? I haven't found anything so far on the net but I could have used the wrong keywords for searching, of course. As answer to the maybe obvious question why I'm not using a subentity: I thought that this solution might be faster because it iterates over the second data source instead of hitting it with a query per each document. Anyway, the main reason I tried this is because I want to know whether it works. I'm still not sure whether it should work but I'm doing something wrong... Thanks! Chantal
Re: DIH : modify document in sibling entity of root entity
On Thu, Mar 10, 2011 at 8:42 PM, Chantal Ackermann chantal.ackerm...@btelligent.de wrote: [...] Is this supposed to work at all? I haven't found anything so far on the net but I could have used the wrong keywords for searching, of course. As answer to the maybe obvious question why I'm not using a subentity: I thought that this solution might be faster because it iterates over the second data source instead of hitting it with a query per each document. [...] I think that what you are after can be handled by Solr's CachedSqlEntityProcessor: http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor Two major caveats here: * I am not 100% sure that I have understood your requirements. * The documentation for CachedSqlEntityProcessor needs to be improved. Will see if I can test it, and come up with a better example. As I have not actually used this, it could be that I have misunderstood its purpose. Regards, Gora
Re: DIH : modify document in sibling entity of root entity
Hi Stefan, thanks for your time! No, the second entity is not reusing values from the previous one. It just provides more fields to it, and, of course the unique identifier - which in case of the second entity is not unique: document name=contributor entity name=contributor pk=id rootEntity=true query=select CONTRIBUTOR_ID as id, CONTRIBUTOR_NAME as name, EXT_ID as extid fromDIM_CONTRIBUTOR /entity entity name=appearance pk=id rootEntity=false transformer=RegexTransformer query=select CONTENTID as contentid, SUBVALUE fromCONTENT_VALUE where ID_ATTRIBUTE=170 field column=ignore sourceColName=SUBVALUE groupNames=id,type,pos,character regex=(\d+);(\d+);(\d+);([^;]*);\d*;[A-Z0-9]*;\d* / /entity /document and here are the fields: field name=id type=slong indexed=true stored=true required=true / field name=name type=string indexed=true stored=true required=true termVectors=true / field name=contentid type=slong indexed=true stored=true multiValued=true / field name=character type=string indexed=true stored=true multiValued=true termVectors=true / field name=type type=sint indexed=true stored=true multiValued=true / (For the sake of simplicity I've removed some fields that would be created using copyfield instructions and transformers.) I'm currently trying to run this using a subentity using the SQL restriction SUBVALUE like '${contributor.id};%' but this takes ages... The other one finished in under a minute (and it did actually process the second entity, I think, it just didn't modify the index). The current one runs for about 30min, and has only processed 22,000 documents out of more than 390,000. (Of course, there is probably no index on that column) Thanks for any suggestions! Chantal On Thu, 2011-03-10 at 17:13 +0100, Stefan Matheis wrote: Hi Chantal, i'm not sure if i understood you correctly (if at all)? Two entities, not arranged as sub-entitiy, but using values from the previous entity? Could you paste your dataimport the relevant part of the logging-output? Regards Stefan On Thu, Mar 10, 2011 at 4:12 PM, Chantal Ackermann chantal.ackerm...@btelligent.de wrote: Dear all, in DIH, is it possible to have two sibling entities where: - the first one is the root entity that creates the documents by iterating over a table that has one row per document. - the second one is executed after the completion of the first entity iteration, and it provides more data that is added to the newly created documents. I've set up such a dih configuration, and the second entity is executed, but no data is written into the index apart from the data extracted by the root entity (=no document is modified?). Documents are identified by the unique key 'id' which is defined by pk=id on both entities. Is this supposed to work at all? I haven't found anything so far on the net but I could have used the wrong keywords for searching, of course. As answer to the maybe obvious question why I'm not using a subentity: I thought that this solution might be faster because it iterates over the second data source instead of hitting it with a query per each document. Anyway, the main reason I tried this is because I want to know whether it works. I'm still not sure whether it should work but I'm doing something wrong... Thanks! Chantal
Re: DIH : modify document in sibling entity of root entity
Hi Gora, thanks for making me read this part of the documentation again! This processor probably cannot do what I need out of the box but I will try to extend it to allow specifying a regular expression in its where attribute. Thanks! Chantal On Thu, 2011-03-10 at 17:39 +0100, Gora Mohanty wrote: On Thu, Mar 10, 2011 at 8:42 PM, Chantal Ackermann chantal.ackerm...@btelligent.de wrote: [...] Is this supposed to work at all? I haven't found anything so far on the net but I could have used the wrong keywords for searching, of course. As answer to the maybe obvious question why I'm not using a subentity: I thought that this solution might be faster because it iterates over the second data source instead of hitting it with a query per each document. [...] I think that what you are after can be handled by Solr's CachedSqlEntityProcessor: http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor Two major caveats here: * I am not 100% sure that I have understood your requirements. * The documentation for CachedSqlEntityProcessor needs to be improved. Will see if I can test it, and come up with a better example. As I have not actually used this, it could be that I have misunderstood its purpose. Regards, Gora
Re: DIH : modify document in sibling entity of root entity
The DIH is strictly tree-structured. Data flows down the tree. If the first sibling is the root entity, nothing is used from the second sibling. This configuration is something that it the DIH should fail. On Thu, Mar 10, 2011 at 9:14 AM, Chantal Ackermann chantal.ackerm...@btelligent.de wrote: Hi Gora, thanks for making me read this part of the documentation again! This processor probably cannot do what I need out of the box but I will try to extend it to allow specifying a regular expression in its where attribute. Thanks! Chantal On Thu, 2011-03-10 at 17:39 +0100, Gora Mohanty wrote: On Thu, Mar 10, 2011 at 8:42 PM, Chantal Ackermann chantal.ackerm...@btelligent.de wrote: [...] Is this supposed to work at all? I haven't found anything so far on the net but I could have used the wrong keywords for searching, of course. As answer to the maybe obvious question why I'm not using a subentity: I thought that this solution might be faster because it iterates over the second data source instead of hitting it with a query per each document. [...] I think that what you are after can be handled by Solr's CachedSqlEntityProcessor: http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor Two major caveats here: * I am not 100% sure that I have understood your requirements. * The documentation for CachedSqlEntityProcessor needs to be improved. Will see if I can test it, and come up with a better example. As I have not actually used this, it could be that I have misunderstood its purpose. Regards, Gora -- Lance Norskog goks...@gmail.com