Re: DIH : modify document in sibling entity of root entity

2011-03-10 Thread Stefan Matheis
Hi Chantal,

i'm not sure if i understood you correctly (if at all)? Two entities,
not arranged as sub-entitiy, but using values from the previous
entity? Could you paste your dataimport  the relevant part of the
logging-output?

Regards
Stefan

On Thu, Mar 10, 2011 at 4:12 PM, Chantal Ackermann
chantal.ackerm...@btelligent.de wrote:
 Dear all,

 in DIH, is it possible to have two sibling entities where:

 - the first one is the root entity that creates the documents by
 iterating over a table that has one row per document.
 - the second one is executed after the completion of the first entity
 iteration, and it provides more data that is added to the newly created
 documents.


 I've set up such a dih configuration, and the second entity is executed,
 but no data is written into the index apart from the data extracted by
 the root entity  (=no document is modified?).

 Documents are identified by the unique key 'id' which is defined by
 pk=id on both entities.

 Is this supposed to work at all? I haven't found anything so far on the
 net but I could have used the wrong keywords for searching, of course.

 As answer to the maybe obvious question why I'm not using a subentity:
 I thought that this solution might be faster because it iterates over
 the second data source instead of hitting it with a query per each
 document.

 Anyway, the main reason I tried this is because I want to know whether
 it works. I'm still not sure whether it should work but I'm doing
 something wrong...


 Thanks!
 Chantal




Re: DIH : modify document in sibling entity of root entity

2011-03-10 Thread Gora Mohanty
On Thu, Mar 10, 2011 at 8:42 PM, Chantal Ackermann
chantal.ackerm...@btelligent.de wrote:
[...]
 Is this supposed to work at all? I haven't found anything so far on the
 net but I could have used the wrong keywords for searching, of course.

 As answer to the maybe obvious question why I'm not using a subentity:
 I thought that this solution might be faster because it iterates over
 the second data source instead of hitting it with a query per each
 document.
[...]

I think that what you are after can be handled by Solr's
CachedSqlEntityProcessor:
http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor

Two major caveats here:
* I am not 100% sure that I have understood your requirements.
* The documentation for CachedSqlEntityProcessor needs to be improved.
  Will see if I can test it, and come up with a better example. As I have
  not actually used this, it could be that I have misunderstood its purpose.

Regards,
Gora


Re: DIH : modify document in sibling entity of root entity

2011-03-10 Thread Chantal Ackermann
Hi Stefan,

thanks for your time!

No, the second entity is not reusing values from the previous one. It
just provides more fields to it, and, of course the unique identifier -
which in case of the second entity is not unique:

document name=contributor
entity name=contributor pk=id rootEntity=true
query=select   CONTRIBUTOR_ID as id,
CONTRIBUTOR_NAME as name,
EXT_ID as extid
fromDIM_CONTRIBUTOR
/entity   
entity name=appearance pk=id rootEntity=false
transformer=RegexTransformer
query=select   CONTENTID as contentid,
SUBVALUE
fromCONTENT_VALUE
where   ID_ATTRIBUTE=170
field column=ignore sourceColName=SUBVALUE
groupNames=id,type,pos,character
regex=(\d+);(\d+);(\d+);([^;]*);\d*;[A-Z0-9]*;\d* /
/entity
/document


and here are the fields:

field name=id type=slong indexed=true stored=true
required=true /
field name=name type=string indexed=true stored=true
required=true termVectors=true /
field name=contentid type=slong indexed=true stored=true
multiValued=true /
field name=character type=string indexed=true stored=true
multiValued=true termVectors=true /
field name=type type=sint indexed=true stored=true
multiValued=true /

(For the sake of simplicity I've removed some fields that would be
created using copyfield instructions and transformers.)

I'm currently trying to run this using a subentity using the SQL
restriction SUBVALUE like '${contributor.id};%' but this takes ages...

The other one finished in under a minute (and it did actually process
the second entity, I think, it just didn't modify the index). The
current one runs for about 30min, and has only processed 22,000
documents out of more than 390,000. (Of course, there is probably no
index on that column)


Thanks for any suggestions!
Chantal




On Thu, 2011-03-10 at 17:13 +0100, Stefan Matheis wrote:
 Hi Chantal,
 
 i'm not sure if i understood you correctly (if at all)? Two entities,
 not arranged as sub-entitiy, but using values from the previous
 entity? Could you paste your dataimport  the relevant part of the
 logging-output?
 
 Regards
 Stefan
 
 On Thu, Mar 10, 2011 at 4:12 PM, Chantal Ackermann
 chantal.ackerm...@btelligent.de wrote:
  Dear all,
 
  in DIH, is it possible to have two sibling entities where:
 
  - the first one is the root entity that creates the documents by
  iterating over a table that has one row per document.
  - the second one is executed after the completion of the first entity
  iteration, and it provides more data that is added to the newly created
  documents.
 
 
  I've set up such a dih configuration, and the second entity is executed,
  but no data is written into the index apart from the data extracted by
  the root entity  (=no document is modified?).
 
  Documents are identified by the unique key 'id' which is defined by
  pk=id on both entities.
 
  Is this supposed to work at all? I haven't found anything so far on the
  net but I could have used the wrong keywords for searching, of course.
 
  As answer to the maybe obvious question why I'm not using a subentity:
  I thought that this solution might be faster because it iterates over
  the second data source instead of hitting it with a query per each
  document.
 
  Anyway, the main reason I tried this is because I want to know whether
  it works. I'm still not sure whether it should work but I'm doing
  something wrong...
 
 
  Thanks!
  Chantal
 
 



Re: DIH : modify document in sibling entity of root entity

2011-03-10 Thread Chantal Ackermann
Hi Gora,

thanks for making me read this part of the documentation again!
This processor probably cannot do what I need out of the box but I will
try to extend it to allow specifying a regular expression in its where
attribute.

Thanks!
Chantal

On Thu, 2011-03-10 at 17:39 +0100, Gora Mohanty wrote:
 On Thu, Mar 10, 2011 at 8:42 PM, Chantal Ackermann
 chantal.ackerm...@btelligent.de wrote:
 [...]
  Is this supposed to work at all? I haven't found anything so far on the
  net but I could have used the wrong keywords for searching, of course.
 
  As answer to the maybe obvious question why I'm not using a subentity:
  I thought that this solution might be faster because it iterates over
  the second data source instead of hitting it with a query per each
  document.
 [...]
 
 I think that what you are after can be handled by Solr's
 CachedSqlEntityProcessor:
 http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor
 
 Two major caveats here:
 * I am not 100% sure that I have understood your requirements.
 * The documentation for CachedSqlEntityProcessor needs to be improved.
   Will see if I can test it, and come up with a better example. As I have
   not actually used this, it could be that I have misunderstood its purpose.
 
 Regards,
 Gora



Re: DIH : modify document in sibling entity of root entity

2011-03-10 Thread Lance Norskog
The DIH is strictly tree-structured. Data flows down the tree. If the
first sibling is the root entity, nothing is used from the second
sibling. This configuration is something that it the DIH should fail.

On Thu, Mar 10, 2011 at 9:14 AM, Chantal Ackermann
chantal.ackerm...@btelligent.de wrote:
 Hi Gora,

 thanks for making me read this part of the documentation again!
 This processor probably cannot do what I need out of the box but I will
 try to extend it to allow specifying a regular expression in its where
 attribute.

 Thanks!
 Chantal

 On Thu, 2011-03-10 at 17:39 +0100, Gora Mohanty wrote:
 On Thu, Mar 10, 2011 at 8:42 PM, Chantal Ackermann
 chantal.ackerm...@btelligent.de wrote:
 [...]
  Is this supposed to work at all? I haven't found anything so far on the
  net but I could have used the wrong keywords for searching, of course.
 
  As answer to the maybe obvious question why I'm not using a subentity:
  I thought that this solution might be faster because it iterates over
  the second data source instead of hitting it with a query per each
  document.
 [...]

 I think that what you are after can be handled by Solr's
 CachedSqlEntityProcessor:
 http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor

 Two major caveats here:
 * I am not 100% sure that I have understood your requirements.
 * The documentation for CachedSqlEntityProcessor needs to be improved.
   Will see if I can test it, and come up with a better example. As I have
   not actually used this, it could be that I have misunderstood its purpose.

 Regards,
 Gora





-- 
Lance Norskog
goks...@gmail.com