Hi Stefan,

thanks for your time!

No, the second entity is not reusing values from the previous one. It
just provides more fields to it, and, of course the unique identifier -
which in case of the second entity is not unique:

<document name="contributor">
        <entity name="contributor" pk="id" rootEntity="true"
                query="select   CONTRIBUTOR_ID as id,
                                CONTRIBUTOR_NAME as name,
                                EXT_ID as extid
                        from    DIM_CONTRIBUTOR">
        </entity>       
        <entity name="appearance" pk="id" rootEntity="false"
                transformer="RegexTransformer"
                query="select   CONTENTID as contentid,
                                SUBVALUE
                        from    CONTENT_VALUE
                        where   ID_ATTRIBUTE=170">
                <field column="ignore" sourceColName="SUBVALUE"
                        groupNames="id,type,pos,character"
                        regex="(\d+);(\d+);(\d+);([^;]*);\d*;[A-Z0-9]*;\d*" />
        </entity>
</document>


and here are the fields:

<field name="id" type="slong" indexed="true" stored="true"
required="true" />
<field name="name" type="string" indexed="true" stored="true"
required="true" termVectors="true" />
<field name="contentid" type="slong" indexed="true" stored="true"
multiValued="true" />
<field name="character" type="string" indexed="true" stored="true"
multiValued="true" termVectors="true" />
<field name="type" type="sint" indexed="true" stored="true"
multiValued="true" />

(For the sake of simplicity I've removed some fields that would be
created using copyfield instructions and transformers.)

I'm currently trying to run this using a subentity using the SQL
restriction "SUBVALUE like '${contributor.id};%'" but this takes ages...

The other one finished in under a minute (and it did actually process
the second entity, I think, it just didn't modify the index). The
current one runs for about 30min, and has only processed 22,000
documents out of more than 390,000. (Of course, there is probably no
index on that column....)


Thanks for any suggestions!
Chantal




On Thu, 2011-03-10 at 17:13 +0100, Stefan Matheis wrote:
> Hi Chantal,
> 
> i'm not sure if i understood you correctly (if at all)? Two entities,
> not arranged as sub-entitiy, but using values from the previous
> entity? Could you paste your dataimport & the relevant part of the
> logging-output?
> 
> Regards
> Stefan
> 
> On Thu, Mar 10, 2011 at 4:12 PM, Chantal Ackermann
> <chantal.ackerm...@btelligent.de> wrote:
> > Dear all,
> >
> > in DIH, is it possible to have two sibling entities where:
> >
> > - the first one is the root entity that creates the documents by
> > iterating over a table that has one row per document.
> > - the second one is executed after the completion of the first entity
> > iteration, and it provides more data that is added to the newly created
> > documents.
> >
> >
> > I've set up such a dih configuration, and the second entity is executed,
> > but no data is written into the index apart from the data extracted by
> > the root entity  (=no document is modified?).
> >
> > Documents are identified by the unique key 'id' which is defined by
> > pk="id" on both entities.
> >
> > Is this supposed to work at all? I haven't found anything so far on the
> > net but I could have used the wrong keywords for searching, of course.
> >
> > As answer to the maybe obvious question why I'm not using a subentity:
> > I thought that this solution might be faster because it iterates over
> > the second data source instead of hitting it with a query per each
> > document.
> >
> > Anyway, the main reason I tried this is because I want to know whether
> > it works. I'm still not sure whether it should work but I'm doing
> > something wrong...
> >
> >
> > Thanks!
> > Chantal
> >
> >

Reply via email to