Hi Bianca,

On Thu, Jul 10, 2014 at 7:51 AM, <[email protected]> wrote:

>
> Hi all,
>
>   I started to use any23 recently and I had one issue extracting the
> information from one website (IMDB.com).
>
>  I want to extract triples from the webpages and I faced the following
> problem:
>

I cannot reproduce this... The output I get from the webpage serialized as
JSON for reading purposes is as follows:
http://paste.apache.org/hhim As you can see there are no blank nodes being
included as the subject relationship.
This being said, I DO know what you mean as I've encounterd this before and
find the information about a blank node quite irrelevant if I am honest.


>
> It seems that in this specific case I could use the content from the
> property */Person/url* as the unique identifier (*IRI*) for the entity. I
> suppose it is not a problem of the extractor but on how the page was
> created. But as many people are using schema.org I was wondering if there
> is any solution for this case. I would be very glad if someone has any idea
> of a solution.
>
>
> Correct, this is NOT a problem with the extractor at all.
What I think yu are suggesting a possibly a *better* way for us to have a
fallback value for blank nodes like the one you provided in your example.
Is this a fair statement for me to make?

If this is true then it would be a case of adding functionality to the
existing html-rdfa11 or html-head-title extractor (whichever one was used
in this particular case). I would ask you to log a Jira issue and possibly
explain what it is that you intend to add... we can certainly work towards
addressing it and I will help you on this no reservations.

BTW, as I write I am thinking... is this fall back value kind of
*falsifying* the node relationships? I mean the page is what it is... if we
use the fall back value then I feel we are kind of manipulating the
relationships within the page! Does this make sense? Is this a valid point
I am making?
Thanks
Lewis

Reply via email to