Hi Bianca, On Thu, Jul 10, 2014 at 7:51 AM, <[email protected]> wrote:
> > Hi all, > > I started to use any23 recently and I had one issue extracting the > information from one website (IMDB.com). > > I want to extract triples from the webpages and I faced the following > problem: > I cannot reproduce this... The output I get from the webpage serialized as JSON for reading purposes is as follows: http://paste.apache.org/hhim As you can see there are no blank nodes being included as the subject relationship. This being said, I DO know what you mean as I've encounterd this before and find the information about a blank node quite irrelevant if I am honest. > > It seems that in this specific case I could use the content from the > property */Person/url* as the unique identifier (*IRI*) for the entity. I > suppose it is not a problem of the extractor but on how the page was > created. But as many people are using schema.org I was wondering if there > is any solution for this case. I would be very glad if someone has any idea > of a solution. > > > Correct, this is NOT a problem with the extractor at all. What I think yu are suggesting a possibly a *better* way for us to have a fallback value for blank nodes like the one you provided in your example. Is this a fair statement for me to make? If this is true then it would be a case of adding functionality to the existing html-rdfa11 or html-head-title extractor (whichever one was used in this particular case). I would ask you to log a Jira issue and possibly explain what it is that you intend to add... we can certainly work towards addressing it and I will help you on this no reservations. BTW, as I write I am thinking... is this fall back value kind of *falsifying* the node relationships? I mean the page is what it is... if we use the fall back value then I feel we are kind of manipulating the relationships within the page! Does this make sense? Is this a valid point I am making? Thanks Lewis
