Hi, 2014-07-10 21:08 GMT+01:00 Lewis John Mcgibbney <[email protected]>:
> Hi Bianca, > > I cannot reproduce this... The output I get from the webpage serialized as > JSON for reading purposes is as follows: > http://paste.apache.org/hhim As you can see there are no blank nodes > being included as the subject relationship. > This being said, I DO know what you mean as I've encounterd this before > and find the information about a blank node quite irrelevant if I am honest. > > In order to reproduce this specific case I used the following commands: wget http://www.imdb.com/title/tt0286560/?ref_=fn_al_tt_4 ./apache-any23-core-1.0/bin/rover -f ntriples -o index.html?ref_=fn_al_tt_4.nt index.html?ref_=fn_al_tt_4 > >> It seems that in this specific case I could use the content from the >> property */Person/url* as the unique identifier (*IRI*) for the entity. >> I suppose it is not a problem of the extractor but on how the page was >> created. But as many people are using schema.org I was wondering if >> there is any solution for this case. I would be very glad if someone has >> any idea of a solution. >> >> >> I tried to look into another website (Rotten Tomatoes) and I found the same pattern. Again, IMHO, the url could be used as the subject of the triples. I am not sure if it is valid for all triples in all websites but in those examples it seems to work fine. Here goes one example from the webpage http://www.rottentomatoes.com/m/sex_tape_2014/ _:nodecfcd208495d565ef66e7dff9f98764da <http://www.schema.org/Movie/name> "Sex T ape (2014)"@en . _:nodecfcd208495d565ef66e7dff9f98764da < http://www.schema.org/Movie/contentRating> "R"@en . _:nodecfcd208495d565ef66e7dff9f98764da < http://www.schema.org/Movie/datePublished> "Jul 18, 2014 Wide"@en . _:nodecfcd208495d565ef66e7dff9f98764da <http://www.schema.org/Movie/image> < http://content9.flixster.com/movie/11/17/70/11177027_det.jpg> . _:nodef4501543ed78d92c8615458a688986 < http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> . *_:nodef4501543ed78d92c8615458a688986 <http://schema.org/Person/name <http://schema.org/Person/name>> "Cameron Diaz"@en .* *_:nodef4501543ed78d92c8615458a688986 <http://schema.org/Person/image <http://schema.org/Person/image>> <http://content9.flixster.com/rtactor/42/17/42179_tmb.jpg <http://content9.flixster.com/rtactor/42/17/42179_tmb.jpg>> .* *_:nodef4501543ed78d92c8615458a688986 <http://schema.org/Person/url <http://schema.org/Person/url>> <file:./sex_tape_2014//celebrity/cameron_diaz/> .* _:nodecfcd208495d565ef66e7dff9f98764da <http://www.schema.org/Movie/actors> _:nodef4501543ed78d92c8615458a688986 . > Correct, this is NOT a problem with the extractor at all. > What I think yu are suggesting a possibly a *better* way for us to have a > fallback value for blank nodes like the one you provided in your example. > Is this a fair statement for me to make? > I don't know if it is a better way or not. Actually I was hoping that someone could tell me if it is a reasonable idea or not =) As it is the first time I really work with data which is not already in triples format. > If this is true then it would be a case of adding functionality to the > existing html-rdfa11 or html-head-title extractor (whichever one was used > in this particular case). I would ask you to log a Jira issue and possibly > explain what it is that you intend to add... we can certainly work towards > addressing it and I will help you on this no reservations. > Sorry my ignorance but I don't know which extractor was used =/ I just used the rover asking the format to be given in ntriples. How can I know which extractor was used? > > BTW, as I write I am thinking... is this fall back value kind of > *falsifying* the node relationships? I mean the page is what it is... if we > use the fall back value then I feel we are kind of manipulating the > relationships within the page! Does this make sense? Is this a valid point > I am making? > Thanks > Lewis > Regards, Bianca
