Re: Document fragment vocabulary

Sebastian Hellmann Mon, 15 Aug 2011 23:17:00 -0700

Am 16.08.2011 14:12, schrieb Michael Hausenblas:

It is not really LinkedData friendly.


Why?

It does not scale for large documents. Let's say you have a 200 MB textfile, with average 3 annotations per line (200,000 lines, 600,000 triples ).

Somebody attached an annotation on line 20000:

<http://example.com/text.txt#line=20000>  my:comment "Please remove this line. It is 
so negative!" .

When making a query with RDF/XML Accept Header. You would always need toretrieve all annotations for all lines.Then after transferring the 200 MB, the client would throw away allother triples but the one.

@Michael: is there some standardisation respective URIs for textgoing on?
As you've rightly identified, an RFC already exists. What would thisnew standardisation activity be chartered for?
As and aside, this reminds me a bit of http://xkcd.com/927/

Hm, actually you created an extra standard yourself for csv, because theapproach by Wilde and Dürst did not cover your use case.It does not cover mine either for 100%. Potentially, there are a lot oftext based formats. So there should be a way to extend the pattern somehow.

The approach by Wilde and Dürst[1] seems to lack stability.
I don't know what you mean by this. Lack of take-up, yes. Stability,what's that?

Wilde and Dürst provide integrity checks, but there is no proposal thatproduces robust fragment IDs. e.g. something that works on the contextand not on line or position. A change in the document on position 0might render all fragment ids obsolete. E.g. "#range=(574,585)" wouldnot be valid, if one character was inserted at the beginning of thedocument.

Do you think we could do such standardisation for document fragmentsand text fragments within the Media Fragments Group[3] ?
No. Disclaimer: I'm a MF WG member. Look at our charter [1] ...

Ok, thanks for clarifying that.


Maybe this thread should slowly be moved over to u...@w3.org [2]?

The # part not being sent to the server might be interesting for thislist as it is a linked data problem. Also I think we should create anOWL Vocabulary to describe, document and standardize different fragmentidentifiers, as Alexander has started. But we should only do it with thew3c. Otherwise it will truly become "competing standard 15" .

The ontology could also just be descriptive, reflecting the RFCs.
Should we cross-post? Alternatively I could just start another thread there.
Sebastian

Cheers,
    Michael

[1] http://www.w3.org/2008/01/media-fragments-wg.html
[2] http://lists.w3.org/Archives/Public/uri/
--
Dr. Michael Hausenblas, Research Fellow
LiDRC - Linked Data Research Centre
DERI - Digital Enterprise Research Institute
NUIG - National University of Ireland, Galway
Ireland, Europe
Tel. +353 91 495730
http://linkeddata.deri.ie/
http://sw-app.org/about.html

On 16 Aug 2011, at 05:40, Sebastian Hellmann wrote:
Hi Michael and Alex,
sorry to answer so late, I was in holiday in France.
I looked at the three provided resources [1,2,3] and there are stillsome comments and questions I have.
1. The part after the # is actually not sent to the server. Are thereany solutions for this? It is not really LinkedData friendly.Comparehttp://linkedgeodata.org/triplify/near/51.033333,13.733333/1000/class/Amenity
(Currently not working, but it gives all points within a 1000m radius)
The client would be required to calculate the subset of triples fromthe resource, that are addressed.
2. [1] is quite basic and they are basically using position andlines. I made a qualitative comparison of different fragment idapproaches for text in [4] slide 7.I was wondering if anybody has researched such properties of URIfragments. Currently, I am benchmarking stability of these uris usingWikipedia changes.
Has such work been done before?
3. @Alex: In my opinion, your proposed fragment ontology can only beused to provide documentation for different fragments.
I would rather propose to just use one triple:
<http://www.w3.org/DesignIssues/LinkedData.html#offset__14406-14418>a <http://nlp2rdf.lod2.eu/schema/string/OffsetBasedString>The ontology I made for Strings might be generalized for formatsother than text based [5]One triple is much shorter. As you can see I also tried to encode thetype of fragment right into the fragment "offset", although anotation like "type=offset" might be better.
4. @Michael: is there some standardisation respective URIs for textgoing on?I heard there would be a Language Technology W3C group. The approachby Wilde and Dürst[1] seems to lack stability.Do you think we could do such standardisation for document fragmentsand text fragments within the Media Fragments Group[3] ?I really thought the liveUrl project was quite good, but it seemsdead[6].
In LOD2[7] and NIF[8] we will need some fragment identifiers toStandardize NLP tools for the LOD2 stack.It would be great to reuse stuff instead of starting from scratch. Ihad to extend [1] for example, because it did not produce stable urisand also it did not contain the type of algorithm used to produce theURI.
All the best,
Sebastian


[1] http://tools.ietf.org/html/rfc5147
[2] http://tools.ietf.org/html/draft-hausenblas-csv-fragment
[3] http://www.w3.org/TR/media-frags/
[4] http://www.slideshare.net/kurzum/nif-nlp-interchange-format
[5] http://nlp2rdf.lod2.eu/schema/string/
[6] http://liveurls.mozdev.org/index.html
[7] http://lod2.eu
[8] http://aksw.org/Projects/NIF

Am 04.08.2011 22:37, schrieb Michael Hausenblas:
Alex,
Has something already done this? Is it even (mostly?) sane?
Sane yes, IMO. Done, sort of, see:

+ URI Fragment Identifiers for the text/plain [1]
+ URI Fragment Identifiers for the text/csv [2]

Cheers,
    Michael

[1] http://tools.ietf.org/html/rfc5147
[2] http://tools.ietf.org/html/draft-hausenblas-csv-fragment

--
Dr. Michael Hausenblas, Research Fellow
LiDRC - Linked Data Research Centre
DERI - Digital Enterprise Research Institute
NUIG - National University of Ireland, Galway
Ireland, Europe
Tel. +353 91 495730
http://linkeddata.deri.ie/
http://sw-app.org/about.html

On 4 Aug 2011, at 14:22, Alexander Dutton wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all,

Say I have an XML document, <http://example.org/something.xml>, and I
want to talk about about some part of it in RDF. As this is XML, being
able to point into it using XPath sounds ideal, leading tosomething like:
<#fragment> a fragment:Fragment ;
 fragment:within <http://example.org/something.xml> ;
 fragment:locator "/some/path[1]"^^fragment:xpath .

(For now we can ignore whether we wanted a nodeset or a single node,
and how to handle XML namespaces.)

More generally, we might want other ways of locating fragments
(probably with a datatype for each):

* character offsets / ranges
* byte offsets / ranges
* line numbers / ranges
* some sub-rectangle of an image
* XML node IDs
* page ranges of a paginated document

Some of these will be IMT-specific and may need some more thinking
about, but the idea is there.


Has something already done this? Is it even (mostly?) sane?


Yours,

Alex


NB. Our actual use-case is having pointers into an NLM XML file
(embodying a journal article) so we can hook up our in-text reference
pointer¹ URIs to the original XML elements (<xref/>s) they were
generated from. This will allow us to work out the context of each
citation for use in further analysis of the relationship between the
citing and cited articles.

¹ See
<http://opencitations.wordpress.com/2011/07/01/nomenclature-for-citations-and-references/>
for an explanation of the terminology.
- --Alexander DuttonDeveloper, data.ox.ac.uk, InfoDev, Oxford University ComputingServices
          Open Citations Project, Department of Zoology, University
of Oxford
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/

iEYEARECAAYFAk46nS4ACgkQS0pRIabRbjDVZQCdGblvoMgNqEietlE5EwAkPJY8
pikAn2KApM0HjcXj6TZegA+Dek/DJIQX
=UcCr
-----END PGP SIGNATURE-----
--
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org



--
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org

Re: Document fragment vocabulary

Reply via email to