Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-07 Thread Alexander Johannesen
On Wed, May 6, 2009 at 18:44, Mike Taylor  wrote:
> Can't you just tell us?

Sorry, but surely you must be tired of me banging on this gong by now?
It's not that I don't want to seem helpful, but I've been writing a
bit on this here already and don't want to be marked as spam for Topic
Maps.

In the Topic Maps world our global identificators are called PSI, for
Published Subject Indicators. There's a few subtleties within this,
but they are not so different from any other identificator you'll find
elsewhere (RDF, library world, etc.) except of course they are
*always* URIs. Now, the thing here is that they should *always* be
published somewhere, whether as a part of a list or somewhere. The
next thing is that they always should resolve to something (although
the standard don't require this, however I'd say you're doing it wrong
if you couldn't do this, even if it sometimes is an evil necessity).

This last part is really the important bit, where any PSI will act as
1) a global identificator, and 2) resolve to a human text explaining
what it represents. Systems can "just use it" while at the same time
people can choose the right ones for their uses.

And, yes, the identificators can be done any way you slice them. Some
might think that ie. a PSI set for all dates is crazy as you need to
produce identificators for all dates (or times), and that would be
just way too much to deal with, but again, that's not an identifcation
problem, that's a resolver problem. If I can browse to a PSI and get
the text that "this is 3rd of June, 19971, using the whatsnot calendar
style", then that's safe for me to use for my birthday. Let's pretend
the PSI is http://iso.org/datetime/03061971. By releasing an URI
template computers can work with this automatically, no frills.

Now a bit more technical; any topic (which is a Topic Map
representation of any subject, where "subject" is defined as "anything
you can ever hope to think of") can have more than one PSI, because I
might use the PSI http://someother.org/time/date/3/6/1971 for my date.
If my application only understand this former set of PSIs, I can't
merge and find similar cross-semantics (which really is the core of
the problem this thread has been talking about). But simply attach the
second PSI to the same Topic, and you do. In fact, both parties will
understand perfectly what you're talking about.

More complex is that the definitions of PSI sets doesn't have to
happen on the subject level, ie. the Topic called "Alex" to which I
tried to attach my birthday. It can be moved to a meta model level,
where you say the Topic for "Time and dates" have the PSI for both
organsiations, and all Topics just use one or the other; we're
shifting the explicity of identification up a notch.

Having multiple PSIs might seem a bit unordered, but it's based on the
notion of organic growth, just like the web. People will gravitate
towards using PSIs from the most trusted sources (or most accurate or
most whatever), shifting identification schemes around. This is a good
thing (organic growth) at the price of multiple identifiers, but if
the library world started creating PSIs, I betcha humanity and the
library world both could be saved in one fell swoop! (That's another
gong I like to bang)

I'm kinda anticipating Jonathan saying this is all so complex now. :)
But it's not really; your application only has to have complexity in
the small meta model you set up, *not* for every single Topic you've
got in your map. And they're mergable and shareable, and as such can
be merged and "fixed" (or cleaned or sobered or made less complex) for
all your various needs also.

Anyway, that's the basics. Let me know if you want me to bang on. :)
For me, the problem the library face isn't really the mechanisms of
this (because this is solvable, and I guess you just have to trust
that the Topic Maps community have been doing this for the last 10
years or so already :), however, but how you're going to fit existing
resources into FRBR and RDA, but that's a separate discussion.


Regards,

Alex
-- 
---
 Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps
-- http://shelter.nu/blog/ 


Re: [CODE4LIB] Wolfram Alpha (was: Another nail in the coffin)

2009-05-07 Thread st...@archive.org

thanks so much for your post Alex, i hadn't had a chance to
consider Wolfram|Alpha (WA) seriously until you posted the link
to the talk (and i had the time to actually watch it).

On 5/3/09 6:13 PM, Alexander Johannesen wrote:
> http://www.youtube.com/watch?v=5TIOH80Qg7Q
> Organisations and people are slowly turning into data
> producers, not book producers.

when i think of data producers, i think CRC press and the like,
companies that compile and publish scientific data. certainly
much of this data is now born-digital or being converted to
digital formats (or put on the web), rather than only being
published in books. but these organizations and people are
still producing data, and those that produce books are in a
rapidly changing space (aren't we all).

imo, the advent of WA will likely result in the production of
_more_ books, not less, and will almost certainly benefit
libraries and learners.

after watching Mr. Wolfram's talk, i realize that most of the
responses to Wolfram Alpha on the net appear to be missing the
point. more specifically,

* WA consists of curated (computable) data + algorithms (5M+
  lines of Mathematica) + (inverted) linguistic analysis[1] +
  automated presentation.

* afaict WA does not attempt to compete with Google or Wikipedia
  or open source/public science, they are all complimentary and
  compatible!

* WA is admirably unique in its effort to make quality data
  useful, rather than merely organizing/regurgitating heaps of
  folk data and net garbage.

* the value added by WA is that it makes (so-called) public data
  "computable", in the NKS[2] sense, as executable Mathematica
  code.

as mentioned in the talk, Wolfram engineers take data from
proprietary, widely accepted, peer-reviewed sources (probably
familiar to any research librarian) and transforms it into
datasets computable in the WA environment[3].

there is considerable confusion as to how WA compares to Google,
Wikipedia, and the Open Source world. i think Google is solving
a different problem with very different data, and Wikipedia (as
mentioned in the talk) is one of many input sources to WA. more
specifically,

* Google's input data set is un-curated, albeit cleverly ranked,
  links to web pages, and _some_ data from the web. it (rightly)
  does not have "computable" data or the Mathematica
  computational engine, but does have many of the natural
  language and automatic presentation features, as well as a
  search engine query box type interface (which i think is the
  cause of much incorrect comparison).

* Wikipedia is merely folk input to WA, complimentary but
  missing _quality_ data (think CRC press), computational
  algorithms, natural language processing, and automated
  presentation. the only basis for comparison i can see here is
  that both Wikipedia and WA contain a lot of useful information
  - however, what is done with and how you interact with that
  data is clearly very different.

* WA is not in danger of being "open-sourced" because curating
  and converting quality scientific data into computable
  datasets is non-trivial, and so is the Mathematica
  computational engine. the comparisons here, i think stem from
  the fact that it has a web interface, and much of the data is
  available from public sources. for many problem-solvers, i
  think it's natural to respond with, "hmmm, how would i have
  done this..."

ultimately, i think Wolfram Alpha will be an extremely valuable
tool for libraries, and could (hopefully) change the way
learners think about how to get information and solve problems.

i think it's exciting to think that it could steer learners and
researchers away from looking to the web (unfortunately, almost
always Google by default) for quick answers, and back to
thinking about how they can answer questions for themselves,
given quality information, and powerful tools for problem
solving.


/st...@archive.org


Notes:

[1] as mentioned near 0:39:00 in the video, Wolfram explains
that the natural language problem that WA attempts to solve
(like search engines) is different than the traditional one.
the traditional NLP problem is taking a mass of data produced
by humans and trying to make sense of it, while the query box
problem is taking short human utterances, and trying to
formulate a problem which is computable from a mass of data.

[2] A New Kind of Science
http://www.wolframscience.com/nksonline/toc.html
i must confess, i haven't completely digested this material.

[3] as a long-time MATLAB user in a former life, this makes a
lot of sense. in MATLAB, everything is a computable matrix, and
solving problems in that environment is about taking (highly
non-linear) real-world problems, and linearizing them to be
computable in the MATLAB environment. this approach has deep
mathematical roots, and is consistent in solving problems across
many scientific disciplines, so the kind of problems which can
be solved with the help of MATLAB is broad and deep.

the Mathematica computat