Dear all

 

I don't think this is a difficult problem, two points should be clarified
first:

1. Al most all facts on Wikipedia need not be sourced.

2. Since sourcing cannot inform us of truth/falsehood of a fact - it, at
best, indicates the authority of its source.

 

I am not a lawyer, only a "Information  Specialist." So this following
should be double checked with legal:

 

Fact: Google, Caches practically anything it indexes:

However attempts by some site owners to claim this cache as a violation of
their copyright have been consistently repealed/dismissed by US law courts. 

I am not even sure what legal grounds we used, but caching is considered
part of web tech and e.g. the browser caches are also not considered
copyright 
violations.

 

On the drawing board of the Next Generation Search Engine is "content
analytics" capability for doing authority assessment of references using:

1.       A Transitive Bibliometric authority model.

2.       A metric of reference longevity. (Fad vs. Fact test)

3.       Bootstrapping using content analysis where access to full text of
source was made available.

 

While the above model is complex, it would take (me) about 2 weeks of work
to set  up prototype reference repository -- 

a Nutch (crawler) + Solr + Storage (say MySql/Hbase/Cassandra) combination
to index:

external links,  

references including urls, 

references with no urls. 

This data would be immediately consumable via http using standards based
request. (A SOLR feature). 

 

 

To add Integration with existing Search UI would probably  take another 2
weeks. As would adding support for caching/indexing most significant non
html document formats.

 

However It would not be able to access  content behind pay walls without
access to a password. If and only if WMF sanctions this strategy - I could
also draft a 'win win' policy for Encouraging such "Hidden Web" resource
owners to provide free access to such a crawler. And possibly even open up
their pay walls  to our editors . 

e.g. remove "No Follow" directive from links to high WP:RS partners.

 

I hope this help.

 

 

Oren Bochman.

 

MediaWiki Search Developer.

 

From: wikidata-l-boun...@lists.wikimedia.org
[mailto:wikidata-l-boun...@lists.wikimedia.org] On Behalf Of John Erling
Blad
Sent: Sunday, April 01, 2012 10:01 PM
To: Discussion list for the Wikidata project.
Subject: Re: [Wikidata-l] Archiving references for facts?

 

Archiving a page should be pretty safe as long as the archived copy is only
for internal use, that means something like OTRS. If the archived copy is
republished it _might_ be viewed as a copyright infringement.

Still note that the archived copy can be used for automatic verification, ie
extract a quote and check that against a stored value, without infringing
any copyright. If a publication is withdrawn it might be an indication that
something is seriously wrong with the page, and no matter what the archived
copy at WebCitation says the page can't be trusted.

Its really a very difficult problem.

Jeblad

On 1. apr. 2012 14.08, "Helder" <helder.w...@gmail.com> wrote:

_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Reply via email to