Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

2009-06-23 Thread John Vandenberg
On Wed, Jun 24, 2009 at 6:10 AM, Anthony  wrote:
>
> On Tue, Jun 23, 2009 at 3:58 PM, Anthony  wrote:
>
> > On Tue, Jun 23, 2009 at 2:24 PM, Brian  wrote:
> >
> >> Ok Shakespeare. But in plain english you appear to be saying that
> >> corporations are inherently greedy and have a tendency to be evil. Sure,
> >> but
> >> we expect more out of GOOG. This is not MSFT we are talking about.
> >
> >
> > Of course they're inherently greedy.  That's the whole purpose of a
> > for-profit corporation - to make as much money as possible for its
> > shareholders.
> >
>
> I guess even a non-profit is inherently greedy, it's just greedy for
> something other than money.  The WMF is greedy for the spread of free
> knowledge.
>
> But this is off-topic.  Let's take it to another list or something.

off-topic?? ... surely you jest!!

I think about _three_ of the 50+ emails in this thread have been on
the topic of open access journal articles on Wikisource.

--
John Vandenberg

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

2009-06-23 Thread Anthony
On Tue, Jun 23, 2009 at 3:58 PM, Anthony  wrote:

> On Tue, Jun 23, 2009 at 2:24 PM, Brian  wrote:
>
>> Ok Shakespeare. But in plain english you appear to be saying that
>> corporations are inherently greedy and have a tendency to be evil. Sure,
>> but
>> we expect more out of GOOG. This is not MSFT we are talking about.
>
>
> Of course they're inherently greedy.  That's the whole purpose of a
> for-profit corporation - to make as much money as possible for its
> shareholders.
>

I guess even a non-profit is inherently greedy, it's just greedy for
something other than money.  The WMF is greedy for the spread of free
knowledge.

But this is off-topic.  Let's take it to another list or something.
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

2009-06-23 Thread Anthony
On Tue, Jun 23, 2009 at 2:24 PM, Brian  wrote:

> Ok Shakespeare. But in plain english you appear to be saying that
> corporations are inherently greedy and have a tendency to be evil. Sure,
> but
> we expect more out of GOOG. This is not MSFT we are talking about.


Of course they're inherently greedy.  That's the whole purpose of a
for-profit corporation - to make as much money as possible for its
shareholders.  As for "tendency to be evil", I think that rests on your
definition of "evil".
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

2009-06-23 Thread Anthony
On Tue, Jun 23, 2009 at 1:09 PM, Brian  wrote:

> 2009/6/23 Samuel Klein 
>
> > Yes, but my understanding is that while google provided part of the mbp
> > data
> > and scans, its continued updates to ocr since then are not being shared.
>  I
> > would be glad to learn this was not the case...
> >
>
> The dataset you need to train an OCR system to be as good as theirs is the
> raw images and the plain text. They aren't making it easy to get either of
> those things :( They have presumably improved the software in other ways as
> well..
>
> WTF GOOG?


It's almost like they're trying to run a business or something.
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

2009-06-23 Thread Brian
Ok Shakespeare. But in plain english you appear to be saying that
corporations are inherently greedy and have a tendency to be evil. Sure, but
we expect more out of GOOG. This is not MSFT we are talking about.

On Tue, Jun 23, 2009 at 12:13 PM, Michael Snow wrote:

> Brian wrote:
> > On Tue, Jun 23, 2009 at 11:44 AM, Michael Snow  >wrote:
> >
> >>> The dataset you need to train an OCR system to be as good as theirs is
> >>>
> >> the
> >>
> >>> raw images and the plain text. They aren't making it easy to get either
> >>>
> >> of
> >>
> >>> those things :( They have presumably improved the software in other
> ways
> >>>
> >> as
> >>
> >>> well..
> >>>
> >>> WTF GOOG?
> >>>
> >> Well, when your shorthand uses their stock ticker symbol, your argument
> >> has already been coopted.
> >>
> >> --Michael Snow
> >>
> > I get the joke but um, I used it on purpose and which one of my arguments
> > been "coopted" ??
> >
> Coopting is not like rebutting; it does not bite chunks out of specific
> pieces, it swallows whole. Symbols are powerful things, perhaps even
> more so outside the mathematical logic of argument. They do not serve
> only your purposes, even if you use them purposefully. My observations
> may be wry, but they are not entirely in jest.
>
> --Michael Snow
>
> ___
> foundation-l mailing list
> foundation-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

2009-06-23 Thread Michael Snow
Brian wrote:
> On Tue, Jun 23, 2009 at 11:44 AM, Michael Snow wrote:
>   
>>> The dataset you need to train an OCR system to be as good as theirs is
>>>   
>> the
>> 
>>> raw images and the plain text. They aren't making it easy to get either
>>>   
>> of
>> 
>>> those things :( They have presumably improved the software in other ways
>>>   
>> as
>> 
>>> well..
>>>
>>> WTF GOOG?
>>>   
>> Well, when your shorthand uses their stock ticker symbol, your argument
>> has already been coopted.
>>
>> --Michael Snow
>> 
> I get the joke but um, I used it on purpose and which one of my arguments
> been "coopted" ??
>   
Coopting is not like rebutting; it does not bite chunks out of specific 
pieces, it swallows whole. Symbols are powerful things, perhaps even 
more so outside the mathematical logic of argument. They do not serve 
only your purposes, even if you use them purposefully. My observations 
may be wry, but they are not entirely in jest.

--Michael Snow

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

2009-06-23 Thread Brian
On Tue, Jun 23, 2009 at 11:44 AM, Michael Snow wrote:

>
> > The dataset you need to train an OCR system to be as good as theirs is
> the
> > raw images and the plain text. They aren't making it easy to get either
> of
> > those things :( They have presumably improved the software in other ways
> as
> > well..
> >
> > WTF GOOG?
> >
> Well, when your shorthand uses their stock ticker symbol, your argument
> has already been coopted.
>
> --Michael Snow
>

I get the joke but um, I used it on purpose and which one of my arguments
been "coopted" ??
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

2009-06-23 Thread Michael Snow
Brian wrote:
> 2009/6/23 Samuel Klein 
>   
>> Yes, but my understanding is that while google provided part of the mbp
>> data
>> and scans, its continued updates to ocr since then are not being shared.  I
>> would be glad to learn this was not the case...
>> 
> The dataset you need to train an OCR system to be as good as theirs is the
> raw images and the plain text. They aren't making it easy to get either of
> those things :( They have presumably improved the software in other ways as
> well..
>
> WTF GOOG?
>   
Well, when your shorthand uses their stock ticker symbol, your argument 
has already been coopted.

--Michael Snow

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

2009-06-23 Thread Brian
2009/6/23 Samuel Klein 

> Yes, but my understanding is that while google provided part of the mbp
> data
> and scans, its continued updates to ocr since then are not being shared.  I
> would be glad to learn this was not the case...
>

The dataset you need to train an OCR system to be as good as theirs is the
raw images and the plain text. They aren't making it easy to get either of
those things :( They have presumably improved the software in other ways as
well..

WTF GOOG?
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


[Foundation-l] Public repositories for research dumps

2009-06-23 Thread Felipe Ortega

Hello.

Since just a few hours ago, a new public repository has been created to host 
WikiXRay database dumps, containing info extracted from public Wikipedia 
dbdumps. The image is hosted by RedIRIS (in short, the Spanish equivalent of 
Kennisnet in Netherlands).

http://sunsite.rediris.es/mirror/WKP_research

ftp://ftp.rediris.es/mirror/WKP_research

These new dumps are aimed to save time and effort to other researchers, since 
they won't need to parse the complete XML dumps to extract all relevant 
activity metadata. We used mysqldump to create the dumps from our databases.. 

As of today, only some of the biggest Wikipedias are available. However,  in 
the following days the full set of available languages will be ready for 
downloading. The files will be updated regularly.

The procedure is as follows:

1. Find the research dump of your interest. Download and decompress it in your 
local system.

2. Create a local DB to import the information.

3. Load the dump file, using a MySQL user with insert privileges:

$> mysql -u user -p passw myDB < dumpfile.sql

And you're done.

Final warning. 3 fields in the revision table are not reliable yet:

rev_num_inlinks
rev_num_outlinks
rev_num_trans

All remaining fields/values are trustable (in particular rev_len, 
rev_num_words, and so forth).

Regards,

Felipe.






  


___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

2009-06-23 Thread Anthony
On Mon, Jun 22, 2009 at 9:15 PM, Platonides  wrote:

> Anthony wrote:
> > (although I still haven't seen the WMF step up
> > to the plate and make it easy for people to make a full history fork, or
> > even to download all the images)
>
> You'll find full history dumps of almost all wikis at
> http://download.wikimedia.org/


Key word being "almost".

Although not trivial, downloading all images is in fact quite easy.


Yep.  All I need is permission.


> But do you have enough space to dedicate?


Not at the moment.  No sense in buying the drives when I don't have
permission to fill them up.


> How many wikis do you want to mirror? Just commons is more than 3 TB...


Commons and En.wikipedia would probably be good for starters.

The main thing I want is permission to scrape en.wikipedia, though.  (Not
really scraping, as I'd probably use the API and Special:Export.  Basically
I just would like someone official to tell me how *fast* I'm allowed to use
the API and Special:Export.  Special:Export especially, because I could
easily overwhelm the servers using that, due to a bug in the script.)

That's the reason so few people were interested in the images when the
> image dump was available.


I downloaded it.  It was well under 1 TB at the time.
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

2009-06-23 Thread Samuel Klein
Yes, but my understanding is that while google provided part of the mbp data
and scans, its continued updates to ocr since then are not being shared.  I
would be glad to learn this was not the case...

samuel klein.  s...@laptop.org.  +1 617 529 4266

On Jun 21, 2009 3:14 AM, "Nikola Smolenski"  wrote:

Дана Saturday 20 June 2009 18:29:24 Brian написа:

> This has reminded me to complain about Google Books. Google has the
world's > best OCR (in virtue ...
Often, these books are available in the Million Books Project too.

___ foundation-l mailing list
foundatio...@lists.wikime...
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

2009-06-23 Thread Peter Gervai
On Tue, Jun 23, 2009 at 03:15, Platonides wrote:
> Although not trivial, downloading all images is in fact quite easy. You
> can find scripts to do that already made. You can also ask Brion to
> rsync3 them.
> But do you have enough space to dedicate?
> How many wikis do you want to mirror? Just commons is more than 3 TB...

Well disks are cheap nowadays. If it's really just the question of
asking, I may be interested. for example.

The more complex question is the parameters of such usage, meaning
what can I do with the images after I've got them. This is the main
reason behind not publishing them in the first hand: the images itself
aren't suggesting any particular license.

Now that I wrote this, it would be possible (not sure if feasible,
though) to publish CC-BY-SA pictures with author info in the comment
of the image itself. Most image formats support sizeable comment
blocks, and standardised templates make it possible to select media by
license, and get author/copyright info to put into the file.

> That's the reason so few people were interested in the images when the
> image dump was available.

People are interested, generally, but not in mirroring the whole shebang. :-)

grin

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l