Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship
On Wed, Jun 24, 2009 at 6:10 AM, Anthony wrote: > > On Tue, Jun 23, 2009 at 3:58 PM, Anthony wrote: > > > On Tue, Jun 23, 2009 at 2:24 PM, Brian wrote: > > > >> Ok Shakespeare. But in plain english you appear to be saying that > >> corporations are inherently greedy and have a tendency to be evil. Sure, > >> but > >> we expect more out of GOOG. This is not MSFT we are talking about. > > > > > > Of course they're inherently greedy. That's the whole purpose of a > > for-profit corporation - to make as much money as possible for its > > shareholders. > > > > I guess even a non-profit is inherently greedy, it's just greedy for > something other than money. The WMF is greedy for the spread of free > knowledge. > > But this is off-topic. Let's take it to another list or something. off-topic?? ... surely you jest!! I think about _three_ of the 50+ emails in this thread have been on the topic of open access journal articles on Wikisource. -- John Vandenberg ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship
On Tue, Jun 23, 2009 at 3:58 PM, Anthony wrote: > On Tue, Jun 23, 2009 at 2:24 PM, Brian wrote: > >> Ok Shakespeare. But in plain english you appear to be saying that >> corporations are inherently greedy and have a tendency to be evil. Sure, >> but >> we expect more out of GOOG. This is not MSFT we are talking about. > > > Of course they're inherently greedy. That's the whole purpose of a > for-profit corporation - to make as much money as possible for its > shareholders. > I guess even a non-profit is inherently greedy, it's just greedy for something other than money. The WMF is greedy for the spread of free knowledge. But this is off-topic. Let's take it to another list or something. ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship
On Tue, Jun 23, 2009 at 2:24 PM, Brian wrote: > Ok Shakespeare. But in plain english you appear to be saying that > corporations are inherently greedy and have a tendency to be evil. Sure, > but > we expect more out of GOOG. This is not MSFT we are talking about. Of course they're inherently greedy. That's the whole purpose of a for-profit corporation - to make as much money as possible for its shareholders. As for "tendency to be evil", I think that rests on your definition of "evil". ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship
On Tue, Jun 23, 2009 at 1:09 PM, Brian wrote: > 2009/6/23 Samuel Klein > > > Yes, but my understanding is that while google provided part of the mbp > > data > > and scans, its continued updates to ocr since then are not being shared. > I > > would be glad to learn this was not the case... > > > > The dataset you need to train an OCR system to be as good as theirs is the > raw images and the plain text. They aren't making it easy to get either of > those things :( They have presumably improved the software in other ways as > well.. > > WTF GOOG? It's almost like they're trying to run a business or something. ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship
Ok Shakespeare. But in plain english you appear to be saying that corporations are inherently greedy and have a tendency to be evil. Sure, but we expect more out of GOOG. This is not MSFT we are talking about. On Tue, Jun 23, 2009 at 12:13 PM, Michael Snow wrote: > Brian wrote: > > On Tue, Jun 23, 2009 at 11:44 AM, Michael Snow >wrote: > > > >>> The dataset you need to train an OCR system to be as good as theirs is > >>> > >> the > >> > >>> raw images and the plain text. They aren't making it easy to get either > >>> > >> of > >> > >>> those things :( They have presumably improved the software in other > ways > >>> > >> as > >> > >>> well.. > >>> > >>> WTF GOOG? > >>> > >> Well, when your shorthand uses their stock ticker symbol, your argument > >> has already been coopted. > >> > >> --Michael Snow > >> > > I get the joke but um, I used it on purpose and which one of my arguments > > been "coopted" ?? > > > Coopting is not like rebutting; it does not bite chunks out of specific > pieces, it swallows whole. Symbols are powerful things, perhaps even > more so outside the mathematical logic of argument. They do not serve > only your purposes, even if you use them purposefully. My observations > may be wry, but they are not entirely in jest. > > --Michael Snow > > ___ > foundation-l mailing list > foundation-l@lists.wikimedia.org > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l > ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship
Brian wrote: > On Tue, Jun 23, 2009 at 11:44 AM, Michael Snow wrote: > >>> The dataset you need to train an OCR system to be as good as theirs is >>> >> the >> >>> raw images and the plain text. They aren't making it easy to get either >>> >> of >> >>> those things :( They have presumably improved the software in other ways >>> >> as >> >>> well.. >>> >>> WTF GOOG? >>> >> Well, when your shorthand uses their stock ticker symbol, your argument >> has already been coopted. >> >> --Michael Snow >> > I get the joke but um, I used it on purpose and which one of my arguments > been "coopted" ?? > Coopting is not like rebutting; it does not bite chunks out of specific pieces, it swallows whole. Symbols are powerful things, perhaps even more so outside the mathematical logic of argument. They do not serve only your purposes, even if you use them purposefully. My observations may be wry, but they are not entirely in jest. --Michael Snow ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship
On Tue, Jun 23, 2009 at 11:44 AM, Michael Snow wrote: > > > The dataset you need to train an OCR system to be as good as theirs is > the > > raw images and the plain text. They aren't making it easy to get either > of > > those things :( They have presumably improved the software in other ways > as > > well.. > > > > WTF GOOG? > > > Well, when your shorthand uses their stock ticker symbol, your argument > has already been coopted. > > --Michael Snow > I get the joke but um, I used it on purpose and which one of my arguments been "coopted" ?? ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship
Brian wrote: > 2009/6/23 Samuel Klein > >> Yes, but my understanding is that while google provided part of the mbp >> data >> and scans, its continued updates to ocr since then are not being shared. I >> would be glad to learn this was not the case... >> > The dataset you need to train an OCR system to be as good as theirs is the > raw images and the plain text. They aren't making it easy to get either of > those things :( They have presumably improved the software in other ways as > well.. > > WTF GOOG? > Well, when your shorthand uses their stock ticker symbol, your argument has already been coopted. --Michael Snow ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship
2009/6/23 Samuel Klein > Yes, but my understanding is that while google provided part of the mbp > data > and scans, its continued updates to ocr since then are not being shared. I > would be glad to learn this was not the case... > The dataset you need to train an OCR system to be as good as theirs is the raw images and the plain text. They aren't making it easy to get either of those things :( They have presumably improved the software in other ways as well.. WTF GOOG? ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
[Foundation-l] Public repositories for research dumps
Hello. Since just a few hours ago, a new public repository has been created to host WikiXRay database dumps, containing info extracted from public Wikipedia dbdumps. The image is hosted by RedIRIS (in short, the Spanish equivalent of Kennisnet in Netherlands). http://sunsite.rediris.es/mirror/WKP_research ftp://ftp.rediris.es/mirror/WKP_research These new dumps are aimed to save time and effort to other researchers, since they won't need to parse the complete XML dumps to extract all relevant activity metadata. We used mysqldump to create the dumps from our databases.. As of today, only some of the biggest Wikipedias are available. However, in the following days the full set of available languages will be ready for downloading. The files will be updated regularly. The procedure is as follows: 1. Find the research dump of your interest. Download and decompress it in your local system. 2. Create a local DB to import the information. 3. Load the dump file, using a MySQL user with insert privileges: $> mysql -u user -p passw myDB < dumpfile.sql And you're done. Final warning. 3 fields in the revision table are not reliable yet: rev_num_inlinks rev_num_outlinks rev_num_trans All remaining fields/values are trustable (in particular rev_len, rev_num_words, and so forth). Regards, Felipe. ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship
On Mon, Jun 22, 2009 at 9:15 PM, Platonides wrote: > Anthony wrote: > > (although I still haven't seen the WMF step up > > to the plate and make it easy for people to make a full history fork, or > > even to download all the images) > > You'll find full history dumps of almost all wikis at > http://download.wikimedia.org/ Key word being "almost". Although not trivial, downloading all images is in fact quite easy. Yep. All I need is permission. > But do you have enough space to dedicate? Not at the moment. No sense in buying the drives when I don't have permission to fill them up. > How many wikis do you want to mirror? Just commons is more than 3 TB... Commons and En.wikipedia would probably be good for starters. The main thing I want is permission to scrape en.wikipedia, though. (Not really scraping, as I'd probably use the API and Special:Export. Basically I just would like someone official to tell me how *fast* I'm allowed to use the API and Special:Export. Special:Export especially, because I could easily overwhelm the servers using that, due to a bug in the script.) That's the reason so few people were interested in the images when the > image dump was available. I downloaded it. It was well under 1 TB at the time. ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship
Yes, but my understanding is that while google provided part of the mbp data and scans, its continued updates to ocr since then are not being shared. I would be glad to learn this was not the case... samuel klein. s...@laptop.org. +1 617 529 4266 On Jun 21, 2009 3:14 AM, "Nikola Smolenski" wrote: Дана Saturday 20 June 2009 18:29:24 Brian написа: > This has reminded me to complain about Google Books. Google has the world's > best OCR (in virtue ... Often, these books are available in the Million Books Project too. ___ foundation-l mailing list foundatio...@lists.wikime... ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship
On Tue, Jun 23, 2009 at 03:15, Platonides wrote: > Although not trivial, downloading all images is in fact quite easy. You > can find scripts to do that already made. You can also ask Brion to > rsync3 them. > But do you have enough space to dedicate? > How many wikis do you want to mirror? Just commons is more than 3 TB... Well disks are cheap nowadays. If it's really just the question of asking, I may be interested. for example. The more complex question is the parameters of such usage, meaning what can I do with the images after I've got them. This is the main reason behind not publishing them in the first hand: the images itself aren't suggesting any particular license. Now that I wrote this, it would be possible (not sure if feasible, though) to publish CC-BY-SA pictures with author info in the comment of the image itself. Most image formats support sizeable comment blocks, and standardised templates make it possible to select media by license, and get author/copyright info to put into the file. > That's the reason so few people were interested in the images when the > image dump was available. People are interested, generally, but not in mirroring the whole shebang. :-) grin ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l