[Wikisource-l] Converting pdf files into wiki markup

2013-06-12 Thread David Cuenca
It is not a trivial matter. The best bet would be to take an existing pdf
import tool for a word processor, and try to write a similar tool for
wikitext.

There is the Oracle PDF Import Extension for Open Office, the code can be
browsed, maybe it can give you some ideas
http://extensions.services.openoffice.org/project/pdfimport

Micru

On Wed, Jun 12, 2013 at 12:38 PM, Alex Brollo  wrote:

> When we tried to convert into wiki code (a needed step to add links and to
> convert files into a "wiki hypertext") a pdf file, that's a opaque, closed
> format, such a work turned off in a nightmare. If we simply load free pdf
> books "as they are", I don't see any advantage, but "feed wikisource
> numbers/statistics" nd this in presently far from my personal interest.
>
> As you guess, I'm one of users who don't support Aubrey's enthusiasm about
>  texts born digital, even if free. :-)
>
> Alex
>
>
> 2013/6/12 David Cuenca 
>
>> Nobody is saying anything about using copyrighted works, there are many
>> books that have an open license that would allow to include them in
>> Wikisource.
>>
>> For instance in ca-ws we have this translation from 2009:
>>
>> http://ca.wikisource.org/wiki/Llibre:El_secret_de_l%E2%80%99or_que_creix_%282009%29.djvu
>>
>> The original is in the PD, and the translator gave away his rights. It
>> would have been much easier to work directly with the pdf, instead of
>> converting to djvu.
>>
>> Micru
>>
>>
>> On Wed, Jun 12, 2013 at 10:47 AM, Aarti K. Dwivedi <
>> ellydwivedi2...@gmail.com> wrote:
>>
>>> If I am not wrong, as of today, most books that were born digital, are
>>> still under copyright. Of course, they are available freely on the
>>> internet. But we can't use the pirated copies. How would we go about the
>>> procurement of these books?
>>> If we procure these copyrighted books, then the only we would have to do
>>> is to check for proper formatting. Isn't it?
>>>
>>>
>>> On Wed, Jun 12, 2013 at 7:58 PM, Lars Aronsson  wrote:
>>>
 On 06/12/2013 02:48 PM, Andrea Zanni wrote:

> We could define some tasks as
> * corrected the page
> * OPTIONAL added optional templates/links/annotations
> *...
>

 Geotagged all the photos, ...

 The list doesn't end. You need a generic mechanism
 for any new feature you can invent. But aren't our
 existing templates and categories the best way to
 do this? You could just add to each page:
 {{done|proofread=user1|**validated=user2|geotagged=**user4|...}}


 --
   Lars Aronsson (l...@aronsson.se)
   Project Runeberg - free Nordic literature - http://runeberg.org/




 __**_
 Wikisource-l mailing list
 Wikisource-l@lists.wikimedia.**org 
 https://lists.wikimedia.org/**mailman/listinfo/wikisource-l

>>>
>>>
>>>
>>> --
>>> Aarti K. Dwivedi
>>>
>>>
>>> ___
>>> Wikisource-l mailing list
>>> Wikisource-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>>
>>>
>>
>>
>> --
>> Etiamsi omnes, ego non
>> ___
>> Wikisource-l mailing list
>> Wikisource-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>
>>
>
> ___
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>


-- 
Etiamsi omnes, ego non
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] About texts without supporting files and "Index:" pages

2013-06-12 Thread Alex Brollo
When we tried to convert into wiki code (a needed step to add links and to
convert files into a "wiki hypertext") a pdf file, that's a opaque, closed
format, such a work turned off in a nightmare. If we simply load free pdf
books "as they are", I don't see any advantage, but "feed wikisource
numbers/statistics" nd this in presently far from my personal interest.

As you guess, I'm one of users who don't support Aubrey's enthusiasm about
 texts born digital, even if free. :-)

Alex


2013/6/12 David Cuenca 

> Nobody is saying anything about using copyrighted works, there are many
> books that have an open license that would allow to include them in
> Wikisource.
>
> For instance in ca-ws we have this translation from 2009:
>
> http://ca.wikisource.org/wiki/Llibre:El_secret_de_l%E2%80%99or_que_creix_%282009%29.djvu
>
> The original is in the PD, and the translator gave away his rights. It
> would have been much easier to work directly with the pdf, instead of
> converting to djvu.
>
> Micru
>
>
> On Wed, Jun 12, 2013 at 10:47 AM, Aarti K. Dwivedi <
> ellydwivedi2...@gmail.com> wrote:
>
>> If I am not wrong, as of today, most books that were born digital, are
>> still under copyright. Of course, they are available freely on the
>> internet. But we can't use the pirated copies. How would we go about the
>> procurement of these books?
>> If we procure these copyrighted books, then the only we would have to do
>> is to check for proper formatting. Isn't it?
>>
>>
>> On Wed, Jun 12, 2013 at 7:58 PM, Lars Aronsson  wrote:
>>
>>> On 06/12/2013 02:48 PM, Andrea Zanni wrote:
>>>
 We could define some tasks as
 * corrected the page
 * OPTIONAL added optional templates/links/annotations
 *...

>>>
>>> Geotagged all the photos, ...
>>>
>>> The list doesn't end. You need a generic mechanism
>>> for any new feature you can invent. But aren't our
>>> existing templates and categories the best way to
>>> do this? You could just add to each page:
>>> {{done|proofread=user1|**validated=user2|geotagged=**user4|...}}
>>>
>>>
>>> --
>>>   Lars Aronsson (l...@aronsson.se)
>>>   Project Runeberg - free Nordic literature - http://runeberg.org/
>>>
>>>
>>>
>>>
>>> __**_
>>> Wikisource-l mailing list
>>> Wikisource-l@lists.wikimedia.**org 
>>> https://lists.wikimedia.org/**mailman/listinfo/wikisource-l
>>>
>>
>>
>>
>> --
>> Aarti K. Dwivedi
>>
>>
>> ___
>> Wikisource-l mailing list
>> Wikisource-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>
>>
>
>
> --
> Etiamsi omnes, ego non
> ___
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] About texts without supporting files and "Index:" pages

2013-06-12 Thread David Cuenca
Nobody is saying anything about using copyrighted works, there are many
books that have an open license that would allow to include them in
Wikisource.

For instance in ca-ws we have this translation from 2009:
http://ca.wikisource.org/wiki/Llibre:El_secret_de_l%E2%80%99or_que_creix_%282009%29.djvu

The original is in the PD, and the translator gave away his rights. It
would have been much easier to work directly with the pdf, instead of
converting to djvu.

Micru

On Wed, Jun 12, 2013 at 10:47 AM, Aarti K. Dwivedi <
ellydwivedi2...@gmail.com> wrote:

> If I am not wrong, as of today, most books that were born digital, are
> still under copyright. Of course, they are available freely on the
> internet. But we can't use the pirated copies. How would we go about the
> procurement of these books?
> If we procure these copyrighted books, then the only we would have to do
> is to check for proper formatting. Isn't it?
>
>
> On Wed, Jun 12, 2013 at 7:58 PM, Lars Aronsson  wrote:
>
>> On 06/12/2013 02:48 PM, Andrea Zanni wrote:
>>
>>> We could define some tasks as
>>> * corrected the page
>>> * OPTIONAL added optional templates/links/annotations
>>> *...
>>>
>>
>> Geotagged all the photos, ...
>>
>> The list doesn't end. You need a generic mechanism
>> for any new feature you can invent. But aren't our
>> existing templates and categories the best way to
>> do this? You could just add to each page:
>> {{done|proofread=user1|**validated=user2|geotagged=**user4|...}}
>>
>>
>> --
>>   Lars Aronsson (l...@aronsson.se)
>>   Project Runeberg - free Nordic literature - http://runeberg.org/
>>
>>
>>
>>
>> __**_
>> Wikisource-l mailing list
>> Wikisource-l@lists.wikimedia.**org 
>> https://lists.wikimedia.org/**mailman/listinfo/wikisource-l
>>
>
>
>
> --
> Aarti K. Dwivedi
>
>
> ___
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>


-- 
Etiamsi omnes, ego non
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] About texts without supporting files and "Index:" pages

2013-06-12 Thread Aarti K. Dwivedi
If I am not wrong, as of today, most books that were born digital, are
still under copyright. Of course, they are available freely on the
internet. But we can't use the pirated copies. How would we go about the
procurement of these books?
If we procure these copyrighted books, then the only we would have to do is
to check for proper formatting. Isn't it?


On Wed, Jun 12, 2013 at 7:58 PM, Lars Aronsson  wrote:

> On 06/12/2013 02:48 PM, Andrea Zanni wrote:
>
>> We could define some tasks as
>> * corrected the page
>> * OPTIONAL added optional templates/links/annotations
>> *...
>>
>
> Geotagged all the photos, ...
>
> The list doesn't end. You need a generic mechanism
> for any new feature you can invent. But aren't our
> existing templates and categories the best way to
> do this? You could just add to each page:
> {{done|proofread=user1|**validated=user2|geotagged=**user4|...}}
>
>
> --
>   Lars Aronsson (l...@aronsson.se)
>   Project Runeberg - free Nordic literature - http://runeberg.org/
>
>
>
>
> __**_
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.**org 
> https://lists.wikimedia.org/**mailman/listinfo/wikisource-l
>



-- 
Aarti K. Dwivedi
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] About texts without supporting files and "Index:" pages

2013-06-12 Thread Lars Aronsson

On 06/12/2013 02:48 PM, Andrea Zanni wrote:

We could define some tasks as
* corrected the page
* OPTIONAL added optional templates/links/annotations
*...


Geotagged all the photos, ...

The list doesn't end. You need a generic mechanism
for any new feature you can invent. But aren't our
existing templates and categories the best way to
do this? You could just add to each page:
{{done|proofread=user1|validated=user2|geotagged=user4|...}}


--
  Lars Aronsson (l...@aronsson.se)
  Project Runeberg - free Nordic literature - http://runeberg.org/



___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] About texts without supporting files and "Index:" pages

2013-06-12 Thread David Cuenca
I think everything is doable, the problem is how to do it without
cluttering the interface and keeping things simple.

Some levels might be redundant and we could take the chance to think if
they are really necessary.

Some proposed changes:
- Proofread page levels: "Unused", "Proofread", "Proofread with format",
"Validated" (the "unused" level would mean: pages with no text, ocr text,
pages with irrelevant content).
- All pages would be created at start with the extracted ocr text at
"unused" level, so finally search engines could also find our texts even if
they are not started yet
- A checkbox list to tag pages: "damaged scan", "missing scan", "contains
media" (image, score, etc)
- Color codes: like now plus orange for "Proofread with format". Page with
tags would affect the color too. "damaged" would make the color half purple
and half the corresponding proofread level color, "contains media" could
add a (black?) square around the page number
- Proofread book levels should be automatic to the lowest page level, plus
two options, one to mark the book as "ready to export" and another one to
mark it as "digital source", which would bring all pages at "proofread"
level.

For the metadata interface I keep thinking about it, and my impression is
that we should start working from Template:Book [1] until having a version
that can be used across Commons, Index pages, and books without supporting
scans (in this last case it could be the same header template with an
option to expand it to show the whole template:book).
That template also might need some coloring/reorganizing to reflect the
Work/Edition distinction that Wikidata is bringing [2]
And if with Lua it is possible to read/write Wikidata, then the possible
migration towards a Wikidata-powered Wikisource shouldn't be that far away.

Cheers,
Micru

[1] http://commons.wikimedia.org/wiki/Template:Book
[2] http://www.wikidata.org/wiki/Wikidata:Books_task_force


On Wed, Jun 12, 2013 at 8:48 AM, Andrea Zanni wrote:

>
> On Wed, Jun 12, 2013 at 2:32 PM, Thibaut Horel wrote:
>
>> 3. The current system with 4 quality levels to represent the proofreading
>> state of a page is not sufficient to represent the diversity of
>> proofreading scenarios. Indeed, there is a distinction to make between the
>> *correctness* of the text and its *formatting*. In the case of a scanned
>> edition which has been OCRed, we do need several passes before reaching a
>> satisfying level of confidence about the correctness of the text as well as
>> a suitable formatting (proper use of the wikicode, etc.). For digital-born
>> documents however, as billinghurst said, we can automatically assume that
>> the extracted text is correct, but that still doesn't mean that the text is
>> correctly formatted and ready to be transcluded in the main namespace.
>> Maybe we should add another level meaning "text is correct, still needs
>> formatting"? Ideally, we should have to scales of quality levels: one
>> dealing with the correctness of the text, and one dealing with its
>> formatting. This would probably be too heavy and confusing though...
>
>
> I couldn't agree more.
> I think this could be an opportunity also to make task *smaller* and
> *clearer*
> (in the direction of "microtask", which are contributions in crowdsourcing
> projects which are small, definite and simple. eg GalaxyZoo, reCAPTCHA).
>
> We could define some tasks as
> * corrected the page
> * proofread the text
> * formatted the page
> * validated the formatting
> * OPTIONAL added optional templates/links/annotations
> *...
>
> We could even have qualifiers (all/part of the page, ...)
>
> Is this idea crazy, or somewhat doable?
>
> Aubrey
>
> ___
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>


-- 
Etiamsi omnes, ego non
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] About texts without supporting files and "Index:" pages

2013-06-12 Thread Andrea Zanni
On Wed, Jun 12, 2013 at 2:32 PM, Thibaut Horel wrote:

> 3. The current system with 4 quality levels to represent the proofreading
> state of a page is not sufficient to represent the diversity of
> proofreading scenarios. Indeed, there is a distinction to make between the
> *correctness* of the text and its *formatting*. In the case of a scanned
> edition which has been OCRed, we do need several passes before reaching a
> satisfying level of confidence about the correctness of the text as well as
> a suitable formatting (proper use of the wikicode, etc.). For digital-born
> documents however, as billinghurst said, we can automatically assume that
> the extracted text is correct, but that still doesn't mean that the text is
> correctly formatted and ready to be transcluded in the main namespace.
> Maybe we should add another level meaning "text is correct, still needs
> formatting"? Ideally, we should have to scales of quality levels: one
> dealing with the correctness of the text, and one dealing with its
> formatting. This would probably be too heavy and confusing though...


I couldn't agree more.
I think this could be an opportunity also to make task *smaller* and
*clearer*
(in the direction of "microtask", which are contributions in crowdsourcing
projects which are small, definite and simple. eg GalaxyZoo, reCAPTCHA).

We could define some tasks as
* corrected the page
* proofread the text
* formatted the page
* validated the formatting
* OPTIONAL added optional templates/links/annotations
*...

We could even have qualifiers (all/part of the page, ...)

Is this idea crazy, or somewhat doable?

Aubrey
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] About texts without supporting files and "Index:" pages

2013-06-12 Thread Thibaut Horel
Hi everybody,

Here is my attempt at giving my point of view while trying to summarize
the discussion:

1. I think the role of Index: pages should be to present the *source* of
a work. This is true whether the source is a scanned edition (as is most
often the case at the moment), or a digital PDF (that is, containing
text and not images) as is the case for most "digital-born" documents. I
think it is good to have a neat separation between the original source
and how Wikisource presents the work in the main namespace. Indeed, even
if Wikisource tries to be as true as possible to the original content,
there are very often some changes in the way it is presented in the main
namespace.

2. Ideally, the metadata about the source of a work (author, date of
printing, etc.) should be located in Wikidata. But metadata related to
proofreading (e.g. the proofreading level of each individual page),
being specific to the mission of Wikisource, should be located in
Wikisource. How to do this while keeping the interface simple (i.e. hide
it from the user so that she doesn't have to go from Wikisource to
Wikidata to Wikisource) is a valid and very important concern, but is
also beyond my current understanding of Wikidata and its integration
into Wikimedia projects.

3. The current system with 4 quality levels to represent the
proofreading state of a page is not sufficient to represent the
diversity of proofreading scenarios. Indeed, there is a distinction to
make between the *correctness* of the text and its *formatting*. In the
case of a scanned edition which has been OCRed, we do need several
passes before reaching a satisfying level of confidence about the
correctness of the text as well as a suitable formatting (proper use of
the wikicode, etc.). For digital-born documents however, as billinghurst
said, we can automatically assume that the extracted text is correct,
but that still doesn't mean that the text is correctly formatted and
ready to be transcluded in the main namespace. Maybe we should add
another level meaning "text is correct, still needs formatting"?
Ideally, we should have to scales of quality levels: one dealing with
the correctness of the text, and one dealing with its formatting. This
would probably be too heavy and confusing though...

Thibaut (user:Zaran on Wikisource)

On 06/12/2013 01:35 PM, Andrea Zanni wrote:
>
> On Wed, Jun 12, 2013 at 1:32 PM, billinghurst  > wrote:
>
> If you are talking about how we represent digitally prepared text
> with the
> validation process. I would have no issue with the text being
> ripped and
> having a bot run through and taking it straight to level 4
> (green), and
> then redefining green to say validated, or digitally prepared text not
> requiring validation.
>
> At the same time, if someone proposed and generates a fifth colour to
> represent digitally prepared text not requiring proofreading, then
> I will
> be happy with that. It may make someone happier in being a truer
> representation, but in the end to me it is a moot point. In the
> end, each
> of those is a local community decision, though one that should be
> made in
> consideration of how the other wikis interpret their processes.
>
>
> Thanks for clarifying this.
> I agree with you, and would welcome both solutions.
>
> But a lot of wikisourcerors don't think this way, 
> so better discuss :-)
>
> Aubrey
>
>
>
>
>
> ___
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l

___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] About texts without supporting files and "Index:" pages

2013-06-12 Thread Andrea Zanni
On Wed, Jun 12, 2013 at 1:32 PM, billinghurst wrote:

> If you are talking about how we represent digitally prepared text with the
> validation process. I would have no issue with the text being ripped and
> having a bot run through and taking it straight to level 4 (green), and
> then redefining green to say validated, or digitally prepared text not
> requiring validation.
>
> At the same time, if someone proposed and generates a fifth colour to
> represent digitally prepared text not requiring proofreading, then I will
> be happy with that. It may make someone happier in being a truer
> representation, but in the end to me it is a moot point. In the end, each
> of those is a local community decision, though one that should be made in
> consideration of how the other wikis interpret their processes.
>

Thanks for clarifying this.
I agree with you, and would welcome both solutions.

But a lot of wikisourcerors don't think this way,
so better discuss :-)

Aubrey
___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l


Re: [Wikisource-l] About texts without supporting files and "Index:" pages

2013-06-12 Thread billinghurst
You need to be cautious talking about "PDF" documents, as it is not the
document presentation format, it is the source of the text. So I like to
talk as the source being digitally prepared (and not requiring validation,
though may require formatting), or OCR'd (requiring validation, and
probably formatting.)

If you are talking about how we represent digitally prepared text with the
validation process. I would have no issue with the text being ripped and
having a bot run through and taking it straight to level 4 (green), and
then redefining green to say validated, or digitally prepared text not
requiring validation.

At the same time, if someone proposed and generates a fifth colour to
represent digitally prepared text not requiring proofreading, then I will
be happy with that. It may make someone happier in being a truer
representation, but in the end to me it is a moot point. In the end, each
of those is a local community decision, though one that should be made in
consideration of how the other wikis interpret their processes.

Regards, Billinghurst


On Tue, 11 Jun 2013 15:12:41 -0400, David Cuenca 
wrote:
> @Billinghurst, I think Aubrey was referring mainly to pdf files, which
> sometimes have text and format but they are not that easy to represent
in
> Wikisource. The main problem is that our current workflow always assume
> that we are going to proofread a text and have it stored as a web page.
> 
> @others: for me it doesn't matter much if the representation of the
> metadata is done by a template, an index page, or something different
> (maybe related to the new Extension:BookManager?)
> However I think that from the user point of view it is better to have a
> consistent system that can handle:
> 1) representation of book/source metadata
> 2) give access to export/visualization options
> 
> I'm preparing a document with some ideas that we can discuss here.
> 
> Micru
> 
> On Tue, Jun 11, 2013 at 7:48 AM, billinghurst
> wrote:
> 
>> On Tue, 11 Jun 2013 12:16:54 +0530, "Aarti K. Dwivedi"
>>  wrote:
>> > A slighly off-topic question: Even if we modify the extension to
>> proofread
>> > books which do not have scans( I am assuming books that were born
>> digital
>> > ), against what
>> > will these books be proofread?
>> >
>>
>> I am not sure why we are looking to proofread a digital only file,
unless
>> of course it never had a text layer and it had to be OCR'd. 
Proofreading
>> surely only relates to scanned images where there has been the need to
>> proofread.
>>
>> Regards, Billinghurst
>>
>> ___
>> Wikisource-l mailing list
>> Wikisource-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>

___
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l