[Bug 57278] Issue with PDFs downloaded from Archive.org

2014-01-12 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57278

Bawolff (Brian Wolff)  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |WORKSFORME

--- Comment #6 from Bawolff (Brian Wolff)  ---
Closing worksforme.

I downloaded the file, and looked at it with various tools:
*The text layer appears to be empty, It has no OCR data, hence proofread page
cannot retrieve the text of the document. (Proofread page doesn't do OCR, it
only extracts what is embedded in the document)
*The file does have a low resolution. Other PDF tools also display it very
small.

(If you think there's still a bug here, please re-open)

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57278] Issue with PDFs downloaded from Archive.org

2014-01-12 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57278

Marco  changed:

   What|Removed |Added

 CC||maic...@yahoo.com
 Blocks||41037

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57278] Issue with PDFs downloaded from Archive.org

2014-01-09 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57278

--- Comment #5 from Bawolff (Brian Wolff)  ---
(In reply to comment #0)
>
> 2. When we create Index file in Wikisource (for example,
> https://ml.wikisource.org/wiki/Index:Pazhancholmala_Gundert_1845.pdf) and try
> to work on a page (for example,
> https://ml.wikisource.org/w/index.php?title=Page:Pazhancholmala_Gundert_1845.
> pdf/7&action=edit&redlink=1)
> you can see that nothing much can be seen on the scanned page. 
> 
> 
>

Are you sure that the pff has an ocr layer (you can test by opening up in a pdf
viewer and seeing if you can select/copy text in the document)? Pdfhandler
seems to think all the pages are blank -
https://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&iiprop=metadata&titles=File:Pazhancholmala_Gundert_1845.pdf
(scroll down to the text property)

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57278] Issue with PDFs downloaded from Archive.org

2014-01-09 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57278

Andre Klapper  changed:

   What|Removed |Added

   Priority|Normal  |Low
   Severity|normal  |minor

--- Comment #4 from Andre Klapper  ---
Shiju Alex: Can you please answer Nemo's questions in comment 3?:

> What is that you don't see there? Do you
> still not see an image there? If you don't, is it consistent on all pages?


Looking for actionable items, I currently only see this:

(In reply to comment #3 by Nemo)
> It's possible that the resolution was guessed incorrectly

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57278] Issue with PDFs downloaded from Archive.org

2013-11-20 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57278

praveenp  changed:

   What|Removed |Added

 Blocks||56295

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57278] Issue with PDFs downloaded from Archive.org

2013-11-20 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57278

Andre Klapper  changed:

   What|Removed |Added

 Status|NEW |UNCONFIRMED
 Ever confirmed|1   |0

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57278] Issue with PDFs downloaded from Archive.org

2013-11-20 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57278

Nemo  changed:

   What|Removed |Added

 CC||federicol...@tiscali.it

--- Comment #3 from Nemo  ---
(In reply to comment #0)
> 1. Inside Commons itself (for example,
> https://commons.wikimedia.org/wiki/File:Pazhancholmala_Gundert_1845.pdf) you
> can see that you cannot view the pages from this file in higher resolution. 

How is this unexpected? The PDF has low resolution (and it's only 2 MB), it's
correctly displayed.

$ pdfinfo Gundert_Pazhancholmala_1845.pdf 
Title:  Pazhancholmala by Hermann Gundert 1845
Keywords:   http://archive.org/details/pazhancholmala_gundert_1845
Author: Hermann Gundert
Creator:Digitized by the Internet Archive
Producer:   Recoded by LuraDocument PDF v2.53
CreationDate:   Mon Sep 16 16:22:18 2013
ModDate:Mon Sep 16 16:23:29 2013
Tagged: no
Form:   none
Pages:  147
Encrypted:  no
Page size:  91 x 148 pts
Page rot:   0
File size:  2363482 bytes
Optimized:  yes
PDF version:1.5

https://catalogd.archive.org/log/13313 tells me:
Source Gundert_Pazhancholmala_1845_images.zip : "Generic Raw Book Zip"
[...]
INFO: Global image dpi: 600

It's possible that the resolution was guessed incorrectly (unless the pages of
this book are very small, 147 pages at 600 dpi can't be 35 MB only): please
edit the metadata to add the correct one at which the images were produced, see
fixed-ppi instructions at


> 
> 2. When we create Index file in Wikisource (for example,
> https://ml.wikisource.org/wiki/Index:Pazhancholmala_Gundert_1845.pdf) and try
> to work on a page (for example,
> https://ml.wikisource.org/w/index.php?title=Page:Pazhancholmala_Gundert_1845.
> pdf/7&action=edit&redlink=1)
> you can see that nothing much can be seen on the scanned page. 

What is that you don't see there? The text isn't loaded but this is expected
because as you know very well there is no OCR. I also see the image from the
PDF correctly, in my case
https://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Dharmaraja_1913.pdf/page11-500px-Dharmaraja_1913.pdf.jpg
which according to wget -S is Last-Modified: Wed, 20 Nov 2013 03:16:52 GMT so
may have been created when someone else clicked the link on comment 0. Do you
still not see an image there? If you don't, is it consistent on all pages?

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57278] Issue with PDFs downloaded from Archive.org

2013-11-19 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57278

Shiju Alex  changed:

   What|Removed |Added

 CC||shijua...@hotmail.com

--- Comment #2 from Shiju Alex  ---
Able to reproduce issue with another PDF downloaded from Archive.org
https://ml.wikisource.org/w/index.php?title=Page:Dharmaraja_1913.pdf/11&action=edit&redlink=1
 Even though, in this case, we can just able to read content (with some
difficulty), it is not good enough for Wikisource digitization efforts.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57278] Issue with PDFs downloaded from Archive.org

2013-11-19 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57278

Sam Reed (reedy)  changed:

   What|Removed |Added

   Priority|Unprioritized   |Normal
 CC||bawolff...@gmail.com,
   ||fflo...@wikimedia.org,
   ||mtrac...@member.fsf.org
  Component|ProofreadPage   |PdfHandler

--- Comment #1 from Sam Reed (reedy)  ---
I suspect that this probably isn't a ProofreadPage issue but one of either the
PdfHandler extension, or more likely one related to the tool doing the PDF page
rendering to images on the wikimedia image scalers.. Those being ghostscript
and imagemagick

Moving to PdfHandler for the time being.

Software on cluster:
reedy@tin:/a/common$ dpkg -l | grep ghostscript
ii  ghostscript  9.05~dfsg-0ubuntu4.2   
interpreter for the PostScript language and for PDF
ii  gs-cjk-resource  1.20100103-3   
Resource files for gs-cjk, ghostscript CJK-TrueType extension
reedy@tin:/a/common$ dpkg -l | grep imagemagick
ii  imagemagick  8:6.6.9.7-5ubuntu3.2image
manipulation programs
ii  imagemagick-common   8:6.6.9.7-5ubuntu3.2image
manipulation programs -- infrastructure


I note a similar output locally too on my dev wiki

reedy@ubuntu64-web-esxi:/var/www/wiki/mediawiki/core$ dpkg -l | grep
ghostscript
ii  ghostscript  9.10~dfsg-0ubuntu2 
amd64interpreter for the PostScript language and for PDF
ii  gs-cjk-resource  1.20100103-3   
all  Resource files for gs-cjk, ghostscript CJK-TrueType extension
reedy@ubuntu64-web-esxi:/var/www/wiki/mediawiki/core$ dpkg -l | grep
imagemagick
ii  imagemagick  8:6.7.7.10-5ubuntu3
amd64image manipulation programs
ii  imagemagick-common   8:6.7.7.10-5ubuntu3
all  image manipulation programs -- infrastructure



Hopefully it can get triaged a little before being dumped onto the WMF image
scaler component

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l