Re: OCR old software listing

2019-01-02 Thread Toby Thain via cctalk
On 2019-01-02 7:22 AM, Steve Malikoff via cctalk wrote:
> I timed myself how long it would take to clean up Mattis' supplied image so 
> it might
> be able to be OCR'd more accurately. Using Paint.NET it took me 23 minutes to 
> get to
> the following:
> http://web.aanet.com.au/~malikoff/pdp11/dvY973s_cleaned.png
> 
> There are still a few little bits I missed, but happy to see if it reads 
> better.
> If the complete set are now up then I must have missed it.

I posted a mechanically cleaned up version on 29 Dec:
https://docs.telegraphics.com.au/mattis/spcwar_pdp11_edit.tif
(_multipage_ TIF)

It's 61 pages so at 23 minutes per page it's not something you'd want to
do by hand.

--Toby

> 
> Steve.
> 
> 



Re: OCR old software listing

2019-01-02 Thread Steve Malikoff via cctalk
I timed myself how long it would take to clean up Mattis' supplied image so it 
might
be able to be OCR'd more accurately. Using Paint.NET it took me 23 minutes to 
get to
the following:
http://web.aanet.com.au/~malikoff/pdp11/dvY973s_cleaned.png

There are still a few little bits I missed, but happy to see if it reads better.
If the complete set are now up then I must have missed it.

Steve.



Re: OCR old software listing

2019-01-02 Thread Larry Kraemer via cctalk
The only way I've been able to get any type of readable ASCII TEXT
from the .tif's is to do the following for each tif:

convert -density 1200 -resize 40% xaaa.tif -density 1 xaaa120040.tif

Then, OCR it with Irfanview with the KADMOS Plugin Installed.
For the first Page I get the following ASCII:

 CHAR 00RG CHCR 000216R  CHlF 000224R
 CHRTAB = ** G CHSPC000232R  CH05 14R
 CH1  24R  CH15 62R  CH2  70R
-
 CH23 000110R  CH24 000134R  CH25 000140R
 CH26 000144R  CH3  000150R  CH4  000164R
_~__~___~_~
 CH5  080212R  CR = 15   CTLP   = 20
)DAC0   = ** G DACl   = ** G DAC2   = ** G
 DUM= 00   INC  000240R  INC6 000242R
 INC8 080244R  LF = 12   PC =%07
 R0 =%00   R1 =%01   R2 =%02
 R3 ~%03   R4 =%04   R5 =%000~~5
~___~~___~~
 R6 =%06   R7 =%07   SP =%06
 SPACE  = 48   .  = 000246R
 END   ?
;*
   ,.

... , . .
;
;   CHARACTER DISPLAY
~   VERSION 3C
;
;   NOV 15,1974

,;_~~._   ~
;   R0=PTR TO BUFFER OF CHARS-FIRST WORD #OF B~TES
;   R1=BIT TEST ROTATING MASK
~   R2=CHARACTER INCERMENT-DETERMINES CHARACTER SIZE
;   R3=POINTER AT CHAR DOT DATA
  . ;   R4=X POSITION OF FIRST CHAR
;   R5=Y POSlTlON OF FTRST CHAR
;
 00 ~   R0=%0
 01 R1=%1
 02 R2=%2
·
 03 R3=%3
 080004 R4=%4
 05 R5=%5
 06 R6=%6
 07 R7=%7
 07 PC=R7
 06 SP=R6
 20 CTLP=20
 40 SPACE=40
 15 CR=15
 12 LF=12'
 00 DUM=0
.TITLE .CHAR
.GLOBL  CHAR,DAC0,DAC1,0AC2,CHRTAB
 00 .CSECT   ,
  00 012046 CHRR:   MO~ (R0)+,-(SP) ;GET CHAR COUNT
  02 016702 MOV INC,R2  ~SET CHARACTER SIZE
 000232
  06 012737~MOV #-2048 .~#OAC2  ;TURN DOT OFF JUST IN CASE
 1740~0
 00
  14 005316 CH05:   DEC (SP>;IS THERE MORE CHARS?
  16 002002 BGE CH1 ;~ES-GO DRAW THEM
  000820 005726 TST (SP)+   ;NO-POP OLD CTR
  22 000207 RTS PC  ;RLL
DONE!~!!~!1~11!
  24 112AA7 CW·l·MnWR(P0~~...P-e·   .nrT r·uc,o

It's no where near 50% accurate, but it's the best I've got so far.

Page 2 is:

\ 1~
 42 001470 BEQ CHLF;YES-GO LF  ---
---~
 44 122703 CMPB#SPACE,R3   ;NO-IS THIS A SPACE?
40
 000850 001470 BEQ CHSPC   ;~ES-GO SPACE
 52 003003 BGT 0H15;NO-IF LESS
THAN.SRAC~-B~~-~*RR
   PAGE001.
 54 122703CMPB#137~R3;IS IT GREAYER TH8N-1~~2~
-
000137
 60 002003 BGE CH2 ;NO-GOOD CHAR~
~--~~-~---~-·
 62 012703~CH15:   MOV #CHRTAB,R3  ~YES-BAD CHAR
 0001300  ,
„
 66 000410 BR  CH23
 70 162703 CH2:SUB #37,R3  ;ZERO FOR FIRST CHAR
IN~TABLE
37
 ~ 010301 MOV R3,Rl   ~SAVE VALE TEMP
~--~--~~-
 76 006303 ASL R3  ;R3=R3*2
 000100 060103 ADD R1~R3   ;R3=R3*3   -
 000102 006303 ASL R3  ;R3=R3*6 (FOR 3 WORDS)
 0801~ 062703~ADD #CHRTAB,R3  ~~ P~ AT FIR~ CHAR~~ATR ~~~E
00
 000110 012701 CH23·   MnV #20A,R1 ~SET TEST BIT
-~-~--~-_~-~
000200
 000114 010546 MOV R5,-(SP>;SAVE lNITIAL ~ POSIT~ON
 000116 010467 MO~ R4,CH24+2   ;SA~E INITIAL X
80001.4   ·
 00~122 010367 MOV R3,CH25+2   ;SAVE INITIAL CHAR PTR
  08881
4
„„,..___ ,.~ ~.,,~,_,,__„,.,
 000126 005137~COM @#DAC2  ;TURN DOT ON INTO CHAR
008008~
 000132 000404 BR  CH26;SKIP
 000134 012704 CH24:   MOV #DUM,R4 ;RESET PTR
00

RE: OCR old software listing

2018-12-31 Thread Kevin Parker via cctalk
I've had a lot of success using Adobe's Clearscan for OCR'ing old stuff.
Admittedly it's not perfect but it can improve the quality of an old
document a lot.


Kevin Parker
0418 815 527

-Original Message-
From: cctalk  On Behalf Of Paul Koning via
cctalk
Sent: Tuesday, 1 January 2019 12:18 PM
To: dwight ; General Discussion: On-Topic and Off-Topic
Posts 
Subject: Re: OCR old software listing



> On Dec 31, 2018, at 7:13 PM, dwight via cctalk 
wrote:
> 
> Fred is right, OCR is only worth it if the document is in perfect
condition. I just finish getting an old 4004 listing working. I made only
two mistakes on the 4K of code that were not the fault of the poorness of
the listing. Twice I put LDM instead of LD. LDM was the most commonly used.

I wouldn't put it quite so strongly.  OCR even if not perfect can help a
lot.  You can often OCR + test assembly + proofread faster than retyping,
especially since that requires fixing typos and proofreading also.  Many OCR
errors are caught by the assembler, though not all of them of course.  I've
done both in an ongoing software preservation project; my conclusion still
is to use OCR when it works "well enough".  A couple of errors per page is
definitely "well enough".

The program used matters.  I looked at Tesseract a bit but its quality was
vastly inferior to commercial products in the examples I tried.  I now use
Abbyy FineReader, which handles a lot of line printer and typewriter
material quite well.

paul





Re: OCR old software listing

2018-12-31 Thread Fred Cisin via cctalk
On the other hand, just for the FUN of it, 
can you write some software to find and fix (or simply flag) the most 
common errors?



When I had to terminate my publisher, I was s'posed to receive a copy of 
their customer database.
They deleted all delimiters (spaces, commas, periods, and other 
punctuation), and then printed it out on greenbar with a font that used 
the same character for zero and letter 'O'; and same character for 
one, lower case 'l', and upper cse 'I'.  Surprisingly, they did NOT use a 
bad ribbon and printhead!


An acquaintance OCR'ed it. They were able to get what they thought was "80 
to 90%" from the originals, but not from a xerox copy that visually 
seemed to be just as good.


I spent a little time writing some simple code to parse and fix most of 
it. 
Mostly simple context, such as a zero between two letters is likely an 
'O', or an 'O' between two numerals is likely a zero.  Similarly with one, 
lower case 'L' and upper case 'I'.   Some OCR software now pays attention 
to context.
Five consecutive numerals following two capital letters is likely to be a 
zip code, and end of the record.  USUALLY. Comparison of those digits 
with the two letters in a zipcode database provided partial confirmation.

. . . and so forth . . .

Not a practical use of time, but a fun exercise in parsing.


Another time, the .SRT file that I found for "Company Man" used upper case 
'I' instead of lower case 'L'!  (AND had a three minute offset for the 
start time)Did not take very long to fix.


--
Grumpy Ol' Fred ci...@xenosoft.com


On Tue, 1 Jan 2019, dwight wrote:


Fred is right, OCR is only worth it if the document is in perfect condition. I 
just finish getting an old 4004 listing working. I made only two mistakes on 
the 4K of code that were not the fault of the poorness of the listing. Twice I 
put LDM instead of LD. LDM was the most commonly used.
There were still some 15 or so other errors do to the printing. It looked to be 
done on a ASR33 with poor registration of the print drum. Cs and 0s were often 
missing the right 1/3. Expecting an OCR to do much would have been a folly. 
Even though some 85% to 90% could be read properly. It took be about 3 weeks of 
evenings to make heads or tails of the code. I've finally got it running 
correctly.
If it had been done with an OCR, many cases it would have simply put a C 
instead of a 0. I'd have had to go through the listing, checking each C to make 
sure it was right. It is easier in many cases to have analysed what I could see 
and make a judgement, based on what I could see and the general context as I 
was typing it in.
Dwight


From: cctalk  on behalf of Fred Cisin via cctalk 

Sent: Monday, December 31, 2018 9:46 AM
To: General Discussion: On-Topic and Off-Topic Posts
Subject: Re: OCR old software listing

On Mon, 31 Dec 2018, Larry Kraemer via cctalk wrote:

I used the libtiff-tools (Debian 8.x - 32 Bit) to extract all 61 .TIF's
from the Multipage .tif file.  While the .tif's look descent, and
RasterVect shows the .tif properties to be Group 4 Fax (1bpp) with 5100
x 6600 pixels - 300 DPI, I can't get tesseract 3.x, TextBridge Classic
2.0, or Irfanview with KADMOS Plugin to OCR any of the .tif files, with
descent results.  I'd expect an OCR of 85 to 90 % correct conversion to
ASCII text.


Software listings need more accuraacy than that.
How many wrong characters does it take for a program not to work?
"desCent" isn't good enough.

85 to 90 % correct is a character wrong in every 6 to 10 characters.
How many errors is that PER LINE?

"But, you can start with that, and just fix the errors, without retyping
the rest."  Doing it that way is a desCent into madness.
BTDT.  wore out the T-shirts.


A competent typist can retype the whole thing faster than fixing an error
in every six to ten characters.
Only if there is less than one error for every several hundred characters
does "patching it" save time for a competent typist.
In general, for a competent typist, the fastest way to reposition the
cursor to the next error in the line is to simply hit the keys of the
intervening letters.
It is NOT to move the cursor with the mouse, then put your hand back on
the keys to type a character.
Using cursor motion keys is no faster for a competent typist than hitting
the keys of the letters toskip over.


TIP: display the OCR'ed text that is to be corrected in a font that
exaggerates the difference between zero and the letter 'O', and between
one and lower case 'l'.  There are some programs that will attempt to
select those based on context.

--
Grumpy Ol' Fred  ci...@xenosoft.com


Re: OCR old software listing

2018-12-31 Thread Paul Koning via cctalk



> On Dec 31, 2018, at 7:13 PM, dwight via cctalk  wrote:
> 
> Fred is right, OCR is only worth it if the document is in perfect condition. 
> I just finish getting an old 4004 listing working. I made only two mistakes 
> on the 4K of code that were not the fault of the poorness of the listing. 
> Twice I put LDM instead of LD. LDM was the most commonly used.

I wouldn't put it quite so strongly.  OCR even if not perfect can help a lot.  
You can often OCR + test assembly + proofread faster than retyping, especially 
since that requires fixing typos and proofreading also.  Many OCR errors are 
caught by the assembler, though not all of them of course.  I've done both in 
an ongoing software preservation project; my conclusion still is to use OCR 
when it works "well enough".  A couple of errors per page is definitely "well 
enough".

The program used matters.  I looked at Tesseract a bit but its quality was 
vastly inferior to commercial products in the examples I tried.  I now use 
Abbyy FineReader, which handles a lot of line printer and typewriter material 
quite well.

paul




Re: OCR old software listing

2018-12-31 Thread dwight via cctalk
Fred is right, OCR is only worth it if the document is in perfect condition. I 
just finish getting an old 4004 listing working. I made only two mistakes on 
the 4K of code that were not the fault of the poorness of the listing. Twice I 
put LDM instead of LD. LDM was the most commonly used.
There were still some 15 or so other errors do to the printing. It looked to be 
done on a ASR33 with poor registration of the print drum. Cs and 0s were often 
missing the right 1/3. Expecting an OCR to do much would have been a folly. 
Even though some 85% to 90% could be read properly. It took be about 3 weeks of 
evenings to make heads or tails of the code. I've finally got it running 
correctly.
If it had been done with an OCR, many cases it would have simply put a C 
instead of a 0. I'd have had to go through the listing, checking each C to make 
sure it was right. It is easier in many cases to have analysed what I could see 
and make a judgement, based on what I could see and the general context as I 
was typing it in.
Dwight


From: cctalk  on behalf of Fred Cisin via cctalk 

Sent: Monday, December 31, 2018 9:46 AM
To: General Discussion: On-Topic and Off-Topic Posts
Subject: Re: OCR old software listing

On Mon, 31 Dec 2018, Larry Kraemer via cctalk wrote:
> I used the libtiff-tools (Debian 8.x - 32 Bit) to extract all 61 .TIF's
> from the Multipage .tif file.  While the .tif's look descent, and
> RasterVect shows the .tif properties to be Group 4 Fax (1bpp) with 5100
> x 6600 pixels - 300 DPI, I can't get tesseract 3.x, TextBridge Classic
> 2.0, or Irfanview with KADMOS Plugin to OCR any of the .tif files, with
> descent results.  I'd expect an OCR of 85 to 90 % correct conversion to
> ASCII text.

Software listings need more accuraacy than that.
How many wrong characters does it take for a program not to work?
"desCent" isn't good enough.

85 to 90 % correct is a character wrong in every 6 to 10 characters.
How many errors is that PER LINE?

"But, you can start with that, and just fix the errors, without retyping
the rest."  Doing it that way is a desCent into madness.
BTDT.  wore out the T-shirts.


A competent typist can retype the whole thing faster than fixing an error
in every six to ten characters.
Only if there is less than one error for every several hundred characters
does "patching it" save time for a competent typist.
In general, for a competent typist, the fastest way to reposition the
cursor to the next error in the line is to simply hit the keys of the
intervening letters.
It is NOT to move the cursor with the mouse, then put your hand back on
the keys to type a character.
Using cursor motion keys is no faster for a competent typist than hitting
the keys of the letters toskip over.


TIP: display the OCR'ed text that is to be corrected in a font that
exaggerates the difference between zero and the letter 'O', and between
one and lower case 'l'.  There are some programs that will attempt to
select those based on context.

--
Grumpy Ol' Fred  ci...@xenosoft.com


Re: OCR old software listing

2018-12-31 Thread Fred Cisin via cctalk

On Mon, 31 Dec 2018, Larry Kraemer via cctalk wrote:
I used the libtiff-tools (Debian 8.x - 32 Bit) to extract all 61 .TIF's 
from the Multipage .tif file.  While the .tif's look descent, and 
RasterVect shows the .tif properties to be Group 4 Fax (1bpp) with 5100 
x 6600 pixels - 300 DPI, I can't get tesseract 3.x, TextBridge Classic 
2.0, or Irfanview with KADMOS Plugin to OCR any of the .tif files, with 
descent results.  I'd expect an OCR of 85 to 90 % correct conversion to 
ASCII text.


Software listings need more accuraacy than that.
How many wrong characters does it take for a program not to work?
"desCent" isn't good enough.

85 to 90 % correct is a character wrong in every 6 to 10 characters.
How many errors is that PER LINE?

"But, you can start with that, and just fix the errors, without retyping 
the rest."  Doing it that way is a desCent into madness.

BTDT.  wore out the T-shirts.


A competent typist can retype the whole thing faster than fixing an error 
in every six to ten characters.
Only if there is less than one error for every several hundred characters 
does "patching it" save time for a competent typist.
In general, for a competent typist, the fastest way to reposition the 
cursor to the next error in the line is to simply hit the keys of the 
intervening letters.
It is NOT to move the cursor with the mouse, then put your hand back on 
the keys to type a character.
Using cursor motion keys is no faster for a competent typist than hitting 
the keys of the letters toskip over.



TIP: display the OCR'ed text that is to be corrected in a font that 
exaggerates the difference between zero and the letter 'O', and between 
one and lower case 'l'.  There are some programs that will attempt to 
select those based on context.


--
Grumpy Ol' Fred ci...@xenosoft.com


Re: OCR old software listing

2018-12-31 Thread Toby Thain via cctalk
On 2018-12-31 7:20 AM, Larry Kraemer via cctalk wrote:
> I used the libtiff-tools (Debian 8.x - 32 Bit) to extract all 61 .TIF's
> from the
> Multipage .tif file.  While the .tif's look descent, and RasterVect shows
> the
> .tif properties to be Group 4 Fax (1bpp) with 5100 x 6600 pixels - 300 DPI,
> I can't get tesseract 3.x, TextBridge Classic 2.0, or Irfanview with KADMOS
> Plugin to OCR any of the .tif files, with descent results.  I'd expect an
> OCR
> of 85 to 90 % correct conversion to ASCII text.
> 
> Typically, one of the three above Software packages will do a descent job
> of OCRing .tif's of such scans.  (Most PDF's end up at 72 x 72 DPI, and
> converting them to 300 DPI, allows them to be properly OCR'd.)
> 
> If anyone else has had better luck, I'd like to know what your process is.

I don't know if OCR software is sensitive to having correct resolution
(I've practically zero experience with it), but 300 dpi seems wrong for
Mattis' scans.

Seems they should be 600 dpi (21.7 cm x 28 cm).

--Toby

> 
> Thanks.
> 
> Larry
> 



Re: OCR old software listing

2018-12-31 Thread Larry Kraemer via cctalk
I used the libtiff-tools (Debian 8.x - 32 Bit) to extract all 61 .TIF's
from the
Multipage .tif file.  While the .tif's look descent, and RasterVect shows
the
.tif properties to be Group 4 Fax (1bpp) with 5100 x 6600 pixels - 300 DPI,
I can't get tesseract 3.x, TextBridge Classic 2.0, or Irfanview with KADMOS
Plugin to OCR any of the .tif files, with descent results.  I'd expect an
OCR
of 85 to 90 % correct conversion to ASCII text.

Typically, one of the three above Software packages will do a descent job
of OCRing .tif's of such scans.  (Most PDF's end up at 72 x 72 DPI, and
converting them to 300 DPI, allows them to be properly OCR'd.)

If anyone else has had better luck, I'd like to know what your process is.

Thanks.

Larry


Re: OCR old software listing.

2018-12-29 Thread Toby Thain via cctalk
On 2018-12-29 1:32 AM, Toby Thain via cctalk wrote:
> On 2018-12-29 12:47 AM, Toby Thain via cctalk wrote:
>> On 2018-12-26 4:29 PM, Mattis Lind via cctalk wrote:
>>> Finally I got hold of the sources for the PDP-11 SPACE WAR that was
>>> submitted to DECUS by Bill Seiler.
>>>
>>> The format is scans of the PAL-11S listing output. It is easy to crop the
>>> image to only contain actual source. Then running OCR on it. Tried a few
>>> online versions and tesseract.
>>>
>>> The problem is that the paper that the listing is printed on has lines.
>>> Very black lines. It makes the OCR go completely crazy. Source lines
>>> without black lines OCR ok. The others do not. The files need massive
>>> amount of manual intervention.
>>>
>>> Does anyone have an idea how to process files like this?
>>>
>>> A good way to remove the black lines?
>>
>> Hi Mattis
>>
>> Here's a first cut. Can probably be improved slightly. Let me know how
>> much this still confuses Tesseract.
>>
>> https://docs.telegraphics.com.au/mattis/spcwar_pdp11_edit.tif
>>
> 
> That is a multipage TIF, and the page order key is listed below.
> 
> I just noticed that a handful of pages seem to be missing, so I'll look
> into that.
> 

Fixed that. I was also able to improve the quality. Same link.

The full page manifest is:

CHAR--
CHAR--0001
CHAR--0002
CHRTAB--
CHRTAB--0001
CHRTAB--0002
CHRTAB--0003
COMPAR--
COMPAR--0001
COMPAR--0002
COMPAR--0003
EXPLOD--
EXPLOD--0001
EXPLOD--0002
GRAVTY--
GRAVTY--0001
GRAVTY--0002
GRAVTY--0003
MULPLY--
MULPLY--0001
MULPLY--0002
PARM--
PARM--0001
PARM--0002
PARM--0003
PARM--0004
PARM--0005
PARM--0006
PARM--0007
PARM--0008
PARM--0009
PWRUP--
PWRUP--0001
RESET--
RESET--0001
RKT1--
RKT1--0001
RKT2--
RKT2--0001
SCORE--
SCORE--0001
SINCOS--
SINCOS--0001
SINCOS--0002
SINCOS--0003
SLINE--
SLINE--0001
SPCWAR--
SPCWAR--0001
SPCWAR--0002
SUN--
SUN--0001
SUN--0002
UPDAT1--
UPDAT1--0001
UPDAT1--0002
UPDAT2--
UPDAT2--0001
UPDAT2--0002
point--
point--0001


> 
>> --Toby
>>
>>>
>>> There are only 19 source files with three or four pages each so I don't
>>> think it makes sense to try to train tesseract to do it (training tesseract
>>> seems to be a huge undertaking).
>>>
>>> https://i.imgur.com/dvY973s.png
>>>
>>> /Mattis
>>>
>>
>>
> 
> 



Re: OCR old software listing.

2018-12-28 Thread Toby Thain via cctalk
On 2018-12-29 12:47 AM, Toby Thain via cctalk wrote:
> On 2018-12-26 4:29 PM, Mattis Lind via cctalk wrote:
>> Finally I got hold of the sources for the PDP-11 SPACE WAR that was
>> submitted to DECUS by Bill Seiler.
>>
>> The format is scans of the PAL-11S listing output. It is easy to crop the
>> image to only contain actual source. Then running OCR on it. Tried a few
>> online versions and tesseract.
>>
>> The problem is that the paper that the listing is printed on has lines.
>> Very black lines. It makes the OCR go completely crazy. Source lines
>> without black lines OCR ok. The others do not. The files need massive
>> amount of manual intervention.
>>
>> Does anyone have an idea how to process files like this?
>>
>> A good way to remove the black lines?
> 
> Hi Mattis
> 
> Here's a first cut. Can probably be improved slightly. Let me know how
> much this still confuses Tesseract.
> 
> https://docs.telegraphics.com.au/mattis/spcwar_pdp11_edit.tif
> 

That is a multipage TIF, and the page order key is listed below.

I just noticed that a handful of pages seem to be missing, so I'll look
into that.

CHAR--
CHAR--0001
CHAR--0002
CHRTAB--
CHRTAB--0001
CHRTAB--0002
COMPAR--
COMPAR--0001
COMPAR--0002
COMPAR--0003
EXPLOD--
EXPLOD--0001
EXPLOD--0002
GRAVTY--
GRAVTY--0001
GRAVTY--0002
GRAVTY--0003
MULPLY--
MULPLY--0001
MULPLY--0002
PARM--
PARM--0001
PARM--0002
PARM--0003
PARM--0005
PARM--0006
PARM--0007
PARM--0008
PARM--0009
PWRUP--
PWRUP--0001
RESET--
RESET--0001
RKT1--
RKT1--0001
RKT2--
RKT2--0001
SCORE--
SCORE--0001
SINCOS--
SINCOS--0001
SINCOS--0002
SLINE--
SLINE--0001
SPCWAR--
SPCWAR--0001
SPCWAR--0002
SUN--
SUN--0001
SUN--0002
UPDAT1--
UPDAT1--0001
UPDAT1--0002
UPDAT2--
UPDAT2--0002
point--
point--0001


> --Toby
> 
>>
>> There are only 19 source files with three or four pages each so I don't
>> think it makes sense to try to train tesseract to do it (training tesseract
>> seems to be a huge undertaking).
>>
>> https://i.imgur.com/dvY973s.png
>>
>> /Mattis
>>
> 
> 



Re: OCR old software listing.

2018-12-28 Thread Toby Thain via cctalk
On 2018-12-26 4:29 PM, Mattis Lind via cctalk wrote:
> Finally I got hold of the sources for the PDP-11 SPACE WAR that was
> submitted to DECUS by Bill Seiler.
> 
> The format is scans of the PAL-11S listing output. It is easy to crop the
> image to only contain actual source. Then running OCR on it. Tried a few
> online versions and tesseract.
> 
> The problem is that the paper that the listing is printed on has lines.
> Very black lines. It makes the OCR go completely crazy. Source lines
> without black lines OCR ok. The others do not. The files need massive
> amount of manual intervention.
> 
> Does anyone have an idea how to process files like this?
> 
> A good way to remove the black lines?

Hi Mattis

Here's a first cut. Can probably be improved slightly. Let me know how
much this still confuses Tesseract.

https://docs.telegraphics.com.au/mattis/spcwar_pdp11_edit.tif

--Toby

> 
> There are only 19 source files with three or four pages each so I don't
> think it makes sense to try to train tesseract to do it (training tesseract
> seems to be a huge undertaking).
> 
> https://i.imgur.com/dvY973s.png
> 
> /Mattis
> 



Re: OCR old software listing.

2018-12-27 Thread Paul Koning via cctalk



> On Dec 26, 2018, at 10:30 PM, Jon Elson via cctalk  
> wrote:
> 
> On 12/26/2018 03:29 PM, Mattis Lind via cctalk wrote:
>> 
>> A good way to remove the black lines?
>> 
>> 
>> 
>> https://i.imgur.com/dvY973s.png
>> 
>> 
> Oh, boy!  The printer was not properly aligned, so the lines actually overlay 
> the dot-matrix printed text!  This is going to make OCR very difficult!  I 
> don't think you can just get rid of the lines, that will drop dots from the 
> characters, too.  A bad situation.

At some point the simplest answer is to type it all in again.  I've been doing 
work on old software using old listings.  Some are nice and clean and OCR just 
fine.  Some are so muddy that they are hard to read for humans, and utterly 
hopeless for OCR.  It's no fun to type in 300 pages of assembly code, but 
sometimes that's the only way.

paul




Re: OCR old software listing.

2018-12-26 Thread Jon Elson via cctalk

On 12/26/2018 03:29 PM, Mattis Lind via cctalk wrote:


A good way to remove the black lines?



https://i.imgur.com/dvY973s.png


Oh, boy!  The printer was not properly aligned, so the lines 
actually overlay the dot-matrix printed text!  This is going 
to make OCR very difficult!  I don't think you can just get 
rid of the lines, that will drop dots from the characters, 
too.  A bad situation.


Jon


Re: OCR old software listing.

2018-12-26 Thread Eric Smith via cctalk
On Wed, Dec 26, 2018, 17:15 Chuck Guzis via cctalk 
wrote:

> On 12/26/18 3:17 PM, Al Kossow via cctalk wrote:
> > On 12/26/18 2:55 PM, Steve Malikoff via cctalk wrote:
> >> Scan them all as-is, put them up and 'crowd source' this list
> > And TYPE the programs in again
>
> I've found that it's often the best course of action and consumes the
> least time overall.  You also have a better chance of understanding the
> code.
>

I typed in 710 pages of listing of the main processor code of HP
2000C'/2000F Time-Shared BASIC. I typed in the full listing format,
including the line numbers, addresses, and object code. I used a simple awk
script to strip it back down to source form. I wrote an HP 21xx
cross-assembler in Perl, assembled the sources, and diffed the generated
assembly listing against the typed-in listing, in order to find the
inevitable errors. I believe the results contain no errors introduced by me
other than typos in the comments.

This did take a fair bit of spare time.


Re: OCR old software listing.

2018-12-26 Thread Kyle Owen via cctalk
On Wed, Dec 26, 2018 at 6:15 PM Chuck Guzis via cctalk <
cctalk@classiccmp.org> wrote:

> On 12/26/18 3:17 PM, Al Kossow via cctalk wrote:
> >
> > And TYPE the programs in again
>
> I've found that it's often the best course of action and consumes the
> least time overall.  You also have a better chance of understanding the
> code.
>

I'm doing this right now for DECUS 8-152. Yes, it certainly helps you
understand what the code is actually doing.

Kyle


Re: OCR old software listing.

2018-12-26 Thread Chuck Guzis via cctalk
On 12/26/18 3:17 PM, Al Kossow via cctalk wrote:
> 
> 
> On 12/26/18 2:55 PM, Steve Malikoff via cctalk wrote:
> 
>> Scan them all as-is, put them up and 'crowd source' this list
> 
> And TYPE the programs in again

I've found that it's often the best course of action and consumes the
least time overall.  You also have a better chance of understanding the
code.

--Chuck


Re: OCR old software listing.

2018-12-26 Thread Al Kossow via cctalk



On 12/26/18 2:55 PM, Steve Malikoff via cctalk wrote:

> Scan them all as-is, put them up and 'crowd source' this list

And TYPE the programs in again




Re: OCR old software listing.

2018-12-26 Thread Steve Malikoff via cctalk
Mattis said
> Finally I got hold of the sources for the PDP-11 SPACE WAR that was
> submitted to DECUS by Bill Seiler.
>
> The format is scans of the PAL-11S listing output. It is easy to crop the
> image to only contain actual source. Then running OCR on it. Tried a few
> online versions and tesseract.
>
> The problem is that the paper that the listing is printed on has lines.
> Very black lines. It makes the OCR go completely crazy. Source lines
> without black lines OCR ok. The others do not. The files need massive
> amount of manual intervention.
>
> Does anyone have an idea how to process files like this?
>
> A good way to remove the black lines?
>
> There are only 19 source files with three or four pages each so I don't
> think it makes sense to try to train tesseract to do it (training tesseract
> seems to be a huge undertaking).
>
> https://i.imgur.com/dvY973s.png
>
> /Mattis

Scan them all as-is, put them up and 'crowd source' this list to divvy up a 
bunch of pages per person
who could then use their favourite graphics program to paint out the lines and 
send you back the
cleaned-up images?  I could do a few.

Steve.




Re: OCR old software listing.

2018-12-26 Thread Toby Thain via cctalk
On 2018-12-26 4:29 PM, Mattis Lind via cctalk wrote:
> Finally I got hold of the sources for the PDP-11 SPACE WAR that was
> submitted to DECUS by Bill Seiler.
> 
> The format is scans of the PAL-11S listing output. It is easy to crop the
> image to only contain actual source. Then running OCR on it. Tried a few
> online versions and tesseract.
> 
> The problem is that the paper that the listing is printed on has lines.
> Very black lines. It makes the OCR go completely crazy. Source lines
> without black lines OCR ok. The others do not. The files need massive
> amount of manual intervention.
> 
> Does anyone have an idea how to process files like this?
> 
> A good way to remove the black lines?
> 

Hi Mattis

I have some ideas. Can you give me access to the original scans?

--Toby


> There are only 19 source files with three or four pages each so I don't
> think it makes sense to try to train tesseract to do it (training tesseract
> seems to be a huge undertaking).
> 
> https://i.imgur.com/dvY973s.png
> 
> /Mattis
> 



Re: OCR old software listing.

2018-12-26 Thread Will Cooke via cctalk
> On December 26, 2018 at 4:29 PM Mattis Lind via cctech 
>  wrote:
> 
> 
> Finally I got hold of the sources for the PDP-11 SPACE WAR that was
> submitted to DECUS by Bill Seiler.
> 
> The format is scans of the PAL-11S listing output. It is easy to crop the
> image to only contain actual source. Then running OCR on it. Tried a few
> online versions and tesseract.
> 
> The problem is that the paper that the listing is printed on has lines.
> Very black lines. It makes the OCR go completely crazy. Source lines
> without black lines OCR ok. The others do not. The files need massive
> amount of manual intervention.
> 
> Does anyone have an idea how to process files like this?
> 
> A good way to remove the black lines?
> 
> There are only 19 source files with three or four pages each so I don't
> think it makes sense to try to train tesseract to do it (training tesseract
> seems to be a huge undertaking).
> 
> https://i.imgur.com/dvY973s.png
> 
> /Mattis
One thing you might try is to pull the scan images into matlab/gnu octave and 
do a 2d FFT, remove the frequency band of the lines, inverse fft, and save.  
I've had good luck removing regular patterns of noise from images that way.

Will

"He may look dumb but that's just a disguise."  -- Charlie Daniels


"The names of global variables should start with// "  -- https://isocpp.org