Re: OCR old software listing
On 2019-01-02 7:22 AM, Steve Malikoff via cctalk wrote: > I timed myself how long it would take to clean up Mattis' supplied image so > it might > be able to be OCR'd more accurately. Using Paint.NET it took me 23 minutes to > get to > the following: > http://web.aanet.com.au/~malikoff/pdp11/dvY973s_cleaned.png > > There are still a few little bits I missed, but happy to see if it reads > better. > If the complete set are now up then I must have missed it. I posted a mechanically cleaned up version on 29 Dec: https://docs.telegraphics.com.au/mattis/spcwar_pdp11_edit.tif (_multipage_ TIF) It's 61 pages so at 23 minutes per page it's not something you'd want to do by hand. --Toby > > Steve. > >
Re: OCR old software listing
I timed myself how long it would take to clean up Mattis' supplied image so it might be able to be OCR'd more accurately. Using Paint.NET it took me 23 minutes to get to the following: http://web.aanet.com.au/~malikoff/pdp11/dvY973s_cleaned.png There are still a few little bits I missed, but happy to see if it reads better. If the complete set are now up then I must have missed it. Steve.
Re: OCR old software listing
The only way I've been able to get any type of readable ASCII TEXT from the .tif's is to do the following for each tif: convert -density 1200 -resize 40% xaaa.tif -density 1 xaaa120040.tif Then, OCR it with Irfanview with the KADMOS Plugin Installed. For the first Page I get the following ASCII: CHAR 00RG CHCR 000216R CHlF 000224R CHRTAB = ** G CHSPC000232R CH05 14R CH1 24R CH15 62R CH2 70R - CH23 000110R CH24 000134R CH25 000140R CH26 000144R CH3 000150R CH4 000164R _~__~___~_~ CH5 080212R CR = 15 CTLP = 20 )DAC0 = ** G DACl = ** G DAC2 = ** G DUM= 00 INC 000240R INC6 000242R INC8 080244R LF = 12 PC =%07 R0 =%00 R1 =%01 R2 =%02 R3 ~%03 R4 =%04 R5 =%000~~5 ~___~~___~~ R6 =%06 R7 =%07 SP =%06 SPACE = 48 . = 000246R END ? ;* ,. ... , . . ; ; CHARACTER DISPLAY ~ VERSION 3C ; ; NOV 15,1974 ,;_~~._ ~ ; R0=PTR TO BUFFER OF CHARS-FIRST WORD #OF B~TES ; R1=BIT TEST ROTATING MASK ~ R2=CHARACTER INCERMENT-DETERMINES CHARACTER SIZE ; R3=POINTER AT CHAR DOT DATA . ; R4=X POSITION OF FIRST CHAR ; R5=Y POSlTlON OF FTRST CHAR ; 00 ~ R0=%0 01 R1=%1 02 R2=%2 · 03 R3=%3 080004 R4=%4 05 R5=%5 06 R6=%6 07 R7=%7 07 PC=R7 06 SP=R6 20 CTLP=20 40 SPACE=40 15 CR=15 12 LF=12' 00 DUM=0 .TITLE .CHAR .GLOBL CHAR,DAC0,DAC1,0AC2,CHRTAB 00 .CSECT , 00 012046 CHRR: MO~ (R0)+,-(SP) ;GET CHAR COUNT 02 016702 MOV INC,R2 ~SET CHARACTER SIZE 000232 06 012737~MOV #-2048 .~#OAC2 ;TURN DOT OFF JUST IN CASE 1740~0 00 14 005316 CH05: DEC (SP>;IS THERE MORE CHARS? 16 002002 BGE CH1 ;~ES-GO DRAW THEM 000820 005726 TST (SP)+ ;NO-POP OLD CTR 22 000207 RTS PC ;RLL DONE!~!!~!1~11! 24 112AA7 CW·l·MnWR(P0~~...P-e· .nrT r·uc,o It's no where near 50% accurate, but it's the best I've got so far. Page 2 is: \ 1~ 42 001470 BEQ CHLF;YES-GO LF --- ---~ 44 122703 CMPB#SPACE,R3 ;NO-IS THIS A SPACE? 40 000850 001470 BEQ CHSPC ;~ES-GO SPACE 52 003003 BGT 0H15;NO-IF LESS THAN.SRAC~-B~~-~*RR PAGE001. 54 122703CMPB#137~R3;IS IT GREAYER TH8N-1~~2~ - 000137 60 002003 BGE CH2 ;NO-GOOD CHAR~ ~--~~-~---~-· 62 012703~CH15: MOV #CHRTAB,R3 ~YES-BAD CHAR 0001300 , „ 66 000410 BR CH23 70 162703 CH2:SUB #37,R3 ;ZERO FOR FIRST CHAR IN~TABLE 37 ~ 010301 MOV R3,Rl ~SAVE VALE TEMP ~--~--~~- 76 006303 ASL R3 ;R3=R3*2 000100 060103 ADD R1~R3 ;R3=R3*3 - 000102 006303 ASL R3 ;R3=R3*6 (FOR 3 WORDS) 0801~ 062703~ADD #CHRTAB,R3 ~~ P~ AT FIR~ CHAR~~ATR ~~~E 00 000110 012701 CH23· MnV #20A,R1 ~SET TEST BIT -~-~--~-_~-~ 000200 000114 010546 MOV R5,-(SP>;SAVE lNITIAL ~ POSIT~ON 000116 010467 MO~ R4,CH24+2 ;SA~E INITIAL X 80001.4 · 00~122 010367 MOV R3,CH25+2 ;SAVE INITIAL CHAR PTR 08881 4 „„,..___ ,.~ ~.,,~,_,,__„,., 000126 005137~COM @#DAC2 ;TURN DOT ON INTO CHAR 008008~ 000132 000404 BR CH26;SKIP 000134 012704 CH24: MOV #DUM,R4 ;RESET PTR 00
RE: OCR old software listing
I've had a lot of success using Adobe's Clearscan for OCR'ing old stuff. Admittedly it's not perfect but it can improve the quality of an old document a lot. Kevin Parker 0418 815 527 -Original Message- From: cctalk On Behalf Of Paul Koning via cctalk Sent: Tuesday, 1 January 2019 12:18 PM To: dwight ; General Discussion: On-Topic and Off-Topic Posts Subject: Re: OCR old software listing > On Dec 31, 2018, at 7:13 PM, dwight via cctalk wrote: > > Fred is right, OCR is only worth it if the document is in perfect condition. I just finish getting an old 4004 listing working. I made only two mistakes on the 4K of code that were not the fault of the poorness of the listing. Twice I put LDM instead of LD. LDM was the most commonly used. I wouldn't put it quite so strongly. OCR even if not perfect can help a lot. You can often OCR + test assembly + proofread faster than retyping, especially since that requires fixing typos and proofreading also. Many OCR errors are caught by the assembler, though not all of them of course. I've done both in an ongoing software preservation project; my conclusion still is to use OCR when it works "well enough". A couple of errors per page is definitely "well enough". The program used matters. I looked at Tesseract a bit but its quality was vastly inferior to commercial products in the examples I tried. I now use Abbyy FineReader, which handles a lot of line printer and typewriter material quite well. paul
Re: OCR old software listing
On the other hand, just for the FUN of it, can you write some software to find and fix (or simply flag) the most common errors? When I had to terminate my publisher, I was s'posed to receive a copy of their customer database. They deleted all delimiters (spaces, commas, periods, and other punctuation), and then printed it out on greenbar with a font that used the same character for zero and letter 'O'; and same character for one, lower case 'l', and upper cse 'I'. Surprisingly, they did NOT use a bad ribbon and printhead! An acquaintance OCR'ed it. They were able to get what they thought was "80 to 90%" from the originals, but not from a xerox copy that visually seemed to be just as good. I spent a little time writing some simple code to parse and fix most of it. Mostly simple context, such as a zero between two letters is likely an 'O', or an 'O' between two numerals is likely a zero. Similarly with one, lower case 'L' and upper case 'I'. Some OCR software now pays attention to context. Five consecutive numerals following two capital letters is likely to be a zip code, and end of the record. USUALLY. Comparison of those digits with the two letters in a zipcode database provided partial confirmation. . . . and so forth . . . Not a practical use of time, but a fun exercise in parsing. Another time, the .SRT file that I found for "Company Man" used upper case 'I' instead of lower case 'L'! (AND had a three minute offset for the start time)Did not take very long to fix. -- Grumpy Ol' Fred ci...@xenosoft.com On Tue, 1 Jan 2019, dwight wrote: Fred is right, OCR is only worth it if the document is in perfect condition. I just finish getting an old 4004 listing working. I made only two mistakes on the 4K of code that were not the fault of the poorness of the listing. Twice I put LDM instead of LD. LDM was the most commonly used. There were still some 15 or so other errors do to the printing. It looked to be done on a ASR33 with poor registration of the print drum. Cs and 0s were often missing the right 1/3. Expecting an OCR to do much would have been a folly. Even though some 85% to 90% could be read properly. It took be about 3 weeks of evenings to make heads or tails of the code. I've finally got it running correctly. If it had been done with an OCR, many cases it would have simply put a C instead of a 0. I'd have had to go through the listing, checking each C to make sure it was right. It is easier in many cases to have analysed what I could see and make a judgement, based on what I could see and the general context as I was typing it in. Dwight From: cctalk on behalf of Fred Cisin via cctalk Sent: Monday, December 31, 2018 9:46 AM To: General Discussion: On-Topic and Off-Topic Posts Subject: Re: OCR old software listing On Mon, 31 Dec 2018, Larry Kraemer via cctalk wrote: I used the libtiff-tools (Debian 8.x - 32 Bit) to extract all 61 .TIF's from the Multipage .tif file. While the .tif's look descent, and RasterVect shows the .tif properties to be Group 4 Fax (1bpp) with 5100 x 6600 pixels - 300 DPI, I can't get tesseract 3.x, TextBridge Classic 2.0, or Irfanview with KADMOS Plugin to OCR any of the .tif files, with descent results. I'd expect an OCR of 85 to 90 % correct conversion to ASCII text. Software listings need more accuraacy than that. How many wrong characters does it take for a program not to work? "desCent" isn't good enough. 85 to 90 % correct is a character wrong in every 6 to 10 characters. How many errors is that PER LINE? "But, you can start with that, and just fix the errors, without retyping the rest." Doing it that way is a desCent into madness. BTDT. wore out the T-shirts. A competent typist can retype the whole thing faster than fixing an error in every six to ten characters. Only if there is less than one error for every several hundred characters does "patching it" save time for a competent typist. In general, for a competent typist, the fastest way to reposition the cursor to the next error in the line is to simply hit the keys of the intervening letters. It is NOT to move the cursor with the mouse, then put your hand back on the keys to type a character. Using cursor motion keys is no faster for a competent typist than hitting the keys of the letters toskip over. TIP: display the OCR'ed text that is to be corrected in a font that exaggerates the difference between zero and the letter 'O', and between one and lower case 'l'. There are some programs that will attempt to select those based on context. -- Grumpy Ol' Fred ci...@xenosoft.com
Re: OCR old software listing
> On Dec 31, 2018, at 7:13 PM, dwight via cctalk wrote: > > Fred is right, OCR is only worth it if the document is in perfect condition. > I just finish getting an old 4004 listing working. I made only two mistakes > on the 4K of code that were not the fault of the poorness of the listing. > Twice I put LDM instead of LD. LDM was the most commonly used. I wouldn't put it quite so strongly. OCR even if not perfect can help a lot. You can often OCR + test assembly + proofread faster than retyping, especially since that requires fixing typos and proofreading also. Many OCR errors are caught by the assembler, though not all of them of course. I've done both in an ongoing software preservation project; my conclusion still is to use OCR when it works "well enough". A couple of errors per page is definitely "well enough". The program used matters. I looked at Tesseract a bit but its quality was vastly inferior to commercial products in the examples I tried. I now use Abbyy FineReader, which handles a lot of line printer and typewriter material quite well. paul
Re: OCR old software listing
Fred is right, OCR is only worth it if the document is in perfect condition. I just finish getting an old 4004 listing working. I made only two mistakes on the 4K of code that were not the fault of the poorness of the listing. Twice I put LDM instead of LD. LDM was the most commonly used. There were still some 15 or so other errors do to the printing. It looked to be done on a ASR33 with poor registration of the print drum. Cs and 0s were often missing the right 1/3. Expecting an OCR to do much would have been a folly. Even though some 85% to 90% could be read properly. It took be about 3 weeks of evenings to make heads or tails of the code. I've finally got it running correctly. If it had been done with an OCR, many cases it would have simply put a C instead of a 0. I'd have had to go through the listing, checking each C to make sure it was right. It is easier in many cases to have analysed what I could see and make a judgement, based on what I could see and the general context as I was typing it in. Dwight From: cctalk on behalf of Fred Cisin via cctalk Sent: Monday, December 31, 2018 9:46 AM To: General Discussion: On-Topic and Off-Topic Posts Subject: Re: OCR old software listing On Mon, 31 Dec 2018, Larry Kraemer via cctalk wrote: > I used the libtiff-tools (Debian 8.x - 32 Bit) to extract all 61 .TIF's > from the Multipage .tif file. While the .tif's look descent, and > RasterVect shows the .tif properties to be Group 4 Fax (1bpp) with 5100 > x 6600 pixels - 300 DPI, I can't get tesseract 3.x, TextBridge Classic > 2.0, or Irfanview with KADMOS Plugin to OCR any of the .tif files, with > descent results. I'd expect an OCR of 85 to 90 % correct conversion to > ASCII text. Software listings need more accuraacy than that. How many wrong characters does it take for a program not to work? "desCent" isn't good enough. 85 to 90 % correct is a character wrong in every 6 to 10 characters. How many errors is that PER LINE? "But, you can start with that, and just fix the errors, without retyping the rest." Doing it that way is a desCent into madness. BTDT. wore out the T-shirts. A competent typist can retype the whole thing faster than fixing an error in every six to ten characters. Only if there is less than one error for every several hundred characters does "patching it" save time for a competent typist. In general, for a competent typist, the fastest way to reposition the cursor to the next error in the line is to simply hit the keys of the intervening letters. It is NOT to move the cursor with the mouse, then put your hand back on the keys to type a character. Using cursor motion keys is no faster for a competent typist than hitting the keys of the letters toskip over. TIP: display the OCR'ed text that is to be corrected in a font that exaggerates the difference between zero and the letter 'O', and between one and lower case 'l'. There are some programs that will attempt to select those based on context. -- Grumpy Ol' Fred ci...@xenosoft.com
Re: OCR old software listing
On Mon, 31 Dec 2018, Larry Kraemer via cctalk wrote: I used the libtiff-tools (Debian 8.x - 32 Bit) to extract all 61 .TIF's from the Multipage .tif file. While the .tif's look descent, and RasterVect shows the .tif properties to be Group 4 Fax (1bpp) with 5100 x 6600 pixels - 300 DPI, I can't get tesseract 3.x, TextBridge Classic 2.0, or Irfanview with KADMOS Plugin to OCR any of the .tif files, with descent results. I'd expect an OCR of 85 to 90 % correct conversion to ASCII text. Software listings need more accuraacy than that. How many wrong characters does it take for a program not to work? "desCent" isn't good enough. 85 to 90 % correct is a character wrong in every 6 to 10 characters. How many errors is that PER LINE? "But, you can start with that, and just fix the errors, without retyping the rest." Doing it that way is a desCent into madness. BTDT. wore out the T-shirts. A competent typist can retype the whole thing faster than fixing an error in every six to ten characters. Only if there is less than one error for every several hundred characters does "patching it" save time for a competent typist. In general, for a competent typist, the fastest way to reposition the cursor to the next error in the line is to simply hit the keys of the intervening letters. It is NOT to move the cursor with the mouse, then put your hand back on the keys to type a character. Using cursor motion keys is no faster for a competent typist than hitting the keys of the letters toskip over. TIP: display the OCR'ed text that is to be corrected in a font that exaggerates the difference between zero and the letter 'O', and between one and lower case 'l'. There are some programs that will attempt to select those based on context. -- Grumpy Ol' Fred ci...@xenosoft.com
Re: OCR old software listing
On 2018-12-31 7:20 AM, Larry Kraemer via cctalk wrote: > I used the libtiff-tools (Debian 8.x - 32 Bit) to extract all 61 .TIF's > from the > Multipage .tif file. While the .tif's look descent, and RasterVect shows > the > .tif properties to be Group 4 Fax (1bpp) with 5100 x 6600 pixels - 300 DPI, > I can't get tesseract 3.x, TextBridge Classic 2.0, or Irfanview with KADMOS > Plugin to OCR any of the .tif files, with descent results. I'd expect an > OCR > of 85 to 90 % correct conversion to ASCII text. > > Typically, one of the three above Software packages will do a descent job > of OCRing .tif's of such scans. (Most PDF's end up at 72 x 72 DPI, and > converting them to 300 DPI, allows them to be properly OCR'd.) > > If anyone else has had better luck, I'd like to know what your process is. I don't know if OCR software is sensitive to having correct resolution (I've practically zero experience with it), but 300 dpi seems wrong for Mattis' scans. Seems they should be 600 dpi (21.7 cm x 28 cm). --Toby > > Thanks. > > Larry >
Re: OCR old software listing
I used the libtiff-tools (Debian 8.x - 32 Bit) to extract all 61 .TIF's from the Multipage .tif file. While the .tif's look descent, and RasterVect shows the .tif properties to be Group 4 Fax (1bpp) with 5100 x 6600 pixels - 300 DPI, I can't get tesseract 3.x, TextBridge Classic 2.0, or Irfanview with KADMOS Plugin to OCR any of the .tif files, with descent results. I'd expect an OCR of 85 to 90 % correct conversion to ASCII text. Typically, one of the three above Software packages will do a descent job of OCRing .tif's of such scans. (Most PDF's end up at 72 x 72 DPI, and converting them to 300 DPI, allows them to be properly OCR'd.) If anyone else has had better luck, I'd like to know what your process is. Thanks. Larry
Re: OCR old software listing.
On 2018-12-29 1:32 AM, Toby Thain via cctalk wrote: > On 2018-12-29 12:47 AM, Toby Thain via cctalk wrote: >> On 2018-12-26 4:29 PM, Mattis Lind via cctalk wrote: >>> Finally I got hold of the sources for the PDP-11 SPACE WAR that was >>> submitted to DECUS by Bill Seiler. >>> >>> The format is scans of the PAL-11S listing output. It is easy to crop the >>> image to only contain actual source. Then running OCR on it. Tried a few >>> online versions and tesseract. >>> >>> The problem is that the paper that the listing is printed on has lines. >>> Very black lines. It makes the OCR go completely crazy. Source lines >>> without black lines OCR ok. The others do not. The files need massive >>> amount of manual intervention. >>> >>> Does anyone have an idea how to process files like this? >>> >>> A good way to remove the black lines? >> >> Hi Mattis >> >> Here's a first cut. Can probably be improved slightly. Let me know how >> much this still confuses Tesseract. >> >> https://docs.telegraphics.com.au/mattis/spcwar_pdp11_edit.tif >> > > That is a multipage TIF, and the page order key is listed below. > > I just noticed that a handful of pages seem to be missing, so I'll look > into that. > Fixed that. I was also able to improve the quality. Same link. The full page manifest is: CHAR-- CHAR--0001 CHAR--0002 CHRTAB-- CHRTAB--0001 CHRTAB--0002 CHRTAB--0003 COMPAR-- COMPAR--0001 COMPAR--0002 COMPAR--0003 EXPLOD-- EXPLOD--0001 EXPLOD--0002 GRAVTY-- GRAVTY--0001 GRAVTY--0002 GRAVTY--0003 MULPLY-- MULPLY--0001 MULPLY--0002 PARM-- PARM--0001 PARM--0002 PARM--0003 PARM--0004 PARM--0005 PARM--0006 PARM--0007 PARM--0008 PARM--0009 PWRUP-- PWRUP--0001 RESET-- RESET--0001 RKT1-- RKT1--0001 RKT2-- RKT2--0001 SCORE-- SCORE--0001 SINCOS-- SINCOS--0001 SINCOS--0002 SINCOS--0003 SLINE-- SLINE--0001 SPCWAR-- SPCWAR--0001 SPCWAR--0002 SUN-- SUN--0001 SUN--0002 UPDAT1-- UPDAT1--0001 UPDAT1--0002 UPDAT2-- UPDAT2--0001 UPDAT2--0002 point-- point--0001 > >> --Toby >> >>> >>> There are only 19 source files with three or four pages each so I don't >>> think it makes sense to try to train tesseract to do it (training tesseract >>> seems to be a huge undertaking). >>> >>> https://i.imgur.com/dvY973s.png >>> >>> /Mattis >>> >> >> > >
Re: OCR old software listing.
On 2018-12-29 12:47 AM, Toby Thain via cctalk wrote: > On 2018-12-26 4:29 PM, Mattis Lind via cctalk wrote: >> Finally I got hold of the sources for the PDP-11 SPACE WAR that was >> submitted to DECUS by Bill Seiler. >> >> The format is scans of the PAL-11S listing output. It is easy to crop the >> image to only contain actual source. Then running OCR on it. Tried a few >> online versions and tesseract. >> >> The problem is that the paper that the listing is printed on has lines. >> Very black lines. It makes the OCR go completely crazy. Source lines >> without black lines OCR ok. The others do not. The files need massive >> amount of manual intervention. >> >> Does anyone have an idea how to process files like this? >> >> A good way to remove the black lines? > > Hi Mattis > > Here's a first cut. Can probably be improved slightly. Let me know how > much this still confuses Tesseract. > > https://docs.telegraphics.com.au/mattis/spcwar_pdp11_edit.tif > That is a multipage TIF, and the page order key is listed below. I just noticed that a handful of pages seem to be missing, so I'll look into that. CHAR-- CHAR--0001 CHAR--0002 CHRTAB-- CHRTAB--0001 CHRTAB--0002 COMPAR-- COMPAR--0001 COMPAR--0002 COMPAR--0003 EXPLOD-- EXPLOD--0001 EXPLOD--0002 GRAVTY-- GRAVTY--0001 GRAVTY--0002 GRAVTY--0003 MULPLY-- MULPLY--0001 MULPLY--0002 PARM-- PARM--0001 PARM--0002 PARM--0003 PARM--0005 PARM--0006 PARM--0007 PARM--0008 PARM--0009 PWRUP-- PWRUP--0001 RESET-- RESET--0001 RKT1-- RKT1--0001 RKT2-- RKT2--0001 SCORE-- SCORE--0001 SINCOS-- SINCOS--0001 SINCOS--0002 SLINE-- SLINE--0001 SPCWAR-- SPCWAR--0001 SPCWAR--0002 SUN-- SUN--0001 SUN--0002 UPDAT1-- UPDAT1--0001 UPDAT1--0002 UPDAT2-- UPDAT2--0002 point-- point--0001 > --Toby > >> >> There are only 19 source files with three or four pages each so I don't >> think it makes sense to try to train tesseract to do it (training tesseract >> seems to be a huge undertaking). >> >> https://i.imgur.com/dvY973s.png >> >> /Mattis >> > >
Re: OCR old software listing.
On 2018-12-26 4:29 PM, Mattis Lind via cctalk wrote: > Finally I got hold of the sources for the PDP-11 SPACE WAR that was > submitted to DECUS by Bill Seiler. > > The format is scans of the PAL-11S listing output. It is easy to crop the > image to only contain actual source. Then running OCR on it. Tried a few > online versions and tesseract. > > The problem is that the paper that the listing is printed on has lines. > Very black lines. It makes the OCR go completely crazy. Source lines > without black lines OCR ok. The others do not. The files need massive > amount of manual intervention. > > Does anyone have an idea how to process files like this? > > A good way to remove the black lines? Hi Mattis Here's a first cut. Can probably be improved slightly. Let me know how much this still confuses Tesseract. https://docs.telegraphics.com.au/mattis/spcwar_pdp11_edit.tif --Toby > > There are only 19 source files with three or four pages each so I don't > think it makes sense to try to train tesseract to do it (training tesseract > seems to be a huge undertaking). > > https://i.imgur.com/dvY973s.png > > /Mattis >
Re: OCR old software listing.
> On Dec 26, 2018, at 10:30 PM, Jon Elson via cctalk > wrote: > > On 12/26/2018 03:29 PM, Mattis Lind via cctalk wrote: >> >> A good way to remove the black lines? >> >> >> >> https://i.imgur.com/dvY973s.png >> >> > Oh, boy! The printer was not properly aligned, so the lines actually overlay > the dot-matrix printed text! This is going to make OCR very difficult! I > don't think you can just get rid of the lines, that will drop dots from the > characters, too. A bad situation. At some point the simplest answer is to type it all in again. I've been doing work on old software using old listings. Some are nice and clean and OCR just fine. Some are so muddy that they are hard to read for humans, and utterly hopeless for OCR. It's no fun to type in 300 pages of assembly code, but sometimes that's the only way. paul
Re: OCR old software listing.
On 12/26/2018 03:29 PM, Mattis Lind via cctalk wrote: A good way to remove the black lines? https://i.imgur.com/dvY973s.png Oh, boy! The printer was not properly aligned, so the lines actually overlay the dot-matrix printed text! This is going to make OCR very difficult! I don't think you can just get rid of the lines, that will drop dots from the characters, too. A bad situation. Jon
Re: OCR old software listing.
On Wed, Dec 26, 2018, 17:15 Chuck Guzis via cctalk wrote: > On 12/26/18 3:17 PM, Al Kossow via cctalk wrote: > > On 12/26/18 2:55 PM, Steve Malikoff via cctalk wrote: > >> Scan them all as-is, put them up and 'crowd source' this list > > And TYPE the programs in again > > I've found that it's often the best course of action and consumes the > least time overall. You also have a better chance of understanding the > code. > I typed in 710 pages of listing of the main processor code of HP 2000C'/2000F Time-Shared BASIC. I typed in the full listing format, including the line numbers, addresses, and object code. I used a simple awk script to strip it back down to source form. I wrote an HP 21xx cross-assembler in Perl, assembled the sources, and diffed the generated assembly listing against the typed-in listing, in order to find the inevitable errors. I believe the results contain no errors introduced by me other than typos in the comments. This did take a fair bit of spare time.
Re: OCR old software listing.
On Wed, Dec 26, 2018 at 6:15 PM Chuck Guzis via cctalk < cctalk@classiccmp.org> wrote: > On 12/26/18 3:17 PM, Al Kossow via cctalk wrote: > > > > And TYPE the programs in again > > I've found that it's often the best course of action and consumes the > least time overall. You also have a better chance of understanding the > code. > I'm doing this right now for DECUS 8-152. Yes, it certainly helps you understand what the code is actually doing. Kyle
Re: OCR old software listing.
On 12/26/18 3:17 PM, Al Kossow via cctalk wrote: > > > On 12/26/18 2:55 PM, Steve Malikoff via cctalk wrote: > >> Scan them all as-is, put them up and 'crowd source' this list > > And TYPE the programs in again I've found that it's often the best course of action and consumes the least time overall. You also have a better chance of understanding the code. --Chuck
Re: OCR old software listing.
On 12/26/18 2:55 PM, Steve Malikoff via cctalk wrote: > Scan them all as-is, put them up and 'crowd source' this list And TYPE the programs in again
Re: OCR old software listing.
Mattis said > Finally I got hold of the sources for the PDP-11 SPACE WAR that was > submitted to DECUS by Bill Seiler. > > The format is scans of the PAL-11S listing output. It is easy to crop the > image to only contain actual source. Then running OCR on it. Tried a few > online versions and tesseract. > > The problem is that the paper that the listing is printed on has lines. > Very black lines. It makes the OCR go completely crazy. Source lines > without black lines OCR ok. The others do not. The files need massive > amount of manual intervention. > > Does anyone have an idea how to process files like this? > > A good way to remove the black lines? > > There are only 19 source files with three or four pages each so I don't > think it makes sense to try to train tesseract to do it (training tesseract > seems to be a huge undertaking). > > https://i.imgur.com/dvY973s.png > > /Mattis Scan them all as-is, put them up and 'crowd source' this list to divvy up a bunch of pages per person who could then use their favourite graphics program to paint out the lines and send you back the cleaned-up images? I could do a few. Steve.
Re: OCR old software listing.
On 2018-12-26 4:29 PM, Mattis Lind via cctalk wrote: > Finally I got hold of the sources for the PDP-11 SPACE WAR that was > submitted to DECUS by Bill Seiler. > > The format is scans of the PAL-11S listing output. It is easy to crop the > image to only contain actual source. Then running OCR on it. Tried a few > online versions and tesseract. > > The problem is that the paper that the listing is printed on has lines. > Very black lines. It makes the OCR go completely crazy. Source lines > without black lines OCR ok. The others do not. The files need massive > amount of manual intervention. > > Does anyone have an idea how to process files like this? > > A good way to remove the black lines? > Hi Mattis I have some ideas. Can you give me access to the original scans? --Toby > There are only 19 source files with three or four pages each so I don't > think it makes sense to try to train tesseract to do it (training tesseract > seems to be a huge undertaking). > > https://i.imgur.com/dvY973s.png > > /Mattis >
Re: OCR old software listing.
> On December 26, 2018 at 4:29 PM Mattis Lind via cctech > wrote: > > > Finally I got hold of the sources for the PDP-11 SPACE WAR that was > submitted to DECUS by Bill Seiler. > > The format is scans of the PAL-11S listing output. It is easy to crop the > image to only contain actual source. Then running OCR on it. Tried a few > online versions and tesseract. > > The problem is that the paper that the listing is printed on has lines. > Very black lines. It makes the OCR go completely crazy. Source lines > without black lines OCR ok. The others do not. The files need massive > amount of manual intervention. > > Does anyone have an idea how to process files like this? > > A good way to remove the black lines? > > There are only 19 source files with three or four pages each so I don't > think it makes sense to try to train tesseract to do it (training tesseract > seems to be a huge undertaking). > > https://i.imgur.com/dvY973s.png > > /Mattis One thing you might try is to pull the scan images into matlab/gnu octave and do a 2d FFT, remove the frequency band of the lines, inverse fft, and save. I've had good luck removing regular patterns of noise from images that way. Will "He may look dumb but that's just a disguise." -- Charlie Daniels "The names of global variables should start with// " -- https://isocpp.org