Re: [firebird-support] Text search ...
On 14/03/2019 13:20, Steve Wiser st...@specializedbusinesssoftware.com [firebird-support] wrote: > Sorry, to be more clear we used lucene outside of firebird. We also > happened to store specific information about the files in firebird. I > stayed away from the lucene within firebird approach because the > libraries seemed out of date and I wasn't sure on support. Having to live with the MySQL on the wordpress sites I don't want to add to that mess with more large databases that I can't back up easily. So the key here is to move things to a Firebird base where the nightly backup just happens without any disruption to the websites ;) Does not leave many options ... -- Lester Caine - G8HFL - Contact - https://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - https://lsces.co.uk EnquirySolve - https://enquirysolve.com/ Model Engineers Digital Workshop - https://medw.co.uk Rainbow Digital Media - https://rainbowdigitalmedia.co.uk
Re: [firebird-support] Text search ...
Sorry, to be more clear we used lucene outside of firebird. We also happened to store specific information about the files in firebird. I stayed away from the lucene within firebird approach because the libraries seemed out of date and I wasn't sure on support. hope that helps -steve On Thu, Mar 14, 2019 at 7:26 AM Lester Caine les...@lsces.co.uk [firebird-support] wrote: > > > On 13/03/2019 14:01, Steve Wiser st...@specializedbusinesssoftware..com > [firebird-support] wrote: > > We have used tesseract-ocr > It can be quite amazing what one misses even searching for something > specific. This has thrown up a couple of add-ons that look interesting > as well. Just have to fight SUSE to get them installed ... just a pity > once again it's google who are funding things ... now to see if I can > get ocrfeeder package for suse! > > > lucene (plus firebird for the index data) > > for this in the past. It works pretty well, but not completely out of > > the box -- you will have to tweak the image a bit. > I've looked at lucene in the past and did have a trial configuration set > up at one time, but support seems to be rather out of date? And .net > seems to get mixed in ... > I'm a little loathed to add Java into the mix anyway, but all the > popular full text options seem to be Java based :( > > https://firebirdsql.org/en/sphinx-full-text-search/ seems to be a dead > end for Firebird if what I'm reading on V3 is correct re MySQL native > access. > > http://www.firebirdfaq.org/faq328/ links are somewhat tired and out of > date. I have used IB_FTS in the past, but not being windows based these > days I'm only using IBObjects on a dwindling number of legacy systems > and even those will need to be moved to web based sooner rather than > later. I need something that plays nicely with PHP7 ... > > -- > Lester Caine - G8HFL > - > Contact - https://lsces.co.uk/wiki/?page=contact > L.S.Caine Electronic Services - https://lsces.co.uk > EnquirySolve - https://enquirysolve.com/ > Model Engineers Digital Workshop - https://medw.co.uk > Rainbow Digital Media - https://rainbowdigitalmedia.co.uk > >
Re: [firebird-support] Text search ...
On 13/03/2019 14:01, Steve Wiser st...@specializedbusinesssoftware.com [firebird-support] wrote: > We have used tesseract-ocr It can be quite amazing what one misses even searching for something specific. This has thrown up a couple of add-ons that look interesting as well. Just have to fight SUSE to get them installed ... just a pity once again it's google who are funding things ... now to see if I can get ocrfeeder package for suse! > lucene (plus firebird for the index data) > for this in the past. It works pretty well, but not completely out of > the box -- you will have to tweak the image a bit. I've looked at lucene in the past and did have a trial configuration set up at one time, but support seems to be rather out of date? And .net seems to get mixed in ... I'm a little loathed to add Java into the mix anyway, but all the popular full text options seem to be Java based :( https://firebirdsql.org/en/sphinx-full-text-search/ seems to be a dead end for Firebird if what I'm reading on V3 is correct re MySQL native access. http://www.firebirdfaq.org/faq328/ links are somewhat tired and out of date. I have used IB_FTS in the past, but not being windows based these days I'm only using IBObjects on a dwindling number of legacy systems and even those will need to be moved to web based sooner rather than later. I need something that plays nicely with PHP7 ... -- Lester Caine - G8HFL - Contact - https://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - https://lsces.co.uk EnquirySolve - https://enquirysolve.com/ Model Engineers Digital Workshop - https://medw.co.uk Rainbow Digital Media - https://rainbowdigitalmedia.co.uk
Re: [firebird-support] Text search ...
We have used tesseract-ocr and lucene (plus firebird for the index data) for this in the past. It works pretty well, but not completely out of the box -- you will have to tweak the image a bit. -steve On Wed, Mar 13, 2019 at 4:33 AM Lester Caine les...@lsces.co.uk [firebird-support] wrote: > > > On 13/03/2019 02:35, Andrew Lowe a...@wht.com.au [firebird-support] wrote: > > Maybe you might want to have a look at Zotero. It does a lot of stuff > > with pdf's, databases etc. > > That is working on a different part of the problem although it might > have provided pointers except for one thing ... "But if you value a > built-in PDF reader, Zotero may not be the best tool for you." Basically > it 'suffers' from the same problem I'm trying to address. We have the > pdf's and in the case of some sites we can even access the original > source used to create them. Google and Bing crawl the pdf's but using > them to provide search results is not acceptable for my local council > clients so we need a 'local' search facility. > > BUGGER that was working fine yesterday !!! > https://northwaypc.org.uk/fisheye/gallery/75 is an example, but just > found the dreaded 'white screen' on looking at a volume ... and the > minutes and agendas galleries are not working either ... This site was > fine yesterday and 'I've' not changed anything in my sleep :( Now to > debug that ... > > -- > Lester Caine - G8HFL > - > Contact - https://lsces.co.uk/wiki/?page=contact > L.S.Caine Electronic Services - https://lsces.co.uk > EnquirySolve - https://enquirysolve.com/ > Model Engineers Digital Workshop - https://medw.co.uk > Rainbow Digital Media - https://rainbowdigitalmedia.co.uk > >
Re: [firebird-support] Text search ...
On 13/03/2019 02:35, Andrew Lowe a...@wht.com.au [firebird-support] wrote: > Maybe you might want to have a look at Zotero. It does a lot of stuff > with pdf's, databases etc. That is working on a different part of the problem although it might have provided pointers except for one thing ... "But if you value a built-in PDF reader, Zotero may not be the best tool for you." Basically it 'suffers' from the same problem I'm trying to address. We have the pdf's and in the case of some sites we can even access the original source used to create them. Google and Bing crawl the pdf's but using them to provide search results is not acceptable for my local council clients so we need a 'local' search facility. BUGGER that was working fine yesterday !!! https://northwaypc.org.uk/fisheye/gallery/75 is an example, but just found the dreaded 'white screen' on looking at a volume ... and the minutes and agendas galleries are not working either ... This site was fine yesterday and 'I've' not changed anything in my sleep :( Now to debug that ... -- Lester Caine - G8HFL - Contact - https://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - https://lsces.co.uk EnquirySolve - https://enquirysolve.com/ Model Engineers Digital Workshop - https://medw.co.uk Rainbow Digital Media - https://rainbowdigitalmedia.co.uk
Re: [firebird-support] Text search ...
On 13/03/19 06:44, Lester Caine les...@lsces.co.uk [firebird-support] wrote: > I've got a few of sites where I've got a growing number of pdf files > which it would be nice to actually index the content. First problem is > obviously the different qualities of pdf, and I've had finereader > deployed in some cases to provide OCRed copies of the original, with the > usual variable success. The question is just what is the best base to be > working towards. I'm currently working on the basis that we store the > original file, and I create thumbnails of the front page so I'm now > looking to striping the raw text. Anybody been there already? Any > suggestions for Linux based solutions ... > > The current indexing process is pulling a list of words from the > document and building a manual index. It was first working pre-Firebird > and has not changed so is there a better was with FB3? > Maybe you might want to have a look at Zotero. It does a lot of stuff with pdf's, databases etc. Andrew