Thanks for all the answers. I was trying to learn from you how much of 'spotting algorithm' (into taxonomies/ontologies/datastore) needs to be customized. Each domain has it's own heuristics to add to.
-harish On Thu, Sep 6, 2012 at 8:46 AM, Mihály Héder <[email protected]> wrote: > Hi! > > On 5 September 2012 19:03, harish suvarna <[email protected]> wrote: > > > Hi, > > Nice work and thanks for sharing. > > > > You had quite a good store of book titles of around 5.6million. Why is it > > that the recall is around 50%.? > > > > Well this 5.6M is a rather small set. No one knows the total number books > ever written, but google estimates (conservatively) that it is at least 130 > million [1]. > > And as you can imagine there is a long tail effect if we talk about how > well known certain books are. This is why you won't easily cover, say 90% > of the books with even a 50M data set. > The 5.6 million set is the smallest one I experimented with - I like this > size because it is easy to handle. To tell you the truth I was quite happy > with the 50% :) > > Anyway, in the long run, it would be much more important to include book > sets for different languages. Of course, both BNB and OL has some foreign > titles but they are mostly for English. > > > > Are the dropped titles (60-28-13=19) missing in the book bank? > > > Most of them are missing, some of them are dropped because the author is > not mentioned (explicitly). > > > > Are you > > trying any more heuristics to reduce the false positives? > > > > The number of false positives is not a really good marker: the associated > confidence measure of those annotations is even more important. There is no > real problem with a false positive that has 0.001 confidence. We should > have displayed that info (next time). > > Anyway, there are two things on my agenda: > 1) restricting by author names. This is a typical false positive from text > 22: http://openlibrary.org/works/OL15987840W/New_Haven > It is marked as found (confidence 0.2) because both some parts of the title > and the author can be found (New Haven Area Heritage Association: New > Haven). That is a dumb thing to do because: a) the author includes the > title b) the author and the title occurrence overlap. This can be fixed > easily. > > 2) better understanding of role of order and the token distance between > author and title. I will probably experiment with different numbers and see > how the test results change. > > These will happen in the next couple of weeks. Will let you know about the > results. > > Cheers > Mihály > > Thanks, > > Harish > > > > [1] > > http://booksearch.blogspot.hu/2010/08/books-of-world-stand-up-and-be-counted.html > > > > On Wed, Sep 5, 2012 at 2:22 AM, Fabian Christ > > <[email protected]>wrote: > > > > > Hi, > > > > > > nice engine ;) Thanks for sharing! > > > > > > Best, > > > - Fabian > > > > > > 2012/9/3 Anuj Kumar <[email protected]>: > > > > That's great! Thanks for the info. > > > > > > > > Regards, > > > > Anuj > > > > > > > > On Mon, Sep 3, 2012 at 8:49 PM, Mihály Héder <[email protected]> > > > wrote: > > > > > > > >> Hi! > > > >> > > > >> Sure, the 5.6M titles in a HashMap take about 1.3-1.5 G ram, so I > run > > > >> the whole stanbol with -Xmx2500M without issues. > > > >> > > > >> In earlier iterations I have used ehcache + sophisticated custom hit > > > >> and miss handlers to save memory, but I had to realize that it > creates > > > >> more performance issues than it solves in everyday setups, to I gave > > > >> up on that. > > > >> > > > >> Cheers > > > >> Mihály > > > >> > > > >> On 3 September 2012 15:58, Anuj Kumar <[email protected]> wrote: > > > >> > Hi Mihály, > > > >> > > > > >> > Thanks a lot for sharing this. Looks good. > > > >> > > > > >> > I was curious to know the memory requirements to load the > 5.6million > > > >> titles > > > >> > and the whole system to run. If you have any stats, can you please > > > share > > > >> > that? > > > >> > > > > >> > Regards, > > > >> > Anuj > > > >> > > > > >> > On Mon, Sep 3, 2012 at 7:14 PM, Mihály Héder <[email protected] > > > > > >> wrote: > > > >> > > > > >> >> Hi! > > > >> >> > > > >> >> let me introduce BookSpotter Enhancement Engige by Sztaki: > > > >> >> > > > >> >> > > > >> > > > > > > http://blog.iks-project.eu/introducing-bookspotter-enhancement-engine-by-sztaki/ > > > >> >> > > > >> >> Bookspotter uses a selection of 5.6M titles from the British > > National > > > >> >> Bibliography and the Open Library. > > > >> >> It scans the incoming text, looking for titles, and in case the > > > author > > > >> >> is also mentioned, it produces the corresponding entity > annotations > > > >> >> that refer to the proper resource uris of either BNB or OL. > > > >> >> > > > >> >> You can check the system out here: > > > >> >> http://pedia2.sztaki.hu:9090/enhancer/chain/bookspotter > > > >> >> > > > >> >> Thanks to the Early Adopter Program, I was able to buy some > student > > > >> >> work hours for data cleaning and for some basic testing. > > > >> >> You might want to read the report on our test set of 25 tests: > > > >> >> > http://pedia2.sztaki.hu/stanbol/bookspotter/Bookspotter_tests.pdf > > > >> >> > > > >> >> For details, see the blog post! > > > >> >> > > > >> >> Any comments are much appreciated! > > > >> >> Cheers, > > > >> >> Mihály > > > >> >> > > > >> > > > > > > > > > > > > -- > > > Fabian > > > http://twitter.com/fctwitt > > > > > > > > > > > -- > > Thanks > > Harish > > > -- Thanks Harish
