Thanks for all the answers. I was trying to learn from you how much of
'spotting algorithm' (into taxonomies/ontologies/datastore) needs to be
customized. Each domain has it's own heuristics to add to.

-harish

On Thu, Sep 6, 2012 at 8:46 AM, Mihály Héder <[email protected]> wrote:

> Hi!
>
> On 5 September 2012 19:03, harish suvarna <[email protected]> wrote:
>
> > Hi,
> > Nice work and thanks for sharing.
> >
> > You had quite a good store of book titles of around 5.6million. Why is it
> > that the recall is around 50%.?
> >
>
> Well this 5.6M is a rather small set. No one knows the total number books
> ever written, but google estimates (conservatively) that it is at least 130
> million [1].
>
> And as you can imagine there is a long tail effect if we talk about how
> well known certain books are. This is why you won't easily cover, say 90%
> of the books with even a 50M data set.
> The 5.6 million set is the smallest one I experimented with - I like this
> size because it is easy to handle. To tell you the truth I was quite happy
> with the 50% :)
>
> Anyway, in the long run, it would be much more important to include book
> sets for different languages. Of course, both BNB and OL has some foreign
> titles but they are mostly for English.
>
>
> > Are the dropped titles (60-28-13=19) missing in the book bank?
> >
> Most of them are missing, some of them are dropped because the author is
> not mentioned (explicitly).
>
>
> > Are you
> > trying any more heuristics to reduce the false positives?
> >
>
> The number of false positives is not a really good marker: the associated
> confidence measure of those annotations is even more important. There is no
> real problem with a false positive that has 0.001 confidence. We should
> have displayed that info (next time).
>
> Anyway, there are two things on my agenda:
> 1) restricting by author names. This is a typical false positive from text
> 22: http://openlibrary.org/works/OL15987840W/New_Haven
> It is marked as found (confidence 0.2) because both some parts of the title
> and the author can be found (New Haven Area Heritage Association: New
> Haven). That is a dumb thing to do because: a) the author includes the
> title b) the author and the title occurrence overlap. This can be fixed
> easily.
>
> 2) better understanding of role of order and the token distance between
> author and title. I will probably experiment with different numbers and see
> how the test results change.
>
> These will happen in the next couple of weeks. Will let you know about the
> results.
>
> Cheers
> Mihály
>
> Thanks,
> > Harish
> >
>
> [1]
>
> http://booksearch.blogspot.hu/2010/08/books-of-world-stand-up-and-be-counted.html
>
>
> > On Wed, Sep 5, 2012 at 2:22 AM, Fabian Christ
> > <[email protected]>wrote:
> >
> > > Hi,
> > >
> > > nice engine ;) Thanks for sharing!
> > >
> > > Best,
> > >  - Fabian
> > >
> > > 2012/9/3 Anuj Kumar <[email protected]>:
> > > > That's great! Thanks for the info.
> > > >
> > > > Regards,
> > > > Anuj
> > > >
> > > > On Mon, Sep 3, 2012 at 8:49 PM, Mihály Héder <[email protected]>
> > > wrote:
> > > >
> > > >> Hi!
> > > >>
> > > >> Sure, the 5.6M titles in a HashMap take about 1.3-1.5 G ram, so I
> run
> > > >> the whole stanbol with -Xmx2500M without issues.
> > > >>
> > > >> In earlier iterations I have used ehcache + sophisticated custom hit
> > > >> and miss handlers to save memory, but I had to realize that it
> creates
> > > >> more performance issues than it solves in everyday setups, to I gave
> > > >> up on that.
> > > >>
> > > >> Cheers
> > > >> Mihály
> > > >>
> > > >> On 3 September 2012 15:58, Anuj Kumar <[email protected]> wrote:
> > > >> > Hi Mihály,
> > > >> >
> > > >> > Thanks a lot for sharing this. Looks good.
> > > >> >
> > > >> > I was curious to know the memory requirements to load the
> 5.6million
> > > >> titles
> > > >> > and the whole system to run. If you have any stats, can you please
> > > share
> > > >> > that?
> > > >> >
> > > >> > Regards,
> > > >> > Anuj
> > > >> >
> > > >> > On Mon, Sep 3, 2012 at 7:14 PM, Mihály Héder <[email protected]
> >
> > > >> wrote:
> > > >> >
> > > >> >> Hi!
> > > >> >>
> > > >> >> let me introduce BookSpotter Enhancement Engige by Sztaki:
> > > >> >>
> > > >> >>
> > > >>
> > >
> >
> http://blog.iks-project.eu/introducing-bookspotter-enhancement-engine-by-sztaki/
> > > >> >>
> > > >> >> Bookspotter uses a selection of 5.6M titles from the British
> > National
> > > >> >> Bibliography and the Open Library.
> > > >> >> It scans the incoming text, looking for titles, and in case the
> > > author
> > > >> >> is also mentioned, it produces the corresponding entity
> annotations
> > > >> >> that refer to the proper resource uris of either BNB or OL.
> > > >> >>
> > > >> >> You can check the system out here:
> > > >> >> http://pedia2.sztaki.hu:9090/enhancer/chain/bookspotter
> > > >> >>
> > > >> >> Thanks to the Early Adopter Program, I was able to buy some
> student
> > > >> >> work hours for data cleaning and for some basic testing.
> > > >> >> You might want to read the report on our test set of 25 tests:
> > > >> >>
> http://pedia2.sztaki.hu/stanbol/bookspotter/Bookspotter_tests.pdf
> > > >> >>
> > > >> >> For details, see the blog post!
> > > >> >>
> > > >> >> Any comments are much appreciated!
> > > >> >> Cheers,
> > > >> >> Mihály
> > > >> >>
> > > >>
> > >
> > >
> > >
> > > --
> > > Fabian
> > > http://twitter.com/fctwitt
> > >
> >
> >
> >
> > --
> > Thanks
> > Harish
> >
>



-- 
Thanks
Harish

Reply via email to