Re: [Wikitech-l] [GSoc2013] Interested in developing Entity Suggester

Nilesh Chakraborty Sat, 04 May 2013 10:35:42 -0700

Thanks for the help, Katie. I'll be looking into how Solr has been
integrated with the GeoData extension. About wikidata-vagrant, no problem,
I'll install it by following this
page<http://www.mediawiki.org/wiki/Extension:Wikibase>
.


You're right, raw DB access can be painful and I'd need to rewrite a lot of
code. I'm considering two options:

*i)* Using the database-related code in the wikidata extension (I'm
studying the DataModel classes and how they interact with the database) to
fetch what I need and feed them into the recommendation engine.

*ii)* Not accessing the DB at all. Rather, I can write map-reduce scripts
to extract all the training data and everything I need for each Item from
the wikidatawiki data dump and feed it into the recommendation engine. I
can use a cron job to download the latest data dump when available, and run
the scripts on it. I don't think it would be an issue even if the engine
lags by the interval the dumps are generated in, since the whole
recommendation thing is all about approximations.

My request to the devs, the community - please discuss the pros and cons of
each method and suggest which one you think would be the best, mainly in
terms of performance. I personally feel that option (ii) would be cleaner.

Cheers,
Nilesh



On Fri, May 3, 2013 at 3:53 PM, aude <aude.w...@gmail.com> wrote:

> On Fri, May 3, 2013 at 5:39 AM, Nilesh Chakraborty <nil...@nileshc.com
> >wrote:
>
> > Hi Lydia,
> >
> > I am currently drafting my proposal, I shall submit within a few hours
> once
> > the initial version is complete.
> >
> > I installed mediawiki-vagrant on my PC and it went quite smoothly. I
> could
> > do all the usual things through the browser; I logged into the mysql
> server
> > to examine the database schema.
> >
> > I also began to clone the
> > wikidata-vagrant<https://github.com/SilkeMeyer/wikidata-vagrant> repo.
> > But it seems that the 'git submodule update --init' part would take a
> long
> > time - if I'm not mistaken, it's a huge download (excluding the vagrant
> up
> > command, which alone takes around 1.25 hours to download everything). I
> > wanted to clarify something before downloading it all.
> >
> > Since the entity suggester will be working with wikidata, it'll obviously
> > need to access the whole live dataset from the database (not the xml
> dump)
> > to make the recommendations. I tried searching for database access APIs
> or
> > high-level REST APIs for wikidata, but couldn't figure out how I to do
> > that. Could you point me to the proper documentation?
> >
>
> One of the best examples of a MediaWiki extension interacting with a Java
> service is how Solr is used.  Solr is still pretty new at Wikimedia,
> though.  It is used with the GeoData extension and then Solr is used by
> geodata api modules.
>
> I think Solr gets updated via a cronjob (solrupdate.php) which creates jobs
> in the job queue.  Not 100% sure of the exact details.
>
> I do not think direct access to the live database is very practical. I
> think anyway the data (json blobs) would need indexing in some particular
> way to support what the entity selector needs to do.
>
> http://www.mediawiki.org/wiki/Extension:GeoData
>
> The Translate extension also uses Solr in some way, though I am not very
> familiar with the details.
>
> On the operations side, puppet is used to configure everything.  The puppet
> git repo is available to see how things are done.
>
>
> https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=tree;f=modules/solr;hb=HEAD
>
>
> >
> > And also, what is the best way to add a few .jar files to wikidata and
> > execute them with custom commands (nohup java blah.jar --blah blah -->
> > running as daemons)? I can of course set it up on my development box
> inside
> > virtualbox - I want to know how to "integrate" it into the system so that
> > any other user can download vagrant and wikidata and have the jars all
> > ready and running? What is the proper development workflow for this?
> >
>
> wikidata-vagrant is maintained in github, though I think might not work
> perfectly right now.  We need to update it and it's on our to-do, and
> perhaps could be moved to gerrit.  I do not know about integrating the
> jars, but should be possible.
>
> Cheers,
> Katie Filbert
>
> [answering from this email, as I am not subscribed to wikitech-l on my
> wikimedia.de email]
>
>
> >
> > Thanks,
> > Nilesh
> >
> >
> >
> > On Sun, Apr 28, 2013 at 3:01 AM, Nilesh Chakraborty <nil...@nileshc.com
> > >wrote:
> >
> > > Awesome. Got it.
> > >
> > > I see what you mean, great, thank you. :)
> > >
> > > Cheers,
> > > Nilesh
> > > On Apr 28, 2013 2:56 AM, "Lydia Pintscher" <
> lydia.pintsc...@wikimedia.de
> > >
> > > wrote:
> > >
> > >> On Sat, Apr 27, 2013 at 11:14 PM, Nilesh Chakraborty <
> > nil...@nileshc.com>
> > >> wrote:
> > >> > Hi Lydia,
> > >> >
> > >> > That helps a lot, and makes it way more interesting. Rather than
> > being a
> > >> > one-size-fits-all solution, as it seems to me, each property or each
> > >> type
> > >> > of property (eg. different relationships) will need individual
> > attention
> > >> > and different methods/metrics for recommendation.
> > >> >
> > >> > The examples you gave, like continents, sex, relations like
> > father/son,
> > >> > uncle/aunt/spouse, or place-oriented properties like place of birth,
> > >> > country of citizenship, ethnic group etc. - each type has a certain
> > >> pattern
> > >> > to it (if a person was born in the US, US should be one of the
> > >> countries he
> > >> > was a citizen of; US census/ethnicity statistics may be used to
> > predict
> > >> > ethnic group etc.) I'm already starting to chalk out a few patterns
> > and
> > >> how
> > >> > they can be used for recommendation. In my proposal, should I go
> into
> > >> > details regarding these? Or should I just give a few examples and
> > >> explain
> > >> > how the algorithms would work, to explain the idea?
> > >>
> > >> Give some examples and how you'd handle them. You definitely don't
> > >> need to have it for all properties. What's important is giving an idea
> > >> about how you'd tackle the problem. Give the reader the impression
> > >> that you know what you are talking about and can handle the larger
> > >> problem.
> > >>
> > >> Also: Don't make the system too intelligent like it knowing about US
> > >> census data for example. Keep it simple and stupid for now. Things
> > >> like "property A is usually used with value X, Y or Z" should cover a
> > >> lot already and are likely enough for most cases.
> > >>
> > >>
> > >> Cheers
> > >> Lydia
> > >>
> > >> --
> > >> Lydia Pintscher - http://about.me/lydia.pintscher
> > >> Community Communications for Technical Projects
> > >>
> > >> Wikimedia Deutschland e.V.
> > >> Obentrautstr. 72
> > >> 10963 Berlin
> > >> www.wikimedia.de
> > >>
> > >> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.
> V.
> > >>
> > >> Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
> > >> unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
> > >> Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
> > >>
> > >> _______________________________________________
> > >> Wikitech-l mailing list
> > >> Wikitech-l@lists.wikimedia.org
> > >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > >
> > >
> >
> >
> > --
> > A quest eternal, a life so small! So don't just play the guitar, build
> one.
> > You can also email me at cont...@nileshc.com or visit my
> > website<http://www.nileshc.com/>
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
>
>
>
> --
> @wikimediadc / @wikidata
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



-- 
A quest eternal, a life so small! So don't just play the guitar, build one.
You can also email me at cont...@nileshc.com or visit my
website<http://www.nileshc.com/>
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [GSoc2013] Interested in developing Entity Suggester

Reply via email to