Re: [Wikitech-l] [GSoc2013] Interested in developing Entity Suggester

Nilesh Chakraborty Sat, 04 May 2013 10:44:33 -0700

Hi everyone,

One more thing - should I create a new thread to discuss on prototyping my
project (entity suggester) and to discuss any issues I'm facing along the
way or ask for help? Or should I just stick to this old thread?


Cheers,
Nilesh


On Sat, May 4, 2013 at 11:05 PM, Nilesh Chakraborty <nil...@nileshc.com>wrote:

> Thanks for the help, Katie. I'll be looking into how Solr has been
> integrated with the GeoData extension. About wikidata-vagrant, no problem,
> I'll install it by following this 
> page<http://www.mediawiki.org/wiki/Extension:Wikibase>
> .
>
> You're right, raw DB access can be painful and I'd need to rewrite a lot
> of code. I'm considering two options:
>
> *i)* Using the database-related code in the wikidata extension (I'm
> studying the DataModel classes and how they interact with the database) to
> fetch what I need and feed them into the recommendation engine.
>
> *ii)* Not accessing the DB at all. Rather, I can write map-reduce scripts
> to extract all the training data and everything I need for each Item from
> the wikidatawiki data dump and feed it into the recommendation engine. I
> can use a cron job to download the latest data dump when available, and run
> the scripts on it. I don't think it would be an issue even if the engine
> lags by the interval the dumps are generated in, since the whole
> recommendation thing is all about approximations.
>
> My request to the devs, the community - please discuss the pros and cons
> of each method and suggest which one you think would be the best, mainly in
> terms of performance. I personally feel that option (ii) would be cleaner.
>
> Cheers,
> Nilesh
>
>
>
> On Fri, May 3, 2013 at 3:53 PM, aude <aude.w...@gmail.com> wrote:
>
>> On Fri, May 3, 2013 at 5:39 AM, Nilesh Chakraborty <nil...@nileshc.com
>> >wrote:
>>
>> > Hi Lydia,
>> >
>> > I am currently drafting my proposal, I shall submit within a few hours
>> once
>> > the initial version is complete.
>> >
>> > I installed mediawiki-vagrant on my PC and it went quite smoothly. I
>> could
>> > do all the usual things through the browser; I logged into the mysql
>> server
>> > to examine the database schema.
>> >
>> > I also began to clone the
>> > wikidata-vagrant<https://github.com/SilkeMeyer/wikidata-vagrant> repo.
>> > But it seems that the 'git submodule update --init' part would take a
>> long
>> > time - if I'm not mistaken, it's a huge download (excluding the vagrant
>> up
>> > command, which alone takes around 1.25 hours to download everything). I
>> > wanted to clarify something before downloading it all.
>> >
>> > Since the entity suggester will be working with wikidata, it'll
>> obviously
>> > need to access the whole live dataset from the database (not the xml
>> dump)
>> > to make the recommendations. I tried searching for database access APIs
>> or
>> > high-level REST APIs for wikidata, but couldn't figure out how I to do
>> > that. Could you point me to the proper documentation?
>> >
>>
>> One of the best examples of a MediaWiki extension interacting with a Java
>> service is how Solr is used.  Solr is still pretty new at Wikimedia,
>> though.  It is used with the GeoData extension and then Solr is used by
>> geodata api modules.
>>
>> I think Solr gets updated via a cronjob (solrupdate.php) which creates
>> jobs
>> in the job queue.  Not 100% sure of the exact details.
>>
>> I do not think direct access to the live database is very practical. I
>> think anyway the data (json blobs) would need indexing in some particular
>> way to support what the entity selector needs to do.
>>
>> http://www.mediawiki.org/wiki/Extension:GeoData
>>
>> The Translate extension also uses Solr in some way, though I am not very
>> familiar with the details.
>>
>> On the operations side, puppet is used to configure everything.  The
>> puppet
>> git repo is available to see how things are done.
>>
>>
>> https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=tree;f=modules/solr;hb=HEAD
>>
>>
>> >
>> > And also, what is the best way to add a few .jar files to wikidata and
>> > execute them with custom commands (nohup java blah.jar --blah blah -->
>> > running as daemons)? I can of course set it up on my development box
>> inside
>> > virtualbox - I want to know how to "integrate" it into the system so
>> that
>> > any other user can download vagrant and wikidata and have the jars all
>> > ready and running? What is the proper development workflow for this?
>> >
>>
>> wikidata-vagrant is maintained in github, though I think might not work
>> perfectly right now.  We need to update it and it's on our to-do, and
>> perhaps could be moved to gerrit.  I do not know about integrating the
>> jars, but should be possible.
>>
>> Cheers,
>> Katie Filbert
>>
>> [answering from this email, as I am not subscribed to wikitech-l on my
>> wikimedia.de email]
>>
>>
>> >
>> > Thanks,
>> > Nilesh
>> >
>> >
>> >
>> > On Sun, Apr 28, 2013 at 3:01 AM, Nilesh Chakraborty <nil...@nileshc.com
>> > >wrote:
>> >
>> > > Awesome. Got it.
>> > >
>> > > I see what you mean, great, thank you. :)
>> > >
>> > > Cheers,
>> > > Nilesh
>> > > On Apr 28, 2013 2:56 AM, "Lydia Pintscher" <
>> lydia.pintsc...@wikimedia.de
>> > >
>> > > wrote:
>> > >
>> > >> On Sat, Apr 27, 2013 at 11:14 PM, Nilesh Chakraborty <
>> > nil...@nileshc.com>
>> > >> wrote:
>> > >> > Hi Lydia,
>> > >> >
>> > >> > That helps a lot, and makes it way more interesting. Rather than
>> > being a
>> > >> > one-size-fits-all solution, as it seems to me, each property or
>> each
>> > >> type
>> > >> > of property (eg. different relationships) will need individual
>> > attention
>> > >> > and different methods/metrics for recommendation.
>> > >> >
>> > >> > The examples you gave, like continents, sex, relations like
>> > father/son,
>> > >> > uncle/aunt/spouse, or place-oriented properties like place of
>> birth,
>> > >> > country of citizenship, ethnic group etc. - each type has a certain
>> > >> pattern
>> > >> > to it (if a person was born in the US, US should be one of the
>> > >> countries he
>> > >> > was a citizen of; US census/ethnicity statistics may be used to
>> > predict
>> > >> > ethnic group etc.) I'm already starting to chalk out a few patterns
>> > and
>> > >> how
>> > >> > they can be used for recommendation. In my proposal, should I go
>> into
>> > >> > details regarding these? Or should I just give a few examples and
>> > >> explain
>> > >> > how the algorithms would work, to explain the idea?
>> > >>
>> > >> Give some examples and how you'd handle them. You definitely don't
>> > >> need to have it for all properties. What's important is giving an
>> idea
>> > >> about how you'd tackle the problem. Give the reader the impression
>> > >> that you know what you are talking about and can handle the larger
>> > >> problem.
>> > >>
>> > >> Also: Don't make the system too intelligent like it knowing about US
>> > >> census data for example. Keep it simple and stupid for now. Things
>> > >> like "property A is usually used with value X, Y or Z" should cover a
>> > >> lot already and are likely enough for most cases.
>> > >>
>> > >>
>> > >> Cheers
>> > >> Lydia
>> > >>
>> > >> --
>> > >> Lydia Pintscher - http://about.me/lydia.pintscher
>> > >> Community Communications for Technical Projects
>> > >>
>> > >> Wikimedia Deutschland e.V.
>> > >> Obentrautstr. 72
>> > >> 10963 Berlin
>> > >> www.wikimedia.de
>> > >>
>> > >> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.
>> V.
>> > >>
>> > >> Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
>> > >> unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
>> > >> Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
>> > >>
>> > >> _______________________________________________
>> > >> Wikitech-l mailing list
>> > >> Wikitech-l@lists.wikimedia.org
>> > >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>> > >
>> > >
>> >
>> >
>> > --
>> > A quest eternal, a life so small! So don't just play the guitar, build
>> one.
>> > You can also email me at cont...@nileshc.com or visit my
>> > website<http://www.nileshc.com/>
>> > _______________________________________________
>> > Wikitech-l mailing list
>> > Wikitech-l@lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>> >
>>
>>
>>
>> --
>> @wikimediadc / @wikidata
>> _______________________________________________
>> Wikitech-l mailing list
>> Wikitech-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>
>
>
> --
> A quest eternal, a life so small! So don't just play the guitar, build one.
> You can also email me at cont...@nileshc.com or visit my 
> website<http://www.nileshc.com/>
>
>


-- 
A quest eternal, a life so small! So don't just play the guitar, build one.
You can also email me at cont...@nileshc.com or visit my
website<http://www.nileshc.com/>
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [GSoc2013] Interested in developing Entity Suggester

Reply via email to