Re: [Wikitech-l] Welcome Gergő Tisza!
Congrants Gergő is a great software engineer and one of the most helpful members of the Hungarians I've met. I'm sure he will be a fine edition to the engineering team On Tuesday, October 15, 2013, Bináris wrote: > Just to be the first: welcome! > All the folks should learn this letter > "ő"<https://en.wikipedia.org/wiki/%C5%90>in Gergő's name. :-) > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- Oren Bochman Mobile +972 54 4320067 skype id: orenbochman e-mail: oren.boch...@gmail.com ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Advance notice: I'm taking a sabbatical October-December
Congratulations - I hope you get co make many new bugs and even do some cool coding outside the wmf! I'm wandering if you'll update us about this unique experience as time allows? On Wednesday, August 28, 2013, Sumana Harihareswara wrote: > I've been accepted to Hacker School <https://www.hackerschool.com>, a > writers' retreat for programmers in New York City. I will therefore be > taking an unpaid personal leave of absence from the Wikimedia Foundation > via our sabbatical program. My last workday before my leave will be > Friday, September 27. I plan to be on leave all of October, November, > and December, returning to WMF in January. > > During my absence, Quim Gil will be the temporary head of the > Engineering Community Team. Thank you, Quim! I'll spend much of > September turning over responsibilities to him. Over the next month I'll > be saying no to a lot of requests so I can ensure I take care of all my > commitments by September 27th, when I'll be turning off my wikimedia.org > email. > > If there's anything else I can do to minimize inconvenience, please let > me know. And -- I have to say this -- oh my gosh I'm so excited to be > going to Hacker School in just a month! Going from "advanced beginner" > to confident programmer! Learning face-to-face with other coders, 30-45% > of them women, all teaching each other! Thank you, WMF, for the > sabbatical program, and thanks to my team for supporting me on this. I > couldn't do this without you. > > -- > Sumana Harihareswara > Engineering Community Manager > Wikimedia Foundation > > _______ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- Oren Bochman Mobile +972 54 4320067 skype id: orenbochman e-mail: oren.boch...@gmail.com ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Separation of Concerns
This schedule is excellent news. I am working on integrating Moodle with mediawiki and having a Sul support would be great. we are looking at two basic use cases. 1. Allowing existing user to log into Moodle via openid. 2. Making edits such as clearing the sandbox on behalf of students. Unfortunately Oauth is currently broken on the current version of Moodle and will require some work. However I'm working on coordinating with our local Moodle dev community to help us out. I am wondering if Oauth will allow a user's privileges to be queried. Or if this can be done using the API? Also there unit test for the respective MW extensions ? 10x Oren Bochman Sent from my iPhone On Jun 4, 2013, at 5:43, Tyler Romeo wrote: > On Mon, Jun 3, 2013 at 8:18 PM, Chris Steipp wrote: > >> We are trying to finish the items in scope (SUL rework, OAuth, and a >> review of the OpenID extension) by the end of this month. > > Speaking of this, there's an OAuth framework attempt here: > https://gerrit.wikimedia.org/r/66286 > > Am I the only person who thinks it's a bad idea for the AuthPlugin class to > be relying on the ApiBase class for its interface? Especially since the > AuthPlugin framework isn't supposed to handle authorization logic anyway. > > *-- * > *Tyler Romeo* > Stevens Institute of Technology, Class of 2016 > Major in Computer Science > www.whizkidztech.com | tylerro...@gmail.com > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] GSoC 2013 Proposal - jQuery.IME extensions for Firefox and Chrome
Interesting proposal. I would imagine that this does not impact most page-views since Js files are cached. It might be better to fix this bug by a tighter integration of the JavaScript with the Resource loader to lazy load the required elements as needed In such a case the solution would be less dependent on browser plugins and Would require less long term maintenance when the Js is updated On Apr 29, 2013, at 12:09, Praveen Singh wrote: > Hello, > > I have drafted a proposal for my GSoC Project: jQuery.IME extensions for > Firefox and Chrome. I would love to hear what you think about it. > I would really appreciate any kind of feedback and suggestions. Please let > me know if I can improve it in any way. > > My proposal can be found here: > http://www.mediawiki.org/wiki/User:Prageck/GSoC_2013_Application > > > Thanks, > > Praveen Singh > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Indexing non-text content in LuceneSearch
-Original Message- From: wikitech-l-boun...@lists.wikimedia.org [mailto:wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Brion Vibber Sent: Thursday, March 7, 2013 9:59 PM To: Wikimedia developers Subject: Re: [Wikitech-l] Indexing non-text content in LuceneSearch On Thu, Mar 7, 2013 at 11:45 AM, Daniel Kinzler wrote: > 1) create a specialized XML dump that contains the text generated by > getTextForSearchIndex() instead of actual page content. That probably makes the most sense; alternately, make a dump that includes both "raw" data and "text for search". This also allows for indexing extra stuff for files -- such as extracted text from a PDF of DjVu or metadata from a JPEG -- if the dump process etc can produce appropriate indexable data. > However, that only works > if the dump is created using the PHP dumper. How are the regular dumps > currently generated on WMF infrastructure? Also, would be be feasible > to make an extra dump just for LuceneSearch (at least for wikidata.org)? The dumps are indeed created via MediaWiki. I think Ariel or someone can comment with more detail on how it currently runs, it's been a while since I was in the thick of it. > 2) We could re-implement the ContentHandler facility in Java, and > require extensions that define their own content types to provide a > Java based handler in addition to the PHP one. That seems like a > pretty massive undertaking of dubious value. But it would allow maximum > control over what is indexed how. No don't do it :) > 3) The indexer code (without plugins) should not know about Wikibase, > but it may have hard coded knowledge about JSON. It could have a > special indexing mode for JSON, in which the structure is deserialized > and traversed, and any values are added to the index (while the keys > used in the structure would be ignored). We may still be indexing > useless interna from the JSON, but at least there would be a lot fewer false > negatives. Indexing structured data could be awesome -- again I think of file metadata as well as wikidata-style stuff. But I'm not sure how easy that'll be. Should probably be in addition to the text indexing, rather than replacing. -- brion I agree with Brion. Here are my 5 shenekel's worth. To indexing non-mwdumps with LuceneSearch I would: 1. modify the demon to read the custom/dump format or update the xml dump to support json dump. 2. it uses the MWdumper codebase to do this now. 3. add a lucene analyzer to handle the new data type, say a json analyzer. 4. add a Lucenedoc per Json based Wikidata schema 5. update the queries parser to handle the new queries and the modified Lucene documents. 6. for bonus points modify spelling correction and write a wiki data ranking algoritm But this would only solve reading static dumps used to bootstrap the index, I would then have to Change how MWSearch periodically polls Brion's OAIRepository to pull in updated pages. I've been coding some analytics from MWDumps from WMF/Wikia Wikis for research project I can say this: 1. Most big dumps (e.g. historic) inherit the isses of wikitext namely unescaped tags and entities which crash modern XML java libraries - so escape your data and validate the xml! 2. The god old SAX code in the MWDumper still works fine - so use it. 3. Use lucene 2.4 with the deprecated old APIs 4. Ariel is doing a great job (e.g. the 7Z compression and the splitting of the dumps) but these are things MWdumper does not handle yet. Finally based on my work with i18n team, TranslateWiki search that indexing JSON data with Solar + Solarium requires no Search Engine coding at all. You define the document schema, and use solarium to push JSON and get results too. I could do a demo of how to do this at a coming Hakathon if there is any interest, however when I offered to replace LuceneSearch like this last October the idea was rejected out of hand. -- oren ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Tag cloud
Re: ([01]+) I was sorry about the Wikidata insanity, and am glad to see you around. Templates are one way to go but I think using a real markup to mark them up would be even better. This would make the tag cheaper to process and would require making a fairly trivial extension. Regarding the UI - I've done something like this a while back based on userscripts in the wild. I do envision another issue - since talk pages don't use liquid threads user comments are not "objects". Tagging a LT objects is just ... adding a decorator. But tagging a blob of text is a can of worms - what are the scope of each tags (The page/top level section/paragraph?) I don't see this would work with templates and without tag scope I don't see this being very useful for filtering/retrieval per your original use case. (It could be done but would require a semi structured text processing kit on the other end) Anyhow it seems that talk pages are being redesigned which may render the project superfluous. On Sun, Mar 3, 2013 at 2:03 PM, Bináris wrote: > Hi folks, > > we have an old problem that talks sink in the archives of talk pages and > village pumps. I already wrote a bot for huwiki that creates tables of > contents for these pages, but this is far not enough. The idea is to use > tags, For example, if the use of disambiguation pages has come up 113 times > in various village pumps, noticeboards and talk pages, a tag could help > users to connect these talks and find them. > > For the solution, there is a trivial way: use templates. Several templates > can be placed in a section. As the tag itself could be the parameter of the > template, special:whatlinkshere will unfortunately not help to collect > tags. A bot may easily be written for this purpose, not a big task. > > For what I write this here: is there a way so that mediaWiki or an > extension could solve this task more efficiently? Is this a good idea for > someone for GSoC? > Tasks: > * Easily place new tags to sections, choose among the existing or create > new. > * Easily find tagged sections in talk pages, village pumps, noticeboards > and archives of these. > > -- > Bináris > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- Oren Bochman Mobile +972 54 4320067 skype id: orenbochman e-mail: oren.boch...@gmail.com ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] a slightly weird search result in the Italian Wikipedia
The algorithm used to rank search results uses ... a variant of page rank so the reasons may lay off the actual page. - Oren Bochman On Mon, Sep 17, 2012 at 12:11 AM, Federico Leva (Nemo) wrote: > "Autopratica" is actually not a valid word: or rather, it's a neologism > from a new/fringe theory, possibly grammatical thanks to the productivity > of auto- but slightly confusing due to the jargon-meaning of "pratica" here. > > That said, the reason is surely in that link label, which is the only use > of the word on the wiki. > > Nemo > > > __**_ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/**mailman/listinfo/wikitech-l<https://lists.wikimedia.org/mailman/listinfo/wikitech-l> > -- Oren Bochman Office tel. 061 4921492 Mobile +36 30 866 6706 skype id: orenbochman e-mail: o...@romai-horizon.com site http://www.riverport.hu ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] MediaWiki Foundation (was Re: CentralAuth API access)
A number of comments: 1. The community is an a massive untapped resource for development. (They like to edit wikis, upload photos and also to code) e.g. The amount of Template Code in about 20 times the size of MediaWiki code base. 2. I would seriouly look at maximizing its potential before allocating more funds for paid devlopment. 2.1 This means making it much easier to develop/test/deploy to Live wikis. (Short Tutorials, Code Samples, Documentation) 2.2 Create a culture where new coders are assigned to work with experinced coders to fix and maintaining existing code. 2.3 Motivating paid developer to work (i.e. review and direct) the community 2.4 Team up with Wikia and WikiHow Devteams on common features and on small wiki testing. 3. Looking at the metrics - The Mediawiki team is still not setup to do developement like other leading Open Souce development communities. Git is a step in the right direction but - the agility of the teams is too low to collaborate at the levels required. to accept "AnonymousDonation" of source from the community. While I applud Sumana who does a great job with the community - this works needs to be followed though organicaly by all members of the development teams or we will continue sending the community the message - that we prefer to delay fixing bugs, pay a premiunm for new features etc ... 4. Only once such issues are adressed would it become productive to engage more developers with WMF or external funding. 5. The one point I do agree with is that features the community asks for should be given due proirity and this process should be more transparent. Oren Bochman On Mon, Sep 3, 2012 at 8:10 PM, Mr. Gregory Varnum wrote: > I'll post more on the RFC, but I wonder if an entity within WMF would be > more appropriate and realistic. Utilizing the existing operations structure > would be far easier. Perhaps setup something like FDC to oversee priorities > and funds. > > My hunch is WMF would be far more likely to sign off on something they > retain a sense of sign-off on for the sake of maintaining the WMF projects > than having to deal with an independent entity that would have the legal > right to go rogue one day and not do what's in the best interest of the WMF > projects. I recognize to some extent that's the point, but looking down a 5 > year road of possibilities, is that something we'd ever want to happen? My > feeling is no and allowing WMF to maintain some level of authority in the > development of MediaWiki is in our collective best interests. From project > management, fundraising, usability, system resources and paid developer > support perspective. > > I would instead propose a MediaWiki department or collective (insert your > favorite term here). > > -Greg aka varnent > > > Sent from my iPhone. Apologies for any typos. A more detailed response may > be sent later. > > On Sep 1, 2012, at 10:42 PM, MZMcBride wrote: > > > Daniel Friesen wrote: > >> Done in true developer style "[RFC] MediaWiki Foundation": > >> > https://www.mediawiki.org/wiki/Requests_for_comment/MediaWiki_Foundation > > > > Thank you for this! This is exactly what I had in mind. > > > > It's interesting, with a lot of (proposed) non-profits, the biggest > concerns > > are engaging volunteers and generating income. With this proposed > > foundation, I think most of the typical concerns aren't in play. > Instead, as > > Nikerabbit so deftly commented on the RFC's talk page, the big question > is: > > > > What projects would a MediaWiki Foundation work on and how would those > > projects be chosen? > > > > This seems to be _the_ crucial issue. Getting grants from the Wikimedia > > Foundation or Wikia or others doesn't seem like it'd be very difficult. > > Assuming there was broad support for the creation of such a foundation > from > > active MediaWiki developers (and related stakeholders), getting the > > Wikimedia Foundation to release the trademark and domain also doesn't > seem > > like it would be very difficult. But there's a huge unresolved question > > about how, out of the infinite number of project ideas, a MediaWiki > > Foundation would choose which ideas to financially support. > > > >> As you command oh great catalyst[1]. > >> [1] Hope you don't mind. I found it amusing. And it kind of fits in a > >> positive way. > > > > Cute. :-) > > > > MZMcBride > > > > > > > > ___ > > Wikitech-l mailing list > > Wikitech-l@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo
Re: [Wikitech-l] [testing] TitleBlacklist now tested under Jenkins
I've tried to do this for translate ext last week - so here are a couple of questions: 1. is successfully runnig the test a requirement to successfully score on gerrit? (i.e. how is gerrit integrated) 2. does the extension need to include php unit? On Thu, Aug 30, 2012 at 9:43 AM, Antoine Musso wrote: > Le 29/08/12 16:27, Chad a écrit : > > Question: why does the config for non-extension tests attempt > > to load extensions? -Parser and -Misc both seem to be failing > > due to a broken inclusion of Wikibase. > > The -Parser and -Misc jobs are triggered by both the MediaWiki core job > and the one testing the Wikidata branch. I originally thought it was a > good idea to a job dedicated to a PHPUnit group, I will end up creating > a job dedicated to testing the Wikidata branch. > > > Core tests should be run without any extensions. > > Fully agree. We can later create a job to test core + the extension > deployed on the wmf and another one for a Semantic MediaWiki setup. > > -- > Antoine "hashar" Musso > > > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > -- Oren Bochman Office tel. 061 4921492 Mobile +36 30 866 6706 skype id: orenbochman e-mail: o...@romai-horizon.com site http://www.riverport.hu ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Gerrit evaluation: where we stand
>That said there are known negatives; the Java+Google Web Toolkit >front-end is intimidating to people who might want to help improve >the UI; even Gerrit devs don't love it. :) >Improvements to the UI and to the git-review CLI tool are welcome... Intimidating to PHP+JS only devs - but Java+Google Web Toolkit is this systems second iteration for Gerrit. I think that it would be possible to add all the changes we need into Gerrit, I personally feel more comfortable hacking Gerrit which has an upstream and a community than our previous code review plug-in which had none. A large number of our issues are already being added by the Gerrit community and by Chad. However the comment above clearly highlight an issue arising from running an almost exclusively PHP+JS shop and under adoption of FOSS development methodologies That being said: Using FOSS tools has a higher total cost of ownership. Managers who authorized a switch from a working system (SVN/Code review) to a new and immature systems such as Git/Gerrit - should have set aside resources (time & money) to offset the problems created by such migrations. These generally amount to several orders of magnitude of the actual cost of the migration done by operations. The bulk of the work created by these changes are offset to the individual developers whose project will be broken by change of workflow and who might not be active. It passing strange how few of the extensions are under-maintained, unsupported. For example: * Integration of Gerrit to our system, * Customization (adding features like better diffs) * Acceptance - getting people to change workflow and getting core developers to actually review code. * Education - Teaching established and new users to work with the Git/Gerrit, writing tutorials, training people with them at Hackatons. Updating project documentations and readmes * Secondary migration - fixing scripts/apis that depend on the current setup. E.g. my CI work in December needs to be updated to reflect using GIT/Gerrit; build scripts of systems with independent modules like search + mwdumper; updating robots and so on. * Tertiary migrations - On the developers machines. Replacing IDEs and Workspaces to reflect the Git/Gerrit workflows. Thus switching forth between different Gerrit alternative is myopic. It ignores the friction and cost these moves create for the established developers community who have created hundreds of extensions and documented them. I say we just get consensus on the Priority Queue of outstanding Gerrit issues and start fix them until it rocks. Oren Bochman Lead of Search ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] suggestion: replace CAPTCHA with better approaches
Hi The wikipedia's captcha is a great opportunity for getting '''useful'' work done by humans. This is now called a [[game with a purpose]]. I think we can ideally use it to help: * ocr wikisource text like recaptcha does * translate articles fragments using geo-location of editors. Translate [xyz-known] [...] Translate [xyz-new] [...] check using blau metric etc. * get more opinions on spam edits. Is this diff [spam] [good faith edit] [ok] * collect linguistics information on different languages edition. Is XYZ a [verb] / [noun] / [adjective] ... [other] *disambiguate Is [xyz-known] [xyz] ... [xyz] ... [xyz] ... Is [yzx-unknown] [yzx1] ... [yzx1] ... [yzx1] ... Etc This way if people feel motivated at cheating at captcha they will end up helping Wikipedia It is up to us to try to balance things out. I'm pretty sure users will be less annoyed at solving captchas that actually contribute some value. -Original Message- From: wikitech-l-boun...@lists.wikimedia.org [mailto:wikitech-l-boun...@lists.wikimedia.org] On Behalf Of matanya Sent: Tuesday, July 24, 2012 4:12 PM To: wikitech-l@lists.wikimedia.org Subject: [Wikitech-l] suggestion: replace CAPTCHA with better approaches As for the last few month the spam rate stewards deal with is raising. I suggest we implement a new mechanism: Instead of giving the user a CAPTCHA to solve, give him a image from commons and ask him to add a brief description in his own language. We can give him two images, one with known description, and the other with unknown, after enough users translate the unknown in the same why, we can use it as a verified translation. We base on the known image description to allow the user to create the account. Is it possible to embed a file from commons in the login page? is it possible to parse the entered text and store it? benefits: A) it would be harder for bots to create automated accounts. B) We will get translations to many languages with little effort from the users signing up. What do you think? ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Creating a centralized access point for propriety databases/resources
Hi Ocaasi I agree that tighter work with the database providers is in order. 1000+ accounts for top contributors can make a significant impact on Wikipedia fact checking. Based on my experience at university (where I taught a lab-class on reference database usage) that there are many more options on how to do this. Most users in universities do not require to log in at all. (they work in context of an IP range that is enabled for databases.) Research libraries also implement floating licenses for databases that have limited access options. However to implement this it is often necessary to work with a large database aggregators (which solves the tech issues) and the rest is implemented by operations staff of a university. Oren Bochman -Original Message- From: wikitech-l-boun...@lists.wikimedia.org [mailto:wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Sumana Harihareswara Sent: Wednesday, July 25, 2012 4:16 PM To: Ocaasi Ocaasi; Wikimedia developers Subject: Re: [Wikitech-l] Creating a centralized access point for propriety databases/resources Ocaasi, please centralize your notes, ideas, and plans regarding this here: https://www.mediawiki.org/wiki/AcademicAccess I know Chad Horohoe, Ryan Lane, and Chris Steipp might have things to say about this; per https://www.mediawiki.org/wiki/Wikimedia_Engineering/2012-13_Goals#Activitie s_12 their team aims to work on OAuth and OpenID within the next 11 months, and AcademicAccess is a possible beneficiary of that. Thanks! -- Sumana Harihareswara Engineering Community Manager Wikimedia Foundation On 07/25/2012 10:03 AM, Ocaasi Ocaasi wrote: > We currently have relationships with three separate resource databases. > > *HighBeam, 1000 authorized accounts, 700 active > (http://enwp.org/WP:HighBeam) *JSTOR, 100 accounts, all active > (http://enwp.org/WP:JSTOR) *Credo, 400 accounts, all active > (http://enwp.org/WP:CREDO) > > No parties have agreed to participate in The Wikipedia Library *yet*, as it's still in the concept stage, but my initial projection is that 1000 editors would have access to it, and 100 additional users per year would be granted. One of the challenges will be getting all the resource providers to agree on that number, but the hope is that once some do, it will create a cascade of adoption. > > So we're not looking at *thousands* of users, but more likely several hundreds. Still, given the impact of our most active editors, 1000 of them with access to the library would have significant impact. After all, we can't cannibalize these databases' subscription business by opening the library to ''all'' editors. It must be a carefully selected and limited group. > > > -Original Message- > From: wikitech-l-boun...@lists.wikimedia.org > [mailto:wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Ocaasi > Ocaasi > Sent: Monday, July 23, 2012 6:22 PM > To: wikitech-l@lists.wikimedia.org > Subject: [Wikitech-l] Creating a centralized access point for > propriety databases/resources > > Hi Folks! > The problem: Many proprietary research databases have donated free > access to select Wikipedia editors (Credo Reference, HighBeam Research, JSTOR). > Managing separate account distribution for each service doesn't scale well. > The idea: Centralize access to these separate resources behind a > single secure (firewalled) gateway, to which accounts would be given > to a limited number of approved users. After logging in to this single > gateway, users would be able to enter any of the multiple > participating research databases without needing to log in to each one separately. > The question: What are the basic technical specifications for setting > up such a system. What are open source options, ideally? What language > would be ideal? What is required to host such a system? Can you > suggest a sketch of the basic steps necessary to implement such an idea? > Any advice, from basics to details would be greatly appreciated. > Thanks so much! > Ocaasi > http://enwp.org/User:Ocaasi > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Creating a centralized access point for propriety databases/resources
Hi This looks similar to something I have been thinking about recently However I would go about it using openeId. But it would require all the databases sites to support openId. I think that the extensions exists to do this using mediawiki, but WMF projects do not trust/support this method of authentication. If all parties were to support this standard it would be possible to develop an gadget which could log users into all the sites at once. Do you know how many users have been granted access to each databases, this would be useful for estimating the importance/impact of this project. Oren Bochman -Original Message- From: wikitech-l-boun...@lists.wikimedia.org [mailto:wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Ocaasi Ocaasi Sent: Monday, July 23, 2012 6:22 PM To: wikitech-l@lists.wikimedia.org Subject: [Wikitech-l] Creating a centralized access point for propriety databases/resources Hi Folks! The problem: Many proprietary research databases have donated free access to select Wikipedia editors (Credo Reference, HighBeam Research, JSTOR). Managing separate account distribution for each service doesn't scale well. The idea: Centralize access to these separate resources behind a single secure (firewalled) gateway, to which accounts would be given to a limited number of approved users. After logging in to this single gateway, users would be able to enter any of the multiple participating research databases without needing to log in to each one separately. The question: What are the basic technical specifications for setting up such a system. What are open source options, ideally? What language would be ideal? What is required to host such a system? Can you suggest a sketch of the basic steps necessary to implement such an idea? Any advice, from basics to details would be greatly appreciated. Thanks so much! Ocaasi http://enwp.org/User:Ocaasi ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] [Wmfall] Announcement: Peter Youngmeister joins Wikimedia as Technical Operations Engineer
Great news I'd also like to congratulate Peter I was very impressed with his work on puputizing the search configuration and look forward to working with him on new projects -- Oren Bochman Office tel. 061 4921492 Mobile +36 30 866 6706 skype id: orenbochman ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] [Xmldatadumps-l] XML dumps/Media mirrors update
Dear Ariel, Consider that people who would need to use Torrent most of all cannot host a mirrors - this is a situation of the little guy being asked to do the heavy lifting. It would be saving WMF significant resources, - it would be more efficient than Rsync. Doing this outside the the WMF infrastructure does not make sense (authenticity, automation) and is the reason why use of torrents has failed traditionally. If the WMF does this - it should be possible for users to leverage all the mirrors simultaneously - which is why torrents are the preferred form of transport for Linux distribution. Installing a torrent server should not significantly impact workload. The main problems, as I see it, is to write a maintenance script to create the magnet link/.torernt files once the dumps are generated and to publish them on the dump servers. With your blessing - I would try to help with it in the context of say a labs, if it would be integrated into the dump release process. Thanks for the great job with the dumps! Oren Bochman On Tue, Jun 5, 2012 at 3:15 PM, Ariel T. Glenn wrote: > This is a place where volunteers can step in and make it happen without > the need for Wikimedia's infrastructure. (This means I can concentrate > on my already very full plate of things too.) > > http://meta.wikimedia.org/wiki/Data_dump_torrents > > Have at! > > Ariel > > Στις 05-06-2012, ημέρα Τρι, και ώρα 08:57 -0400, ο/η Derric Atzrott > έγραψε: > > I second this idea. Large archives should always be available using > bittorrent. I would actually suggest posting magnet links for them though > instead of .torrent files. This way you can leverage the acceptable source > feature of magnet links. > > > > https://en.wikipedia.org/wiki/Magnet_URI_scheme#Web_links_to_the_file > > > > This way we get the best of both worlds: the constant availability of > direct downloads, and the reduction in load that p2p filesharing provides. > > > > Thank you, > > Derric Atzrott > > > > -Original Message- > > From: wikitech-l-boun...@lists.wikimedia.org [mailto: > wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Oren Bochman > > Sent: 05 June 2012 08:44 > > To: 'Wikimedia developers' > > Subject: Re: [Wikitech-l] [Xmldatadumps-l] XML dumps/Media mirrors update > > > > Any chance that these archived can be served via bittorent - so that > even partial downloaders can become servers - leveraging p2p to reduce > overall bandwidth load on the servers and increase download times? > > > > > > -Original Message- > > From: wikitech-l-boun...@lists.wikimedia.org [mailto: > wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Mike Dupont > > Sent: Saturday, June 02, 2012 1:28 AM > > To: Wikimedia developers; wikiteam-disc...@googlegroups.com > > Subject: Re: [Wikitech-l] [Xmldatadumps-l] XML dumps/Media mirrors update > > > > I have run cron archiving now every 30 minutes, > http://ia700802.us.archive.org/34/items/wikipedia-delete-2012-06/ > > it is amazing how fast the stuff gets deleted on wikipedia. > > what about the proposed deletes are there categories for that? > > thanks > > mike > > > > On Wed, May 30, 2012 at 6:26 AM, Mike Dupont < > jamesmikedup...@googlemail.com> wrote: > > > https://github.com/h4ck3rm1k3/wikiteam code here > > > > > > On Wed, May 30, 2012 at 6:26 AM, Mike Dupont > > > wrote: > > >> Ok, I merged the code from wikteam and have a full history dump > > >> script that uploads to archive.org, next step is to fix the bucket > > >> metadata in the script mike > > >> > > >> On Tue, May 29, 2012 at 3:08 AM, Mike Dupont > > >> wrote: > > >>> Well, I have now updated the script to include the xml dump in raw > > >>> format. I will have to add more information the achive.org item, at > > >>> least a basic readme. > > >>> other thing is that the wikipybot does not support the full history > > >>> it seems, so that I will have to move over to the wikiteam version > > >>> and rework it, I just spent 2 hours on this so i am pretty happy for > > >>> the first version. > > >>> > > >>> mike > > >>> > > >>> On Tue, May 29, 2012 at 1:52 AM, Hydriz Wikipedia < > ad...@alphacorp.tk> wrote: > > >>>> This is quite nice, though the item's metadata is too little :) > > >>>> > > >>>> On Tue, May 29, 2012 at 3:40 AM, Mike Dupont > > >>>> > >>>>> wrote: > > >>>> &g
Re: [Wikitech-l] [Xmldatadumps-l] XML dumps/Media mirrors update
Any chance that these archived can be served via bittorent - so that even partial downloaders can become servers - leveraging p2p to reduce overall bandwidth load on the servers and increase download times? -Original Message- From: wikitech-l-boun...@lists.wikimedia.org [mailto:wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Mike Dupont Sent: Saturday, June 02, 2012 1:28 AM To: Wikimedia developers; wikiteam-disc...@googlegroups.com Subject: Re: [Wikitech-l] [Xmldatadumps-l] XML dumps/Media mirrors update I have run cron archiving now every 30 minutes, http://ia700802.us.archive.org/34/items/wikipedia-delete-2012-06/ it is amazing how fast the stuff gets deleted on wikipedia. what about the proposed deletes are there categories for that? thanks mike On Wed, May 30, 2012 at 6:26 AM, Mike Dupont wrote: > https://github.com/h4ck3rm1k3/wikiteam code here > > On Wed, May 30, 2012 at 6:26 AM, Mike Dupont > wrote: >> Ok, I merged the code from wikteam and have a full history dump >> script that uploads to archive.org, next step is to fix the bucket >> metadata in the script mike >> >> On Tue, May 29, 2012 at 3:08 AM, Mike Dupont >> wrote: >>> Well, I have now updated the script to include the xml dump in raw >>> format. I will have to add more information the achive.org item, at >>> least a basic readme. >>> other thing is that the wikipybot does not support the full history >>> it seems, so that I will have to move over to the wikiteam version >>> and rework it, I just spent 2 hours on this so i am pretty happy for >>> the first version. >>> >>> mike >>> >>> On Tue, May 29, 2012 at 1:52 AM, Hydriz Wikipedia >>> wrote: This is quite nice, though the item's metadata is too little :) On Tue, May 29, 2012 at 3:40 AM, Mike Dupont wrote: > first version of the Script is ready , it gets the versions, puts > them in a zip and puts that on archive.org > https://github.com/h4ck3rm1k3/pywikipediabot/blob/master/export_de > leted.py > > here is an example output : > http://archive.org/details/wikipedia-delete-2012-05 > > http://ia601203.us.archive.org/24/items/wikipedia-delete-2012-05/a > rchive2012-05-28T21:34:02.302183.zip > > I will cron this, and it should give a start of saving deleted data. > Articles will be exported once a day, even if they they were > exported yesterday as long as they are in one of the categories. > > mike > > On Mon, May 21, 2012 at 7:21 PM, Mike Dupont > wrote: > > Thanks! and run that 1 time per day, they dont get deleted that quickly. > > mike > > > > On Mon, May 21, 2012 at 9:11 PM, emijrp wrote: > >> Create a script that makes a request to Special:Export using > >> this > category > >> as feed > >> https://en.wikipedia.org/wiki/Category:Candidates_for_speedy_de > >> letion > >> > >> More info > https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export > >> > >> > >> 2012/5/21 Mike Dupont > >>> > >>> Well I whould be happy for items like this : > >>> http://en.wikipedia.org/wiki/Template:Db-a7 > >>> would it be possible to extract them easily? > >>> mike > >>> > >>> On Thu, May 17, 2012 at 2:23 PM, Ariel T. Glenn > >>> > >>> wrote: > >>> > There's a few other reasons articles get deleted: copyright > >>> > issues, personal identifying data, etc. This makes > >>> > maintaning the sort of mirror you propose problematic, although a > >>> > similar mirror is here: > >>> > http://deletionpedia.dbatley.com/w/index.php?title=Main_Page > >>> > > >>> > The dumps contain only data publically available at the time > >>> > of the > run, > >>> > without deleted data. > >>> > > >>> > The articles aren't permanently deleted of course. The > >>> > revisions > texts > >>> > live on in the database, so a query on toolserver, for > >>> > example, > could be > >>> > used to get at them, but that would need to be for research > >>> > purposes. > >>> > > >>> > Ariel > >>> > > >>> > Στις 17-05-2012, ημέρα Πεμ, και ώρα 13:30 +0200, ο/η Mike > >>> > Dupont > έγραψε: > >>> >> Hi, > >>> >> I am thinking about how to collect articles deleted based > >>> >> on the > "not > >>> >> notable" criteria, > >>> >> is there any way we can extract them from the mysql > >>> >> binlogs? how are these mirrors working? I would be > >>> >> interested in setting up a mirror > of > >>> >> deleted data, at least that which is not spam/vandalism > >>> >> based on > tags. > >>> >> mike > >>> >> > >>> >> On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn < > ar...@wikimedia.org> > >>> >> wrote: > >>> >> > We now have three mirror sites, yay! The full list is > >>> >> > linked to > from >>>
Re: [Wikitech-l] [Wiki-research-l] MathJax comes to Wikipedia
Hey this is so wonderful. I've been working with formulas on Wikpedia nad on Meta and they are so ugly. One realy important feature to check is if it is possible to have for servral furmula with a number - that the number for all will appear aligned on the right. I'll be glad to beta test. On Thu, May 3, 2012 at 6:49 PM, Erik Moeller wrote: > On Thu, May 3, 2012 at 9:44 AM, Dario Taraborelli > wrote: > > MathJax [1] is now enabled site-wide as an opt-in preference. You can > now see beautifully rendered, accessible, copy&pasteable and > standard-compliant (MathML) formulas on Wikipedia, replacing the old > TeX-rendered PNGs. > > Thanks Dario. There are definitely still bugs in this experimental > rendering mode, so please report issues in Bugzilla against the Math > component: > > > https://bugzilla.wikimedia.org/enter_bug.cgi?product=MediaWiki%20extensions&component=Math > > More here: > > http://www.mediawiki.org/wiki/Extension:Math/MathJax_testing > > -- > Erik Möller > VP of Engineering and Product Development, Wikimedia Foundation > > Support Free Knowledge: https://wikimediafoundation.org/wiki/Donate > > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > -- Oren Bochman Office tel. 061 4921492 Mobile +36 30 866 6706 skype id: orenbochman e-mail: o...@romai-horizon.com site http://www.riverport.hu ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] New Engineering Community Group, headed by Sumana Harihareswara
Great news. Congratulations Sumana !! I think that this is greatly deserved -- I'd be glad that you will be having an even greater impact bringing in more people into our ecosystem. Oren Bochman Lead of Search Operation Manager E-mail: o...@romai-horizon.com Mobil: +36 30 866 6706 Római Horizon Kft. H-1039 Budapest Királyok útja 291. D. ép. fszt. 2. Tel: +36 1 492 1492 Fax: +36 1 266 5529 -Original Message- From: wikitech-l-boun...@lists.wikimedia.org [mailto:wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Rob Lanphier Sent: Wednesday, April 25, 2012 5:30 AM To: Wikimedia developers Subject: [Wikitech-l] New Engineering Community Group, headed by Sumana Harihareswara Hi everyone, I'm happy to announce that we have promoted Sumana Harihareswara as manager of Engineering Community group. Sumana started with us as a contractor back in February 2011, initially in a targeted engagement to help out with Google Summer of Code and with the Berlin Hackathon last year. Later that year, as we interviewed people to bring in as Volunteer Development Coordinator, not only did Sumana put in a strong application herself, but recruited very worthy competition for the role. After winning the role, she worked tirelessly to straighten out many kinks in our processes around volunteer development and systematically ensured that new volunteer developers get the recognition and (if needed) help they deserve. She has also applied focus and organization in many areas outside of her immediate purview, for example, recently stepping in as project manager for Git, and occasionally filling in for me when I've been unavailable for the larger Platform Engineering organization. The promotion to Engineering Community Manager isn't so much a change in the way things are done here so much as an official recognition of a vital role that she has already played for the past year. Sumana has been working with Guillaume Paumier and Mark Hershberger under the somewhat ad hoc group title of "Technical Liaison; Developer Relations (tl;dr)", serving as lead of that group since last year. Under the new "Engineering Community" name, this group will continue to serve many roles: facilitating collaboration and communication between Wikimedia Foundation and its employees and the larger Wikimedia developer community, as well as facilitating collaboration and communication between the Wikimedia developer community and other Wikimedia communities. Thank you, Sumana, for your hard work over the past year. I'm looking forward to seeing what you and the group accomplish moving forward. Congratulations! Rob ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Unified login vs. unified settings
I'd love it too - but I noticed that different wikis have quite different settings due to different gadgets and extensions being available. So a good solution would have to be smart enough to accommodate this. Oren Bochman -Original Message- From: wikitech-l-boun...@lists.wikimedia.org [mailto:wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Bináris Sent: Monday, April 16, 2012 7:44 AM To: Wikimedia developers Subject: Re: [Wikitech-l] Unified login vs. unified settings 2012/4/15 Ole Palnatoke Andersen > Hi! > > I would love to be able to manage my settings in one place There is somewhere a bot that does it for you, but I don't remember where I saw it. -- Bináris ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] GSOC 2012 : Lucene Automatic Query Expansion From Wikipedia Text
Hi Robert Stojnic and Gautham Shankar I wanted to let Gautham that he has written a great proposal and thank you for the feedback as well. I wanted to point out that in my point of view the main goal of this multilingual wordnet isn't queary expansion, but rather means for ever greater cross language capabilites in search and content analytics. A wordnet seme can be further disambiguated using a topic map algorithm run which would consider all the contexts like you suggest. But this is planned latter and so the wordnet would be a milestone. To further clarify Gautham's integration will place a XrossLanguage-seme Word Net tokens during indexing for words it recognises - allow the ranking algorithm to use knowldege drawn from all the wikipedia articles. (For example one part of the ranking would peek into featured article in German on "A" rank it >> then "B" featured in Hungarian and use them as oracles to rank A >> B >> ... in English where the picture might now be X >> Y >> Z >> ... B >> A ...) I mention in passing that I have began to develop dataset for use with open relavance to sytematicly review and evaluate dramatic changes to relevance due to changes in the search engine. I will post on this in due course as it matures - since I am working on a number of smaller projects i'd like to demo at WikiMania.) On Fri, Apr 6, 2012 at 6:01 PM, Gautham Shankar < gautham.shan...@hiveusers.com> wrote: > Robert Stojnic gmail.com> writes: > > > > > > > Hi Gautham, > > > > I think mining wiktionary is an interesting project. However, about the > > more practical Lucene part: at some point I tried using wordnet to > > expand queries however I found that it introduces too many false > > positives. The most challenging part I think it *context-based* > > expansion. I.e. a simple synonym-based expansion is of no use because it > > introduces too many meanings that the user didn't quite have in mind. > > However, if we could somehow use the words in the query to find a > > meaning from a set of possible meanings that could be really helpful. > > > > You can look into existing lucene-search source to see how I used > > wordnet. I think in the end I ended up using it only for very obvious > > stuff (e.g. 11 = eleven, UK = United Kingdom, etc..). > > > > Cheers, r. > > > > On 06/04/12 01:58, Gautham Shankar wrote: > > > Hello, > > > > > > Based on the feedback i received i have updated my proposal page. > > > > > > https://www.mediawiki.org/wiki/User:Gautham_shankar/Gsoc > > > > > > There is about 20 Hrs for the deadline and any final feedback would be > > > useful. > > > I have also submitted the proposal at the GSOC page. > > > > > > Regards, > > > Gautham Shankar > > > ___ > > > Wikitech-l mailing list > > > Wikitech-l lists.wikimedia.org > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > > > > > Hi Robert, > > Thank you for your feedback. > Like you pointed out, query expansion using the wordnet data directly, > reduces > the quality of the search. > > I found this research paper very interesting. > www.sftw.umac.mo/~fstzgg/dexa2005.pdf<http://www.sftw.umac.mo/%7Efstzgg/dexa2005.pdf> > They have built a TSN (Term Semantic Network) for the given query based on > the > usage of words in the documents. The expansion words obtained from the > wordnet > are then filtered out based on the TSN data. > > I did not add this detail to my proposal since i thought it deals more > with the > creation of the wordnet. I would love to implement the TSN concept once the > wordnet is complete. > > Regards, > Gautham Shankar > > > > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > Hi again -- Oren Bochman Office tel. 061 4921492 Mobile +36 30 866 6706 skype id: orenbochman e-mail: o...@romai-horizon.com site http://www.riverport.hu ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] GSOC 2012
Hi we are running out of time Thanks for your interest in out project at Wikimedia The GSOC proposals should be mostly specified by the student Those of you who have not done so should draft proposals and place them in at www.mediawiki.org in your user space, then post a link here, or email me so we can process them. 1. I have expanded the requirement of my project ideas a bit. However I have left room for your ideas. There is plenty of similar work published on these subject -- Research these and refine your proposals with tools/algorithms you would like to use and preferred formats, so that deliverables that would be widely reused. 2. I am contacting two researchers who have worked on similar projects to check if they wish to Co-operate by contribute Code and helping with the Linguistics side of the Mentoring. 3. I can answer specific questions you have about expectation. To optimally match you with a suitable high impact project please let us know: *Your development experience what projects have you done and where specially what are your experience with: *Java and other programming languages? *PHP *Apache Lucene or Solr *Natural Language Processing *Data Mining *Corpus Linguistics *WordNet Since these projects are highly multilingual please tell us what is your native language and what other language you can use (scale from 1 beginner to 5 near native). Background, Ability in programming, Operation Manager E-mail: o...@romai-horizon.com Mobil: +36 30 866 6706 Római Horizon Kft. H-1039 Budapest Királyok útja 291. D. ép. fszt. 2. Tel: +36 1 492 1492 Fax: +36 1 266 5529 -Original Message- From: wikitech-l-boun...@lists.wikimedia.org [mailto:wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Sudeep Singh Sent: Tuesday, April 03, 2012 8:48 PM To: wikitech-l@lists.wikimedia.org Subject: [Wikitech-l] GSOC 2012 Hi, I am sudeep. I am final year student at Indian Institute of Technology, Kharagpur in the computer science department. I am interested to apply in the following projects for gsoc 2012 1. Lucene automatic query expansion from wikipedia text 2. Backwards compatibility extension 3. Semantic form rules 4. Index transcluded text in search I have a strong background in Information retrieval and Machine learning. I have worked previously with Yahoo Research Labs in the area of Information retrieval. We extracted association rules and attribite-value pairs from the webpages using unsupervised approach. I have also worked on another project with yahoo, which involved emotion detection of youtube videos, based on the comments of the users. We used various ML, Statisitcs andf IR techniques to achieve our goal. I last year succesfully completed GSOC 2011, with OSGEO and have good experience in Open Source Development. Kindly let me know how shall I proceed with my application. Thanks regards Sudeep ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] GSoC 2012: Proposal-Wikipedia Corpus Tools
You do understand correctly! The main idea about NLP components is with POS tagger as an example: 1. a fall back system that does unsupervised POS tagging. 2. the ability to plug in an existing POS tagger as these become available for specific languages. I would as supervisor would recommend working with 3 languages. English, Hebrew, and the GSOC native language. If we could get QA from other native speakers we would incorporate them into the workflow. I think that by using a deletion/reversion based heuristic we may also be able to make a spam corpus to boost the accuracy of the corpuses. Operation Manager E-mail: o...@romai-horizon.com Mobil: +36 30 866 6706 Római Horizon Kft. H-1039 Budapest Királyok útja 291. D. ép. fszt. 2. Tel: +36 1 492 1492 Fax: +36 1 266 5529 -Original Message- From: wikitech-l-boun...@lists.wikimedia.org [mailto:wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Amir E. Aharoni Sent: Tuesday, April 03, 2012 10:19 PM To: Wikimedia developers Subject: Re: [Wikitech-l] GSoC 2012: Proposal-Wikipedia Corpus Tools 2012/4/3 karthik prasad : > Hello, > I am a GSoC aspirant and have compiled a proposal for one of the > project ideas - Wikipedia Corpus Tools. [Mentor : Oren Bochman] I > would sincerely appreciate if you could kindly go through it and > suggest corrections/additions so that I can settle with a coherent proposal. > > Link to my proposal : > https://www.mediawiki.org/wiki/User:Karthikprasad/gsoc2012proposal Nice, but why only English? If i understand the proposal correctly, this project is supposed to be able to work with almost any language with very little effort. -- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com “We're living in pieces, I want to live in peace.” – T. Moore ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] GSOC 2012 - Text Processing and Data Mining
Dear, Karthik Prasad & Other GSOC candidates. I was not getting this list but I am now. The GSOC proposal should be specified by the student. I'll can expand the details on these projects. I can answer specific questions you have about expectation. To optimally match you with a suitable high impact project - to what extent are you familiar with : *Java and other programming languages? *PHP? *Apache Lucene? *Natural Language Processing? *Corpus Linguistics? *Word Net? The listed projects would be either wrapped as services, or consumed by downstream projects or both. The corpus is the simplest but requires lots of attention to detail. When successful, it would be picked up by lots of researchers and companies who do not have the resources for doing such CPU intensive tasks. For WMF it would provide us with a standardized body for future NLP work. A Part Of Speech tagged corpus would be immediately useful for an 80% accurate word sense disambiguation in the search engine. Automatic Summaries are not a strategic priority AFAIK - 1. most articles provide a kind of abstract in their intro and 2. there are something like this already provided in the dumps for yahoo. 3. I have been using a great pop up preview widget in Wiktionary for a year or so. I do think it would be a great project to learn how to become a MediaWiki developer but is small for a GSOC. However I cannot speak for Jebald and other mentors in cellular and other teams who might be interested in this. If your easy grader is working it could be the basis of another very exciting GSOC project aimed at article quality. A NLP savvy "smart" article quality assessment service could improve/expand the current bots grading articles. Grammar and spelling are two good indicators, features. However a full assessment of Wikipedia articles would require more details - both stylistic and information based. Once you have covered sufficient features building discriminators based on samples of graded articles would require some data mining ability. However since there is an Existing bot, undergoing upgrades we would have to check with its small dev team what it currently doing And it would be subject to community oversight. Yours Sincerely, Oren Bochman MediaWiki Search Developer ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l