Re: [Wikitech-l] Welcome Gergő Tisza!

2013-10-17 Thread Oren Bochman
Congrants

Gergő is a great software engineer and one of the most helpful members of
the Hungarians I've met.

I'm sure he will be a fine edition to the engineering team

On Tuesday, October 15, 2013, Bináris wrote:

 Just to be the first: welcome!
 All the folks should learn this letter
 őhttps://en.wikipedia.org/wiki/%C5%90in Gergő's name. :-)
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org javascript:;
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l



-- 
Oren Bochman

Mobile +972 54 4320067
skype id: orenbochman
e-mail: oren.boch...@gmail.com
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Advance notice: I'm taking a sabbatical October-December

2013-08-31 Thread Oren Bochman
Congratulations - I hope you get co make many new bugs and even do some
cool coding outside the wmf!
I'm wandering if you'll update us about this unique experience as time
allows?



On Wednesday, August 28, 2013, Sumana Harihareswara wrote:

 I've been accepted to Hacker School https://www.hackerschool.com, a
 writers' retreat for programmers in New York City. I will therefore be
 taking an unpaid personal leave of absence from the Wikimedia Foundation
 via our sabbatical program. My last workday before my leave will be
 Friday, September 27. I plan to be on leave all of October, November,
 and December, returning to WMF in January.

 During my absence, Quim Gil will be the temporary head of the
 Engineering Community Team. Thank you, Quim! I'll spend much of
 September turning over responsibilities to him. Over the next month I'll
 be saying no to a lot of requests so I can ensure I take care of all my
 commitments by September 27th, when I'll be turning off my wikimedia.org
 email.

 If there's anything else I can do to minimize inconvenience, please let
 me know. And -- I have to say this -- oh my gosh I'm so excited to be
 going to Hacker School in just a month! Going from advanced beginner
 to confident programmer! Learning face-to-face with other coders, 30-45%
 of them women, all teaching each other! Thank you, WMF, for the
 sabbatical program, and thanks to my team for supporting me on this. I
 couldn't do this without you.

 --
 Sumana Harihareswara
 Engineering Community Manager
 Wikimedia Foundation

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org javascript:;
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l



-- 
Oren Bochman

Mobile +972 54 4320067
skype id: orenbochman
e-mail: oren.boch...@gmail.com
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Separation of Concerns

2013-06-04 Thread oren bochman
This schedule is excellent news.

 I am working on integrating Moodle with mediawiki and having a Sul support 
would be great. 

we are looking at two basic use cases.
1. Allowing existing user to log into Moodle via openid. 
2. Making edits such as clearing the sandbox on behalf of students.

Unfortunately Oauth is currently broken on the current version of Moodle and 
will require some work. However I'm working on coordinating with our local 
Moodle dev community to help us out.

I am wondering if Oauth will allow a user's privileges to be queried. Or if 
this can be done using the API?

Also there unit test for the respective MW extensions ?

10x

Oren Bochman



Sent from my iPhone

On Jun 4, 2013, at 5:43, Tyler Romeo tylerro...@gmail.com wrote:

 On Mon, Jun 3, 2013 at 8:18 PM, Chris Steipp cste...@wikimedia.org wrote:
 
 We are trying to finish the items in scope (SUL rework, OAuth, and a
 review of the OpenID extension) by the end of this month.
 
 Speaking of this, there's an OAuth framework attempt here:
 https://gerrit.wikimedia.org/r/66286
 
 Am I the only person who thinks it's a bad idea for the AuthPlugin class to
 be relying on the ApiBase class for its interface? Especially since the
 AuthPlugin framework isn't supposed to handle authorization logic anyway.
 
 *-- *
 *Tyler Romeo*
 Stevens Institute of Technology, Class of 2016
 Major in Computer Science
 www.whizkidztech.com | tylerro...@gmail.com
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] GSoC 2013 Proposal - jQuery.IME extensions for Firefox and Chrome

2013-04-29 Thread oren bochman
Interesting proposal.

I would imagine that this does not impact most page-views since Js files are 
cached.

It might be better to fix this bug by a tighter integration of the JavaScript 
with the Resource loader to lazy load the required elements as needed

In such a case the solution would be less dependent on browser plugins and 
Would require less long term maintenance when the Js is updated

On Apr 29, 2013, at 12:09, Praveen Singh prag...@gmail.com wrote:

 Hello,
 
 I have drafted a proposal for my GSoC Project: jQuery.IME extensions for
 Firefox and Chrome. I would love to hear what you think about it.
 I would really appreciate any kind of feedback and suggestions. Please let
 me know if I can improve it in any way.
 
 My proposal can be found here:
 http://www.mediawiki.org/wiki/User:Prageck/GSoC_2013_Application
 
 
 Thanks,
 
 Praveen Singh
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Indexing non-text content in LuceneSearch

2013-03-08 Thread oren bochman
-Original Message-
From: wikitech-l-boun...@lists.wikimedia.org 
[mailto:wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Brion Vibber
Sent: Thursday, March 7, 2013 9:59 PM
To: Wikimedia developers
Subject: Re: [Wikitech-l] Indexing non-text content in LuceneSearch

On Thu, Mar 7, 2013 at 11:45 AM, Daniel Kinzler dan...@brightbyte.de wrote:
 1) create a specialized XML dump that contains the text generated by
 getTextForSearchIndex() instead of actual page content.

That probably makes the most sense; alternately, make a dump that includes both 
raw data and text for search. This also allows for indexing extra stuff for 
files -- such as extracted text from a PDF of DjVu or metadata from a JPEG -- 
if the dump process etc can produce appropriate indexable data.

 However, that only works
 if the dump is created using the PHP dumper. How are the regular dumps 
 currently generated on WMF infrastructure? Also, would be be feasible 
 to make an extra dump just for LuceneSearch (at least for wikidata.org)?

The dumps are indeed created via MediaWiki. I think Ariel or someone can 
comment with more detail on how it currently runs, it's been a while since I 
was in the thick of it.

 2) We could re-implement the ContentHandler facility in Java, and 
 require extensions that define their own content types to provide a 
 Java based handler in addition to the PHP one. That seems like a 
 pretty massive undertaking of dubious value. But it would allow maximum 
 control over what is indexed how.

No don't do it :)

 3) The indexer code (without plugins) should not know about Wikibase, 
 but it may have hard coded knowledge about JSON. It could have a 
 special indexing mode for JSON, in which the structure is deserialized 
 and traversed, and any values are added to the index (while the keys 
 used in the structure would be ignored). We may still be indexing 
 useless interna from the JSON, but at least there would be a lot fewer false 
 negatives.

Indexing structured data could be awesome -- again I think of file metadata as 
well as wikidata-style stuff. But I'm not sure how easy that'll be. Should 
probably be in addition to the text indexing, rather than replacing.

-- brion

I agree with Brion.

Here are my 5 shenekel's worth.

To indexing non-mwdumps with LuceneSearch I would:
1. modify the demon to read the custom/dump format or update the xml dump to 
support json dump. 
2. it uses the MWdumper codebase to do this now.
3. add a lucene analyzer to handle the new data type, say a json analyzer.
4. add a Lucenedoc per Json based Wikidata schema
5. update the queries parser to handle the new queries and the modified Lucene 
documents.
6. for bonus points modify spelling correction and write a wiki data ranking 
algoritm
But this would only solve reading static dumps used to bootstrap the index, I 
would then have to 
Change how MWSearch periodically polls Brion's OAIRepository to pull in updated 
pages.

I've been coding some analytics from MWDumps from WMF/Wikia Wikis for research 
project I can say this:
1. Most big dumps (e.g. historic) inherit the isses of wikitext namely 
unescaped tags and entities which crash modern XML java libraries - so escape 
your data and validate the xml!
2. The god old SAX code in the MWDumper still works fine - so use it.
3. Use lucene 2.4 with the deprecated old APIs
4. Ariel is doing a great job (e.g. the 7Z compression and the splitting of the 
dumps) but these are things MWdumper does not handle yet.

Finally based on my work with i18n team, TranslateWiki search that indexing 
JSON data with Solar + Solarium requires no Search Engine coding at all.
You define the document schema, and use solarium to push JSON and get results 
too. I could do a demo of how to do this at a coming Hakathon if there
is any interest, however when I offered to replace LuceneSearch like this last 
October the idea was rejected out of hand.

-- oren

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Tag cloud

2013-03-03 Thread Oren Bochman
Re: ([01]+)

I was sorry about the Wikidata insanity, and am glad to see you around.

Templates are one way to go but I think using a real markup tag to mark
them up would be even better.
This would make the tag cheaper to process and would require making
a fairly trivial extension. Regarding
the UI - I've done something like this a while back based on userscripts in
the wild.

I do envision another issue - since talk pages don't use liquid threads
user comments are not objects.
Tagging a LT objects is just ... adding a decorator. But tagging a blob of
text is a can of worms - what are
the scope of each tags (The page/top level section/paragraph?)  I don't see
this would work with templates
and without tag scope I don't see this being very useful for
filtering/retrieval per your original use case.

(It could be done but would require a semi structured text processing kit
on the other end)

Anyhow it seems that talk pages are being redesigned which may render the
project superfluous.


On Sun, Mar 3, 2013 at 2:03 PM, Bináris wikipo...@gmail.com wrote:

 Hi folks,

 we have an old problem that talks sink in the archives of talk pages and
 village pumps. I already wrote a bot for huwiki that creates tables of
 contents for these pages, but this is far not enough. The idea is to use
 tags, For example, if the use of disambiguation pages has come up 113 times
 in various village pumps, noticeboards and talk pages, a tag could help
 users to connect these talks and find them.

 For the solution, there is a trivial way: use templates. Several templates
 can be placed in a section. As the tag itself could be the parameter of the
 template, special:whatlinkshere will unfortunately not help to collect
 tags. A bot may easily be written for this purpose, not a big task.

 For what I write this here: is there a way so that mediaWiki or an
 extension could solve this task more efficiently? Is this a good idea for
 someone for GSoC?
 Tasks:
 * Easily place new tags to sections, choose among the existing or create
 new.
 * Easily find tagged sections in talk pages, village pumps, noticeboards
 and archives of these.

 --
 Bináris
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l




-- 
Oren Bochman

Mobile +972 54 4320067
skype id: orenbochman
e-mail: oren.boch...@gmail.com
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] a slightly weird search result in the Italian Wikipedia

2012-09-17 Thread Oren Bochman
The algorithm used to rank search results uses ... a variant of page rank
so the reasons may lay off the actual page.

- Oren Bochman

On Mon, Sep 17, 2012 at 12:11 AM, Federico Leva (Nemo)
nemow...@gmail.comwrote:

 Autopratica is actually not a valid word: or rather, it's a neologism
 from a new/fringe theory, possibly grammatical thanks to the productivity
 of auto- but slightly confusing due to the jargon-meaning of pratica here.

 That said, the reason is surely in that link label, which is the only use
 of the word on the wiki.

 Nemo


 __**_
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/**mailman/listinfo/wikitech-lhttps://lists.wikimedia.org/mailman/listinfo/wikitech-l




-- 

Oren Bochman

Office tel. 061 4921492
Mobile +36 30 866 6706
skype id: orenbochman
e-mail: o...@romai-horizon.com
site http://www.riverport.hu
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] MediaWiki Foundation (was Re: CentralAuth API access)

2012-09-03 Thread Oren Bochman
A number of comments:

1. The community is an a massive untapped resource for development. (They
like to edit wikis, upload photos and also to code)
e.g. The amount of Template Code in about 20 times the size of
MediaWiki code base.
2. I would seriouly look at maximizing its potential before allocating more
funds for paid devlopment.
2.1 This means making it much easier to develop/test/deploy to Live wikis.
(Short Tutorials, Code Samples, Documentation)
2.2 Create a culture where new coders are assigned to work with experinced
coders to fix and maintaining existing code.
2.3 Motivating paid developer to work (i.e. review and direct) the
community
2.4 Team up with Wikia and WikiHow Devteams on common features and on small
wiki testing.
3. Looking at the metrics -  The Mediawiki team is still not setup to do
developement like other leading Open Souce development communities.
Git is a step in the right direction but - the agility of the teams is
too low to collaborate at the levels required.
to accept AnonymousDonation of source from the community.
 While I applud Sumana who does a great job with the community - this
works needs to be followed though organicaly by all members of the
development teams
or we will continue sending the community the message - that we prefer
to delay fixing bugs, pay a premiunm for new features etc ...
4. Only once such issues are adressed would it become productive to engage
more developers with WMF or external funding.
5. The one point I do agree with is that features the community asks for
should be given due proirity and this process should be more transparent.

Oren Bochman


On Mon, Sep 3, 2012 at 8:10 PM, Mr. Gregory Varnum gregory.var...@gmail.com
 wrote:

 I'll post more on the RFC, but I wonder if an entity within WMF would be
 more appropriate and realistic. Utilizing the existing operations structure
 would be far easier. Perhaps setup something like FDC to oversee priorities
 and funds.

 My hunch is WMF would be far more likely to sign off on something they
 retain a sense of sign-off on for the sake of maintaining the WMF projects
 than having to deal with an independent entity that would have the legal
 right to go rogue one day and not do what's in the best interest of the WMF
 projects. I recognize to some extent that's the point, but looking down a 5
 year road of possibilities, is that something we'd ever want to happen?  My
 feeling is no and allowing WMF to maintain some level of authority in the
 development of MediaWiki is in our collective best interests. From project
 management, fundraising, usability, system resources and paid developer
 support perspective.

 I would instead propose a MediaWiki department or collective (insert your
 favorite term here).

 -Greg aka varnent

 
 Sent from my iPhone. Apologies for any typos. A more detailed response may
 be sent later.

 On Sep 1, 2012, at 10:42 PM, MZMcBride z...@mzmcbride.com wrote:

  Daniel Friesen wrote:
  Done in true developer style [RFC] MediaWiki Foundation:
 
 https://www.mediawiki.org/wiki/Requests_for_comment/MediaWiki_Foundation
 
  Thank you for this! This is exactly what I had in mind.
 
  It's interesting, with a lot of (proposed) non-profits, the biggest
 concerns
  are engaging volunteers and generating income. With this proposed
  foundation, I think most of the typical concerns aren't in play.
 Instead, as
  Nikerabbit so deftly commented on the RFC's talk page, the big question
 is:
 
  What projects would a MediaWiki Foundation work on and how would those
  projects be chosen?
 
  This seems to be _the_ crucial issue. Getting grants from the Wikimedia
  Foundation or Wikia or others doesn't seem like it'd be very difficult.
  Assuming there was broad support for the creation of such a foundation
 from
  active MediaWiki developers (and related stakeholders), getting the
  Wikimedia Foundation to release the trademark and domain also doesn't
 seem
  like it would be very difficult. But there's a huge unresolved question
  about how, out of the infinite number of project ideas, a MediaWiki
  Foundation would choose which ideas to financially support.
 
  As you command oh great catalyst[1].
  [1] Hope you don't mind. I found it amusing. And it kind of fits in a
  positive way.
 
  Cute. :-)
 
  MZMcBride
 
 
 
  ___
  Wikitech-l mailing list
  Wikitech-l@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/wikitech-l

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l




-- 

Oren Bochman

Office tel. 061 4921492
Mobile +36 30 866 6706
skype id: orenbochman
e-mail: o...@romai-horizon.com
site http://www.riverport.hu
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [testing] TitleBlacklist now tested under Jenkins

2012-08-30 Thread Oren Bochman
I've tried to do this for translate ext last week - so here are a couple of
questions:
1. is successfully runnig the test a requirement to successfully score on
gerrit? (i.e. how is gerrit integrated)
2. does the extension need to include php unit?

On Thu, Aug 30, 2012 at 9:43 AM, Antoine Musso hashar+...@free.fr wrote:

 Le 29/08/12 16:27, Chad a écrit :
  Question: why does the config for non-extension tests attempt
  to load extensions? -Parser and -Misc both seem to be failing
  due to a broken inclusion of Wikibase.

 The -Parser and -Misc jobs are triggered by both the MediaWiki core job
 and the one testing the Wikidata branch.  I originally thought it was a
 good idea to a job dedicated to a PHPUnit group, I will end up creating
 a job dedicated to testing the Wikidata branch.

  Core tests should be run without any extensions.

 Fully agree. We can later create a job to test core + the extension
 deployed on the wmf and another one for a Semantic MediaWiki setup.

 --
 Antoine hashar Musso


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l




-- 

Oren Bochman

Office tel. 061 4921492
Mobile +36 30 866 6706
skype id: orenbochman
e-mail: o...@romai-horizon.com
site http://www.riverport.hu
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Gerrit evaluation: where we stand

2012-08-16 Thread Oren Bochman
That said there are known negatives; the Java+Google Web Toolkit front-end
is intimidating to people who might want to help improve the UI; even
Gerrit devs don't love it. :)

Improvements to the UI and to the git-review CLI tool are welcome...

Intimidating to PHP+JS only devs - but Java+Google Web Toolkit is this
systems second iteration for Gerrit. I think that it would be possible to
add all the changes we need into Gerrit, I personally feel more comfortable
hacking Gerrit which has an upstream and a community than our previous code
review plug-in which had none. A large number of our issues are already
being added by the Gerrit community and by Chad. However the comment above
clearly highlight an issue arising from running an almost exclusively PHP+JS
shop and under adoption of FOSS development methodologies

That being said:

Using FOSS tools has a higher total cost of ownership. Managers who
authorized a switch from a working system (SVN/Code review) to a new and
immature systems such as Git/Gerrit - should have set aside resources (time
 money) to offset the problems created by  such migrations.

These generally amount to several orders of magnitude of the actual cost of
the migration done by operations. The bulk of the work created by these
changes are offset to the individual developers whose project will be broken
by change of workflow and who might not be active. It passing strange how
few of the extensions are under-maintained, unsupported. 

For example:
* Integration of Gerrit to our system, 
* Customization (adding features like better diffs)
* Acceptance - getting people to change workflow and getting core developers
to actually review code.
* Education - Teaching established and new users to work with the
Git/Gerrit, writing tutorials, training people with them at Hackatons.
Updating project documentations and readmes
* Secondary migration - fixing scripts/apis that depend on the current
setup. E.g. my CI work in December needs to be updated to reflect using
GIT/Gerrit; build scripts of systems with independent modules like search +
mwdumper; updating robots and so on.
* Tertiary migrations - On the developers machines. Replacing IDEs and
Workspaces to reflect the Git/Gerrit workflows.

Thus switching forth between different Gerrit alternative is myopic. It
ignores the friction and cost these moves create for the established
developers community who have created hundreds of extensions and documented
them. I say we just get consensus on the Priority Queue of outstanding
Gerrit issues and start fix them until it rocks.


Oren Bochman
Lead of Search






___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Creating a centralized access point for propriety databases/resources

2012-07-25 Thread Oren Bochman
Hi

This looks similar to something I have been thinking about recently

However I would go about it using openeId. But it would require all the
databases sites to support openId. I think that the extensions exists to do
this using mediawiki, but 
WMF projects do not trust/support this method of authentication.

If all parties were to support this standard it would be possible to develop
an gadget which could log users into all the sites at once.

Do you know how many users have been granted access to each databases, this
would be useful for estimating the importance/impact of this project.

Oren Bochman

-Original Message-
From: wikitech-l-boun...@lists.wikimedia.org
[mailto:wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Ocaasi Ocaasi
Sent: Monday, July 23, 2012 6:22 PM
To: wikitech-l@lists.wikimedia.org
Subject: [Wikitech-l] Creating a centralized access point for propriety
databases/resources

Hi Folks!
The problem: Many proprietary research databases have donated free access to
select Wikipedia editors (Credo Reference, HighBeam Research, JSTOR).
Managing separate account distribution for each service doesn't scale well.
The idea: Centralize access to these separate resources behind a single
secure (firewalled) gateway, to which accounts would be given to a limited
number of approved users. After logging in to this single gateway, users
would be able to enter any of the multiple participating research databases
without needing to log in to each one separately.
The question: What are the basic technical specifications for setting up
such a system. What are open source options, ideally? What language would be
ideal? What is required to host such a system? Can you suggest a sketch of
the basic steps necessary to implement such an idea?
Any advice, from basics to details would be greatly appreciated.  Thanks so
much!
Ocaasi
http://enwp.org/User:Ocaasi
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Creating a centralized access point for propriety databases/resources

2012-07-25 Thread Oren Bochman
Hi Ocaasi

I agree that tighter work with the database providers is in order. 1000+
accounts for top contributors can make a significant impact on Wikipedia
fact checking.

Based on my experience at university (where I taught a lab-class on
reference database usage) that there are many more options on how to do
this. Most users in universities do not require to log in at all. (they work
in context of an IP range that is enabled for databases.) Research libraries
also implement floating licenses for databases that have limited access
options.

However to implement this it is often necessary to work with a large
database aggregators (which solves the tech issues) and the rest is
implemented by operations staff of a university.

Oren Bochman

-Original Message-
From: wikitech-l-boun...@lists.wikimedia.org
[mailto:wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Sumana
Harihareswara
Sent: Wednesday, July 25, 2012 4:16 PM
To: Ocaasi Ocaasi; Wikimedia developers
Subject: Re: [Wikitech-l] Creating a centralized access point for propriety
databases/resources

Ocaasi, please centralize your notes, ideas, and plans regarding this here:

https://www.mediawiki.org/wiki/AcademicAccess

I know Chad Horohoe, Ryan Lane, and Chris Steipp might have things to say
about this; per
https://www.mediawiki.org/wiki/Wikimedia_Engineering/2012-13_Goals#Activitie
s_12
their team aims to work on OAuth and OpenID within the next 11 months, and
AcademicAccess is a possible beneficiary of that.

Thanks!
--
Sumana Harihareswara
Engineering Community Manager
Wikimedia Foundation

On 07/25/2012 10:03 AM, Ocaasi Ocaasi wrote:
 We currently have relationships with three separate resource databases.
 
 *HighBeam, 1000 authorized accounts, 700 active 
 (http://enwp.org/WP:HighBeam) *JSTOR, 100 accounts, all active 
 (http://enwp.org/WP:JSTOR) *Credo, 400 accounts, all active 
 (http://enwp.org/WP:CREDO)
 
 No parties have agreed to participate in The Wikipedia Library *yet*, as
it's still in the concept stage, but my initial projection is that 1000
editors would have access to it, and 100 additional users per year would be
granted.  One of the challenges will be getting all the resource providers
to agree on that number, but the hope is that once some do, it will create a
cascade of adoption.  
 
 So we're not looking at *thousands* of users, but more likely several
hundreds.  Still, given the impact of our most active editors, 1000 of them
with access to the library would have significant impact.  After all, we
can't cannibalize these databases' subscription business by opening the
library to ''all'' editors.  It must be a carefully selected and limited
group.
 
 
 -Original Message-
 From: wikitech-l-boun...@lists.wikimedia.org
 [mailto:wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Ocaasi 
 Ocaasi
 Sent: Monday, July 23, 2012 6:22 PM
 To: wikitech-l@lists.wikimedia.org
 Subject: [Wikitech-l] Creating a centralized access point for 
 propriety databases/resources
 
 Hi Folks!
 The problem: Many proprietary research databases have donated free 
 access to select Wikipedia editors (Credo Reference, HighBeam Research,
JSTOR).
 Managing separate account distribution for each service doesn't scale
well.
 The idea: Centralize access to these separate resources behind a 
 single secure (firewalled) gateway, to which accounts would be given 
 to a limited number of approved users. After logging in to this single 
 gateway, users would be able to enter any of the multiple 
 participating research databases without needing to log in to each one
separately.
 The question: What are the basic technical specifications for setting 
 up such a system. What are open source options, ideally? What language 
 would be ideal? What is required to host such a system? Can you 
 suggest a sketch of the basic steps necessary to implement such an idea?
 Any advice, from basics to details would be greatly appreciated.  
 Thanks so much!
 Ocaasi
 http://enwp.org/User:Ocaasi
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] suggestion: replace CAPTCHA with better approaches

2012-07-25 Thread Oren Bochman
Hi

The wikipedia's captcha is a great opportunity for getting '''useful'' work
done by humans.
This is now called a [[game with a purpose]]. 

I think we can ideally use it to help:
* ocr wikisource text like recaptcha does
* translate articles fragments using geo-location of editors.
  Translate [xyz-known] [...]
  Translate [xyz-new] [...] 
check using blau metric etc.
* get more opinions on spam edits.
  Is this diff [spam] [good faith edit] [ok]
* collect linguistics information on different languages edition.
Is XYZ a [verb] / [noun] / [adjective] ... [other]
*disambiguate 
  Is [xyz-known] [xyz] ... [xyz] ... [xyz] ...
  Is [yzx-unknown] [yzx1] ... [yzx1] ... [yzx1] ...
Etc

This way if people feel motivated at cheating at captcha they will end up
helping Wikipedia
It is up to us to try to balance things out.

I'm pretty sure users will be less annoyed at solving captchas that actually
contribute some value.



-Original Message-
From: wikitech-l-boun...@lists.wikimedia.org
[mailto:wikitech-l-boun...@lists.wikimedia.org] On Behalf Of matanya
Sent: Tuesday, July 24, 2012 4:12 PM
To: wikitech-l@lists.wikimedia.org
Subject: [Wikitech-l] suggestion: replace CAPTCHA with better approaches

As for the last few month the spam rate stewards deal with is raising.
I suggest we implement a new mechanism:

Instead of giving the user a CAPTCHA to solve, give him a image from commons
and ask him to add a brief description in his own language.

We can give him two images, one with known description, and the other with
unknown, after enough users translate the unknown in the same why, we can
use it as a verified translation. We base on the known image description to
allow the user to create the account.

Is it possible to embed a file from commons in the login page? is it
possible to parse the entered text and store it?

benefits:

A) it would be harder for bots to create automated accounts.

B) We will get translations to many languages with little effort from the
users signing up.

What do you think?



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Wmfall] Announcement: Peter Youngmeister joins Wikimedia as Technical Operations Engineer

2012-07-17 Thread Oren Bochman
Great news

I'd also like to congratulate Peter

I was very impressed with his work on puputizing the search configuration
and look forward to working with him on new projects

-- 

Oren Bochman

Office tel. 061 4921492
Mobile +36 30 866 6706
skype id: orenbochman
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Xmldatadumps-l] XML dumps/Media mirrors update

2012-06-06 Thread Oren Bochman
Dear Ariel,

Consider that people who would need to use Torrent most of all cannot host
a mirrors - this is a situation of the little guy being asked to do the
heavy lifting.

It would be saving WMF significant resources, - it would be more efficient
than Rsync. Doing this outside the the WMF infrastructure does not make
sense (authenticity, automation) and is the reason why use of torrents has
failed traditionally. If the WMF does this - it should be possible for
users to leverage all the mirrors simultaneously - which is why torrents
are the preferred form of transport for Linux distribution.

Installing a torrent server should not significantly impact workload.
The main problems, as I see it, is to write a maintenance script to create
the magnet link/.torernt files once the dumps are generated and to publish
them on the dump servers.

With your blessing - I would try to help with it in the context of say a
labs, if it would be integrated into the dump release process.

Thanks for the great job with the dumps!

Oren Bochman

On Tue, Jun 5, 2012 at 3:15 PM, Ariel T. Glenn ar...@wikimedia.org wrote:

 This is a place where volunteers can step in and make it happen without
 the need for Wikimedia's infrastructure.  (This means I can concentrate
 on my already very full plate of things too.)

 http://meta.wikimedia.org/wiki/Data_dump_torrents

 Have at!

 Ariel

 Στις 05-06-2012, ημέρα Τρι, και ώρα 08:57 -0400, ο/η Derric Atzrott
 έγραψε:
  I second this idea.  Large archives should always be available using
 bittorrent.  I would actually suggest posting magnet links for them though
 instead of .torrent files.  This way you can leverage the acceptable source
 feature of magnet links.
 
  https://en.wikipedia.org/wiki/Magnet_URI_scheme#Web_links_to_the_file
 
  This way we get the best of both worlds: the constant availability of
 direct downloads, and the reduction in load that p2p filesharing provides.
 
  Thank you,
  Derric Atzrott
 
  -Original Message-
  From: wikitech-l-boun...@lists.wikimedia.org [mailto:
 wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Oren Bochman
  Sent: 05 June 2012 08:44
  To: 'Wikimedia developers'
  Subject: Re: [Wikitech-l] [Xmldatadumps-l] XML dumps/Media mirrors update
 
  Any chance that these archived can be served via bittorent - so that
 even partial downloaders can become servers - leveraging p2p to reduce
 overall bandwidth load on the servers and increase download times?
 
 
  -Original Message-
  From: wikitech-l-boun...@lists.wikimedia.org [mailto:
 wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Mike Dupont
  Sent: Saturday, June 02, 2012 1:28 AM
  To: Wikimedia developers; wikiteam-disc...@googlegroups.com
  Subject: Re: [Wikitech-l] [Xmldatadumps-l] XML dumps/Media mirrors update
 
  I have run cron archiving now every 30 minutes,
 http://ia700802.us.archive.org/34/items/wikipedia-delete-2012-06/
  it is amazing how fast the stuff gets deleted on wikipedia.
  what about the proposed deletes are there categories for that?
  thanks
  mike
 
  On Wed, May 30, 2012 at 6:26 AM, Mike  Dupont 
 jamesmikedup...@googlemail.com wrote:
   https://github.com/h4ck3rm1k3/wikiteam code here
  
   On Wed, May 30, 2012 at 6:26 AM, Mike  Dupont
   jamesmikedup...@googlemail.com wrote:
   Ok, I merged the code from wikteam and have a full history dump
   script that uploads to archive.org, next step is to fix the bucket
   metadata in the script mike
  
   On Tue, May 29, 2012 at 3:08 AM, Mike  Dupont
   jamesmikedup...@googlemail.com wrote:
   Well, I have now updated the script to include  the xml dump in raw
   format. I will have to add more information the achive.org item, at
   least a basic readme.
   other thing is that the wikipybot does not support the full history
   it seems, so that I will have to move over to the wikiteam version
   and rework it, I just spent 2 hours on this so i am pretty happy for
   the first version.
  
   mike
  
   On Tue, May 29, 2012 at 1:52 AM, Hydriz Wikipedia 
 ad...@alphacorp.tk wrote:
   This is quite nice, though the item's metadata is too little :)
  
   On Tue, May 29, 2012 at 3:40 AM, Mike Dupont
   jamesmikedup...@googlemail.com
   wrote:
  
   first version of the Script is ready , it gets the versions, puts
   them in a zip and puts that on archive.org
   https://github.com/h4ck3rm1k3/pywikipediabot/blob/master/export_de
   leted.py
  
   here is an example output :
   http://archive.org/details/wikipedia-delete-2012-05
  
   http://ia601203.us.archive.org/24/items/wikipedia-delete-2012-05/a
   rchive2012-05-28T21:34:02.302183.zip
  
   I will cron this, and it should give a start of saving deleted
 data.
   Articles will be exported once a day, even if they they were
   exported yesterday as long as they are in one of the categories.
  
   mike
  
   On Mon, May 21, 2012 at 7:21 PM, Mike  Dupont
   jamesmikedup...@googlemail.com wrote:
Thanks! and run that 1 time per day, they dont get deleted

Re: [Wikitech-l] [Xmldatadumps-l] XML dumps/Media mirrors update

2012-06-05 Thread Oren Bochman
Any chance that these archived can be served via bittorent - so that even 
partial downloaders can become servers - leveraging p2p to reduce overall 
bandwidth load on the servers and increase download times?


-Original Message-
From: wikitech-l-boun...@lists.wikimedia.org 
[mailto:wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Mike Dupont
Sent: Saturday, June 02, 2012 1:28 AM
To: Wikimedia developers; wikiteam-disc...@googlegroups.com
Subject: Re: [Wikitech-l] [Xmldatadumps-l] XML dumps/Media mirrors update

I have run cron archiving now every 30 minutes, 
http://ia700802.us.archive.org/34/items/wikipedia-delete-2012-06/
it is amazing how fast the stuff gets deleted on wikipedia.
what about the proposed deletes are there categories for that?
thanks
mike

On Wed, May 30, 2012 at 6:26 AM, Mike  Dupont jamesmikedup...@googlemail.com 
wrote:
 https://github.com/h4ck3rm1k3/wikiteam code here

 On Wed, May 30, 2012 at 6:26 AM, Mike  Dupont 
 jamesmikedup...@googlemail.com wrote:
 Ok, I merged the code from wikteam and have a full history dump 
 script that uploads to archive.org, next step is to fix the bucket 
 metadata in the script mike

 On Tue, May 29, 2012 at 3:08 AM, Mike  Dupont 
 jamesmikedup...@googlemail.com wrote:
 Well, I have now updated the script to include  the xml dump in raw 
 format. I will have to add more information the achive.org item, at 
 least a basic readme.
 other thing is that the wikipybot does not support the full history 
 it seems, so that I will have to move over to the wikiteam version 
 and rework it, I just spent 2 hours on this so i am pretty happy for 
 the first version.

 mike

 On Tue, May 29, 2012 at 1:52 AM, Hydriz Wikipedia ad...@alphacorp.tk 
 wrote:
 This is quite nice, though the item's metadata is too little :)

 On Tue, May 29, 2012 at 3:40 AM, Mike Dupont 
 jamesmikedup...@googlemail.com
 wrote:

 first version of the Script is ready , it gets the versions, puts 
 them in a zip and puts that on archive.org 
 https://github.com/h4ck3rm1k3/pywikipediabot/blob/master/export_de
 leted.py

 here is an example output :
 http://archive.org/details/wikipedia-delete-2012-05

 http://ia601203.us.archive.org/24/items/wikipedia-delete-2012-05/a
 rchive2012-05-28T21:34:02.302183.zip

 I will cron this, and it should give a start of saving deleted data.
 Articles will be exported once a day, even if they they were 
 exported yesterday as long as they are in one of the categories.

 mike

 On Mon, May 21, 2012 at 7:21 PM, Mike  Dupont 
 jamesmikedup...@googlemail.com wrote:
  Thanks! and run that 1 time per day, they dont get deleted that quickly.
  mike
 
  On Mon, May 21, 2012 at 9:11 PM, emijrp emi...@gmail.com wrote:
  Create a script that makes a request to Special:Export using 
  this
 category
  as feed
  https://en.wikipedia.org/wiki/Category:Candidates_for_speedy_de
  letion
 
  More info
 https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export
 
 
  2012/5/21 Mike Dupont jamesmikedup...@googlemail.com
 
  Well I whould be happy for items like this :
  http://en.wikipedia.org/wiki/Template:Db-a7
  would it be possible to extract them easily?
  mike
 
  On Thu, May 17, 2012 at 2:23 PM, Ariel T. Glenn 
  ar...@wikimedia.org
  wrote:
   There's a few other reasons articles get deleted: copyright 
   issues, personal identifying data, etc.  This makes 
   maintaning the sort of mirror you propose problematic, although a 
   similar mirror is here:
   http://deletionpedia.dbatley.com/w/index.php?title=Main_Page
  
   The dumps contain only data publically available at the time 
   of the
 run,
   without deleted data.
  
   The articles aren't permanently deleted of course.  The 
   revisions
 texts
   live on in the database, so a query on toolserver, for 
   example,
 could be
   used to get at them, but that would need to be for research 
   purposes.
  
   Ariel
  
   Στις 17-05-2012, ημέρα Πεμ, και ώρα 13:30 +0200, ο/η Mike 
   Dupont
 έγραψε:
   Hi,
   I am thinking about how to collect articles deleted based 
   on the
 not
   notable criteria,
   is there any way we can extract them from the mysql 
   binlogs? how are these mirrors working? I would be 
   interested in setting up a mirror
 of
   deleted data, at least that which is not spam/vandalism 
   based on
 tags.
   mike
  
   On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn 
 ar...@wikimedia.org
   wrote:
We now have three mirror sites, yay!  The full list is 
linked to
 from
http://dumps.wikimedia.org/ and is also available at
   
   
 http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dum
 ps#Current_Mirrors
   
Summarizing, we have:
   
C3L (Brazil) with the last 5 good known dumps, Masaryk 
University (Czech Republic) with the last 5 known good
 dumps,
Your.org (USA) with the complete archive of dumps, and
   
for the latest version of uploaded media, Your.org with 
http/ftp/rsync access.
   
Thanks to Carlos, Kevin and 

Re: [Wikitech-l] [Wiki-research-l] MathJax comes to Wikipedia

2012-05-03 Thread Oren Bochman
Hey this is so wonderful. I've been working with formulas on Wikpedia nad
on Meta and they are so ugly.

One realy important feature to check is if it is possible to have for
servral furmula with a number - that the number for all will appear aligned
on the right.

I'll be glad to beta test.


On Thu, May 3, 2012 at 6:49 PM, Erik Moeller e...@wikimedia.org wrote:

 On Thu, May 3, 2012 at 9:44 AM, Dario Taraborelli
 dtarabore...@wikimedia.org wrote:
  MathJax [1] is now enabled site-wide as an opt-in preference. You can
 now see beautifully rendered, accessible, copypasteable and
 standard-compliant (MathML) formulas on Wikipedia, replacing the old
 TeX-rendered PNGs.

 Thanks Dario. There are definitely still bugs in this experimental
 rendering mode, so please report issues in Bugzilla against the Math
 component:


 https://bugzilla.wikimedia.org/enter_bug.cgi?product=MediaWiki%20extensionscomponent=Math

 More here:

 http://www.mediawiki.org/wiki/Extension:Math/MathJax_testing

 --
 Erik Möller
 VP of Engineering and Product Development, Wikimedia Foundation

 Support Free Knowledge: https://wikimediafoundation.org/wiki/Donate

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l




-- 

Oren Bochman

Office tel. 061 4921492
Mobile +36 30 866 6706
skype id: orenbochman
e-mail: o...@romai-horizon.com
site http://www.riverport.hu
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] New Engineering Community Group, headed by Sumana Harihareswara

2012-04-25 Thread Oren Bochman
Great news.

Congratulations Sumana !!

I think that this is greatly deserved -- I'd be glad that you will be having
an even greater impact bringing in more people into our ecosystem.

Oren Bochman
Lead of Search

Operation Manager 
E-mail: o...@romai-horizon.com
Mobil: +36 30 866 6706



Római Horizon Kft. 
H-1039 Budapest 
Királyok útja  291. D. ép. fszt. 2.
Tel:   +36 1 492 1492
Fax:  +36 1 266 5529


-Original Message-
From: wikitech-l-boun...@lists.wikimedia.org
[mailto:wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Rob Lanphier
Sent: Wednesday, April 25, 2012 5:30 AM
To: Wikimedia developers
Subject: [Wikitech-l] New Engineering Community Group, headed by Sumana
Harihareswara

Hi everyone,

I'm happy to announce that we have promoted Sumana Harihareswara as manager
of Engineering Community group.  Sumana started with us as a contractor back
in February 2011, initially in a targeted engagement to help out with Google
Summer of Code and with the Berlin Hackathon last year.  Later that year, as
we interviewed people to bring in as Volunteer Development Coordinator, not
only did Sumana put in a strong application herself, but recruited very
worthy competition for the role.  After winning the role, she worked
tirelessly to straighten out many kinks in our processes around volunteer
development and systematically ensured that new volunteer developers get the
recognition and (if needed) help they deserve.  She has also applied focus
and organization in many areas outside of her immediate purview, for
example, recently stepping in as project manager for Git, and
occasionally filling in for me when I've been unavailable for the larger
Platform Engineering organization.

The promotion to Engineering Community Manager isn't so much a change in the
way things are done here so much as an official recognition of a vital role
that she has already played for the past year.  Sumana has been working with
Guillaume Paumier and Mark Hershberger under the somewhat ad hoc group title
of Technical Liaison; Developer Relations (tl;dr), serving as lead of that
group since last year.  Under the new Engineering Community name, this
group will continue to serve many roles:  facilitating collaboration and
communication between Wikimedia Foundation and its employees and the larger
Wikimedia developer community, as well as facilitating collaboration and
communication between the Wikimedia developer community and other Wikimedia
communities.

Thank you, Sumana, for your hard work over the past year.  I'm looking
forward to seeing what you and the group accomplish moving forward.
Congratulations!

Rob

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Unified login vs. unified settings

2012-04-16 Thread Oren Bochman
I'd love it too - but I noticed that different wikis have quite different
settings due to different gadgets and extensions being available.

So a good solution would have to be smart enough to accommodate this.



Oren Bochman
-Original Message-
From: wikitech-l-boun...@lists.wikimedia.org
[mailto:wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Bináris
Sent: Monday, April 16, 2012 7:44 AM
To: Wikimedia developers
Subject: Re: [Wikitech-l] Unified login vs. unified settings

2012/4/15 Ole Palnatoke Andersen palnat...@gmail.com

 Hi!

 I would love to be able to manage my settings in one place


There is somewhere a bot that does it for you, but I don't remember where I
saw it.

--
Bináris
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] GSOC 2012 : Lucene Automatic Query Expansion From Wikipedia Text

2012-04-06 Thread Oren Bochman
Hi Robert Stojnic and Gautham Shankar

I wanted to let Gautham that he has written a great proposal and thank you
for the feedback as well.

I wanted to point out that in my point of view the main goal of this
multilingual wordnet isn't queary expansion, but rather means for ever
greater cross language capabilites in search and content analytics. A
wordnet seme can be  further disambiguated using a topic map algorithm run
which would consider all the contexts like you suggest. But this is planned
latter and so the wordnet would be a milestone.
To further clarify Gautham's integration will place a XrossLanguage-seme
Word Net tokens during indexing for words it recognises - allow the ranking
algorithm to use knowldege drawn from all the wikipedia articles.
(For example one part of the ranking would peek into featured article in
German on A rank it  then B featured in Hungarian and use them as
oracles to rank A  B  ... in English where the picture might now be X
 Y  Z  ... B  A ...)

I mention in passing that I have began to develop dataset for use with open
relavance to sytematicly review and evaluate dramatic changes to relevance
due to changes in the search engine. I will post on this in due course as
it matures - since I am working on a number of smaller projects i'd like to
demo at WikiMania.)

On Fri, Apr 6, 2012 at 6:01 PM, Gautham Shankar 
gautham.shan...@hiveusers.com wrote:

 Robert Stojnic rainmansr at gmail.com writes:

 
 
  Hi Gautham,
 
  I think mining wiktionary is an interesting project. However, about the
  more practical Lucene part: at some point I tried using wordnet to
  expand queries however I found that it introduces too many false
  positives. The most challenging part I think it *context-based*
  expansion. I.e. a simple synonym-based expansion is of no use because it
  introduces too many meanings that the user didn't quite have in mind.
  However, if we could somehow use the words in the query to find a
  meaning from a set of possible meanings that could be really helpful.
 
  You can look into existing lucene-search source to see how I used
  wordnet. I think in the end I ended up using it only for very obvious
  stuff (e.g. 11 = eleven, UK = United Kingdom, etc..).
 
  Cheers, r.
 
  On 06/04/12 01:58, Gautham Shankar wrote:
   Hello,
  
   Based on the feedback i received i have updated my proposal page.
  
   https://www.mediawiki.org/wiki/User:Gautham_shankar/Gsoc
  
   There is about 20 Hrs for the deadline and any final feedback would be
   useful.
   I have also submitted the proposal at the GSOC page.
  
   Regards,
   Gautham Shankar
   ___
   Wikitech-l mailing list
   Wikitech-l at lists.wikimedia.org
   https://lists.wikimedia.org/mailman/listinfo/wikitech-l
  
 

 Hi Robert,

 Thank you for your feedback.
 Like you pointed out, query expansion using the wordnet data directly,
 reduces
 the quality of the search.

 I found this research paper very interesting.
 www.sftw.umac.mo/~fstzgg/dexa2005.pdfhttp://www.sftw.umac.mo/%7Efstzgg/dexa2005.pdf
 They have built a TSN (Term Semantic Network) for the given query based on
 the
 usage of words in the documents. The expansion words obtained from the
 wordnet
 are then filtered out based on the TSN data.

 I did not add this detail to my proposal since i thought it deals more
 with the
 creation of the wordnet. I would love to implement the TSN concept once the
 wordnet is complete.

 Regards,
 Gautham Shankar



 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Hi again

-- 

Oren Bochman

Office tel. 061 4921492
Mobile +36 30 866 6706
skype id: orenbochman
e-mail: o...@romai-horizon.com
site http://www.riverport.hu
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] GSOC 2012

2012-04-05 Thread Oren Bochman
Hi we are running out of time

Thanks for your interest in out project at Wikimedia
The GSOC proposals should be mostly specified by the student – 
Those of you who have not done so should draft proposals and place them in
at
www.mediawiki.org in your user space, then post a link here, or email me so
we can process them.

1.  I have expanded the requirement of my project ideas a bit. However I
have left room for your ideas. 
There is plenty of similar work published on these subject -- Research these
and refine your proposals with tools/algorithms you would like to use and
preferred formats, so that deliverables that would be widely reused.
2.  I am contacting two researchers who have worked on similar projects
to check if they wish to Co-operate by  contribute Code and helping with the
Linguistics side of the Mentoring.
3.  I can answer specific questions you have about expectation.

To optimally  match you with a suitable high impact project – please let us
know:
*Your development experience what projects have you done and where – 

specially what are your experience with:
*Java and other programming languages?
*PHP
*Apache Lucene or Solr
*Natural Language Processing 
*Data Mining
*Corpus Linguistics
*WordNet

Since these projects are highly multilingual please tell us what is your
native language and what other language you 
can use (scale from 1 beginner  to 5 near native).






Background,
Ability in programming,


Operation Manager 
E-mail: o...@romai-horizon.com
Mobil: +36 30 866 6706



Római Horizon Kft. 
H-1039 Budapest 
Királyok útja  291. D. ép. fszt. 2.
Tel:   +36 1 492 1492
Fax:  +36 1 266 5529


-Original Message-
From: wikitech-l-boun...@lists.wikimedia.org
[mailto:wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Sudeep Singh
Sent: Tuesday, April 03, 2012 8:48 PM
To: wikitech-l@lists.wikimedia.org
Subject: [Wikitech-l] GSOC 2012

Hi,

I am sudeep. I am final year student at Indian Institute of Technology,
Kharagpur in the computer science department.

I am interested to apply in the following projects for gsoc 2012

1. Lucene automatic query expansion from wikipedia text 2. Backwards
compatibility extension 3. Semantic form rules 4. Index transcluded text in
search

I have a strong background in Information retrieval and Machine learning. I
have worked previously with Yahoo Research Labs in the area of Information
retrieval. We extracted association rules and attribite-value pairs from the
webpages using unsupervised approach.

I have also worked on another project with yahoo, which involved emotion
detection of youtube videos, based on the comments of the users. We used
various ML, Statisitcs andf IR techniques to achieve our goal.

I last year succesfully completed GSOC 2011, with OSGEO and have good
experience in Open Source Development.

Kindly let me know how shall I proceed with my application.

Thanks
regards
Sudeep
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] GSoC 2012: Proposal-Wikipedia Corpus Tools

2012-04-04 Thread Oren Bochman
You do understand correctly!

The main idea about NLP components is with POS tagger as an example:

1. a fall back system that does unsupervised POS tagging.
2. the ability to plug in an existing POS tagger as these become  available for 
specific languages.

I would as supervisor would recommend working with 3 languages.
English, Hebrew, and the GSOC native language.

If we could get QA from other native speakers we would incorporate them into 
the workflow.

I think that by using a deletion/reversion based heuristic we may also be able 
to make a spam corpus to boost the accuracy of the corpuses.


Operation Manager 
E-mail: o...@romai-horizon.com
Mobil: +36 30 866 6706



Római Horizon Kft. 
H-1039 Budapest 
Királyok útja  291. D. ép. fszt. 2.
Tel:   +36 1 492 1492
Fax:  +36 1 266 5529

-Original Message-
From: wikitech-l-boun...@lists.wikimedia.org 
[mailto:wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Amir E. Aharoni
Sent: Tuesday, April 03, 2012 10:19 PM
To: Wikimedia developers
Subject: Re: [Wikitech-l] GSoC 2012: Proposal-Wikipedia Corpus Tools

2012/4/3 karthik prasad karthikprasad...@gmail.com:
 Hello,
 I am a GSoC aspirant and have compiled a proposal for one of the 
 project ideas - Wikipedia Corpus Tools. [Mentor : Oren Bochman] I 
 would sincerely appreciate if you could kindly go through it and 
 suggest corrections/additions so that I can settle with a coherent proposal.

 Link to my proposal :
 https://www.mediawiki.org/wiki/User:Karthikprasad/gsoc2012proposal

Nice, but why only English?

If i understand the proposal correctly, this project is supposed to be able to 
work with almost any language with very little effort.

--
Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com 
‪“We're living in pieces, I want to live in peace.” – T. Moore‬

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] GSOC 2012 - Text Processing and Data Mining

2012-04-02 Thread Oren Bochman
Dear, Karthik Prasad  Other GSOC candidates.

 

I was not getting this list but I am now.

 

The GSOC proposal should be specified by the student.

 

I'll can expand the details on these projects.

I can answer specific questions you have about expectation.

 

To optimally  match you with a suitable high impact project - to what extent
are you familiar with :

*Java and other programming languages?

*PHP?

*Apache Lucene?

*Natural Language Processing?

*Corpus Linguistics?

*Word Net?

 

The listed projects would be either wrapped as services, or consumed by
downstream projects or both.

 

The corpus is the simplest but requires lots of attention to detail. When
successful, it would be picked up by lots of 

researchers and companies who do not have the resources for doing such CPU
intensive tasks.

For WMF it would provide us with a standardized body for future NLP work. A
Part Of Speech tagged corpus would 
be immediately useful for an 80% accurate word sense disambiguation in the
search engine.

 

Automatic Summaries are not a strategic priority AFAIK - 

1.   most articles provide a kind of abstract in their intro and 

2.   there are something like this already provided in the dumps for
yahoo.  

3.   I have been using a great pop up preview widget in Wiktionary for a
year or so.

 

I do think it would be a great project to learn how to become a MediaWiki
developer but is small for a GSOC. 
However I cannot speak for Jebald and other mentors in cellular and other
teams who might be interested in this.



If your easy grader is working it could be the basis of another very
exciting GSOC project aimed at article quality.

A NLP savvy smart article quality assessment service could improve/expand
the current bots grading articles. 
Grammar and spelling are two good indicators, features. However a full
assessment of Wikipedia articles would 
require more details - both stylistic and information based. Once you have
covered sufficient features 
building discriminators based on samples of graded articles would require
some data mining ability.

 

However since there is an Existing bot, undergoing upgrades  we would have
to check with its small dev team what it currently doing

And it would be subject to community oversight. 

 

Yours Sincerely,

 

Oren Bochman

 

MediaWiki Search Developer

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l