Re: [Wikitech-l] OPW intern looking for feedback!

2013-03-27 Thread bawolff
On Wed, Mar 27, 2013 at 9:31 AM, Daniel Friesen
 wrote:
> On Wed, 27 Mar 2013 00:19:53 -0700, Brian Wolff  wrote:

>
>
> Please don't. I've been trying to slowly move us away from depending on
> wgSecretKey's secrecy for security. Eventually I hope to try an eliminate
> dependence on it from extensions too. And in an ideal case, eventually stop
> setting it in the installer (unless you have an edge case where a little
> more entropy for CryptRand could be useful; Or maybe not, I need to double
> check which case that was, but it might not even exist anymore with our
> version requirements).
>
> I see people over and over asking for help and inadvertently handing that
> information which is supposed to remain secret right over in public.
>
> Instead of trying to make the paths a secret just don't put that data inside
> of public /tmp directories.
> I recommend setting your git director config to false and in an extension
> setup function set it to some path based on the upload directory.
> This is basically what we used to do with $wgTmpDirectory which was used by
> CACHE_DBA.
>
>
>
> --
> ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]
>
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Getting slightly offtopic, but a world where people stop spamming us
with $wgSecretKey would be nice ;)

However, you're still going to have $wgUpgradeKey, and $wgDBpass ...
Perhaps it'd be cool to split LocalSettings.php into LocalSettings.php
and PrivateSettings.php


> I recommend setting your git director config to false and in an extension
> setup function set it to some path based on the upload directory

Given that the upload directory is web accessible (and many people
don't even turn off php_engine in that directory [speaking of which,
why don't we add that to the default .htaccess for that directory]),
having arbitrary git checkouts in such a directory seems kind of scary
too.


--bawolff

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] OPW intern looking for feedback!

2013-03-27 Thread Daniel Friesen

On Wed, 27 Mar 2013 00:19:53 -0700, Brian Wolff  wrote:

Note, using directory names that can be pre-determined in a public
/tmp directory is a bit dangerous in a shared server. Another user
could make the directory, put something malicious in it (for example
an evil post-merge hook), and then have your script use the malicious
data. One way around that could be to add the $wgSecretKey (and some
salt) to the variables that generate the hash that becomes the
directory name.


Please don't. I've been trying to slowly move us away from depending on  
wgSecretKey's secrecy for security. Eventually I hope to try an eliminate  
dependence on it from extensions too. And in an ideal case, eventually  
stop setting it in the installer (unless you have an edge case where a  
little more entropy for CryptRand could be useful; Or maybe not, I need to  
double check which case that was, but it might not even exist anymore with  
our version requirements).


I see people over and over asking for help and inadvertently handing that  
information which is supposed to remain secret right over in public.


Instead of trying to make the paths a secret just don't put that data  
inside of public /tmp directories.
I recommend setting your git director config to false and in an extension  
setup function set it to some path based on the upload directory.
This is basically what we used to do with $wgTmpDirectory which was used  
by CACHE_DBA.



Cheers,
bawolff

p.s. if any of that was confusing or unclear, please let me know.


--
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] OPW intern looking for feedback!

2013-03-27 Thread Brian Wolff
Hi,

Your extension looks quite good. I did an extensive review of your
code over at https://www.mediawiki.org/wiki/User:Bawolff/review/Git2Pages
. Well that page may look very long, most of the issues listed there
are minor nitpicks.


> - GitRepository will do a sparse checkout on the information, that is,
> it will clone the repository but only keep the specified file (this
> was implemented to save space)

Don't think that works the way you think it does. The entire repo is
still there (in fact, the entire history of the repo is there).

When I was first testing your extension, I tried to load a snippet
from mediawiki/core.git . The process of cloning the repo basically
made my web server unresponsive (eventually I got bored and restarted
apache. I'm not sure why wfShellExec didn't kick in before that
point). I think you need to do a shallow clone since you only need one
revision of the file.

> - The repositories will be cloned into a folder that is a md5 hash of
> the url + branch to make sure that the program isn't cloning a ton of
> copies of the same repository

Martijn wrote:
>>Why hash it, and not just keep the url + branch encoded to some charset
>>that is a valid path, saving rare yet hairy collisions?

I actually like the hashing. The chance of a collision happening seems
so unlikely that its not worth worrying about (If it is a concern, one
could use sha256 encoded as base36 to make collisions even less
likely). It also prevents the somewhat unlikely (but more likely than
a collision imo) of a url+branch combination that is > 255 characters.

One thing I would recommend is prefix the directory names with
something like mw-git2page- so that if a system admin sees all
these entries, s/he would know where they are coming from.

Note, using directory names that can be pre-determined in a public
/tmp directory is a bit dangerous in a shared server. Another user
could make the directory, put something malicious in it (for example
an evil post-merge hook), and then have your script use the malicious
data. One way around that could be to add the $wgSecretKey (and some
salt) to the variables that generate the hash that becomes the
directory name.

>
> This is my baseline program. It works (for me at least). I have a few
> ideas of what to work on next, but I would really like to know if I'm
> going in the right direction. Is this something you would use? How
> does my code look, is the implementation up to the MediaWiki coding
> standard?buttt You can find the progression of the code on
> gerrit[3].

Your code actually looks quite good. Better then much of the code that
has come out of "mentorship" type projects in the past (imho). I put
quite extensive comments on the wiki page I linked above.

>
> Here are some ideas of what I might want to implement while still on
> the internship:
> - Instead of a  tag, encase it in a  tag if
> it's code, maybe add a flag for user to supply the language

That would be cool.

> - Keep a database of all the repositories that a wiki has (though not
> sure how to handle deletions)

One possibility with that (although quite a bit of work) would be to
have a post commit hook in a git server that caused the wiki page to
be re-rendered whenever a snippet it includes changes.

As for deletion - Check out how LinksUpdate stuff works, and how
extensions like GlobalUsage that tie into it work. In particular the
various hooks in that file:
https://www.mediawiki.org/wiki/Category:MediaWiki_hooks_included_in_LinksUpdate.php
If you meant page deletion (instead of snippet deletion) see the
ArticleDeleteComplete hook.

>
> Here are some problems I might face:
> - If I update the working tree each time a file from the same
> repository is added, then the line numbers may not match the old file
> - Should I be periodically updating the repositories or perhaps keep
> multiple snapshots of the same repository

I think its expected that if somebody puts {{#snippet:...}} on a page,
and then someone commits an update to the file being transcluded by
{{#snippet}}, that the {{#snippet:...}} would eventually update (aka
on the next page parse)

Having an option to include a snippet from a specific commit -
{{#snippet:filename=foo|repository=bar|revision=bf75623112354}} would
be cool (or could the branch option be used to do this already?)

Also, having an option to start at some line based on some pattern
(Start at line that has word foo on it) might be cool.

> - Cloning an entire repository and keeping only one file does not seem
> ideal, but I've yet to find a better solution, the more repositories
> being used concurrently the bigger an issue this might be

Yes that's a problem, especially on a big repository. At the very
least it should be a shallow clone. Note when you clone the repo - you
keep the entire repo, even if only some files are shown.

> - I'm also worried about security implications of my program. Security
> isn't my area of expertise, and I would definitely appreciate

Re: [Wikitech-l] OPW intern looking for feedback!

2013-03-26 Thread Teresa Cho
> Why hash it, and not just keep the url + branch encoded to some charset
> that is a valid path, saving rare yet hairy collisions?

Hmm, good point. Originally, I had my program cloned via a git clone
that would just clone into a folder of whatever the filename was.
Therefore if we had two repositories named MyRepo, it would cause
problems. We (my mentor Sebastien aka Dereckson) had done the md5 hash
as a way to combat that but I supposed we didn't think of creating a
folder using just the name of the url. I think this was back when we
might use local repositories. My program can use urls like this:
/some/path/to/repo. In which case, you may get different local urls?
Thinking about it, it doesn't really make sense. I think this is just
one of those scenarios where we thought of a solution and worked and
the program evolved in a way where it wasn't necessary anymore.
Changing the hash back to charset of a path should be an easy fix.

Speaking of local repositories, another idea was to disable the option
to have a local repository as the git url as a safety precaution or
something. Should I do away with the local repo feature entirely or
should I set a flag and have it be off by default, but allow the user
to choose to turn it on?

> Will there be a re checkout for a duplicate request? Will the cache of
> files ever be cleaned?

I did a little test and it looks like if you duplicate the request,
the original content doesn't change. So even if it has be remotely
changed, the local copy doesn't change. Additionally, when you add a
file to the sparse-checkout, the ` git read-tree -mu HEAD ` command
only seems to pull the added file but doesn't seem the change the
contents of a file that's already in the repo. So right now the only
way to get new content is to delete the repository and rerun it. This
is a good thing to work on. I'm not sure what you mean by the cache of
the files being clean. Can you elaborate on the scenario and I can
give you a better answer.

Thanks for your feedback! :)

-- Teresa

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] OPW intern looking for feedback!

2013-03-26 Thread Martijn Hoekstra
On Mar 27, 2013 1:31 AM, "Teresa Cho"  wrote:
>
> Hi everyone,
>
> My name is Teresa (or terrrydactyl if you've seen me on IRC) and I've
> been interning at Wikimedia for the last few months through the
> Outreach Program for Women[1]. My project, Git2Pages[2], is an
> extension to pull snippets of code/text from a git repository. I've
> been working hard on learning PHP and the MediaWiki
> framework/development cycle. My internship is ending soon and I wanted
> to reach out to the community and ask for feedback.
>

Cool stuff!

> Here's what the program currently does:
> - User supplies (git) url, filename, branch, startline, endline using
> the #snippet tag
> - Git2Pages.body.php will validate the information and then pass on
> the inputs into my library, GitRepository.php
> - GitRepository will do a sparse checkout on the information, that is,
> it will clone the repository but only keep the specified file (this
> was implemented to save space)
> - The repositories will be cloned into a folder that is a md5 hash of
> the url + branch to make sure that the program isn't cloning a ton of
> copies of the same repository

Why hash it, and not just keep the url + branch encoded to some charset
that is a valid path, saving rare yet hairy collisions?

> - If the repository already exists, the file will be added to the
> sparse-checkout file and the program will update the working tree

Will there be a re checkout for a duplicate request? Will the cache of
files ever be cleaned?

> - Once the repo is cloned, the program will go and yank the lines that
> the user requested and it'll return the text encased in a  tag.
>
> This is my baseline program. It works (for me at least). I have a few
> ideas of what to work on next, but I would really like to know if I'm
> going in the right direction. Is this something you would use? How
> does my code look, is the implementation up to the MediaWiki coding
> standard?buttt You can find the progression of the code on
> gerrit[3].
>
> Here are some ideas of what I might want to implement while still on
> the internship:
> - Instead of a  tag, encase it in a  tag if
> it's code, maybe add a flag for user to supply the language
> - Keep a database of all the repositories that a wiki has (though not
> sure how to handle deletions)
>
> Here are some problems I might face:
> - If I update the working tree each time a file from the same
> repository is added, then the line numbers may not match the old file
> - Should I be periodically updating the repositories or perhaps keep
> multiple snapshots of the same repository
> - Cloning an entire repository and keeping only one file does not seem
> ideal, but I've yet to find a better solution, the more repositories
> being used concurrently the bigger an issue this might be
> - I'm also worried about security implications of my program. Security
> isn't my area of expertise, and I would definitely appreciate some
> input from people with a security background
>
> Thanks for taking the time to read this and thanks in advance for any
> feedback, bug reports, etc.
>
> Have a great day,
> Teresa
> http://www.mediawiki.org/wiki/User:Chot
>
> [1] https://www.mediawiki.org/wiki/Outreach_Program_for_Women
> [2] http://www.mediawiki.org/wiki/Extension:Git2Pages
> [3]
https://gerrit.wikimedia.org/r/#/q/project:mediawiki/extensions/Git2Pages,n,z
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] OPW intern looking for feedback!

2013-03-26 Thread Teresa Cho
Hi everyone,

My name is Teresa (or terrrydactyl if you've seen me on IRC) and I've
been interning at Wikimedia for the last few months through the
Outreach Program for Women[1]. My project, Git2Pages[2], is an
extension to pull snippets of code/text from a git repository. I've
been working hard on learning PHP and the MediaWiki
framework/development cycle. My internship is ending soon and I wanted
to reach out to the community and ask for feedback.

Here's what the program currently does:
- User supplies (git) url, filename, branch, startline, endline using
the #snippet tag
- Git2Pages.body.php will validate the information and then pass on
the inputs into my library, GitRepository.php
- GitRepository will do a sparse checkout on the information, that is,
it will clone the repository but only keep the specified file (this
was implemented to save space)
- The repositories will be cloned into a folder that is a md5 hash of
the url + branch to make sure that the program isn't cloning a ton of
copies of the same repository
- If the repository already exists, the file will be added to the
sparse-checkout file and the program will update the working tree
- Once the repo is cloned, the program will go and yank the lines that
the user requested and it'll return the text encased in a  tag.

This is my baseline program. It works (for me at least). I have a few
ideas of what to work on next, but I would really like to know if I'm
going in the right direction. Is this something you would use? How
does my code look, is the implementation up to the MediaWiki coding
standard?buttt You can find the progression of the code on
gerrit[3].

Here are some ideas of what I might want to implement while still on
the internship:
- Instead of a  tag, encase it in a  tag if
it's code, maybe add a flag for user to supply the language
- Keep a database of all the repositories that a wiki has (though not
sure how to handle deletions)

Here are some problems I might face:
- If I update the working tree each time a file from the same
repository is added, then the line numbers may not match the old file
- Should I be periodically updating the repositories or perhaps keep
multiple snapshots of the same repository
- Cloning an entire repository and keeping only one file does not seem
ideal, but I've yet to find a better solution, the more repositories
being used concurrently the bigger an issue this might be
- I'm also worried about security implications of my program. Security
isn't my area of expertise, and I would definitely appreciate some
input from people with a security background

Thanks for taking the time to read this and thanks in advance for any
feedback, bug reports, etc.

Have a great day,
Teresa
http://www.mediawiki.org/wiki/User:Chot

[1] https://www.mediawiki.org/wiki/Outreach_Program_for_Women
[2] http://www.mediawiki.org/wiki/Extension:Git2Pages
[3] 
https://gerrit.wikimedia.org/r/#/q/project:mediawiki/extensions/Git2Pages,n,z

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l