Re: Introducing codesearch.debian.net, a regexp code search engine
On Tue, Nov 06, 2012 at 07:05:43PM +0100, Michael Stapelberg wrote: Hi, I hereby announce a new Debian project: Debian Code Search. [...] You can use the search engine at http://codesearch.debian.net/ Here are a few sample queries: • http://codesearch.debian.net/search?q=workaround+package%3Alinux • http://codesearch.debian.net/search?q=XCreateWindow • http://codesearch.debian.net/search?q=AnyEvent%3A%3AI3+filetype%3Aperl The corresponding thesis (and source code, of course) will be released I hope you find it useful and would love to hear your feedback. Just in case the correct use of English is important (and I hope it is) then the lines: amount of regexp results: amount of source results: should be altered to either: Number of regexp results: Number of source results: or simply regexp results: source results: See: http://grammar.about.com/od/words/a/amount.htm http://grammarist.com/usage/amount-number/ -- If you're not careful, the newspapers will have you hating the people who are being oppressed, and loving the people who are doing the oppressing. --- Malcolm X -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20121107201157.GI24124@tal
Re: Introducing codesearch.debian.net, a regexp code search engine
Hi Chris, Chris Bannister cbannis...@slingshot.co.nz writes: See: http://grammar.about.com/od/words/a/amount.htm http://grammarist.com/usage/amount-number/ Thanks. I have heard about this rule but must have forgotten it. I changed the text and will push an update soon. -- Best regards, Michael -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/x6vcdhcfyh@midna.zekjur.net
Re: Introducing codesearch.debian.net, a regexp code search engine
Hi Neil, Neil Williams codeh...@debian.org writes: That's just swamped by licences, as would be received and lots of other common words (which are, rightly or wrongly, used as variable names or as part of function names). Well, of course searching for common words will result in a lot of results. Asking the other way around: What is your expected result for something like modify, even if comments were ignored? http://codesearch.debian.net/search?q=codehelp+filetype%3Aperl filetype:perl just doesn't seem to be working: http://codesearch.debian.net/search?q=QofBook+filetype%3Aperl ... lists a lot of .c files ... filetype:python does the same - some .py but then a lot more .c Thanks, this is fixed now. -- Best regards, Michael -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/x6mwytcetq@midna.zekjur.net
Re: Introducing codesearch.debian.net, a regexp code search engine
On Wed, 07 Nov 2012 21:56:17 +0100 Michael Stapelberg stapelb...@debian.org wrote: Neil Williams codeh...@debian.org writes: That's just swamped by licences, as would be received and lots of other common words (which are, rightly or wrongly, used as variable names or as part of function names). Well, of course searching for common words will result in a lot of results. Asking the other way around: What is your expected result for something like modify, even if comments were ignored? function names and variables which contain the word modify... bytes_received could be a very common variable, but it could also be bytesReceived or received_bytes depending on the convention. It's just the kind of thing to search for buffer overflows My own initial query was QofBook. http://codesearch.debian.net/search?q=QofBook+filetype%3Ac http://codesearch.debian.net/search?q=QofBook+filetype%3Ac+package%3Aqof Any variable/class which is used as a base struct/class across a library or which is contained within a lot of other structs in a library is going to come up again and again in documentation comments and in class/struct definitions. http://codesearch.debian.net/search?q=codehelp+filetype%3Aperl filetype:perl just doesn't seem to be working: http://codesearch.debian.net/search?q=QofBook+filetype%3Aperl ... lists a lot of .c files ... filetype:python does the same - some .py but then a lot more .c Thanks, this is fixed now. Now it's missing known hits: http://codesearch.debian.net/search?skip=17q=noauth+filetype%3Aperl Should find listings in multistrap, which this search finds: http://codesearch.debian.net/search?q=noauth+package%3Amultistrap Just because a file doesn't end in .pl, doesn't mean it isn't perl - Policy mandates that perl in /usr/bin does not end in .pl Is this only finding perl modules and perl scripts in /usr/share? That's a bigger problem than the extra listings for comments. e.g. http://codesearch.debian.net/search?q=codehelp+filetype%3Aperl Now lists lots of .pm and .pl files but nothing else. dpkg-cross is listed as a .pm but not as the executable dpkg-cross. wrap-lintian.pl is listed but not multistrap. Grip.pm is listed but not emgrip. -- Neil Williams = http://www.linux.codehelp.co.uk/ pgpDyovenF244.pgp Description: PGP signature
Re: Introducing codesearch.debian.net, a regexp code search engine
Hi Neil, Neil Williams codeh...@debian.org writes: Just because a file doesn't end in .pl, doesn't mean it isn't perl - Policy mandates that perl in /usr/bin does not end in .pl Is this only finding perl modules and perl scripts in /usr/share? As the FAQ¹ states, this is filtering by file extension. I will keep recognizing the actual file type in mind, but cannot make any promises as to whether this will be implemented or not. ¹ = http://codesearch.debian.net/faq -- Best regards, Michael -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/x6k3txcbtt@midna.zekjur.net
Re: Introducing codesearch.debian.net, a regexp code search engine
On Tue, 6 Nov 2012 19:05:43 +0100 Michael Stapelberg stapelb...@debian.org wrote: Debian Code Search is a search engine for program source code within Debian. It allows you to search all ≈ 17000 source packages, containing 130 GiB of FLOSS source code (including Debian packaging) with regular expressions. It's pleasingly quick, which is always good. Might need to be able to exclude the debian/ directory from searches. You can use the search engine at http://codesearch.debian.net/ Here are a few sample queries: • http://codesearch.debian.net/search?q=workaround+package%3Alinux • http://codesearch.debian.net/search?q=XCreateWindow • http://codesearch.debian.net/search?q=AnyEvent%3A%3AI3+filetype%3Aperl The corresponding thesis (and source code, of course) will be released soon (2013-01-15 being the deadline, but I hope I can do it earlier). I hope you find it useful and would love to hear your feedback. First thing which occurs to me is that I'd prefer a summary page as the entry point for the search results - listing package, version and possibly a link to the PTS, possibly the number of hits for that package/package+version. First thing I've needed to do with every search result so far is find a relevant package within the results. The search results (and any summary page) should probably be sorted by package name too - I'm getting results from packages starting with m before package names starting with e. Maybe extend the keywords to allow regexp matching on package names? Another important step would be a way of excluding matches within comments from the results. The filetype seems a little confused in places too. Searching for things in filetype:perl I get matches in debian/control and debian/copyright. -- Neil Williams = http://www.linux.codehelp.co.uk/ pgp4kXw5FLbf9.pgp Description: PGP signature
Re: Introducing codesearch.debian.net, a regexp code search engine
2 words: Awe some roughly speaking, how does it work internally? On Tue, Nov 6, 2012 at 7:05 PM, Michael Stapelberg stapelb...@debian.org wrote: Hi, I hereby announce a new Debian project: Debian Code Search. Debian Code Search is a search engine for program source code within Debian. It allows you to search all ≈ 17000 source packages, containing 130 GiB of FLOSS source code (including Debian packaging) with regular expressions. You can use the search engine at http://codesearch.debian.net/ Here are a few sample queries: • http://codesearch.debian.net/search?q=workaround+package%3Alinux • http://codesearch.debian.net/search?q=XCreateWindow • http://codesearch.debian.net/search?q=AnyEvent%3A%3AI3+filetype%3Aperl The corresponding thesis (and source code, of course) will be released soon (2013-01-15 being the deadline, but I hope I can do it earlier). I hope you find it useful and would love to hear your feedback. -- Best regards, Michael -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/calkubt69c2xtzf9qvn2shkfrkor+d0tz8t-kjgjsdetjkob...@mail.gmail.com
Re: Introducing codesearch.debian.net, a regexp code search engine
LOVE IT!!! THANK YOU SO MUCH On Tue, Nov 6, 2012 at 12:05 PM, Michael Stapelberg stapelb...@debian.orgwrote: Hi, I hereby announce a new Debian project: Debian Code Search. Debian Code Search is a search engine for program source code within Debian. It allows you to search all ≈ 17000 source packages, containing 130 GiB of FLOSS source code (including Debian packaging) with regular expressions. You can use the search engine at http://codesearch.debian.net/ Here are a few sample queries: • http://codesearch.debian.net/search?q=workaround+package%3Alinux • http://codesearch.debian.net/search?q=XCreateWindow • http://codesearch.debian.net/search?q=AnyEvent%3A%3AI3+filetype%3Aperl The corresponding thesis (and source code, of course) will be released soon (2013-01-15 being the deadline, but I hope I can do it earlier). I hope you find it useful and would love to hear your feedback. -- Best regards, Michael -- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Saving wikipedia(tm) articles from deletion http://SpeedyDeletion.wikia.com Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3 Free Software Foundation Europe Fellow http://fsfe.org/support/?h4ck3rm1k3
Re: Introducing codesearch.debian.net, a regexp code search engine
Hi alberto, alberto fuentes paj...@gmail.com writes: roughly speaking, how does it work internally? It uses a trigram index and the RE2 regular expression engine. My work is based on Russ Cox’s ideas and code published at http://swtch.com/~rsc/regexp/regexp4.html In case you are interested, I’m happy to send you (or anyone else) the current draft of my thesis, which describes the system in much more detail. In that case, just send me an email in private. -- Best regards, Michael -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/x6ehk6fqcw@midna.zekjur.net
Re: Introducing codesearch.debian.net, a regexp code search engine
Hi Neil, Neil Williams codeh...@debian.org writes: It's pleasingly quick, which is always good. Might need to be able to exclude the debian/ directory from searches. File regular expressions and a minus operator is already on the TODO list :-). First thing which occurs to me is that I'd prefer a summary page as the entry point for the search results - listing package, version and possibly a link to the PTS, possibly the number of hits for that package/package+version. First thing I've needed to do with every search result so far is find a relevant package within the results. The search results (and any summary page) should probably be sorted by package name too - I'm getting results from packages starting with m before package names starting with e. Changing the entry point of the search is not going to happen — I quite like the interface it currently has. However, adding a list of packages which are present in the current page of search results would be possible. Note that displaying the entire list of matching packages is unfortunately not possible because it’d require searching through all the files, which is — depending on the query — absolutely impossible when still wanting to guarantee a timely response :-). Maybe extend the keywords to allow regexp matching on package names? I have also considered this. Probably I will resort to making the filename keyword (not yet implemented) use regular expressions and keep the package keyword an exact match. Since the package is part of the filename, complex things are possible while easy matches stay easy :-). Another important step would be a way of excluding matches within comments from the results. I have considered this, but when you think about it, identifiers (variable names, function names, …) and comments are really are there is searchable in source code. Could you give me a few convincing points on why it would be useful to exclude comments (that is, examples)? The filetype seems a little confused in places too. Searching for things in filetype:perl I get matches in debian/control and debian/copyright. Can you give me the exact query for which this happens, please? -- Best regards, Michael -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/x6390mfpmu@midna.zekjur.net
Re: Introducing codesearch.debian.net, a regexp code search engine
On Tue, Nov 6, 2012 at 9:06 PM, Michael Stapelberg stapelb...@debian.org wrote: Hi alberto, alberto fuentes paj...@gmail.com writes: roughly speaking, how does it work internally? It uses a trigram index and the RE2 regular expression engine. My work is based on Russ Cox’s ideas and code published at http://swtch.com/~rsc/regexp/regexp4.html That read was enough to satiate my questions on how it works. :) Now some actual details would be appreciate. Like size of database, size on memory, engine running, kind of machine, number of nodes, etc... Have you run any benchmark? greets aL -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/calkubt7iuvm8r0jgkajfzl4d-nrxalpinurfpdr4-t8cysa...@mail.gmail.com
Re: Introducing codesearch.debian.net, a regexp code search engine
Hi, Am Dienstag, den 06.11.2012, 19:05 +0100 schrieb Michael Stapelberg: I hereby announce a new Debian project: Debian Code Search. Great! I hope you find it useful and would love to hear your feedback. Since you have all code extracted anyways, could you extend the page to allow for easy code browsing? Might be faster than apt-get source; less ... sometimes. Greetings, Joachim -- Joachim nomeata Breitner Debian Developer nome...@debian.org | ICQ# 74513189 | GPG-Keyid: 4743206C JID: nome...@joachim-breitner.de | http://people.debian.org/~nomeata signature.asc Description: This is a digitally signed message part
Re: Introducing codesearch.debian.net, a regexp code search engine
On Tue, Nov 06, 2012 at 07:05:43PM +0100, Michael Stapelberg wrote: Hi, Hi! I hereby announce a new Debian project: Debian Code Search. Debian Code Search is a search engine for program source code within Debian. It allows you to search all ??? 17000 source packages, containing 130 GiB of FLOSS source code (including Debian packaging) with regular expressions. cool :) You can use the search engine at http://codesearch.debian.net/ nice I hope you find it useful and would love to hear your feedback. yes, I think it is. it's an enabler kind of tool, people can study the code in new ways and it has applications also in the security field. if you consider that Debian is one of the more extended (and regularly used) collections of software, I'm sure it will be the joy of many :) cheers, Domenico -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20121106215213.GA20829@glitch
Re: Introducing codesearch.debian.net, a regexp code search engine
Hi Joachim, Joachim Breitner nome...@debian.org writes: Since you have all code extracted anyways, could you extend the page to allow for easy code browsing? Might be faster than apt-get source; less ... sometimes. Very basic code browsing is on my agenda, but zack@ mentioned he wants to build a new sources.debian.org. Maybe his project is what you are looking for? :-) -- Best regards, Michael -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/x6fw4me621@midna.zekjur.net
Re: Introducing codesearch.debian.net, a regexp code search engine
Hi, Am Dienstag, den 06.11.2012, 23:10 +0100 schrieb Michael Stapelberg: Joachim Breitner nome...@debian.org writes: Since you have all code extracted anyways, could you extend the page to allow for easy code browsing? Might be faster than apt-get source; less ... sometimes. Very basic code browsing is on my agenda, but zack@ mentioned he wants to build a new sources.debian.org. Maybe his project is what you are looking for? :-) either works. Or rather, both should be one (or at least appear as one, e.g. search input field on sources.d.o; search results on codesearch.d.n linking back to sources.d.o). Greetings, Joachim -- Joachim nomeata Breitner Debian Developer nome...@debian.org | ICQ# 74513189 | GPG-Keyid: 4743206C JID: nome...@joachim-breitner.de | http://people.debian.org/~nomeata signature.asc Description: This is a digitally signed message part
Re: Introducing codesearch.debian.net, a regexp code search engine
On Tue, 06 Nov 2012 21:22:17 +0100 Michael Stapelberg stapelb...@debian.org wrote: Another important step would be a way of excluding matches within comments from the results. I have considered this, but when you think about it, identifiers (variable names, function names, …) and comments are really are there is searchable in source code. Could you give me a few convincing points on why it would be useful to exclude comments (that is, examples)? Any search term which can be a variable name and frequently occurs in licence headers or doxygen markup or email addresses (copyright). (I dread to think what results come from searching just for 'debian', even with filetype:c it's all licence headers / email addresses.) http://codesearch.debian.net/search?q=QofBook+filetype%3Ac Any similar term which is frequently used across doxygen-style API docs will give a mix of comments and code. e.g. http://codesearch.debian.net/search?q=modify+filetype%3Ac That's just swamped by licences, as would be received and lots of other common words (which are, rightly or wrongly, used as variable names or as part of function names). Without exclusions on comments (and without fixes for filetype: matches below) then any common word is going to be swamped. The filetype seems a little confused in places too. Searching for things in filetype:perl I get matches in debian/control and debian/copyright. Can you give me the exact query for which this happens, please? http://codesearch.debian.net/search?q=codehelp+filetype%3Aperl filetype:perl just doesn't seem to be working: http://codesearch.debian.net/search?q=QofBook+filetype%3Aperl ... lists a lot of .c files ... filetype:python does the same - some .py but then a lot more .c -- Neil Williams = http://www.linux.codehelp.co.uk/ pgpW7FiDGAXan.pgp Description: PGP signature