Re: [Wikitech-l] Wikipedia database
On Nov 20 2009, Platonides wrote: Zeyi wrote: I took the sub-current data from MediaWiki and import them to Oracle. Which tool did you use for the import? I used xml2sql tool, which is easy to use. I found there are two same page_latest ID in the page table. Then when I tried to join Revision table and Page table together, this caused two same rev_id. Which pages are those? kinds of every pages, is that page_latest ID unique? May I ask why I have two page_latest on page table, what it mean? If I want to put Revision table and Page table together, which should be the link point? You shouldn't have that situation. And why are you merging page and revision, anyway? I need use rev_user and page_namespace to do crossing-analysis. How i can put them in the one table? thanks again. thanks, Zeyi ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] 235 Auto unit conversion
On Fri, Nov 20, 2009 at 12:04 AM, Sylvain Leroux sylv...@chicoree.fr wrote: The idea would be something like: {{#unit: 1000 ft|m}} = span title=304m (1000ft)304m/span It would make more sense to have it output simply 304. That way it could be effectively used in a template to achieve any desired formatting. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] [MediaWiki] Enhancement: LaTeX images quality (eliminate white background)
On Fri, Nov 20, 2009 at 3:39 AM, Alexander Shulgin alex.shul...@gmail.com wrote: So, there can be no way to provide non-alpha transparent image to satisfy all possible styles. No, but if the overwhelming majority of viewers see a light blue background, it would make a lot more sense to have a light blue background on the formula than a white background. Both will look wrong on some style, but they may as well at least look right on the default style. I think we cannot do something sensible now, so we'll just have to wait for MathML support from most popular browsers. That's definitely the long-term solution. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Size of DB/table of enwiki after import into MySQL
On Sat, Nov 21, 2009 at 6:39 PM, Aryeh Gregor simetrical+wikil...@gmail.com wrote: Selecting a list of all titles that are not redirects will take a long time on any database, unless you have everything in memory, because it requires a table scan -- there's no index that covers the relevant columns (IIRC). You could build an index on page_is_redirect in the page table (see enwiki-*-page.sql.gz). But I'm pretty sure Postgresql wouldn't use it, and would do a sequential scan, since pretty much all the pages are going to have to be accessed anyway. Five or ten minutes sounds about right. I can't imagine this is a query you want to run over and over again. If it is, you'd probably want to use partitioning (http://www.postgresql.org/docs/8.4/interactive/ddl-partitioning.html), but you're still not going to cut down the query time very much, as it's going to be returning millions of rows. It's also worth pointing out that Wikipedia uses a version of MySQL with substantial modifications, and Wikimedia sysadmins are very familiar with its behavior. Switching to a new technology might theoretically be better in the long term (although I wouldn't take that for granted in this case), but the transition cost would be substantial. Heck, Wikipedia hasn't even upgraded to MySQL 4.1, let alone a whole different DBMS. Yes, it can be very hard to switch your DBMS, and that's a very good thing for MySQL. :) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Size of DB/table of enwiki after import into MySQL
On Sat, Nov 21, 2009 at 7:02 PM, Anthony wikim...@inbox.org wrote: You could build an index on page_is_redirect in the page table (see enwiki-*-page.sql.gz). But I'm pretty sure Postgresql wouldn't use it, and would do a sequential scan, since pretty much all the pages are going to have to be accessed anyway. Five or ten minutes sounds about right. I can't imagine this is a query you want to run over and over again. If it is, you'd probably want to use partitioning (http://www.postgresql.org/docs/8.4/interactive/ddl-partitioning.html), but you're still not going to cut down the query time very much, as it's going to be returning millions of rows. Yeah, pretty much. If you had an index on (page_is_redirect, page_namespace, page_title) or such, that would speed it up significantly (at least in MySQL). It would have to scan through the whole index, but that only contains three columns plus a row id of some kind, so it should be quite a lot faster than scanning the whole table. But this isn't a likely query for optimization. Yes, it can be very hard to switch your DBMS, and that's a very good thing for MySQL. :) Let's not have a DBMS flame war here, please. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Size of DB/table of enwiki after import into MySQL
On Sun, Nov 22, 2009 at 02:45, Aryeh Gregor simetrical+wikil...@gmail.com wrote: Yeah, pretty much. If you had an index on (page_is_redirect, page_namespace, page_title) or such, that would speed it up significantly (at least in MySQL). It would have to scan through the whole index, but that only contains three columns plus a row id of some kind, so it should be quite a lot faster than scanning the whole table. But this isn't a likely query for optimization. That's correct. We run this query once, and then we do a SELECT for each title (which only takes a few millis, because it uses an index) and work with the result. Building the index would probably take longer than 5 or 10 minutes. The DB access is fast compared to all the other stuff we do, so there's not much need to optimize it. MySQL with the indexes in tables.sql suits us fine so far, and I have no reason to suspect any other DB would be faster. The one thing that is slow is builiding the indexes after the data has been imported (eight hours or so). Maybe we could omit some indexes that are not used in our application, but I haven't really looked into that. Christopher ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Size of DB/table of enwiki after import into MySQL
On Sat, Nov 21, 2009 at 8:45 PM, Aryeh Gregor simetrical+wikil...@gmail.com wrote: Yeah, pretty much. If you had an index on (page_is_redirect, page_namespace, page_title) or such, that would speed it up significantly (at least in MySQL). It would have to scan through the whole index, but that only contains three columns plus a row id of some kind, so it should be quite a lot faster than scanning the whole table. Quite a lot? The theoretical max would be about twice as fast, as (page_is_redirect, page_namespace, page_title) is going to take up at least half as much space as the whole page table. But I'm not sure even that theoretical max could be reached by MySQL. I know it wouldn't be reached by PostgreSQL, which would still do a sequential scan through the table. If you clustered on page_is_redirect you'd save yourself from having to go through the parts of the table which were redirects, but you're still stuck with a sequential scan. In either database, if you really wanted the absolute fastest solution, you'd create a materialized view for exactly that query. But as you said, this isn't a likely query for optimization. Yes, it can be very hard to switch your DBMS, and that's a very good thing for MySQL. :) Let's not have a DBMS flame war here, please. Aww, c'mon, just a little light ribbing... I couldn't resist, you set me right up for it. I'm actually in the process of trying to import enwiki into a postgresql database right now. Attempt 1 was to import everything into a MySQL database (with no indexes), export it as a TSV file, then import from the TSV file into Postgresql. Hit a snag with some data that Postgres is saying isn't valid UTF8, which is probably due to something I did wrong with the import, but I can't figure out what it is. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Size of DB/table of enwiki after import into MySQL
On Sun, Nov 22, 2009 at 7:39 AM, Aryeh Gregor simetrical+wikil...@gmail.com wrote: It's also worth pointing out that Wikipedia uses a version of MySQL with substantial modifications, Is the source available in the svn? Can you point me to the right direction? Thanks. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l