Re: [Wikitech-l] Wikipedia database

2009-11-21 Thread zh509
On Nov 20 2009, Platonides wrote:

Zeyi wrote:
 I took the sub-current data from MediaWiki and import them to Oracle. 
Which tool did you use for the import?

I used xml2sql tool, which is easy to use. 

 I found there are two same page_latest ID in the page table. Then when 
 I tried to join Revision table and Page table together, this caused two 
 same rev_id.

Which pages are those?
kinds of every pages, is that page_latest ID unique? 


 May I ask why I have two page_latest on page table, what it mean? If I 
 want to put Revision table and Page table together, which should be the 
 link point?

You shouldn't have that situation.
And why are you merging page and revision, anyway?

I need use rev_user and page_namespace to do crossing-analysis. How i can 
put them in the one table? thanks again.

 thanks,
 Zeyi


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] 235 Auto unit conversion

2009-11-21 Thread Aryeh Gregor
On Fri, Nov 20, 2009 at 12:04 AM, Sylvain Leroux sylv...@chicoree.fr wrote:
 The idea would be something like:
 {{#unit: 1000 ft|m}} =  span title=304m (1000ft)304m/span

It would make more sense to have it output simply 304.  That way it
could be effectively used in a template to achieve any desired
formatting.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [MediaWiki] Enhancement: LaTeX images quality (eliminate white background)

2009-11-21 Thread Aryeh Gregor
On Fri, Nov 20, 2009 at 3:39 AM, Alexander Shulgin
alex.shul...@gmail.com wrote:
 So, there can be no way to provide non-alpha transparent image to
 satisfy all possible styles.

No, but if the overwhelming majority of viewers see a light blue
background, it would make a lot more sense to have a light blue
background on the formula than a white background.  Both will look
wrong on some style, but they may as well at least look right on the
default style.

 I think we cannot do something sensible now, so we'll just have to
 wait for MathML support from most popular browsers.

That's definitely the long-term solution.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Size of DB/table of enwiki after import into MySQL

2009-11-21 Thread Anthony
On Sat, Nov 21, 2009 at 6:39 PM, Aryeh Gregor
simetrical+wikil...@gmail.com wrote:
 Selecting a list of all titles that are not
 redirects will take a long time on any database, unless you have
 everything in memory, because it requires a table scan -- there's no
 index that covers the relevant columns (IIRC).

You could build an index on page_is_redirect in the page table (see
enwiki-*-page.sql.gz).  But I'm pretty sure Postgresql wouldn't use
it, and would do a sequential scan, since pretty much all the pages
are going to have to be accessed anyway.

Five or ten minutes sounds about right.  I can't imagine this is a
query you want to run over and over again.  If it is, you'd probably
want to use partitioning
(http://www.postgresql.org/docs/8.4/interactive/ddl-partitioning.html),
but you're still not going to cut down the query time very much, as
it's going to be returning millions of rows.

 It's also worth pointing out that Wikipedia uses a version of MySQL
 with substantial modifications, and Wikimedia sysadmins are very
 familiar with its behavior.  Switching to a new technology might
 theoretically be better in the long term (although I wouldn't take
 that for granted in this case), but the transition cost would be
 substantial.  Heck, Wikipedia hasn't even upgraded to MySQL 4.1, let
 alone a whole different DBMS.

Yes, it can be very hard to switch your DBMS, and that's a very good
thing for MySQL.  :)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Size of DB/table of enwiki after import into MySQL

2009-11-21 Thread Aryeh Gregor
On Sat, Nov 21, 2009 at 7:02 PM, Anthony wikim...@inbox.org wrote:
 You could build an index on page_is_redirect in the page table (see
 enwiki-*-page.sql.gz).  But I'm pretty sure Postgresql wouldn't use
 it, and would do a sequential scan, since pretty much all the pages
 are going to have to be accessed anyway.

 Five or ten minutes sounds about right.  I can't imagine this is a
 query you want to run over and over again.  If it is, you'd probably
 want to use partitioning
 (http://www.postgresql.org/docs/8.4/interactive/ddl-partitioning.html),
 but you're still not going to cut down the query time very much, as
 it's going to be returning millions of rows.

Yeah, pretty much.  If you had an index on (page_is_redirect,
page_namespace, page_title) or such, that would speed it up
significantly (at least in MySQL).  It would have to scan through the
whole index, but that only contains three columns plus a row id of
some kind, so it should be quite a lot faster than scanning the whole
table.  But this isn't a likely query for optimization.

 Yes, it can be very hard to switch your DBMS, and that's a very good
 thing for MySQL.  :)

Let's not have a DBMS flame war here, please.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Size of DB/table of enwiki after import into MySQL

2009-11-21 Thread Jona Christopher Sahnwaldt
On Sun, Nov 22, 2009 at 02:45, Aryeh Gregor
simetrical+wikil...@gmail.com wrote:
 Yeah, pretty much.  If you had an index on (page_is_redirect,
 page_namespace, page_title) or such, that would speed it up
 significantly (at least in MySQL).  It would have to scan through the
 whole index, but that only contains three columns plus a row id of
 some kind, so it should be quite a lot faster than scanning the whole
 table.  But this isn't a likely query for optimization.

That's correct. We run this query once, and then we do
a SELECT for each title (which only takes a few millis,
because it uses an index) and work with the result.
Building the index would probably take longer than
5 or 10 minutes.

The DB access is fast compared to all the other stuff
we do, so there's not much need to optimize it. MySQL
with the indexes in tables.sql suits us fine so far, and I
have no reason to suspect any other DB would be faster.

The one thing that is slow is builiding the indexes after
the data has been imported (eight hours or so). Maybe
we could omit some indexes that are not used in our
application, but I haven't really looked into that.

Christopher

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Size of DB/table of enwiki after import into MySQL

2009-11-21 Thread Anthony
On Sat, Nov 21, 2009 at 8:45 PM, Aryeh Gregor
simetrical+wikil...@gmail.com wrote:
 Yeah, pretty much.  If you had an index on (page_is_redirect,
 page_namespace, page_title) or such, that would speed it up
 significantly (at least in MySQL).  It would have to scan through the
 whole index, but that only contains three columns plus a row id of
 some kind, so it should be quite a lot faster than scanning the whole
 table.

Quite a lot?  The theoretical max would be about twice as fast, as
(page_is_redirect, page_namespace, page_title) is going to take up at
least half as much space as the whole page table.  But I'm not sure
even that theoretical max could be reached by MySQL.

I know it wouldn't be reached by PostgreSQL, which would still do a
sequential scan through the table.  If you clustered on
page_is_redirect you'd save yourself from having to go through the
parts of the table which were redirects, but you're still stuck with a
sequential scan.

In either database, if you really wanted the absolute fastest
solution, you'd create a materialized view for exactly that query.
But as you said, this isn't a likely query for optimization.

 Yes, it can be very hard to switch your DBMS, and that's a very good
 thing for MySQL.  :)

 Let's not have a DBMS flame war here, please.

Aww, c'mon, just a little light ribbing...  I couldn't resist, you set
me right up for it.

I'm actually in the process of trying to import enwiki into a
postgresql database right now.  Attempt 1 was to import everything
into a MySQL database (with no indexes), export it as a TSV file, then
import from the TSV file into Postgresql.  Hit a snag with some data
that Postgres is saying isn't valid UTF8, which is probably due to
something I did wrong with the import, but I can't figure out what it
is.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Size of DB/table of enwiki after import into MySQL

2009-11-21 Thread Ryan Chan
On Sun, Nov 22, 2009 at 7:39 AM, Aryeh Gregor
simetrical+wikil...@gmail.com wrote:
 It's also worth pointing out that Wikipedia uses a version of MySQL
 with substantial modifications,

Is the source available in the svn? Can you point me to the right direction?


Thanks.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l