Dan Andreescu <dandree...@wikimedia.org> wrote:

> Maybe something exists already in Hadoop
>>
>
> The page properties table is already loaded into Hadoop on a monthly basis
> (wmf_raw.mediawiki_page_props).  I haven't played with it much, but Hive
> also has JSON-parsing goodies, so give it a shot and let me know if you get
> stuck.  In general, data from the databases can be sqooped into Hadoop.  We
> do this for large pipelines like edit history
> <https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Edit_data_loading>
>  and
> it's very easy
> <https://github.com/wikimedia/analytics-refinery/blob/master/python/refinery/sqoop.py#L505>
> to add a table.  We're looking at just replicating the whole db on a more
> frequent basis, but we have to do some groundwork first to allow
> incremental updates (see Apache Iceberg if you're interested).
>
>
Yes, I like that and all of the other wmf_raw goodies! I'll follow up off
thread on accessing the parser cache DBs (they're in site.pp and
db-eqiad.php, but I don't think those are presently represented by
refinery.util as they're not in .dblist files).
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to