Hi,

in the week from 2014-09-15–2014-09-21 Andrew, Jeff, and I worked on
the following items around the Analytics Cluster and Analytics related
Ops:

* Using kafkatee to generate TSVs
* Bringing Webstatscollector to Hive
* TSV generation through Hive
* Logstash demo
* Reorganizing Wikimetrics mounts
* Stream to Universities
* Analytics1021 issues not an artifact of kafka consumers
* X-Analytics php tag missing/wrong for some requests (Bug 70463)

(details below)

Have fun,
Christian



* Using kafkatee to generate TSVs

For meeting the overall plan of ceasing to rely on udp2log for
Analytics tasks, we wanted to use kafkatee as drop in replacement for
udp2log. While initial tests were positive, kafkatee did not run
smoothly when trying to use in production, as it for example dropped
some partitions, and didn't update offset files. Both of which being
blockers for its use.

We're in contact with the kafkatee developer, and producing the
necessary logs for him to be able to debug it. But the issues have not
yet been resolved.

* Bringing Webstatscollector to Hive

We produced a first running Hive/Oozie implementation of
webstatscollector. Code still need polishing, but it's working. Once
in production, this code will be the first real-world use of the
cluster.

* TSV generation through Hive

Since kafkatee showed some severe issues for us (see above), we
discussed a plan B to move off of udp2log. After the initial checks,
it seems generating the TSVs through Hive could work out. It would
come with some nice benefits (like being able to re-run files, or
better controlling when which data flows into it), but also some real
downsides (like adding filters requiring implementation instead of
configuration, and no longer being able to use the existing tooling
around udp2log (think udp-filters to geolocate))

So we're still targeting to use kafkatee. But if it does not work out,
there are no immediate blockers for a Hive-based move away from
udp2log.

* Logstash demo

In order to raise visibility around Logstash and it's usefulness
around Hive and Hadoop, there was a demo session that showed the basic
workflows.

* Reorganizing Wikimetrics mounts

Wikimetrics ran out database disk space on the labs instances, so more
space got allocated and contents of the instances has been reshuffled
a bit to take better use of available disk space.

* Stream to Universities

Since some years some aspects of the udp2log multicast got streamed to
Universities for research purposes. Those streams caused pain on many
levels, and this week, the last one of those legacy streams could get
turned off.

* Analytics1021 issues not an artifact of kafka consumers

Around analytics1021, progress has been slow, as the issue on
analytics1021 only occur sporadically.
But kafka consumers got ruled out as culprit for dropping messages,
since the missing lines have been identified to be already missing in
kafka.

* X-Analytics php tag missing/wrong for some requests (Bug 70463)

The php={zend,hhvm} tagging happened twice for bits. Ops fixed the
double tagging, but now some requests don't see a tag at all. While
this is expected for some cases, Ops assume that some HHVM requests
come with php=zend tags. They are working on it.



-- 
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
                           Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3     Email:  christ...@quelltextlich.at
4293 Gutau, Austria          Phone:          +43 7946 / 20 5 81
                             Fax:            +43 7946 / 20 5 81
                             Homepage: http://quelltextlich.at/
---------------------------------------------------------------

Attachment: signature.asc
Description: Digital signature

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to