> I am not sure if this is quite what you are asking but just in case: > > For streaming is probably easier for you to use the newly created webrequest > tables: For Hadoop Streaming, it’ll be a little annoying. This new data is in Parquet. Hadoop Streaming is still using the old MapReduce 1 API, and most of the officially supported Parquet input formats are for MapReduce 2 API, so by default Parquet and Hadoop Streaming are incompatible.
However! Some guy already ran into this problem and wrote this: https://github.com/whale2/iow-hadoop-streaming/blob/master/src/main/java/net/iponweb/hadoop/streaming/parquet/ParquetAsJsonInputFormat.java <https://github.com/whale2/iow-hadoop-streaming/blob/master/src/main/java/net/iponweb/hadoop/streaming/parquet/ParquetAsJsonInputFormat.java> > On Jan 7, 2015, at 18:40, Nuria Ruiz <nu...@wikimedia.org> wrote: > > I am not sure if this is quite what you are asking but just in case: > > For streaming is probably easier for you to use the newly created webrequest > tables: > > https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive#Webrequest_Table.28s.29 > > <https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive#Webrequest_Table.28s.29> > > Those include an isPageview field so requests are pre-classified. You will > need to wait a bit as data for those tables is being populated starting > today. > > > > On Wed, Jan 7, 2015 at 3:35 PM, Aaron Halfaker <ahalfa...@wikimedia.org > <mailto:ahalfa...@wikimedia.org>> wrote: > Cool! Let's say I want to review the filters and apply them in a python > script. What should I reference? > > On Wed, Jan 7, 2015 at 5:13 PM, Oliver Keyes <oke...@wikimedia.org > <mailto:oke...@wikimedia.org>> wrote: > I'm pleased to say we now have the prototype pageviews definition as a UDF! > > For those with cluster access: > > CREATE TEMPORARY FUNCTION pageview as > 'org.wikimedia.analytics.refinery.hive.isPageviewUDF'; > > ...and then just apply it. It outputs a boolean, so you can easily go > WHERE is.Pageview(fields) and treat it as a conditional. Great > success! > > What this means for the definition is twofold; it means it's a lot > easier to tests it accuracy, and it means that it's a lot easier to > make sure we're all using the same definition going forward. Once we > have the legacy definition as a UDF, refining and testing will proceed > at great speed, although I encourage anyone with time on their hands > who wants to help out to do some testing of their own :) > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> > https://lists.wikimedia.org/mailman/listinfo/analytics > <https://lists.wikimedia.org/mailman/listinfo/analytics> > > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> > https://lists.wikimedia.org/mailman/listinfo/analytics > <https://lists.wikimedia.org/mailman/listinfo/analytics> > > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics