Thanks, Tim,

This is helpful. Currently, we (colleague of Antoni) are using Impyla, which 
Thrift gen’s as part of its egg/wheel building process (I assume). Internally, 
we’ll figure out how either match the Impyla Thrift version or build Impyla 
ourselves.

Does the JSON structure not match the nested Thrift structs?

Thanks,
Jenny

From: Tim Armstrong <tarmstr...@cloudera.com>
Date: Friday, August 9, 2019 at 5:20 PM
To: "u...@impala.apache.org" <u...@impala.apache.org>
Cc: "dev@impala.apache.org" <dev@impala.apache.org>, "Jenny Kwan (c)" 
<kje...@vmware.com>
Subject: Re: How to parse a query plan /summary/profile

Impala has two sets of information tracked on the coordinator node for each 
query: a summary and a profile.
The profile is currently accessible as a string, which is unwieldy for parsing. 
A thrift format is theoretically available, but there is a bug: 
https://issues.apache.org/jira/browse/IMPALA-8252<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FIMPALA-8252&data=02%7C01%7Ckjenny%40vmware.com%7C9036c392cc8b4976a57a08d71d287bd2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637009932062861464&sdata=8Bp9n2XQYK9ovyPTOwIW14Uk%2FT1k8kjWG9fPm8JNQwc%3D&reserved=0>
 , which is resolved in v3.2.0. So you need to have version >=3.2

The thrift format generally works fine, I know of a lot of tooling built on top 
of it (e.g. Cloudera Manager uses it extensively). The title of the JIRA sounds 
overly dramatic without context, basically we had some issues with 
compatibility across versions. You'll be fine if you use the .thrift file 
corresponding to the version of Impala you're consuming profiles from. It's 
messier if you have a tool that uses an old thrift file, since there were some 
issues with backward compatibility, or if you're trying to consume profiles 
from multiple versions of Impala.

There's a toy Python profile decoder in the impala source tree that may be 
useful to get started 
-https://github.com/apache/impala/blob/master/bin/parse-thrift-profile.py<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fimpala%2Fblob%2Fmaster%2Fbin%2Fparse-thrift-profile.py&data=02%7C01%7Ckjenny%40vmware.com%7C9036c392cc8b4976a57a08d71d287bd2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637009932062871459&sdata=%2BFFwJJToaoe2mGq8pyWD8su30HUmTXVkjO1mGpdha64%3D&reserved=0>
 and 
https://github.com/apache/impala/blob/24eab713a0d35f629509f59711f8a563e1346acf/lib/python/impala_py_lib/profiles.py<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fimpala%2Fblob%2F24eab713a0d35f629509f59711f8a563e1346acf%2Flib%2Fpython%2Fimpala_py_lib%2Fprofiles.py&data=02%7C01%7Ckjenny%40vmware.com%7C9036c392cc8b4976a57a08d71d287bd2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637009932062871459&sdata=orPAkIOFmMgJI1O3r275mctQ7TPHfDm5gZt%2BEziESgw%3D&reserved=0>
 . That just gets you from the base64-encoded strings to a thrift object.

A JSON format was added very recently (this week) into master - 
https://gerrit.cloudera.org/#/c/13801/<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgerrit.cloudera.org%2F%23%2Fc%2F13801%2F&data=02%7C01%7Ckjenny%40vmware.com%7C9036c392cc8b4976a57a08d71d287bd2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637009932062871459&sdata=mmbTvBC1QGusshljOKyIHwDIW43SIbxTlWKREj03yxE%3D&reserved=0>.
 That's kinda experimental at the moment - we're not sure how convenient the 
current structure is without some experience actually using it - we'd welcome 
feedback about your use cases.

- Tim


On Fri, Aug 9, 2019 at 4:14 PM Antoni Ivanov 
<aiva...@vmware.com<mailto:aiva...@vmware.com>> wrote:
Hi,

We did some research on the topic, the answer we’ve come so far is

Impala has two sets of information tracked on the coordinator node for each 
query: a summary and a profile.
The profile is currently accessible as a string, which is unwieldy for parsing. 
A thrift format is theoretically available, but there is a bug: 
https://issues.apache.org/jira/browse/IMPALA-8252<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FIMPALA-8252&data=02%7C01%7Ckjenny%40vmware.com%7C9036c392cc8b4976a57a08d71d287bd2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637009932062881454&sdata=9sTEO3fKZKYaKA4iV46Psg39nfd0jdRDKZCK4q6n4Og%3D&reserved=0>
 , which is resolved in v3.2.0. So you need to have version >=3.2


After that Thrift Encoding form Twitter commons may be used –
https://github.com/twitter/commons/blob/06905dc0f1a26440a79ff1164831c85ce2d1bdf0/src/python/twitter/thrift/text/thrift_json_encoder.py<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftwitter%2Fcommons%2Fblob%2F06905dc0f1a26440a79ff1164831c85ce2d1bdf0%2Fsrc%2Fpython%2Ftwitter%2Fthrift%2Ftext%2Fthrift_json_encoder.py&data=02%7C01%7Ckjenny%40vmware.com%7C9036c392cc8b4976a57a08d71d287bd2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637009932062881454&sdata=WvVwYpH8F03Kz69YSfFT2zxYKx9Pbq6iOWXCsxFQuVc%3D&reserved=0>


The thrift can be downloaded from Coordinator node e.g 
http://coord-node:25000/query_profile_encoded?query_id=442c057197d9c0d:81810ccd00000000
 ( 442c057197d9c0d:81810ccd00000000 is the Query ID)
The thrift can be downloaded from Cloudera REST API (if using Cloudera)
Or if using 
impyla<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcloudera%2Fimpyla&data=02%7C01%7Ckjenny%40vmware.com%7C9036c392cc8b4976a57a08d71d287bd2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637009932062891446&sdata=s3HrRr0DEh9u6Qnfg5mRlM5FI19oiDdxdGJo1Rusp4s%3D&reserved=0>
 Python library you can get the profile after execution
        cur.execute(sql)
        return cur.get_profile(profile_format=TRuntimeProfileFormat.THRIFT)


Just posting here in  case it’s helpful to anyone following the user group.

-Antoni

From: Antoni Ivanov
Sent: Wednesday, August 7, 2019 10:13 AM
To: u...@impala.apache.org<mailto:u...@impala.apache.org>
Cc: dev@impala <dev@impala.apache.org<mailto:dev@impala.apache.org>>; Jenny 
Kwan (c) <kje...@vmware.com<mailto:kje...@vmware.com>>
Subject: How to parse a query plan /summary/profile

Hi,

We’d like to get better visibility into way our Impala Cluster is used.
For example there’s per node utilization – e.g sometimes fragments on a given 
node are slower, and this is visible in profile . Or there are some statistics 
available only in profile (like Runtime filters used or parquet file pruning 
stats)

I think you can download it as a Thrift ? But is it easily de-serializable (we 
need to have the Thrift Schema at least I think)
Thanks,
Antoni

Reply via email to