I work for a mobile game company. I'm solving a simple question: "Can we
efficiently/cheaply query for the log of a particular user within given
date period?"

I've created a special JSON text-based file format that has these traits:
 - Snappy compressed, saved in AWS S3
 - Partitioned by date. ie. 2017-01-01.sz, 2017-01-02.sz, ...
 - Sorted by a primary key (log_type) and a secondary key (user_id), Snappy
block compressed by 5MB blocks
 - Blocks are indexed with primary/secondary key in file 2017-01-01.json
 - Efficient block based random access on primary key (log_type) and
secondary key (user_id) using the index

I've created a Spark SQL DataFrame relation that can query this file
format.  Since the schema of each log type is fairly consistent, I've
reused the `InferSchema.inferSchema` method and `JacksonParser`in the Spark
SQL code to support structured querying.  I've also implemented filter
push-down to optimize the file access.

It is very fast when querying for a single user or querying for a single
log type with a sampling ratio of 10000 to 1 compared to parquet file
format.  (We do use parquet for some log types when we need batch analysis.)

One of the problems we face is that the methods we use above are private
API.  So we are forced to use hacks to use these methods.  (Things like
copying the code or using the org.apache.spark.sql package namespace)

I've been following Spark SQL code since 1.4, and the JSON schema
inferencing code and JacksonParser seem to be relatively stable recently.
Can the core-devs make these APIs public?

We are willing to open source this file format because it is very excellent
for archiving user related logs in S3.  The key dependency of private APIs
in Spark SQL is the main hurdle in making this a reality.

Thank you for reading!

Reply via email to