I work for a mobile game company. I'm solving a simple question: "Can we efficiently/cheaply query for the log of a particular user within given date period?"
I've created a special JSON text-based file format that has these traits: - Snappy compressed, saved in AWS S3 - Partitioned by date. ie. 2017-01-01.sz, 2017-01-02.sz, ... - Sorted by a primary key (log_type) and a secondary key (user_id), Snappy block compressed by 5MB blocks - Blocks are indexed with primary/secondary key in file 2017-01-01.json - Efficient block based random access on primary key (log_type) and secondary key (user_id) using the index I've created a Spark SQL DataFrame relation that can query this file format. Since the schema of each log type is fairly consistent, I've reused the `InferSchema.inferSchema` method and `JacksonParser`in the Spark SQL code to support structured querying. I've also implemented filter push-down to optimize the file access. It is very fast when querying for a single user or querying for a single log type with a sampling ratio of 10000 to 1 compared to parquet file format. (We do use parquet for some log types when we need batch analysis.) One of the problems we face is that the methods we use above are private API. So we are forced to use hacks to use these methods. (Things like copying the code or using the org.apache.spark.sql package namespace) I've been following Spark SQL code since 1.4, and the JSON schema inferencing code and JacksonParser seem to be relatively stable recently. Can the core-devs make these APIs public? We are willing to open source this file format because it is very excellent for archiving user related logs in S3. The key dependency of private APIs in Spark SQL is the main hurdle in making this a reality. Thank you for reading!