Li Jin created SPARK-22947: ------------------------------ Summary: SPIP: as-of join in Spark SQL Key: SPARK-22947 URL: https://issues.apache.org/jira/browse/SPARK-22947 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.1 Reporter: Li Jin
h2. Background and Motivation Time series analysis is one of the most common analysis on financial data. In time series analysis, as-of join is a very common operation. Supporting as-of join in Spark SQL will allow many use cases of using Spark SQL for time series analysis. As-of join is “join on time” with inexact time matching criteria. Various library has implemented asof join or similar functionality: Kdb: https://code.kx.com/wiki/Reference/aj Pandas: http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof R: This functionality is called “Last Observation Carried Forward” https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin Flint: https://github.com/twosigma/flint#temporal-join-functions This proposal advocates introducing new API in Spark SQL to support as-of join. h2. Target Personas Data scientists, data engineers h2. Goals * New API in Spark SQL that allows as-of join * As-of join of multiple table (>2) should be performant, because it’s very common that users need to join multiple data sources together for further analysis. * Define Distribution, Partitioning and shuffle strategy for ordered time series data h2. Non-Goals These are out of scope for the existing SPIP, should be considered in future SPIP as improvement to Spark’s time series analysis ability: * Utilize partition information from data source, i.e, begin/end of each partition to reduce sorting/shuffling * Define API for user to implement asof join time spec in business calendar (i.e. lookback one business day, this is very common in financial data analysis because of market calendars) * Support broadcast join h2. Proposed API Changes See attachment -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org