[
https://issues.apache.org/jira/browse/SPARK-47717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated SPARK-47717:
----------------------------------
Fix Version/s: (was: 3.3.2)
(was: 3.4.1)
(was: 3.5.1)
> Support Hive tables as a streaming source and sink
> --------------------------------------------------
>
> Key: SPARK-47717
> URL: https://issues.apache.org/jira/browse/SPARK-47717
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Affects Versions: 3.3.2, 3.4.1, 3.5.1
> Reporter: Adi Suresh
> Priority: Major
>
> People have data stored in Hive tables. Currently these tables do not support
> Spark streaming, so customers do not have a good way to natively stream this
> data in Spark. The current solutions involve an intermediary to track which
> data has been read and periodically execute batch jobs. This use case should
> be supported by Spark's in-built streaming mechanism.
>
> From doing some research, Hive supports streaming
> [https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest+V2]
> but Spark does not support streaming on tables in Hive format. I don't think
> it makes sense to start copying Hive server-side code into Spark, but we
> could copy the relevant logic and wrap it in the DataSourceV2 APIs to enable
> this feature. To not break backwards compatibility, we would probably want to
> gate this behind a new Spark property.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]