Bug report for reading Hive table as streaming source.

Xiaolong Wang Tue, 26 Mar 2024 02:00:42 -0700

Hi,

I found a weird bug when reading a Hive table as a streaming source.


In summary, if the first partition is not time related, then the Hive table
cannot be read as a streaming source.

e.g.

I've a Hive table in the definition of

```
CREATE TABLE article (
id BIGINT,
edition STRING,
dt STRING,
hh STRING
)
PARTITIONED BY (edition, dt, hh)
USING orc;
```
Then I try to query it as a streaming source:

```
INSERT INTO kafka_sink
SELECT id
FROM article /*+ OPTIONS('streaming-source.enable' = 'true',
'streaming-source.partition-order' = 'partition-name',
'streaming-source.consume-start-offset' =
'edition=en_US/dt=2024-03-26/hh=00') */
```

And I see no output in the `kafka_sink`.

Then I defined an external table pointing to the same path but has no
`edition` partition,

```
CREATE TABLE en_article (
id BIGINT,
edition STRING,
dt STRING,
hh STRING
)
PARTITIONED BY (edition, dt, hh)
LOCATION 's3://xxx/article/edition=en_US'
USING orc;
```

And insert with the following statement:

```
INSERT INTO kafka_sink
SELECT id
FROM en_article /*+ OPTIONS('streaming-source.enable' = 'true',
'streaming-source.partition-order' = 'partition-name',
'streaming-source.consume-start-offset' = 'dt=2024-03-26/hh=00') */
```

The data is sinked.

Bug report for reading Hive table as streaming source.

Reply via email to