[I] Support batch scaner for table [fluss]

via GitHub Wed, 04 Mar 2026 01:37:15 -0800


loserwang1024 opened a new issue, #2793:
URL: https://github.com/apache/fluss/issues/2793


   ### Search before asking
   
   - [x] I searched in the [issues](https://github.com/apache/fluss/issues) and 
found nothing similar.
   
   
   ### Motivation
   
   ### problem
   In arrow, table.to_batches will return a collection of arrow batch:
   
   
https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_batches
   ```python
   table.to_batches()[0].to_pandas()
      n_legs        animals
   0       2       Flamingo
   1       4          Horse
   2       5  Brittle stars
   3     100      Centipede
   ```
   
   
   In iceberg, support a Scanner for table.
   ```java
   org.apache.iceberg.Scanner<Record> scanner = 
icebergTable.newScan().limit(100).build();
   ```
   However, in fluss, only support batch scanner for tablet bucket.  If user 
wang to get limit of a table, they have to use as followers:
   ```java
   
    try (Connection connection = 
ConnectionFactory.createConnection(flussConfig);
                   Table table = connection.getTable(tablePath);
                   Admin flussAdmin = connection.getAdmin()) {
               TableInfo tableInfo = flussAdmin.getTableInfo(tablePath).get();
               int bucketCount = tableInfo.getNumBuckets();
               List<TableBucket> tableBuckets;
               if (tableInfo.isPartitioned()) {
                   List<PartitionInfo> partitionInfos = 
flussAdmin.listPartitionInfos(tablePath).get();
                   tableBuckets =
                           partitionInfos.stream()
                                   .flatMap(
                                           partitionInfo ->
                                                   IntStream.range(0, 
bucketCount)
                                                           .mapToObj(
                                                                   bucketId ->
                                                                           new 
TableBucket(
                                                                                
   tableInfo
                                                                                
           .getTableId(),
                                                                                
   partitionInfo
                                                                                
           .getPartitionId(),
                                                                                
   bucketId)))
                                   .collect(Collectors.toList());
               } else {
                   tableBuckets =
                           IntStream.range(0, bucketCount)
                                   .mapToObj(
                                           bucketId ->
                                                   new 
TableBucket(tableInfo.getTableId(), bucketId))
                                   .collect(Collectors.toList());
               }
   
               Scan scan = 
table.newScan().limit(limit).project(projectedFields);
               List<BatchScanner> scanners =
                       tableBuckets.stream()
                               .map(scan::createBatchScanner)
                               .collect(Collectors.toList());
               List<InternalRow> scannedRows = 
BatchScanUtils.collectLimitedRows(scanners, limit);
   }
   ```
   
   
   
   
   
   ### Solution
   
   I recommend to privode a batch scanner for the whole table:
   ```java
   Table table = connection.getTable(tablePath);
   BatchScanner batchScanner =
                           table.newScan()
                                   .project(projectedFields)
                                   .limit(limit)
                                   .createBatchScanner()
   
   ```
   
   ### Anything else?
   
   _No response_
   
   ### Willingness to contribute
   
   - [ ] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Support batch scaner for table [fluss]

Reply via email to