zhztheplayer opened a new issue, #10778:
URL: https://github.com/apache/incubator-gluten/issues/10778

   ### Backend
   
   VL (Velox)
   
   ### Bug description
   
   On a TPC-DS dataset where 182300 files (which is a lot) generated for table 
`web_sales` (similarly, other fact tables all include many small files), Gluten 
is much slower than vanilla Spark when reading the data.
   
   Code versions: Latest Glute code + Spark 3.4 + Delta 2.4
   
   See test report:
   
   ```
   Test report: 
   
   Summary: 5 out of 5 queries passed. 
   
   | Query ID | Passed |    Row Count     | Planning Time (Millis)  |  Query 
Time (Millis)  | Speedup |
   |          |        | Vanilla | Gluten |  Vanilla   |   Gluten   |  Vanilla  
|  Gluten   |         |
   
|----------|--------|---------|--------|------------|------------|-----------|-----------|---------|
   |        q1|    true|      100|     100|       13348|       11773|      
27781|       9613|  188.99%|
   |        q2|    true|     2513|    2513|        5610|        5452|      
81757|     277679|  -70.56%|
   |        q3|    true|      100|     100|        3969|        4366|      
31489|      18550|   69.75%|
   |        q4|    true|      100|     100|        2140|        2392|     
300999|     658322|  -54.28%|
   |        q5|    true|      100|     100|        4013|        3674|     
219881|     129141|   70.26%|
   |       all|    true|     2913|    2913|       29080|       27657|     
661907|    1093305|  -39.46%|
   
   No failed queries. 
   ```
   
   Test command used (gluten-it):
   
   ```
   sbin/gluten-it.sh queries-compare --benchmark-type=ds --data-gen=once 
--local-cluster --auto-cluster-resource --off-heap-ratio=0.5 --enable-history 
--enable-ui --gen-partitioned-data -s=1000.0 --data-source=delta 
--data-dir=/root/data --extra-conf=spark.gluten.sql.columnar.scanOnly=true 
--queries=q1,q2,q3,q4,q5 --shuffle-partitions=100
   ```
   
   Hardware (64 CPUs + 256 GiB RAM):
   
   ```
   Gluten Version: 1.6.0-SNAPSHOT
   Commit: a1edfafcd4025440caef8bba5a0d5a1c432c2480
   CMake Version: 3.28.3
   System: Linux-6.1.141-155.222.amzn2023.x86_64
   Arch: x86_64
   CPU Name: Model name:                              Intel(R) Xeon(R) Platinum 
8488C
   C++ Compiler: /usr/bin/c++
   C++ Compiler Version: 13.3.0
   C Compiler: /usr/bin/cc
   C Compiler Version: 13.3.0
   CMake Prefix Path: 
/usr/local;/usr;/;/root/.local/share/uv/tools/cmake/lib/python3.12/site-packages/cmake/data;/usr/local;/usr/X11R6;/usr/pkg;/opt
   ```
   
   ### Gluten version
   
   _No response_
   
   ### Spark version
   
   None
   
   ### Spark configurations
   
   _No response_
   
   ### System information
   
   _No response_
   
   ### Relevant logs
   
   ```bash
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to