Re: [PR] Clarify documentation about gathering statistics for parquet files [datafusion]

2025-05-28 Thread via GitHub


xudong963 commented on code in PR #16157:
URL: https://github.com/apache/datafusion/pull/16157#discussion_r2112300128


##
docs/source/user-guide/sql/ddl.md:
##
@@ -91,6 +93,23 @@ STORED AS PARQUET
 LOCATION '/mnt/nyctaxi/tripdata.parquet';
 ```
 
+:::{note}

Review Comment:
   > Here is an example of what this looks like rendered
   
   TIL



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Clarify documentation about gathering statistics for parquet files [datafusion]

2025-05-28 Thread via GitHub


xudong963 merged PR #16157:
URL: https://github.com/apache/datafusion/pull/16157


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Clarify documentation about gathering statistics for parquet files [datafusion]

2025-05-27 Thread via GitHub


comphead commented on code in PR #16157:
URL: https://github.com/apache/datafusion/pull/16157#discussion_r2110078298


##
docs/source/user-guide/sql/ddl.md:
##
@@ -91,6 +93,23 @@ STORED AS PARQUET
 LOCATION '/mnt/nyctaxi/tripdata.parquet';
 ```
 
+:::{note}
+Statistics
+: By default, when a table is created, DataFusion will _NOT_ read the files
+to gather statistics, which can be expensive but can accelerate subsequent
+queries substantially. If you want to gather statistics
+when creating a table, set the `datafusion.explain.show_statistics`
+configuration option to `true` before creating the table. For example:
+
+```sql
+SET datafusion.explain.show_statistics = true;

Review Comment:
   ```suggestion
   SET datafusion.execution.collect_statistics = true;
   ```
   ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Clarify documentation about gathering statistics for parquet files [datafusion]

2025-05-27 Thread via GitHub


alamb commented on code in PR #16157:
URL: https://github.com/apache/datafusion/pull/16157#discussion_r2103274433


##
docs/source/user-guide/sql/ddl.md:
##
@@ -91,6 +93,23 @@ STORED AS PARQUET
 LOCATION '/mnt/nyctaxi/tripdata.parquet';
 ```
 
+:::{note}

Review Comment:
   Here is an example of what this looks like rendered
   
   ![Screenshot 2025-05-22 at 3 22 13 
PM](https://github.com/user-attachments/assets/9b7639b4-4a79-4f68-b97d-646ae96df586)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Clarify documentation about gathering statistics for parquet files [datafusion]

2025-05-27 Thread via GitHub


comphead commented on code in PR #16157:
URL: https://github.com/apache/datafusion/pull/16157#discussion_r2110073876


##
docs/source/user-guide/sql/ddl.md:
##
@@ -91,6 +93,23 @@ STORED AS PARQUET
 LOCATION '/mnt/nyctaxi/tripdata.parquet';
 ```
 
+:::{note}
+Statistics
+: By default, when a table is created, DataFusion will _NOT_ read the files
+to gather statistics, which can be expensive but can accelerate subsequent
+queries substantially. If you want to gather statistics
+when creating a table, set the `datafusion.explain.show_statistics`

Review Comment:
   ```suggestion
   when creating a table, set the `datafusion.execution.collect_statistics`
   ```
   ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Clarify documentation about gathering statistics for parquet files [datafusion]

2025-05-27 Thread via GitHub


comphead commented on code in PR #16157:
URL: https://github.com/apache/datafusion/pull/16157#discussion_r2110073876


##
docs/source/user-guide/sql/ddl.md:
##
@@ -91,6 +93,23 @@ STORED AS PARQUET
 LOCATION '/mnt/nyctaxi/tripdata.parquet';
 ```
 
+:::{note}
+Statistics
+: By default, when a table is created, DataFusion will _NOT_ read the files
+to gather statistics, which can be expensive but can accelerate subsequent
+queries substantially. If you want to gather statistics
+when creating a table, set the `datafusion.explain.show_statistics`

Review Comment:
   ```suggestion
   when creating a table, set the `datafusion.explain.collect_statistics`
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Clarify documentation about gathering statistics for parquet files [datafusion]

2025-05-26 Thread via GitHub


xudong963 commented on code in PR #16157:
URL: https://github.com/apache/datafusion/pull/16157#discussion_r2107141726


##
docs/source/user-guide/sql/ddl.md:
##
@@ -91,6 +93,23 @@ STORED AS PARQUET
 LOCATION '/mnt/nyctaxi/tripdata.parquet';
 ```
 
+:::{note}
+Statistics
+: By default, when a table is created, DataFusion will _NOT_ read the files
+to gather statistics, which can be expensive but can accelerate subsequent
+queries substantially. If you want to gather statistics
+when creating a table, set the `datafusion.explain.show_statistics`

Review Comment:
   `datafusion.explain.collect_statistics`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]