[I] Improve Jupyter notebook support with SQL magic commands and examples [datafusion-ballista]

via GitHub Thu, 22 Jan 2026 07:20:37 -0800


andygrove opened a new issue, #1398:
URL: https://github.com/apache/datafusion-ballista/issues/1398


   ## Summary
   
   Improve the Jupyter notebook experience for Ballista by adding SQL magic 
commands, example notebooks, and notebook-specific features. While basic 
notebook support already works via `_repr_html_`, there's an opportunity to 
provide a richer, more integrated experience.
   
   ## Current State
   
   PyBallista already supports basic Jupyter usage:
   
   ```python
   from ballista import BallistaSessionContext
   
   ctx = BallistaSessionContext("df://localhost:50050")
   df = ctx.sql("SELECT * FROM my_table LIMIT 10")
   df  # Renders as HTML table via _repr_html_()
   ```
   
   **What works today:**
   - `_repr_html_()` - DataFrames render as styled HTML tables
   - `to_pandas()` / `to_arrow_table()` / `to_polars()` - Data conversion
   - `show()` - Terminal-style output
   - Example `.py` files with `# %%` cell markers
   
   ## Proposed Improvements
   
   ### Phase 1: Documentation & Examples (Low Effort)
   
   1. **Add example Jupyter notebooks** to `python/examples/`:
      - `getting_started.ipynb` - Basic connection and queries
      - `dataframe_api.ipynb` - DataFrame transformations
      - `distributed_queries.ipynb` - Multi-stage query examples
   
   2. **Document notebook support** in `python/README.md`
   
   ### Phase 2: SQL Magic Commands (Medium Effort)
   
   Add IPython magic commands for a more interactive SQL experience:
   
   ```python
   %load_ext ballista.jupyter
   
   # Connect to cluster
   %ballista connect df://localhost:50050
   
   # Line magic for simple queries
   %sql SELECT COUNT(*) FROM orders
   
   # Cell magic for complex queries
   %%sql
   SELECT 
       customer_id,
       SUM(amount) as total
   FROM orders
   GROUP BY customer_id
   ORDER BY total DESC
   LIMIT 10
   ```
   
   **Implementation sketch:**
   
   ```python
   # ballista/jupyter.py
   from IPython.core.magic import Magics, magics_class, line_magic, cell_magic
   
   @magics_class
   class BallistaMagics(Magics):
       def __init__(self, shell):
           super().__init__(shell)
           self.ctx = None
       
       @line_magic
       def ballista(self, line):
           """Ballista commands: connect, status, disconnect"""
           cmd, *args = line.split()
           if cmd == "connect":
               self.ctx = BallistaSessionContext(args[0])
               return f"Connected to {args[0]}"
           elif cmd == "status":
               # Show cluster status
               pass
       
       @cell_magic
       def sql(self, line, cell):
           """Execute SQL query"""
           if self.ctx is None:
               raise ValueError("Not connected. Use: %ballista connect 
df://host:port")
           return self.ctx.sql(cell)
   
   def load_ipython_extension(ipython):
       ipython.register_magics(BallistaMagics)
   ```
   
   **Alternative**: Integrate with 
[JupySQL](https://github.com/ploomber/jupysql) which provides a mature `%%sql` 
magic with features like:
   - Query composition
   - Result caching  
   - Plotting integration
   - Multiple connection management
   
   ### Phase 3: Enhanced Notebook Features (Medium Effort)
   
   1. **Query plan visualization**
      ```python
      df.explain_visual()  # Render SVG of execution plan in notebook
      ```
      Leverage existing `/api/job/{job_id}/dot_svg` endpoint.
   
   2. **Progress indicators for long queries**
      ```python
      # Show progress bar during distributed query execution
      from ipywidgets import FloatProgress
      ```
   
   3. **Result size warnings**
      ```python
      # Warn before collecting large results
      df.collect()  # Warning: Query will return ~1M rows. Use .limit() or 
proceed? [y/N]
      ```
   
   4. **Schema exploration**
      ```python
      %ballista tables           # List registered tables
      %ballista schema orders    # Show schema for table
      ```
   
   ## Benefits
   
   1. **Lower barrier to entry** - SQL magic is familiar to data scientists
   2. **Interactive exploration** - Faster iteration in notebooks
   3. **Discoverability** - Example notebooks show what's possible
   4. **Ecosystem alignment** - Follows patterns from ipython-sql, JupySQL, 
DuckDB
   
   ## Prior Art
   
   - [JupySQL](https://github.com/ploomber/jupysql) - Modern SQL magic for 
Jupyter
   - [ipython-sql](https://github.com/catherinedevlin/ipython-sql) - Original 
`%%sql` magic
   - [DuckDB Jupyter](https://duckdb.org/docs/guides/python/jupyter.html) - 
DuckDB's notebook integration
   - [Spark magic](https://github.com/baifengbai/sql_magic) - SQL magic for 
Spark
   
   ## Implementation Checklist
   
   - [ ] Add example `.ipynb` notebooks to `python/examples/`
   - [ ] Document notebook support in Python README
   - [ ] Create `ballista.jupyter` module with magic commands
   - [ ] Add `%ballista connect/status/tables/schema` line magics
   - [ ] Add `%%sql` cell magic
   - [ ] Add `explain_visual()` method for query plan rendering
   - [ ] Consider JupySQL integration as alternative/complement
   - [ ] Add progress indicator support for long-running queries


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Improve Jupyter notebook support with SQL magic commands and examples [datafusion-ballista]

Reply via email to