GitHub user sfirke created a discussion: Script to scan Superset metadata DB 
and find unquoted URLs in text/markdown widgets

In Superset 4.x a link (`a href`, `img src`, iFrame embed) would work even if 
the URL wasn't quoted. Like `<img src=https://www.example.com/img1.jpg>` would 
load.

But in Superset >=5.0 that changed and URLs had to be quoted. I created this 
script to search the metadata DB and find any such links, so that we could go 
through and proactively fix broken images and links. I used an LLM, as the 
structure and comments probably make clear, but it took enough tweaking that it 
feels worth sharing - there are lot of edge cases like the URL appearing in the 
link text itself, or not in a link at all, etc.

```python
import pandas as pd
import sqlalchemy
import re

superset_metadata_db = sqlalchemy.create_engine( ...
# <your connection string, you should already have this if you're doing usage 
analytics>
) 

# Simple SQL: get dashboards with URLs in their JSON fields
dashboards_with_urls = pd.read_sql(
    """
        SELECT 
            id,
            dashboard_title,
            position_json::text AS position_json
        FROM dashboards
        WHERE position_json::text ~ 'https?://'
    """,
    superset_metadata_db
)

# Python regex to extract URLs with context
url_pattern = re.compile(r'(["\']?)(https?://[^\s"\'<>)}\]]+)(["\']?)')
markdown_pattern = re.compile(r'\[[^\]]+\]\(https?://[^)]+\)')

results = []

for _, row in dashboards_with_urls.iterrows():
    dashboard_id = row['id']
    dashboard_title = row['dashboard_title']
    
    # Process only the position_json field (json_metadata not needed)
    source_field = 'position_json'
    json_text = row['position_json']
    if not json_text or 'http' not in json_text:
        continue
    
    # Find all URL matches
    for match in url_pattern.finditer(json_text):
        quote_before = match.group(1)
        url = match.group(2)
        quote_after = match.group(3)
        
        # Get context around this specific match (widen window to catch long 
markdown link text)
        start_pos = max(0, match.start() - 200)
        end_pos = min(len(json_text), match.end() + 200)
        context = json_text[start_pos:end_pos]
        
        # Skip markdown links like [text](url)
        if markdown_pattern.search(context):
            continue
        
        # Decide whether to include this match. By default include, but
        # if it's a bare URL (no surrounding quotes) we only keep it when
        # it is actually part of HTML (href attribute, inside an anchor's
        # inner text between > and <, or wrapped in angle brackets).
        include_match = True
        if quote_before == '' and quote_after == '':
            # Check for href="...url..." in the nearby context
            href_regex = re.compile(r'href\s*=\s*["\']' + re.escape(url) + 
r'["\']', re.IGNORECASE)
            if href_regex.search(context):
                include_match = True
            else:
                # Check if the URL appears as anchor inner text: there is a '>' 
before
                # and a '<' after the match within the captured context slice
                # Compute absolute positions in the original text
                g_start = match.start()
                g_end = match.end()
                # look backwards for a '>' and forwards for a '<' within 200 
chars
                prev_gt = json_text.rfind('>', max(0, g_start - 200), g_start)
                next_lt = json_text.find('<', g_end, min(len(json_text), g_end 
+ 200))
                # Also accept explicit angle-bracket wrapping like <https://...>
                wrapped_left = (g_start > 0 and json_text[g_start - 1] == '<')
                wrapped_right = (g_end < len(json_text) and json_text[g_end] == 
'>')
                
                if (prev_gt != -1 and next_lt != -1) or (wrapped_left and 
wrapped_right):
                    include_match = True
                else:
                    include_match = False
                    
        if not include_match:
            continue
        
        # Classify quote status
        if quote_before in ('"', "'") and quote_after in ('"', "'"):
            quote_status = 'quoted'
        elif quote_before == '' and quote_after == '':
            quote_status = 'bare'
        else:
            quote_status = 'mixed'
            
        results.append({
            'id': dashboard_id,
            'dashboard_title': dashboard_title,
            'source_field': source_field,
            'url': url,
            'quote_status': quote_status
        })

# Create DataFrame and remove duplicates
URLs = pd.DataFrame(results).drop_duplicates().sort_values(['id', 
'url']).reset_index(drop=True)
```

At this point write it to a .csv or your data warehouse or access it as-is.

GitHub link: https://github.com/apache/superset/discussions/35649

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to