GitHub user sfirke created a discussion: Script to scan Superset metadata DB and find unquoted URLs in text/markdown widgets
In Superset 4.x a link (`a href`, `img src`, iFrame embed) would work even if the URL wasn't quoted. Like `<img src=https://www.example.com/img1.jpg>` would load. But in Superset >=5.0 that changed and URLs had to be quoted. I created this script to search the metadata DB and find any such links, so that we could go through and proactively fix broken images and links. I used an LLM, as the structure and comments probably make clear, but it took enough tweaking that it feels worth sharing - there are lot of edge cases like the URL appearing in the link text itself, or not in a link at all, etc. ```python import pandas as pd import sqlalchemy import re superset_metadata_db = sqlalchemy.create_engine( ... # <your connection string, you should already have this if you're doing usage analytics> ) # Simple SQL: get dashboards with URLs in their JSON fields dashboards_with_urls = pd.read_sql( """ SELECT id, dashboard_title, position_json::text AS position_json FROM dashboards WHERE position_json::text ~ 'https?://' """, superset_metadata_db ) # Python regex to extract URLs with context url_pattern = re.compile(r'(["\']?)(https?://[^\s"\'<>)}\]]+)(["\']?)') markdown_pattern = re.compile(r'\[[^\]]+\]\(https?://[^)]+\)') results = [] for _, row in dashboards_with_urls.iterrows(): dashboard_id = row['id'] dashboard_title = row['dashboard_title'] # Process only the position_json field (json_metadata not needed) source_field = 'position_json' json_text = row['position_json'] if not json_text or 'http' not in json_text: continue # Find all URL matches for match in url_pattern.finditer(json_text): quote_before = match.group(1) url = match.group(2) quote_after = match.group(3) # Get context around this specific match (widen window to catch long markdown link text) start_pos = max(0, match.start() - 200) end_pos = min(len(json_text), match.end() + 200) context = json_text[start_pos:end_pos] # Skip markdown links like [text](url) if markdown_pattern.search(context): continue # Decide whether to include this match. By default include, but # if it's a bare URL (no surrounding quotes) we only keep it when # it is actually part of HTML (href attribute, inside an anchor's # inner text between > and <, or wrapped in angle brackets). include_match = True if quote_before == '' and quote_after == '': # Check for href="...url..." in the nearby context href_regex = re.compile(r'href\s*=\s*["\']' + re.escape(url) + r'["\']', re.IGNORECASE) if href_regex.search(context): include_match = True else: # Check if the URL appears as anchor inner text: there is a '>' before # and a '<' after the match within the captured context slice # Compute absolute positions in the original text g_start = match.start() g_end = match.end() # look backwards for a '>' and forwards for a '<' within 200 chars prev_gt = json_text.rfind('>', max(0, g_start - 200), g_start) next_lt = json_text.find('<', g_end, min(len(json_text), g_end + 200)) # Also accept explicit angle-bracket wrapping like <https://...> wrapped_left = (g_start > 0 and json_text[g_start - 1] == '<') wrapped_right = (g_end < len(json_text) and json_text[g_end] == '>') if (prev_gt != -1 and next_lt != -1) or (wrapped_left and wrapped_right): include_match = True else: include_match = False if not include_match: continue # Classify quote status if quote_before in ('"', "'") and quote_after in ('"', "'"): quote_status = 'quoted' elif quote_before == '' and quote_after == '': quote_status = 'bare' else: quote_status = 'mixed' results.append({ 'id': dashboard_id, 'dashboard_title': dashboard_title, 'source_field': source_field, 'url': url, 'quote_status': quote_status }) # Create DataFrame and remove duplicates URLs = pd.DataFrame(results).drop_duplicates().sort_values(['id', 'url']).reset_index(drop=True) ``` At this point write it to a .csv or your data warehouse or access it as-is. GitHub link: https://github.com/apache/superset/discussions/35649 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
