Re: [PR] Updating documentation on site - PDF creation for some files [daffodil-site]

via GitHub Mon, 27 Oct 2025 07:27:20 -0700


stevedlawrence commented on code in PR #193:
URL: https://github.com/apache/daffodil-site/pull/193#discussion_r2465690243



##########
site/_pandoc/Makefile:
##########
@@ -0,0 +1,63 @@
+# ==========================================================
+# Pandoc PDF generator for Jekyll site
+# Scans Markdown files with "pdf: true" in YAML front matter
+# and produces PDFs in the site's ./pdf/ directory
+# ==========================================================
+
+# --- Configuration ---
+SITE_ROOT := ..
+AWK_UNWRAP := $(SITE_ROOT)/_pandoc/unwrap-pandoc.awk
+AWK_LIST   := $(SITE_ROOT)/_pandoc/list-pdf-sources.awk
+PANDOC := pandoc
+
+# Output directory for generated PDFs (at site root)
+PDF_OUTDIR := $(SITE_ROOT)/pdf
+
+DEFAULTS := $(SITE_ROOT)/_pandoc/basic.yaml
+
+# --- Candidate Markdown files (exclude build/tool/output dirs) ---
+# Use find + awk pipeline — awk -f avoids executable bit.
+MD_CANDIDATES := $(shell find $(SITE_ROOT) \
+  -type f -name '*.md' \
+  -not -path '*/_*/*' \
+  -not -path '*/node_modules/*' \
+  -not -path '*/vendor/*' \
+  -not -path '*/pdf/*' \
+  -print0 | xargs -0 -r awk -f $(AWK_LIST))

Review Comment:
   We don't have a vendor directory, and I think the pdf directory shouldn't 
contain and .md files? And I think node_modules is in the repo root so 
shouldn't be in the site root and will be ignored. Also,  `_*` directories 
should never have pdf:true in them. All this to say, feels like we could 
simplify this quite a bit and just use grep, e.g.:
   
   ```bash
   MD_CANDIDATES=`$(grep -Rl '^pdf: true$' $(SITE_ROOT) --include '*.md')
   ```
   
   It's not as relaxed about whitespace like the awk/sed script, and doesn't 
even require pdf:true to be in the header, but I think that's probably fine, I 
doubt we'll ever have a single line with jsut "pdf: true" in it anywhere. And 
is much easier to maintain than the awk script.



##########
site/dfdl-extensions.md:
##########
@@ -21,38 +22,60 @@ See the License for the specific language governing 
permissions and
 limitations under the License.
 {% endcomment %}
 -->
+<!-- 
+The :target="_blank" syntax below makes this open in a new tab 
+and work in the PDF and jekyll web page.
+But displays as literal text in the IDE markdown previewer. 
+--> 
+<div class="only-jekyll" markdown="1">
+_This page is available as a [downloadable 
PDF](../pdf/dfdl-extensions.pdf){:target="_blank"}._

Review Comment:
   Wondering if we can move this logic somewhere else so we don't have to 
duplicate it every time we mark a page as pdf: true. Maybe this is rare enough 
that's it's not necessary? But for example, you could put somethign like this 
in _navigation.html.
   
   ```
   {% if page.pdf == "true" %}
   <div><i>This page is available as a <a href="../pdf/{{ page.title }}.pdf" 
target="_blank">downloadable PDF</a></i></div>
   {% endif %}
   ```



##########
site/_pandoc/list-pdf-sources.awk:
##########
@@ -0,0 +1,59 @@
+#!/usr/bin/awk -f
+# Prints FILENAME iff the file has YAML front matter with "pdf: true".
+# - Must be called with filenames (works via find/xargs).
+# - Ignores matches outside the front matter.
+# - Front matter is the lines between the first '---' and the next '---'.
+
+# We process each file independently.
+# Use a per-file BEGINFILE block if available (GNU awk). Otherwise reset on 
first record.
+BEGIN {
+  have_beginfile = 0
+}
+BEGINFILE {
+  have_beginfile = 1
+  in_front = 0
+  seen_start = 0
+  want_pdf = 0
+}
+
+# For non-GNU awk compatibility:
+# Reset on the first line of each file if BEGINFILE isn't supported.
+FNR == 1 && !have_beginfile {
+  in_front = 0
+  seen_start = 0
+  want_pdf = 0
+}
+
+{
+  # Detect start of front matter
+  if (!seen_start) {
+    if ($0 ~ /^[[:space:]]*---[[:space:]]*$/) {
+      seen_start = 1
+      in_front = 1
+      next
+    } else {
+      # No front matter: skip file
+      nextfile
+    }
+  }
+
+  # If in front matter, look for end and for pdf:true
+  if (in_front) {
+    if ($0 ~ /^[[:space:]]*---[[:space:]]*$/) {
+      # End of front matter
+      in_front = 0
+      if (want_pdf) {
+        print FILENAME
+      }
+      nextfile
+    }
+    # Match "pdf: true" allowing spaces; ensure it's a key at line start
+    if ($0 ~ /^[[:space:]]*pdf:[[:space:]]*true([[:space:]]|$)/) {
+      want_pdf = 1
+    }
+    next
+  }
+
+  # If we got here, we’ve passed front matter without finding pdf:true
+  nextfile
+}

Review Comment:
   Thoughts on a simple sed/bash script?
   
   ```bash
   #!/bin/bash
   
   for FILE in "$@"; do
     sed -n '/^---$/,/^---$/p' "$FILE" |
       grep -Eq '^[[:space:]]*pdf:[[:space:]]*true[[:space:]]*$' && echo "$FILE"
   done
   ```



##########
site/_pandoc/README.md:
##########
@@ -0,0 +1,223 @@
+---
+layout: page
+title: Pandoc + Jekyll Integration
+pdf: false
+---
+# 🧭 Pandoc + Jekyll Integration
+
+This directory contains tools for generating **PDF versions** of selected 
Jekyll pages while keeping the same Markdown files usable by Jekyll for the 
website.
+
+The goal is to have **one Markdown source** that:
+- renders cleanly in the Jekyll site (for HTML),
+- and can also be converted into a polished PDF using **Pandoc + LaTeX**.
+
+---
+
+## 🏗️ Directory Layout
+
+```
+_pandoc/
+│
+├── README.md              ← this file
+├── Makefile               ← builds all PDFs
+├── unwrap-pandoc.awk      ← preprocessor that removes comment wrappers
+├── template.latex         ← (optional) custom LaTeX template
+├── header.tex             ← (optional) extra LaTeX header content
+└── ../pdf/                ← generated PDFs appear here
+```
+
+At the root of the Jekyll site:
+
+```
+_config.yml
+_posts/
+pages/
+assets/
+_pandoc/
+pdf/
+```
+
+---
+
+## 🧩 How It Works

Review Comment:
   Can we remove these emojis? I don't think they add anything, if anythign 
they make things more confusing.



##########
site/_pandoc/README.md:
##########
@@ -0,0 +1,223 @@
+---
+layout: page
+title: Pandoc + Jekyll Integration
+pdf: false
+---
+# 🧭 Pandoc + Jekyll Integration
+
+This directory contains tools for generating **PDF versions** of selected 
Jekyll pages while keeping the same Markdown files usable by Jekyll for the 
website.
+
+The goal is to have **one Markdown source** that:
+- renders cleanly in the Jekyll site (for HTML),
+- and can also be converted into a polished PDF using **Pandoc + LaTeX**.
+
+---
+
+## 🏗️ Directory Layout
+
+```
+_pandoc/
+│
+├── README.md              ← this file
+├── Makefile               ← builds all PDFs
+├── unwrap-pandoc.awk      ← preprocessor that removes comment wrappers
+├── template.latex         ← (optional) custom LaTeX template
+├── header.tex             ← (optional) extra LaTeX header content

Review Comment:
   These don't exist, can we remove these?



##########
site/_pandoc/README.md:
##########
@@ -0,0 +1,223 @@
+---
+layout: page
+title: Pandoc + Jekyll Integration
+pdf: false
+---
+# 🧭 Pandoc + Jekyll Integration
+
+This directory contains tools for generating **PDF versions** of selected 
Jekyll pages while keeping the same Markdown files usable by Jekyll for the 
website.
+
+The goal is to have **one Markdown source** that:
+- renders cleanly in the Jekyll site (for HTML),
+- and can also be converted into a polished PDF using **Pandoc + LaTeX**.
+
+---
+
+## 🏗️ Directory Layout
+
+```
+_pandoc/
+│
+├── README.md              ← this file
+├── Makefile               ← builds all PDFs
+├── unwrap-pandoc.awk      ← preprocessor that removes comment wrappers
+├── template.latex         ← (optional) custom LaTeX template
+├── header.tex             ← (optional) extra LaTeX header content
+└── ../pdf/                ← generated PDFs appear here
+```
+
+At the root of the Jekyll site:
+
+```
+_config.yml
+_posts/
+pages/
+assets/
+_pandoc/
+pdf/
+```

Review Comment:
   We don't have config.yml or _post or _pages. Suggest we just remove this as 
it doesn't provide anything useful.



##########
site/_pandoc/README.md:
##########
@@ -0,0 +1,223 @@
+---
+layout: page
+title: Pandoc + Jekyll Integration
+pdf: false
+---
+# 🧭 Pandoc + Jekyll Integration
+
+This directory contains tools for generating **PDF versions** of selected 
Jekyll pages while keeping the same Markdown files usable by Jekyll for the 
website.
+
+The goal is to have **one Markdown source** that:
+- renders cleanly in the Jekyll site (for HTML),
+- and can also be converted into a polished PDF using **Pandoc + LaTeX**.
+
+---
+
+## 🏗️ Directory Layout
+
+```
+_pandoc/
+│
+├── README.md              ← this file
+├── Makefile               ← builds all PDFs
+├── unwrap-pandoc.awk      ← preprocessor that removes comment wrappers
+├── template.latex         ← (optional) custom LaTeX template
+├── header.tex             ← (optional) extra LaTeX header content
+└── ../pdf/                ← generated PDFs appear here
+```
+
+At the root of the Jekyll site:
+
+```
+_config.yml
+_posts/
+pages/
+assets/
+_pandoc/
+pdf/
+```
+
+---
+
+## 🧩 How It Works
+
+### 1. Mark pages that should have PDFs
+
+Any Markdown file (in `_posts`, `pages/`, or elsewhere) can be tagged with:
+
+```yaml
+---
+title: Example Page
+layout: page
+pdf: true
+---
+```
+
+The Makefile will scan the entire Jekyll project and automatically detect 
these files.
+
+---
+
+### 2. Use HTML comment wrappers for Pandoc-only content
+
+Pandoc sometimes needs LaTeX code for things like custom tables, math, or page 
layout.  
+We hide that LaTeX from Jekyll using **HTML comments**, which Jekyll ignores 
but our AWK preprocessor removes before running Pandoc.
+
+Example:
+
+````markdown
+Regular Markdown content here.
+
+<!-- PANDOC:START -->
+<!--
+```{=latex}
+\begin{tabular}{ll}
+A & B \\
+C & D \\
+\end{tabular}
+```
+-->
+<!-- PANDOC:END -->
+
+More Markdown content.
+````
+
+When viewed on the Jekyll site:
+- This section is hidden (HTML comments are ignored).
+
+When built via Pandoc:
+- The `unwrap-pandoc.awk` script strips the comment wrappers,
+- The inner LaTeX becomes active, producing a correct PDF table.
+
+---
+
+### 3. The Makefile
+
+The `_pandoc/Makefile` automates the whole process.
+
+It:
+
+1. Recursively scans the site for Markdown files with `pdf: true`.
+2. Runs `unwrap-pandoc.awk` to clean up `<!-- PANDOC:START -->` / `<!-- 
PANDOC:END -->` wrappers.
+3. Invokes Pandoc with the configured LaTeX template to produce a PDF.
+
+The resulting PDFs go into:
+
+```
+_pandoc/output/
+```
+
+keeping them separate from the Jekyll site itself.
+
+---
+
+## 🧮 Example Commands
+
+From inside the `_pandoc/` directory:
+
+### Build all PDFs
+```bash
+make
+```
+
+### Clean all generated PDFs
+```bash
+make clean
+```
+
+### List all Markdown files with `pdf: true`
+```bash
+make list
+```
+
+### Force rebuild of one PDF
+```bash
+make ../about.pdf
+```
+
+---
+
+## 🧰 How the AWK Script Works
+
+`unwrap-pandoc.awk` removes the HTML comment wrappers used to hide LaTeX from 
Jekyll.
+
+Input example:
+
+````markdown
+<!-- PANDOC:START -->
+<!--
+\LaTeX code
+-->
+<!-- PANDOC:END -->
+````
+
+Output to Pandoc:
+
+```markdown
+\LaTeX code
+```
+
+That means Pandoc receives clean, valid LaTeX syntax while Jekyll never sees 
it.
+
+---

Review Comment:
   I think the aboe section kindof already mentions pandoc:start/end.



##########
site/_pandoc/README.md:
##########
@@ -0,0 +1,223 @@
+---
+layout: page
+title: Pandoc + Jekyll Integration
+pdf: false
+---
+# 🧭 Pandoc + Jekyll Integration
+
+This directory contains tools for generating **PDF versions** of selected 
Jekyll pages while keeping the same Markdown files usable by Jekyll for the 
website.
+
+The goal is to have **one Markdown source** that:
+- renders cleanly in the Jekyll site (for HTML),
+- and can also be converted into a polished PDF using **Pandoc + LaTeX**.
+
+---
+
+## 🏗️ Directory Layout
+
+```
+_pandoc/
+│
+├── README.md              ← this file
+├── Makefile               ← builds all PDFs
+├── unwrap-pandoc.awk      ← preprocessor that removes comment wrappers
+├── template.latex         ← (optional) custom LaTeX template
+├── header.tex             ← (optional) extra LaTeX header content
+└── ../pdf/                ← generated PDFs appear here
+```
+
+At the root of the Jekyll site:
+
+```
+_config.yml
+_posts/
+pages/
+assets/
+_pandoc/
+pdf/
+```
+
+---
+
+## 🧩 How It Works
+
+### 1. Mark pages that should have PDFs
+
+Any Markdown file (in `_posts`, `pages/`, or elsewhere) can be tagged with:
+
+```yaml
+---
+title: Example Page
+layout: page
+pdf: true
+---
+```
+
+The Makefile will scan the entire Jekyll project and automatically detect 
these files.
+
+---
+
+### 2. Use HTML comment wrappers for Pandoc-only content
+
+Pandoc sometimes needs LaTeX code for things like custom tables, math, or page 
layout.  
+We hide that LaTeX from Jekyll using **HTML comments**, which Jekyll ignores 
but our AWK preprocessor removes before running Pandoc.
+
+Example:
+
+````markdown
+Regular Markdown content here.
+
+<!-- PANDOC:START -->
+<!--
+```{=latex}
+\begin{tabular}{ll}
+A & B \\
+C & D \\
+\end{tabular}
+```
+-->
+<!-- PANDOC:END -->
+
+More Markdown content.
+````
+
+When viewed on the Jekyll site:
+- This section is hidden (HTML comments are ignored).
+
+When built via Pandoc:
+- The `unwrap-pandoc.awk` script strips the comment wrappers,
+- The inner LaTeX becomes active, producing a correct PDF table.
+
+---
+
+### 3. The Makefile
+
+The `_pandoc/Makefile` automates the whole process.
+
+It:
+
+1. Recursively scans the site for Markdown files with `pdf: true`.
+2. Runs `unwrap-pandoc.awk` to clean up `<!-- PANDOC:START -->` / `<!-- 
PANDOC:END -->` wrappers.
+3. Invokes Pandoc with the configured LaTeX template to produce a PDF.
+
+The resulting PDFs go into:
+
+```
+_pandoc/output/
+```
+
+keeping them separate from the Jekyll site itself.
+
+---
+
+## 🧮 Example Commands
+
+From inside the `_pandoc/` directory:
+
+### Build all PDFs
+```bash
+make
+```
+
+### Clean all generated PDFs
+```bash
+make clean
+```
+
+### List all Markdown files with `pdf: true`
+```bash
+make list
+```
+
+### Force rebuild of one PDF
+```bash
+make ../about.pdf
+```
+
+---
+
+## 🧰 How the AWK Script Works
+
+`unwrap-pandoc.awk` removes the HTML comment wrappers used to hide LaTeX from 
Jekyll.
+
+Input example:
+
+````markdown
+<!-- PANDOC:START -->
+<!--
+\LaTeX code
+-->
+<!-- PANDOC:END -->
+````
+
+Output to Pandoc:
+
+```markdown
+\LaTeX code
+```
+
+That means Pandoc receives clean, valid LaTeX syntax while Jekyll never sees 
it.
+
+---
+
+## ⚙️ Customizing Pandoc
+
+Pandoc is run with `--defaults=basic.yaml` which specifies the 
`template_basic.tex` is used.
+The template can be modified to change the PDF output. 
+
+---
+
+## 🧱 Recommended Workflow
+
+1. Write Markdown pages normally for your Jekyll site.
+2. When you also want a PDF version, add `pdf: true` to front matter.
+3. If needed, wrap LaTeX-specific content in `<!-- PANDOC:START -->` / `<!-- 
PANDOC:END -->` blocks.
+4. From `_pandoc/`, run:
+   ```bash
+   make
+   ```
+5. Find the generated PDFs in `pdf/`.
+
+---
+
+## 🪄 Why This Setup Works
+
+| Concern | Solution |
+|----------|-----------|
+| Jekyll shouldn’t see LaTeX | Hidden in HTML comments |
+| Pandoc must see LaTeX | AWK removes wrappers |
+| Need automatic PDF generation | Makefile scans for `pdf: true` |
+| Keep tools separate | Everything lives in `_pandoc/` |
+
+---

Review Comment:
   Suggest we remove this section.



##########
site/_pandoc/README.md:
##########
@@ -0,0 +1,223 @@
+---
+layout: page
+title: Pandoc + Jekyll Integration
+pdf: false
+---
+# 🧭 Pandoc + Jekyll Integration
+
+This directory contains tools for generating **PDF versions** of selected 
Jekyll pages while keeping the same Markdown files usable by Jekyll for the 
website.
+
+The goal is to have **one Markdown source** that:
+- renders cleanly in the Jekyll site (for HTML),
+- and can also be converted into a polished PDF using **Pandoc + LaTeX**.
+
+---
+
+## 🏗️ Directory Layout
+
+```
+_pandoc/
+│
+├── README.md              ← this file
+├── Makefile               ← builds all PDFs
+├── unwrap-pandoc.awk      ← preprocessor that removes comment wrappers
+├── template.latex         ← (optional) custom LaTeX template
+├── header.tex             ← (optional) extra LaTeX header content
+└── ../pdf/                ← generated PDFs appear here
+```
+
+At the root of the Jekyll site:
+
+```
+_config.yml
+_posts/
+pages/
+assets/
+_pandoc/
+pdf/
+```
+
+---
+
+## 🧩 How It Works
+
+### 1. Mark pages that should have PDFs
+
+Any Markdown file (in `_posts`, `pages/`, or elsewhere) can be tagged with:
+
+```yaml
+---
+title: Example Page
+layout: page
+pdf: true
+---
+```
+
+The Makefile will scan the entire Jekyll project and automatically detect 
these files.
+
+---
+
+### 2. Use HTML comment wrappers for Pandoc-only content
+
+Pandoc sometimes needs LaTeX code for things like custom tables, math, or page 
layout.  
+We hide that LaTeX from Jekyll using **HTML comments**, which Jekyll ignores 
but our AWK preprocessor removes before running Pandoc.
+
+Example:
+
+````markdown
+Regular Markdown content here.
+
+<!-- PANDOC:START -->
+<!--
+```{=latex}
+\begin{tabular}{ll}
+A & B \\
+C & D \\
+\end{tabular}
+```
+-->
+<!-- PANDOC:END -->
+
+More Markdown content.
+````
+
+When viewed on the Jekyll site:
+- This section is hidden (HTML comments are ignored).
+
+When built via Pandoc:
+- The `unwrap-pandoc.awk` script strips the comment wrappers,
+- The inner LaTeX becomes active, producing a correct PDF table.
+
+---
+
+### 3. The Makefile
+
+The `_pandoc/Makefile` automates the whole process.
+
+It:
+
+1. Recursively scans the site for Markdown files with `pdf: true`.
+2. Runs `unwrap-pandoc.awk` to clean up `<!-- PANDOC:START -->` / `<!-- 
PANDOC:END -->` wrappers.
+3. Invokes Pandoc with the configured LaTeX template to produce a PDF.
+
+The resulting PDFs go into:
+
+```
+_pandoc/output/
+```
+
+keeping them separate from the Jekyll site itself.
+
+---
+
+## 🧮 Example Commands
+
+From inside the `_pandoc/` directory:
+
+### Build all PDFs
+```bash
+make
+```
+
+### Clean all generated PDFs
+```bash
+make clean
+```
+
+### List all Markdown files with `pdf: true`
+```bash
+make list
+```
+
+### Force rebuild of one PDF
+```bash
+make ../about.pdf
+```
+
+---
+
+## 🧰 How the AWK Script Works
+
+`unwrap-pandoc.awk` removes the HTML comment wrappers used to hide LaTeX from 
Jekyll.
+
+Input example:
+
+````markdown
+<!-- PANDOC:START -->
+<!--
+\LaTeX code
+-->
+<!-- PANDOC:END -->
+````
+
+Output to Pandoc:
+
+```markdown
+\LaTeX code
+```
+
+That means Pandoc receives clean, valid LaTeX syntax while Jekyll never sees 
it.
+
+---
+
+## ⚙️ Customizing Pandoc
+
+Pandoc is run with `--defaults=basic.yaml` which specifies the 
`template_basic.tex` is used.
+The template can be modified to change the PDF output. 
+
+---
+
+## 🧱 Recommended Workflow
+
+1. Write Markdown pages normally for your Jekyll site.
+2. When you also want a PDF version, add `pdf: true` to front matter.
+3. If needed, wrap LaTeX-specific content in `<!-- PANDOC:START -->` / `<!-- 
PANDOC:END -->` blocks.
+4. From `_pandoc/`, run:
+   ```bash
+   make
+   ```
+5. Find the generated PDFs in `pdf/`.
+
+---
+
+## 🪄 Why This Setup Works
+
+| Concern | Solution |
+|----------|-----------|
+| Jekyll shouldn’t see LaTeX | Hidden in HTML comments |
+| Pandoc must see LaTeX | AWK removes wrappers |
+| Need automatic PDF generation | Makefile scans for `pdf: true` |
+| Keep tools separate | Everything lives in `_pandoc/` |
+
+---
+
+## 🧾 Example Output
+
+```
+pdf/
+├── about.pdf
+└── _posts/
+    └── 2025-01-01-example.pdf
+```
+
+---
+
+**Maintainer Notes**
+
+- `_pandoc/Makefile` assumes it’s run from `_pandoc/`, with site root as `..`
+- Pandoc and AWK must be available on your `PATH`
+
+---
+
+## Pandoc Tools Installation
+
+These tools run on Linux.
+
+On Ubuntu you have to install these things:
+
+    sudo apt install pandoc texlive-latex-base texlive-latex-recommended \
+      texlive-fonts-recommended texlive-xetex texlive-latex-extra
+
+I have found one must update pandoc to a more up to date version.
+This is currently dependent on pandoc 3.7.0.2 which can be downloaded from
+https://github.com/jgm/pandoc/releases/tag/3.7.0.2 . 

Review Comment:
   Does this really depend on 3.7? I don't think we should rely on devs needing 
to install newer versions from source. Fedora only ships with 3.1, suggest we 
avoid modern features to make these easier to test.



##########
site/dfdl-extensions.md:
##########
@@ -21,38 +22,60 @@ See the License for the specific language governing 
permissions and
 limitations under the License.
 {% endcomment %}
 -->
+<!-- 
+The :target="_blank" syntax below makes this open in a new tab 
+and work in the PDF and jekyll web page.
+But displays as literal text in the IDE markdown previewer. 
+--> 
+<div class="only-jekyll" markdown="1">
+_This page is available as a [downloadable 
PDF](../pdf/dfdl-extensions.pdf){:target="_blank"}._
+
+### Table of Contents
+{:.no_toc}
+
+1. use ordered table of contents 
+{:toc}
+</div>
+
+<div class="only-pandoc" markdown="1">
+# Introduction
+</div>

Review Comment:
   Shoudl we just always have the Intruduction header? The less differences 
between PDF and website the better.



##########
site/dfdl-extensions.md:
##########
@@ -21,38 +22,60 @@ See the License for the specific language governing 
permissions and
 limitations under the License.
 {% endcomment %}
 -->
+<!-- 
+The :target="_blank" syntax below makes this open in a new tab 
+and work in the PDF and jekyll web page.
+But displays as literal text in the IDE markdown previewer. 
+--> 
+<div class="only-jekyll" markdown="1">
+_This page is available as a [downloadable 
PDF](../pdf/dfdl-extensions.pdf){:target="_blank"}._
+
+### Table of Contents
+{:.no_toc}
+
+1. use ordered table of contents 
+{:toc}
+</div>
+
+<div class="only-pandoc" markdown="1">
+# Introduction
+</div>
 
 Daffodil provides extensions to the DFDL specification. 
-These properties are in the namespace defined by the URI 
+These functions and properties are in the namespace defined by the URI 
 ``http://www.ogf.org/dfdl/dfdl-1.0/extensions`` which is normally bound to the 
``dfdlx`` prefix 
 like so: 
 
 
 ``` xml
-<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema";
-           xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/";
-           xmlns:dfdlx="http://www.ogf.org/dfdl/dfdl-1.0/extensions";
+<schema xmlns="http://www.w3.org/2001/XMLSchema";
+        xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/";
+        xmlns:dfdlx="http://www.ogf.org/dfdl/dfdl-1.0/extensions";
 >
 ```
 
-The following symbols defined in this namespace are described below.
+The DFDL language extensions described below have Long Term Support (LTS) in 
Daffodil 
+going forward, and are proposed for inclusion in a future revision of the DFDL 
+standard. 
+DFDL schema authors can depend on the features and behaviors defined here 
without fear 
+that these extensions will be withdrawn in the future. 
 
-### Expression Functions
+# Expression Functions

Review Comment:
   I think there wareason we used the `###`. That said, we should probably 
figure out why and change things. It is pretty annoying that our .md pages cant 
use normal headings. I'll look into this and see if I can figure out the 
reasoning and if we can change it.



##########
site/dfdl-extensions.md:
##########
@@ -87,46 +110,112 @@ found after fields `a` and `b`:
 <xs:element name="tag" type="xs:int" dfdl:length="8" />
 ```
 
-Bitwise Functions
+## Bitwise Functions: `bitAnd`, `bitOr`, `bitXor`, `bitNot`, `leftShift`, 
`rightShift`
+
+These functions are defined on types `long`, `int`, `short`, `byte`, 
`unsignedLong`, 
+`unsignedInt`, `unsignedShort`, and `unsignedByte`
+
+### `dfdlx:bitAnd(arg1, arg2)`
+
+This computes the bitwise AND of two integers. 
+
+- Both arguments must be signed, or both must be unsigned.
+- If the two arguments are not the same type the smaller one is converted into 
the type of the 
+larger one. 
+- If the smaller argument is signed, this conversion does sign-extension.
+- The result type is the that of the largest argument. 
+
+### `dfdlx:bitOr(arg1, arg2)`
+
+This computes the bitwise OR of two integers.
+
+- Both arguments must be signed, or both must be unsigned.
+- If the two arguments are not the same type the smaller one is converted into 
the type of the
+larger one.
+- If the smaller argument is signed, this conversion does sign-extension.
+- The result type is the that of the largest argument.
+
+### `dfdlx:bitXor(arg1, arg2)`
+
+This computes the bitwise Exclusive OR of two integers.
+
+- Both arguments must be signed, or both must be unsigned.
+- If the two arguments are not the same type the smaller one is converted into 
the type of the
+larger one. 
+- If the smaller argument is signed, this conversion does sign-extension.
+- The result type is the that of the largest argument.
+
+### `dfdlx:bitNot(arg)`
+
+This computes the bitwise NOT of an integer. Every bit is inverted. The result 
type is the same 
+as the argument type. 
+
+### `dfdlx:leftShift(value, shiftCount)`
+
+This is the _logical_ shift left, meaning that bits are shifted from 
less-significant positions 
+to more-significant positions. 
+
+- The left-most bits shifted out are discarded. 
+- Zeros are shifted in for the right-most bits. 
+- The result type is the same as the `value` argument type. 
+- It is a processing error if the `shiftCount` argument is < 0.
+- It is a processing error if the `shiftCount` argument is greater than the 
number of 
+  bits in the type of the value argument. 
+
+### `dfdlx:rightShift(value, shiftCount)`
+
+This is the _arithmetic_ shift right, meaning bits move from most-significant 
to 
+less-significant positions.
+If _logical_ (zero-filling) shift right is needed, you must use unsigned types.
+
+- The `value` argument is shifted by the `shiftCount`.
+- The right-most bits shifted out are discarded. 
+- If the `value` is signed, then the sign bit is shifted in for the left-most 
bits.
+- If the `value` is unsigned, then zeros are shifted in for the left-most 
bits. 
+- The result type is the same as the `value` argument type.
+- It is a processing error if the `shiftCount` argument is < 0.
+- It is a processing error if the `shiftCount` argument is greater than the 
number of
+  bits in the type of the value argument.
+
+## `dfdlx:doubleFromRawLong(longArg): double` and 
`dfdlx:doubleToRawLong(doubleArg): long`
 
-   : TBD, but the complete list (all ``dfdlx``) is `BitAnd`, `BitNot`, 
`BitOr`, `BitXor`, `LeftShift`, 
-   `RightShift`
+IEEE binary float and double values that are not NaN will parse to base 10 
text and unparse back
+to the same exact IEEE binary bits. 
+However, the same cannot be said for NaN (not a number) values, of which there 
are many bit 
+patterns. 
+To preserve float and double NaN values bit for bit you can use these 
functions to compute
+`xs:long` values that enable the DFDL Infoset to preserve the bits of a float 
or double value 
+even if it is a NaN. 
 
-``dfdlx:doubleFromRawLong`` and ``dfdlx:doubleToRawLong``
 
-   : Converting binary floating point numbers to/from base 10 text can result 
in lost information.
-The base 10 representation, converted back to binary representation, may not 
be bit-for-bit 
-   identical. These functions can be used to carry 8-byte double precision 
IEEE floating point 
-   numbers as type `xs:long` so that no information is lost. The DFDL schema 
can still obtain 
-   and operate on the floating point value by converting these `xs:long` 
values into type 
-   `xs:double`, and back if necessary for unparsing a new value. 
 
-### Properties
+# Properties
 
-``dfdlx:parseUnparsePolicy``
+## `dfdlx:parseUnparsePolicy`
 
-   : A property applied to simple and complex elements, which specifies 
whether the element supports only parsing, only unparsing, or both parsing and 
unparse. Valid values for this property are ``parse``, ``unparse``, or 
``both``. This allows one to leave off properties that are required for only 
parse or only unparse, such as ``dfdl:outputValueCalc`` or 
``dfdl:outputNewLine``, so that one may have a valid schema if only a subset of 
functionality is needed.
+A property applied to simple and complex elements, which specifies whether the 
element supports only parsing, only unparsing, or both parsing and unparse. 
Valid values for this property are ``parse``, ``unparse``, or ``both``. This 
allows one to leave off properties that are required for only parse or only 
unparse, such as ``dfdl:outputValueCalc`` or ``dfdl:outputNewLine``, so that 
one may have a valid schema if only a subset of functionality is needed.
 
-     All elements must have a compatible parseUnparsePolicy with the 
compilation parseUnparsePolicy (which is defined by the root element 
daf:parseUnparsePolicy and/or the Daffodil parseUnparsePolicy tunable) or it is 
a Schema Definition Error. An element is defined to have a compatible 
parseUnparsePolicy if it has the same value as the compilation 
parseUnparsePolicy or if it has the value ``both``.
+All elements must have a compatible parseUnparsePolicy with the compilation 
parseUnparsePolicy (which is defined by the root element daf:parseUnparsePolicy 
and/or the Daffodil parseUnparsePolicy tunable) or it is a Schema Definition 
Error. An element is defined to have a compatible parseUnparsePolicy if it has 
the same value as the compilation parseUnparsePolicy or if it has the value 
``both``.
 
-     For compatibility, if this property is not defined, it is assumed to be 
``both``.
+For compatibility, if this property is not defined, it is assumed to be 
``both``.
 
-``dfdlx:layer``
+## `dfdlx:layer`
 
-   : [Layers](/layers) provide algorithmic capabilities for decoding/encoding 
data or computing 
+_Layers_ provide algorithmic capabilities for decoding/encoding data or 
computing 
    checksums. Some are built-in to Daffodil. New layers can be created in 
Java/Scala and 
    plugged-in to Daffodil dynamically. 
+There is [separate Layer documentation](/layers).
 
-``dfdlx:direction``
+## `dfdlx:direction`
 
-   : TBD
+This property has 

Review Comment:
   Should probably stay TBD until documented.



##########
site/_pandoc/only.lua:
##########
@@ -0,0 +1,60 @@
+-- only.lua: drop .only-jekyll, keep contents of .only-pandoc
+-- Handles both native Div/Span nodes and raw HTML <div> wrappers.
+
+local List = require 'pandoc.List'
+
+local function has_class(classes, cls)
+  return classes and List.includes(classes, cls)
+end
+
+-- Native block divs (Pandoc recognized <div class="..."> as Div)
+function Div(el)
+  if has_class(el.classes, 'only-jekyll') then
+    return {}                  -- drop entirely
+  elseif has_class(el.classes, 'only-pandoc') then
+    return el.content          -- unwrap: keep inner blocks
+  end
+end
+
+-- Native inline spans
+function Span(el)
+  if has_class(el.classes, 'only-jekyll') then
+    return {}
+  elseif has_class(el.classes, 'only-pandoc') then
+    return el.content
+  end
+end
+
+-- Fallback for raw HTML wrappers when Pandoc didn’t turn them into Divs.
+function Pandoc(doc)
+  local out = List()
+  local mode = nil  -- nil | 'drop' | 'keep'
+
+  local function is_open_of(txt, klass)
+    -- match <div ... class="... klass ...">
+    return txt:match('<div[^>]-class=[\'"][^\'"]-' .. klass .. '[^\'"]-[\'"]')
+  end
+
+  for _, blk in ipairs(doc.blocks) do
+    if blk.t == 'RawBlock' and blk.format:match('html') then
+      local t = blk.text
+      if is_open_of(t, 'only%-jekyll') then
+        mode = 'drop'   -- drop wrapper and its inner content
+      elseif is_open_of(t, 'only%-pandoc') then
+        mode = 'keep'   -- drop wrapper, keep inner content
+      elseif t:match('</div>') and mode ~= nil then
+        mode = nil
+      else
+        if not mode or mode == 'keep' then out:insert(blk) end
+      end
+    else
+      if not mode then
+        out:insert(blk)
+      elseif mode == 'keep' then
+        out:insert(blk)
+      end
+    end
+  end
+
+  return pandoc.Pandoc(out, doc.meta)
+end

Review Comment:
   What is the difference between these .only-pandoc and the PANDOC:START/END 
things? Seems like they are two different ways to do the same thign?



##########
site/_pandoc/README.md:
##########
@@ -0,0 +1,223 @@
+---
+layout: page
+title: Pandoc + Jekyll Integration
+pdf: false
+---
+# 🧭 Pandoc + Jekyll Integration
+
+This directory contains tools for generating **PDF versions** of selected 
Jekyll pages while keeping the same Markdown files usable by Jekyll for the 
website.
+
+The goal is to have **one Markdown source** that:
+- renders cleanly in the Jekyll site (for HTML),
+- and can also be converted into a polished PDF using **Pandoc + LaTeX**.
+
+---
+
+## 🏗️ Directory Layout
+
+```
+_pandoc/
+│
+├── README.md              ← this file
+├── Makefile               ← builds all PDFs
+├── unwrap-pandoc.awk      ← preprocessor that removes comment wrappers
+├── template.latex         ← (optional) custom LaTeX template
+├── header.tex             ← (optional) extra LaTeX header content
+└── ../pdf/                ← generated PDFs appear here
+```
+
+At the root of the Jekyll site:
+
+```
+_config.yml
+_posts/
+pages/
+assets/
+_pandoc/
+pdf/
+```
+
+---
+
+## 🧩 How It Works
+
+### 1. Mark pages that should have PDFs
+
+Any Markdown file (in `_posts`, `pages/`, or elsewhere) can be tagged with:
+
+```yaml
+---
+title: Example Page
+layout: page
+pdf: true
+---
+```
+
+The Makefile will scan the entire Jekyll project and automatically detect 
these files.
+
+---
+
+### 2. Use HTML comment wrappers for Pandoc-only content
+
+Pandoc sometimes needs LaTeX code for things like custom tables, math, or page 
layout.  
+We hide that LaTeX from Jekyll using **HTML comments**, which Jekyll ignores 
but our AWK preprocessor removes before running Pandoc.
+
+Example:
+
+````markdown
+Regular Markdown content here.
+
+<!-- PANDOC:START -->
+<!--
+```{=latex}
+\begin{tabular}{ll}
+A & B \\
+C & D \\
+\end{tabular}
+```
+-->
+<!-- PANDOC:END -->
+
+More Markdown content.
+````
+
+When viewed on the Jekyll site:
+- This section is hidden (HTML comments are ignored).
+
+When built via Pandoc:
+- The `unwrap-pandoc.awk` script strips the comment wrappers,
+- The inner LaTeX becomes active, producing a correct PDF table.
+
+---
+
+### 3. The Makefile
+
+The `_pandoc/Makefile` automates the whole process.
+
+It:
+
+1. Recursively scans the site for Markdown files with `pdf: true`.
+2. Runs `unwrap-pandoc.awk` to clean up `<!-- PANDOC:START -->` / `<!-- 
PANDOC:END -->` wrappers.
+3. Invokes Pandoc with the configured LaTeX template to produce a PDF.
+
+The resulting PDFs go into:
+
+```
+_pandoc/output/
+```
+
+keeping them separate from the Jekyll site itself.
+
+---
+
+## 🧮 Example Commands
+
+From inside the `_pandoc/` directory:
+
+### Build all PDFs
+```bash
+make
+```
+
+### Clean all generated PDFs
+```bash
+make clean
+```
+
+### List all Markdown files with `pdf: true`
+```bash
+make list
+```
+
+### Force rebuild of one PDF
+```bash
+make ../about.pdf

Review Comment:
   Is this right, shouldn't this be make ../pdf/about.pdf?



##########
site/_pandoc/README.md:
##########
@@ -0,0 +1,223 @@
+---
+layout: page
+title: Pandoc + Jekyll Integration
+pdf: false
+---
+# 🧭 Pandoc + Jekyll Integration
+
+This directory contains tools for generating **PDF versions** of selected 
Jekyll pages while keeping the same Markdown files usable by Jekyll for the 
website.
+
+The goal is to have **one Markdown source** that:
+- renders cleanly in the Jekyll site (for HTML),
+- and can also be converted into a polished PDF using **Pandoc + LaTeX**.
+
+---
+
+## 🏗️ Directory Layout
+
+```
+_pandoc/
+│
+├── README.md              ← this file
+├── Makefile               ← builds all PDFs
+├── unwrap-pandoc.awk      ← preprocessor that removes comment wrappers
+├── template.latex         ← (optional) custom LaTeX template
+├── header.tex             ← (optional) extra LaTeX header content
+└── ../pdf/                ← generated PDFs appear here
+```
+
+At the root of the Jekyll site:
+
+```
+_config.yml
+_posts/
+pages/
+assets/
+_pandoc/
+pdf/
+```
+
+---
+
+## 🧩 How It Works
+
+### 1. Mark pages that should have PDFs
+
+Any Markdown file (in `_posts`, `pages/`, or elsewhere) can be tagged with:
+
+```yaml
+---
+title: Example Page
+layout: page
+pdf: true
+---
+```
+
+The Makefile will scan the entire Jekyll project and automatically detect 
these files.
+
+---
+
+### 2. Use HTML comment wrappers for Pandoc-only content
+
+Pandoc sometimes needs LaTeX code for things like custom tables, math, or page 
layout.  
+We hide that LaTeX from Jekyll using **HTML comments**, which Jekyll ignores 
but our AWK preprocessor removes before running Pandoc.
+
+Example:
+
+````markdown
+Regular Markdown content here.
+
+<!-- PANDOC:START -->
+<!--
+```{=latex}
+\begin{tabular}{ll}
+A & B \\
+C & D \\
+\end{tabular}
+```
+-->
+<!-- PANDOC:END -->
+
+More Markdown content.
+````
+
+When viewed on the Jekyll site:
+- This section is hidden (HTML comments are ignored).
+
+When built via Pandoc:
+- The `unwrap-pandoc.awk` script strips the comment wrappers,
+- The inner LaTeX becomes active, producing a correct PDF table.
+
+---
+
+### 3. The Makefile
+
+The `_pandoc/Makefile` automates the whole process.
+
+It:
+
+1. Recursively scans the site for Markdown files with `pdf: true`.
+2. Runs `unwrap-pandoc.awk` to clean up `<!-- PANDOC:START -->` / `<!-- 
PANDOC:END -->` wrappers.
+3. Invokes Pandoc with the configured LaTeX template to produce a PDF.
+
+The resulting PDFs go into:
+
+```
+_pandoc/output/

Review Comment:
   I thought they go int he ../pdf directory?



##########
site/_pandoc/README.md:
##########
@@ -0,0 +1,223 @@
+---
+layout: page
+title: Pandoc + Jekyll Integration
+pdf: false
+---
+# 🧭 Pandoc + Jekyll Integration
+
+This directory contains tools for generating **PDF versions** of selected 
Jekyll pages while keeping the same Markdown files usable by Jekyll for the 
website.
+
+The goal is to have **one Markdown source** that:
+- renders cleanly in the Jekyll site (for HTML),
+- and can also be converted into a polished PDF using **Pandoc + LaTeX**.
+
+---
+
+## 🏗️ Directory Layout
+
+```
+_pandoc/
+│
+├── README.md              ← this file
+├── Makefile               ← builds all PDFs
+├── unwrap-pandoc.awk      ← preprocessor that removes comment wrappers
+├── template.latex         ← (optional) custom LaTeX template
+├── header.tex             ← (optional) extra LaTeX header content
+└── ../pdf/                ← generated PDFs appear here
+```
+
+At the root of the Jekyll site:
+
+```
+_config.yml
+_posts/
+pages/
+assets/
+_pandoc/
+pdf/
+```
+
+---
+
+## 🧩 How It Works
+
+### 1. Mark pages that should have PDFs
+
+Any Markdown file (in `_posts`, `pages/`, or elsewhere) can be tagged with:
+
+```yaml
+---
+title: Example Page
+layout: page
+pdf: true
+---
+```
+
+The Makefile will scan the entire Jekyll project and automatically detect 
these files.
+
+---
+
+### 2. Use HTML comment wrappers for Pandoc-only content
+
+Pandoc sometimes needs LaTeX code for things like custom tables, math, or page 
layout.  
+We hide that LaTeX from Jekyll using **HTML comments**, which Jekyll ignores 
but our AWK preprocessor removes before running Pandoc.

Review Comment:
   Do we need this? I don't think we actually use latex anywhere. I also would 
want to avoid things where are pdf files are different than our markdown. Seems 
like pandoc should be able to handle converting our normal markdown files to 
pdf without a problem. It would be nice if we could remove all the preprocessor 
stuff if we don't really use it.



##########
site/_pandoc/README.md:
##########
@@ -0,0 +1,223 @@
+---
+layout: page
+title: Pandoc + Jekyll Integration
+pdf: false
+---
+# 🧭 Pandoc + Jekyll Integration
+
+This directory contains tools for generating **PDF versions** of selected 
Jekyll pages while keeping the same Markdown files usable by Jekyll for the 
website.
+
+The goal is to have **one Markdown source** that:
+- renders cleanly in the Jekyll site (for HTML),
+- and can also be converted into a polished PDF using **Pandoc + LaTeX**.
+
+---
+
+## 🏗️ Directory Layout
+
+```
+_pandoc/
+│
+├── README.md              ← this file
+├── Makefile               ← builds all PDFs
+├── unwrap-pandoc.awk      ← preprocessor that removes comment wrappers
+├── template.latex         ← (optional) custom LaTeX template
+├── header.tex             ← (optional) extra LaTeX header content
+└── ../pdf/                ← generated PDFs appear here
+```
+
+At the root of the Jekyll site:
+
+```
+_config.yml
+_posts/
+pages/
+assets/
+_pandoc/
+pdf/
+```
+
+---
+
+## 🧩 How It Works
+
+### 1. Mark pages that should have PDFs
+
+Any Markdown file (in `_posts`, `pages/`, or elsewhere) can be tagged with:

Review Comment:
   We don't have _posts or _pages, suggest we just say .md files.



##########
site/pdf/dfdl-extensions.pdf:
##########


Review Comment:
   We should not commit changes to these PDF files. Instead, we should modify 
the build/publish CI tool to rebuild the PDF files and commit them along with 
site changes.



##########
site/_pandoc/Makefile:
##########
@@ -0,0 +1,63 @@
+# ==========================================================
+# Pandoc PDF generator for Jekyll site
+# Scans Markdown files with "pdf: true" in YAML front matter
+# and produces PDFs in the site's ./pdf/ directory
+# ==========================================================
+
+# --- Configuration ---
+SITE_ROOT := ..
+AWK_UNWRAP := $(SITE_ROOT)/_pandoc/unwrap-pandoc.awk
+AWK_LIST   := $(SITE_ROOT)/_pandoc/list-pdf-sources.awk
+PANDOC := pandoc
+
+# Output directory for generated PDFs (at site root)
+PDF_OUTDIR := $(SITE_ROOT)/pdf
+
+DEFAULTS := $(SITE_ROOT)/_pandoc/basic.yaml
+
+# --- Candidate Markdown files (exclude build/tool/output dirs) ---
+# Use find + awk pipeline — awk -f avoids executable bit.
+MD_CANDIDATES := $(shell find $(SITE_ROOT) \
+  -type f -name '*.md' \
+  -not -path '*/_*/*' \
+  -not -path '*/node_modules/*' \
+  -not -path '*/vendor/*' \
+  -not -path '*/pdf/*' \
+  -print0 | xargs -0 -r awk -f $(AWK_LIST))
+
+# --- Files to build ---
+PDF_SRCS := $(MD_CANDIDATES)
+PDFS     := $(patsubst $(SITE_ROOT)/%.md,$(PDF_OUTDIR)/%.pdf,$(PDF_SRCS))
+

Review Comment:
   I wonder how useful it is to have a single PDF per page. For examle, if 
someone wants to create a PDF of a page, they can just print it to a PDF.
   
   Feels like a more useful documentation would be to combine chosen pages into 
a single large PDF. That way users could download that and have a single 
offlien source for all Daffodil releated things. That might simplify much of 
the logic too. For example, we just have one navigation link do download 
offline documentation, rather than having a bunch of pages with indivdual links.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Updating documentation on site - PDF creation for some files [daffodil-site]

Reply via email to