Re: [PR] feat(huggingFace): add HuggingFaceModelResource for model browsing and media proxy [texera]

via GitHub Tue, 19 May 2026 16:28:00 -0700


PG1204 commented on code in PR #5124:
URL: https://github.com/apache/texera/pull/5124#discussion_r3270304878



##########
amber/src/main/scala/org/apache/texera/web/resource/HuggingFaceModelResource.scala:
##########
@@ -0,0 +1,504 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.texera.web.resource
+
+import com.fasterxml.jackson.core.`type`.TypeReference
+import com.fasterxml.jackson.databind.{JsonNode, ObjectMapper}
+import kong.unirest.Unirest
+
+import javax.ws.rs._
+import javax.ws.rs.core.{MediaType, Response}
+import java.nio.file.{Files, Path => NioPath, Paths}
+import java.util.concurrent.ConcurrentHashMap
+import java.util.stream.Collectors
+import scala.jdk.CollectionConverters._
+
+/**
+  * REST resource that proxies the Hugging Face Hub API to list
+  * models for the HuggingFace operator.
+  *
+  * Browse mode:  GET /api/huggingface/models?task=text-generation
+  *   Fetches ALL models for the task from HF Hub (paginated internally),
+  *   caches the full list server-side, and returns it.
+  *
+  * Search mode:  GET /api/huggingface/models?task=text-generation&search=bert
+  *   Forwards the search query to HF Hub API (searches all models).
+  */
+@Path("/huggingface")
+@Produces(Array(MediaType.APPLICATION_JSON))
+class HuggingFaceModelResource {
+
+  import HuggingFaceModelResource._
+
+  @GET
+  @Path("/models")
+  def listModels(
+      @QueryParam("task") @DefaultValue("text-generation") task: String,
+      @QueryParam("search") search: String
+  ): Response = {
+    try {
+      val hfToken = Option(System.getenv("HF_TOKEN")).getOrElse("")
+
+      // ── Search mode: forward query to HF Hub, return results directly ──
+      if (search != null && search.trim.nonEmpty) {
+        return fetchSearchResults(task, search.trim, hfToken)
+      }
+
+      // ── Browse mode: return ALL models for this task (cached) ──
+      val cached = modelCache.get(task)
+      if (cached != null) {
+        return Response.ok(cached).build()
+      }
+
+      // Not cached — fetch all pages from HF Hub API
+      val allModels = fetchAllModelsForTask(task, hfToken)
+      val json = objectMapper.writeValueAsString(allModels)
+      modelCache.put(task, json)
+
+      Response.ok(json).build()
+
+    } catch {
+      case e: Exception =>
+        Response
+          .status(Response.Status.INTERNAL_SERVER_ERROR)
+          .entity(s"""{"error":"Failed to fetch models: ${e.getMessage}"}""")
+          .build()
+    }
+  }
+
+  @POST
+  @Path("/upload-audio")
+  @Consumes(Array(MediaType.WILDCARD))
+  def uploadAudioReference(
+      @QueryParam("filename") filename: String,
+      bytes: Array[Byte]
+  ): Response = {
+    try {
+      if (bytes == null || bytes.isEmpty) {
+        return Response
+          .status(Response.Status.BAD_REQUEST)
+          .entity("""{"error":"Audio payload is empty."}""")
+          .build()
+      }
+
+      val safeFileName = Option(filename)
+        .map(_.trim)
+        .filter(_.nonEmpty)
+        .map(name => Paths.get(name).getFileName.toString)
+        .getOrElse("audio.bin")
+      val extension = {
+        val idx = safeFileName.lastIndexOf('.')
+        if (idx >= 0 && idx < safeFileName.length - 1) 
safeFileName.substring(idx) else ".bin"
+      }
+
+      val tempDir = Paths.get(System.getProperty("java.io.tmpdir"), 
"texera-hf-audio")
+      Files.createDirectories(tempDir)
+      val tempFile: NioPath = Files.createTempFile(tempDir, "hf-audio-", 
extension)
+      Files.write(tempFile, bytes)
+
+      val json = objectMapper.writeValueAsString(
+        Map(
+          "path" -> tempFile.toAbsolutePath.toString,
+          "fileName" -> safeFileName
+        ).asJava
+      )
+      Response.ok(json).build()
+    } catch {
+      case e: Exception =>
+        Response
+          .status(Response.Status.INTERNAL_SERVER_ERROR)
+          .entity(s"""{"error":"Failed to upload audio: ${e.getMessage}"}""")
+          .build()
+    }
+  }
+
+  @GET
+  @Path("/audio-preview")
+  def previewUploadedAudio(@QueryParam("path") path: String): Response = {
+    try {
+      val trimmedPath = Option(path).map(_.trim).getOrElse("")
+      if (trimmedPath.isEmpty) {
+        return Response
+          .status(Response.Status.BAD_REQUEST)
+          .entity("""{"error":"Audio path is required."}""")
+          .build()
+      }
+
+      val tempDir = Paths
+        .get(System.getProperty("java.io.tmpdir"), "texera-hf-audio")
+        .toAbsolutePath
+        .normalize()
+      val requestedPath = Paths.get(trimmedPath).toAbsolutePath.normalize()
+      if (!requestedPath.startsWith(tempDir)) {
+        return Response
+          .status(Response.Status.FORBIDDEN)
+          .entity("""{"error":"Audio path is outside the allowed preview 
directory."}""")
+          .build()
+      }
+      if (!Files.exists(requestedPath) || !Files.isRegularFile(requestedPath)) 
{
+        return Response
+          .status(Response.Status.NOT_FOUND)
+          .entity("""{"error":"Uploaded audio file was not found."}""")
+          .build()
+      }
+
+      val contentType = Option(Files.probeContentType(requestedPath))
+        .filter(_.trim.nonEmpty)
+        .getOrElse(inferAudioContentType(requestedPath))
+      Response.ok(Files.readAllBytes(requestedPath), contentType).build()
+    } catch {
+      case e: Exception =>
+        Response
+          .status(Response.Status.INTERNAL_SERVER_ERROR)
+          .entity(s"""{"error":"Failed to read uploaded audio: 
${e.getMessage}"}""")
+          .build()
+    }
+  }
+
+  @GET
+  @Path("/media-proxy")
+  def proxyRemoteMedia(@QueryParam("url") url: String): Response = {
+    try {
+      val trimmedUrl = Option(url).map(_.trim).getOrElse("")
+      if (trimmedUrl.isEmpty) {
+        return Response
+          .status(Response.Status.BAD_REQUEST)
+          .entity("""{"error":"Media URL is required."}""")
+          .build()
+      }
+      if (!trimmedUrl.startsWith("http://";) && 
!trimmedUrl.startsWith("https://";)) {

Review Comment:
   Changes:
   
   Defined a list of host names we actually need to proxy (HuggingFace, Fal, 
Replicate). The endpoint now parses the URL, extracts the host, and rejects 
with HTTP 403 if the host isn't on the list - before any outbound request goes 
out.
   
   Subdomain matching uses a leading-dot suffix check (endsWith("." + suffix)), 
which is important for security: cdn-lfs.huggingface.co is correctly accepted 
(real HF subdomain), but evilhuggingface.co is correctly rejected (lookalike).



##########
amber/src/main/scala/org/apache/texera/web/resource/HuggingFaceModelResource.scala:
##########
@@ -0,0 +1,504 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.texera.web.resource
+
+import com.fasterxml.jackson.core.`type`.TypeReference
+import com.fasterxml.jackson.databind.{JsonNode, ObjectMapper}
+import kong.unirest.Unirest
+
+import javax.ws.rs._
+import javax.ws.rs.core.{MediaType, Response}
+import java.nio.file.{Files, Path => NioPath, Paths}
+import java.util.concurrent.ConcurrentHashMap
+import java.util.stream.Collectors
+import scala.jdk.CollectionConverters._
+
+/**
+  * REST resource that proxies the Hugging Face Hub API to list
+  * models for the HuggingFace operator.
+  *
+  * Browse mode:  GET /api/huggingface/models?task=text-generation
+  *   Fetches ALL models for the task from HF Hub (paginated internally),
+  *   caches the full list server-side, and returns it.
+  *
+  * Search mode:  GET /api/huggingface/models?task=text-generation&search=bert
+  *   Forwards the search query to HF Hub API (searches all models).
+  */
+@Path("/huggingface")
+@Produces(Array(MediaType.APPLICATION_JSON))
+class HuggingFaceModelResource {
+
+  import HuggingFaceModelResource._
+
+  @GET
+  @Path("/models")
+  def listModels(
+      @QueryParam("task") @DefaultValue("text-generation") task: String,
+      @QueryParam("search") search: String
+  ): Response = {
+    try {
+      val hfToken = Option(System.getenv("HF_TOKEN")).getOrElse("")
+
+      // ── Search mode: forward query to HF Hub, return results directly ──
+      if (search != null && search.trim.nonEmpty) {
+        return fetchSearchResults(task, search.trim, hfToken)
+      }
+
+      // ── Browse mode: return ALL models for this task (cached) ──
+      val cached = modelCache.get(task)
+      if (cached != null) {
+        return Response.ok(cached).build()
+      }
+
+      // Not cached — fetch all pages from HF Hub API
+      val allModels = fetchAllModelsForTask(task, hfToken)
+      val json = objectMapper.writeValueAsString(allModels)
+      modelCache.put(task, json)
+
+      Response.ok(json).build()
+
+    } catch {
+      case e: Exception =>
+        Response
+          .status(Response.Status.INTERNAL_SERVER_ERROR)
+          .entity(s"""{"error":"Failed to fetch models: ${e.getMessage}"}""")
+          .build()
+    }

Review Comment:
   - Added an `errorJson(message)` helper that uses Jackson to build the 
response, so escaping is library-handled and the response is always well-formed 
JSON.
   - Added an `errorResponse(status, message)` helper that wraps it.
   - Rewrote every error path in the file (catch blocks + upstream-API error
     returns) to follow the same pattern: log the real exception/status
     to the server log, return a generic message to the client.
   
   Concretely the old:
   entity(s"""{"error":"Failed to fetch models: ${e.getMessage}"}""")
   
   becomes:
   logger.error(s"Failed to fetch HF models for task '$task'", e) 
errorResponse(Response.Status.INTERNAL_SERVER_ERROR, "Failed to fetch models.")
   
   So clients only ever see e.g. `{"error":"Failed to fetch models."}`, and 
operators still have the full exception + stack trace in the server log for 
debugging.



##########
amber/src/main/scala/org/apache/texera/web/resource/HuggingFaceModelResource.scala:
##########
@@ -0,0 +1,504 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.texera.web.resource
+
+import com.fasterxml.jackson.core.`type`.TypeReference
+import com.fasterxml.jackson.databind.{JsonNode, ObjectMapper}
+import kong.unirest.Unirest
+
+import javax.ws.rs._
+import javax.ws.rs.core.{MediaType, Response}
+import java.nio.file.{Files, Path => NioPath, Paths}
+import java.util.concurrent.ConcurrentHashMap
+import java.util.stream.Collectors
+import scala.jdk.CollectionConverters._
+
+/**
+  * REST resource that proxies the Hugging Face Hub API to list
+  * models for the HuggingFace operator.
+  *
+  * Browse mode:  GET /api/huggingface/models?task=text-generation
+  *   Fetches ALL models for the task from HF Hub (paginated internally),
+  *   caches the full list server-side, and returns it.
+  *
+  * Search mode:  GET /api/huggingface/models?task=text-generation&search=bert
+  *   Forwards the search query to HF Hub API (searches all models).
+  */
+@Path("/huggingface")
+@Produces(Array(MediaType.APPLICATION_JSON))
+class HuggingFaceModelResource {
+
+  import HuggingFaceModelResource._
+
+  @GET
+  @Path("/models")
+  def listModels(
+      @QueryParam("task") @DefaultValue("text-generation") task: String,
+      @QueryParam("search") search: String
+  ): Response = {
+    try {
+      val hfToken = Option(System.getenv("HF_TOKEN")).getOrElse("")
+
+      // ── Search mode: forward query to HF Hub, return results directly ──
+      if (search != null && search.trim.nonEmpty) {
+        return fetchSearchResults(task, search.trim, hfToken)
+      }
+
+      // ── Browse mode: return ALL models for this task (cached) ──
+      val cached = modelCache.get(task)
+      if (cached != null) {
+        return Response.ok(cached).build()
+      }
+
+      // Not cached — fetch all pages from HF Hub API
+      val allModels = fetchAllModelsForTask(task, hfToken)
+      val json = objectMapper.writeValueAsString(allModels)
+      modelCache.put(task, json)
+
+      Response.ok(json).build()
+
+    } catch {
+      case e: Exception =>
+        Response
+          .status(Response.Status.INTERNAL_SERVER_ERROR)
+          .entity(s"""{"error":"Failed to fetch models: ${e.getMessage}"}""")
+          .build()
+    }
+  }
+
+  @POST
+  @Path("/upload-audio")
+  @Consumes(Array(MediaType.WILDCARD))
+  def uploadAudioReference(
+      @QueryParam("filename") filename: String,
+      bytes: Array[Byte]
+  ): Response = {
+    try {
+      if (bytes == null || bytes.isEmpty) {
+        return Response
+          .status(Response.Status.BAD_REQUEST)
+          .entity("""{"error":"Audio payload is empty."}""")
+          .build()
+      }
+
+      val safeFileName = Option(filename)
+        .map(_.trim)
+        .filter(_.nonEmpty)
+        .map(name => Paths.get(name).getFileName.toString)
+        .getOrElse("audio.bin")
+      val extension = {
+        val idx = safeFileName.lastIndexOf('.')
+        if (idx >= 0 && idx < safeFileName.length - 1) 
safeFileName.substring(idx) else ".bin"
+      }
+
+      val tempDir = Paths.get(System.getProperty("java.io.tmpdir"), 
"texera-hf-audio")
+      Files.createDirectories(tempDir)
+      val tempFile: NioPath = Files.createTempFile(tempDir, "hf-audio-", 
extension)
+      Files.write(tempFile, bytes)
+
+      val json = objectMapper.writeValueAsString(
+        Map(
+          "path" -> tempFile.toAbsolutePath.toString,
+          "fileName" -> safeFileName
+        ).asJava
+      )
+      Response.ok(json).build()
+    } catch {
+      case e: Exception =>
+        Response
+          .status(Response.Status.INTERNAL_SERVER_ERROR)
+          .entity(s"""{"error":"Failed to upload audio: ${e.getMessage}"}""")
+          .build()
+    }
+  }
+
+  @GET
+  @Path("/audio-preview")
+  def previewUploadedAudio(@QueryParam("path") path: String): Response = {
+    try {
+      val trimmedPath = Option(path).map(_.trim).getOrElse("")
+      if (trimmedPath.isEmpty) {
+        return Response
+          .status(Response.Status.BAD_REQUEST)
+          .entity("""{"error":"Audio path is required."}""")
+          .build()
+      }
+
+      val tempDir = Paths
+        .get(System.getProperty("java.io.tmpdir"), "texera-hf-audio")
+        .toAbsolutePath
+        .normalize()
+      val requestedPath = Paths.get(trimmedPath).toAbsolutePath.normalize()
+      if (!requestedPath.startsWith(tempDir)) {
+        return Response
+          .status(Response.Status.FORBIDDEN)
+          .entity("""{"error":"Audio path is outside the allowed preview 
directory."}""")
+          .build()
+      }
+      if (!Files.exists(requestedPath) || !Files.isRegularFile(requestedPath)) 
{
+        return Response
+          .status(Response.Status.NOT_FOUND)
+          .entity("""{"error":"Uploaded audio file was not found."}""")
+          .build()
+      }
+
+      val contentType = Option(Files.probeContentType(requestedPath))
+        .filter(_.trim.nonEmpty)
+        .getOrElse(inferAudioContentType(requestedPath))
+      Response.ok(Files.readAllBytes(requestedPath), contentType).build()
+    } catch {
+      case e: Exception =>
+        Response
+          .status(Response.Status.INTERNAL_SERVER_ERROR)
+          .entity(s"""{"error":"Failed to read uploaded audio: 
${e.getMessage}"}""")
+          .build()
+    }
+  }
+
+  @GET
+  @Path("/media-proxy")
+  def proxyRemoteMedia(@QueryParam("url") url: String): Response = {
+    try {
+      val trimmedUrl = Option(url).map(_.trim).getOrElse("")
+      if (trimmedUrl.isEmpty) {
+        return Response
+          .status(Response.Status.BAD_REQUEST)
+          .entity("""{"error":"Media URL is required."}""")
+          .build()
+      }
+      if (!trimmedUrl.startsWith("http://";) && 
!trimmedUrl.startsWith("https://";)) {
+        return Response
+          .status(Response.Status.BAD_REQUEST)
+          .entity("""{"error":"Only http(s) media URLs are supported."}""")
+          .build()
+      }
+
+      val upstreamResponse = Unirest
+        .get(trimmedUrl)
+        .connectTimeout(10000)
+        .socketTimeout(120000)
+        .asBytes()
+
+      if (upstreamResponse.getStatus != 200) {
+        return Response
+          .status(upstreamResponse.getStatus)
+          .entity(
+            s"""{"error":"Failed to fetch remote media: 
${upstreamResponse.getStatusText}"}"""
+          )
+          .build()
+      }
+
+      val contentType = 
Option(upstreamResponse.getHeaders.getFirst("Content-Type"))
+        .filter(_.trim.nonEmpty)
+        .getOrElse(MediaType.APPLICATION_OCTET_STREAM)
+      Response.ok(upstreamResponse.getBody, contentType).build()
+    } catch {
+      case e: Exception =>
+        Response
+          .status(Response.Status.INTERNAL_SERVER_ERROR)
+          .entity(s"""{"error":"Failed to proxy remote media: 
${e.getMessage}"}""")
+          .build()
+    }
+  }
+
+  /** Search HF Hub for models matching a query within a task. */
+  private def fetchSearchResults(task: String, query: String, hfToken: 
String): Response = {
+    var request = Unirest
+      .get("https://huggingface.co/api/models";)
+      .queryString("pipeline_tag", task)
+      .queryString("sort", "downloads")
+      .queryString("direction", "-1")
+      .queryString("limit", "100")

Review Comment:
   Went with the "tell the caller it's truncated" option rather than full 
pagination:
   
   - Search response gets an `X-Texera-Truncated: true` header when results hit 
the SEARCH_LIMIT (100) cap. Header is absent below the cap.
   - Body shape stays a flat JSON array - frontend reads the header alongside, 
doesn't have to deal with an envelope shape.
   - Same truncation mechanism as the browse path uses when it hits MAX_PAGES 
(see L334) - one signal, two endpoints, frontend handles it uniformly.
   
   Avoided full pagination here because the search endpoint is intended for 
interactive "search as you type" - looping through up to 50 sequential HF pages 
on every keystroke would make the picker feel broken. The truncation hint 
nudges users to refine the query, which is more natural for search UX anyway. 
Browse mode (which legitimately wants the full list) is the place where the 
full pagination loop runs.
   
   Frontend hint UI ("showing top 100, refine to see more") lands with the 
picker PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(huggingFace): add HuggingFaceModelResource for model browsing and media proxy [texera]

Reply via email to