This is an automated email from the ASF dual-hosted git repository. rzo1 pushed a commit to branch genai-text-extractor in repository https://gitbox.apache.org/repos/asf/stormcrawler.git
commit 47da87d3252c06a8dd62b132dea182d9d4fce301 Author: Richard Zowalal <[email protected]> AuthorDate: Thu Jun 12 16:56:10 2025 +0200 Adds a first draft for a llm based text extractor. Still needs some more love in terms of testing --- THIRD-PARTY.txt | 57 ++++---- .../stormcrawler/parse/JSoupTextExtractor.java | 5 - .../apache/stormcrawler/parse/TextExtractor.java | 5 + external/ai/README.md | 68 ++++++++++ external/ai/ai-conf.yaml | 34 +++++ external/ai/pom.xml | 57 ++++++++ .../stormcrawler/ai/LlmResponseListener.java | 8 +- .../apache/stormcrawler/ai/LlmTextExtractor.java | 146 +++++++++++++++++++++ .../ai/src/main/resources/llm-default-prompt.txt | 44 +++++++ .../stormcrawler/ai/LlmTextExtractorTest.java | 20 ++- external/ai/src/test/resources/stormcrawler.html | 118 +++++++++++++++++ external/pom.xml | 2 +- pom.xml | 2 + 13 files changed, 525 insertions(+), 41 deletions(-) diff --git a/THIRD-PARTY.txt b/THIRD-PARTY.txt index 42e0dbdf..ed55d797 100644 --- a/THIRD-PARTY.txt +++ b/THIRD-PARTY.txt @@ -14,11 +14,7 @@ List of third-party dependencies grouped by their license type. AL 2.0, GPL v2, MPL 2.0 - * RabbitMQ Java Client (com.rabbitmq:amqp-client:5.23.0 - https://www.rabbitmq.com) - - Apache License - - * Log4j Implemented Over SLF4J (org.slf4j:log4j-over-slf4j:2.0.16 - http://www.slf4j.org) + * RabbitMQ Java Client (com.rabbitmq:amqp-client:5.24.0 - https://www.rabbitmq.com) Apache License, Version 2.0 @@ -44,20 +40,20 @@ List of third-party dependencies grouped by their license type. * Apache FontBox (org.apache.pdfbox:fontbox:3.0.5 - http://pdfbox.apache.org/) * Apache Hadoop Client API (org.apache.hadoop:hadoop-client-api:3.4.1 - no url defined) * Apache Hadoop Client Runtime (org.apache.hadoop:hadoop-client-runtime:3.4.1 - no url defined) - * Apache HBase - Client (org.apache.hbase:hbase-client:2.6.1-hadoop3 - https://hbase.apache.org/hbase-build-configuration/hbase-client) - * Apache HBase - Common (org.apache.hbase:hbase-common:2.6.1-hadoop3 - https://hbase.apache.org/hbase-build-configuration/hbase-common) - * Apache HBase - Hadoop Compatibility (org.apache.hbase:hbase-hadoop-compat:2.6.1-hadoop3 - https://hbase.apache.org/hbase-build-configuration/hbase-hadoop-compat) - * Apache HBase - Hadoop Two Compatibility (org.apache.hbase:hbase-hadoop2-compat:2.6.1-hadoop3 - https://hbase.apache.org/hbase-build-configuration/hbase-hadoop2-compat) - * Apache HBase - Logging (org.apache.hbase:hbase-logging:2.6.1-hadoop3 - https://hbase.apache.org/hbase-build-configuration/hbase-logging) - * Apache HBase - Metrics API (org.apache.hbase:hbase-metrics-api:2.6.1-hadoop3 - https://hbase.apache.org/hbase-build-configuration/hbase-metrics-api) - * Apache HBase - Metrics Implementation (org.apache.hbase:hbase-metrics:2.6.1-hadoop3 - https://hbase.apache.org/hbase-build-configuration/hbase-metrics) - * Apache HBase Patched and Relocated (Shaded) Protobuf (org.apache.hbase.thirdparty:hbase-shaded-protobuf:4.1.9 - https://hbase.apache.org/hbase-shaded-protobuf) - * Apache HBase - Protocol (org.apache.hbase:hbase-protocol:2.6.1-hadoop3 - https://hbase.apache.org/hbase-build-configuration/hbase-protocol) - * Apache HBase Relocated (Shaded) GSON Libs (org.apache.hbase.thirdparty:hbase-shaded-gson:4.1.9 - https://hbase.apache.org/hbase-shaded-gson) - * Apache HBase Relocated (Shaded) Netty Libs (org.apache.hbase.thirdparty:hbase-shaded-netty:4.1.9 - https://hbase.apache.org/hbase-shaded-netty) - * Apache HBase Relocated (Shaded) Third-party Miscellaneous Libs (org.apache.hbase.thirdparty:hbase-shaded-miscellaneous:4.1.9 - https://hbase.apache.org/hbase-shaded-miscellaneous) - * Apache HBase - Shaded Protocol (org.apache.hbase:hbase-protocol-shaded:2.6.1-hadoop3 - https://hbase.apache.org/hbase-build-configuration/hbase-protocol-shaded) - * Apache HBase Unsafe Wrapper (org.apache.hbase.thirdparty:hbase-unsafe:4.1.9 - https://hbase.apache.org/hbase-unsafe) + * Apache HBase - Client (org.apache.hbase:hbase-client:2.6.2-hadoop3 - https://hbase.apache.org/hbase-build-configuration/hbase-client) + * Apache HBase - Common (org.apache.hbase:hbase-common:2.6.2-hadoop3 - https://hbase.apache.org/hbase-build-configuration/hbase-common) + * Apache HBase - Hadoop Compatibility (org.apache.hbase:hbase-hadoop-compat:2.6.2-hadoop3 - https://hbase.apache.org/hbase-build-configuration/hbase-hadoop-compat) + * Apache HBase - Hadoop Two Compatibility (org.apache.hbase:hbase-hadoop2-compat:2.6.2-hadoop3 - https://hbase.apache.org/hbase-build-configuration/hbase-hadoop2-compat) + * Apache HBase - Logging (org.apache.hbase:hbase-logging:2.6.2-hadoop3 - https://hbase.apache.org/hbase-build-configuration/hbase-logging) + * Apache HBase - Metrics API (org.apache.hbase:hbase-metrics-api:2.6.2-hadoop3 - https://hbase.apache.org/hbase-build-configuration/hbase-metrics-api) + * Apache HBase - Metrics Implementation (org.apache.hbase:hbase-metrics:2.6.2-hadoop3 - https://hbase.apache.org/hbase-build-configuration/hbase-metrics) + * Apache HBase Patched and Relocated (Shaded) Protobuf (org.apache.hbase.thirdparty:hbase-shaded-protobuf:4.1.10 - https://hbase.apache.org/hbase-shaded-protobuf) + * Apache HBase - Protocol (org.apache.hbase:hbase-protocol:2.6.2-hadoop3 - https://hbase.apache.org/hbase-build-configuration/hbase-protocol) + * Apache HBase Relocated (Shaded) GSON Libs (org.apache.hbase.thirdparty:hbase-shaded-gson:4.1.10 - https://hbase.apache.org/hbase-shaded-gson) + * Apache HBase Relocated (Shaded) Netty Libs (org.apache.hbase.thirdparty:hbase-shaded-netty:4.1.10 - https://hbase.apache.org/hbase-shaded-netty) + * Apache HBase Relocated (Shaded) Third-party Miscellaneous Libs (org.apache.hbase.thirdparty:hbase-shaded-miscellaneous:4.1.10 - https://hbase.apache.org/hbase-shaded-miscellaneous) + * Apache HBase - Shaded Protocol (org.apache.hbase:hbase-protocol-shaded:2.6.2-hadoop3 - https://hbase.apache.org/hbase-build-configuration/hbase-protocol-shaded) + * Apache HBase Unsafe Wrapper (org.apache.hbase.thirdparty:hbase-unsafe:4.1.10 - https://hbase.apache.org/hbase-unsafe) * Apache HttpAsyncClient (org.apache.httpcomponents:httpasyncclient:4.1.5 - http://hc.apache.org/httpcomponents-asyncclient) * Apache HttpClient (org.apache.httpcomponents:httpclient:4.5.14 - http://hc.apache.org/httpcomponents-client-ga) * Apache HttpClient Mime (org.apache.httpcomponents:httpmime:4.5.14 - http://hc.apache.org/httpcomponents-client-ga) @@ -83,6 +79,7 @@ List of third-party dependencies grouped by their license type. * Apache Lucene (module: spatial3d) (org.apache.lucene:lucene-spatial3d:9.12.1 - https://lucene.apache.org/) * Apache Lucene (module: spatial-extras) (org.apache.lucene:lucene-spatial-extras:9.12.1 - https://lucene.apache.org/) * Apache Lucene (module: suggest) (org.apache.lucene:lucene-suggest:9.12.1 - https://lucene.apache.org/) + * Apache OpenNLP Tools (org.apache.opennlp:opennlp-tools:2.5.4 - https://www.apache.org/opennlp/opennlp-tools/) * Apache PDFBox (org.apache.pdfbox:pdfbox:3.0.5 - https://www.apache.org/pdfbox-parent/pdfbox/) * Apache PDFBox io (org.apache.pdfbox:pdfbox-io:3.0.5 - https://www.apache.org/pdfbox-parent/pdfbox-io/) * Apache PDFBox tools (org.apache.pdfbox:pdfbox-tools:3.0.5 - https://www.apache.org/pdfbox-parent/pdfbox-tools/) @@ -140,12 +137,11 @@ List of third-party dependencies grouped by their license type. * error-prone annotations (com.google.errorprone:error_prone_annotations:2.38.0 - https://errorprone.info/error_prone_annotations) * FindBugs-jsr305 (com.google.code.findbugs:jsr305:3.0.2 - http://findbugs.sourceforge.net/) * Google Android Annotations Library (com.google.android:annotations:4.1.1.4 - http://source.android.com/) - * Graphite Integration for Metrics (io.dropwizard.metrics:metrics-graphite:4.2.29 - https://metrics.dropwizard.io/metrics-graphite) + * Graphite Integration for Metrics (io.dropwizard.metrics:metrics-graphite:4.2.30 - https://metrics.dropwizard.io/metrics-graphite) * Gson (com.google.code.gson:gson:2.11.0 - https://github.com/google/gson) * Gson (com.google.code.gson:gson:2.12.1 - https://github.com/google/gson) * Guava: Google Core Libraries for Java (com.google.guava:guava:18.0 - http://code.google.com/p/guava-libraries/guava) * Guava: Google Core Libraries for Java (com.google.guava:guava:33.2.1-android - https://github.com/google/guava) - * Guava: Google Core Libraries for Java (com.google.guava:guava:33.4.0-jre - https://github.com/google/guava) * Guava: Google Core Libraries for Java (com.google.guava:guava:33.4.8-jre - https://github.com/google/guava) * Guava InternalFutureFailureAccess and InternalFutures (com.google.guava:failureaccess:1.0.2 - https://github.com/google/guava/failureaccess) * Guava InternalFutureFailureAccess and InternalFutures (com.google.guava:failureaccess:1.0.3 - https://github.com/google/guava/failureaccess) @@ -180,17 +176,23 @@ List of third-party dependencies grouped by their license type. * Joda-Time (joda-time:joda-time:2.12.7 - https://www.joda.org/joda-time/) * jsonic (net.arnx:jsonic:1.2.11 - http://jsonic.sourceforge.jp/) * JSpecify annotations (org.jspecify:jspecify:1.0.0 - http://jspecify.org/) - * JVM Integration for Metrics (io.dropwizard.metrics:metrics-jvm:4.2.29 - https://metrics.dropwizard.io/metrics-jvm) + * JVM Integration for Metrics (io.dropwizard.metrics:metrics-jvm:4.2.30 - https://metrics.dropwizard.io/metrics-jvm) * jwarc (org.netpreserve:jwarc:0.31.1 - https://github.com/iipc/jwarc) * Kotlin Stdlib (org.jetbrains.kotlin:kotlin-stdlib:1.8.21 - https://kotlinlang.org/) * Kotlin Stdlib Common (org.jetbrains.kotlin:kotlin-stdlib-common:1.9.10 - https://kotlinlang.org/) * Kotlin Stdlib Jdk7 (org.jetbrains.kotlin:kotlin-stdlib-jdk7:1.8.21 - https://kotlinlang.org/) * Kotlin Stdlib Jdk8 (org.jetbrains.kotlin:kotlin-stdlib-jdk8:1.8.21 - https://kotlinlang.org/) + * LangChain4j :: Core (dev.langchain4j:langchain4j-core:1.0.1 - https://github.com/langchain4j/langchain4j/tree/main/langchain4j-core) + * LangChain4j :: HTTP Client :: JDK HttpClient (dev.langchain4j:langchain4j-http-client-jdk:1.0.1 - https://github.com/langchain4j/langchain4j/tree/main/langchain4j-http-client-jdk) + * LangChain4j :: HTTP Client (dev.langchain4j:langchain4j-http-client:1.0.1 - https://github.com/langchain4j/langchain4j/tree/main/langchain4j-http-client) + * LangChain4j :: Integration :: OpenAI (dev.langchain4j:langchain4j-open-ai:1.0.1 - https://github.com/langchain4j/langchain4j/tree/main/langchain4j-open-ai) + * LangChain4j (dev.langchain4j:langchain4j:1.0.1 - https://github.com/langchain4j/langchain4j/tree/main/langchain4j) * lang-mustache (org.opensearch.plugin:lang-mustache-client:2.19.1 - https://github.com/opensearch-project/OpenSearch.git) * language-detector (com.optimaize.languagedetector:language-detector:0.6 - https://github.com/optimaize/language-detector) + * Log4j Implemented Over SLF4J (org.slf4j:log4j-over-slf4j:2.0.17 - http://www.slf4j.org) * mapper-extras (org.opensearch.plugin:mapper-extras-client:2.19.1 - https://github.com/opensearch-project/OpenSearch.git) - * Metrics Core (io.dropwizard.metrics:metrics-core:4.2.29 - https://metrics.dropwizard.io/metrics-core) - * Metrics Integration with JMX (io.dropwizard.metrics:metrics-jmx:4.2.29 - https://metrics.dropwizard.io/metrics-jmx) + * Metrics Core (io.dropwizard.metrics:metrics-core:4.2.30 - https://metrics.dropwizard.io/metrics-core) + * Metrics Integration with JMX (io.dropwizard.metrics:metrics-jmx:4.2.30 - https://metrics.dropwizard.io/metrics-jmx) * Netty/Buffer (io.netty:netty-buffer:4.1.105.Final - https://netty.io/netty-buffer/) * Netty/Codec (io.netty:netty-codec:4.1.105.Final - https://netty.io/netty-codec/) * Netty/Common (io.netty:netty-common:4.1.105.Final - https://netty.io/netty-common/) @@ -286,12 +288,9 @@ List of third-party dependencies grouped by their license type. Bouncy Castle Licence - * Bouncy Castle ASN.1 Extension and Utility APIs (org.bouncycastle:bcutil-jdk18on:1.79 - https://www.bouncycastle.org/java.html) * Bouncy Castle ASN.1 Extension and Utility APIs (org.bouncycastle:bcutil-jdk18on:1.80 - https://www.bouncycastle.org/download/bouncy-castle-java/) * Bouncy Castle JavaMail Jakarta S/MIME APIs (org.bouncycastle:bcjmail-jdk18on:1.80 - https://www.bouncycastle.org/download/bouncy-castle-java/) - * Bouncy Castle PKIX, CMS, EAC, TSP, PKCS, OCSP, CMP, and CRMF APIs (org.bouncycastle:bcpkix-jdk18on:1.79 - https://www.bouncycastle.org/java.html) * Bouncy Castle PKIX, CMS, EAC, TSP, PKCS, OCSP, CMP, and CRMF APIs (org.bouncycastle:bcpkix-jdk18on:1.80 - https://www.bouncycastle.org/download/bouncy-castle-java/) - * Bouncy Castle Provider (org.bouncycastle:bcprov-jdk18on:1.79 - https://www.bouncycastle.org/java.html) * Bouncy Castle Provider (org.bouncycastle:bcprov-jdk18on:1.80 - https://www.bouncycastle.org/download/bouncy-castle-java/) BSD-2-Clause, Public Domain, per Creative Commons CC0 @@ -362,18 +361,18 @@ List of third-party dependencies grouped by their license type. * Animal Sniffer Annotations (org.codehaus.mojo:animal-sniffer-annotations:1.24 - https://www.mojohaus.org/animal-sniffer/animal-sniffer-annotations) * Checker Qual (org.checkerframework:checker-qual:3.42.0 - https://checkerframework.org/) - * Checker Qual (org.checkerframework:checker-qual:3.43.0 - https://checkerframework.org/) * dd-plist (com.googlecode.plist:dd-plist:1.28 - http://www.github.com/3breadt/dd-plist) * JCodings (org.jruby.jcodings:jcodings:1.0.58 - http://nexus.sonatype.org/oss-repository-hosting.html/jcodings) * Joni (org.jruby.joni:joni:2.2.1 - http://nexus.sonatype.org/oss-repository-hosting.html/joni) * JOpt Simple (net.sf.jopt-simple:jopt-simple:5.0.4 - http://jopt-simple.github.io/jopt-simple) * jsoup Java HTML Parser (org.jsoup:jsoup:1.20.1 - https://jsoup.org/) + * JTokkit (com.knuddels:jtokkit:1.1.0 - https://github.com/knuddelsgmbh/jtokkit) * org.brotli:dec (org.brotli:dec:0.1.2 - http://brotli.org/dec) * semver4j (org.semver4j:semver4j:5.3.0 - https://github.com/semver4j/semver4j) * SLF4J API Module (org.slf4j:slf4j-api:1.7.36 - http://www.slf4j.org) * SLF4J API Module (org.slf4j:slf4j-api:1.7.6 - http://www.slf4j.org) * SLF4J API Module (org.slf4j:slf4j-api:2.0.13 - http://www.slf4j.org) - * SLF4J API Module (org.slf4j:slf4j-api:2.0.16 - http://www.slf4j.org) + * SLF4J API Module (org.slf4j:slf4j-api:2.0.17 - http://www.slf4j.org) * xsoup (us.codecraft:xsoup:0.3.7 - https://github.com/code4craft/xsoup/) Similar to Apache License but with the acknowledgment clause removed diff --git a/core/src/main/java/org/apache/stormcrawler/parse/JSoupTextExtractor.java b/core/src/main/java/org/apache/stormcrawler/parse/JSoupTextExtractor.java index 55b9279f..99deee32 100644 --- a/core/src/main/java/org/apache/stormcrawler/parse/JSoupTextExtractor.java +++ b/core/src/main/java/org/apache/stormcrawler/parse/JSoupTextExtractor.java @@ -60,11 +60,6 @@ import org.jsoup.select.NodeVisitor; */ public class JSoupTextExtractor implements TextExtractor { - public static final String INCLUDE_PARAM_NAME = "textextractor.include.pattern"; - public static final String EXCLUDE_PARAM_NAME = "textextractor.exclude.tags"; - public static final String NO_TEXT_PARAM_NAME = "textextractor.no.text"; - public static final String TEXT_MAX_TEXT_PARAM_NAME = "textextractor.skip.after"; - private final List<String> inclusionPatterns; private final Set<String> excludedTags; private final boolean noText; diff --git a/core/src/main/java/org/apache/stormcrawler/parse/TextExtractor.java b/core/src/main/java/org/apache/stormcrawler/parse/TextExtractor.java index c9678bd2..1a29d58e 100644 --- a/core/src/main/java/org/apache/stormcrawler/parse/TextExtractor.java +++ b/core/src/main/java/org/apache/stormcrawler/parse/TextExtractor.java @@ -18,5 +18,10 @@ package org.apache.stormcrawler.parse; public interface TextExtractor { + String INCLUDE_PARAM_NAME = "textextractor.include.pattern"; + String EXCLUDE_PARAM_NAME = "textextractor.exclude.tags"; + String NO_TEXT_PARAM_NAME = "textextractor.no.text"; + String TEXT_MAX_TEXT_PARAM_NAME = "textextractor.skip.after"; + String text(Object element); } diff --git a/external/ai/README.md b/external/ai/README.md new file mode 100644 index 00000000..7e1c722e --- /dev/null +++ b/external/ai/README.md @@ -0,0 +1,68 @@ +# stormcrawler-aws +================================ + +The `OpenAiTextExtractor` is a StormCrawler-compatible content extraction component that uses a Large Language Model (LLM) via an OpenAI-compatible API to extract meaningful text from HTML documents. +This enables context-aware and semantically rich extraction beyond traditional rule-based approaches. + +## Prerequisites + +Add `stormcrawler-ai` to the dependencies of your project\: + +```xml +<dependency> + <groupId>org.apache.stormcrawler</groupId> + <artifactId>stormcrawler-ai</artifactId> + <version>XXXX</version> +</dependency> +``` + +## Features + +- Uses OpenAI-compatible LLMs (e.g., LLaMA 3) for intelligent HTML parsing and content extraction. +- Customizable prompts for both system and user messages. +- Easily integrates with StormCrawler parsing pipelines. +- Optional listener interface for logging or usage metrics. + +## Configuration + +To use the `LlmTextExtractor`, your configuration file must include the following: + +```yaml +# Required: specify the extractor class +textextractor.class: "org.apache.stormcrawler.ai.LlmTextExtractor" + +# Required: LLM API settings +textextractor.llm.api_key: "<your-api-key>" +textextractor.llm.url: "https://<your-openai-compatible-endpoint>" +textextractor.llm.model: "<your-model-to-use>" + +# Optional: system prompt sent to LLM +textextractor.system.prompt: "You are an expert in extracting content from plain HTML input." + +# Optional: user prompt template (with placeholders) for custom use cases. Note: We provide a default prompt in `src/main/resources/llm-default-prompt.txt` +textextractor.llm.prompt: | + Please extract the main textual content from the following HTML: + {HTML} + + {REQUEST} + +# Optional: extra request passed into the user prompt +textextractor.llm.user_request: "Only include body content relevant to articles." + +# Optional: listener class implementing LlmResponseListener to hook into success/failure of LLM response, i.e. for tracking usage metrics. +textextractor.llm.listener.clazz: "<your-listener-class>" +``` + +Note: You **must** set `textextractor.class` to use this extractor in a StormCrawler topology. + +The `LlmTextExtractor` does not support the following configuration options from the default `TextExtractor`: + +- `textextractor.include.pattern` +- `textextractor.exclude.tags` +- `textextractor.no.text` +- `textextractor.skip.after` + +## Additional Notes +- LLM Costs: Calls to LLM APIs may incur costs - monitor usage if billing is a concern. +- Performance: LLM responses add latency to a crawl; this extractor is best used for high-value pages or specific use-cases. +- Security: Never hard-code or expose your API key in public repositories. \ No newline at end of file diff --git a/external/ai/ai-conf.yaml b/external/ai/ai-conf.yaml new file mode 100644 index 00000000..29e2c567 --- /dev/null +++ b/external/ai/ai-conf.yaml @@ -0,0 +1,34 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +############################### +# AI Configuration # +############################### + +textextractor.llm.api_key: "" +textextractor.llm.url: "https://openai.inference.de-txl.ionos.com/v1" +textextractor.llm.model: "meta-llama/Meta-Llama-3.1-8B-Instruct" + +# Allows to define a custom system prompt. +#textextractor.system.prompt: "You are an expert in extracting content from plain HTML input." + +# Allows to define a custom prompt. {HTML} is replaced with the page html, {REQUEST} is replaced with the content of textextractor.llm.user_request +#textextractor.llm.prompt: "see llm-default-prompt.txt - can be a multi line string with placeholders" + +# Allows to configure a special user request which the LLM should honour. +#textextractor.llm.user_request: "-" + +# Allows to define the listener class to have the possibility to hook in usage metrics (i.e. for payment related metrics) +#textextractor.llm.listener.clazz: "org.apache.stormcrawler.ai.AiResponseListener" \ No newline at end of file diff --git a/external/ai/pom.xml b/external/ai/pom.xml new file mode 100644 index 00000000..426cc588 --- /dev/null +++ b/external/ai/pom.xml @@ -0,0 +1,57 @@ +<?xml version="1.0" encoding="UTF-8"?> + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +<project xmlns="http://maven.apache.org/POM/4.0.0" + xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" + xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> + <modelVersion>4.0.0</modelVersion> + <parent> + <groupId>org.apache.stormcrawler</groupId> + <artifactId>stormcrawler-external</artifactId> + <version>3.3.1-SNAPSHOT</version> + <relativePath>../pom.xml</relativePath> + </parent> + + + <artifactId>stormcrawler-ai</artifactId> + <name>stormcrawler-ai</name> + + <url>https://github.com/apache/stormcrawler/tree/master/external/ai</url> + <description>AI resources for StormCrawler</description> + + <properties> + <langchain4j.version>1.0.1</langchain4j.version> + </properties> + + <dependencies> + <dependency> + <groupId>dev.langchain4j</groupId> + <artifactId>langchain4j</artifactId> + <version>${langchain4j.version}</version> <!-- TODO: We might be able to trim down dependencies--> + </dependency> + <dependency> + <groupId>dev.langchain4j</groupId> + <artifactId>langchain4j-open-ai</artifactId> + <version>${langchain4j.version}</version> + </dependency> + </dependencies> + +</project> \ No newline at end of file diff --git a/core/src/main/java/org/apache/stormcrawler/parse/TextExtractor.java b/external/ai/src/main/java/org/apache/stormcrawler/ai/LlmResponseListener.java similarity index 85% copy from core/src/main/java/org/apache/stormcrawler/parse/TextExtractor.java copy to external/ai/src/main/java/org/apache/stormcrawler/ai/LlmResponseListener.java index c9678bd2..cf491e5e 100644 --- a/core/src/main/java/org/apache/stormcrawler/parse/TextExtractor.java +++ b/external/ai/src/main/java/org/apache/stormcrawler/ai/LlmResponseListener.java @@ -14,9 +14,11 @@ * See the License for the specific language governing permissions and * limitations under the License. */ -package org.apache.stormcrawler.parse; +package org.apache.stormcrawler.ai; -public interface TextExtractor { +public interface LlmResponseListener { - String text(Object element); + void onResponse(Object o); + + void onFailure(Object o); } diff --git a/external/ai/src/main/java/org/apache/stormcrawler/ai/LlmTextExtractor.java b/external/ai/src/main/java/org/apache/stormcrawler/ai/LlmTextExtractor.java new file mode 100644 index 00000000..4068d16e --- /dev/null +++ b/external/ai/src/main/java/org/apache/stormcrawler/ai/LlmTextExtractor.java @@ -0,0 +1,146 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to you under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.stormcrawler.ai; + +import dev.langchain4j.data.message.SystemMessage; +import dev.langchain4j.data.message.UserMessage; +import dev.langchain4j.model.chat.request.ChatRequest; +import dev.langchain4j.model.chat.response.ChatResponse; +import dev.langchain4j.model.openai.OpenAiChatModel; +import java.io.FileNotFoundException; +import java.io.IOException; +import java.io.InputStream; +import java.lang.reflect.InvocationTargetException; +import java.nio.charset.StandardCharsets; +import java.nio.file.Files; +import java.nio.file.Path; +import java.util.Map; +import org.apache.storm.Config; +import org.apache.stormcrawler.parse.TextExtractor; +import org.apache.stormcrawler.util.ConfUtils; +import org.jsoup.nodes.Element; +import org.jsoup.parser.Parser; + +public class LlmTextExtractor implements TextExtractor { + + public static final String API_KEY = "textextractor.llm.api_key"; + public static final String BASE_URL = "textextractor.llm.url"; + public static final String MODEL_NAME = "textextractor.llm.model"; + public static final String SYSTEM_PROMPT = "textextractor.system.prompt"; + public static final String USER_PROMPT = "textextractor.llm.prompt"; + public static final String USER_REQUEST = "textextractor.llm.user_request"; + public static final String LISTENER_CLASS = "textextractor.llm.listener.clazz"; + + private final OpenAiChatModel model; + private final SystemMessage systemMessage; + private final String userMessage; + private final String userRequest; + private final LlmResponseListener listener; + + public LlmTextExtractor(Map<String, Object> stormConf) { + this.model = + OpenAiChatModel.builder() + .apiKey(ConfUtils.getString(stormConf, API_KEY)) + .baseUrl(ConfUtils.getString(stormConf, BASE_URL)) + .modelName(ConfUtils.getString(stormConf, MODEL_NAME)) + .build(); + this.systemMessage = + SystemMessage.from( + ConfUtils.getString( + stormConf, + SYSTEM_PROMPT, + "You are an expert in extracting content from plain HTML input.")); + this.userMessage = + ConfUtils.getString( + stormConf, USER_PROMPT, readFromClasspath("llm-default-prompt.txt")); + this.userRequest = ConfUtils.getString(stormConf, USER_REQUEST, ""); + final String clazz = + ConfUtils.getString(stormConf, LISTENER_CLASS, NoOpListener.class.getName()); + try { + listener = + (LlmResponseListener) + Class.forName(clazz).getDeclaredConstructor().newInstance(); + } catch (ClassNotFoundException + | InvocationTargetException + | InstantiationException + | IllegalAccessException + | NoSuchMethodException e) { + throw new RuntimeException(e); + } + } + + private String readFromClasspath(String resource) { + try { + final ClassLoader classLoader = Thread.currentThread().getContextClassLoader(); + try (InputStream is = classLoader.getResourceAsStream(resource)) { + if (is == null) { + throw new FileNotFoundException("Resource not found: " + resource); + } + return new String(is.readAllBytes(), StandardCharsets.UTF_8); + } + } catch (IOException e) { + throw new RuntimeException(e); + } + } + + @Override + public String text(Object element) { + if (element instanceof Element e) { + try { + final ChatRequest chatRequest = + ChatRequest.builder() + .messages( + systemMessage, + UserMessage.from( + replacePlaceholders(userMessage, e.html()))) + .build(); + final ChatResponse response = model.chat(chatRequest); + listener.onResponse(response); + return response.aiMessage().text(); + } catch (RuntimeException ex) { + listener.onFailure(ex); + } + } + return ""; + } + + private String replacePlaceholders(String userMessage, String html) { + userMessage = userMessage.replace("{HTML}", html); + userMessage = userMessage.replace("{REQUEST}", userRequest); + return userMessage; + } + + private static class NoOpListener implements LlmResponseListener { + + @Override + public void onResponse(Object o) {} + + @Override + public void onFailure(Object o) {} + } + + public static void main(String[] args) throws IOException { + final Map<String, Object> conf = ConfUtils.loadConf(args[0], new Config()); + final LlmTextExtractor textExtractor = new LlmTextExtractor(conf); + + final String html = Files.readString(Path.of(args[1]), StandardCharsets.UTF_8); + + final String text = textExtractor.text(Parser.htmlParser().parseInput(html, "").body()); + + System.out.println(text); + } +} diff --git a/external/ai/src/main/resources/llm-default-prompt.txt b/external/ai/src/main/resources/llm-default-prompt.txt new file mode 100644 index 00000000..8ce4604b --- /dev/null +++ b/external/ai/src/main/resources/llm-default-prompt.txt @@ -0,0 +1,44 @@ +Your task is to filter and convert HTML content into clean, focused markdown that's optimized for use with LLMs and information retrieval systems. + +TASK DETAILS: +1. Content Selection +- DO: Keep essential information, main content, key details +- DO: Preserve hierarchical structure using markdown headers +- DO: Keep code blocks, tables, key lists +- DON'T: Include navigation menus, ads, footers, cookie notices +- DON'T: Keep social media widgets, sidebars, related content + +2. Content Transformation +- DO: Use proper markdown syntax (#, ##, **, `, etc) +- DO: Convert tables to markdown tables +- DO: Preserve code formatting with ```language blocks +- DO: Maintain link texts but remove tracking parameters +- DON'T: Include HTML tags in output +- DON'T: Keep class names, ids, or other HTML attributes + +3. Content Organization +- DO: Maintain logical flow of information +- DO: Group related content under appropriate headers +- DO: Use consistent header levels +- DON'T: Fragment related content +- DON'T: Duplicate information + +IMPORTANT: If user specific instruction is provided, ignore above guideline and prioritize those requirements over these general guidelines. + +OUTPUT FORMAT: +Wrap your response in <content> tags. Use proper markdown throughout. +<content> +[Your markdown content here] +</content> + +Begin filtering now. + +-------------------------------------------- + +<|HTML_CONTENT_START|> +{HTML} +<|HTML_CONTENT_END|> + +<|USER_INSTRUCTION_START|> +{REQUEST} +<|USER_INSTRUCTION_END|> \ No newline at end of file diff --git a/core/src/main/java/org/apache/stormcrawler/parse/TextExtractor.java b/external/ai/src/test/java/org/apache/stormcrawler/ai/LlmTextExtractorTest.java similarity index 55% copy from core/src/main/java/org/apache/stormcrawler/parse/TextExtractor.java copy to external/ai/src/test/java/org/apache/stormcrawler/ai/LlmTextExtractorTest.java index c9678bd2..3a51f6c6 100644 --- a/core/src/main/java/org/apache/stormcrawler/parse/TextExtractor.java +++ b/external/ai/src/test/java/org/apache/stormcrawler/ai/LlmTextExtractorTest.java @@ -14,9 +14,23 @@ * See the License for the specific language governing permissions and * limitations under the License. */ -package org.apache.stormcrawler.parse; +package org.apache.stormcrawler.ai; -public interface TextExtractor { +import java.io.IOException; - String text(Object element); +import org.apache.storm.Config; +import org.junit.jupiter.api.Test; +import org.junit.jupiter.api.condition.EnabledIfEnvironmentVariable; + +@EnabledIfEnvironmentVariable(named = "OPENAI_API", matches = ".+") +public class LlmTextExtractorTest { + + @Test + void testExtraction() throws IOException { + Config conf = new Config(); + conf.put(LlmTextExtractor.API_KEY, System.getProperty("OPENAI_API_KEY")); + conf.put(LlmTextExtractor.BASE_URL, System.getProperty("OPENAI_API_BASE_URL")); + conf.put(LlmTextExtractor.MODEL_NAME, System.getProperty("OPENAI_API_MODEL_NAME")); + //TODO + } } diff --git a/external/ai/src/test/resources/stormcrawler.html b/external/ai/src/test/resources/stormcrawler.html new file mode 100644 index 00000000..43b85886 --- /dev/null +++ b/external/ai/src/test/resources/stormcrawler.html @@ -0,0 +1,118 @@ + +<!DOCTYPE html> +<html> + +<head> + <meta charset="utf-8"> + <meta http-equiv="X-UA-Compatible" content="IE=edge"> + <meta name="viewport" content="width=device-width, initial-scale=1"> + + <title>Apache StormCrawler</title> + <meta name="description" content="Apache StormCrawler is collection of resources for building low-latency, scalable web crawlers on Apache Storm +"> + + <link rel="stylesheet" href="/css/main.css"> + <link rel="canonical" href="https://stormcrawler.apache.org/"> + <link rel="alternate" type="application/rss+xml" title="Apache StormCrawler" href="https://stormcrawler.apache.org/feed.xml"> + <link rel="icon" type="/image/png" href="/img/favicon.png" /> +</head> + + +<body class="home"> + +<header class="site-header"> + <div class="site-header__wrap"> + <div class="site-header__logo"> + <a href="/"><img src="/img/logo-small.png" alt="Apache StormCrawler"></a> + </div> + </div> +</header> +<nav class="site-nav"> + <ul> + <li><a href="/index.html">Home</a> + <li><a href="/download/index.html">Download</a> + <li><a href="/getting-started/">Getting Started</a></li> + <li><a href="/contribute/">Contribute</a></li> + <li><a href="https://javadoc.io/doc/org.apache.stormcrawler/stormcrawler-core/3.3.0/index.html">JavaDocs</a> + <li><a href="/faq/">FAQ</a></li> + <li><a href="/support/">Support</a></li> + </ul> +</nav> +<span id="forkongithub"><a href="https://github.com/apache/incubator-stormcrawler">Fork me on GitHub</a></span> + + +<main class="main-content"> + <div class="page-title"> + <h1>A collection of resources for building low-latency, scalable web crawlers on Apache Storm®</h1> + </div> + </div> + <div class="row row-col"> + <p><strong>Apache StormCrawler</strong> is an open source SDK for building distributed web crawlers based on <a href="http://storm.apache.org">Apache Storm®</a>. The project is under Apache License v2 and consists of a collection of reusable resources and components, written mostly in Java.</p> + <p>The aim of Apache StormCrawler is to help build web crawlers that are :</p> + <ul> + <li>scalable</li> + <li>resilient</li> + <li>low latency</li> + <li>easy to extend</li> + <li>polite yet efficient</li> + </ul> + <p><strong>Apache StormCrawler</strong> is a library and collection of resources that developers can leverage to build their own crawlers. The good news is that doing so can be pretty straightforward! Have a look at the <a href="getting-started/">Getting Started</a> section for more details.</p> + <p>Apart from the core components, we provide some <a href="https://github.com/apache/incubator-stormcrawler/tree/main/external">external resources</a> that you can reuse in your project, like for instance our spout and bolts for <a href="https://opensearch.org/">OpenSearch®</a> or a ParserBolt which uses <a href="http://tika.apache.org">Apache Tika®</a> to parse various document formats.</p> + <p><strong>Apache StormCrawler</strong> is perfectly suited to use cases where the URL to fetch and parse come as streams but is also an appropriate solution for large scale recursive crawls, particularly where low latency is required. The project is used in production by <a href="https://github.com/apache/incubator-stormcrawler/wiki/Powered-By">many organisations</a> and is actively developed and maintained.</p> + <p>The <a href="https://github.com/apache/incubator-stormcrawler/wiki/Presentations">Presentations</a> page contains links to some recent presentations made about this project.</p> + </div> + + <div class="row row-col"> + <div class="used-by-panel"> + <h2>Used by</h2> + <a href="https://pixray.com/" target="_blank"> + <img src="/img/pixray.png" alt="Pixray" height=80> + </a> + <a href="https://www.gov.nt.ca/" target="_blank"> + <img src="/img/gnwt.png" alt="Government of Northwest Territories"> + </a> + <a href="https://www.stolencamerafinder.com/" target="_blank"> + <img src="/img/stolen-camera-finder.png" alt="StolenCameraFinder"> + </a> + <a href="https://www.polecat.com/" target="_blank"> + <img src="/img/polecat.svg" alt="Polecat" height=70> + </a> + <br> + <a href="http://github.com/apache/incubator-stormcrawler/wiki/Powered-By">and many more...</a> + </div> + </div> + +</main> + +<footer class="site-footer"> + © 2025 <a href="https://www.apache.org/">The Apache Software Foundation</a><br/><br/> + Licensed under the <a href="https://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>. <br/> Apache StormCrawler, StormCrawler, the Apache feather logo are trademarks of The Apache Software Foundation. <br/> All other marks mentioned may be trademarks or registered trademarks of their respective owners. <br/><br/> + <a href="https://privacy.apache.org/policies/privacy-policy-public.html">Privacy Policy</a> | <a href="https://www.apache.org/security/">Security</a> | <a href="https://www.apache.org/foundation/sponsorship">Sponsorship</a> | <a href="https://www.apache.org/foundation/sponsors">Sponsors</a><br/><br/> + <div class="footer-widget"> + <a class="acevent" data-format="wide" data-mode="dark"></a> + </div> +</footer> + + +</body> + +<script src="https://www.apachecon.com/event-images/snippet.js"></script> + +<!-- Matomo --> +<script> + var _paq = window._paq = window._paq || []; + /* tracker methods like "setCustomDimension" should be called before "trackPageView" */ + _paq.push(["setDoNotTrack", true]); + _paq.push(["disableCookies"]); + _paq.push(['trackPageView']); + _paq.push(['enableLinkTracking']); + (function() { + var u="https://analytics.apache.org/"; + _paq.push(['setTrackerUrl', u+'matomo.php']); + _paq.push(['setSiteId', '58']); + var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0]; + g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s); + })(); +</script> +<!-- End Matomo Code --> +</html> diff --git a/external/pom.xml b/external/pom.xml index 21c6fe44..377d46bf 100644 --- a/external/pom.xml +++ b/external/pom.xml @@ -32,7 +32,7 @@ under the License. <artifactId>stormcrawler-external</artifactId> <packaging>pom</packaging> - <dependencies> + <dependencies> <dependency> <groupId>org.apache.storm</groupId> <artifactId>storm-client</artifactId> diff --git a/pom.xml b/pom.xml index af855c75..cb01f968 100644 --- a/pom.xml +++ b/pom.xml @@ -497,6 +497,7 @@ under the License. <exclude>**/README.md</exclude> <exclude>**/target/**</exclude> <exclude>**/warc.inputs</exclude> + <exclude>**/llm-default-prompt.txt</exclude> <exclude>LICENSE</exclude> <exclude>NOTICE</exclude> <exclude>DISCLAIMER</exclude> @@ -615,6 +616,7 @@ under the License. <modules> <module>core</module> <module>external</module> + <module>external/ai</module> <module>external/aws</module> <module>external/langid</module> <module>external/opensearch</module>
