adityamparikh opened a new pull request, #88: URL: https://github.com/apache/solr-mcp/pull/88
## Summary Adds a new MCP tool `index-file-document` that enables users to upload files of any format (PDF, Word, Excel, PowerPoint, etc.) through their AI chat client and have the content indexed into Solr for full-text search. Closes https://github.com/apache/solr-mcp/issues/69 ## How it works When a user uploads a file in an AI chat client like Claude Desktop, the **client** handles text extraction — not the MCP server. Here's the flow: 1. **User uploads a file** (e.g., `report.pdf`) in Claude Desktop 2. **Claude Desktop extracts text** from the PDF before Claude ever sees it — Claude receives the readable text content, not the raw binary bytes 3. **Claude calls the `index-file-document` tool**, passing the extracted text as `content` and the original filename (`report.pdf`) as `filename` 4. **The MCP server indexes** a SolrInputDocument with `id` (auto-generated UUID), `content` (the full text), and `filename` (for filtering/display) 5. **User can now search** over the indexed content using existing search tools This design means no binary parsing library (Tika, Docling, etc.) is needed on the server side — the AI chat client already does the heavy lifting of text extraction before invoking MCP tools. This keeps the server lightweight and avoids ~100MB of transitive dependencies. ### Tool signature ``` index-file-document(collection, content, filename) ``` | Parameter | Description | |-----------|-------------| | `collection` | Solr collection to index into | | `content` | Text content extracted from the file by the chat client | | `filename` | Original filename with extension (e.g. `report.pdf`) — stored as a searchable field | ## Changes - **`FileDocumentCreator`** (new) — `@Component` that creates a `SolrInputDocument` with `id`, `content`, and `filename` fields. Does not implement `SolrDocumentCreator` because it requires a filename parameter in addition to content. - **`IndexingDocumentCreator`** — Added `FileDocumentCreator` dependency and `createSchemalessDocumentsFromFile()` delegation method - **`IndexingService`** — New `indexFileDocument()` MCP tool with `@PreAuthorize("isAuthenticated()")` - **`AGENTS.md`** — Updated architecture docs ## Test plan - [x] `FileDocumentCreatorTest` — 9 unit tests: valid input, null/empty/blank content, null/empty filename, oversized content, unique IDs, multiline content - [x] `FileIndexingTest` — Spring Boot integration test through `IndexingDocumentCreator` - [x] `IndexingServiceTest` — 2 new Testcontainers integration tests verifying index-then-search round-trip (search by content, search by filename) - [x] `IndexingServiceTest.UnitTests` — 2 new mocked unit tests for the MCP tool method - [x] Existing test constructors updated for new `FileDocumentCreator` parameter - [x] `./gradlew build` passes with all tests green 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
