Not able to fulltext index Microsoft Office documents - PDF works fine

Dirk Wed, 27 Aug 2014 01:38:42 -0700

Hi,

using elasticsearch-1.3.2 with


Plug-in
-----
name: mapper-attachments
version: 2.3.1
description: Adds the attachment type allowing to parse difference
attachment formats
jvm: true
site: false

on Windows 8 for evaluation purpose.

JVM 
-----
version: 1.7.0_67
vm_name: Java HotSpot(TM) Client VM
vm_version: 24.65-b04
vm_vendor: Oracle Corporation


I have created the following mapping:

{
myIndex: {
mappings: {
dokument: {
properties: {
created: {
type: date
format: dateOptionalTime
}
description: {
type: string
}
file: {
type: attachment
path: full
fields: {
file: {
type: string
store: true
term_vector: with_positions_offsets
}
author: {
type: string
}
title: {
type: string
}
name: {
type: string
}
date: {
type: date
format: dateOptionalTime
}
keywords: {
type: string
}
content_type: {
type: string
}
content_length: {
type: integer
}
language: {
type: string
}
}
}
id: {
type: string
}
title: {
type: string
}
}
}
}
}
}

Because I like to use ES from C#/.NET I have created a little C# app that
reads a file as base64 encodes stream from hard drive and put the document
to the index of ES. I'm working with this POST request:

{
  "id": "8dbf1d73-44d1-4e20-aa35-13b18ddf5057",
  "title": "Test",
  "description": "Test Description",
  "created": "2014-01-20T19:04:20.1019885+01:00",
  "file": {
    "_content_type": "application/pdf",
    "_name": "Test.pdf",
    "content": "---my base64 stuff here---"
  }
}

and send it as index command to ES like this:

myIndex/dokument/8dbf1d73-44d1-4e20-aa35-13b18ddf5057?refresh=true

After that I query ES with this request:

{
  "fields": [],
  "query": {
    "match": {
      "file": "test"
    }
  },
  "highlight": {
    "fields": {
      "file": {}
    }
  }
}

If my input is a *.pdf or *.txt file everything works as expected. The
content of the document was recognized by the mapper-attachments plug-in and
the results with my string "test" that I'm looking for are highlighted.

I have searched for hours now to find a solution to do the same with
Microsoft Office documents but I'm not able to get it to work. ES does not
send any error message during adding the documents but I'm not able to find
the content of my office documents.
Can anyone please help me an give me an sample how to index a *.doc, *.docx,
*.xls, *.xlsx etc.?

I have tried to give ES a hint for the content-type / mime type based on
this link http://filext.com/faq/office_mime_types.php but this makes no
change. 

Thanks in advance!
Dirk




--
View this message in context: 
http://elasticsearch-users.115913.n3.nabble.com/Not-able-to-fulltext-index-Microsoft-Office-documents-PDF-works-fine-tp4062325.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/1408811281465-4062325.post%40n3.nabble.com.
For more options, visit https://groups.google.com/d/optout.

Not able to fulltext index Microsoft Office documents - PDF works fine

Reply via email to