Hello - I'm using Elasticsearch 1.2.1, with mapper-attachments-2.0.0. I'm a little baffled by how to surface the text that Tika extracts from a PDF into the structured document that ES is storing.
Long story short, with a trivial PDF file with one line of text, I'm getting something like this: { > "_index" : "test", > "_type" : "doc", > "_id" : "1", > "_score" : 0.067124054, > "fields" : { > "my_attachment.date" : [ "2014-07-31T23:29:45.000Z" ], > "my_attachment.keywords" : [ "TestKeyword1, TestKeyword2" ], > "my_attachment.title" : [ "Untitled" ] > } > } When what I want is this (with the content of the file included): > { > "_index" : "test", > "_type" : "doc", > "_id" : "1", > "_score" : 0.067124054, > "fields" : { > "my_attachment.date" : [ "2014-07-31T23:29:45.000Z" ], > "my_attachment.keywords" : [ "TestKeyword1, TestKeyword2" ], > "my_attachment.title" : [ "Untitled" ], > "my_attachment.file" : "This is the easiest PDF ever." > } > } A somewhat related question: I'm also a bit confused as to the difference between the "fields" from the attachment, and other fields in my document that I'm storing in my _source. If I ask for the attachment fields, I don't get anything else I stored in the document; if I don't ask for any fields, I get everything from _source. Is there a way I can make the my_attachment.* fields and the "Thing" field I store in my document co-equals? I think what I want is for the my_attachment fields to show up without having to explicitly ask for them. My sample PDF documents are here: http://pages.cs.wisc.edu/~epaulson/simplepdfs/Untitled1.pdf http://pages.cs.wisc.edu/~epaulson/simplepdfs/Untitled2.pdf And my curl/shell is below, followed by the sample output of a run. curl -X DELETE localhost:9200/test curl -X PUT localhost:9200/test curl -X PUT localhost:9200/test/doc/_mapping -d ' { "doc" : { "properties" : { "my_attachment" : { "type" : "attachment", "fields": { "title" : { "store" : "yes" }, "date" : {"store" : "yes"}, "author" : {"store" : "yes"}, "keywords" : {store : "yes"}, "content_type" : {store : "yes"}, "content_length" : {store : "yes"}, "language" : {"store" : "yes"}, "file": { "store" : "yes", "term_vector": "with_positions_offsets"} } } } } }' echo echo "Uploading a PDF with 'This is the easiest PDF ever'" coded=`cat simple/Untitled1.pdf | base64` json="{\"Thing\":\"first\",\"my_attachment\":\"${coded}\"}" echo "$json" > json.file curl -X PUT 'localhost:9200/test/doc/1?refresh=true' -d @json.file rm json.file echo echo "Uploading a PDF with 'This is the second easiest PDF ever'" coded=`cat simple/Untitled2.pdf | base64` json="{\"Thing\": \"followup\", \"my_attachment\":\"${coded}\"}" echo "$json" > json.file curl -X PUT 'localhost:9200/test/doc/2?refresh=true' -d @json.file rm json.file echo echo "Querying: Should get two hits" curl -X POST 'localhost:9200/test/doc/_search?pretty=true' -d '{ "fields": ["title", "author", "date", "file", "keywords"], "query" : { "match" : { "_all" : "easiest" } } }' echo echo echo "Querying: Should get one hit" curl -X POST 'localhost:9200/test/doc/_search?pretty=true' -d '{ "fields": "*", > "query" : { "match" : { "_all" : "second" } } }' echo echo echo "Directly loading object 1" echo curl 'localhost:9200/test/doc/1' echo > > And the output {"acknowledged":true}{"acknowledged":true}{"acknowledged":true} Uploading a PDF with 'This is the easiest PDF ever' {"_index":"test","_type":"doc","_id":"1","_version":1,"created":true} Uploading a PDF with 'This is the second easiest PDF ever' {"_index":"test","_type":"doc","_id":"2","_version":1,"created":true} Querying: Should get two hits { "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 0.067124054, "hits" : [ { "_index" : "test", "_type" : "doc", "_id" : "2", "_score" : 0.067124054, "fields" : { "my_attachment.date" : [ "2014-07-31T21:48:21.000Z" ], "my_attachment.keywords" : [ "" ], "my_attachment.title" : [ "Untitled" ] } }, { "_index" : "test", "_type" : "doc", "_id" : "1", "_score" : 0.067124054, "fields" : { "my_attachment.date" : [ "2014-07-31T23:29:45.000Z" ], "my_attachment.keywords" : [ "TestKeyword1, TestKeyword2" ], "my_attachment.title" : [ "Untitled" ] } } ] } } Querying: Should get one hit { "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.067124054, "hits" : [ { "_index" : "test", "_type" : "doc", "_id" : "2", "_score" : 0.067124054, "fields" : { "my_attachment.content_type" : [ "application/pdf" ], "my_attachment.keywords" : [ "" ], "my_attachment.title" : [ "Untitled" ], "my_attachment.date" : [ "2014-07-31T21:48:21.000Z" ], "my_attachment.content_length" : [ 9458 ] } } ] } } Directly loading object 1 {"_index":"test","_type":"doc","_id":"1","_version":1,"found":true,"_source":{"Thing":"first","my_attachment":"JVBERi0xLjMKJcTl....lots of base64 data removed....VPRgo="}} Thanks for any help you can point me at! -Erik -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKJO4n4P09NrP1R8OMRD11XEkYBAOa3w5Ug%3DCcx_M9%3DDi%2B_Hpg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.