Hello -

I'm using Elasticsearch 1.2.1, with mapper-attachments-2.0.0. I'm a little
baffled by how to surface the text that Tika extracts from a PDF into the
structured document that ES is storing.

Long story short, with a trivial PDF file with one line of text, I'm
getting something like this:

>       "_index" : "test",
>       "_type" : "doc",
>       "_id" : "1",
>       "_score" : 0.067124054,
>       "fields" : {
>         "my_attachment.date" : [ "2014-07-31T23:29:45.000Z" ],
>         "my_attachment.keywords" : [ "TestKeyword1, TestKeyword2" ],
>         "my_attachment.title" : [ "Untitled" ]
>       }
>     }

When what I want is this (with the content of the file included):

> {
>       "_index" : "test",
>       "_type" : "doc",
>       "_id" : "1",
>       "_score" : 0.067124054,
>       "fields" : {
>         "my_attachment.date" : [ "2014-07-31T23:29:45.000Z" ],
>         "my_attachment.keywords" : [ "TestKeyword1, TestKeyword2" ],
>         "my_attachment.title" : [ "Untitled" ],
>         "my_attachment.file" : "This is the easiest PDF ever."
>       }
>     }

A somewhat related question: I'm also a bit confused as to the difference
between the "fields" from the attachment, and other fields in my document
that I'm storing in my _source. If I ask for the attachment fields, I don't
get anything else I stored in the document; if I don't ask for any fields,
I get everything from _source. Is there a way I can make the
my_attachment.* fields and the "Thing" field I store in my document
co-equals? I think what I want is for the my_attachment fields to show up
without having to explicitly ask for them.

My sample PDF documents are here:



And my curl/shell is below, followed by the sample output of a run.

curl -X DELETE localhost:9200/test
curl -X PUT localhost:9200/test

curl -X PUT localhost:9200/test/doc/_mapping -d '
    "doc" : {
        "properties" : {
            "my_attachment" : {
              "type" : "attachment",
              "fields": {
                "title" : { "store" : "yes" },
                "date" : {"store" : "yes"},
                "author" : {"store" : "yes"},
                "keywords" : {store : "yes"},
                 "content_type" : {store : "yes"},
                 "content_length" : {store : "yes"},
                 "language" : {"store" : "yes"},
                 "file": { "store" : "yes", "term_vector":

echo "Uploading a PDF with 'This is the easiest PDF ever'"
coded=`cat simple/Untitled1.pdf | base64`
echo "$json" > json.file
curl -X PUT 'localhost:9200/test/doc/1?refresh=true' -d @json.file
rm json.file

echo "Uploading a PDF with 'This is the second easiest PDF ever'"
coded=`cat simple/Untitled2.pdf | base64`
json="{\"Thing\": \"followup\", \"my_attachment\":\"${coded}\"}"
echo "$json" > json.file
curl -X PUT 'localhost:9200/test/doc/2?refresh=true' -d @json.file
rm json.file

echo "Querying: Should get two hits"
curl -X POST 'localhost:9200/test/doc/_search?pretty=true' -d '{
        "fields": ["title", "author", "date", "file", "keywords"],
        "query" : { "match" : { "_all" : "easiest" } }
echo "Querying: Should get one hit"
curl -X POST 'localhost:9200/test/doc/_search?pretty=true' -d '{
  "fields": "*",

>   "query" : { "match" : { "_all" : "second" } }




echo "Directly loading object 1"


curl 'localhost:9200/test/doc/1'
> And the output


Uploading a PDF with 'This is the easiest PDF ever'


Uploading a PDF with 'This is the second easiest PDF ever'


Querying: Should get two hits


  "took" : 2,

  "timed_out" : false,

  "_shards" : {

    "total" : 5,

    "successful" : 5,

    "failed" : 0


  "hits" : {

    "total" : 2,

    "max_score" : 0.067124054,

    "hits" : [ {

      "_index" : "test",

      "_type" : "doc",

      "_id" : "2",

      "_score" : 0.067124054,

      "fields" : {

        "my_attachment.date" : [ "2014-07-31T21:48:21.000Z" ],

        "my_attachment.keywords" : [ "" ],

        "my_attachment.title" : [ "Untitled" ]


    }, {

      "_index" : "test",

      "_type" : "doc",

      "_id" : "1",

      "_score" : 0.067124054,

      "fields" : {

        "my_attachment.date" : [ "2014-07-31T23:29:45.000Z" ],

        "my_attachment.keywords" : [ "TestKeyword1, TestKeyword2" ],

        "my_attachment.title" : [ "Untitled" ]


    } ]



Querying: Should get one hit


  "took" : 2,

  "timed_out" : false,

  "_shards" : {

    "total" : 5,

    "successful" : 5,

    "failed" : 0


  "hits" : {

    "total" : 1,

    "max_score" : 0.067124054,

    "hits" : [ {

      "_index" : "test",

      "_type" : "doc",

      "_id" : "2",

      "_score" : 0.067124054,

      "fields" : {

        "my_attachment.content_type" : [ "application/pdf" ],

        "my_attachment.keywords" : [ "" ],

        "my_attachment.title" : [ "Untitled" ],

        "my_attachment.date" : [ "2014-07-31T21:48:21.000Z" ],

        "my_attachment.content_length" : [ 9458 ]


    } ]



Directly loading object 1
of base64 data removed....VPRgo="}}

 Thanks for any help you can point me at!


You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
For more options, visit https://groups.google.com/d/optout.

Reply via email to