Hello -

I'm using Elasticsearch 1.2.1, with mapper-attachments-2.0.0. I'm a little
baffled by how to surface the text that Tika extracts from a PDF into the
structured document that ES is storing.

Long story short, with a trivial PDF file with one line of text, I'm
getting something like this:

{
>       "_index" : "test",
>       "_type" : "doc",
>       "_id" : "1",
>       "_score" : 0.067124054,
>       "fields" : {
>         "my_attachment.date" : [ "2014-07-31T23:29:45.000Z" ],
>         "my_attachment.keywords" : [ "TestKeyword1, TestKeyword2" ],
>         "my_attachment.title" : [ "Untitled" ]
>       }
>     }


When what I want is this (with the content of the file included):

> {
>       "_index" : "test",
>       "_type" : "doc",
>       "_id" : "1",
>       "_score" : 0.067124054,
>       "fields" : {
>         "my_attachment.date" : [ "2014-07-31T23:29:45.000Z" ],
>         "my_attachment.keywords" : [ "TestKeyword1, TestKeyword2" ],
>         "my_attachment.title" : [ "Untitled" ],
>         "my_attachment.file" : "This is the easiest PDF ever."
>       }
>     }


A somewhat related question: I'm also a bit confused as to the difference
between the "fields" from the attachment, and other fields in my document
that I'm storing in my _source. If I ask for the attachment fields, I don't
get anything else I stored in the document; if I don't ask for any fields,
I get everything from _source. Is there a way I can make the
my_attachment.* fields and the "Thing" field I store in my document
co-equals? I think what I want is for the my_attachment fields to show up
without having to explicitly ask for them.

My sample PDF documents are here:

http://pages.cs.wisc.edu/~epaulson/simplepdfs/Untitled1.pdf

http://pages.cs.wisc.edu/~epaulson/simplepdfs/Untitled2.pdf

And my curl/shell is below, followed by the sample output of a run.

curl -X DELETE localhost:9200/test
curl -X PUT localhost:9200/test

curl -X PUT localhost:9200/test/doc/_mapping -d '
{
    "doc" : {
        "properties" : {
            "my_attachment" : {
              "type" : "attachment",
              "fields": {
                "title" : { "store" : "yes" },
                "date" : {"store" : "yes"},
                "author" : {"store" : "yes"},
                "keywords" : {store : "yes"},
                 "content_type" : {store : "yes"},
                 "content_length" : {store : "yes"},
                 "language" : {"store" : "yes"},
                 "file": { "store" : "yes", "term_vector":
"with_positions_offsets"}
               }
            }
        }
    }
}'

echo
echo "Uploading a PDF with 'This is the easiest PDF ever'"
coded=`cat simple/Untitled1.pdf | base64`
json="{\"Thing\":\"first\",\"my_attachment\":\"${coded}\"}"
echo "$json" > json.file
curl -X PUT 'localhost:9200/test/doc/1?refresh=true' -d @json.file
rm json.file

echo
echo "Uploading a PDF with 'This is the second easiest PDF ever'"
coded=`cat simple/Untitled2.pdf | base64`
json="{\"Thing\": \"followup\", \"my_attachment\":\"${coded}\"}"
echo "$json" > json.file
curl -X PUT 'localhost:9200/test/doc/2?refresh=true' -d @json.file
rm json.file

echo
echo "Querying: Should get two hits"
curl -X POST 'localhost:9200/test/doc/_search?pretty=true' -d '{
        "fields": ["title", "author", "date", "file", "keywords"],
        "query" : { "match" : { "_all" : "easiest" } }
}'
echo
echo
echo "Querying: Should get one hit"
curl -X POST 'localhost:9200/test/doc/_search?pretty=true' -d '{
  "fields": "*",

>   "query" : { "match" : { "_all" : "second" } }

}'

echo

echo

echo "Directly loading object 1"

echo

curl 'localhost:9200/test/doc/1'
echo
>
> And the output

{"acknowledged":true}{"acknowledged":true}{"acknowledged":true}

Uploading a PDF with 'This is the easiest PDF ever'

{"_index":"test","_type":"doc","_id":"1","_version":1,"created":true}

Uploading a PDF with 'This is the second easiest PDF ever'

{"_index":"test","_type":"doc","_id":"2","_version":1,"created":true}

Querying: Should get two hits

{

  "took" : 2,

  "timed_out" : false,

  "_shards" : {

    "total" : 5,

    "successful" : 5,

    "failed" : 0

  },

  "hits" : {

    "total" : 2,

    "max_score" : 0.067124054,

    "hits" : [ {

      "_index" : "test",

      "_type" : "doc",

      "_id" : "2",

      "_score" : 0.067124054,

      "fields" : {

        "my_attachment.date" : [ "2014-07-31T21:48:21.000Z" ],

        "my_attachment.keywords" : [ "" ],

        "my_attachment.title" : [ "Untitled" ]

      }

    }, {

      "_index" : "test",

      "_type" : "doc",

      "_id" : "1",

      "_score" : 0.067124054,

      "fields" : {

        "my_attachment.date" : [ "2014-07-31T23:29:45.000Z" ],

        "my_attachment.keywords" : [ "TestKeyword1, TestKeyword2" ],

        "my_attachment.title" : [ "Untitled" ]

      }

    } ]

  }

}


Querying: Should get one hit

{

  "took" : 2,

  "timed_out" : false,

  "_shards" : {

    "total" : 5,

    "successful" : 5,

    "failed" : 0

  },

  "hits" : {

    "total" : 1,

    "max_score" : 0.067124054,

    "hits" : [ {

      "_index" : "test",

      "_type" : "doc",

      "_id" : "2",

      "_score" : 0.067124054,

      "fields" : {

        "my_attachment.content_type" : [ "application/pdf" ],

        "my_attachment.keywords" : [ "" ],

        "my_attachment.title" : [ "Untitled" ],

        "my_attachment.date" : [ "2014-07-31T21:48:21.000Z" ],

        "my_attachment.content_length" : [ 9458 ]

      }

    } ]

  }

}


Directly loading object 1
{"_index":"test","_type":"doc","_id":"1","_version":1,"found":true,"_source":{"Thing":"first","my_attachment":"JVBERi0xLjMKJcTl....lots
of base64 data removed....VPRgo="}}

 Thanks for any help you can point me at!

-Erik

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKJO4n4P09NrP1R8OMRD11XEkYBAOa3w5Ug%3DCcx_M9%3DDi%2B_Hpg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to