I index PDFs using apache with the following mapping.

.field( "type", "attachment" )

.field("fields")

.startObject()

.startObject("file")

.field("store", "yes")

.endObject()

I want to index photos, I am able to extract text using OCR. I am confused 
how to index the text though, do I treat it like any document and not as an 
attachment? I have text as "String" when extracted and not base 64 like in 
the case of pdfs?
I am confused to how it gets stored and how does it work if I need to make 
it available during search? Can someone explain on how I do this?

XContentFactory.jsonBuilder().startObject()

               .startObject(INDEX_TYPE) 

               .startObject("_source").field("enabled","no").endObject()  
//This 
line will not store/not store the base 64 whole _source

                 .startObject("properties")



So, My photo object becomes something like this, what about the source (the 
image itself ?)
jsonObject
{
  "content":"text extracted from image"
  "name":"my_photo.png"
}


//add to the bulk indexer for indexing

bulkProcessor.add(Requests.indexRequest(INDEX_NAME).type(INDEX_TYPE).id(
jsonObject.getString("name")).source(jsonObject.toString()));

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/2012d7c6-b499-4318-8ae7-512879e5e8b8%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to