I am a bit confused about this topic, I would like to index images 
(png,jpegs, gifs...), my understanding is that I need to extract and index 
text portions from images, I don't really care for the meta data. So, I 
looked online and decided to use apache Tika which I also use to extract 
text and index pdfs (pdfs work fine).
- How do I get the text part of images? All I am able to extract is 
metadata which I don't need.
- Ideally I want to say if this image has no text to extract, then 
discard/ignore?  Can you please clarify this topic a bit more and provide 
any samples if available?  Additionaly, I don't want to store the 64 based 
encoded document.

PutMappingResponse putMappingResponse = new PutMappingRequestBuilder(

                                   client.admin().indices() ).setIndices(
INDEX_NAME).setType(INDEX_TYPE).setSource(
                                               XContentFactory.jsonBuilder
().startObject()

                                    .startObject(INDEX_TYPE)

                                       .startObject("_source").field(
"enabled","no").endObject()  //I believe this line will not store the base 
64 whole _source, below I store the text portion of file only "file"

                                       .startObject("properties")

                                         .startObject("file")

                                           .field( "term_vector", 
"with_positions_offsets" )

                                           .field( "store", "no" )

                                           .field( "type", "attachment" )

                                           .field("fields")

                                              .startObject()

                                                .startObject("file")

                                                    .field("store", "yes")

                                                .endObject()

                                            .endObject()

                                         .endObject()

                                       .endObject()

                                     .endObject()

                                   .endObject()

                               ).execute().actionGet();


        public static void testImage(File file) throws IOException, 
SAXException,TikaException {

       Tika tika = new Tika();

       InputStream inputStream = new BufferedInputStream( new 
FileInputStream(file));

   Metadata metadata = new Metadata();

   ContentHandler handler = new DefaultHandler();

   Parser parser = new JpegParser();

   ParseContext context = new ParseContext();

   String mimeType = tika.detect(inputStream);

    metadata.set(Metadata.CONTENT_TYPE, mimeType);

   parser.parse(inputStream,handler,metadata,context);

 for(int i = 0; i <metadata.names().length; i++) {  //metaData -I don't 
care for this

       String name = metadata.names()[i];
       System.out.println(name + " : " + metadata.get(name));

     }

  }


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/dbfe132a-c25b-40f0-93a7-7957cf978004%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to