I am a bit confused about this topic, I would like to index images (png,jpegs, gifs...), my understanding is that I need to extract and index text portions from images, I don't really care for the meta data. So, I looked online and decided to use apache Tika which I also use to extract text and index pdfs (pdfs work fine). - How do I get the text part of images? All I am able to extract is metadata which I don't need. - Ideally I want to say if this image has no text to extract, then discard/ignore? Can you please clarify this topic a bit more and provide any samples if available? Additionaly, I don't want to store the 64 based encoded document.
PutMappingResponse putMappingResponse = new PutMappingRequestBuilder( client.admin().indices() ).setIndices( INDEX_NAME).setType(INDEX_TYPE).setSource( XContentFactory.jsonBuilder ().startObject() .startObject(INDEX_TYPE) .startObject("_source").field( "enabled","no").endObject() //I believe this line will not store the base 64 whole _source, below I store the text portion of file only "file" .startObject("properties") .startObject("file") .field( "term_vector", "with_positions_offsets" ) .field( "store", "no" ) .field( "type", "attachment" ) .field("fields") .startObject() .startObject("file") .field("store", "yes") .endObject() .endObject() .endObject() .endObject() .endObject() .endObject() ).execute().actionGet(); public static void testImage(File file) throws IOException, SAXException,TikaException { Tika tika = new Tika(); InputStream inputStream = new BufferedInputStream( new FileInputStream(file)); Metadata metadata = new Metadata(); ContentHandler handler = new DefaultHandler(); Parser parser = new JpegParser(); ParseContext context = new ParseContext(); String mimeType = tika.detect(inputStream); metadata.set(Metadata.CONTENT_TYPE, mimeType); parser.parse(inputStream,handler,metadata,context); for(int i = 0; i <metadata.names().length; i++) { //metaData -I don't care for this String name = metadata.names()[i]; System.out.println(name + " : " + metadata.get(name)); } } -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dbfe132a-c25b-40f0-93a7-7957cf978004%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.