Thanks David. I agree that OCR and maybe any kind of text extraction should
be done pre-Elastic Search indexing. But, I am just wondering if apache
tika supports this, or if anyone has experience with using a certain tool.
I do plan to do extract before indexing.
On Thursday, February 20, 2014 11:38:31 AM UTC-5, ZenMaster80 wrote:
I am a bit confused about this topic, I would like to index images
(png,jpegs, gifs...), my understanding is that I need to extract and index
text portions from images, I don't really care for the meta data. So, I
looked online and decided to use apache Tika which I also use to extract
text and index pdfs (pdfs work fine).
- How do I get the text part of images? All I am able to extract is
metadata which I don't need.
- Ideally I want to say if this image has no text to extract, then
discard/ignore? Can you please clarify this topic a bit more and provide
any samples if available? Additionaly, I don't want to store the 64 based
encoded document.
PutMappingResponse putMappingResponse = new PutMappingRequestBuilder(
client.admin().indices() ).setIndices(
INDEX_NAME).setType(INDEX_TYPE).setSource(
XContentFactory.jsonBuilder
().startObject()
.startObject(INDEX_TYPE)
.startObject(_source).field(
enabled,no).endObject() //I believe this line will not store the
base 64 whole _source, below I store the text portion of file only file
.startObject(properties)
.startObject(file)
.field( term_vector,
with_positions_offsets )
.field( store, no )
.field( type, attachment )
.field(fields)
.startObject()
.startObject(file)
.field(store, yes)
.endObject()
.endObject()
.endObject()
.endObject()
.endObject()
.endObject()
).execute().actionGet();
public static void testImage(File file) throws IOException,
SAXException,TikaException {
Tika tika = new Tika();
InputStream inputStream = new BufferedInputStream( new
FileInputStream(file));
Metadata metadata = new Metadata();
ContentHandler handler = new DefaultHandler();
Parser parser = new JpegParser();
ParseContext context = new ParseContext();
String mimeType = tika.detect(inputStream);
metadata.set(Metadata.CONTENT_TYPE, mimeType);
parser.parse(inputStream,handler,metadata,context);
for(int i = 0; i metadata.names().length; i++) { //metaData -I don't
care for this
String name = metadata.names()[i];
System.out.println(name + : + metadata.get(name));
}
}
--
You received this message because you are subscribed to the Google Groups
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/fac820d6-5343-4820-8acc-7e20c5663984%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.