Indexing Images

2014-02-20 Thread ZenMaster80
I am a bit confused about this topic, I would like to index images 
(png,jpegs, gifs...), my understanding is that I need to extract and index 
text portions from images, I don't really care for the meta data. So, I 
looked online and decided to use apache Tika which I also use to extract 
text and index pdfs (pdfs work fine).
- How do I get the text part of images? All I am able to extract is 
metadata which I don't need.
- Ideally I want to say if this image has no text to extract, then 
discard/ignore?  Can you please clarify this topic a bit more and provide 
any samples if available?  Additionaly, I don't want to store the 64 based 
encoded document.

PutMappingResponse putMappingResponse = new PutMappingRequestBuilder(

   client.admin().indices() ).setIndices(
INDEX_NAME).setType(INDEX_TYPE).setSource(
   XContentFactory.jsonBuilder
().startObject()

.startObject(INDEX_TYPE)

   .startObject(_source).field(
enabled,no).endObject()  //I believe this line will not store the base 
64 whole _source, below I store the text portion of file only file

   .startObject(properties)

 .startObject(file)

   .field( term_vector, 
with_positions_offsets )

   .field( store, no )

   .field( type, attachment )

   .field(fields)

  .startObject()

.startObject(file)

.field(store, yes)

.endObject()

.endObject()

 .endObject()

   .endObject()

 .endObject()

   .endObject()

   ).execute().actionGet();


public static void testImage(File file) throws IOException, 
SAXException,TikaException {

   Tika tika = new Tika();

   InputStream inputStream = new BufferedInputStream( new 
FileInputStream(file));

   Metadata metadata = new Metadata();

   ContentHandler handler = new DefaultHandler();

   Parser parser = new JpegParser();

   ParseContext context = new ParseContext();

   String mimeType = tika.detect(inputStream);

metadata.set(Metadata.CONTENT_TYPE, mimeType);

   parser.parse(inputStream,handler,metadata,context);

 for(int i = 0; i metadata.names().length; i++) {  //metaData -I don't 
care for this

   String name = metadata.names()[i];
   System.out.println(name +  :  + metadata.get(name));

 }

  }


-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/dbfe132a-c25b-40f0-93a7-7957cf978004%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Indexing Images

2014-02-20 Thread ZenMaster80
Thanks David. I agree that OCR and maybe any kind of text extraction should 
be done pre-Elastic Search indexing. But, I am just wondering if apache 
tika supports this, or if anyone has experience with using a certain tool. 
I do plan to do extract before indexing.

On Thursday, February 20, 2014 11:38:31 AM UTC-5, ZenMaster80 wrote:

 I am a bit confused about this topic, I would like to index images 
 (png,jpegs, gifs...), my understanding is that I need to extract and index 
 text portions from images, I don't really care for the meta data. So, I 
 looked online and decided to use apache Tika which I also use to extract 
 text and index pdfs (pdfs work fine).
 - How do I get the text part of images? All I am able to extract is 
 metadata which I don't need.
 - Ideally I want to say if this image has no text to extract, then 
 discard/ignore?  Can you please clarify this topic a bit more and provide 
 any samples if available?  Additionaly, I don't want to store the 64 based 
 encoded document.

 PutMappingResponse putMappingResponse = new PutMappingRequestBuilder(

client.admin().indices() ).setIndices(
 INDEX_NAME).setType(INDEX_TYPE).setSource(
XContentFactory.jsonBuilder
 ().startObject()

 .startObject(INDEX_TYPE)

.startObject(_source).field(
 enabled,no).endObject()  //I believe this line will not store the 
 base 64 whole _source, below I store the text portion of file only file

.startObject(properties)

  .startObject(file)

.field( term_vector, 
 with_positions_offsets )

.field( store, no )

.field( type, attachment )

.field(fields)

   .startObject()

 .startObject(file)

 .field(store, yes)

 .endObject()

 .endObject()

  .endObject()

.endObject()

  .endObject()

.endObject()

).execute().actionGet();


 public static void testImage(File file) throws IOException, 
 SAXException,TikaException {

Tika tika = new Tika();

InputStream inputStream = new BufferedInputStream( new 
 FileInputStream(file));

Metadata metadata = new Metadata();

ContentHandler handler = new DefaultHandler();

Parser parser = new JpegParser();

ParseContext context = new ParseContext();

String mimeType = tika.detect(inputStream);

 metadata.set(Metadata.CONTENT_TYPE, mimeType);

parser.parse(inputStream,handler,metadata,context);

  for(int i = 0; i metadata.names().length; i++) {  //metaData -I don't 
 care for this

String name = metadata.names()[i];
System.out.println(name +  :  + metadata.get(name));

  }

   }




-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/fac820d6-5343-4820-8acc-7e20c5663984%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.