Re: Indexing PDF and other binary formats

2014-01-16 Thread ZenMaster80
Thanks for the reply. the attachment plugin I understand encodes content 
before indexing it, this sounds like an expensive operation if we have lots 
of pdfs. I was thinking extracting text from pdf early on instead and deal 
with text instead.
Does the plugin also work for binaries like images?

On Thursday, January 16, 2014 4:12:47 PM UTC-5, David Pilato wrote:

 You can use Tika by yourself (recommended). See how I did it in fsriver 
 project.
 You can use mapper attachment plugin which is using Tika behind the scene 
 but gives you less control IMHO.

 About versions, elasticsearch does not keep old versions around. If you 
 need that, you have to manage it yourself.

 HTH

 --
 David ;-)
 Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

 Le 16 janv. 2014 à 20:42, ZenMaster80 sabda...@gmail.com javascript: 
 a écrit :

 - Is there any literature on how to index pdf documents and binary formats 
 like images?
 - Versioning question: If I update an already indexed document, I believe 
 ES will update the version number. I am wondering if it keeps the previous 
 document, what if I needed access to the previous document?

 -- 
 You received this message because you are subscribed to the Google Groups 
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to elasticsearc...@googlegroups.com javascript:.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/a9e8f331-c4bd-4a4c-be5a-b91e4f2f0e26%40googlegroups.com
 .
 For more options, visit https://groups.google.com/groups/opt_out.



-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/94b706cf-c4de-4f94-87b7-48c9e6e814b0%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Indexing PDF and other binary formats

2014-01-16 Thread David Pilato
Yes. Some metadata are extracted with Tika.

As you said, you should do that operation before indexation (means only index 
what you really need).

--
David ;-)
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 16 janv. 2014 à 22:51, ZenMaster80 sabdall...@gmail.com a écrit :

 Thanks for the reply. the attachment plugin I understand encodes content 
 before indexing it, this sounds like an expensive operation if we have lots 
 of pdfs. I was thinking extracting text from pdf early on instead and deal 
 with text instead.
 Does the plugin also work for binaries like images?
 
 On Thursday, January 16, 2014 4:12:47 PM UTC-5, David Pilato wrote:
 
 You can use Tika by yourself (recommended). See how I did it in fsriver 
 project.
 You can use mapper attachment plugin which is using Tika behind the scene 
 but gives you less control IMHO.
 
 About versions, elasticsearch does not keep old versions around. If you need 
 that, you have to manage it yourself.
 
 HTH
 
 --
 David ;-)
 Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
 
 Le 16 janv. 2014 à 20:42, ZenMaster80 sabda...@gmail.com a écrit :
 
 - Is there any literature on how to index pdf documents and binary formats 
 like images?
 - Versioning question: If I update an already indexed document, I believe 
 ES will update the version number. I am wondering if it keeps the previous 
 document, what if I needed access to the previous document?
 -- 
 You received this message because you are subscribed to the Google Groups 
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to elasticsearc...@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/a9e8f331-c4bd-4a4c-be5a-b91e4f2f0e26%40googlegroups.com.
 For more options, visit https://groups.google.com/groups/opt_out.
 
 -- 
 You received this message because you are subscribed to the Google Groups 
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to elasticsearch+unsubscr...@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/94b706cf-c4de-4f94-87b7-48c9e6e814b0%40googlegroups.com.
 For more options, visit https://groups.google.com/groups/opt_out.

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/6CD3EB4F-93DD-48BD-98F7-D14E3FDA88CA%40pilato.fr.
For more options, visit https://groups.google.com/groups/opt_out.