subject:"Indexing PDF and other binary formats"

Re: Indexing PDF and other binary formats

2014-01-16 Thread ZenMaster80

Thanks for the reply. the attachment plugin I understand encodes content
before indexing it, this sounds like an expensive operation if we have lots
of pdfs. I was thinking extracting text from pdf early on instead and deal
with text instead.
Does the plugin also work for binaries like images?

On Thursday, January 16, 2014 4:12:47 PM UTC-5, David Pilato wrote:

You can use Tika by yourself (recommended). See how I did it in fsriver
project.
You can use mapper attachment plugin which is using Tika behind the scene
but gives you less control IMHO.

About versions, elasticsearch does not keep old versions around. If you
need that, you have to manage it yourself.

HTH

--
David ;-)
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 16 janv. 2014 à 20:42, ZenMaster80 sabda...@gmail.com javascript:
a écrit :

--
You received this message because you are subscribed to the Google Groups
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com javascript:.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a9e8f331-c4bd-4a4c-be5a-b91e4f2f0e26%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/94b706cf-c4de-4f94-87b7-48c9e6e814b0%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Indexing PDF and other binary formats

2014-01-16 Thread David Pilato

Yes. Some metadata are extracted with Tika.

As you said, you should do that operation before indexation (means only index
what you really need).

--
David ;-)
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 16 janv. 2014 à 22:51, ZenMaster80 sabdall...@gmail.com a écrit :

On Thursday, January 16, 2014 4:12:47 PM UTC-5, David Pilato wrote:

You can use Tika by yourself (recommended). See how I did it in fsriver
project.
You can use mapper attachment plugin which is using Tika behind the scene
but gives you less control IMHO.

About versions, elasticsearch does not keep old versions around. If you need
that, you have to manage it yourself.

HTH

--
David ;-)
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 16 janv. 2014 à 20:42, ZenMaster80 sabda...@gmail.com a écrit :

- Is there any literature on how to index pdf documents and binary formats
like images?
- Versioning question: If I update an already indexed document, I believe
ES will update the version number. I am wondering if it keeps the previous
document, what if I needed access to the previous document?
--
You received this message because you are subscribed to the Google Groups
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a9e8f331-c4bd-4a4c-be5a-b91e4f2f0e26%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/94b706cf-c4de-4f94-87b7-48c9e6e814b0%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6CD3EB4F-93DD-48BD-98F7-D14E3FDA88CA%40pilato.fr.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Indexing PDF and other binary formats

Re: Indexing PDF and other binary formats

2 matches

Site Navigation

Mail list logo

Footer information