Hi,
for people having the same problem like me, here an answer I received from
Pablo in PT group:
About your problem I beleive this is a constraint of the Apache Tika [1],
which is used by the mapper-attachment plugin.
I believe that a search over Tika pdf limitations or a question on their
list will help you more than we can.
Anyway, maybe you want to ask in the Elasticsearch main list [2], which is
bigger than ours and has the Elasticsearch engineers.
I am sorry for not being able to help you that much.
Cheers,
Pablo
[1] http://tika.apache.org/
[2] elasti...@googlegroups.com
Le mercredi 18 février 2015 15:37:33 UTC+1, Marria a écrit :
Hi everybody,
I want to perform URL extraction from my PDF files. I use
mapper-attachment plugin to index my PDF files.
In order to be able to perform some regex queries and extract all the urls
present in a pdf file, I useduax_url_email:
curl -X PUT localhost:9200/test -d '{
settings : {
index: {
analysis :{
analyzer: {
default: {
type : custom,
tokenizer : uax_url_email,
filter : [standard, lowercase, stop]
}
}
}
}
}
}'
and the map :
curl -X PUT localhost:9200/test/attachment/_mapping -d '{
attachment : {
properties : {
file : {
type : attachment,
fields : {
title : { store : yes },
file : { term_vector:with_positions_offsets,
store:yes }
}
}
}
}
I indexed some PDF files, the problem is for a file , I get this (while
urls in this file start with http://):
https://lh3.googleusercontent.com/-6uzhp-v0qFs/VOSfMU95byI/AUc/H4c6xvb54kg/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.17.19.png
for another file, I got this (it leaves the http:// ):
https://lh3.googleusercontent.com/-1rYIYWJJEbU/VOSfweFpgbI/AUk/bWzfst_uZUE/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.19.43.png
But the problem is the urls are not recognized completely , look at this:
https://lh3.googleusercontent.com/-vsKUj5I9MiA/VOSgtyS3yWI/AUw/64lgO4gYSdI/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.22.32.png
Is it caused by the double column representation in the PDF file?
https://lh4.googleusercontent.com/-c7n5-oMygRM/VOShm4hwnWI/AU4/CQNjTTctMnY/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.26.46.png
So, what did I do wrong? how can I fix this and use regexp queries
successfully to extract all the URLs?
Thank you
--
You received this message because you are subscribed to the Google Groups
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/12a4f452-6c6f-4e4f-ba00-97208efdbcba%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.