Hi everybody, I want to perform URL extraction from my PDF files. I use mapper-attachment plugin to index my PDF files.
In order to be able to perform some regex queries and extract all the urls present in a pdf file, I used uax_url_email: curl -X PUT "localhost:9200/test" -d '{ > "settings" : { > "index": { > "analysis" :{ > "analyzer": { > "default": { > "type" : "custom", > "tokenizer" : "uax_url_email", > "filter" : ["standard", "lowercase", "stop"] > } > } > } > } > } > }' and the map : curl -X PUT "localhost:9200/test/attachment/_mapping" -d '{ > "attachment" : { > "properties" : { > "file" : { > "type" : "attachment", > "fields" : { > "title" : { "store" : "yes" }, > "file" : { "term_vector":"with_positions_offsets", "store":"yes" } > } > } > } > } I indexed some PDF files, the problem is for a file , I get this (while urls in this file start with http://): <https://lh3.googleusercontent.com/-6uzhp-v0qFs/VOSfMU95byI/AAAAAAAAAUc/H4c6xvb54kg/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.17.19.png> for another file, I got this (it leaves the http:// ): <https://lh3.googleusercontent.com/-1rYIYWJJEbU/VOSfweFpgbI/AAAAAAAAAUk/bWzfst_uZUE/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.19.43.png> But the problem is the urls are not recognized completely , look at this: <https://lh3.googleusercontent.com/-vsKUj5I9MiA/VOSgtyS3yWI/AAAAAAAAAUw/64lgO4gYSdI/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.22.32.png> Is it caused by the double column representation in the PDF file? <https://lh4.googleusercontent.com/-c7n5-oMygRM/VOShm4hwnWI/AAAAAAAAAU4/CQNjTTctMnY/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.26.46.png> So, what did I do wrong? how can I fix this and use regexp queries successfully to extract all the URLs? Thank you -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/28cf5243-c60b-4cbc-b488-2da97c65061d%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.