Hi everybody,

I want to perform URL extraction from my PDF files. I use mapper-attachment 
plugin to index my PDF files.

In order to be able to perform some regex queries and extract all the urls 
present in a pdf file, I used    uax_url_email:

curl -X PUT "localhost:9200/test" -d '{

>   "settings" : {

>     "index": {

>       "analysis" :{

>         "analyzer": {

>           "default": {

>             "type" : "custom",

>             "tokenizer" : "uax_url_email",

>             "filter" : ["standard", "lowercase", "stop"]

>           }

>         }

>       }

>     }

>   }

> }'



 and the map :

curl -X PUT "localhost:9200/test/attachment/_mapping" -d '{

>   "attachment" : {

>     "properties" : {

>       "file" : {

>         "type" : "attachment",

>         "fields" : {

>           "title" : { "store" : "yes" },

>           "file" : { "term_vector":"with_positions_offsets", 
"store":"yes" }

>         }

>       }

>     }

>   }


I indexed some PDF files, the problem is for a file , I get this (while 
urls in this file start with http://):


<https://lh3.googleusercontent.com/-6uzhp-v0qFs/VOSfMU95byI/AAAAAAAAAUc/H4c6xvb54kg/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.17.19.png>
for another file, I got this (it leaves the http:// ):

<https://lh3.googleusercontent.com/-1rYIYWJJEbU/VOSfweFpgbI/AAAAAAAAAUk/bWzfst_uZUE/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.19.43.png>
 But the problem is the urls are not recognized completely , look at this:

<https://lh3.googleusercontent.com/-vsKUj5I9MiA/VOSgtyS3yWI/AAAAAAAAAUw/64lgO4gYSdI/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.22.32.png>

Is it caused by the double column representation in the PDF file?

<https://lh4.googleusercontent.com/-c7n5-oMygRM/VOShm4hwnWI/AAAAAAAAAU4/CQNjTTctMnY/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.26.46.png>

So, what did I do wrong? how can I fix this and use regexp queries 
successfully to extract all the URLs?


Thank you





 
 

 

 




-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/28cf5243-c60b-4cbc-b488-2da97c65061d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to