Re: problem with using uax_url_email

2015-02-19 Thread Marria
Hi,

for people having the same problem like me, here an answer I received from 
Pablo in PT group:

About your problem I beleive this is a constraint of the Apache Tika [1], 
which is used by the mapper-attachment plugin.
I believe that a search over Tika pdf limitations or a question on their 
list will help you more than we can.
Anyway, maybe you want to ask in the Elasticsearch main list [2], which is 
bigger than ours and has the Elasticsearch engineers.

I am sorry for not being able to help you that much.

Cheers,
Pablo

[1] http://tika.apache.org/
[2] elasti...@googlegroups.com


Le mercredi 18 février 2015 15:37:33 UTC+1, Marria a écrit :

 Hi everybody,

 I want to perform URL extraction from my PDF files. I use 
 mapper-attachment plugin to index my PDF files.

 In order to be able to perform some regex queries and extract all the urls 
 present in a pdf file, I useduax_url_email:

 curl -X PUT localhost:9200/test -d '{

settings : {

  index: {

analysis :{

  analyzer: {

default: {

  type : custom,

  tokenizer : uax_url_email,

  filter : [standard, lowercase, stop]

}

  }

}

  }

}

  }'



  and the map :

 curl -X PUT localhost:9200/test/attachment/_mapping -d '{

attachment : {

  properties : {

file : {

  type : attachment,

  fields : {

title : { store : yes },

file : { term_vector:with_positions_offsets, 
 store:yes }

  }

}

  }

}


 I indexed some PDF files, the problem is for a file , I get this (while 
 urls in this file start with http://):



 https://lh3.googleusercontent.com/-6uzhp-v0qFs/VOSfMU95byI/AUc/H4c6xvb54kg/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.17.19.png
 for another file, I got this (it leaves the http:// ):


 https://lh3.googleusercontent.com/-1rYIYWJJEbU/VOSfweFpgbI/AUk/bWzfst_uZUE/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.19.43.png
  But the problem is the urls are not recognized completely , look at this:


 https://lh3.googleusercontent.com/-vsKUj5I9MiA/VOSgtyS3yWI/AUw/64lgO4gYSdI/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.22.32.png

 Is it caused by the double column representation in the PDF file?


 https://lh4.googleusercontent.com/-c7n5-oMygRM/VOShm4hwnWI/AU4/CQNjTTctMnY/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.26.46.png

 So, what did I do wrong? how can I fix this and use regexp queries 
 successfully to extract all the URLs?


 Thank you





  
  

  

  






-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/12a4f452-6c6f-4e4f-ba00-97208efdbcba%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


problem with using uax_url_email

2015-02-18 Thread Marria
Hi everybody,

I want to perform URL extraction from my PDF files. I use mapper-attachment 
plugin to index my PDF files.

In order to be able to perform some regex queries and extract all the urls 
present in a pdf file, I useduax_url_email:

curl -X PUT localhost:9200/test -d '{

   settings : {

 index: {

   analysis :{

 analyzer: {

   default: {

 type : custom,

 tokenizer : uax_url_email,

 filter : [standard, lowercase, stop]

   }

 }

   }

 }

   }

 }'



 and the map :

curl -X PUT localhost:9200/test/attachment/_mapping -d '{

   attachment : {

 properties : {

   file : {

 type : attachment,

 fields : {

   title : { store : yes },

   file : { term_vector:with_positions_offsets, 
store:yes }

 }

   }

 }

   }


I indexed some PDF files, the problem is for a file , I get this (while 
urls in this file start with http://):


https://lh3.googleusercontent.com/-6uzhp-v0qFs/VOSfMU95byI/AUc/H4c6xvb54kg/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.17.19.png
for another file, I got this (it leaves the http:// ):

https://lh3.googleusercontent.com/-1rYIYWJJEbU/VOSfweFpgbI/AUk/bWzfst_uZUE/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.19.43.png
 But the problem is the urls are not recognized completely , look at this:

https://lh3.googleusercontent.com/-vsKUj5I9MiA/VOSgtyS3yWI/AUw/64lgO4gYSdI/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.22.32.png

Is it caused by the double column representation in the PDF file?

https://lh4.googleusercontent.com/-c7n5-oMygRM/VOShm4hwnWI/AU4/CQNjTTctMnY/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.26.46.png

So, what did I do wrong? how can I fix this and use regexp queries 
successfully to extract all the URLs?


Thank you





 
 

 

 




-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/28cf5243-c60b-4cbc-b488-2da97c65061d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.