Re: Tika trouble
What I could try to say is that if you want to index a Pdf, then you should use a Pdf extractor. A Pdf Extractor is able to extract the text content and the metadata of the files. I suppose you have just opened and indexed the pdf as is. So you stored bynary data and stop. For my applciation I've used PdfExtractor, but also pdfBox project could be used. Antonio 2009/11/16 Markus Jelsma - Buyways B.V. mar...@buyways.nl Anyone has a clue? List, I somehow fail to index certain pdf files using the ExtractingRequestHandler in Solr 1.4 with default solrconfig.xml but modified schema. I have a very simple schema for this case using only and ID field, a timestamp field and two dynamic fields; ignored_* and attr_* both indexed, stored and multivalued strings. They are multivalued simple because some HTML files fail when storing multiple hyperlinks. I have posted multiple files to http://.../update/extract?literal.id=doc1 including: 1. the whitepaper at http://www.lucidimagination.com/whitepaper/whats-new-in-lucene-2-9?sc=AP 2. the html file of the frontpage of http://nu.nl/ 3. another pdf at http://www.google.nl/url?sa=tsource=webct=rescd=1ved=0CAcQFjAAurl=http%3A%2F%2Fcsl.stanford.edu%2F~christos%2Fpublications%2F2007.cmp_mapreduce.hpca.pdfrct=jq=2007.cmp_mapreduce.hpca.pdfei=PPz7SpiiOM6l4QbZjKjRAwusg=AFQjCNHs-olxbUQrGCXpNMHfcZvY8aMk8Ahttp://www.google.nl/url?sa=tsource=webct=rescd=1ved=0CAcQFjAAurl=http%3A%2F%2Fcsl.stanford.edu%2F%7Echristos%2Fpublications%2F2007.cmp_mapreduce.hpca.pdfrct=jq=2007.cmp_mapreduce.hpca.pdfei=PPz7SpiiOM6l4QbZjKjRAwusg=AFQjCNHs-olxbUQrGCXpNMHfcZvY8aMk8A For each document i have a corresponding select/?q=*:*: 1. No text? Should i see something? docstr name=iddoc1/str arr name=ignored_content_type strapplication/octet-stream/str /arr arr name=ignored_stream_content_type str text/xml; charset=UTF-8; boundary=cf57b4ad644d /str /arr arr name=ignored_stream_size str491238/str /arr arr name=ignored_text str/str /arr date name=timestamp2009-11-12T12:17:23.016Z/date /doc 2. Plenty of data, this seems to be ok doc str name=iddoc1/str arr name=ignored_content_type strapplication/xhtml+xml/str /arr arr name=ignored_links strhttp://www.nu.nl//str strhttp://www.nu.nl//str strhttp://www.nu.nl/algemeen//str strhttp://www.nu.nl/economie//str arr name=ignored_stream_content_type str text/xml; charset=UTF-8; boundary=b6e44d087bdd /str /arr arr name=ignored_stream_size str36991/str /arr arr name=ignored_text str A LOT OF TEXT HERE /str /arr date name=timestamp2009-11-12T12:19:15.415Z/date /doc 3. a lot of garbage doc str name=iddoc1/str arr name=ignored_content_encoding strwindows-1252/str /arr arr name=ignored_content_language strfr/str /arr arr name=ignored_content_type strtext/plain/str /arr arr name=ignored_language strfr/str /arr arr name=ignored_stream_content_type str text/xml; charset=UTF-8; boundary=83df0fd4d358 /str /arr arr name=ignored_stream_size str361458/str /arr arr name=ignored_text str A LOT OF GARBAGE HERE including ió½·Þp™ó 40› š©xÓ ^ CøùI3람š³î¨V ÚÜ¡yS4 ¹£ ² ›H 6õɨ5¤ ÅÜ磩bädÒøŸ\ �s%OîÐÙIÑYRäŠ ;4 ¢9r —!rEôˆÌ {SìûD²à £©ïœ«{‘ínÆ N÷ô¥F»�™ ±¡Ë'ú\³=·m„Þ »ý)³Å=j¶B¢)` Ñ „Ï™hjCu{£É5{¢¯ç6½Ñhr¢ºÃ=J M- AqsøtÜì ÿ^Rl S?¿óšM‰—lv‘Ø›Qüãý´ þžŽ $S;¾¦wze³Ù)qÉú§ ‰› ãqó…Ó ‰ªU:šBÝ‘GuŠë MM±Òv �~ ‚N‹t¢ä§~Ì ÞŒS—Êòö¼ÊÄQaº¸¿7tñ ¾Áç œãØŒ58$O 3Å~�8¿L ‡ëŽó©pk _ Ša Â=u×; (ä�...@.œ÷ä ù° µk+ÿ PP~ ¨*ݤ¿Œ™¡D» @fI$0°�Î Ù·p“Œ,Øâ †¶v ¤v1#8¼0 › èð€-†šZ 6¾ ! ñb ˆbˆ¤v)LS)T X² ¬ l...@€ 6E$Q endstream endobj 137 0 obj/Type/Encoding/BaseEncoding/WinAnsiEncoding/Differences[1/W/o/r/d/C/u/n/t/M/a/i/x/l/S/g/c/h/K/m/e/s/R/v/I/P/A/H/L/space/p] endobj 138 0 obj/Type/FontDescriptor/FontFile2 136 0 R/FontBBox[0 -210 942 728]/FontName/WQHWKD+TTE31911E0t00/Flags 4/MissingWidth 750/StemV 141/CapHeight 728/Ascent 728/Descent -210/ItalicAngle 0 endobj 139 0 obj/Count 12/Kids[140 0 R 141 0 R]/Type/Pages endobj 140 0 obj/Count 6/Kids[147 0 R 1 0 R 4 0 R 7 0 R 22 0 R 25 0 R]/Type/Pages/Parent 139 0 R endobj 141 0 obj/Count 6/Kids[39 0 R 42 0 R 45 0 R 82 0 R 92 0 R 122 0 R]/Type/Pages/Parent /str /arr date name=timestamp2009-11-12T12:21:28.306Z/date /doc Any ideas? Why doesn't the whitepaper produce any results and why is the next whitepaper full of garbage? At least i'm happy that HTML works fine. Regards, - Markus Jelsma Buyways B.V. Technisch ArchitectFriesestraatweg 215c http://www.buyways.nl 9743 AD Groningen Alg. 050-853 6600 KvK 01074105 Tel. 050-853 6620 Fax. 050-3118124 Mob. 06-5025 8350 In: http://www.linkedin.com/in/markus17 -- Antonio Calò
Re: HighLithing exact phrases with solr
catenateAll=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldtype Maybe I'm missing something, or my understanding of the highlighting feature is not correct. Any Idea? As always, thanks for your support! Regards, Antonio -- Antonio Calò -- Software Developer Engineer @ Intellisemantic Mail anton.c...@gmail.com Tel. 011-56.90.429 --
HighLithing exact phrases with solr
Hi Guys I'm getting crazy with the highlighting in solr. The problem is the follow: when I submit an exact phrase query, I get the related results and the related snippets with highlight. But I've noticed that the *single term of the phrase are highlighted too*. Here an example: If I start a search for quick brown fox, I obtain the correct result with the doc wich contains the phrase, but the snippets came to me like this: lst name=highlighting lst name=14 arr name=DocumentText str The emquick brown fox/em jump over the lazy dog. The emfox/em is a nice animal. /str /arr /lst /lst Also with some documents, only single terms are highlighted insteand of exact sentence even if the exact phrase is contained into the document i. e.: lst name=highlighting lst name=14 arr name=DocumentText str The emfox/em is a nice animal. /str /arr /lst /lst My understanding of highlighting is that if I search for exact phrase, only the exact phrase is should be highlighted. Here an extract of my solrconfig.xml schema.xml solrconfig.xml: highlighting !-- Configure the standard fragmenter -- !-- This could most likely be commented out in the default case -- fragmenter name=gap class=org.apache.solr.highlight.GapFragmenter lst name=defaults int name=hl.fragsize500/int /lst /fragmenter !-- A regular-expression-based fragmenter (f.i., for sentence extraction) -- fragmenter name=regex class=org.apache.solr.highlight.RegexFragmenter default=true lst name=defaults !-- slightly smaller fragsizes work better because of slop -- int name=hl.fragsize700/int !-- allow 50% slop on fragment sizes -- float name=hl.regex.slop0.5/float !-- a basic sentence pattern -- str name=hl.regex.pattern[-\w ,/\n\']{20,200}/str bool name=hl.usePhraseHighlightertrue/bool bool name=hl.highlightMultiTermtrue/bool /lst /fragmenter !-- Configure the standard formatter -- formatter name=html class=org.apache.solr.highlight.HtmlFormatter lst name=highlighting str name=hl.simple.pre![CDATA[strong]]/str str name=hl.simple.post![CDATA[/strong]]/str /lst /formatter schema.xml: analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stop_italiano.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stop_italiano.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldtype Maybe I'm missing something, or my understanding of the highlighting feature is not correct. Any Idea? As always, thanks for your support! Regards, Antonio
Re: Solr Porting to .Net
Hi Mauricio, thanks for your feedback. I suppose we will move to a mixed solution Solr on Tomcat and a .Net client (maybe SolrNet) But the Solr on KVM could be interesting. If I've time I'll try It and I'll let you know in success case. Antonio 2009/9/30 Mauricio Scheffer mauricioschef...@gmail.com Solr is a server that runs on Java and it exposes a http interface.SolrNet is a client library for .Net that connects to a Solr instance via its http interface. My experiment (let's call it SolrIKVM) is an attempt to run Solr on .Net. Hope that clear things up. On Wed, Sep 30, 2009 at 11:50 AM, Antonio Calò anton.c...@gmail.com wrote: I guys, thanks for your prompt feedback. So, you are saying that SolrNet is just a wrapper written in C#, that connnect the Solr (still written in Java that run on the IKVM) ? Is my understanding correct? Regards Antonio 2009/9/30 Mauricio Scheffer mauricioschef...@gmail.com SolrNet is only a http client to Solr. I've been experimenting with IKVM but wasn't very successful... There seem to be some issues with class loading, but unfortunately I don't have much time to continue these experiments right now. In case you're interested in continuing this, here's the repository: http://code.google.com/p/mausch/source/browse/trunk/SolrIKVM Also recently someone registered a project on google code with the same intentions, but no commits yet: http://code.google.com/p/solrwin/ http://code.google.com/p/mausch/source/browse/trunk/SolrIKVMCheers, Mauricio On Wed, Sep 30, 2009 at 7:09 AM, Pravin Paratey prav...@gmail.com wrote: You may want to check out - http://code.google.com/p/solrnet/ 2009/9/30 Antonio Calò anton.c...@gmail.com: Hi All I'm wondering if is already available a Solr version for .Net or if it is still under development/planning. I've searched on Solr website but I've found only info on Lucene .Net project. Best Regards Antonio -- Antonio Calò -- Software Developer Engineer @ Intellisemantic Mail anton.c...@gmail.com Tel. 011-56.90.429 -- -- Antonio Calò -- Software Developer Engineer @ Intellisemantic Mail anton.c...@gmail.com Tel. 011-56.90.429 -- -- Antonio Calò -- Software Developer Engineer @ Intellisemantic Mail anton.c...@gmail.com Tel. 011-56.90.429 --
Solr Porting to .Net
Hi All I'm wondering if is already available a Solr version for .Net or if it is still under development/planning. I've searched on Solr website but I've found only info on Lucene .Net project. Best Regards Antonio -- Antonio Calò -- Software Developer Engineer @ Intellisemantic Mail anton.c...@gmail.com Tel. 011-56.90.429 --
Re: Solr Porting to .Net
I guys, thanks for your prompt feedback. So, you are saying that SolrNet is just a wrapper written in C#, that connnect the Solr (still written in Java that run on the IKVM) ? Is my understanding correct? Regards Antonio 2009/9/30 Mauricio Scheffer mauricioschef...@gmail.com SolrNet is only a http client to Solr. I've been experimenting with IKVM but wasn't very successful... There seem to be some issues with class loading, but unfortunately I don't have much time to continue these experiments right now. In case you're interested in continuing this, here's the repository: http://code.google.com/p/mausch/source/browse/trunk/SolrIKVM Also recently someone registered a project on google code with the same intentions, but no commits yet: http://code.google.com/p/solrwin/ http://code.google.com/p/mausch/source/browse/trunk/SolrIKVMCheers, Mauricio On Wed, Sep 30, 2009 at 7:09 AM, Pravin Paratey prav...@gmail.com wrote: You may want to check out - http://code.google.com/p/solrnet/ 2009/9/30 Antonio Calò anton.c...@gmail.com: Hi All I'm wondering if is already available a Solr version for .Net or if it is still under development/planning. I've searched on Solr website but I've found only info on Lucene .Net project. Best Regards Antonio -- Antonio Calò -- Software Developer Engineer @ Intellisemantic Mail anton.c...@gmail.com Tel. 011-56.90.429 -- -- Antonio Calò -- Software Developer Engineer @ Intellisemantic Mail anton.c...@gmail.com Tel. 011-56.90.429 --