Re: Tika trouble

2009-11-16 Thread Antonio Calò
What I could try to say is that if you want to index a Pdf, then you should
use a Pdf extractor. A Pdf Extractor is able to extract the text content and
the metadata of the files. I suppose you have just opened and indexed the
pdf as is. So you stored bynary data and stop. For my applciation I've used
PdfExtractor, but also pdfBox project could be used.

Antonio

2009/11/16 Markus Jelsma - Buyways B.V. mar...@buyways.nl

 Anyone has a clue?



  List,
 
 
  I somehow fail to index certain pdf files using the
  ExtractingRequestHandler in Solr 1.4 with default solrconfig.xml but
  modified schema. I have a very simple schema for this case using only
  and ID field, a timestamp field and two dynamic fields; ignored_* and
  attr_* both indexed, stored and multivalued strings. They are
  multivalued simple because some HTML files fail when storing multiple
  hyperlinks.
 
  I have posted multiple files to
  http://.../update/extract?literal.id=doc1 including:
  1. the whitepaper at
  http://www.lucidimagination.com/whitepaper/whats-new-in-lucene-2-9?sc=AP
  2. the html file of the frontpage of http://nu.nl/
  3. another pdf at
 
 http://www.google.nl/url?sa=tsource=webct=rescd=1ved=0CAcQFjAAurl=http%3A%2F%2Fcsl.stanford.edu%2F~christos%2Fpublications%2F2007.cmp_mapreduce.hpca.pdfrct=jq=2007.cmp_mapreduce.hpca.pdfei=PPz7SpiiOM6l4QbZjKjRAwusg=AFQjCNHs-olxbUQrGCXpNMHfcZvY8aMk8Ahttp://www.google.nl/url?sa=tsource=webct=rescd=1ved=0CAcQFjAAurl=http%3A%2F%2Fcsl.stanford.edu%2F%7Echristos%2Fpublications%2F2007.cmp_mapreduce.hpca.pdfrct=jq=2007.cmp_mapreduce.hpca.pdfei=PPz7SpiiOM6l4QbZjKjRAwusg=AFQjCNHs-olxbUQrGCXpNMHfcZvY8aMk8A
 
  For each document i have a corresponding select/?q=*:*:
 
 
  1. No text? Should i see something?
 
  docstr name=iddoc1/str
  arr name=ignored_content_type
  strapplication/octet-stream/str
  /arr
  arr name=ignored_stream_content_type
  str
  text/xml; charset=UTF-8;
  boundary=cf57b4ad644d
  /str
  /arr
  arr name=ignored_stream_size
  str491238/str
  /arr
  arr name=ignored_text
  str/str
  /arr
  date name=timestamp2009-11-12T12:17:23.016Z/date
  /doc
 
 
  2. Plenty of data, this seems to be ok
 
  doc
  str name=iddoc1/str
  arr name=ignored_content_type
  strapplication/xhtml+xml/str
  /arr
  arr name=ignored_links
  strhttp://www.nu.nl//str
  strhttp://www.nu.nl//str
  strhttp://www.nu.nl/algemeen//str
  strhttp://www.nu.nl/economie//str
  
  arr name=ignored_stream_content_type
  str
  text/xml; charset=UTF-8;
  boundary=b6e44d087bdd
  /str
  /arr
  arr name=ignored_stream_size
  str36991/str
  /arr
  arr name=ignored_text
  str
  A LOT OF TEXT HERE
  /str
  /arr
  date name=timestamp2009-11-12T12:19:15.415Z/date
  /doc
 
 
  3. a lot of garbage
 
  doc
  str name=iddoc1/str
  arr name=ignored_content_encoding
  strwindows-1252/str
  /arr
  arr name=ignored_content_language
  strfr/str
  /arr
  arr name=ignored_content_type
  strtext/plain/str
  /arr
  arr name=ignored_language
  strfr/str
  /arr
  arr name=ignored_stream_content_type
  str
  text/xml; charset=UTF-8;
  boundary=83df0fd4d358
  /str
  /arr
  arr name=ignored_stream_size
  str361458/str
  /arr
  arr name=ignored_text
  str
  A LOT OF GARBAGE HERE including
 
  ió½·Þp™ó 4­0›
  š©xÓ ^ CøùI3람š³î¨V ÚÜ¡yS4 ¹£ ² ›H 6õɨ5¤ ÅÜ磩bädÒøŸ\ �s%OîÐÙIÑYRäŠ ;4
  ¢9r —!rEôˆÌ {SìûD²à £©ïœ«{‘ínÆ N÷ô¥F»�™ ±¡Ë'ú\³=·m„Þ »ý)³Å=j¶B¢)`  Ñ
  „Ï™hjCu{£É5{¢¯ç6½Ñhr¢ºÃ=J M- AqsøtÜì ÿ^Rl S?¿óšM‰—lv‘Ø›Qüãý´ þžŽ
  $S;¾¦wze³Ù)qÉú§ ‰› ãqó…Ó ‰ªU:šBÝ‘GuŠë
  MM±Òv �~ ‚N‹t¢ä§~Ì ÞŒS—Êòö¼ÊÄQaº¸¿7tñ ¾Áç œãØŒ58$O 3Å~�8¿L  ‡ëŽó©pk _
  Ša Â=u×; (ä�...@.œ÷ä ù° µk+ÿ PP~ ¨*ݤ¿Œ™¡D»   @fI$0°�Î Ù·p“Œ,Øâ  †¶v
  ¤v1#8¼0 ›  èð€-†šZ 6¾  ! ñb ˆbˆ¤v)LS)T X² ¬ l...@€  6E$Q
  endstream
  endobj
  137 0
 
 obj/Type/Encoding/BaseEncoding/WinAnsiEncoding/Differences[1/W/o/r/d/C/u/n/t/M/a/i/x/l/S/g/c/h/K/m/e/s/R/v/I/P/A/H/L/space/p]
  endobj
  138 0 obj/Type/FontDescriptor/FontFile2 136 0 R/FontBBox[0 -210 942
  728]/FontName/WQHWKD+TTE31911E0t00/Flags 4/MissingWidth 750/StemV
  141/CapHeight 728/Ascent 728/Descent -210/ItalicAngle 0
  endobj
  139 0 obj/Count 12/Kids[140 0 R 141 0 R]/Type/Pages
  endobj
  140 0 obj/Count 6/Kids[147 0 R 1 0 R 4 0 R 7 0 R 22 0 R 25 0
  R]/Type/Pages/Parent 139 0 R
  endobj
  141 0 obj/Count 6/Kids[39 0 R 42 0 R 45 0 R 82 0 R 92 0 R 122 0
  R]/Type/Pages/Parent
 
  
 
  /str
  /arr
  date name=timestamp2009-11-12T12:21:28.306Z/date
  /doc
 
 
  Any ideas? Why doesn't the whitepaper produce any results and why is the
  next whitepaper full of garbage? At least i'm happy that HTML works
  fine.
 
 
 
  Regards,
 
  -
  Markus Jelsma  Buyways B.V.
  Technisch ArchitectFriesestraatweg 215c
  http://www.buyways.nl  9743 AD Groningen
 
 
  Alg. 050-853 6600  KvK  01074105
  Tel. 050-853 6620  Fax. 050-3118124
  Mob. 06-5025 8350  In: http://www.linkedin.com/in/markus17
 




-- 
Antonio Calò

Re: HighLithing exact phrases with solr

2009-10-20 Thread Antonio Calò
 catenateAll=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
/fieldtype


 Maybe I'm missing something, or my understanding of the highlighting
 feature
 is not correct. Any Idea?

 As always, thanks for your support!

 Regards, Antonio







-- 
Antonio Calò
--
Software Developer Engineer
@ Intellisemantic
Mail anton.c...@gmail.com
Tel. 011-56.90.429
--


HighLithing exact phrases with solr

2009-10-05 Thread Antonio Calò
Hi Guys

I'm getting crazy with the highlighting in solr. The problem is the follow:
when I submit an exact phrase query, I get the related results and the
related snippets with highlight. But I've noticed that the *single term of
the phrase are highlighted too*. Here an example:

If I start a search for quick brown fox, I obtain the correct result with
the doc wich contains the phrase, but the snippets came to me like this:

lst name=highlighting
 lst name=14
arr name=DocumentText
str
The emquick brown fox/em jump over the lazy dog. The emfox/em is a
nice animal.
/str
 /arr
  /lst
/lst


Also with some documents, only single terms are highlighted insteand of
exact sentence even if the exact phrase is contained into the document i.
e.:
lst name=highlighting
 lst name=14
arr name=DocumentText
str
The emfox/em is a nice animal.
/str
 /arr
  /lst
/lst


My understanding of highlighting is that if I search for exact phrase, only
the exact phrase is should be highlighted.

Here an extract of my solrconfig.xml  schema.xml

solrconfig.xml:

highlighting
   !-- Configure the standard fragmenter --
   !-- This could most likely be commented out in the default case --
   fragmenter name=gap class=org.apache.solr.highlight.GapFragmenter
lst name=defaults
 int name=hl.fragsize500/int
/lst
   /fragmenter

   !-- A regular-expression-based fragmenter (f.i., for sentence
extraction) --
   fragmenter name=regex
class=org.apache.solr.highlight.RegexFragmenter default=true
lst name=defaults
  !-- slightly smaller fragsizes work better because of slop --
  int name=hl.fragsize700/int
  !-- allow 50% slop on fragment sizes --
  float name=hl.regex.slop0.5/float
  !-- a basic sentence pattern --
  str name=hl.regex.pattern[-\w ,/\n\']{20,200}/str

  bool name=hl.usePhraseHighlightertrue/bool

  bool name=hl.highlightMultiTermtrue/bool
/lst
   /fragmenter

   !-- Configure the standard formatter --
   formatter name=html class=org.apache.solr.highlight.HtmlFormatter
lst name=highlighting
 str name=hl.simple.pre![CDATA[strong]]/str
 str name=hl.simple.post![CDATA[/strong]]/str
/lst
   /formatter


schema.xml:

analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
words=stop_italiano.txt/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
  filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer


analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt ignoreCase=true expand=true/
filter class=solr.StopFilterFactory ignoreCase=true
words=stop_italiano.txt/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
/fieldtype


Maybe I'm missing something, or my understanding of the highlighting feature
is not correct. Any Idea?

As always, thanks for your support!

Regards, Antonio


Re: Solr Porting to .Net

2009-10-05 Thread Antonio Calò
Hi Mauricio, thanks for your feedback.

I suppose we will move to a mixed solution Solr on Tomcat and a .Net client
(maybe SolrNet)

But the Solr on KVM could be interesting. If I've time I'll try It and I'll
let you know in success case.

Antonio

2009/9/30 Mauricio Scheffer mauricioschef...@gmail.com

 Solr is a server that runs on Java and it exposes a http interface.SolrNet
 is a client library for .Net that connects to a Solr instance via its http
 interface.
 My experiment (let's call it SolrIKVM) is an attempt to run Solr on .Net.

 Hope that clear things up.

 On Wed, Sep 30, 2009 at 11:50 AM, Antonio Calò anton.c...@gmail.com
 wrote:

  I guys, thanks for your prompt feedback.
 
 
  So, you are saying that SolrNet is just a wrapper written in C#, that
  connnect the Solr (still written in Java that run on the IKVM) ?
 
  Is my understanding correct?
 
  Regards
 
  Antonio
 
  2009/9/30 Mauricio Scheffer mauricioschef...@gmail.com
 
   SolrNet is only a http client to Solr.
   I've been experimenting with IKVM but wasn't very successful... There
  seem
   to be some issues with class loading, but unfortunately I don't have
 much
   time to continue these experiments right now. In case you're interested
  in
   continuing this, here's the repository:
   http://code.google.com/p/mausch/source/browse/trunk/SolrIKVM
  
   Also recently someone registered a project on google code with the same
   intentions, but no commits yet: http://code.google.com/p/solrwin/
  
   http://code.google.com/p/mausch/source/browse/trunk/SolrIKVMCheers,
   Mauricio
  
   On Wed, Sep 30, 2009 at 7:09 AM, Pravin Paratey prav...@gmail.com
  wrote:
  
You may want to check out - http://code.google.com/p/solrnet/
   
2009/9/30 Antonio Calò anton.c...@gmail.com:
 Hi All

 I'm wondering if is already available a Solr version for .Net or if
  it
   is
 still under development/planning. I've searched on Solr website but
   I've
 found only info on Lucene .Net project.

 Best Regards

 Antonio

 --
 Antonio Calò
 --
 Software Developer Engineer
 @ Intellisemantic
 Mail anton.c...@gmail.com
 Tel. 011-56.90.429
 --

   
  
 
 
 
  --
  Antonio Calò
  --
  Software Developer Engineer
  @ Intellisemantic
  Mail anton.c...@gmail.com
  Tel. 011-56.90.429
  --
 




-- 
Antonio Calò
--
Software Developer Engineer
@ Intellisemantic
Mail anton.c...@gmail.com
Tel. 011-56.90.429
--


Solr Porting to .Net

2009-09-30 Thread Antonio Calò
Hi All

I'm wondering if is already available a Solr version for .Net or if it is
still under development/planning. I've searched on Solr website but I've
found only info on Lucene .Net project.

Best Regards

Antonio

-- 
Antonio Calò
--
Software Developer Engineer
@ Intellisemantic
Mail anton.c...@gmail.com
Tel. 011-56.90.429
--


Re: Solr Porting to .Net

2009-09-30 Thread Antonio Calò
I guys, thanks for your prompt feedback.


So, you are saying that SolrNet is just a wrapper written in C#, that
connnect the Solr (still written in Java that run on the IKVM) ?

Is my understanding correct?

Regards

Antonio

2009/9/30 Mauricio Scheffer mauricioschef...@gmail.com

 SolrNet is only a http client to Solr.
 I've been experimenting with IKVM but wasn't very successful... There seem
 to be some issues with class loading, but unfortunately I don't have much
 time to continue these experiments right now. In case you're interested in
 continuing this, here's the repository:
 http://code.google.com/p/mausch/source/browse/trunk/SolrIKVM

 Also recently someone registered a project on google code with the same
 intentions, but no commits yet: http://code.google.com/p/solrwin/

 http://code.google.com/p/mausch/source/browse/trunk/SolrIKVMCheers,
 Mauricio

 On Wed, Sep 30, 2009 at 7:09 AM, Pravin Paratey prav...@gmail.com wrote:

  You may want to check out - http://code.google.com/p/solrnet/
 
  2009/9/30 Antonio Calò anton.c...@gmail.com:
   Hi All
  
   I'm wondering if is already available a Solr version for .Net or if it
 is
   still under development/planning. I've searched on Solr website but
 I've
   found only info on Lucene .Net project.
  
   Best Regards
  
   Antonio
  
   --
   Antonio Calò
   --
   Software Developer Engineer
   @ Intellisemantic
   Mail anton.c...@gmail.com
   Tel. 011-56.90.429
   --
  
 




-- 
Antonio Calò
--
Software Developer Engineer
@ Intellisemantic
Mail anton.c...@gmail.com
Tel. 011-56.90.429
--