Re: Solr on Google App Engine

2010-12-13 Thread Praveen Agrawal
Thanks a lot, Mauricio.

Does anyone has any experience on Amazon EC2, or can point me to existing
discussions?

Appreciate your help.
Thanks.
Praveen

On Thu, Dec 9, 2010 at 6:20 PM, Mauricio Scheffer 
mauricioschef...@gmail.com wrote:

 Solr on GAE has been discussed a couple of times, see these threads:

 http://www.mail-archive.com/java-user@lucene.apache.org/msg26010.html
 http://www.mail-archive.com/java-user@lucene.apache.org/msg26010.html
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg24473.html
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg24473.html

 http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3co2w952e01251005032245r79d6bfd6zbe08ece212c82...@mail.gmail.com%3e
 
 http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3co2w952e01251005032245r79d6bfd6zbe08ece212c82...@mail.gmail.com%3e
 

 --
 Mauricio



 On Thu, Dec 9, 2010 at 9:07 AM, Praveen Agrawal pkal...@gmail.com wrote:

  Hi,
  I was wondering if Solr can be deployed/run on Google App Engine. GAE has
  some restrictions, notably no local file write access is allowed, instead
  applications must use JDO/JPA etc.
 
  I believe Solr can be deployed/run on Amazon EC2.
 
  Has anyone tried Solr on these two hosts?
 
  Thanks.
  Praveen
 



Re: Solr on Google App Engine

2010-12-13 Thread Praveen Agrawal
Thanks Dave..

On Mon, Dec 13, 2010 at 4:06 PM, Dave Searle dave.sea...@magicalia.comwrote:

 EC2 installations are just windows/linux machines, so this would just be a
 normal setup. I have a solr server running on a small instance with 1.7gb
 ram mounted to an EBS volume of 50gb, seems to run fine. Costs about $115 a
 month

 -Original Message-
 From: Praveen Agrawal [mailto:pkal...@gmail.com]
 Sent: 13 December 2010 09:20
 To: solr-user@lucene.apache.org
 Subject: Re: Solr on Google App Engine

 Thanks a lot, Mauricio.

 Does anyone has any experience on Amazon EC2, or can point me to existing
 discussions?

 Appreciate your help.
 Thanks.
 Praveen

 On Thu, Dec 9, 2010 at 6:20 PM, Mauricio Scheffer 
 mauricioschef...@gmail.com wrote:

  Solr on GAE has been discussed a couple of times, see these threads:
 
  http://www.mail-archive.com/java-user@lucene.apache.org/msg26010.html
  http://www.mail-archive.com/java-user@lucene.apache.org/msg26010.html
  http://www.mail-archive.com/solr-user@lucene.apache.org/msg24473.html
  http://www.mail-archive.com/solr-user@lucene.apache.org/msg24473.html
 
 
 http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3co2w952e01251005032245r79d6bfd6zbe08ece212c82...@mail.gmail.com%3e
  
 
 http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3co2w952e01251005032245r79d6bfd6zbe08ece212c82...@mail.gmail.com%3e
  
 
  --
  Mauricio
 
 
 
  On Thu, Dec 9, 2010 at 9:07 AM, Praveen Agrawal pkal...@gmail.com
 wrote:
 
   Hi,
   I was wondering if Solr can be deployed/run on Google App Engine. GAE
 has
   some restrictions, notably no local file write access is allowed,
 instead
   applications must use JDO/JPA etc.
  
   I believe Solr can be deployed/run on Amazon EC2.
  
   Has anyone tried Solr on these two hosts?
  
   Thanks.
   Praveen
  
 



Solr on Google App Engine

2010-12-09 Thread Praveen Agrawal
Hi,
I was wondering if Solr can be deployed/run on Google App Engine. GAE has
some restrictions, notably no local file write access is allowed, instead
applications must use JDO/JPA etc.

I believe Solr can be deployed/run on Amazon EC2.

Has anyone tried Solr on these two hosts?

Thanks.
Praveen


Re: Example of using stream.file to post a binary file to solr

2010-05-07 Thread Praveen Agrawal
Sandhya,
Chris's link (with anchor name) directly goes to solrj example


On Fri, May 7, 2010 at 8:15 PM, Sandhya Agarwal sagar...@opentext.comwrote:

 Yes, I did. But, I don't find a solrj example there. The example in
 the doc uses curl.

 - Sent from iPhone

 On 07-May-2010, at 8:12 PM, Chris Hostetter
 hossman_luc...@fucit.org wrote:

  : Sorry. That is what I meant. But, I put it wrongly. I have not been
  : able to find examples of using solrj, for this.
 
  did you look at the link i included?
 
  :  To POST a raw stream using SolrJ you need to use the
  :  ContentStreamUpdateRequest...
  : 
  : 
 http://wiki.apache.org/solr/ExtractingRequestHandler#Sending_documents_to_Solr
 
 
  -Hoss
 



Re: Problem with pdf, upgrading Cell

2010-05-05 Thread Praveen Agrawal
Marc  Sandhya,
Did you use Solr from trunk?
I used Solr 1.4 distn, and even after copying all the jars, i still get the
same results for the pdfs i posted here.
Thanks.

On Wed, May 5, 2010 at 1:09 PM, Marc Ghorayeb dekay...@hotmail.com wrote:


 Hey,
 I have the same list, and i added to it the extraction library (apache solr
 cell jar), though you might not need it specifically inside the war file.
 Marc
  From: sagar...@opentext.com
  To: solr-user@lucene.apache.org
  Date: Wed, 5 May 2010 10:21:36 +0530
  Subject: RE: Problem with pdf, upgrading Cell
 
  Looks like the highlighting may not work here. Following is the list of
 jars I copied :
 
  asm-3.1.jar
  bcmail-jdk15-1.45.jar
  bcprov-jdk15-1.45.jar
  commons-compress-1.0.jar
  commons-logging-1.1.1.jar
  dom4j-1.6.1.jar
  fontbox-1.1.0.jar
  geronimo-stax-api_1.0_spec-1.0.1.jar
  jempbox-1.1.0.jar
  log4j-1.2.14.jar
  metadata-extractor-2.4.0-beta-1.jar
  pdfbox-1.1.0.jar
  poi-3.6.jar
  poi-ooxml-3.6.jar
  poi-ooxml-schemas-3.6.jar
  poi-scratchpad-3.6.jar
  tagsoup-1.2.jar
  tika-core-0.7.jar
  tika-parsers-0.7.jar
  xml-apis-1.0.b2.jar
  xmlbeans-2.3.0.jar
 
  Thanks,
  Sandhya
 
 
 
  -Original Message-
  From: Sandhya Agarwal [mailto:sagar...@opentext.com]
  Sent: Wednesday, May 05, 2010 10:06 AM
  To: solr-user@lucene.apache.org
  Subject: RE: Problem with pdf, upgrading Cell
 
  Praveen,
 
 
 
  I only have the highlighted jars copied. Not sure, if we need the other
 jars. Also, I copied the jars directly into solr\WEB-INF\lib, like you did.
 
 
 
  Thanks,
 
  Sandhya
 
 
 
  -Original Message-
  From: Praveen Agrawal [mailto:pkal...@gmail.com]
  Sent: Tuesday, May 04, 2010 8:10 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Problem with pdf, upgrading Cell
 
 
 
  Hi Sandhya..
 
  I must be missing something. I copied all dependencies jars to both
 
  contrib/extraction/lib and web-in/lib folders. Here is the list of jars
 
  copied:
 
 
 
  asm-3.1.jar
 
  bcmail-jdk15-1.45.jar
 
  bcprov-jdk15-1.45.jar
 
  commons-compress-1.0.jar
 
  commons-logging-1.1.1.jar
 
  dom4j-1.6.1.jar
 
  fontbox-1.1.0.jar
 
  geronimo-stax-api_1.0_spec-1.0.1.jar
 
  hamcrest-core-1.1.jar
 
  jempbox-1.1.0.jar
 
  junit-3.8.1.jar
 
  log4j-1.2.14.jar
 
  metadata-extractor-2.4.0-beta-1.jar
 
  mockito-core-1.7.jar
 
  nekohtml-1.9.9.jar
 
  objenesis-1.0.jar
 
  ooxml-schemas-1.0.jar
 
  pdfbox-1.1.0.jar
 
  poi-3.6.jar
 
  poi-ooxml-3.6.jar
 
  poi-ooxml-schemas-3.6.jar
 
  poi-scratchpad-3.6.jar
 
  tagsoup-1.2.jar
 
  tika-core-0.7.jar
 
  tika-parsers-0.7.jar
 
  xml-apis-1.0.b2.jar
 
  xmlbeans-2.3.0.jar
 
 
 
  Still same result for me..
 
 
 
  Marc,
 
  i'm on windows, and i copied above jars directly into already extracted
 
  folder webapps/solr/web-in/lib, in addition to what were already there. I
 
  didn;t explicitly un-jar'd and re-jar'd the solr.war, but do you think
 that
 
  could be the issue? i think tomcat extract the war and use the folder in
 
  webapps (i didn;t put the war file in webapps, instead had put extracted
 
  solr folder directly)
 
 
 
  If it has worked for you guys, specially with my two pdfs, then that's
 
  really great. Please let me know your exact procedure, including what all
 
  you copied and where, or if you see i missed something obvious..
 
 
 
  Thanks,
 
  Praveen
 
 
 
 
 
  On Tue, May 4, 2010 at 5:28 PM, Sandhya Agarwal sagar...@opentext.com
 wrote:
 
 
 
   Both the files work for me, Praveen.
 
  
 
   Thanks,
 
   Sandhya
 
  
 
   From: Praveen Agrawal [mailto:pkal...@gmail.com]
 
   Sent: Tuesday, May 04, 2010 5:22 PM
 
   To: solr-user@lucene.apache.org
 
   Subject: Re: Problem with pdf, upgrading Cell
 
  
 
   another one here..
 
   On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal pkal...@gmail.com
 mailto:
 
   pkal...@gmail.com wrote:
 
   It bounced because of attachment's size..
 
   attaching one by one now..
 
  
 
  
 
   On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal pkal...@gmail.com
 mailto:
 
   pkal...@gmail.com wrote:
 
   I noticed following pattern/relationship b/w producer/creator and
 content
 
   extraction, not sure if helpful (as Grant told earlier pdfs are
 notorious):
 
  
 
   producer: Bullzip PDF Printer / www.bullzip.comhttp://www.bullzip.com
 /
 
   Freeware Edition (not registered)
 
   Creator: PScript5.dll Version 5.2.2
 
   Extraction: no content  --  installing Solr in Tomcat.pdf (attached -
 i
 
   generated)
 
   -
 
  
 
   Producer: Acrobat Distiller 7.0.5 (Windows)
 
   creator: PScript5.dll Version 5.2.2
 
   Extraction: One line content
 
   --
 
  
 
   Producer: Acrobat Distiller 8.1.0 (Windows)
 
   creator: Acrobat PDFMaker 8.1 for Word
 
   Extraction:  one line of content(Free_Two_way_Radio_Guide.pdf -
 
   attached - was available freely on their website)
 
   -
 
  
 
   Producer: FOP 0.20.5
 
   Extraction: full content/docs/features.pdf | linkmap.pdf etc

Re: Problem with pdf, upgrading Cell

2010-05-05 Thread Praveen Agrawal
' to
 classloader
 
  May 4, 2010 12:50:16 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding 'file:/C:/apache-solr-1.4.0/dist/apache-solr-cell-1.4.0.jar'
 to classloader
 
  May 4, 2010 12:50:20 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/dist/apache-solr-clustering-1.4.0.jar' to
 classloader
 
  May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/carrot2-mini-3.1.0.jar'
 to classloader
 
  May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/commons-lang-2.4.jar' to
 classloader
 
  May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/ehcache-1.6.2.jar' to
 classloader
 
  May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/google-collections-1.0-rc2.jar'
 to classloader
 
  May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/jackson-core-asl-0.9.9-6.jar'
 to classloader
 
  May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/jackson-mapper-asl-0.9.9-6.jar'
 to classloader
 
  May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/log4j-1.2.14.jar' to
 classloader
 
 
 
  Thanks,
 
  Sandhya
 
 
 
  -Original Message-
  From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant
 Ingersoll
  Sent: Tuesday, May 04, 2010 6:13 AM
  Cc: solr-user@lucene.apache.org
  Subject: Re: Problem with pdf, upgrading Cell
 
 
 
  Little more info... Seems to be a classloading issue.  The tests pass,
 but they aren't loading the Tika libraries via the Solr ResourceLoader,
 whereas the example is.  Marc, one thing to try is to unjar the Solr WAR
 file and put the Tika libs in there, as I bet it will then work.  Note,
 however, I haven't tried this.
 
 
 
  On May 3, 2010, at 6:24 PM, Grant Ingersoll wrote:
 
 
 
  I've opened https://issues.apache.org/jira/browse/SOLR-1902 to track
 this.  It is indeed a bug somewhere (still investigating).  It seems that
 Tika is now picking an EmptyParser implementation when trying to determine
 which parser to use, despite the fact that it properly identifies the MIME
 Type.
 
 
 
  -Grant
 
 
 
  On May 3, 2010, at 5:36 PM, Grant Ingersoll wrote:
 
 
 
  I'm investigating.
 
 
 
  On May 3, 2010, at 5:17 AM, Marc Ghorayeb wrote:
 
 
 
 
 
  Hi,
 
  Grant, i confirm what Praveen has said, any PDF i try does not work
 with the new Tika and SVN versions. :(
 
  Marc
 
 
 
  From: sagar...@opentext.com
 
  To: solr-user@lucene.apache.org
 
  Date: Mon, 3 May 2010 13:05:24 +0530
 
  Subject: RE: Problem with pdf, upgrading Cell
 
 
 
  Hello,
 
 
 
  Please let me know if anybody figured out a way out of this issue.
 
 
 
  Thanks,
 
  Sandhya
 
 
 
  -Original Message-
 
  From: Praveen Agrawal [mailto:pkal...@gmail.com]
 
  Sent: Friday, April 30, 2010 11:14 PM
 
  To: solr-user@lucene.apache.org
 
  Subject: Re: Problem with pdf, upgrading Cell
 
 
 
  Grant,
 
  You can try any of the sample pdfs that come in /docs folder of Solr
 1.4
 
  dist'n. I had tried 'Installing Solr in Tomcat.pdf', 'index.pdf' etc.
 Only
 
  metadata i.e. stream_size, content_type apart from my own literals
 are
 
  indexed, and content is missing..
 
 
 
 
 
  On Fri, Apr 30, 2010 at 8:52 PM, Grant Ingersoll 
 gsing...@apache.orgwrote:
 
 
 
  Praveen and Marc,
 
 
 
  Can you share the PDF (feel free to email my private email) that
 fails in
 
  Solr?
 
 
 
  Thanks,
 
  Grant
 
 
 
 
 
  On Apr 30, 2010, at 7:55 AM, Marc Ghorayeb wrote:
 
 
 
 
 
  Hi
 
  Nope i didn't get it to work... Just like you, command line version
 of
 
  tika extracts correctly the content, but once included in Solr, no
 content
 
  is extracted.
 
  What i tried until now is:- Updating the tika libraries inside Solr
 1.4
 
  public version, no luck there.- Downloading the latest SVN version,
 compiled
 
  it, and started from a simple schema, still no luck.- Getting other
 versions
 
  compiled on hudson (nightly builds), and testing them also, still no
 
  extraction.
 
  I sent a mail on the developpers mailing list but they told me i
 should
 
  just mail here, hope some developper reads this because it's quite
 an
 
  important feature of Solr and somehow it got broke between the 1.4
 release,
 
  and the last version on the svn.
 
  Marc
 
  _
 
  Consultez

Re: Problem with pdf, upgrading Cell

2010-05-04 Thread Praveen Agrawal
 org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tagsoup-1.2.jar' to
 classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tika-core-0.7.jar' to
 classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tika-parsers-0.7.jar' to
 classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/xercesImpl-2.8.1.jar' to
 classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/xml-apis-1.0.b2.jar' to
 classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/xmlbeans-2.3.0.jar' to
 classloader
 
  May 4, 2010 12:50:16 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding 'file:/C:/apache-solr-1.4.0/dist/apache-solr-cell-1.4.0.jar'
 to classloader
 
  May 4, 2010 12:50:20 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/dist/apache-solr-clustering-1.4.0.jar' to
 classloader
 
  May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/carrot2-mini-3.1.0.jar'
 to classloader
 
  May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/commons-lang-2.4.jar' to
 classloader
 
  May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/ehcache-1.6.2.jar' to
 classloader
 
  May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/google-collections-1.0-rc2.jar'
 to classloader
 
  May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/jackson-core-asl-0.9.9-6.jar'
 to classloader
 
  May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/jackson-mapper-asl-0.9.9-6.jar'
 to classloader
 
  May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/log4j-1.2.14.jar' to
 classloader
 
 
 
  Thanks,
 
  Sandhya
 
 
 
  -Original Message-
  From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant
 Ingersoll
  Sent: Tuesday, May 04, 2010 6:13 AM
  Cc: solr-user@lucene.apache.org
  Subject: Re: Problem with pdf, upgrading Cell
 
 
 
  Little more info... Seems to be a classloading issue.  The tests pass,
 but they aren't loading the Tika libraries via the Solr ResourceLoader,
 whereas the example is.  Marc, one thing to try is to unjar the Solr WAR
 file and put the Tika libs in there, as I bet it will then work.  Note,
 however, I haven't tried this.
 
 
 
  On May 3, 2010, at 6:24 PM, Grant Ingersoll wrote:
 
 
 
   I've opened https://issues.apache.org/jira/browse/SOLR-1902 to track
 this.  It is indeed a bug somewhere (still investigating).  It seems that
 Tika is now picking an EmptyParser implementation when trying to determine
 which parser to use, despite the fact that it properly identifies the MIME
 Type.
 
  
 
   -Grant
 
  
 
   On May 3, 2010, at 5:36 PM, Grant Ingersoll wrote:
 
  
 
   I'm investigating.
 
  
 
   On May 3, 2010, at 5:17 AM, Marc Ghorayeb wrote:
 
  
 
  
 
   Hi,
 
   Grant, i confirm what Praveen has said, any PDF i try does not work
 with the new Tika and SVN versions. :(
 
   Marc
 
  
 
   From: sagar...@opentext.com
 
   To: solr-user@lucene.apache.org
 
   Date: Mon, 3 May 2010 13:05:24 +0530
 
   Subject: RE: Problem with pdf, upgrading Cell
 
  
 
   Hello,
 
  
 
   Please let me know if anybody figured out a way out of this issue.
 
  
 
   Thanks,
 
   Sandhya
 
  
 
   -Original Message-
 
   From: Praveen Agrawal [mailto:pkal...@gmail.com]
 
   Sent: Friday, April 30, 2010 11:14 PM
 
   To: solr-user@lucene.apache.org
 
   Subject: Re: Problem with pdf, upgrading Cell
 
  
 
   Grant,
 
   You can try any of the sample pdfs that come in /docs folder of Solr
 1.4
 
   dist'n. I had tried 'Installing Solr in Tomcat.pdf', 'index.pdf'
 etc. Only
 
   metadata i.e. stream_size, content_type apart from my own literals
 are
 
   indexed, and content is missing..
 
  
 
  
 
   On Fri, Apr 30, 2010 at 8:52 PM

Re: Problem with pdf, upgrading Cell

2010-05-04 Thread Praveen Agrawal
I seems to have mixed results:

Here is what i did:
copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in
contrib/extraction/lib (of-course removed old ones),. as well as in
web-inf/lib of solr web app in tomcat.

Now it extracts contents from some pdf, but either no content from others,
or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf
still shows no contents. I've two other pdfs, for which it extracts only one
line of content.

Also, now i;m getting a field 'title' single value for some pdfs, and two
for others. In case where it can extract full content, it shows title as
what i gave as literal while submitting the pdf. For pdf wher no comtent was
extracted, it shows one empty title and one mine. For pdf where it extracted
only one line of content, it shows that line as title too and mine one.
'title' field is defined as multivalue in schema.

Any idea, whats going on? or am i missing something?



On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb dekay...@hotmail.com wrote:


 Hey,
 I got it to work. I just redid my steps, i had forgotten several libraries
 that were imported through the xml. PDF extraction seems to work once again,
 i have yet to find one that raises an exception!

 Thanks for the investigation, at least we now have a fix :)
 Marc
 _
 Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone,
 Blackberry, …
 http://www.messengersurvotremobile.com/?d=Hotmail



Re: Problem with pdf, upgrading Cell

2010-05-04 Thread Praveen Agrawal
Yes Sandhya,
i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is what
you were asking.
Thanks.


On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal sagar...@opentext.comwrote:

 Praveen,

 Along with the tika core and parser jars, did you run mvn
 dependency:copy-dependencies, to generate all the dependencies too.

 Thanks,
 Sandhya

 -Original Message-
 From: Praveen Agrawal [mailto:pkal...@gmail.com]
 Sent: Tuesday, May 04, 2010 4:52 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Problem with pdf, upgrading Cell

 I seems to have mixed results:

 Here is what i did:
 copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in
 contrib/extraction/lib (of-course removed old ones),. as well as in
 web-inf/lib of solr web app in tomcat.

 Now it extracts contents from some pdf, but either no content from others,
 or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf
 still shows no contents. I've two other pdfs, for which it extracts only
 one
 line of content.

 Also, now i;m getting a field 'title' single value for some pdfs, and two
 for others. In case where it can extract full content, it shows title as
 what i gave as literal while submitting the pdf. For pdf wher no comtent
 was
 extracted, it shows one empty title and one mine. For pdf where it
 extracted
 only one line of content, it shows that line as title too and mine one.
 'title' field is defined as multivalue in schema.

 Any idea, whats going on? or am i missing something?



 On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb dekay...@hotmail.com
 wrote:

 
  Hey,
  I got it to work. I just redid my steps, i had forgotten several
 libraries
  that were imported through the xml. PDF extraction seems to work once
 again,
  i have yet to find one that raises an exception!
 
  Thanks for the investigation, at least we now have a fix :)
  Marc
  _
  Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone,
  Blackberry, …
  http://www.messengersurvotremobile.com/?d=Hotmail
 



Re: Problem with pdf, upgrading Cell

2010-05-04 Thread Praveen Agrawal
This email contained a .zip file attachment. Raytheon does not allow email 
attachments that are considered likely to contain malicious code. For your 
protection this attachment has been removed.

If this email is from an unknown source, please simply delete this email.

If this email was expected, and it is from a known sender, you may follow the 
below suggested instructions to obtain these types of attachments.

+ Instruct the sender to enclose the file(s) in a .zip compressed file, and 
rename the .zip compressed file with a different extension, such as, 
.rtnzip.  Password protecting the renamed .zip compressed file adds an 
additional layer of protection. When you receive the file, please rename it 
with the extension .zip.

Additional instructions and options on how to receive these attachments can be 
found at:

http://security.it.ray.com/antivirus/extensions.html
http://security.it.ray.com/news/2007/zipfiles.html

Should you have any questions or difficulty with these instructions, please 
contact the Help Desk at 877.844.4712

---

It bounced because of attachment's size..
attaching one by one now..


On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal pkal...@gmail.com wrote:

 I noticed following pattern/relationship b/w producer/creator and content
 extraction, not sure if helpful (as Grant told earlier pdfs are notorious):

 producer: Bullzip PDF Printer / www.bullzip.com / Freeware Edition (not
 registered)
 Creator: PScript5.dll Version 5.2.2
 Extraction: no content  --  installing Solr in Tomcat.pdf (attached - i
 generated)
 -

 Producer: Acrobat Distiller 7.0.5 (Windows)
 creator: PScript5.dll Version 5.2.2
 Extraction: One line content
 --

 Producer: Acrobat Distiller 8.1.0 (Windows)
 creator: Acrobat PDFMaker 8.1 for Word
 Extraction:  one line of content(Free_Two_way_Radio_Guide.pdf - attached
 - was available freely on their website)
 -

 Producer: FOP 0.20.5
 Extraction: full content/docs/features.pdf | linkmap.pdf etc
 --
 Thanks.
 Praveen



 On Tue, May 4, 2010 at 5:05 PM, Praveen Agrawal pkal...@gmail.com wrote:

 Yes Sandhya,
 i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is
 what you were asking.
 Thanks.



 On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal sagar...@opentext.comwrote:

 Praveen,

 Along with the tika core and parser jars, did you run mvn
 dependency:copy-dependencies, to generate all the dependencies too.

 Thanks,
 Sandhya

 -Original Message-
 From: Praveen Agrawal [mailto:pkal...@gmail.com]
 Sent: Tuesday, May 04, 2010 4:52 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Problem with pdf, upgrading Cell

 I seems to have mixed results:

 Here is what i did:
 copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in
 contrib/extraction/lib (of-course removed old ones),. as well as in
 web-inf/lib of solr web app in tomcat.

 Now it extracts contents from some pdf, but either no content from
 others,
 or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf
 still shows no contents. I've two other pdfs, for which it extracts only
 one
 line of content.

 Also, now i;m getting a field 'title' single value for some pdfs, and two
 for others. In case where it can extract full content, it shows title as
 what i gave as literal while submitting the pdf. For pdf wher no comtent
 was
 extracted, it shows one empty title and one mine. For pdf where it
 extracted
 only one line of content, it shows that line as title too and mine one.
 'title' field is defined as multivalue in schema.

 Any idea, whats going on? or am i missing something?



 On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb dekay...@hotmail.com
 wrote:

 
  Hey,
  I got it to work. I just redid my steps, i had forgotten several
 libraries
  that were imported through the xml. PDF extraction seems to work once
 again,
  i have yet to find one that raises an exception!
 
  Thanks for the investigation, at least we now have a fix :)
  Marc
  _
  Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone,
  Blackberry, …
  http://www.messengersurvotremobile.com/?d=Hotmail
 






Re: Problem with pdf, upgrading Cell

2010-05-04 Thread Praveen Agrawal
This email contained a .zip file attachment. Raytheon does not allow email 
attachments that are considered likely to contain malicious code. For your 
protection this attachment has been removed.

If this email is from an unknown source, please simply delete this email.

If this email was expected, and it is from a known sender, you may follow the 
below suggested instructions to obtain these types of attachments.

+ Instruct the sender to enclose the file(s) in a .zip compressed file, and 
rename the .zip compressed file with a different extension, such as, 
.rtnzip.  Password protecting the renamed .zip compressed file adds an 
additional layer of protection. When you receive the file, please rename it 
with the extension .zip.

Additional instructions and options on how to receive these attachments can be 
found at:

http://security.it.ray.com/antivirus/extensions.html
http://security.it.ray.com/news/2007/zipfiles.html

Should you have any questions or difficulty with these instructions, please 
contact the Help Desk at 877.844.4712

---

another one here..

On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal pkal...@gmail.com wrote:

 It bounced because of attachment's size..
 attaching one by one now..



 On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal pkal...@gmail.com wrote:

 I noticed following pattern/relationship b/w producer/creator and content
 extraction, not sure if helpful (as Grant told earlier pdfs are notorious):

 producer: Bullzip PDF Printer / www.bullzip.com / Freeware Edition (not
 registered)
 Creator: PScript5.dll Version 5.2.2
 Extraction: no content  --  installing Solr in Tomcat.pdf (attached - i
 generated)
 -

 Producer: Acrobat Distiller 7.0.5 (Windows)
 creator: PScript5.dll Version 5.2.2
 Extraction: One line content
 --

 Producer: Acrobat Distiller 8.1.0 (Windows)
 creator: Acrobat PDFMaker 8.1 for Word
 Extraction:  one line of content(Free_Two_way_Radio_Guide.pdf - attached
 - was available freely on their website)
 -

 Producer: FOP 0.20.5
 Extraction: full content/docs/features.pdf | linkmap.pdf etc
 --
 Thanks.
 Praveen



 On Tue, May 4, 2010 at 5:05 PM, Praveen Agrawal pkal...@gmail.comwrote:

 Yes Sandhya,
 i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is
 what you were asking.
 Thanks.



 On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal 
 sagar...@opentext.comwrote:

 Praveen,

 Along with the tika core and parser jars, did you run mvn
 dependency:copy-dependencies, to generate all the dependencies too.

 Thanks,
 Sandhya

 -Original Message-
 From: Praveen Agrawal [mailto:pkal...@gmail.com]
 Sent: Tuesday, May 04, 2010 4:52 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Problem with pdf, upgrading Cell

 I seems to have mixed results:

 Here is what i did:
 copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in
 contrib/extraction/lib (of-course removed old ones),. as well as in
 web-inf/lib of solr web app in tomcat.

 Now it extracts contents from some pdf, but either no content from
 others,
 or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf
 still shows no contents. I've two other pdfs, for which it extracts only
 one
 line of content.

 Also, now i;m getting a field 'title' single value for some pdfs, and
 two
 for others. In case where it can extract full content, it shows title as
 what i gave as literal while submitting the pdf. For pdf wher no comtent
 was
 extracted, it shows one empty title and one mine. For pdf where it
 extracted
 only one line of content, it shows that line as title too and mine one.
 'title' field is defined as multivalue in schema.

 Any idea, whats going on? or am i missing something?



 On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb dekay...@hotmail.com
 wrote:

 
  Hey,
  I got it to work. I just redid my steps, i had forgotten several
 libraries
  that were imported through the xml. PDF extraction seems to work once
 again,
  i have yet to find one that raises an exception!
 
  Thanks for the investigation, at least we now have a fix :)
  Marc
  _
  Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone,
  Blackberry, …
  http://www.messengersurvotremobile.com/?d=Hotmail
 







Re: Problem with pdf, upgrading Cell

2010-05-04 Thread Praveen Agrawal
Hi Sandhya..
I must be missing something. I copied all dependencies jars to both
contrib/extraction/lib and web-in/lib folders. Here is the list of jars
copied:

asm-3.1.jar
bcmail-jdk15-1.45.jar
bcprov-jdk15-1.45.jar
commons-compress-1.0.jar
commons-logging-1.1.1.jar
dom4j-1.6.1.jar
fontbox-1.1.0.jar
geronimo-stax-api_1.0_spec-1.0.1.jar
hamcrest-core-1.1.jar
jempbox-1.1.0.jar
junit-3.8.1.jar
log4j-1.2.14.jar
metadata-extractor-2.4.0-beta-1.jar
mockito-core-1.7.jar
nekohtml-1.9.9.jar
objenesis-1.0.jar
ooxml-schemas-1.0.jar
pdfbox-1.1.0.jar
poi-3.6.jar
poi-ooxml-3.6.jar
poi-ooxml-schemas-3.6.jar
poi-scratchpad-3.6.jar
tagsoup-1.2.jar
tika-core-0.7.jar
tika-parsers-0.7.jar
xml-apis-1.0.b2.jar
xmlbeans-2.3.0.jar

Still same result for me..

Marc,
i'm on windows, and i copied above jars directly into already extracted
folder webapps/solr/web-in/lib, in addition to what were already there. I
didn;t explicitly un-jar'd and re-jar'd the solr.war, but do you think that
could be the issue? i think tomcat extract the war and use the folder in
webapps (i didn;t put the war file in webapps, instead had put extracted
solr folder directly)

If it has worked for you guys, specially with my two pdfs, then that's
really great. Please let me know your exact procedure, including what all
you copied and where, or if you see i missed something obvious..

Thanks,
Praveen


On Tue, May 4, 2010 at 5:28 PM, Sandhya Agarwal sagar...@opentext.comwrote:

 Both the files work for me, Praveen.

 Thanks,
 Sandhya

 From: Praveen Agrawal [mailto:pkal...@gmail.com]
 Sent: Tuesday, May 04, 2010 5:22 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Problem with pdf, upgrading Cell

 another one here..
 On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal pkal...@gmail.commailto:
 pkal...@gmail.com wrote:
 It bounced because of attachment's size..
 attaching one by one now..


 On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal pkal...@gmail.commailto:
 pkal...@gmail.com wrote:
 I noticed following pattern/relationship b/w producer/creator and content
 extraction, not sure if helpful (as Grant told earlier pdfs are notorious):

 producer: Bullzip PDF Printer / www.bullzip.comhttp://www.bullzip.com /
 Freeware Edition (not registered)
 Creator: PScript5.dll Version 5.2.2
 Extraction: no content  --  installing Solr in Tomcat.pdf (attached - i
 generated)
 -

 Producer: Acrobat Distiller 7.0.5 (Windows)
 creator: PScript5.dll Version 5.2.2
 Extraction: One line content
 --

 Producer: Acrobat Distiller 8.1.0 (Windows)
 creator: Acrobat PDFMaker 8.1 for Word
 Extraction:  one line of content(Free_Two_way_Radio_Guide.pdf -
 attached - was available freely on their website)
 -

 Producer: FOP 0.20.5
 Extraction: full content/docs/features.pdf | linkmap.pdf etc
 --
 Thanks.
 Praveen


 On Tue, May 4, 2010 at 5:05 PM, Praveen Agrawal pkal...@gmail.commailto:
 pkal...@gmail.com wrote:
 Yes Sandhya,
 i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is
 what you were asking.
 Thanks.


 On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal sagar...@opentext.com
 mailto:sagar...@opentext.com wrote:
 Praveen,

 Along with the tika core and parser jars, did you run mvn
 dependency:copy-dependencies, to generate all the dependencies too.

 Thanks,
 Sandhya

 -Original Message-
 From: Praveen Agrawal [mailto:pkal...@gmail.commailto:pkal...@gmail.com]
 Sent: Tuesday, May 04, 2010 4:52 PM
 To: solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org
 Subject: Re: Problem with pdf, upgrading Cell
 I seems to have mixed results:

 Here is what i did:
 copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in
 contrib/extraction/lib (of-course removed old ones),. as well as in
 web-inf/lib of solr web app in tomcat.

 Now it extracts contents from some pdf, but either no content from others,
 or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf
 still shows no contents. I've two other pdfs, for which it extracts only
 one
 line of content.

 Also, now i;m getting a field 'title' single value for some pdfs, and two
 for others. In case where it can extract full content, it shows title as
 what i gave as literal while submitting the pdf. For pdf wher no comtent
 was
 extracted, it shows one empty title and one mine. For pdf where it
 extracted
 only one line of content, it shows that line as title too and mine one.
 'title' field is defined as multivalue in schema.

 Any idea, whats going on? or am i missing something?



 On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb dekay...@hotmail.com
 mailto:dekay...@hotmail.com wrote:

 
  Hey,
  I got it to work. I just redid my steps, i had forgotten several
 libraries
  that were imported through the xml. PDF extraction seems to work once
 again,
  i have yet to find one that raises an exception!
 
  Thanks for the investigation, at least we now have a fix :)
  Marc

Any way to get top 'n' queries searched from Solr?

2010-04-30 Thread Praveen Agrawal
Hi,
I need to know what are the top (most frequently searched and their
frequencies) 'n' (say 100) search queries that users tried. Does Solr keep
this information and can return, or else what options do i have here?
Thanks,
Praveen


Re: Any way to get top 'n' queries searched from Solr?

2010-04-30 Thread Praveen Agrawal
Thanks Mitch..
I've an application fronting the Solr for updating/searching etc, and i'll
make use of that to store this info.

Thanks to all for suggestions.


On Fri, Apr 30, 2010 at 3:43 PM, MitchK mitc...@web.de wrote:


 The most simple way is to send the querystring to your Solr-client *and* to
 your custom query-fetcher, which could be any database you like. Doing so,
 you can count how often which query was send etc.
 *And* you can make them searchable by exporting those datasets to another
 Solr-core.
 Why  an extra DB?
 Because if there occurs a crash, you got no guaranties given by Solr. Keep
 in mind that Solr is only an index-search-server, not a real database.

 This is the pretty easiest way to implement such a feature, I think.

 Good luck.
 - Mitch
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Any-way-to-get-top-n-queries-searched-from-Solr-tp767165p767489.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Problem with pdf, upgrading Cell

2010-04-30 Thread Praveen Agrawal
I did try standalone version of tika0.7, and it extracted pdf content
successfully. Then i replaced tika related jars in contrib/extraction/lib of
solr1.4 dist'n with their newer versions, and now it doesn;t extract
contents from ANY pdf.
Earlier (0.4) it was throwing exception for few pdfs, but now no contents or
exception.


On Fri, Apr 30, 2010 at 4:14 PM, Grant Ingersoll gsing...@apache.orgwrote:

 Can you share the PDF it is failing on?  FWIW, PDFs are notoriously hard to
 extract.  They come in all shapes and flavors and I've seen many a
 commercial extractor fail on them too.  Have you tried using either Tika
 standalone or PDFBox standalone?  Does the file work there?

 On Apr 26, 2010, at 8:35 AM, Marc Ghorayeb wrote:

 
  Okay i've been digging a little bit through the Java code from the SVN,
 and it seems the load function inside the ExtractingDocumentLoader class
 does not receive the ContentStream (it is set to null...).Maybe i should
 send this to the developper mailing list?
  Marc
 
  From: dekay...@hotmail.com
  To: solr-user@lucene.apache.org
  Subject: RE: Problem with pdf, upgrading Cell
  Date: Fri, 23 Apr 2010 16:03:28 +0200
 
 
  Seems like i'm not the only one with this no extraction problem:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg33609.htmlApparentlyhe
  tried the same thing, building from the trunk, and indexing a pdf, and no
 extraction occured... Strange.
  Marc G.
 
  _
  Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone,
 Blackberry, …
  http://www.messengersurvotremobile.com/?d=Hotmail
 
  _
  Découvrez comment SURFER DISCRETEMENT sur un site de rencontres !
  http://clk.atdmt.com/FRM/go/206608211/direct/01/

 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem using Solr/Lucene:
 http://www.lucidimagination.com/search




Re: Problem with pdf, upgrading Cell

2010-04-30 Thread Praveen Agrawal
Grant,
You can try any of the sample pdfs that come in /docs folder of Solr 1.4
dist'n. I had tried 'Installing Solr in Tomcat.pdf', 'index.pdf' etc. Only
metadata i.e. stream_size, content_type apart from my own literals are
indexed, and content is missing..


On Fri, Apr 30, 2010 at 8:52 PM, Grant Ingersoll gsing...@apache.orgwrote:

 Praveen and Marc,

 Can you share the PDF (feel free to email my private email) that fails in
 Solr?

 Thanks,
 Grant


 On Apr 30, 2010, at 7:55 AM, Marc Ghorayeb wrote:

 
  Hi
  Nope i didn't get it to work... Just like you, command line version of
 tika extracts correctly the content, but once included in Solr, no content
 is extracted.
  What i tried until now is:- Updating the tika libraries inside Solr 1.4
 public version, no luck there.- Downloading the latest SVN version, compiled
 it, and started from a simple schema, still no luck.- Getting other versions
 compiled on hudson (nightly builds), and testing them also, still no
 extraction.
  I sent a mail on the developpers mailing list but they told me i should
 just mail here, hope some developper reads this because it's quite an
 important feature of Solr and somehow it got broke between the 1.4 release,
 and the last version on the svn.
  Marc
  _
  Consultez gratuitement vos emails Orange, Gmail, Free, ... directement
 dans HOTMAIL !
  http://www.windowslive.fr/hotmail/agregation/

 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem using Solr/Lucene:
 http://www.lucidimagination.com/search




Re: Solr throws TikaException while parsing sample PDF

2010-04-21 Thread Praveen Agrawal
Can somebody please guide me here?


On Tue, Apr 20, 2010 at 10:53 AM, Praveen Agrawal pkal...@gmail.com wrote:

 I'm using Solr 1.4 distribution, with Solr cell. Can i update only new
 version of Tika in Solr 1.4 distn? If yes, any guide etc?
 Thanks.



 On Mon, Apr 19, 2010 at 4:36 PM, Koji Sekiguchi k...@r.email.ne.jpwrote:

 Praveen Agrawal wrote:

 Hi Grant,
 I tried command line of Tika v-0.7(newest), and it parsed the file.. I
 believe Solr1.4 contains 0.4 version of Tika.
 Do you suggest to upgrade to new Tika? Can i upgrade only tika in
 Solr-1.4?
 or i need to wait till Solr ships with new Tika?
 Thanks.


 Solr trunk uses Tika 0.7. I'm not SolrCell user, so this is just an FYI.

 Koji

 --
 http://www.rondhuit.com/en/





Re: Solr throws TikaException while parsing sample PDF

2010-04-19 Thread Praveen Agrawal
Hi Grant,
I tried command line of Tika v-0.7(newest), and it parsed the file.. I
believe Solr1.4 contains 0.4 version of Tika.
Do you suggest to upgrade to new Tika? Can i upgrade only tika in Solr-1.4?
or i need to wait till Solr ships with new Tika?
Thanks.


On Sun, Apr 18, 2010 at 11:24 PM, Grant Ingersoll gsing...@apache.orgwrote:

 Can you extract content from this using Tika's standalone command line
 tool?  PDF's are notorious for problems in extracting.  To me, it looks like
 a bug in PDFBox.  I would try to isolate it down to there and then send, if
 possible, the sample document to PDFBox and see if they can come up w/ a
 fix.

 -Grant

 On Apr 18, 2010, at 1:12 PM, pk wrote:

 
  Hi,
  while posting a sample pdf (that comes with Solr dist'n) to solr, i'm
  getting a TikaException.
  Using Solr-1.4, SolrJ (StreamingUpdateSolrServer) for posting pdf to
 solr.
  Other sample pdfs can be parsed and indexed successfully.. I;m getting
 same
  error with some other pdfs also (but adobe reader can open them fine, so
 i
  dont think they have an issue in formatting or are corrupt etc)... Here
 is
  the trace...
 
  
  found uploaded file : C:\solr_1.4.0\docs\Installing Solr in Tomcat.pdf ::
  size=286242
  Apr 18, 2010 10:31:34 PM
 org.apache.solr.update.processor.LogUpdateProcessor
  finish
  INFO: {} 0 640
  Apr 18, 2010 10:31:34 PM org.apache.solr.common.SolrException log
  SEVERE: org.apache.solr.common.SolrException:
  org.apache.tika.exception.TikaException: Una
  ble to extract PDF content
 at
 
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocu
  mentLoader.java:211)
 at
 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStrea
  mHandlerBase.java:54)
 at
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.jav
  a:131)
 at
 
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(Re
  questHandlers.java:233)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
 at
 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
 
 at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241
  )
 at
 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFil
  terChain.java:215)
 at
 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain
  .java:188)
 at
 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:
  213)
 at
 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:
  172)
 at
 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
 at
 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
 at
 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:10
  8)
 at
 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)
 at
 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:873)
 at
 
 org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConn
  ection(Http11BaseProtocol.java:665)
 at
 
 org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:5
  28)
 at
 
 org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorke
  rThread.java:81)
 at
 
 org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:6
  89)
 at java.lang.Thread.run(Thread.java:595)
  Caused by: org.apache.tika.exception.TikaException: Unable to extract PDF
  content
 at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:58)
 at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51)
 at
  org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
 at
  org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
 at
 
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocu
  mentLoader.java:190)
 ... 20 more
  Caused by: java.util.zip.ZipException: incorrect header check
 at
  java.util.zip.InflaterInputStream.read(InflaterInputStream.java:140)
 at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:97)
 at org.pdfbox.cos.COSStream.doDecode(COSStream.java:290)
 at org.pdfbox.cos.COSStream.doDecode(COSStream.java:235)
 at
 org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170)
 at org.pdfbox.pdfparser.PDFStreamParser.(PDFStreamParser.java:101)
 at org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:132)
 at
 
 org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:202)
 at
  org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
 at
  

Re: Solr throws TikaException while parsing sample PDF

2010-04-19 Thread Praveen Agrawal
I'm using Solr 1.4 distribution, with Solr cell. Can i update only new
version of Tika in Solr 1.4 distn? If yes, any guide etc?
Thanks.


On Mon, Apr 19, 2010 at 4:36 PM, Koji Sekiguchi k...@r.email.ne.jp wrote:

 Praveen Agrawal wrote:

 Hi Grant,
 I tried command line of Tika v-0.7(newest), and it parsed the file.. I
 believe Solr1.4 contains 0.4 version of Tika.
 Do you suggest to upgrade to new Tika? Can i upgrade only tika in
 Solr-1.4?
 or i need to wait till Solr ships with new Tika?
 Thanks.


 Solr trunk uses Tika 0.7. I'm not SolrCell user, so this is just an FYI.

 Koji

 --
 http://www.rondhuit.com/en/




Autofill 'id' field with the URL of files posted to Solr?

2010-04-18 Thread Praveen Agrawal
Hi,
I need to submit thousands of online PDF/html files to Solr. I can submit
one file using SolrJ (StreamingUpdateSolrServer and
..solr.common.util.ContentStreamBase.URLStream), setting
literal.idparameter to the url. I can't do the same with a batch of
multiple files, as
their 'id' should be unique (set to their urls).

I couldn't get this to work. Is there a way to somehow get the 'id' field
set automatically to the url of the files posted to Solr (something like to
'stream_name')? How to set this in solrconfig.xml or schema.xml?  or any
other way?

Thanks.