Re: Solr on Google App Engine
Thanks a lot, Mauricio. Does anyone has any experience on Amazon EC2, or can point me to existing discussions? Appreciate your help. Thanks. Praveen On Thu, Dec 9, 2010 at 6:20 PM, Mauricio Scheffer mauricioschef...@gmail.com wrote: Solr on GAE has been discussed a couple of times, see these threads: http://www.mail-archive.com/java-user@lucene.apache.org/msg26010.html http://www.mail-archive.com/java-user@lucene.apache.org/msg26010.html http://www.mail-archive.com/solr-user@lucene.apache.org/msg24473.html http://www.mail-archive.com/solr-user@lucene.apache.org/msg24473.html http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3co2w952e01251005032245r79d6bfd6zbe08ece212c82...@mail.gmail.com%3e http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3co2w952e01251005032245r79d6bfd6zbe08ece212c82...@mail.gmail.com%3e -- Mauricio On Thu, Dec 9, 2010 at 9:07 AM, Praveen Agrawal pkal...@gmail.com wrote: Hi, I was wondering if Solr can be deployed/run on Google App Engine. GAE has some restrictions, notably no local file write access is allowed, instead applications must use JDO/JPA etc. I believe Solr can be deployed/run on Amazon EC2. Has anyone tried Solr on these two hosts? Thanks. Praveen
Re: Solr on Google App Engine
Thanks Dave.. On Mon, Dec 13, 2010 at 4:06 PM, Dave Searle dave.sea...@magicalia.comwrote: EC2 installations are just windows/linux machines, so this would just be a normal setup. I have a solr server running on a small instance with 1.7gb ram mounted to an EBS volume of 50gb, seems to run fine. Costs about $115 a month -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: 13 December 2010 09:20 To: solr-user@lucene.apache.org Subject: Re: Solr on Google App Engine Thanks a lot, Mauricio. Does anyone has any experience on Amazon EC2, or can point me to existing discussions? Appreciate your help. Thanks. Praveen On Thu, Dec 9, 2010 at 6:20 PM, Mauricio Scheffer mauricioschef...@gmail.com wrote: Solr on GAE has been discussed a couple of times, see these threads: http://www.mail-archive.com/java-user@lucene.apache.org/msg26010.html http://www.mail-archive.com/java-user@lucene.apache.org/msg26010.html http://www.mail-archive.com/solr-user@lucene.apache.org/msg24473.html http://www.mail-archive.com/solr-user@lucene.apache.org/msg24473.html http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3co2w952e01251005032245r79d6bfd6zbe08ece212c82...@mail.gmail.com%3e http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3co2w952e01251005032245r79d6bfd6zbe08ece212c82...@mail.gmail.com%3e -- Mauricio On Thu, Dec 9, 2010 at 9:07 AM, Praveen Agrawal pkal...@gmail.com wrote: Hi, I was wondering if Solr can be deployed/run on Google App Engine. GAE has some restrictions, notably no local file write access is allowed, instead applications must use JDO/JPA etc. I believe Solr can be deployed/run on Amazon EC2. Has anyone tried Solr on these two hosts? Thanks. Praveen
Solr on Google App Engine
Hi, I was wondering if Solr can be deployed/run on Google App Engine. GAE has some restrictions, notably no local file write access is allowed, instead applications must use JDO/JPA etc. I believe Solr can be deployed/run on Amazon EC2. Has anyone tried Solr on these two hosts? Thanks. Praveen
Re: Example of using stream.file to post a binary file to solr
Sandhya, Chris's link (with anchor name) directly goes to solrj example On Fri, May 7, 2010 at 8:15 PM, Sandhya Agarwal sagar...@opentext.comwrote: Yes, I did. But, I don't find a solrj example there. The example in the doc uses curl. - Sent from iPhone On 07-May-2010, at 8:12 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : Sorry. That is what I meant. But, I put it wrongly. I have not been : able to find examples of using solrj, for this. did you look at the link i included? : To POST a raw stream using SolrJ you need to use the : ContentStreamUpdateRequest... : : http://wiki.apache.org/solr/ExtractingRequestHandler#Sending_documents_to_Solr -Hoss
Re: Problem with pdf, upgrading Cell
Marc Sandhya, Did you use Solr from trunk? I used Solr 1.4 distn, and even after copying all the jars, i still get the same results for the pdfs i posted here. Thanks. On Wed, May 5, 2010 at 1:09 PM, Marc Ghorayeb dekay...@hotmail.com wrote: Hey, I have the same list, and i added to it the extraction library (apache solr cell jar), though you might not need it specifically inside the war file. Marc From: sagar...@opentext.com To: solr-user@lucene.apache.org Date: Wed, 5 May 2010 10:21:36 +0530 Subject: RE: Problem with pdf, upgrading Cell Looks like the highlighting may not work here. Following is the list of jars I copied : asm-3.1.jar bcmail-jdk15-1.45.jar bcprov-jdk15-1.45.jar commons-compress-1.0.jar commons-logging-1.1.1.jar dom4j-1.6.1.jar fontbox-1.1.0.jar geronimo-stax-api_1.0_spec-1.0.1.jar jempbox-1.1.0.jar log4j-1.2.14.jar metadata-extractor-2.4.0-beta-1.jar pdfbox-1.1.0.jar poi-3.6.jar poi-ooxml-3.6.jar poi-ooxml-schemas-3.6.jar poi-scratchpad-3.6.jar tagsoup-1.2.jar tika-core-0.7.jar tika-parsers-0.7.jar xml-apis-1.0.b2.jar xmlbeans-2.3.0.jar Thanks, Sandhya -Original Message- From: Sandhya Agarwal [mailto:sagar...@opentext.com] Sent: Wednesday, May 05, 2010 10:06 AM To: solr-user@lucene.apache.org Subject: RE: Problem with pdf, upgrading Cell Praveen, I only have the highlighted jars copied. Not sure, if we need the other jars. Also, I copied the jars directly into solr\WEB-INF\lib, like you did. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 8:10 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell Hi Sandhya.. I must be missing something. I copied all dependencies jars to both contrib/extraction/lib and web-in/lib folders. Here is the list of jars copied: asm-3.1.jar bcmail-jdk15-1.45.jar bcprov-jdk15-1.45.jar commons-compress-1.0.jar commons-logging-1.1.1.jar dom4j-1.6.1.jar fontbox-1.1.0.jar geronimo-stax-api_1.0_spec-1.0.1.jar hamcrest-core-1.1.jar jempbox-1.1.0.jar junit-3.8.1.jar log4j-1.2.14.jar metadata-extractor-2.4.0-beta-1.jar mockito-core-1.7.jar nekohtml-1.9.9.jar objenesis-1.0.jar ooxml-schemas-1.0.jar pdfbox-1.1.0.jar poi-3.6.jar poi-ooxml-3.6.jar poi-ooxml-schemas-3.6.jar poi-scratchpad-3.6.jar tagsoup-1.2.jar tika-core-0.7.jar tika-parsers-0.7.jar xml-apis-1.0.b2.jar xmlbeans-2.3.0.jar Still same result for me.. Marc, i'm on windows, and i copied above jars directly into already extracted folder webapps/solr/web-in/lib, in addition to what were already there. I didn;t explicitly un-jar'd and re-jar'd the solr.war, but do you think that could be the issue? i think tomcat extract the war and use the folder in webapps (i didn;t put the war file in webapps, instead had put extracted solr folder directly) If it has worked for you guys, specially with my two pdfs, then that's really great. Please let me know your exact procedure, including what all you copied and where, or if you see i missed something obvious.. Thanks, Praveen On Tue, May 4, 2010 at 5:28 PM, Sandhya Agarwal sagar...@opentext.com wrote: Both the files work for me, Praveen. Thanks, Sandhya From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 5:22 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell another one here.. On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal pkal...@gmail.com mailto: pkal...@gmail.com wrote: It bounced because of attachment's size.. attaching one by one now.. On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal pkal...@gmail.com mailto: pkal...@gmail.com wrote: I noticed following pattern/relationship b/w producer/creator and content extraction, not sure if helpful (as Grant told earlier pdfs are notorious): producer: Bullzip PDF Printer / www.bullzip.comhttp://www.bullzip.com / Freeware Edition (not registered) Creator: PScript5.dll Version 5.2.2 Extraction: no content -- installing Solr in Tomcat.pdf (attached - i generated) - Producer: Acrobat Distiller 7.0.5 (Windows) creator: PScript5.dll Version 5.2.2 Extraction: One line content -- Producer: Acrobat Distiller 8.1.0 (Windows) creator: Acrobat PDFMaker 8.1 for Word Extraction: one line of content(Free_Two_way_Radio_Guide.pdf - attached - was available freely on their website) - Producer: FOP 0.20.5 Extraction: full content/docs/features.pdf | linkmap.pdf etc
Re: Problem with pdf, upgrading Cell
' to classloader May 4, 2010 12:50:16 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/dist/apache-solr-cell-1.4.0.jar' to classloader May 4, 2010 12:50:20 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/dist/apache-solr-clustering-1.4.0.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/carrot2-mini-3.1.0.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/commons-lang-2.4.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/ehcache-1.6.2.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/google-collections-1.0-rc2.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/jackson-core-asl-0.9.9-6.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/jackson-mapper-asl-0.9.9-6.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/log4j-1.2.14.jar' to classloader Thanks, Sandhya -Original Message- From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll Sent: Tuesday, May 04, 2010 6:13 AM Cc: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell Little more info... Seems to be a classloading issue. The tests pass, but they aren't loading the Tika libraries via the Solr ResourceLoader, whereas the example is. Marc, one thing to try is to unjar the Solr WAR file and put the Tika libs in there, as I bet it will then work. Note, however, I haven't tried this. On May 3, 2010, at 6:24 PM, Grant Ingersoll wrote: I've opened https://issues.apache.org/jira/browse/SOLR-1902 to track this. It is indeed a bug somewhere (still investigating). It seems that Tika is now picking an EmptyParser implementation when trying to determine which parser to use, despite the fact that it properly identifies the MIME Type. -Grant On May 3, 2010, at 5:36 PM, Grant Ingersoll wrote: I'm investigating. On May 3, 2010, at 5:17 AM, Marc Ghorayeb wrote: Hi, Grant, i confirm what Praveen has said, any PDF i try does not work with the new Tika and SVN versions. :( Marc From: sagar...@opentext.com To: solr-user@lucene.apache.org Date: Mon, 3 May 2010 13:05:24 +0530 Subject: RE: Problem with pdf, upgrading Cell Hello, Please let me know if anybody figured out a way out of this issue. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Friday, April 30, 2010 11:14 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell Grant, You can try any of the sample pdfs that come in /docs folder of Solr 1.4 dist'n. I had tried 'Installing Solr in Tomcat.pdf', 'index.pdf' etc. Only metadata i.e. stream_size, content_type apart from my own literals are indexed, and content is missing.. On Fri, Apr 30, 2010 at 8:52 PM, Grant Ingersoll gsing...@apache.orgwrote: Praveen and Marc, Can you share the PDF (feel free to email my private email) that fails in Solr? Thanks, Grant On Apr 30, 2010, at 7:55 AM, Marc Ghorayeb wrote: Hi Nope i didn't get it to work... Just like you, command line version of tika extracts correctly the content, but once included in Solr, no content is extracted. What i tried until now is:- Updating the tika libraries inside Solr 1.4 public version, no luck there.- Downloading the latest SVN version, compiled it, and started from a simple schema, still no luck.- Getting other versions compiled on hudson (nightly builds), and testing them also, still no extraction. I sent a mail on the developpers mailing list but they told me i should just mail here, hope some developper reads this because it's quite an important feature of Solr and somehow it got broke between the 1.4 release, and the last version on the svn. Marc _ Consultez
Re: Problem with pdf, upgrading Cell
org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tagsoup-1.2.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tika-core-0.7.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tika-parsers-0.7.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/xercesImpl-2.8.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/xml-apis-1.0.b2.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/xmlbeans-2.3.0.jar' to classloader May 4, 2010 12:50:16 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/dist/apache-solr-cell-1.4.0.jar' to classloader May 4, 2010 12:50:20 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/dist/apache-solr-clustering-1.4.0.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/carrot2-mini-3.1.0.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/commons-lang-2.4.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/ehcache-1.6.2.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/google-collections-1.0-rc2.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/jackson-core-asl-0.9.9-6.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/jackson-mapper-asl-0.9.9-6.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/log4j-1.2.14.jar' to classloader Thanks, Sandhya -Original Message- From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll Sent: Tuesday, May 04, 2010 6:13 AM Cc: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell Little more info... Seems to be a classloading issue. The tests pass, but they aren't loading the Tika libraries via the Solr ResourceLoader, whereas the example is. Marc, one thing to try is to unjar the Solr WAR file and put the Tika libs in there, as I bet it will then work. Note, however, I haven't tried this. On May 3, 2010, at 6:24 PM, Grant Ingersoll wrote: I've opened https://issues.apache.org/jira/browse/SOLR-1902 to track this. It is indeed a bug somewhere (still investigating). It seems that Tika is now picking an EmptyParser implementation when trying to determine which parser to use, despite the fact that it properly identifies the MIME Type. -Grant On May 3, 2010, at 5:36 PM, Grant Ingersoll wrote: I'm investigating. On May 3, 2010, at 5:17 AM, Marc Ghorayeb wrote: Hi, Grant, i confirm what Praveen has said, any PDF i try does not work with the new Tika and SVN versions. :( Marc From: sagar...@opentext.com To: solr-user@lucene.apache.org Date: Mon, 3 May 2010 13:05:24 +0530 Subject: RE: Problem with pdf, upgrading Cell Hello, Please let me know if anybody figured out a way out of this issue. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Friday, April 30, 2010 11:14 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell Grant, You can try any of the sample pdfs that come in /docs folder of Solr 1.4 dist'n. I had tried 'Installing Solr in Tomcat.pdf', 'index.pdf' etc. Only metadata i.e. stream_size, content_type apart from my own literals are indexed, and content is missing.. On Fri, Apr 30, 2010 at 8:52 PM
Re: Problem with pdf, upgrading Cell
I seems to have mixed results: Here is what i did: copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in contrib/extraction/lib (of-course removed old ones),. as well as in web-inf/lib of solr web app in tomcat. Now it extracts contents from some pdf, but either no content from others, or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf still shows no contents. I've two other pdfs, for which it extracts only one line of content. Also, now i;m getting a field 'title' single value for some pdfs, and two for others. In case where it can extract full content, it shows title as what i gave as literal while submitting the pdf. For pdf wher no comtent was extracted, it shows one empty title and one mine. For pdf where it extracted only one line of content, it shows that line as title too and mine one. 'title' field is defined as multivalue in schema. Any idea, whats going on? or am i missing something? On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb dekay...@hotmail.com wrote: Hey, I got it to work. I just redid my steps, i had forgotten several libraries that were imported through the xml. PDF extraction seems to work once again, i have yet to find one that raises an exception! Thanks for the investigation, at least we now have a fix :) Marc _ Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, … http://www.messengersurvotremobile.com/?d=Hotmail
Re: Problem with pdf, upgrading Cell
Yes Sandhya, i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is what you were asking. Thanks. On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal sagar...@opentext.comwrote: Praveen, Along with the tika core and parser jars, did you run mvn dependency:copy-dependencies, to generate all the dependencies too. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 4:52 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell I seems to have mixed results: Here is what i did: copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in contrib/extraction/lib (of-course removed old ones),. as well as in web-inf/lib of solr web app in tomcat. Now it extracts contents from some pdf, but either no content from others, or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf still shows no contents. I've two other pdfs, for which it extracts only one line of content. Also, now i;m getting a field 'title' single value for some pdfs, and two for others. In case where it can extract full content, it shows title as what i gave as literal while submitting the pdf. For pdf wher no comtent was extracted, it shows one empty title and one mine. For pdf where it extracted only one line of content, it shows that line as title too and mine one. 'title' field is defined as multivalue in schema. Any idea, whats going on? or am i missing something? On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb dekay...@hotmail.com wrote: Hey, I got it to work. I just redid my steps, i had forgotten several libraries that were imported through the xml. PDF extraction seems to work once again, i have yet to find one that raises an exception! Thanks for the investigation, at least we now have a fix :) Marc _ Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, … http://www.messengersurvotremobile.com/?d=Hotmail
Re: Problem with pdf, upgrading Cell
This email contained a .zip file attachment. Raytheon does not allow email attachments that are considered likely to contain malicious code. For your protection this attachment has been removed. If this email is from an unknown source, please simply delete this email. If this email was expected, and it is from a known sender, you may follow the below suggested instructions to obtain these types of attachments. + Instruct the sender to enclose the file(s) in a .zip compressed file, and rename the .zip compressed file with a different extension, such as, .rtnzip. Password protecting the renamed .zip compressed file adds an additional layer of protection. When you receive the file, please rename it with the extension .zip. Additional instructions and options on how to receive these attachments can be found at: http://security.it.ray.com/antivirus/extensions.html http://security.it.ray.com/news/2007/zipfiles.html Should you have any questions or difficulty with these instructions, please contact the Help Desk at 877.844.4712 --- It bounced because of attachment's size.. attaching one by one now.. On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal pkal...@gmail.com wrote: I noticed following pattern/relationship b/w producer/creator and content extraction, not sure if helpful (as Grant told earlier pdfs are notorious): producer: Bullzip PDF Printer / www.bullzip.com / Freeware Edition (not registered) Creator: PScript5.dll Version 5.2.2 Extraction: no content -- installing Solr in Tomcat.pdf (attached - i generated) - Producer: Acrobat Distiller 7.0.5 (Windows) creator: PScript5.dll Version 5.2.2 Extraction: One line content -- Producer: Acrobat Distiller 8.1.0 (Windows) creator: Acrobat PDFMaker 8.1 for Word Extraction: one line of content(Free_Two_way_Radio_Guide.pdf - attached - was available freely on their website) - Producer: FOP 0.20.5 Extraction: full content/docs/features.pdf | linkmap.pdf etc -- Thanks. Praveen On Tue, May 4, 2010 at 5:05 PM, Praveen Agrawal pkal...@gmail.com wrote: Yes Sandhya, i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is what you were asking. Thanks. On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal sagar...@opentext.comwrote: Praveen, Along with the tika core and parser jars, did you run mvn dependency:copy-dependencies, to generate all the dependencies too. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 4:52 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell I seems to have mixed results: Here is what i did: copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in contrib/extraction/lib (of-course removed old ones),. as well as in web-inf/lib of solr web app in tomcat. Now it extracts contents from some pdf, but either no content from others, or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf still shows no contents. I've two other pdfs, for which it extracts only one line of content. Also, now i;m getting a field 'title' single value for some pdfs, and two for others. In case where it can extract full content, it shows title as what i gave as literal while submitting the pdf. For pdf wher no comtent was extracted, it shows one empty title and one mine. For pdf where it extracted only one line of content, it shows that line as title too and mine one. 'title' field is defined as multivalue in schema. Any idea, whats going on? or am i missing something? On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb dekay...@hotmail.com wrote: Hey, I got it to work. I just redid my steps, i had forgotten several libraries that were imported through the xml. PDF extraction seems to work once again, i have yet to find one that raises an exception! Thanks for the investigation, at least we now have a fix :) Marc _ Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, … http://www.messengersurvotremobile.com/?d=Hotmail
Re: Problem with pdf, upgrading Cell
This email contained a .zip file attachment. Raytheon does not allow email attachments that are considered likely to contain malicious code. For your protection this attachment has been removed. If this email is from an unknown source, please simply delete this email. If this email was expected, and it is from a known sender, you may follow the below suggested instructions to obtain these types of attachments. + Instruct the sender to enclose the file(s) in a .zip compressed file, and rename the .zip compressed file with a different extension, such as, .rtnzip. Password protecting the renamed .zip compressed file adds an additional layer of protection. When you receive the file, please rename it with the extension .zip. Additional instructions and options on how to receive these attachments can be found at: http://security.it.ray.com/antivirus/extensions.html http://security.it.ray.com/news/2007/zipfiles.html Should you have any questions or difficulty with these instructions, please contact the Help Desk at 877.844.4712 --- another one here.. On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal pkal...@gmail.com wrote: It bounced because of attachment's size.. attaching one by one now.. On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal pkal...@gmail.com wrote: I noticed following pattern/relationship b/w producer/creator and content extraction, not sure if helpful (as Grant told earlier pdfs are notorious): producer: Bullzip PDF Printer / www.bullzip.com / Freeware Edition (not registered) Creator: PScript5.dll Version 5.2.2 Extraction: no content -- installing Solr in Tomcat.pdf (attached - i generated) - Producer: Acrobat Distiller 7.0.5 (Windows) creator: PScript5.dll Version 5.2.2 Extraction: One line content -- Producer: Acrobat Distiller 8.1.0 (Windows) creator: Acrobat PDFMaker 8.1 for Word Extraction: one line of content(Free_Two_way_Radio_Guide.pdf - attached - was available freely on their website) - Producer: FOP 0.20.5 Extraction: full content/docs/features.pdf | linkmap.pdf etc -- Thanks. Praveen On Tue, May 4, 2010 at 5:05 PM, Praveen Agrawal pkal...@gmail.comwrote: Yes Sandhya, i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is what you were asking. Thanks. On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal sagar...@opentext.comwrote: Praveen, Along with the tika core and parser jars, did you run mvn dependency:copy-dependencies, to generate all the dependencies too. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 4:52 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell I seems to have mixed results: Here is what i did: copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in contrib/extraction/lib (of-course removed old ones),. as well as in web-inf/lib of solr web app in tomcat. Now it extracts contents from some pdf, but either no content from others, or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf still shows no contents. I've two other pdfs, for which it extracts only one line of content. Also, now i;m getting a field 'title' single value for some pdfs, and two for others. In case where it can extract full content, it shows title as what i gave as literal while submitting the pdf. For pdf wher no comtent was extracted, it shows one empty title and one mine. For pdf where it extracted only one line of content, it shows that line as title too and mine one. 'title' field is defined as multivalue in schema. Any idea, whats going on? or am i missing something? On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb dekay...@hotmail.com wrote: Hey, I got it to work. I just redid my steps, i had forgotten several libraries that were imported through the xml. PDF extraction seems to work once again, i have yet to find one that raises an exception! Thanks for the investigation, at least we now have a fix :) Marc _ Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, … http://www.messengersurvotremobile.com/?d=Hotmail
Re: Problem with pdf, upgrading Cell
Hi Sandhya.. I must be missing something. I copied all dependencies jars to both contrib/extraction/lib and web-in/lib folders. Here is the list of jars copied: asm-3.1.jar bcmail-jdk15-1.45.jar bcprov-jdk15-1.45.jar commons-compress-1.0.jar commons-logging-1.1.1.jar dom4j-1.6.1.jar fontbox-1.1.0.jar geronimo-stax-api_1.0_spec-1.0.1.jar hamcrest-core-1.1.jar jempbox-1.1.0.jar junit-3.8.1.jar log4j-1.2.14.jar metadata-extractor-2.4.0-beta-1.jar mockito-core-1.7.jar nekohtml-1.9.9.jar objenesis-1.0.jar ooxml-schemas-1.0.jar pdfbox-1.1.0.jar poi-3.6.jar poi-ooxml-3.6.jar poi-ooxml-schemas-3.6.jar poi-scratchpad-3.6.jar tagsoup-1.2.jar tika-core-0.7.jar tika-parsers-0.7.jar xml-apis-1.0.b2.jar xmlbeans-2.3.0.jar Still same result for me.. Marc, i'm on windows, and i copied above jars directly into already extracted folder webapps/solr/web-in/lib, in addition to what were already there. I didn;t explicitly un-jar'd and re-jar'd the solr.war, but do you think that could be the issue? i think tomcat extract the war and use the folder in webapps (i didn;t put the war file in webapps, instead had put extracted solr folder directly) If it has worked for you guys, specially with my two pdfs, then that's really great. Please let me know your exact procedure, including what all you copied and where, or if you see i missed something obvious.. Thanks, Praveen On Tue, May 4, 2010 at 5:28 PM, Sandhya Agarwal sagar...@opentext.comwrote: Both the files work for me, Praveen. Thanks, Sandhya From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 5:22 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell another one here.. On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal pkal...@gmail.commailto: pkal...@gmail.com wrote: It bounced because of attachment's size.. attaching one by one now.. On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal pkal...@gmail.commailto: pkal...@gmail.com wrote: I noticed following pattern/relationship b/w producer/creator and content extraction, not sure if helpful (as Grant told earlier pdfs are notorious): producer: Bullzip PDF Printer / www.bullzip.comhttp://www.bullzip.com / Freeware Edition (not registered) Creator: PScript5.dll Version 5.2.2 Extraction: no content -- installing Solr in Tomcat.pdf (attached - i generated) - Producer: Acrobat Distiller 7.0.5 (Windows) creator: PScript5.dll Version 5.2.2 Extraction: One line content -- Producer: Acrobat Distiller 8.1.0 (Windows) creator: Acrobat PDFMaker 8.1 for Word Extraction: one line of content(Free_Two_way_Radio_Guide.pdf - attached - was available freely on their website) - Producer: FOP 0.20.5 Extraction: full content/docs/features.pdf | linkmap.pdf etc -- Thanks. Praveen On Tue, May 4, 2010 at 5:05 PM, Praveen Agrawal pkal...@gmail.commailto: pkal...@gmail.com wrote: Yes Sandhya, i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is what you were asking. Thanks. On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal sagar...@opentext.com mailto:sagar...@opentext.com wrote: Praveen, Along with the tika core and parser jars, did you run mvn dependency:copy-dependencies, to generate all the dependencies too. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.commailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 4:52 PM To: solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell I seems to have mixed results: Here is what i did: copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in contrib/extraction/lib (of-course removed old ones),. as well as in web-inf/lib of solr web app in tomcat. Now it extracts contents from some pdf, but either no content from others, or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf still shows no contents. I've two other pdfs, for which it extracts only one line of content. Also, now i;m getting a field 'title' single value for some pdfs, and two for others. In case where it can extract full content, it shows title as what i gave as literal while submitting the pdf. For pdf wher no comtent was extracted, it shows one empty title and one mine. For pdf where it extracted only one line of content, it shows that line as title too and mine one. 'title' field is defined as multivalue in schema. Any idea, whats going on? or am i missing something? On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb dekay...@hotmail.com mailto:dekay...@hotmail.com wrote: Hey, I got it to work. I just redid my steps, i had forgotten several libraries that were imported through the xml. PDF extraction seems to work once again, i have yet to find one that raises an exception! Thanks for the investigation, at least we now have a fix :) Marc
Any way to get top 'n' queries searched from Solr?
Hi, I need to know what are the top (most frequently searched and their frequencies) 'n' (say 100) search queries that users tried. Does Solr keep this information and can return, or else what options do i have here? Thanks, Praveen
Re: Any way to get top 'n' queries searched from Solr?
Thanks Mitch.. I've an application fronting the Solr for updating/searching etc, and i'll make use of that to store this info. Thanks to all for suggestions. On Fri, Apr 30, 2010 at 3:43 PM, MitchK mitc...@web.de wrote: The most simple way is to send the querystring to your Solr-client *and* to your custom query-fetcher, which could be any database you like. Doing so, you can count how often which query was send etc. *And* you can make them searchable by exporting those datasets to another Solr-core. Why an extra DB? Because if there occurs a crash, you got no guaranties given by Solr. Keep in mind that Solr is only an index-search-server, not a real database. This is the pretty easiest way to implement such a feature, I think. Good luck. - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Any-way-to-get-top-n-queries-searched-from-Solr-tp767165p767489.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problem with pdf, upgrading Cell
I did try standalone version of tika0.7, and it extracted pdf content successfully. Then i replaced tika related jars in contrib/extraction/lib of solr1.4 dist'n with their newer versions, and now it doesn;t extract contents from ANY pdf. Earlier (0.4) it was throwing exception for few pdfs, but now no contents or exception. On Fri, Apr 30, 2010 at 4:14 PM, Grant Ingersoll gsing...@apache.orgwrote: Can you share the PDF it is failing on? FWIW, PDFs are notoriously hard to extract. They come in all shapes and flavors and I've seen many a commercial extractor fail on them too. Have you tried using either Tika standalone or PDFBox standalone? Does the file work there? On Apr 26, 2010, at 8:35 AM, Marc Ghorayeb wrote: Okay i've been digging a little bit through the Java code from the SVN, and it seems the load function inside the ExtractingDocumentLoader class does not receive the ContentStream (it is set to null...).Maybe i should send this to the developper mailing list? Marc From: dekay...@hotmail.com To: solr-user@lucene.apache.org Subject: RE: Problem with pdf, upgrading Cell Date: Fri, 23 Apr 2010 16:03:28 +0200 Seems like i'm not the only one with this no extraction problem: http://www.mail-archive.com/solr-user@lucene.apache.org/msg33609.htmlApparentlyhe tried the same thing, building from the trunk, and indexing a pdf, and no extraction occured... Strange. Marc G. _ Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, … http://www.messengersurvotremobile.com/?d=Hotmail _ Découvrez comment SURFER DISCRETEMENT sur un site de rencontres ! http://clk.atdmt.com/FRM/go/206608211/direct/01/ -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
Re: Problem with pdf, upgrading Cell
Grant, You can try any of the sample pdfs that come in /docs folder of Solr 1.4 dist'n. I had tried 'Installing Solr in Tomcat.pdf', 'index.pdf' etc. Only metadata i.e. stream_size, content_type apart from my own literals are indexed, and content is missing.. On Fri, Apr 30, 2010 at 8:52 PM, Grant Ingersoll gsing...@apache.orgwrote: Praveen and Marc, Can you share the PDF (feel free to email my private email) that fails in Solr? Thanks, Grant On Apr 30, 2010, at 7:55 AM, Marc Ghorayeb wrote: Hi Nope i didn't get it to work... Just like you, command line version of tika extracts correctly the content, but once included in Solr, no content is extracted. What i tried until now is:- Updating the tika libraries inside Solr 1.4 public version, no luck there.- Downloading the latest SVN version, compiled it, and started from a simple schema, still no luck.- Getting other versions compiled on hudson (nightly builds), and testing them also, still no extraction. I sent a mail on the developpers mailing list but they told me i should just mail here, hope some developper reads this because it's quite an important feature of Solr and somehow it got broke between the 1.4 release, and the last version on the svn. Marc _ Consultez gratuitement vos emails Orange, Gmail, Free, ... directement dans HOTMAIL ! http://www.windowslive.fr/hotmail/agregation/ -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
Re: Solr throws TikaException while parsing sample PDF
Can somebody please guide me here? On Tue, Apr 20, 2010 at 10:53 AM, Praveen Agrawal pkal...@gmail.com wrote: I'm using Solr 1.4 distribution, with Solr cell. Can i update only new version of Tika in Solr 1.4 distn? If yes, any guide etc? Thanks. On Mon, Apr 19, 2010 at 4:36 PM, Koji Sekiguchi k...@r.email.ne.jpwrote: Praveen Agrawal wrote: Hi Grant, I tried command line of Tika v-0.7(newest), and it parsed the file.. I believe Solr1.4 contains 0.4 version of Tika. Do you suggest to upgrade to new Tika? Can i upgrade only tika in Solr-1.4? or i need to wait till Solr ships with new Tika? Thanks. Solr trunk uses Tika 0.7. I'm not SolrCell user, so this is just an FYI. Koji -- http://www.rondhuit.com/en/
Re: Solr throws TikaException while parsing sample PDF
Hi Grant, I tried command line of Tika v-0.7(newest), and it parsed the file.. I believe Solr1.4 contains 0.4 version of Tika. Do you suggest to upgrade to new Tika? Can i upgrade only tika in Solr-1.4? or i need to wait till Solr ships with new Tika? Thanks. On Sun, Apr 18, 2010 at 11:24 PM, Grant Ingersoll gsing...@apache.orgwrote: Can you extract content from this using Tika's standalone command line tool? PDF's are notorious for problems in extracting. To me, it looks like a bug in PDFBox. I would try to isolate it down to there and then send, if possible, the sample document to PDFBox and see if they can come up w/ a fix. -Grant On Apr 18, 2010, at 1:12 PM, pk wrote: Hi, while posting a sample pdf (that comes with Solr dist'n) to solr, i'm getting a TikaException. Using Solr-1.4, SolrJ (StreamingUpdateSolrServer) for posting pdf to solr. Other sample pdfs can be parsed and indexed successfully.. I;m getting same error with some other pdfs also (but adobe reader can open them fine, so i dont think they have an issue in formatting or are corrupt etc)... Here is the trace... found uploaded file : C:\solr_1.4.0\docs\Installing Solr in Tomcat.pdf :: size=286242 Apr 18, 2010 10:31:34 PM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {} 0 640 Apr 18, 2010 10:31:34 PM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Una ble to extract PDF content at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocu mentLoader.java:211) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStrea mHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.jav a:131) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(Re questHandlers.java:233) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241 ) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFil terChain.java:215) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain .java:188) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java: 213) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java: 172) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:10 8) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:873) at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConn ection(Http11BaseProtocol.java:665) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:5 28) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorke rThread.java:81) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:6 89) at java.lang.Thread.run(Thread.java:595) Caused by: org.apache.tika.exception.TikaException: Unable to extract PDF content at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:58) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocu mentLoader.java:190) ... 20 more Caused by: java.util.zip.ZipException: incorrect header check at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:140) at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:97) at org.pdfbox.cos.COSStream.doDecode(COSStream.java:290) at org.pdfbox.cos.COSStream.doDecode(COSStream.java:235) at org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170) at org.pdfbox.pdfparser.PDFStreamParser.(PDFStreamParser.java:101) at org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:132) at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:202) at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174) at
Re: Solr throws TikaException while parsing sample PDF
I'm using Solr 1.4 distribution, with Solr cell. Can i update only new version of Tika in Solr 1.4 distn? If yes, any guide etc? Thanks. On Mon, Apr 19, 2010 at 4:36 PM, Koji Sekiguchi k...@r.email.ne.jp wrote: Praveen Agrawal wrote: Hi Grant, I tried command line of Tika v-0.7(newest), and it parsed the file.. I believe Solr1.4 contains 0.4 version of Tika. Do you suggest to upgrade to new Tika? Can i upgrade only tika in Solr-1.4? or i need to wait till Solr ships with new Tika? Thanks. Solr trunk uses Tika 0.7. I'm not SolrCell user, so this is just an FYI. Koji -- http://www.rondhuit.com/en/
Autofill 'id' field with the URL of files posted to Solr?
Hi, I need to submit thousands of online PDF/html files to Solr. I can submit one file using SolrJ (StreamingUpdateSolrServer and ..solr.common.util.ContentStreamBase.URLStream), setting literal.idparameter to the url. I can't do the same with a batch of multiple files, as their 'id' should be unique (set to their urls). I couldn't get this to work. Is there a way to somehow get the 'id' field set automatically to the url of the files posted to Solr (something like to 'stream_name')? How to set this in solrconfig.xml or schema.xml? or any other way? Thanks.