Index pdf files.
Hi I'm new to Solr. I want to index pdf files usng the Data Import Handler. Im using Solr-4.3.0. I followed the steps given in this post http://lucene.472066.n3.nabble.com/indexing-with-DIH-and-with-problems-td3731129.html However, I get the following error - Full Import failed:java.lang.NoClassDefFoundError: org/apache/tika/parser/Parser Please help! Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Index-pdf-files-tp4074278.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Index pdf files.
The tika jars are not in your classpath. You need to add all the jars inside contrib/extraction/lib directory to your classpath. On Mon, Jul 1, 2013 at 2:00 PM, archit2112 archit2...@gmail.com wrote: Hi I'm new to Solr. I want to index pdf files usng the Data Import Handler. Im using Solr-4.3.0. I followed the steps given in this post http://lucene.472066.n3.nabble.com/indexing-with-DIH-and-with-problems-td3731129.html However, I get the following error - Full Import failed:java.lang.NoClassDefFoundError: org/apache/tika/parser/Parser Please help! Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Index-pdf-files-tp4074278.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Shalin Shekhar Mangar.
Re: Index pdf files.
Hi Thanks a lot. I did what you said. Now I'm getting the following error. Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: java.util.regex.PatternSyntaxException: Dangling meta character '*' near index 0 -- View this message in context: http://lucene.472066.n3.nabble.com/Index-pdf-files-tp4074278p4074297.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Index pdf files.
OK, have you done anything custom? You get this where? solr logs? Echoed back in the browser? In response to what command? You haven't provided enough info to help us help you. You might review: http://wiki.apache.org/solr/UsingMailingLists Best Erick On Mon, Jul 1, 2013 at 6:08 AM, archit2112 archit2...@gmail.com wrote: Hi Thanks a lot. I did what you said. Now I'm getting the following error. Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: java.util.regex.PatternSyntaxException: Dangling meta character '*' near index 0 -- View this message in context: http://lucene.472066.n3.nabble.com/Index-pdf-files-tp4074278p4074297.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Index pdf files.
I figured it out. It was a problem with the regular expression i used in data-config.xml . -- View this message in context: http://lucene.472066.n3.nabble.com/Index-pdf-files-tp4074278p4074304.html Sent from the Solr - User mailing list archive at Nabble.com.
index pdf files
I wrote a simple java program to import a pdf file. I can get a result when I do search *:* from admin page. I get nothing if I search a word. I wonder if I did something wrong or miss set something. Here is part of result I get when do *:* search: * - doc - arr name=attr_Author strHristovski D/str /arr - arr name=attr_Content-Type strapplication/pdf/str /arr - arr name=attr_Keywords strmicroarray analysis, literature-based discovery, semantic predications, natural language processing/str /arr - arr name=attr_Last-Modified strThu Aug 12 10:58:37 EDT 2010/str /arr - arr name=attr_content strCombining Semantic Relations and DNA Microarray Data for Novel Hypotheses Generation Combining Semantic Relations and DNA Microarray Data for Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej Kastrin,2... * Please help me out if anyone has experience with pdf files. I really appreciate it! Thanks so much,
Re: index pdf files
To help you we need the description of your fields in your schema.xml and the query that you do when you search only a single word. Marco Martínez Bautista http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42 2010/8/12 Ma, Xiaohui (NIH/NLM/LHC) [C] xiao...@mail.nlm.nih.gov I wrote a simple java program to import a pdf file. I can get a result when I do search *:* from admin page. I get nothing if I search a word. I wonder if I did something wrong or miss set something. Here is part of result I get when do *:* search: * - doc - arr name=attr_Author strHristovski D/str /arr - arr name=attr_Content-Type strapplication/pdf/str /arr - arr name=attr_Keywords strmicroarray analysis, literature-based discovery, semantic predications, natural language processing/str /arr - arr name=attr_Last-Modified strThu Aug 12 10:58:37 EDT 2010/str /arr - arr name=attr_content strCombining Semantic Relations and DNA Microarray Data for Novel Hypotheses Generation Combining Semantic Relations and DNA Microarray Data for Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej Kastrin,2... * Please help me out if anyone has experience with pdf files. I really appreciate it! Thanks so much,
RE: index pdf files
Thanks so much. I didn't know how to make any changes in schema.xml for pdf files. I used solr default schema.xml. Please tell me what I need do in schema.xml. The simple java program I use is following. I also attached that pdf file. I really appreciate your help! * public class importPDF { public static void main(String[] args) { try { String fileName = pub2009001.pdf; String solrId = pub2009001.pdf; indexFilesSolrCell(fileName, solrId); } catch (Exception ex) { System.out.println(ex.toString()); } } public static void indexFilesSolrCell(String fileName, String solrId) throws IOException, SolrServerException { String urlString = http://lhcinternal.nlm.nih.gov:8989/solr/lhcpdf;; SolrServer solr = new CommonsHttpSolrServer(urlString); ContentStreamUpdateRequest up = new ContentStreamUpdateRequest(/update/extract); up.addFile(new File(fileName)); up.setParam(literal.id, solrId); up.setParam(uprefix, attr_); up.setParam(fmap.content, attr_content); up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); solr.request(up); } } -Original Message- From: Marco Martinez [mailto:mmarti...@paradigmatecnologico.com] Sent: Thursday, August 12, 2010 11:45 AM To: solr-user@lucene.apache.org Subject: Re: index pdf files To help you we need the description of your fields in your schema.xml and the query that you do when you search only a single word. Marco Martínez Bautista http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42 2010/8/12 Ma, Xiaohui (NIH/NLM/LHC) [C] xiao...@mail.nlm.nih.gov I wrote a simple java program to import a pdf file. I can get a result when I do search *:* from admin page. I get nothing if I search a word. I wonder if I did something wrong or miss set something. Here is part of result I get when do *:* search: * - doc - arr name=attr_Author strHristovski D/str /arr - arr name=attr_Content-Type strapplication/pdf/str /arr - arr name=attr_Keywords strmicroarray analysis, literature-based discovery, semantic predications, natural language processing/str /arr - arr name=attr_Last-Modified strThu Aug 12 10:58:37 EDT 2010/str /arr - arr name=attr_content strCombining Semantic Relations and DNA Microarray Data for Novel Hypotheses Generation Combining Semantic Relations and DNA Microarray Data for Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej Kastrin,2... * Please help me out if anyone has experience with pdf files. I really appreciate it! Thanks so much,
RE: index pdf files
Does anyone know if I need define fields in schema.xml for indexing pdf files? If I need, please tell me how I can do it. I defined fields in schema.xml and created data-configuration file by using xpath for xml files. Would you please tell me if I need do it for pdf files and how I can do? Thanks so much for your help as always! -Original Message- From: Marco Martinez [mailto:mmarti...@paradigmatecnologico.com] Sent: Thursday, August 12, 2010 11:45 AM To: solr-user@lucene.apache.org Subject: Re: index pdf files To help you we need the description of your fields in your schema.xml and the query that you do when you search only a single word. Marco Martínez Bautista http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42 2010/8/12 Ma, Xiaohui (NIH/NLM/LHC) [C] xiao...@mail.nlm.nih.gov I wrote a simple java program to import a pdf file. I can get a result when I do search *:* from admin page. I get nothing if I search a word. I wonder if I did something wrong or miss set something. Here is part of result I get when do *:* search: * - doc - arr name=attr_Author strHristovski D/str /arr - arr name=attr_Content-Type strapplication/pdf/str /arr - arr name=attr_Keywords strmicroarray analysis, literature-based discovery, semantic predications, natural language processing/str /arr - arr name=attr_Last-Modified strThu Aug 12 10:58:37 EDT 2010/str /arr - arr name=attr_content strCombining Semantic Relations and DNA Microarray Data for Novel Hypotheses Generation Combining Semantic Relations and DNA Microarray Data for Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej Kastrin,2... * Please help me out if anyone has experience with pdf files. I really appreciate it! Thanks so much,
Re: index pdf files
Maybe this helps: http://www.packtpub.com/article/indexing-data-solr-1.4-enterprise-search-server-2 Cheers, Stefan Am 12.08.2010 19:45, schrieb Ma, Xiaohui (NIH/NLM/LHC) [C]: Does anyone know if I need define fields in schema.xml for indexing pdf files? If I need, please tell me how I can do it. I defined fields in schema.xml and created data-configuration file by using xpath for xml files. Would you please tell me if I need do it for pdf files and how I can do? Thanks so much for your help as always! -Original Message- From: Marco Martinez [mailto:mmarti...@paradigmatecnologico.com] Sent: Thursday, August 12, 2010 11:45 AM To: solr-user@lucene.apache.org Subject: Re: index pdf files To help you we need the description of your fields in your schema.xml and the query that you do when you search only a single word. Marco Martínez Bautista http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42 2010/8/12 Ma, Xiaohui (NIH/NLM/LHC) [C]xiao...@mail.nlm.nih.gov I wrote a simple java program to import a pdf file. I can get a result when I do search *:* from admin page. I get nothing if I search a word. I wonder if I did something wrong or miss set something. Here is part of result I get when do *:* search: * -doc -arr name=attr_Author strHristovski D/str /arr -arr name=attr_Content-Type strapplication/pdf/str /arr -arr name=attr_Keywords strmicroarray analysis, literature-based discovery, semantic predications, natural language processing/str /arr -arr name=attr_Last-Modified strThu Aug 12 10:58:37 EDT 2010/str /arr -arr name=attr_content strCombining Semantic Relations and DNA Microarray Data for Novel Hypotheses Generation Combining Semantic Relations and DNA Microarray Data for Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej Kastrin,2... * Please help me out if anyone has experience with pdf files. I really appreciate it! Thanks so much, -- *** Stefan Moises Senior Softwareentwickler shoptimax GmbH Guntherstraße 45 a 90461 Nürnberg Amtsgericht Nürnberg HRB 21703 GF Friedrich Schreieck Tel.: 0911/25566-25 Fax: 0911/25566-29 moi...@shoptimax.de http://www.shoptimax.de ***
RE: index pdf files
Thanks so much for your help! I defined dynamic field in schema.xml as following: dynamicField name=metadata_* type=string indexed=true stored=true multiValued=false/ But I wonder what I should put for uniqueKey/uniqueKey. I really appreciate your help! -Original Message- From: Stefan Moises [mailto:moi...@shoptimax.de] Sent: Thursday, August 12, 2010 1:58 PM To: solr-user@lucene.apache.org Subject: Re: index pdf files Maybe this helps: http://www.packtpub.com/article/indexing-data-solr-1.4-enterprise-search-server-2 Cheers, Stefan Am 12.08.2010 19:45, schrieb Ma, Xiaohui (NIH/NLM/LHC) [C]: Does anyone know if I need define fields in schema.xml for indexing pdf files? If I need, please tell me how I can do it. I defined fields in schema.xml and created data-configuration file by using xpath for xml files. Would you please tell me if I need do it for pdf files and how I can do? Thanks so much for your help as always! -Original Message- From: Marco Martinez [mailto:mmarti...@paradigmatecnologico.com] Sent: Thursday, August 12, 2010 11:45 AM To: solr-user@lucene.apache.org Subject: Re: index pdf files To help you we need the description of your fields in your schema.xml and the query that you do when you search only a single word. Marco Martínez Bautista http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42 2010/8/12 Ma, Xiaohui (NIH/NLM/LHC) [C]xiao...@mail.nlm.nih.gov I wrote a simple java program to import a pdf file. I can get a result when I do search *:* from admin page. I get nothing if I search a word. I wonder if I did something wrong or miss set something. Here is part of result I get when do *:* search: * -doc -arr name=attr_Author strHristovski D/str /arr -arr name=attr_Content-Type strapplication/pdf/str /arr -arr name=attr_Keywords strmicroarray analysis, literature-based discovery, semantic predications, natural language processing/str /arr -arr name=attr_Last-Modified strThu Aug 12 10:58:37 EDT 2010/str /arr -arr name=attr_content strCombining Semantic Relations and DNA Microarray Data for Novel Hypotheses Generation Combining Semantic Relations and DNA Microarray Data for Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej Kastrin,2... * Please help me out if anyone has experience with pdf files. I really appreciate it! Thanks so much, -- *** Stefan Moises Senior Softwareentwickler shoptimax GmbH Guntherstraße 45 a 90461 Nürnberg Amtsgericht Nürnberg HRB 21703 GF Friedrich Schreieck Tel.: 0911/25566-25 Fax: 0911/25566-29 moi...@shoptimax.de http://www.shoptimax.de ***
RE: index pdf files
Thanks so much. I got it work now. I really appreciate your help! Xiaohui -Original Message- From: Stefan Moises [mailto:moi...@shoptimax.de] Sent: Thursday, August 12, 2010 1:58 PM To: solr-user@lucene.apache.org Subject: Re: index pdf files Maybe this helps: http://www.packtpub.com/article/indexing-data-solr-1.4-enterprise-search-server-2 Cheers, Stefan Am 12.08.2010 19:45, schrieb Ma, Xiaohui (NIH/NLM/LHC) [C]: Does anyone know if I need define fields in schema.xml for indexing pdf files? If I need, please tell me how I can do it. I defined fields in schema.xml and created data-configuration file by using xpath for xml files. Would you please tell me if I need do it for pdf files and how I can do? Thanks so much for your help as always! -Original Message- From: Marco Martinez [mailto:mmarti...@paradigmatecnologico.com] Sent: Thursday, August 12, 2010 11:45 AM To: solr-user@lucene.apache.org Subject: Re: index pdf files To help you we need the description of your fields in your schema.xml and the query that you do when you search only a single word. Marco Martínez Bautista http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42 2010/8/12 Ma, Xiaohui (NIH/NLM/LHC) [C]xiao...@mail.nlm.nih.gov I wrote a simple java program to import a pdf file. I can get a result when I do search *:* from admin page. I get nothing if I search a word. I wonder if I did something wrong or miss set something. Here is part of result I get when do *:* search: * -doc -arr name=attr_Author strHristovski D/str /arr -arr name=attr_Content-Type strapplication/pdf/str /arr -arr name=attr_Keywords strmicroarray analysis, literature-based discovery, semantic predications, natural language processing/str /arr -arr name=attr_Last-Modified strThu Aug 12 10:58:37 EDT 2010/str /arr -arr name=attr_content strCombining Semantic Relations and DNA Microarray Data for Novel Hypotheses Generation Combining Semantic Relations and DNA Microarray Data for Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej Kastrin,2... * Please help me out if anyone has experience with pdf files. I really appreciate it! Thanks so much, -- *** Stefan Moises Senior Softwareentwickler shoptimax GmbH Guntherstraße 45 a 90461 Nürnberg Amtsgericht Nürnberg HRB 21703 GF Friedrich Schreieck Tel.: 0911/25566-25 Fax: 0911/25566-29 moi...@shoptimax.de http://www.shoptimax.de ***
RE: index pdf files
I got the following error when I index some pdf files. I wonder if anyone has this issue before and how to fix it. Thanks so much in advance! *** html head meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/ titleError 500 /title /head bodyh2HTTP ERROR: 500/h2preorg.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.pdfpar...@44ffb2 org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.pdfpar...@44ffb2 at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.pdfpar...@44ffb2 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:121) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190) *** -Original Message- From: Stefan Moises [mailto:moi...@shoptimax.de] Sent: Thursday, August 12, 2010 1:58 PM To: solr-user@lucene.apache.org Subject: Re: index pdf files Maybe this helps: http://www.packtpub.com/article/indexing-data-solr-1.4-enterprise-search-server-2 Cheers, Stefan Am 12.08.2010 19:45, schrieb Ma, Xiaohui (NIH/NLM/LHC) [C]: Does anyone know if I need define fields in schema.xml for indexing pdf files? If I need, please tell me how I can do it. I defined fields in schema.xml and created data-configuration file by using xpath for xml files. Would you please tell me if I need do it for pdf files and how I can do? Thanks so much for your help as always! -Original Message- From: Marco Martinez [mailto:mmarti...@paradigmatecnologico.com] Sent: Thursday, August 12, 2010 11:45 AM To: solr-user@lucene.apache.org Subject: Re: index pdf files To help you we need the description of your fields in your schema.xml and the query that you do when you search only a single word. Marco Martínez Bautista http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42 2010/8/12 Ma, Xiaohui (NIH/NLM/LHC) [C]xiao...@mail.nlm.nih.gov I wrote a simple java program to import a pdf file. I can get a result when I do search *:* from admin page. I get nothing if I search a word. I wonder if I did something wrong or miss set something. Here is part of result I get when do *:* search: * -doc -arr name=attr_Author strHristovski D/str /arr -arr name=attr_Content-Type strapplication/pdf/str /arr -arr name=attr_Keywords strmicroarray analysis, literature-based discovery, semantic predications, natural language processing/str /arr -arr name=attr_Last-Modified strThu Aug 12 10:58:37 EDT 2010/str /arr -arr name=attr_content
Re: index pdf files
: Subject: index pdf files : References: aanlktim1wgref511p+unovqcu=b0usxnm8vxzn5bu...@mail.gmail.com : 4c63ed43.4030...@r.email.ne.jp : aanlkti=28tulxqjtibrwcbxtok0avwbvbrjnxpdej...@mail.gmail.com : In-Reply-To: aanlkti=28tulxqjtibrwcbxtok0avwbvbrjnxpdej...@mail.gmail.com http://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. See Also: http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking -Hoss