One of the files that post tool identified as XML is not. Possibly a 404 error or some such. So it is trying to parse the file and sees non-xml content right at start. Or if you are sure it is an XML file, maybe there is a BOM mark. Either way try to isolate the specific file.
On a bigger picture though, if crawling is actual part of the project rather than just a test, you should use proper crawlers that integrate with Solr. Mitch, StormCrawler (so?), etc. Regards, Alex On Thu, Apr 11, 2019, 6:09 AM Shivprasad Shetty, <shivpras...@orioninc.com> wrote: > Hello Team, > > > I am working on solr for the first time and got the setup > done. Now I have created a core using command line and want to perform > webcrawl of a third party site. > If I try it with individual links, I am able to do the crawl and index it > to the core.This was done using > > java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -jar > post.jar http://www.example.com > > Now what I intend to do is to give a url and using the recursive option > (-Drecursive) and let it crawl the entire site. > Note that I am pointing to a website that has around 125 pages and I am > using the below command > > java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update > -Drecursive=yes -jar post.jar http://www.example.com and > java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update > -Drecursive=2 -jar post.jar http://www.example.com > > and I am getting the below error message. > Error: > > > POSTed web resource http://www.example.com (depth: 0) > [Fatal Error] :1:1: Content is not allowed in prolog. > Exception in thread "main" java.lang.RuntimeException: > org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is > not allowed in prolog. > at > org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1252) > at > org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:616) > at > org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563) > at > org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365) > at > org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187) > at > org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172) > Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; > Content is not allowed in prolog. > at > com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source) > at > com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown > Source) > at javax.xml.parsers.DocumentBuilder.parse(Unknown Source) > at > org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1061) > at > org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1232) > ... 5 more > > > > I would be very grateful if anyone could get me to solve this issue I have > been trying to fix for a couple of days. > > > Regards, > ShivprasadS > > > Confidentiality Notice: This e-mail message, including any attachments, is > for the sole use of the intended recipient(s) and may contain confidential > and privileged information. Any unauthorized review, use, disclosure or > distribution is prohibited. If you are not the intended recipient, please > contact the sender by reply e-mail, delete and then destroy all copies of > the original message. >