One of the files that post tool identified as XML is not. Possibly a 404
error or some such. So it is trying to parse the file and sees non-xml
content right at start. Or if you are sure it is an XML file, maybe there
is a BOM mark. Either way try to isolate the specific file.

On a bigger picture though, if crawling is actual part of the project
rather than just a test, you should use proper crawlers that integrate with
Solr. Mitch, StormCrawler (so?), etc.

Regards,
     Alex

On Thu, Apr 11, 2019, 6:09 AM Shivprasad Shetty, <shivpras...@orioninc.com>
wrote:

> Hello Team,
>
>
>                 I am working on solr for the first time and got the setup
> done. Now I have created a core using command line and want to perform
> webcrawl of a third party site.
> If I try it with individual links, I am able to do the crawl and index it
> to the core.This was done using >
> java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -jar
> post.jar http://www.example.com
>
> Now what I intend to do is to give a url and using the recursive option
> (-Drecursive) and let it crawl the entire site.
> Note that I am pointing to a website that has around 125 pages and I am
> using the below command >
> java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update
> -Drecursive=yes -jar post.jar http://www.example.com  and
> java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update
> -Drecursive=2 -jar post.jar http://www.example.com
>
> and I am getting the below error message.
> Error:
>
>
> POSTed web resource http://www.example.com (depth: 0)
> [Fatal Error] :1:1: Content is not allowed in prolog.
> Exception in thread "main" java.lang.RuntimeException:
> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is
> not allowed in prolog.
>         at
> org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1252)
>         at
> org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:616)
>         at
> org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563)
>         at
> org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365)
>         at
> org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187)
>         at
> org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1;
> Content is not allowed in prolog.
>         at
> com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
>         at
> com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown
> Source)
>         at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
>         at
> org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1061)
>         at
> org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1232)
>         ... 5 more
>
>
>
> I would be very grateful if anyone could get me to solve this issue I have
> been trying to fix for a couple of days.
>
>
> Regards,
> ShivprasadS
>
>
> Confidentiality Notice: This e-mail message, including any attachments, is
> for the sole use of the intended recipient(s) and may contain confidential
> and privileged information. Any unauthorized review, use, disclosure or
> distribution is prohibited. If you are not the intended recipient, please
> contact the sender by reply e-mail, delete and then destroy all copies of
> the original message.
>

Reply via email to