Re: solr 7.0.1: exception running post to crawl simple website

2017-10-27 Thread Cassandra Targett
Toby, Your mention of "-recursive" causing a problem reminded me of a simple crawl (of the 7.0 Ref Guide) using bin/post I was trying to get to work the other day and couldn't. The order of the parameters seems to make a difference with what error you get (this is using 7.1): 1. "./bin/post -c

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-27 Thread toby1851
Amrit Sarkar wrote > The above is SAXParse, runtime exception. Nothing can be done at Solr end > except curating your own data. I'm trying to replace a solr-4.6.0 system (which has been working brilliantly for 3 years!) with solr-7.1.0. I'm running into this exact same problem. I do not believe

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Rick Leir
On 2017-10-13 04:19 PM, Kevin Layer wrote: Amrit Sarkar wrote: Kevin, fileType => md is not recognizable format in SimplePostTool, anyway, moving on. OK, thanks. Looks like I'll have to abandon using solr for this project (or find another way to crawl the site). Thank you for all the help,

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote: >> Kevin, >> >> fileType => md is not recognizable format in SimplePostTool, anyway, moving >> on. OK, thanks. Looks like I'll have to abandon using solr for this project (or find another way to crawl the site). Thank you for all the help, though. I appreciate it. >> The

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Amrit Sarkar
Kevin, fileType => md is not recognizable format in SimplePostTool, anyway, moving on. The above is SAXParse, runtime exception. Nothing can be done at Solr end except curating your own data. Some helpful links:

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote: >> Kevin, >> >> I am not able to replicate the issue on my system, which is bit annoying >> for me. Try this out for last time: >> >> docker exec -it --user=solr solr bin/post -c handbook >> http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes html >> >>

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Amrit Sarkar
Kevin, I am not able to replicate the issue on my system, which is bit annoying for me. Try this out for last time: docker exec -it --user=solr solr bin/post -c handbook http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes html and have Content-Type: "html" and "text/html",

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote: >> ah oh, dockers. They are placed under [solr-home]/server/log/solr/log in >> the machine. I haven't played much with docker, any way you can get that >> file from that location. I see these files: /opt/solr/server/logs/archived /opt/solr/server/logs/solr_gc.log.0.current

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Amrit Sarkar
pardon: [solr-home]/server/log/solr.log Amrit Sarkar Search Engineer Lucidworks, Inc. 415-589-9269 www.lucidworks.com Twitter http://twitter.com/lucidworks LinkedIn: https://www.linkedin.com/in/sarkaramrit2 On Fri, Oct 13, 2017 at 8:10 PM, Amrit Sarkar wrote: > ah oh,

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Amrit Sarkar
ah oh, dockers. They are placed under [solr-home]/server/log/solr/log in the machine. I haven't played much with docker, any way you can get that file from that location. Amrit Sarkar Search Engineer Lucidworks, Inc. 415-589-9269 www.lucidworks.com Twitter http://twitter.com/lucidworks LinkedIn:

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote: >> Hi Kevin, >> >> Can you post the solr log in the mail thread. I don't think it handled the >> .md by itself by first glance at code. Note that when I use the admin web interface, and click on "Logging" on the left, I just see a spinner that implies it's trying to retrieve

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote: >> Hi Kevin, >> >> Can you post the solr log in the mail thread. I don't think it handled the >> .md by itself by first glance at code. How do I extract the log you want? >> >> Amrit Sarkar >> Search Engineer >> Lucidworks, Inc. >> 415-589-9269 >> www.lucidworks.com >>

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Amrit Sarkar
Hi Kevin, Can you post the solr log in the mail thread. I don't think it handled the .md by itself by first glance at code. Amrit Sarkar Search Engineer Lucidworks, Inc. 415-589-9269 www.lucidworks.com Twitter http://twitter.com/lucidworks LinkedIn: https://www.linkedin.com/in/sarkaramrit2 On

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote: >> Kevin, >> >> Just put "html" too and give it a shot. These are the types it is expecting: Same thing. >> >> mimeMap = new HashMap<>(); >> mimeMap.put("xml", "application/xml"); >> mimeMap.put("csv", "text/csv"); >> mimeMap.put("json", "application/json"); >>

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote: >> Reference to the code: >> >> . >> >> String rawContentType = conn.getContentType(); >> String type = rawContentType.split(";")[0]; >> if(typeSupported(type) || "*".equals(fileTypes)) { >> String encoding = conn.getContentEncoding(); >> >> . >> >> protected

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Amrit Sarkar
Ah! Only supported type is: text/html; encoding=utf-8 I am not confident of this either :) but this should work. See the code-snippet below: .. if(res.httpStatus == 200) { // Raw content type of form "text/html; encoding=utf-8" String rawContentType = conn.getContentType(); String

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Amrit Sarkar
Kevin, Just put "html" too and give it a shot. These are the types it is expecting: mimeMap = new HashMap<>(); mimeMap.put("xml", "application/xml"); mimeMap.put("csv", "text/csv"); mimeMap.put("json", "application/json"); mimeMap.put("jsonl", "application/json"); mimeMap.put("pdf",

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Amrit Sarkar
Reference to the code: . String rawContentType = conn.getContentType(); String type = rawContentType.split(";")[0]; if(typeSupported(type) || "*".equals(fileTypes)) { String encoding = conn.getContentEncoding(); . protected boolean typeSupported(String type) { for(String key :

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote: >> Strange, >> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org page's >> Content-Type. Let's see what it says now. Same thing. Verified Content-Type: quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |& grep Content-Type

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Amrit Sarkar
Strange, Can you add: "text/html;charset=utf-8". This is wiki.apache.org page's Content-Type. Let's see what it says now. Amrit Sarkar Search Engineer Lucidworks, Inc. 415-589-9269 www.lucidworks.com Twitter http://twitter.com/lucidworks LinkedIn: https://www.linkedin.com/in/sarkaramrit2 On

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
OK, so I hacked markserv to add Content-Type text/html, but now I get SimplePostTool: WARNING: Skipping URL with unsupported type text/html What is it expecting? $ docker exec -it --user=solr solr bin/post -c handbook http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote: >> Kevin, >> >> You are getting NPE at: >> >> String type = rawContentType.split(";")[0]; //HERE - rawContentType is NULL >> >> // related code >> >> String rawContentType = conn.getContentType(); >> >> public String getContentType() { >> return

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Amrit Sarkar
Kevin, You are getting NPE at: String type = rawContentType.split(";")[0]; //HERE - rawContentType is NULL // related code String rawContentType = conn.getContentType(); public String getContentType() { return getHeaderField("content-type"); } HttpURLConnection conn = (HttpURLConnection)

solr 7.0.1: exception running post to crawl simple website

2017-10-11 Thread Kevin Layer
I want to use solr to index a markdown website. The files are in native markdown, but they are served in HTML (by markserv). Here's what I did: docker run --name solr -d -p 8983:8983 -t solr docker exec -it --user=solr solr bin/solr create_core -c handbook Then, to crawl the site: