Solr indexing with Tika DIH - ZeroByteFileException

2019-06-11 Thread neilb
Hi, while going through solr logs, I found data import error for certain
documents. Here are details about the error.

Exception while processing: file document :
null:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable
to read content Processing Document # 7866
at
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:69)
at
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:171)
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
at
org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.tika.exception.ZeroByteFileException: InputStream must
have > 0 bytes
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:122)
at
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:165)


How do I know which document(document name with path) is #7866? And how do I
ignore ZeroByteFileException as document network share is not in my control.
Users can upload any size pdfs to it.

Thanks!



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr indexing with Tika DIH local vs network share

2019-04-04 Thread neilb
Thank you Erick, this is very helpful!



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr indexing with Tika DIH local vs network share

2019-03-29 Thread neilb
Hi Erick, I am using solrconfig.xml from samples only and has very few
entries. I have attached my config files for review along with reply.

Thanks
solrconfig.xml
  
tika-data-config.xml
  
managed-schema
 
 





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr indexing with Tika DIH local vs network share

2019-03-29 Thread neilb
Hi Erick, thanks a lot for your suggestions. I will look into it. But to
answer my own query, I was little impatient and checking indexing status
after every minute. What I found is after few hours, status started updating
with document count and finished the indexing process in around 5Hrs.
Do you see anything wrong with current setup of Solr and Tika DIH? All I am
looking for PDF full text search results and have it integrated in web app
dashboard using ajax queries. Also this particular  article
   was helpful
to get Solr running as windows service with 4G memory configuration under
localsystem account.

Thanks again!



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Solr indexing with Tika DIH local vs network share

2019-03-26 Thread neilb
Hi, I am trying to setup Solr for our  project which can return full text
searches on PDF documents. I am able to run the sample Tika DIH example
locally on my windows server machine. It can index all PDF documents
recursively in "baseDir" of config xml. Presently "baseDir" points to local
folder on the same machine and has around 10K pdf files. This whole setup
works as expected.

Next step is to import PDF documents located on network share. I created
another core, with very similar configuration files except this time,
baseDir points to network share ("\\myserver\pdfshare"). I have no success
in indexing these documents on newly created core. I have tried mapping this
network share to local drive and updated config accordingly but still no
success. 
I managed to copy all pdf file from network share to local folder where
example core with sample Tika DIH points and I am able to index all pdf
files. 

So I am not sure why Tika config with network path is not able to index the
files. Looking into log I can see following entries but that doesn't explain
anything. Can someone guide to resolve the issue?

2019-03-26 13:58:37.250 DEBUG (Scheduler-1147580192) [   ]
o.e.j.i.FillInterest onFail
FillInterest@419eacc8{AC.ReadCB@1ad637ed{HttpConnection@1ad637ed::SocketChannelEndPoint@6190d407{/10.206.11.68:51486<->/10.205.53.163:8983,OPEN,fill=FI,flush=-,to=120010/12}{io=1/1,kio=1,kro=1}->HttpConnection@1ad637ed[p=HttpParser{s=START,0
of
-1},g=HttpGenerator@7d81e85c{s=START}]=>HttpChannelOverHttp@10e588cc{r=2,c=false,a=IDLE,uri=null,age=0}}}
java.util.concurrent.TimeoutException: Idle timeout expired: 120010/12
ms
at 
org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:166)
[jetty-io-9.4.14.v20181114.jar:9.4.14.v20181114]
at org.eclipse.jetty.io.IdleTimeout$1.run(IdleTimeout.java:50)
[jetty-io-9.4.14.v20181114.jar:9.4.14.v20181114]
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
[?:1.8.0_201]
at java.util.concurrent.FutureTask.run(Unknown Source) [?:1.8.0_201]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(Unknown
Source) [?:1.8.0_201]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
Source) [?:1.8.0_201]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
[?:1.8.0_201]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
[?:1.8.0_201]
at java.lang.Thread.run(Unknown Source) [?:1.8.0_201]


Is it possible that Solr is not ale to access the network share? Is this
anyway that I can run Solr.cmd under different user (who as access to
network share) in windows environment?
Please let me know if you wish to know any more details about the issue.


Thanks in advance




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html