RE: Problems indexing very large set of documents

2011-04-11 Thread Brandon Waterloo
e.apache.org Subject: Re: Problems indexing very large set of documents There is a library called iText. It parses and writes PDFs very very well, and a simple program will let you do a batch conversion. PDFs are made by a wide range of programs, not just Adobe code. Many of these do weird things and

Re: Problems indexing very large set of documents

2011-04-10 Thread Lance Norskog
> From: Ezequiel Calderara [ezech...@gmail.com] > Sent: Friday, April 08, 2011 11:35 AM > To: solr-user@lucene.apache.org > Cc: Brandon Waterloo > Subject: Re: Problems indexing very large set of documents > > Maybe those files are created w

RE: Problems indexing very large set of documents

2011-04-08 Thread Brandon Waterloo
ara [ezech...@gmail.com] Sent: Friday, April 08, 2011 11:35 AM To: solr-user@lucene.apache.org Cc: Brandon Waterloo Subject: Re: Problems indexing very large set of documents Maybe those files are created with a different Adobe Format version... See this: http://lucene.472066.n3.nabble.com/PDF-parser

Re: Problems indexing very large set of documents

2011-04-08 Thread Ezequiel Calderara
ay, April 08, 2011 10:40 AM >> To: solr-user@lucene.apache.org >> Subject: RE: Problems indexing very large set of documents >> >> I had some time to do some research into the problems. From what I can >> tell, it appears Solr is tripping up over the filename. These are str

Re: Problems indexing very large set of documents

2011-04-08 Thread Ezequiel Calderara
cond naming format. > > From: Brandon Waterloo [brandon.water...@matrix.msu.edu] > Sent: Friday, April 08, 2011 10:40 AM > To: solr-user@lucene.apache.org > Subject: RE: Problems indexing very large set of documents > > I had some time to do some research into the problems. From what I

RE: Problems indexing very large set of documents

2011-04-08 Thread Brandon Waterloo
. From: Brandon Waterloo [brandon.water...@matrix.msu.edu] Sent: Friday, April 08, 2011 10:40 AM To: solr-user@lucene.apache.org Subject: RE: Problems indexing very large set of documents I had some time to do some research into the problems. From what I can

RE: Problems indexing very large set of documents

2011-04-08 Thread Brandon Waterloo
From: Chris Hostetter [hossman_luc...@fucit.org] Sent: Tuesday, April 05, 2011 3:03 PM To: solr-user@lucene.apache.org Subject: RE: Problems indexing very large set of documents : It wasn't just a single file, it was dozens of files all having problems : toward the end just bef

RE: Problems indexing very large set of documents

2011-04-05 Thread Chris Hostetter
: It wasn't just a single file, it was dozens of files all having problems : toward the end just before I killed the process. ... : That is by no means all the errors, that is just a sample of a few. : You can see they all threw HTTP 500 errors. What is strange is, nearly : every file

Re: Problems indexing very large set of documents

2011-04-05 Thread Anuj Kumar
errors. What is strange is, nearly every > file succeeded before about the 2200-files-mark, and nearly every file after > that failed. > > > ~Brandon Waterloo > > ____ > From: Anuj Kumar [anujs...@gmail.com] > Sent: Monday, April 04, 2011 2:48 PM >

RE: Problems indexing very large set of documents

2011-04-05 Thread Brandon Waterloo
at failed. ~Brandon Waterloo From: Anuj Kumar [anujs...@gmail.com] Sent: Monday, April 04, 2011 2:48 PM To: solr-user@lucene.apache.org Cc: Brandon Waterloo Subject: Re: Problems indexing very large set of documents In the log messages are you able to locate

Re: Problems indexing very large set of documents

2011-04-04 Thread Anuj Kumar
solr-user@lucene.apache.org > Cc: Brandon Waterloo > Subject: Re: Problems indexing very large set of documents > > This is related to Apache TIKA. Which version are you using? > Please see this thread for more details- > http://lucene.472066.n3.nabble.com/PDF-parser-exceptio

RE: Problems indexing very large set of documents

2011-04-04 Thread Brandon Waterloo
andon Waterloo Subject: Re: Problems indexing very large set of documents This is related to Apache TIKA. Which version are you using? Please see this thread for more details- http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html <http://lucene.472066.n3.nabble.com/PDF-parser-e

Re: Problems indexing very large set of documents

2011-04-04 Thread Anuj Kumar
This is related to Apache TIKA. Which version are you using? Please see this thread for more details- http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html Hope it helps. Regards, Anuj On Mon, Apr 4, 2011 at 1

Problems indexing very large set of documents

2011-04-04 Thread Brandon Waterloo
Hey everybody, I've been running into some issues indexing a very large set of documents. There's about 4000 PDF files, ranging in size from 160MB to 10KB. Obviously this is a big task for Solr. I have a PHP script that iterates over the directory and uses PHP cURL to query Solr to index th