Wow !!! Thanks a lot for the helpfull tips I will implement this in the next two days & report back with my indexing speed....I have one more question...
i tried committing to solr cloud, but then something was not correct as it would not index after a few documents... Also, There seems to be something wrong in zookeeper, when we try to add documents using solrj, it works fine as long as load of insert is not much, but once we start doing many inserts, then it throws a lot of errors... I am doing something like - CloudSolrServer solrCoreCloud = new CloudSolrServer(cloudURL); solrCoreCloud. setDefaultCollection("Image"); UpdateResponse up = solrCoreCloud.addBean(resultItem); UpdateResponse upr = solrCoreCloud.commit(); since i have to reindex, i am thinking if i need to use solrcloud or not? On Wed, Oct 23, 2013 at 8:41 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > Indexing 100M web pages really should not take months; if you fix > committing after every row that should make things much faster. > > Use multiple index threads, set a highish RAM buffer (~512 MB), use a > local disk not a remote mounted fileserver, ideally an SSD, etc. See > http://wiki.apache.org/lucene-java/ImproveIndexingSpeed for more > ideas. > > Only commit periodically, when enough indexing has happened that you > would be upset to lose that work since the last commit (e.g. maybe > every few hours or something). > > Also, be sure your IO system is "healthy" / does not disregard fsync, > and if the index is really important, back it up to a different > storage device every so often. > > Mike McCandless > > http://blog.mikemccandless.com > > On Wed, Oct 23, 2013 at 10:58 AM, Chris <christu...@gmail.com> wrote: > > Actually, it contains about 100 million webpages and was built out of a > web > > index for NLP processing :( > > > > I did the indexing & crawling over one small sized server....and > > researching and getting it all to this stage took me this much time...and > > now my index is un-usable :( > > > > > > On Wed, Oct 23, 2013 at 8:16 PM, Michael McCandless < > > luc...@mikemccandless.com> wrote: > > > >> On Wed, Oct 23, 2013 at 10:33 AM, Chris <christu...@gmail.com> wrote: > >> > I am not exactly sure if the commit() was run, as i am inserting each > >> row & > >> > doing a commit right away. My solr will not load the index.... > >> > >> I'm confused: if you are doing a commit right away after every row > >> (which is REALLY bad practice: that's incredibly slow and > >> unnecessary), then surely you've had many commits succeed? > >> > >> > is there anyway that i can fix this, I have a huge index & will loose > >> > months if i try to reindex :( I didnt know lucene was not stable, I > >> thought > >> > it was > >> > >> Sorry, but no. > >> > >> In theory ... a tool could be created that would try to "reconstitute" > >> a segments file by looking at all the various files that exist, but > >> this is not in general easy (and may not be possible): the segments > >> file has very important metadata, like which codec was used to write > >> each segment, etc. > >> > >> Did it really take months to do this indexing? That is really way too > >> long; how many documents? > >> > >> Lucene (Solr) is stable, i.e. a successful commit should ensure your > >> index survives power loss. If somehow that was not the case here, > >> then we need to figure out why and fix it ... > >> > >> Mike McCandless > >> > >> http://blog.mikemccandless.com > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >