Hi Mike, I changed my program and now the indexing is better. How ever I have run into another issue - I get characters like -
������������������ - CTA������������ - in the solr index. I am adding Java beans to solr by the addBean() function. This seems to be a character encoding issue. Any pointers on how to resolve this one? I have seen that this occurs mostly for japanese chinese characters. Warm Regards, Chris On Thu, Oct 24, 2013 at 1:30 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > Hi Chris, > > Sorry, I don't know much about Solr cloud; maybe as on the solr-user > list, and give details about what went wrong? > > Mike McCandless > > http://blog.mikemccandless.com > > > On Wed, Oct 23, 2013 at 11:25 AM, Chris <christu...@gmail.com> wrote: > > Wow !!! Thanks a lot for the helpfull tips I will implement this in the > > next two days & report back with my indexing speed....I have one more > > question... > > > > i tried committing to solr cloud, but then something was not correct > > as it would not index after a few documents... > > > > Also, There seems to be something wrong in zookeeper, when we try to add > > documents using solrj, it works fine as long as load of insert is not > much, > > but once we start doing many inserts, then it throws a lot of errors... > > > > I am doing something like - > > > > CloudSolrServer solrCoreCloud = new CloudSolrServer(cloudURL); > > solrCoreCloud. > > setDefaultCollection("Image"); > > UpdateResponse up = > solrCoreCloud.addBean(resultItem); > > UpdateResponse upr = solrCoreCloud.commit(); > > > > since i have to reindex, i am thinking if i need to use solrcloud or not? > > > > > > > > > > On Wed, Oct 23, 2013 at 8:41 PM, Michael McCandless < > > luc...@mikemccandless.com> wrote: > > > >> Indexing 100M web pages really should not take months; if you fix > >> committing after every row that should make things much faster. > >> > >> Use multiple index threads, set a highish RAM buffer (~512 MB), use a > >> local disk not a remote mounted fileserver, ideally an SSD, etc. See > >> http://wiki.apache.org/lucene-java/ImproveIndexingSpeed for more > >> ideas. > >> > >> Only commit periodically, when enough indexing has happened that you > >> would be upset to lose that work since the last commit (e.g. maybe > >> every few hours or something). > >> > >> Also, be sure your IO system is "healthy" / does not disregard fsync, > >> and if the index is really important, back it up to a different > >> storage device every so often. > >> > >> Mike McCandless > >> > >> http://blog.mikemccandless.com > >> > >> On Wed, Oct 23, 2013 at 10:58 AM, Chris <christu...@gmail.com> wrote: > >> > Actually, it contains about 100 million webpages and was built out of > a > >> web > >> > index for NLP processing :( > >> > > >> > I did the indexing & crawling over one small sized server....and > >> > researching and getting it all to this stage took me this much > time...and > >> > now my index is un-usable :( > >> > > >> > > >> > On Wed, Oct 23, 2013 at 8:16 PM, Michael McCandless < > >> > luc...@mikemccandless.com> wrote: > >> > > >> >> On Wed, Oct 23, 2013 at 10:33 AM, Chris <christu...@gmail.com> > wrote: > >> >> > I am not exactly sure if the commit() was run, as i am inserting > each > >> >> row & > >> >> > doing a commit right away. My solr will not load the index.... > >> >> > >> >> I'm confused: if you are doing a commit right away after every row > >> >> (which is REALLY bad practice: that's incredibly slow and > >> >> unnecessary), then surely you've had many commits succeed? > >> >> > >> >> > is there anyway that i can fix this, I have a huge index & will > loose > >> >> > months if i try to reindex :( I didnt know lucene was not stable, I > >> >> thought > >> >> > it was > >> >> > >> >> Sorry, but no. > >> >> > >> >> In theory ... a tool could be created that would try to > "reconstitute" > >> >> a segments file by looking at all the various files that exist, but > >> >> this is not in general easy (and may not be possible): the segments > >> >> file has very important metadata, like which codec was used to write > >> >> each segment, etc. > >> >> > >> >> Did it really take months to do this indexing? That is really way > too > >> >> long; how many documents? > >> >> > >> >> Lucene (Solr) is stable, i.e. a successful commit should ensure your > >> >> index survives power loss. If somehow that was not the case here, > >> >> then we need to figure out why and fix it ... > >> >> > >> >> Mike McCandless > >> >> > >> >> http://blog.mikemccandless.com > >> >> > >> >> --------------------------------------------------------------------- > >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> >> > >> >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >