Re: corrupted index Lucene 4.4

Chris Tue, 29 Oct 2013 08:56:20 -0700

Hi Mike,

I changed my program and now the indexing is better. How ever I have run
into another issue -
I  get characters like -


������������������ - CTA������������ -

in the solr index. I am adding Java beans to solr by the addBean() function.
This seems to be a character encoding issue. Any pointers on how to
resolve this one?

I have seen that this occurs  mostly for japanese chinese characters.

Warm Regards,

Chris




On Thu, Oct 24, 2013 at 1:30 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Hi Chris,
>
> Sorry, I don't know much about Solr cloud; maybe as on the solr-user
> list, and give details about what went wrong?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Oct 23, 2013 at 11:25 AM, Chris <christu...@gmail.com> wrote:
> > Wow !!! Thanks a lot for the helpfull tips I will implement this in the
> > next two days & report back with my indexing speed....I have one more
> > question...
> >
> > i tried committing to solr cloud, but then something was not correct
> > as it would not index after a few documents...
> >
> > Also, There seems to be something wrong in zookeeper, when we try to add
> > documents using solrj, it works fine as long as load of insert is not
> much,
> > but once we start doing many inserts, then it throws a lot of errors...
> >
> > I am doing something like -
> >
> > CloudSolrServer solrCoreCloud = new CloudSolrServer(cloudURL);
> >                     solrCoreCloud.
> > setDefaultCollection("Image");
> >                     UpdateResponse up =
> solrCoreCloud.addBean(resultItem);
> >                     UpdateResponse upr = solrCoreCloud.commit();
> >
> > since i have to reindex, i am thinking if i need to use solrcloud or not?
> >
> >
> >
> >
> > On Wed, Oct 23, 2013 at 8:41 PM, Michael McCandless <
> > luc...@mikemccandless.com> wrote:
> >
> >> Indexing 100M web pages really should not take months; if you fix
> >> committing after every row that should make things much faster.
> >>
> >> Use multiple index threads, set a highish RAM buffer (~512 MB), use a
> >> local disk not a remote mounted fileserver, ideally an SSD, etc.  See
> >> http://wiki.apache.org/lucene-java/ImproveIndexingSpeed for more
> >> ideas.
> >>
> >> Only commit periodically, when enough indexing has happened that you
> >> would be upset to lose that work since the last commit (e.g. maybe
> >> every few hours or something).
> >>
> >> Also, be sure your IO system is "healthy" / does not disregard fsync,
> >> and if the index is really important, back it up to a different
> >> storage device every so often.
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >> On Wed, Oct 23, 2013 at 10:58 AM, Chris <christu...@gmail.com> wrote:
> >> > Actually, it contains about 100 million webpages and was built out of
> a
> >> web
> >> > index for NLP processing :(
> >> >
> >> > I did the indexing & crawling over one small sized server....and
> >> > researching and getting it all to this stage took me this much
> time...and
> >> > now my index is un-usable :(
> >> >
> >> >
> >> > On Wed, Oct 23, 2013 at 8:16 PM, Michael McCandless <
> >> > luc...@mikemccandless.com> wrote:
> >> >
> >> >> On Wed, Oct 23, 2013 at 10:33 AM, Chris <christu...@gmail.com>
> wrote:
> >> >> > I am not exactly sure if the commit() was run, as i am inserting
> each
> >> >> row &
> >> >> > doing a commit right away. My solr will not load the index....
> >> >>
> >> >> I'm confused: if you are doing a commit right away after every row
> >> >> (which is REALLY bad practice: that's incredibly slow and
> >> >> unnecessary), then surely you've had many commits succeed?
> >> >>
> >> >> > is there anyway that i can fix this, I have a huge index & will
> loose
> >> >> > months if i try to reindex :( I didnt know lucene was not stable, I
> >> >> thought
> >> >> > it was
> >> >>
> >> >> Sorry, but no.
> >> >>
> >> >> In theory ... a tool could be created that would try to
> "reconstitute"
> >> >> a segments file by looking at all the various files that exist, but
> >> >> this is not in general easy (and may not be possible): the segments
> >> >> file has very important metadata, like which codec was used to write
> >> >> each segment, etc.
> >> >>
> >> >> Did it really take months to do this indexing?  That is really way
> too
> >> >> long; how many documents?
> >> >>
> >> >> Lucene (Solr) is stable, i.e. a successful commit should ensure your
> >> >> index survives power loss.  If somehow that was not the case here,
> >> >> then we need to figure out why and fix it ...
> >> >>
> >> >> Mike McCandless
> >> >>
> >> >> http://blog.mikemccandless.com
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >> >>
> >> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: corrupted index Lucene 4.4

Reply via email to