Re: How to index large set data

2009-05-24 Thread nk 11
Hello
Interesting thread. One request please, because I don't have much experience
with solr, could you please use full terms and not DIH, RES etc.?

Thanks :)

On Mon, May 25, 2009 at 4:44 AM, Jianbin Dai  wrote:

>
> Hi Paul,
>
> Hope you have a great weekend so far.
> I still have a couple of questions you might help me out:
>
> 1. In your earlier email, you said "if possible , you can setup multiple
> DIH say /dataimport1, /dataimport2 etc and split your files and can achieve
> parallelism"
> I am not sure if I understand it right. I put two requesHandler in
> solrconfig.xml, like this
>
>  class="org.apache.solr.handler..dataimport.DataImportHandler">
>
>  ./data-config.xml
>
> 
>
>  class="org.apache.solr.handler.dataimport.DataImportHandler">
>
>  ./data-config2.xml
>
> 
>
>
> and create data-config.xml and data-config2.xml.
> then I run the command
> http://host:8080/solr/dataimport?command=full-import
>
> But only one data set (the first one) was indexed. Did I get something
> wrong?
>
>
> 2. I noticed that after solr indexed about 8M documents (around two hours),
> it gets very very slow. I use "top" command in linux, and noticed that RES
> is 1g of memory. I did several experiments, every time RES reaches 1g, the
> indexing process becomes extremely slow. Is this memory limit set by JVM?
> And how can I set the JVM memory when I use DIH through web command
> full-import?
>
> Thanks!
>
>
> JB
>
>
>
>
> --- On Fri, 5/22/09, Noble Paul നോബിള്‍  नोब्ळ् 
> wrote:
>
> > From: Noble Paul നോബിള്‍  नोब्ळ् 
> > Subject: Re: How to index large set data
> > To: "Jianbin Dai" 
> > Date: Friday, May 22, 2009, 10:04 PM
> > On Sat, May 23, 2009 at 10:27 AM,
> > Jianbin Dai 
> > wrote:
> > >
> > > Hi Pual, but in your previous post, you said "there is
> > already an issue for writing to Solr in multiple threads
> >  SOLR-1089". Do you think use solrj alone would be better
> > than DIH?
> >
> > nope
> > you will have to do indexing in multiple threads
> >
> > if possible , you can setup multiple DIH say /dataimport1,
> > /dataimport2 etc and split your files and can achieve
> > parallelism
> >
> >
> > > Thanks and have a good weekend!
> > >
> > > --- On Fri, 5/22/09, Noble Paul നോബിള്‍
> >  नोब्ळ् 
> > wrote:
> > >
> > >> no need to use embedded Solrserver..
> > >> you can use SolrJ with streaming
> > >> in multiple threads
> > >>
> > >> On Fri, May 22, 2009 at 8:36 PM, Jianbin Dai
> > 
> > >> wrote:
> > >> >
> > >> > If I do the xml parsing by myself and use
> > embedded
> > >> client to do the push, would it be more efficient
> > than DIH?
> > >> >
> > >> >
> > >> > --- On Fri, 5/22/09, Grant Ingersoll 
> > >> wrote:
> > >> >
> > >> >> From: Grant Ingersoll 
> > >> >> Subject: Re: How to index large set data
> > >> >> To: solr-user@lucene.apache.org
> > >> >> Date: Friday, May 22, 2009, 5:38 AM
> > >> >> Can you parallelize this?  I
> > >> >> don't know that the DIH can handle it,
> > >> >> but having multiple threads sending docs
> > to Solr
> > >> is the
> > >> >> best
> > >> >> performance wise, so maybe you need to
> > look at
> > >> alternatives
> > >> >> to pulling
> > >> >> with DIH and instead use a client to push
> > into
> > >> Solr.
> > >> >>
> > >> >>
> > >> >> On May 22, 2009, at 3:42 AM, Jianbin Dai
> > wrote:
> > >> >>
> > >> >> >
> > >> >> > about 2.8 m total docs were created.
> > only the
> > >> first
> > >> >> run finishes. In
> > >> >> > my 2nd try, it hangs there forever
> > at the end
> > >> of
> > >> >> indexing, (I guess
> > >> >> > right before commit), with cpu usage
> > of 100%.
> > >> Total 5G
> > >> >> (2050) index
> > >> >> > files are created. Now I have two
> > problems:
> > >> >> > 1. why it hangs there and failed?
> > >> >> > 2. how can i speed up the indexing?
> > >> >> >
> > >> >> >
> > >> >> > Here is my solrconfig.xml
> > >> >> >
> > >> >> >
> > >> >>
> > >>
> > false
> > >> >> >
> > >> >>
> > >>
> > 3000
> > >> >> >
> > >> >>
> > 1000
> > >> >> >
> > >> >>
> > >>
> > 2147483647
> > >> >> >
> > >> >>
> > >>
> > 1
> > >> >> >
> > >> >>
> > >>
> > false
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> > --- On Thu, 5/21/09, Noble Paul
> > >> >> നോബിള്‍  नो
> > >> >> > ब्ळ् 
> > >> >> wrote:
> > >> >> >
> > >> >> >> From: Noble Paul
> > നോബിള്‍
> > >> >> नोब्ळ्
> > >> >> >> 
> > >> >> >> Subject: Re: How to index large
> > set data
> > >> >> >> To: solr-user@lucene.apache.org
> > >> >> >> Date: Thursday, May 21, 2009,
> > 10:39 PM
> > >> >> >> what is the total no:of docs
> > created
> > >> >> >> ?  I guess it may not be
> > memory
> > >> >> >> bound. indexing is mostly amn IO
> > bound
> > >> operation.
> > >> >> You may
> > >> >> >> be able to
> > >> >> >> get a better perf if a SSD is
> > used (solid
> > >> state
> > >> >> disk)
> > >> >> >>
> > >> >> >> On Fri, May 22, 2009 at 10:46
> > AM, Jianbin
> > >> Dai
> > >> >> 
> > >> >> >> wrote:
> > >> >> >>>
> > >> >> >>> Hi Paul,
> > >> >> >>>
> > >> >> >>> Thank you so much for
> > answeri

Re: Getting 404 for MoreLikeThis handler

2009-05-24 Thread jlist9
Thanks. Will that still be the MoreLikeThisRequestHandler?
Or the StandardRequestHandler with mlt option?

>> Hi, I'm trying out the mlt handler but I'm getting a 404 error.
>>
>> HTTP Status 404 - /solr/mlt
>>
>> solrconfig.xml seem to say that mlt handler is available by default.
>> i wonder if there's anything else I should do before I can use it?
>> I'm using version 1.3.
>
> Try /solr/select with mlt=on parameter.
>
> Koji


Re: How to index large set data

2009-05-24 Thread Jianbin Dai

Hi Paul,

Hope you have a great weekend so far.
I still have a couple of questions you might help me out:

1. In your earlier email, you said "if possible , you can setup multiple DIH 
say /dataimport1, /dataimport2 etc and split your files and can achieve 
parallelism"
I am not sure if I understand it right. I put two requesHandler in 
solrconfig.xml, like this



  ./data-config.xml





  ./data-config2.xml




and create data-config.xml and data-config2.xml.
then I run the command
http://host:8080/solr/dataimport?command=full-import

But only one data set (the first one) was indexed. Did I get something wrong?


2. I noticed that after solr indexed about 8M documents (around two hours), it 
gets very very slow. I use "top" command in linux, and noticed that RES is 1g 
of memory. I did several experiments, every time RES reaches 1g, the indexing 
process becomes extremely slow. Is this memory limit set by JVM? And how can I 
set the JVM memory when I use DIH through web command full-import?

Thanks!


JB




--- On Fri, 5/22/09, Noble Paul നോബിള്‍  नोब्ळ्  wrote:

> From: Noble Paul നോബിള്‍  नोब्ळ् 
> Subject: Re: How to index large set data
> To: "Jianbin Dai" 
> Date: Friday, May 22, 2009, 10:04 PM
> On Sat, May 23, 2009 at 10:27 AM,
> Jianbin Dai 
> wrote:
> >
> > Hi Pual, but in your previous post, you said "there is
> already an issue for writing to Solr in multiple threads
>  SOLR-1089". Do you think use solrj alone would be better
> than DIH?
> 
> nope
> you will have to do indexing in multiple threads
> 
> if possible , you can setup multiple DIH say /dataimport1,
> /dataimport2 etc and split your files and can achieve
> parallelism
> 
> 
> > Thanks and have a good weekend!
> >
> > --- On Fri, 5/22/09, Noble Paul നോബിള്‍
>  नोब्ळ् 
> wrote:
> >
> >> no need to use embedded Solrserver..
> >> you can use SolrJ with streaming
> >> in multiple threads
> >>
> >> On Fri, May 22, 2009 at 8:36 PM, Jianbin Dai
> 
> >> wrote:
> >> >
> >> > If I do the xml parsing by myself and use
> embedded
> >> client to do the push, would it be more efficient
> than DIH?
> >> >
> >> >
> >> > --- On Fri, 5/22/09, Grant Ingersoll 
> >> wrote:
> >> >
> >> >> From: Grant Ingersoll 
> >> >> Subject: Re: How to index large set data
> >> >> To: solr-user@lucene.apache.org
> >> >> Date: Friday, May 22, 2009, 5:38 AM
> >> >> Can you parallelize this?  I
> >> >> don't know that the DIH can handle it,
> >> >> but having multiple threads sending docs
> to Solr
> >> is the
> >> >> best
> >> >> performance wise, so maybe you need to
> look at
> >> alternatives
> >> >> to pulling
> >> >> with DIH and instead use a client to push
> into
> >> Solr.
> >> >>
> >> >>
> >> >> On May 22, 2009, at 3:42 AM, Jianbin Dai
> wrote:
> >> >>
> >> >> >
> >> >> > about 2.8 m total docs were created.
> only the
> >> first
> >> >> run finishes. In
> >> >> > my 2nd try, it hangs there forever
> at the end
> >> of
> >> >> indexing, (I guess
> >> >> > right before commit), with cpu usage
> of 100%.
> >> Total 5G
> >> >> (2050) index
> >> >> > files are created. Now I have two
> problems:
> >> >> > 1. why it hangs there and failed?
> >> >> > 2. how can i speed up the indexing?
> >> >> >
> >> >> >
> >> >> > Here is my solrconfig.xml
> >> >> >
> >> >> >
> >> >>
> >>
> false
> >> >> >
> >> >>
> >>
> 3000
> >> >> >
> >> >>
> 1000
> >> >> >
> >> >>
> >>
> 2147483647
> >> >> >
> >> >>
> >>
> 1
> >> >> >
> >> >>
> >>
> false
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > --- On Thu, 5/21/09, Noble Paul
> >> >> നോബിള്‍  नो
> >> >> > ब्ळ् 
> >> >> wrote:
> >> >> >
> >> >> >> From: Noble Paul
> നോബിള്‍
> >> >> नोब्ळ्
> >> >> >> 
> >> >> >> Subject: Re: How to index large
> set data
> >> >> >> To: solr-user@lucene.apache.org
> >> >> >> Date: Thursday, May 21, 2009,
> 10:39 PM
> >> >> >> what is the total no:of docs
> created
> >> >> >> ?  I guess it may not be
> memory
> >> >> >> bound. indexing is mostly amn IO
> bound
> >> operation.
> >> >> You may
> >> >> >> be able to
> >> >> >> get a better perf if a SSD is
> used (solid
> >> state
> >> >> disk)
> >> >> >>
> >> >> >> On Fri, May 22, 2009 at 10:46
> AM, Jianbin
> >> Dai
> >> >> 
> >> >> >> wrote:
> >> >> >>>
> >> >> >>> Hi Paul,
> >> >> >>>
> >> >> >>> Thank you so much for
> answering my
> >> questions.
> >> >> It
> >> >> >> really helped.
> >> >> >>> After some adjustment,
> basically
> >> setting
> >> >> mergeFactor
> >> >> >> to 1000 from the default value
> of 10, I
> >> can
> >> >> finished the
> >> >> >> whole job in 2.5 hours. I
> checked that
> >> during
> >> >> running time,
> >> >> >> only around 18% of memory is
> being used,
> >> and VIRT
> >> >> is always
> >> >> >> 1418m. I am thinking it may be
> restricted
> >> by JVM
> >> >> memory
> >> >> >> setting. But I run the data
> import
> >> command through
> >> >> web,
> >> >> >> i.e.,
> >> >> >>>
> >> >> >>
> >> >>
> >>
> http://:/solr/dataimport?command=full-import,
> >> >> >> how can I set the memory
> allocation for

Re: Getting 404 for MoreLikeThis handler

2009-05-24 Thread Koji Sekiguchi

jlist9 wrote:

Hi, I'm trying out the mlt handler but I'm getting a 404 error.

HTTP Status 404 - /solr/mlt

solrconfig.xml seem to say that mlt handler is available by default.
i wonder if there's anything else I should do before I can use it?
I'm using version 1.3.

Thanks

  


Try /solr/select with mlt=on parameter.

Koji




More questions about MoreLikeThis

2009-05-24 Thread jlist9
The wiki page (http://wiki.apache.org/solr/MoreLikeThis) says:

mlt.fl: The fields to use for similarity. NOTE: if possible, these
should have a stored TermVector

I didn't set TermVector to true MoreLikeThis with StandardRequestHandler seems
to work fine. The first question is, is TermVector only for
performance optimization?

The second question is, afterI changed the mlt.fl fields from both
indexed and stored
to indexed only, I started to get zero results back. Do mlt.fl fields
always need to
be stored?

Thanks


Getting 404 for MoreLikeThis handler

2009-05-24 Thread jlist9
Hi, I'm trying out the mlt handler but I'm getting a 404 error.

HTTP Status 404 - /solr/mlt

solrconfig.xml seem to say that mlt handler is available by default.
i wonder if there's anything else I should do before I can use it?
I'm using version 1.3.

Thanks


Re: solr replication 1.3

2009-05-24 Thread Yonik Seeley
On Sun, May 24, 2009 at 12:37 AM, Otis Gospodnetic
 wrote:
> Yes.  Although it might work under Cygwin, too.

cygwin wouldn't work:
http://www.lucidimagination.com/search/document/32471da18a69b169/replication_in_1_3

-Yonik
http://www.lucidimagination.com