OK, I was thinking more along the lines of this blog:

http://searchhub.org/dev/2012/02/14/indexing-with-solrj/

which uses Tika directly to process the docs on the client
(wherever you run it) and only sends the results to
Solr....

The SolrJ program you're referencing uses a different approach...

FWIW,
Erick

On Tue, Sep 25, 2012 at 10:04 AM,  <johannes.schwendin...@blum.com> wrote:
> The difference with solr cell is, that i'am sending every single document
> to solr cell and don't collect them until i have a couple of them in my
> memory.
> Using mainly the code form here:
> http://wiki.apache.org/solr/ExtractingRequestHandler#SolrJ
>
>
> Erick Erickson <erickerick...@gmail.com> schrieb am 25.09.2012 15:47:34:
>
>> Von:
>>
>> Erick Erickson <erickerick...@gmail.com>
>>
>> An:
>>
>> solr-user@lucene.apache.org
>>
>> Datum:
>>
>> 25.09.2012 15:48
>>
>> Betreff:
>>
>> Re: Re: Solr Cell Questions
>>
>> bq: how many documents per minute, second, what ever can i put into solr
>>
>> Too many variables to say. I've seen several thousand truly simple
>> docs/sec. But since you're doing the Tika processing that's probably
>> going to be your limiting factor. And it'll be many fewer...
>>
>> I don't understand your OOM issue when running Tika on the client. Or,
>> rather, why you think using SolrCell makes this different. SolrCell also
>> uses Tika. So my suspicion it that your client-side process simply isn't
>> allocating much memory to the JVM, did you try bumping the memory
>> on your client?
>>
>> Best
>> Erick
>>
>> On Tue, Sep 25, 2012 at 5:23 AM,  <johannes.schwendin...@blum.com>
> wrote:
>> > Thank you Erick for your respone,
>> >
>> > I've already tried what you've suggested and got some out of memory
>> > exceptions. Because of this i like the solution with solr Cell where i
> can
>> > send the file directly to solr via stream and don't collect them in my
>> > memory.
>> >
>> > And another question that came to my mind, how many documents per
> minute,
>> > second, what ever can i put into solr. Say XML format and from 100kb
> to
>> > 100MB.
>> > Is there a number or is it to dependent from hardware and settings?
>> >
>> >
>> > Best
>> > Johannes
>> >
>> > Erick Erickson <erickerick...@gmail.com> schrieb am 25.09.2012
> 00:22:26:
>> >
>> >> Von:
>> >>
>> >> Erick Erickson <erickerick...@gmail.com>
>> >>
>> >> An:
>> >>
>> >> solr-user@lucene.apache.org
>> >>
>> >> Datum:
>> >>
>> >> 25.09.2012 00:23
>> >>
>> >> Betreff:
>> >>
>> >> Re: Solr Cell Questions
>> >>
>> >> If you're concerned about throughput, consider moving all the
>> >> SolrCell (Tika) processing off the server. SolrCell is way cool
>> >> for showing what can be done, but its downside is you're
>> >> moving all the processing of the structured documents to the
>> >> same machine doing the indexing. Pretty soon, especially
>> >> with significant size files, you're spending all your CPU cycles
>> >> parsing the files...
>> >>
>> >> Happens there's a blog about this:
>> >> http://searchhub.org/dev/2012/02/14/indexing-with-solrj/
>> >>
>> >> By moving the indexing to N clients, you can increase
>> >> throughput until you make Solr work hard to do the indexing....
>> >>
>> >> Best
>> >> Erick
>> >>
>> >> On Mon, Sep 24, 2012 at 10:04 AM,  <johannes.schwendin...@blum.com>
>> > wrote:
>> >> > Hi,
>> >> >
>> >> > Im currently experimenting with Solr Cell to index files to Solr.
>> > During
>> >> > this some questions came up.
>> >> >
>> >> > 1. Is it possible (and wise) to connect to Solr Cell with multiple
>> > Threads
>> >> > at the same time to index several documents at the same time?
>> >> > This question came up because my prrogramm takes about 6hours to
> index
>> >> > round 35000 docs. (no production environment, only example solr and
> a
>> >> > little desktop machine but I think its very slow, and I know solr
>> > isn't
>> >> > the bottleneck (yet))
>> >> >
>> >> > 2. If 1 is possible, how many Threads should do this and how many
>> > memory
>> >> > Solr needs? I've tried it but i run into an out of memory
> exception.
>> >> >
>> >> > Thanks in advantage
>> >> >
>> >> > Best Regards
>> >> > Johannes

Reply via email to